Alignment exercise answers: Difference between revisions
No edit summary |
No edit summary |
||
| Line 8: | Line 8: | ||
</ul> | </ul> | ||
<p><b>Q2.</b> | <p><b>Q2.</b> Three header lines if you added the read group (<code>@RG</code>).</p> | ||
<p><b>Q3.</b> The genomic coordinate of the first read (<code>SRR8002634.1</code>) is: <b>position 166782</b>.</p> | <p><b>Q3.</b> The genomic coordinate of the first read (<code>SRR8002634.1</code>) is: <b>position 166782</b>.</p> | ||
| Line 19: | Line 19: | ||
<p><b>Q6.</b> Yes, <code>SRR8002634.11</code> is unmapped. | <p><b>Q6.</b> Yes, <code>SRR8002634.11</code> is unmapped. | ||
The SAM flag <code>0x4</code> ( | The SAM flag <code>0x4</code> (4) indicates “read unmapped”.</p> | ||
<p><b>Q7.</b> According to <code>flagstat</code>:</p> | <p><b>Q7.</b> According to <code>flagstat</code>:</p> | ||
| Line 27: | Line 27: | ||
</ul> | </ul> | ||
<p><b>Q8.</b> SAM was | <p><b>Q8.</b> SAM was 479M → BAM was 116M → approximately <b>4× smaller</b>.</p> | ||
<p><b>Q9.</b> BAM was | <p><b>Q9.</b> BAM was 116M → CRAM was 38M → approximately <b>3.1× smaller</b>.</p> | ||
<p><b>Q10.</b></p> | <p><b>Q10.</b></p> | ||
| Line 87: | Line 87: | ||
<pre>samtools view NA19201.bam chr20 | wc -l</pre> | <pre>samtools view NA19201.bam chr20 | wc -l</pre> | ||
<p>→ ~3, | <p>→ ~3,450,558 (or very close, depending on how BWA breaks ties)</p> | ||
<p>To exclude supplementary alignments (<code>0x800</code>):</p> | <p>To exclude supplementary alignments (<code>0x800</code>):</p> | ||
<pre>samtools view -F 0x800 NA19201.bam chr20 | wc -l</pre> | <pre>samtools view -F 0x800 NA19201.bam chr20 | wc -l</pre> | ||
<p>→ ~3, | <p>→ ~3,449,745</p> | ||
<p><b>Q19.</b> From the insert size distribution (via <code>plot-bamstats</code>): | <p><b>Q19.</b> From the insert size distribution (via <code>plot-bamstats</code>): | ||
Most common insert size is approximately <b>152 bp</b>.</p> | Most common insert size is approximately <b>152 bp</b>.</p> | ||
<p><b>Q20.</b> At the | <p><b>Q20.</b> At the position <code>chr20:35,581,362</code>: | ||
Reference base = <b>T</b>, sample base = <b>G</b>.</p> | Reference base = <b>T</b>, sample base = <b>G</b>.</p> | ||
<p><b>Q21.</b> Number of reads supporting the G allele: <b>8 reads</b>.</p> | <p><b>Q21.</b> Number of reads supporting the G allele: <b>8 reads</b> (the first line is the reference, the 2nd line is the sample consensus base). </p> | ||
<p><b>Q22.</b> | <p><b>Q22.</b> Homozygous as all reads support just a single allele.</p> | ||
Latest revision as of 11:57, 19 December 2025
Alignment Exercise – Suggested Answers
Q1. Three ways to determine which FASTQ file contains the trimmed reads:
- File size: the trimmed file is smaller because adapter/low-quality bases were removed.
- Look at the reads: trimmed reads often have uneven lengths, while raw reads have uniform read length for each technology.
- Run FastQC: raw reads show adapter contamination, “adapter content” warnings, or low-quality tails.
Q2. Three header lines if you added the read group (@RG).
Q3. The genomic coordinate of the first read (SRR8002634.1) is: position 166782.
Q4. The mapping quality of the read SRR8002634.3 is 0.
This means the mapper believes that it cannot confidently assign the read to this position; multiple equally good mapping locations likely exist.
Q5. Among the first 8 reads: 3 mapped to the - strand (flag 16) and 5 mapped to the + strand (flag 0).
Q6. Yes, SRR8002634.11 is unmapped.
The SAM flag 0x4 (4) indicates “read unmapped”.
Q7. According to flagstat:
- 1,183,276 aligned out of 1,252,591 total reads → 94.47% mapped.
- Therefore, unmapped fraction = 100% – 94.47% = 5.53%.
Q8. SAM was 479M → BAM was 116M → approximately 4× smaller.
Q9. BAM was 116M → CRAM was 38M → approximately 3.1× smaller.
Q10.
samtools view SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l
→ 155,343 reads
Q11.
samtools view -q 30 SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l
→ 149,524 reads
Q12. From SRR8002634_1.mosdepth.summary.txt:
NC_002516.2 6264404 177098613 28.27 0 113
- Average coverage: 28.27×
- Maximum depth at any position: 113×
Q13. If the wrong reference is used, the number of aligned reads generally goes down because greater phylogenetic distance means more mismatches and fewer alignable regions. If the incorrect reference is a closely related species, more reads will map than to a distant one, but still fewer than to the correct reference.
Example values:
- Yersinia pestis reference: 5,689 reads (0.46%) mapped
- Pseudomonas aeruginosa reference: 279,217 reads (22.34%) mapped
Q14. One-line alignment + convert to BAM + sort:
bwa mem -R "@RG\tID:RG26\tSM:YRB42" \ /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ /home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz \ /home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz \ | samtools view -uS - \ | samtools sort /dev/stdin > NA19201.bam
This version uses -uS to emit uncompressed BAM, avoiding unnecessary compression/decompression inside the pipeline.
Q15. Advantages of using UNIX pipes:
- No intermediate large files → less storage use.
- Steps run in streaming mode → faster overall execution (the pipeline is as fast as the slowest tool, usually
bwa mem).
Q16. From samtools flagstat NA19201.bam:
- 3,516,557 mapped reads → 99.83% mapped.
This is unusually high because the file only contains reads from chr20.
Q17. Properly paired:
3,490,934 properly paired (99.23%)
Q18. Number of reads aligned to chr20:
samtools view NA19201.bam chr20 | wc -l
→ ~3,450,558 (or very close, depending on how BWA breaks ties)
To exclude supplementary alignments (0x800):
samtools view -F 0x800 NA19201.bam chr20 | wc -l
→ ~3,449,745
Q19. From the insert size distribution (via plot-bamstats):
Most common insert size is approximately 152 bp.
Q20. At the position chr20:35,581,362:
Reference base = T, sample base = G.
Q21. Number of reads supporting the G allele: 8 reads (the first line is the reference, the 2nd line is the sample consensus base).
Q22. Homozygous as all reads support just a single allele.