Alignment exercise answers: Difference between revisions
(Created page with "'''Q1:''' 3 possible ways: * The file with the smaller file size contains the trimmed reads. * Peek in the file and determine which file contains reads of uneven lengths * Use fastqc, to determine which file contains overrepresented adapter sequences '''Q2:''' 4 lines if you have added the RG '''Q3:''' 166782 '''Q4:''' 0, this means that the probability of being mismapped is one. This means that this read cannot be confidently assigned to this position. '''Q5:''' The...") |
No edit summary |
||
| Line 1: | Line 1: | ||
<h2>Alignment Exercise – Suggested Answers</h2> | |||
<p><b>Q1.</b> Three ways to determine which FASTQ file contains the trimmed reads:</p> | |||
<ul> | |||
<li><b>File size:</b> the trimmed file is smaller because adapter/low-quality bases were removed.</li> | |||
<li><b>Look at the reads:</b> trimmed reads often have <b>uneven lengths</b>, while raw reads have uniform read length for each technology.</li> | |||
<li><b>Run FastQC:</b> raw reads show <b>adapter contamination</b>, “adapter content” warnings, or low-quality tails.</li> | |||
</ul> | |||
<p><b>Q2.</b> Four header lines if you added the read group (<code>@RG</code>).</p> | |||
<p><b>Q3.</b> The genomic coordinate of the first read (<code>SRR8002634.1</code>) is: <b>position 166782</b>.</p> | |||
<p><b>Q4.</b> The mapping quality of the read <code>SRR8002634.3</code> is <b>0</b>. | |||
This means the mapper believes that it cannot confidently assign the read to this position; multiple equally good mapping locations likely exist.</p> | |||
<p><b>Q5.</b> Among the first 8 reads: | |||
<b>3 mapped to the - strand</b> (flag 16) and <b>5 mapped to the + strand</b> (flag 0).</p> | |||
<p><b>Q6.</b> Yes, <code>SRR8002634.11</code> is unmapped. | |||
The SAM flag <code>0x4</code> (decimal 4) indicates “read unmapped”.</p> | |||
<p><b>Q7.</b> According to <code>flagstat</code>:</p> | |||
<ul> | |||
<li>1,183,276 aligned out of 1,252,591 total reads → 94.47% mapped.</li> | |||
<li>Therefore, unmapped fraction = 100% – 94.47% = <b>5.53%</b>.</li> | |||
</ul> | |||
<p><b>Q8.</b> SAM was 467M → BAM was 117M → approximately <b>4× smaller</b>.</p> | |||
<p><b>Q9.</b> BAM was 117M → CRAM was 37M → approximately <b>3.1× smaller</b>.</p> | |||
<pre> | <p><b>Q10.</b></p> | ||
samtools | <pre>samtools view SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l</pre> | ||
</pre> | <p>→ <b>155,343 reads</b></p> | ||
<pre> | <p><b>Q11.</b></p> | ||
samtools | <pre>samtools view -q 30 SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l</pre> | ||
</pre> | <p>→ <b>149,524 reads</b></p> | ||
<p><b>Q12.</b> From <code>SRR8002634_1.mosdepth.summary.txt</code>:</p> | |||
<pre> | <pre> | ||
NC_002516.2 6264404 177098613 28.27 0 113 | NC_002516.2 6264404 177098613 28.27 0 113 | ||
</pre> | </pre> | ||
<ul> | |||
<li>Average coverage: <b>28.27×</b></li> | |||
<li>Maximum depth at any position: <b>113×</b></li> | |||
</ul> | |||
<p><b>Q13.</b> If the wrong reference is used, the number of aligned reads generally goes <b>down</b> because greater phylogenetic distance means more mismatches and fewer alignable regions. | |||
If the incorrect reference is a <b>closely related species</b>, more reads will map than to a distant one, but still fewer than to the correct reference.</p> | |||
< | |||
</ | |||
</ | |||
<p>Example values:</p> | |||
<ul> | |||
<li><i>Yersinia pestis</i> reference: 5,689 reads (0.46%) mapped</li> | |||
<li><i>Pseudomonas aeruginosa</i> reference: 279,217 reads (22.34%) mapped</li> | |||
</ul> | |||
<p><b>Q14.</b> One-line alignment + convert to BAM + sort:</p> | |||
<pre> | <pre> | ||
samtools | bwa mem -R "@RG\tID:RG26\tSM:YRB42" \ | ||
/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ | |||
/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz \ | |||
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz \ | |||
| samtools view -uS - \ | |||
| samtools sort /dev/stdin > NA19201.bam | |||
</pre> | </pre> | ||
<p>This version uses <code>-uS</code> to emit uncompressed BAM, avoiding unnecessary compression/decompression inside the pipeline.</p> | |||
<p><b>Q15.</b> Advantages of using UNIX pipes:</p> | |||
< | <ul> | ||
<li>No intermediate large files → less storage use.</li> | |||
</ | <li>Steps run in streaming mode → faster overall execution (the pipeline is as fast as the slowest tool, usually <code>bwa mem</code>).</li> | ||
</ul> | |||
<p><b>Q16.</b> From <code>samtools flagstat NA19201.bam</code>:</p> | |||
<ul> | |||
<li>3,516,557 mapped reads → <b>99.83% mapped</b>.</li> | |||
</ul> | |||
<p>This is unusually high because the file only contains reads from <code>chr20</code>.</p> | |||
<p><b>Q17.</b> Properly paired:</p> | |||
<pre>3,490,934 properly paired (99.23%)</pre> | |||
<p><b>Q18.</b> Number of reads aligned to <code>chr20</code>:</p> | |||
< | |||
</ | |||
< | |||
</ | |||
<pre>samtools view NA19201.bam chr20 | wc -l</pre> | |||
<p>→ ~3,449,006 (or very close, depending on how BWA breaks ties)</p> | |||
<p>To exclude supplementary alignments (<code>0x800</code>):</p> | |||
<pre> | <pre>samtools view -F 0x800 NA19201.bam chr20 | wc -l</pre> | ||
samtools | <p>→ ~3,448,196</p> | ||
</ | |||
<p><b>Q19.</b> From the insert size distribution (via <code>plot-bamstats</code>): | |||
Most common insert size is approximately <b>152 bp</b>.</p> | |||
<p><b>Q20.</b> At the IGV position <code>chr20:35,581,362</code>: | |||
Reference base = <b>T</b>, sample base = <b>G</b>.</p> | |||
<p><b>Q21.</b> Number of reads supporting the G allele: <b>8 reads</b>.</p> | |||
<p><b>Q22.</b> The variant lies in the gene: <b>FER1L4</b>.</p> | |||
Latest revision as of 13:23, 20 November 2025
Alignment Exercise – Suggested Answers
Q1. Three ways to determine which FASTQ file contains the trimmed reads:
- File size: the trimmed file is smaller because adapter/low-quality bases were removed.
- Look at the reads: trimmed reads often have uneven lengths, while raw reads have uniform read length for each technology.
- Run FastQC: raw reads show adapter contamination, “adapter content” warnings, or low-quality tails.
Q2. Four header lines if you added the read group (@RG).
Q3. The genomic coordinate of the first read (SRR8002634.1) is: position 166782.
Q4. The mapping quality of the read SRR8002634.3 is 0.
This means the mapper believes that it cannot confidently assign the read to this position; multiple equally good mapping locations likely exist.
Q5. Among the first 8 reads: 3 mapped to the - strand (flag 16) and 5 mapped to the + strand (flag 0).
Q6. Yes, SRR8002634.11 is unmapped.
The SAM flag 0x4 (decimal 4) indicates “read unmapped”.
Q7. According to flagstat:
- 1,183,276 aligned out of 1,252,591 total reads → 94.47% mapped.
- Therefore, unmapped fraction = 100% – 94.47% = 5.53%.
Q8. SAM was 467M → BAM was 117M → approximately 4× smaller.
Q9. BAM was 117M → CRAM was 37M → approximately 3.1× smaller.
Q10.
samtools view SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l
→ 155,343 reads
Q11.
samtools view -q 30 SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l
→ 149,524 reads
Q12. From SRR8002634_1.mosdepth.summary.txt:
NC_002516.2 6264404 177098613 28.27 0 113
- Average coverage: 28.27×
- Maximum depth at any position: 113×
Q13. If the wrong reference is used, the number of aligned reads generally goes down because greater phylogenetic distance means more mismatches and fewer alignable regions. If the incorrect reference is a closely related species, more reads will map than to a distant one, but still fewer than to the correct reference.
Example values:
- Yersinia pestis reference: 5,689 reads (0.46%) mapped
- Pseudomonas aeruginosa reference: 279,217 reads (22.34%) mapped
Q14. One-line alignment + convert to BAM + sort:
bwa mem -R "@RG\tID:RG26\tSM:YRB42" \ /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ /home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz \ /home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz \ | samtools view -uS - \ | samtools sort /dev/stdin > NA19201.bam
This version uses -uS to emit uncompressed BAM, avoiding unnecessary compression/decompression inside the pipeline.
Q15. Advantages of using UNIX pipes:
- No intermediate large files → less storage use.
- Steps run in streaming mode → faster overall execution (the pipeline is as fast as the slowest tool, usually
bwa mem).
Q16. From samtools flagstat NA19201.bam:
- 3,516,557 mapped reads → 99.83% mapped.
This is unusually high because the file only contains reads from chr20.
Q17. Properly paired:
3,490,934 properly paired (99.23%)
Q18. Number of reads aligned to chr20:
samtools view NA19201.bam chr20 | wc -l
→ ~3,449,006 (or very close, depending on how BWA breaks ties)
To exclude supplementary alignments (0x800):
samtools view -F 0x800 NA19201.bam chr20 | wc -l
→ ~3,448,196
Q19. From the insert size distribution (via plot-bamstats):
Most common insert size is approximately 152 bp.
Q20. At the IGV position chr20:35,581,362:
Reference base = T, sample base = G.
Q21. Number of reads supporting the G allele: 8 reads.
Q22. The variant lies in the gene: FER1L4.