Alignment exercise answers: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 8: Line 8:
</ul>
</ul>


<p><b>Q2.</b> Four header lines if you added the read group (<code>@RG</code>).</p>
<p><b>Q2.</b> Three header lines if you added the read group (<code>@RG</code>).</p>


<p><b>Q3.</b> The genomic coordinate of the first read (<code>SRR8002634.1</code>) is: <b>position 166782</b>.</p>
<p><b>Q3.</b> The genomic coordinate of the first read (<code>SRR8002634.1</code>) is: <b>position 166782</b>.</p>
Line 19: Line 19:


<p><b>Q6.</b> Yes, <code>SRR8002634.11</code> is unmapped.   
<p><b>Q6.</b> Yes, <code>SRR8002634.11</code> is unmapped.   
The SAM flag <code>0x4</code> (decimal 4) indicates “read unmapped”.</p>
The SAM flag <code>0x4</code> (4) indicates “read unmapped”.</p>


<p><b>Q7.</b> According to <code>flagstat</code>:</p>
<p><b>Q7.</b> According to <code>flagstat</code>:</p>
Line 27: Line 27:
</ul>
</ul>


<p><b>Q8.</b> SAM was 467M → BAM was 117M → approximately <b>4× smaller</b>.</p>
<p><b>Q8.</b> SAM was 479M → BAM was 116M → approximately <b>4× smaller</b>.</p>


<p><b>Q9.</b> BAM was 117M → CRAM was 37M → approximately <b>3.1× smaller</b>.</p>
<p><b>Q9.</b> BAM was 116M → CRAM was 38M → approximately <b>3.1× smaller</b>.</p>


<p><b>Q10.</b></p>
<p><b>Q10.</b></p>
Line 87: Line 87:


<pre>samtools view NA19201.bam chr20 | wc -l</pre>
<pre>samtools view NA19201.bam chr20 | wc -l</pre>
<p>→ ~3,449,006 (or very close, depending on how BWA breaks ties)</p>
<p>→ ~3,450,558 (or very close, depending on how BWA breaks ties)</p>


<p>To exclude supplementary alignments (<code>0x800</code>):</p>
<p>To exclude supplementary alignments (<code>0x800</code>):</p>
<pre>samtools view -F 0x800 NA19201.bam chr20 | wc -l</pre>
<pre>samtools view -F 0x800 NA19201.bam chr20 | wc -l</pre>
<p>→ ~3,448,196</p>
<p>→ ~3,449,745</p>


<p><b>Q19.</b> From the insert size distribution (via <code>plot-bamstats</code>):   
<p><b>Q19.</b> From the insert size distribution (via <code>plot-bamstats</code>):   
Most common insert size is approximately <b>152 bp</b>.</p>
Most common insert size is approximately <b>152 bp</b>.</p>


<p><b>Q20.</b> At the IGV position <code>chr20:35,581,362</code>:   
<p><b>Q20.</b> At the position <code>chr20:35,581,362</code>:   
Reference base = <b>T</b>, sample base = <b>G</b>.</p>
Reference base = <b>T</b>, sample base = <b>G</b>.</p>


<p><b>Q21.</b> Number of reads supporting the G allele: <b>8 reads</b>.</p>
<p><b>Q21.</b> Number of reads supporting the G allele: <b>8 reads</b> (the first line is the reference, the 2nd line is the sample consensus base). </p>


<p><b>Q22.</b> The variant lies in the gene: <b>FER1L4</b>.</p>
<p><b>Q22.</b> Homozygous as all reads support just a single allele.</p>

Latest revision as of 11:57, 19 December 2025

Alignment Exercise – Suggested Answers

Q1. Three ways to determine which FASTQ file contains the trimmed reads:

  • File size: the trimmed file is smaller because adapter/low-quality bases were removed.
  • Look at the reads: trimmed reads often have uneven lengths, while raw reads have uniform read length for each technology.
  • Run FastQC: raw reads show adapter contamination, “adapter content” warnings, or low-quality tails.

Q2. Three header lines if you added the read group (@RG).

Q3. The genomic coordinate of the first read (SRR8002634.1) is: position 166782.

Q4. The mapping quality of the read SRR8002634.3 is 0. This means the mapper believes that it cannot confidently assign the read to this position; multiple equally good mapping locations likely exist.

Q5. Among the first 8 reads: 3 mapped to the - strand (flag 16) and 5 mapped to the + strand (flag 0).

Q6. Yes, SRR8002634.11 is unmapped. The SAM flag 0x4 (4) indicates “read unmapped”.

Q7. According to flagstat:

  • 1,183,276 aligned out of 1,252,591 total reads → 94.47% mapped.
  • Therefore, unmapped fraction = 100% – 94.47% = 5.53%.

Q8. SAM was 479M → BAM was 116M → approximately 4× smaller.

Q9. BAM was 116M → CRAM was 38M → approximately 3.1× smaller.

Q10.

samtools view SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l

155,343 reads

Q11.

samtools view -q 30 SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l

149,524 reads

Q12. From SRR8002634_1.mosdepth.summary.txt:

NC_002516.2  6264404  177098613  28.27  0    113
  • Average coverage: 28.27×
  • Maximum depth at any position: 113×

Q13. If the wrong reference is used, the number of aligned reads generally goes down because greater phylogenetic distance means more mismatches and fewer alignable regions. If the incorrect reference is a closely related species, more reads will map than to a distant one, but still fewer than to the correct reference.

Example values:

  • Yersinia pestis reference: 5,689 reads (0.46%) mapped
  • Pseudomonas aeruginosa reference: 279,217 reads (22.34%) mapped

Q14. One-line alignment + convert to BAM + sort:

bwa mem -R "@RG\tID:RG26\tSM:YRB42" \
/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz \
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz \
| samtools view -uS - \
| samtools sort /dev/stdin > NA19201.bam

This version uses -uS to emit uncompressed BAM, avoiding unnecessary compression/decompression inside the pipeline.

Q15. Advantages of using UNIX pipes:

  • No intermediate large files → less storage use.
  • Steps run in streaming mode → faster overall execution (the pipeline is as fast as the slowest tool, usually bwa mem).

Q16. From samtools flagstat NA19201.bam:

  • 3,516,557 mapped reads → 99.83% mapped.

This is unusually high because the file only contains reads from chr20.

Q17. Properly paired:

3,490,934 properly paired (99.23%)

Q18. Number of reads aligned to chr20:

samtools view NA19201.bam chr20 | wc -l

→ ~3,450,558 (or very close, depending on how BWA breaks ties)

To exclude supplementary alignments (0x800):

samtools view -F 0x800 NA19201.bam chr20 | wc -l

→ ~3,449,745

Q19. From the insert size distribution (via plot-bamstats): Most common insert size is approximately 152 bp.

Q20. At the position chr20:35,581,362: Reference base = T, sample base = G.

Q21. Number of reads supporting the G allele: 8 reads (the first line is the reference, the 2nd line is the sample consensus base).

Q22. Homozygous as all reads support just a single allele.