Alignment exercise answers: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with "'''Q1:''' 3 possible ways: * The file with the smaller file size contains the trimmed reads. * Peek in the file and determine which file contains reads of uneven lengths * Use fastqc, to determine which file contains overrepresented adapter sequences '''Q2:''' 4 lines if you have added the RG '''Q3:''' 166782 '''Q4:''' 0, this means that the probability of being mismapped is one. This means that this read cannot be confidently assigned to this position. '''Q5:''' The...")
 
No edit summary
 
Line 1: Line 1:
'''Q1:''' 3 possible ways:
<h2>Alignment Exercise – Suggested Answers</h2>
* The file with the smaller file size contains the trimmed reads.
* Peek in the file and determine which file contains reads of uneven lengths
* Use fastqc, to determine which file contains overrepresented adapter sequences


'''Q2:''' 4 lines if you have added the RG
<p><b>Q1.</b> Three ways to determine which FASTQ file contains the trimmed reads:</p>
<ul>
  <li><b>File size:</b> the trimmed file is smaller because adapter/low-quality bases were removed.</li>
  <li><b>Look at the reads:</b> trimmed reads often have <b>uneven lengths</b>, while raw reads have uniform read length for each technology.</li>
  <li><b>Run FastQC:</b> raw reads show <b>adapter contamination</b>, “adapter content” warnings, or low-quality tails.</li>
</ul>


'''Q3:''' 166782
<p><b>Q2.</b> Four header lines if you added the read group (<code>@RG</code>).</p>


'''Q4:''' 0, this means that the probability of being mismapped is one. This means that this read cannot be confidently assigned to this position.
<p><b>Q3.</b> The genomic coordinate of the first read (<code>SRR8002634.1</code>) is: <b>position 166782</b>.</p>


'''Q5:''' There are 3x16 and 5x0 so 3 on the - strand and 5 on the + strand
<p><b>Q4.</b> The mapping quality of the read <code>SRR8002634.3</code> is <b>0</b>. 
This means the mapper believes that it cannot confidently assign the read to this position; multiple equally good mapping locations likely exist.</p>


'''Q6:''' Yes it is unmapped, 4 means read unmapped.
<p><b>Q5.</b> Among the first 8 reads:
<b>3 mapped to the - strand</b> (flag 16) and <b>5 mapped to the + strand</b> (flag 0).</p>


'''Q7:''' 1183276 aligned out of 1252591 which is 94.47%. Therefore 100%-94.47% = 5.53%
<p><b>Q6.</b> Yes, <code>SRR8002634.11</code> is unmapped.
The SAM flag <code>0x4</code> (decimal 4) indicates “read unmapped”.</p>


'''Q8:''' Down from 467M to 117M so about 4X less.
<p><b>Q7.</b> According to <code>flagstat</code>:</p>
<ul>
  <li>1,183,276 aligned out of 1,252,591 total reads → 94.47% mapped.</li>
  <li>Therefore, unmapped fraction = 100% – 94.47% = <b>5.53%</b>.</li>
</ul>


'''Q9:''' Down from 117M to 37M so about 3.1X less.
<p><b>Q8.</b> SAM was 467M → BAM was 117M → approximately <b>4× smaller</b>.</p>


'''Q10:''' Use:
<p><b>Q9.</b> BAM was 117M → CRAM was 37M → approximately <b>3.1× smaller</b>.</p>


<pre>
<p><b>Q10.</b></p>
samtools view SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000|wc -l  
<pre>samtools view SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l</pre>
</pre>
<p>→ <b>155,343 reads</b></p>
155343
 
 
 
'''Q11:''' Use:


<pre>
<p><b>Q11.</b></p>
samtools view -q 30 SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000|wc -l  
<pre>samtools view -q 30 SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l</pre>
</pre>
<p>→ <b>149,524 reads</b></p>
149524
 
 
'''Q12:''' In SRR8002634_1.mosdepth.summary.txt we find:


<p><b>Q12.</b> From <code>SRR8002634_1.mosdepth.summary.txt</code>:</p>
<pre>
<pre>
NC_002516.2  6264404  177098613  28.27  0    113
NC_002516.2  6264404  177098613  28.27  0    113
</pre>
</pre>
<ul>
  <li>Average coverage: <b>28.27×</b></li>
  <li>Maximum depth at any position: <b>113×</b></li>
</ul>


The average is 28.27X, the maximum is 113 reads.
<p><b>Q13.</b> If the wrong reference is used, the number of aligned reads generally goes <b>down</b> because greater phylogenetic distance means more mismatches and fewer alignable regions.
 
If the incorrect reference is a <b>closely related species</b>, more reads will map than to a distant one, but still fewer than to the correct reference.</p>
'''Q13:''' if you use the wrong genome reference, the number of aligned reads will likely go down. This is because as the phylogenetic distance increases, the more the genomes diverge, the less the reads are likely to land on the genome due to increased divergence. In the extreme, the reads aligned by serendipity. Conversely, if the reference genome used is closer phylogenetically to the species of the sample, the reads are more likely to align and the number of aligned reads will go up. For instance, 5689 (0.46%) aligned to the ''Yersinia pestis'' and 279217 (22.34%) aligned to the ''Pseudomonas aeruginosa''.
 
'''Q14''' the command line is:
<pre>
bwa mem  -R "@RG\tID:RG26\tSM:YRB42"  /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa /home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz |samtools view -bS /dev/stdin | samtools sort /dev/stdin >  NA19201.bam
</pre>
 
By the way, when we write:
 
<pre>
samtools view -bS /dev/stdin
</pre>
 
we can also write:
 
<pre>
samtools view -bS -
</pre>
 
It means the same thing and is quicker.
 
Very technically speaking, the command line above is not the most efficient. By default, BAM is a compressed binary format. Compressed here means it is run through '''gzip''' or more precisely '''bgzip''' (block version of '''gzip''').  It can also be uncompressed binary. The first call to '''samtools view''' transforms the data as compressed binary. Subsequently, '''samtools sort''' decompresses and re-compresses it. This is a bit wasteful, there is no point in compressing data if you're going to decompress it again in the subsequent step. Is smarter command line is:
<pre>
bwa mem  -R "@RG\tID:RG26\tSM:YRB42" /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa /home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz |samtools view -uS - | samtools sort /dev/stdin >  NA19201.bam
</pre>
 
Note that we have changed '''samtools view -bS''' to '''samtools view -uS''', this says produced uncompressed BAM.


'''Q15''' 1) there are fewer intermediate files being produced 2) you guarantee parallel execution such that you are as fast as your bottleneck which in this case is mapping.
<p>Example values:</p>
<ul>
  <li><i>Yersinia pestis</i> reference: 5,689 reads (0.46%) mapped</li>
  <li><i>Pseudomonas aeruginosa</i> reference: 279,217 reads (22.34%) mapped</li>
</ul>


 
<p><b>Q14.</b> One-line alignment + convert to BAM + sort:</p>
'''Q16''' Using:
<pre>
<pre>
samtools flagstat NA19201.bam
bwa mem -R "@RG\tID:RG26\tSM:YRB42" \
/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz \
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz \
| samtools view -uS - \
| samtools sort /dev/stdin > NA19201.bam
</pre>
</pre>
We find:
<pre>
3516557 + 0 mapped (99.83% : N/A)
</pre>
So 99.83%. Please note that this is unusually high as we knew that these are reads then mapped to chromosome 20.


<p>This version uses <code>-uS</code> to emit uncompressed BAM, avoiding unnecessary compression/decompression inside the pipeline.</p>


'''Q17'''
<p><b>Q15.</b> Advantages of using UNIX pipes:</p>
<pre>
<ul>
3490934 + 0 properly paired (99.23% : N/A)
  <li>No intermediate large files → less storage use.</li>
</pre>
  <li>Steps run in streaming mode → faster overall execution (the pipeline is as fast as the slowest tool, usually <code>bwa mem</code>).</li>
</ul>


<p><b>Q16.</b> From <code>samtools flagstat NA19201.bam</code>:</p>
<ul>
  <li>3,516,557 mapped reads → <b>99.83% mapped</b>.</li>
</ul>
<p>This is unusually high because the file only contains reads from <code>chr20</code>.</p>


<p><b>Q17.</b> Properly paired:</p>
<pre>3,490,934 properly paired (99.23%)</pre>


'''Q18''' Using:
<p><b>Q18.</b> Number of reads aligned to <code>chr20</code>:</p>
<pre>
samtools view    NA19201.bam chr20 |wc -l
3449006 (or 3448996)
</pre>
However, please note that you might get different numbers. Like mentioned in class, BWA tends to plate reads randomly in case of a tie. But expect the same ballpark. Also, to be more rigorous, as noted in the flagstat, there are 4620 supplementary alignments in total, so one would need to filter those out. Using
 
<pre>
samtools view -F0x800 NA19201.bam chr20 |wc -l
3448196 (or 3448186)
</pre>


<pre>samtools view NA19201.bam chr20 | wc -l</pre>
<p>→ ~3,449,006 (or very close, depending on how BWA breaks ties)</p>


'''Q19'''
<p>To exclude supplementary alignments (<code>0x800</code>):</p>
<pre>
<pre>samtools view -F 0x800 NA19201.bam chr20 | wc -l</pre>
samtools stat NA19201.bam > NA19201.stat
<p>→ ~3,448,196</p>
plot-bamstats -p NA19201 NA19201.stat
</pre>


The most common insert size is 152.  
<p><b>Q19.</b> From the insert size distribution (via <code>plot-bamstats</code>): 
Most common insert size is approximately <b>152 bp</b>.</p>


'''Q20''':reference=T, sample=G
<p><b>Q20.</b> At the IGV position <code>chr20:35,581,362</code>: 
Reference base = <b>T</b>, sample base = <b>G</b>.</p>


'''Q21''': 8 reads
<p><b>Q21.</b> Number of reads supporting the G allele: <b>8 reads</b>.</p>


'''Q22''': FER1L4
<p><b>Q22.</b> The variant lies in the gene: <b>FER1L4</b>.</p>

Latest revision as of 13:23, 20 November 2025

Alignment Exercise – Suggested Answers

Q1. Three ways to determine which FASTQ file contains the trimmed reads:

  • File size: the trimmed file is smaller because adapter/low-quality bases were removed.
  • Look at the reads: trimmed reads often have uneven lengths, while raw reads have uniform read length for each technology.
  • Run FastQC: raw reads show adapter contamination, “adapter content” warnings, or low-quality tails.

Q2. Four header lines if you added the read group (@RG).

Q3. The genomic coordinate of the first read (SRR8002634.1) is: position 166782.

Q4. The mapping quality of the read SRR8002634.3 is 0. This means the mapper believes that it cannot confidently assign the read to this position; multiple equally good mapping locations likely exist.

Q5. Among the first 8 reads: 3 mapped to the - strand (flag 16) and 5 mapped to the + strand (flag 0).

Q6. Yes, SRR8002634.11 is unmapped. The SAM flag 0x4 (decimal 4) indicates “read unmapped”.

Q7. According to flagstat:

  • 1,183,276 aligned out of 1,252,591 total reads → 94.47% mapped.
  • Therefore, unmapped fraction = 100% – 94.47% = 5.53%.

Q8. SAM was 467M → BAM was 117M → approximately 4× smaller.

Q9. BAM was 117M → CRAM was 37M → approximately 3.1× smaller.

Q10.

samtools view SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l

155,343 reads

Q11.

samtools view -q 30 SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l

149,524 reads

Q12. From SRR8002634_1.mosdepth.summary.txt:

NC_002516.2  6264404  177098613  28.27  0    113
  • Average coverage: 28.27×
  • Maximum depth at any position: 113×

Q13. If the wrong reference is used, the number of aligned reads generally goes down because greater phylogenetic distance means more mismatches and fewer alignable regions. If the incorrect reference is a closely related species, more reads will map than to a distant one, but still fewer than to the correct reference.

Example values:

  • Yersinia pestis reference: 5,689 reads (0.46%) mapped
  • Pseudomonas aeruginosa reference: 279,217 reads (22.34%) mapped

Q14. One-line alignment + convert to BAM + sort:

bwa mem -R "@RG\tID:RG26\tSM:YRB42" \
/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz \
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz \
| samtools view -uS - \
| samtools sort /dev/stdin > NA19201.bam

This version uses -uS to emit uncompressed BAM, avoiding unnecessary compression/decompression inside the pipeline.

Q15. Advantages of using UNIX pipes:

  • No intermediate large files → less storage use.
  • Steps run in streaming mode → faster overall execution (the pipeline is as fast as the slowest tool, usually bwa mem).

Q16. From samtools flagstat NA19201.bam:

  • 3,516,557 mapped reads → 99.83% mapped.

This is unusually high because the file only contains reads from chr20.

Q17. Properly paired:

3,490,934 properly paired (99.23%)

Q18. Number of reads aligned to chr20:

samtools view NA19201.bam chr20 | wc -l

→ ~3,449,006 (or very close, depending on how BWA breaks ties)

To exclude supplementary alignments (0x800):

samtools view -F 0x800 NA19201.bam chr20 | wc -l

→ ~3,448,196

Q19. From the insert size distribution (via plot-bamstats): Most common insert size is approximately 152 bp.

Q20. At the IGV position chr20:35,581,362: Reference base = T, sample base = G.

Q21. Number of reads supporting the G allele: 8 reads.

Q22. The variant lies in the gene: FER1L4.