Alignment exercise answers
Alignment Exercise – Suggested Answers
Q1. Three ways to determine which FASTQ file contains the trimmed reads:
- File size: the trimmed file is smaller because adapter/low-quality bases were removed.
- Look at the reads: trimmed reads often have uneven lengths, while raw reads have uniform read length for each technology.
- Run FastQC: raw reads show adapter contamination, “adapter content” warnings, or low-quality tails.
Q2. Four header lines if you added the read group (@RG).
Q3. The genomic coordinate of the first read (SRR8002634.1) is: position 166782.
Q4. The mapping quality of the read SRR8002634.3 is 0.
This means the mapper believes that it cannot confidently assign the read to this position; multiple equally good mapping locations likely exist.
Q5. Among the first 8 reads: 3 mapped to the - strand (flag 16) and 5 mapped to the + strand (flag 0).
Q6. Yes, SRR8002634.11 is unmapped.
The SAM flag 0x4 (decimal 4) indicates “read unmapped”.
Q7. According to flagstat:
- 1,183,276 aligned out of 1,252,591 total reads → 94.47% mapped.
- Therefore, unmapped fraction = 100% – 94.47% = 5.53%.
Q8. SAM was 467M → BAM was 117M → approximately 4× smaller.
Q9. BAM was 117M → CRAM was 37M → approximately 3.1× smaller.
Q10.
samtools view SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l
→ 155,343 reads
Q11.
samtools view -q 30 SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l
→ 149,524 reads
Q12. From SRR8002634_1.mosdepth.summary.txt:
NC_002516.2 6264404 177098613 28.27 0 113
- Average coverage: 28.27×
- Maximum depth at any position: 113×
Q13. If the wrong reference is used, the number of aligned reads generally goes down because greater phylogenetic distance means more mismatches and fewer alignable regions. If the incorrect reference is a closely related species, more reads will map than to a distant one, but still fewer than to the correct reference.
Example values:
- Yersinia pestis reference: 5,689 reads (0.46%) mapped
- Pseudomonas aeruginosa reference: 279,217 reads (22.34%) mapped
Q14. One-line alignment + convert to BAM + sort:
bwa mem -R "@RG\tID:RG26\tSM:YRB42" \ /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ /home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz \ /home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz \ | samtools view -uS - \ | samtools sort /dev/stdin > NA19201.bam
This version uses -uS to emit uncompressed BAM, avoiding unnecessary compression/decompression inside the pipeline.
Q15. Advantages of using UNIX pipes:
- No intermediate large files → less storage use.
- Steps run in streaming mode → faster overall execution (the pipeline is as fast as the slowest tool, usually
bwa mem).
Q16. From samtools flagstat NA19201.bam:
- 3,516,557 mapped reads → 99.83% mapped.
This is unusually high because the file only contains reads from chr20.
Q17. Properly paired:
3,490,934 properly paired (99.23%)
Q18. Number of reads aligned to chr20:
samtools view NA19201.bam chr20 | wc -l
→ ~3,449,006 (or very close, depending on how BWA breaks ties)
To exclude supplementary alignments (0x800):
samtools view -F 0x800 NA19201.bam chr20 | wc -l
→ ~3,448,196
Q19. From the insert size distribution (via plot-bamstats):
Most common insert size is approximately 152 bp.
Q20. At the IGV position chr20:35,581,362:
Reference base = T, sample base = G.
Q21. Number of reads supporting the G allele: 8 reads.
Q22. The variant lies in the gene: FER1L4.