Alignment exercise answers

From 22126
Revision as of 13:23, 20 November 2025 by Mick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Alignment Exercise – Suggested Answers

Q1. Three ways to determine which FASTQ file contains the trimmed reads:

  • File size: the trimmed file is smaller because adapter/low-quality bases were removed.
  • Look at the reads: trimmed reads often have uneven lengths, while raw reads have uniform read length for each technology.
  • Run FastQC: raw reads show adapter contamination, “adapter content” warnings, or low-quality tails.

Q2. Four header lines if you added the read group (@RG).

Q3. The genomic coordinate of the first read (SRR8002634.1) is: position 166782.

Q4. The mapping quality of the read SRR8002634.3 is 0. This means the mapper believes that it cannot confidently assign the read to this position; multiple equally good mapping locations likely exist.

Q5. Among the first 8 reads: 3 mapped to the - strand (flag 16) and 5 mapped to the + strand (flag 0).

Q6. Yes, SRR8002634.11 is unmapped. The SAM flag 0x4 (decimal 4) indicates “read unmapped”.

Q7. According to flagstat:

  • 1,183,276 aligned out of 1,252,591 total reads → 94.47% mapped.
  • Therefore, unmapped fraction = 100% – 94.47% = 5.53%.

Q8. SAM was 467M → BAM was 117M → approximately 4× smaller.

Q9. BAM was 117M → CRAM was 37M → approximately 3.1× smaller.

Q10.

samtools view SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l

155,343 reads

Q11.

samtools view -q 30 SRR8002634_1.sorted.bam NC_002516.2:2000000-3000000 | wc -l

149,524 reads

Q12. From SRR8002634_1.mosdepth.summary.txt:

NC_002516.2  6264404  177098613  28.27  0    113
  • Average coverage: 28.27×
  • Maximum depth at any position: 113×

Q13. If the wrong reference is used, the number of aligned reads generally goes down because greater phylogenetic distance means more mismatches and fewer alignable regions. If the incorrect reference is a closely related species, more reads will map than to a distant one, but still fewer than to the correct reference.

Example values:

  • Yersinia pestis reference: 5,689 reads (0.46%) mapped
  • Pseudomonas aeruginosa reference: 279,217 reads (22.34%) mapped

Q14. One-line alignment + convert to BAM + sort:

bwa mem -R "@RG\tID:RG26\tSM:YRB42" \
/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz \
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz \
| samtools view -uS - \
| samtools sort /dev/stdin > NA19201.bam

This version uses -uS to emit uncompressed BAM, avoiding unnecessary compression/decompression inside the pipeline.

Q15. Advantages of using UNIX pipes:

  • No intermediate large files → less storage use.
  • Steps run in streaming mode → faster overall execution (the pipeline is as fast as the slowest tool, usually bwa mem).

Q16. From samtools flagstat NA19201.bam:

  • 3,516,557 mapped reads → 99.83% mapped.

This is unusually high because the file only contains reads from chr20.

Q17. Properly paired:

3,490,934 properly paired (99.23%)

Q18. Number of reads aligned to chr20:

samtools view NA19201.bam chr20 | wc -l

→ ~3,449,006 (or very close, depending on how BWA breaks ties)

To exclude supplementary alignments (0x800):

samtools view -F 0x800 NA19201.bam chr20 | wc -l

→ ~3,448,196

Q19. From the insert size distribution (via plot-bamstats): Most common insert size is approximately 152 bp.

Q20. At the IGV position chr20:35,581,362: Reference base = T, sample base = G.

Q21. Number of reads supporting the G allele: 8 reads.

Q22. The variant lies in the gene: FER1L4.