Postprocess exercise answers
Post-Alignment Processing – Answer Key
Q1
Running:
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
-I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam \
-M ERR016028_chr20_sort_markdup.metrics.txt \
-O ERR016028_chr20_sort_markdup.bam
The Picard log reports:
Marking 9798 records as duplicates.
This number is low because we only use a very small subset of the genome (chr20 only) to keep the exercise fast.
Q2
The two reads have:
- Different sequences (one contains an
N) - The same alignment start coordinate (position 45,996,739)
ERR016028.5947720 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG ERR016028.18808080 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG
Even though the sequences differ slightly, Picard considers them duplicates because:
- They originate from the same original fragment
- They align to the exact same genomic location
- Duplicate detection is based on alignment position, not sequence identity
Q3
ERR016028.18808080 is the read marked as a duplicate.
You can tell because its SAM flag changed from:
- 163 → original flag
- 1187 → original flags + duplicate flag (0x400)
The duplicate status is identified because the new flag includes:
0x400 (1024 decimal) = "PCR/optical duplicate"
Picard retains the “best” representative read and marks the others as duplicates.
Q4
The correct command for merging BAM files is:
samtools merge
samtools cat simply concatenates files and does not guarantee sorted order, so it is not appropriate here.
The full command for this exercise is:
samtools merge -c --write-index HG00418_chr20_sort_markdup.bam \
ERR016028_chr20_sort_markdup.bam \
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
Where:
-ckeeps the read groups unchanged--write-indexcreates the.baiindex automatically
Q5
The sample/library of origin is indicated in the RG (read group) tag attached to each read:
RG:Z:ERR016025 RG:Z:ERR016028
RG:Z:ERR016025→ read from the second libraryRG:Z:ERR016028→ read from the first library (the one you processed)
Variant callers use read groups to ensure correct sample/lane/library attribution.
Q6
Multiplexing
This is the process of pooling multiple samples into a single sequencing run.
Q7
Demultiplexing
This is the computational process of separating pooled reads back into individual samples using barcodes/index sequences.
End of answer key.