Postprocess exercise answers

From 22126
Revision as of 14:58, 20 November 2025 by Mick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Post-Alignment Processing – Answer Key

Q1

Running:

java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
    -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam \
    -M ERR016028_chr20_sort_markdup.metrics.txt \
    -O ERR016028_chr20_sort_markdup.bam

The Picard log reports:

Marking 9798 records as duplicates.

This number is low because we only use a very small subset of the genome (chr20 only) to keep the exercise fast.


Q2

The two reads have:

  • Different sequences (one contains an N)
  • The same alignment start coordinate (position 45,996,739)
ERR016028.5947720   ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG
ERR016028.18808080  ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG

Even though the sequences differ slightly, Picard considers them duplicates because:

  • They originate from the same original fragment
  • They align to the exact same genomic location
  • Duplicate detection is based on alignment position, not sequence identity

Q3

ERR016028.18808080 is the read marked as a duplicate.

You can tell because its SAM flag changed from:

  • 163 → original flag
  • 1187 → original flags + duplicate flag (0x400)

The duplicate status is identified because the new flag includes:

0x400 (1024 decimal) = "PCR/optical duplicate"

Picard retains the “best” representative read and marks the others as duplicates.


Q4

The correct command for merging BAM files is:

samtools merge

samtools cat simply concatenates files and does not guarantee sorted order, so it is not appropriate here.

The full command for this exercise is:

samtools merge -c --write-index HG00418_chr20_sort_markdup.bam \
    ERR016028_chr20_sort_markdup.bam \
    /home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam

Where:

  • -c keeps the read groups unchanged
  • --write-index creates the .bai index automatically

Q5

The sample/library of origin is indicated in the RG (read group) tag attached to each read:

RG:Z:ERR016025
RG:Z:ERR016028
  • RG:Z:ERR016025 → read from the second library
  • RG:Z:ERR016028 → read from the first library (the one you processed)

Variant callers use read groups to ensure correct sample/lane/library attribution.


Q6

Multiplexing

This is the process of pooling multiple samples into a single sequencing run.


Q7

Demultiplexing

This is the computational process of separating pooled reads back into individual samples using barcodes/index sequences.


End of answer key.