Postprocess exercise answers

Post-Alignment Processing – Answer Key

Q1

Running:

java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
    -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam \
    -M ERR016028_chr20_sort_markdup.metrics.txt \
    -O ERR016028_chr20_sort_markdup.bam

The Picard log reports:

Marking 9798 records as duplicates.

This number is low because we only use a very small subset of the genome (chr20 only) to keep the exercise fast.

Q2

The two reads have:

Different sequences (one contains an N)
The same alignment start coordinate (position 45,996,739)

ERR016028.5947720   ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG
ERR016028.18808080  ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG

Even though the sequences differ slightly, Picard considers them duplicates because:

They originate from the same original fragment
They align to the exact same genomic location
Duplicate detection is based on alignment position, not sequence identity

Q3

ERR016028.18808080 is the read marked as a duplicate.

You can tell because its SAM flag changed from:

163 → original flag
1187 → original flags + duplicate flag (0x400)

The duplicate status is identified because the new flag includes:

0x400 (1024 decimal) = "PCR/optical duplicate"

Picard retains the “best” representative read and marks the others as duplicates.

Q4

The correct command for merging BAM files is:

samtools merge

samtools cat simply concatenates files and does not guarantee sorted order, so it is not appropriate here.

The full command for this exercise is:

samtools merge -c --write-index HG00418_chr20_sort_markdup.bam \
    ERR016028_chr20_sort_markdup.bam \
    /home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam

Where:

-c keeps the read groups unchanged
--write-index creates the .bai index automatically

Q5

The sample/library of origin is indicated in the RG (read group) tag attached to each read:

RG:Z:ERR016025
RG:Z:ERR016028

RG:Z:ERR016025 → read from the second library
RG:Z:ERR016028 → read from the first library (the one you processed)

Variant callers use read groups to ensure correct sample/lane/library attribution.

Q6

Multiplexing

This is the process of pooling multiple samples into a single sequencing run.

Q7

Demultiplexing

This is the computational process of separating pooled reads back into individual samples using barcodes/index sequences.

End of answer key.

Postprocess exercise answers

Contents

Post-Alignment Processing – Answer Key

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Navigation menu

Postprocess exercise answers

Post-Alignment Processing – Answer Key

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Navigation menu

Search