Postprocess exercise answers: Difference between revisions

Latest revision as of 14:58, 20 November 2025

Post-Alignment Processing – Answer Key

Q1

Running:

java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
    -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam \
    -M ERR016028_chr20_sort_markdup.metrics.txt \
    -O ERR016028_chr20_sort_markdup.bam

The Picard log reports:

Marking 9798 records as duplicates.

This number is low because we only use a very small subset of the genome (chr20 only) to keep the exercise fast.

Q2

The two reads have:

Different sequences (one contains an N)
The same alignment start coordinate (position 45,996,739)

ERR016028.5947720   ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG
ERR016028.18808080  ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG

Even though the sequences differ slightly, Picard considers them duplicates because:

They originate from the same original fragment
They align to the exact same genomic location
Duplicate detection is based on alignment position, not sequence identity

Q3

ERR016028.18808080 is the read marked as a duplicate.

You can tell because its SAM flag changed from:

163 → original flag
1187 → original flags + duplicate flag (0x400)

The duplicate status is identified because the new flag includes:

0x400 (1024 decimal) = "PCR/optical duplicate"

Picard retains the “best” representative read and marks the others as duplicates.

Q4

The correct command for merging BAM files is:

samtools merge

samtools cat simply concatenates files and does not guarantee sorted order, so it is not appropriate here.

The full command for this exercise is:

samtools merge -c --write-index HG00418_chr20_sort_markdup.bam \
    ERR016028_chr20_sort_markdup.bam \
    /home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam

Where:

-c keeps the read groups unchanged
--write-index creates the .bai index automatically

Q5

The sample/library of origin is indicated in the RG (read group) tag attached to each read:

RG:Z:ERR016025
RG:Z:ERR016028

RG:Z:ERR016025 → read from the second library
RG:Z:ERR016028 → read from the first library (the one you processed)

Variant callers use read groups to ensure correct sample/lane/library attribution.

Q6

Multiplexing

This is the process of pooling multiple samples into a single sequencing run.

Q7

Demultiplexing

This is the computational process of separating pooled reads back into individual samples using barcodes/index sequences.

End of answer key.

@@ Line 1: / Line 1: @@
-'''Q1'''
+<h2>Post-Alignment Processing – Answer Key</h2>
-Running:
+<h3>Q1</h3>
+<p>Running:</p>
 <pre>
-java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam -M  ERR016028_chr20_sort_markdup.metrics.txt -O ERR016028_chr20_sort_markdup.bam
+java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
+    -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam \
+    -M ERR016028_chr20_sort_markdup.metrics.txt \
+    -O ERR016028_chr20_sort_markdup.bam
 </pre>
-The log should state:
+<p>The Picard log reports:</p>
 <pre>
 Marking 9798 records as duplicates.
 </pre>
-Please note that this is very low but that is because we have very little data so that it runs faster.
+<p>This number is low because we only use a very small subset of the genome (chr20 only) to keep the exercise fast.</p>
+<hr>
+<h3>Q2</h3>
-'''Q2'''
+<p>The two reads have:</p>
+<ul>
+  <li><b>Different sequences</b> (one contains an <code>N</code>)</li>
+  <li><b>The same alignment start coordinate</b> (position 45,996,739)</li>
+</ul>
-They do not have the same sequence:
 <pre>
-ERR016028.5947720  ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG
+ERR016028.5947720   ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG
-ERR016028.18808080 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG
+ERR016028.18808080  ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG
 </pre>
-notice "TCTCA" vs "TCNCA" but they both have the same starting coordinate (45996739).
-'''Q3'''
+<p>Even though the sequences differ slightly, Picard considers them duplicates because:</p>
+<ul>
+  <li>They originate from the same original fragment</li>
+  <li>They align to the exact same genomic location</li>
+  <li>Duplicate detection is based on alignment position, not sequence identity</li>
+</ul>
+<hr>
-ERR016028.18808080 is the read marked as duplicate. It is the read whose flag (2nd field) changed from 163 to 1187, which corresponds to a duplicate (see [https://broadinstitute.github.io/picard/explain-flags.html https://broadinstitute.github.io/picard/explain-flags.html]).
+<h3>Q3</h3>
-'''Q4'''
+<p><code>ERR016028.18808080</code> is the read marked as a duplicate.</p>
+<p>You can tell because its SAM flag changed from:</p>
+<ul>
+  <li><b>163</b> → original flag</li>
+  <li><b>1187</b> → original flags + duplicate flag (0x400)</li>
+</ul>
+<p>The duplicate status is identified because the new flag includes:</p>
-The correct command is:
 <pre>
-samtools merge
+x400 (1024 decimal) = "PCR/optical duplicate"
 </pre>
-If you choose:
+<p>Picard retains the “best” representative read and marks the others as duplicates.</p>
+<hr>
+<h3>Q4</h3>
+<p>The correct command for merging BAM files is:</p>
 <pre>
-samtools cat
+samtools merge
 </pre>
-It will merely concatenate the files meaning that they will be file1, file2, file3.. It will not necessarily be sorted.
-The full command should look something like this:
+<p><code>samtools cat</code> simply concatenates files and does <b>not</b> guarantee sorted order, so it is not appropriate here.</p>
+<p>The full command for this exercise is:</p>
 <pre>
-samtools merge -c --write-index HG00418_chr20_sort_markdup.bam   ERR016028_chr20_sort_markdup.bam  /home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
+samtools merge -c --write-index HG00418_chr20_sort_markdup.bam \
+    ERR016028_chr20_sort_markdup.bam \
+    /home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
 </pre>
-'''Q5'''
+<p>Where:</p>
+<ul>
-It is '''RG''' which stands for read group. You will see them at the end of reads:
+  <li><code>-c</code> keeps the read groups unchanged</li>
+  <li><code>--write-index</code> creates the <code>.bai</code> index automatically</li>
+</ul>
+<hr>
+<h3>Q5</h3>
+<p>The sample/library of origin is indicated in the <b>RG (read group)</b> tag attached to each read:</p>
 <pre>
-	RG:Z:ERR016025
+RG:Z:ERR016025
-	RG:Z:ERR016028
+RG:Z:ERR016028
 </pre>
-If it was '''RG:Z:ERR016025''' it was from the file that was stored, '''RG:Z:ERR016028''' was from the file you generated.
+<ul>
+  <li><code>RG:Z:ERR016025</code> → read from the second library</li>
+  <li><code>RG:Z:ERR016028</code> → read from the first library (the one you processed)</li>
+</ul>
+<p>Variant callers use read groups to ensure correct sample/lane/library attribution.</p>
+<hr>
+<h3>Q6</h3>
+<p><b>Multiplexing</b></p>
+<p>This is the process of pooling multiple samples into a single sequencing run.</p>
+<hr>
+<h3>Q7</h3>
+<p><b>Demultiplexing</b></p>
+<p>This is the computational process of separating pooled reads back into individual samples using barcodes/index sequences.</p>
-'''Q6''' multiplexing
+<hr>
-'''Q7''' demultiplexing
+<p><b>End of answer key.</b></p>

Postprocess exercise answers: Difference between revisions

Latest revision as of 14:58, 20 November 2025

Contents

Post-Alignment Processing – Answer Key

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Navigation menu

Postprocess exercise answers: Difference between revisions

Latest revision as of 14:58, 20 November 2025

Post-Alignment Processing – Answer Key

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Navigation menu

Search