Postprocess exercise answers: Difference between revisions
(Created page with "'''Q1''' Running: <pre> java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam -M ERR016028_chr20_sort_markdup.metrics.txt -O ERR016028_chr20_sort_markdup.bam </pre> The log should state: <pre> Marking 9798 records as duplicates. </pre> Please note that this is very low but that is because we have very little data so that it runs faster. '''Q2''' They do not have the same sequence:...") |
No edit summary |
||
| Line 1: | Line 1: | ||
<h2>Post-Alignment Processing – Answer Key</h2> | |||
Running: | |||
<h3>Q1</h3> | |||
<p>Running:</p> | |||
<pre> | <pre> | ||
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam -M | java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \ | ||
-I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam \ | |||
-M ERR016028_chr20_sort_markdup.metrics.txt \ | |||
-O ERR016028_chr20_sort_markdup.bam | |||
</pre> | </pre> | ||
The log | <p>The Picard log reports:</p> | ||
<pre> | <pre> | ||
Marking 9798 records as duplicates. | Marking 9798 records as duplicates. | ||
</pre> | </pre> | ||
<p>This number is low because we only use a very small subset of the genome (chr20 only) to keep the exercise fast.</p> | |||
<hr> | |||
<h3>Q2</h3> | |||
<p>The two reads have:</p> | |||
<ul> | |||
<li><b>Different sequences</b> (one contains an <code>N</code>)</li> | |||
<li><b>The same alignment start coordinate</b> (position 45,996,739)</li> | |||
</ul> | |||
<pre> | <pre> | ||
ERR016028.5947720 | ERR016028.5947720 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG | ||
ERR016028.18808080 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG | ERR016028.18808080 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG | ||
</pre> | </pre> | ||
<p>Even though the sequences differ slightly, Picard considers them duplicates because:</p> | |||
<ul> | |||
<li>They originate from the same original fragment</li> | |||
<li>They align to the exact same genomic location</li> | |||
<li>Duplicate detection is based on alignment position, not sequence identity</li> | |||
</ul> | |||
<hr> | |||
<h3>Q3</h3> | |||
<p><code>ERR016028.18808080</code> is the read marked as a duplicate.</p> | |||
<p>You can tell because its SAM flag changed from:</p> | |||
<ul> | |||
<li><b>163</b> → original flag</li> | |||
<li><b>1187</b> → original flags + duplicate flag (0x400)</li> | |||
</ul> | |||
<p>The duplicate status is identified because the new flag includes:</p> | |||
<pre> | <pre> | ||
0x400 (1024 decimal) = "PCR/optical duplicate" | |||
</pre> | </pre> | ||
<p>Picard retains the “best” representative read and marks the others as duplicates.</p> | |||
<hr> | |||
<h3>Q4</h3> | |||
<p>The correct command for merging BAM files is:</p> | |||
<pre> | <pre> | ||
samtools | samtools merge | ||
</pre> | </pre> | ||
The full command | <p><code>samtools cat</code> simply concatenates files and does <b>not</b> guarantee sorted order, so it is not appropriate here.</p> | ||
<p>The full command for this exercise is:</p> | |||
<pre> | <pre> | ||
samtools merge -c --write-index HG00418_chr20_sort_markdup.bam | samtools merge -c --write-index HG00418_chr20_sort_markdup.bam \ | ||
ERR016028_chr20_sort_markdup.bam \ | |||
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam | |||
</pre> | </pre> | ||
<p>Where:</p> | |||
<ul> | |||
<li><code>-c</code> keeps the read groups unchanged</li> | |||
<li><code>--write-index</code> creates the <code>.bai</code> index automatically</li> | |||
</ul> | |||
<hr> | |||
<h3>Q5</h3> | |||
<p>The sample/library of origin is indicated in the <b>RG (read group)</b> tag attached to each read:</p> | |||
<pre> | <pre> | ||
RG:Z:ERR016025 | |||
RG:Z:ERR016028 | |||
</pre> | </pre> | ||
<ul> | |||
<li><code>RG:Z:ERR016025</code> → read from the second library</li> | |||
<li><code>RG:Z:ERR016028</code> → read from the first library (the one you processed)</li> | |||
</ul> | |||
<p>Variant callers use read groups to ensure correct sample/lane/library attribution.</p> | |||
<hr> | |||
<h3>Q6</h3> | |||
<p><b>Multiplexing</b></p> | |||
<p>This is the process of pooling multiple samples into a single sequencing run.</p> | |||
<hr> | |||
<h3>Q7</h3> | |||
<p><b>Demultiplexing</b></p> | |||
<p>This is the computational process of separating pooled reads back into individual samples using barcodes/index sequences.</p> | |||
<hr> | |||
<p><b>End of answer key.</b></p> | |||
Latest revision as of 14:58, 20 November 2025
Post-Alignment Processing – Answer Key
Q1
Running:
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
-I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam \
-M ERR016028_chr20_sort_markdup.metrics.txt \
-O ERR016028_chr20_sort_markdup.bam
The Picard log reports:
Marking 9798 records as duplicates.
This number is low because we only use a very small subset of the genome (chr20 only) to keep the exercise fast.
Q2
The two reads have:
- Different sequences (one contains an
N) - The same alignment start coordinate (position 45,996,739)
ERR016028.5947720 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG ERR016028.18808080 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG
Even though the sequences differ slightly, Picard considers them duplicates because:
- They originate from the same original fragment
- They align to the exact same genomic location
- Duplicate detection is based on alignment position, not sequence identity
Q3
ERR016028.18808080 is the read marked as a duplicate.
You can tell because its SAM flag changed from:
- 163 → original flag
- 1187 → original flags + duplicate flag (0x400)
The duplicate status is identified because the new flag includes:
0x400 (1024 decimal) = "PCR/optical duplicate"
Picard retains the “best” representative read and marks the others as duplicates.
Q4
The correct command for merging BAM files is:
samtools merge
samtools cat simply concatenates files and does not guarantee sorted order, so it is not appropriate here.
The full command for this exercise is:
samtools merge -c --write-index HG00418_chr20_sort_markdup.bam \
ERR016028_chr20_sort_markdup.bam \
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
Where:
-ckeeps the read groups unchanged--write-indexcreates the.baiindex automatically
Q5
The sample/library of origin is indicated in the RG (read group) tag attached to each read:
RG:Z:ERR016025 RG:Z:ERR016028
RG:Z:ERR016025→ read from the second libraryRG:Z:ERR016028→ read from the first library (the one you processed)
Variant callers use read groups to ensure correct sample/lane/library attribution.
Q6
Multiplexing
This is the process of pooling multiple samples into a single sequencing run.
Q7
Demultiplexing
This is the computational process of separating pooled reads back into individual samples using barcodes/index sequences.
End of answer key.