Postprocess exercise answers: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with "'''Q1''' Running: <pre> java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam -M ERR016028_chr20_sort_markdup.metrics.txt -O ERR016028_chr20_sort_markdup.bam </pre> The log should state: <pre> Marking 9798 records as duplicates. </pre> Please note that this is very low but that is because we have very little data so that it runs faster. '''Q2''' They do not have the same sequence:...")
 
No edit summary
 
Line 1: Line 1:
'''Q1'''
<h2>Post-Alignment Processing – Answer Key</h2>
Running:
 
<h3>Q1</h3>
 
<p>Running:</p>


<pre>
<pre>
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam -M ERR016028_chr20_sort_markdup.metrics.txt -O ERR016028_chr20_sort_markdup.bam
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
    -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam \
    -M ERR016028_chr20_sort_markdup.metrics.txt \
    -O ERR016028_chr20_sort_markdup.bam
</pre>
</pre>


The log should state:
<p>The Picard log reports:</p>
 


<pre>
<pre>
Marking 9798 records as duplicates.
Marking 9798 records as duplicates.
</pre>
</pre>
Please note that this is very low but that is because we have very little data so that it runs faster.


<p>This number is low because we only use a very small subset of the genome (chr20 only) to keep the exercise fast.</p>
<hr>
<h3>Q2</h3>


'''Q2'''
<p>The two reads have:</p>
<ul>
  <li><b>Different sequences</b> (one contains an <code>N</code>)</li>
  <li><b>The same alignment start coordinate</b> (position 45,996,739)</li>
</ul>


They do not have the same sequence:
<pre>
<pre>
ERR016028.5947720 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG
ERR016028.5947720   ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG
ERR016028.18808080 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG
ERR016028.18808080 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG
</pre>
</pre>
notice "TCTCA" vs "TCNCA" but they both have the same starting coordinate (45996739).


'''Q3'''
<p>Even though the sequences differ slightly, Picard considers them duplicates because:</p>
 
<ul>
  <li>They originate from the same original fragment</li>
  <li>They align to the exact same genomic location</li>
  <li>Duplicate detection is based on alignment position, not sequence identity</li>
</ul>
 
<hr>


ERR016028.18808080 is the read marked as duplicate. It is the read whose flag (2nd field) changed from 163 to 1187, which corresponds to a duplicate (see [https://broadinstitute.github.io/picard/explain-flags.html https://broadinstitute.github.io/picard/explain-flags.html]).
<h3>Q3</h3>


'''Q4'''
<p><code>ERR016028.18808080</code> is the read marked as a duplicate.</p>
 
<p>You can tell because its SAM flag changed from:</p>
 
<ul>
  <li><b>163</b> → original flag</li>
  <li><b>1187</b> → original flags + duplicate flag (0x400)</li>
</ul>
 
<p>The duplicate status is identified because the new flag includes:</p>


The correct command is:
<pre>
<pre>
samtools merge
0x400 (1024 decimal) = "PCR/optical duplicate"
</pre>
</pre>
If you choose:
 
<p>Picard retains the “best” representative read and marks the others as duplicates.</p>
 
<hr>
 
<h3>Q4</h3>
 
<p>The correct command for merging BAM files is:</p>
 
<pre>
<pre>
samtools cat
samtools merge
</pre>
</pre>
It will merely concatenate the files meaning that they will be file1, file2, file3.. It will not necessarily be sorted.


The full command should look something like this:
<p><code>samtools cat</code> simply concatenates files and does <b>not</b> guarantee sorted order, so it is not appropriate here.</p>
 
<p>The full command for this exercise is:</p>
 
<pre>
<pre>
samtools merge -c --write-index HG00418_chr20_sort_markdup.bam   ERR016028_chr20_sort_markdup.bam /home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam  
samtools merge -c --write-index HG00418_chr20_sort_markdup.bam \
    ERR016028_chr20_sort_markdup.bam \
    /home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
</pre>
</pre>


'''Q5'''
<p>Where:</p>
<ul>
It is '''RG''' which stands for read group. You will see them at the end of reads:
  <li><code>-c</code> keeps the read groups unchanged</li>
  <li><code>--write-index</code> creates the <code>.bai</code> index automatically</li>
</ul>
 
<hr>
 
<h3>Q5</h3>
 
<p>The sample/library of origin is indicated in the <b>RG (read group)</b> tag attached to each read:</p>


<pre>
<pre>
RG:Z:ERR016025
RG:Z:ERR016025
RG:Z:ERR016028
RG:Z:ERR016028
</pre>
</pre>


If it was '''RG:Z:ERR016025''' it was from the file that was stored, '''RG:Z:ERR016028''' was from the file you generated.
<ul>
  <li><code>RG:Z:ERR016025</code> → read from the second library</li>
  <li><code>RG:Z:ERR016028</code> → read from the first library (the one you processed)</li>
</ul>
 
<p>Variant callers use read groups to ensure correct sample/lane/library attribution.</p>
 
<hr>
 
<h3>Q6</h3>
 
<p><b>Multiplexing</b></p>
 
<p>This is the process of pooling multiple samples into a single sequencing run.</p>
 
<hr>
 
<h3>Q7</h3>
 
<p><b>Demultiplexing</b></p>
 
<p>This is the computational process of separating pooled reads back into individual samples using barcodes/index sequences.</p>


'''Q6''' multiplexing
<hr>


'''Q7''' demultiplexing
<p><b>End of answer key.</b></p>

Latest revision as of 14:58, 20 November 2025

Post-Alignment Processing – Answer Key

Q1

Running:

java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
    -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam \
    -M ERR016028_chr20_sort_markdup.metrics.txt \
    -O ERR016028_chr20_sort_markdup.bam

The Picard log reports:

Marking 9798 records as duplicates.

This number is low because we only use a very small subset of the genome (chr20 only) to keep the exercise fast.


Q2

The two reads have:

  • Different sequences (one contains an N)
  • The same alignment start coordinate (position 45,996,739)
ERR016028.5947720   ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG
ERR016028.18808080  ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG

Even though the sequences differ slightly, Picard considers them duplicates because:

  • They originate from the same original fragment
  • They align to the exact same genomic location
  • Duplicate detection is based on alignment position, not sequence identity

Q3

ERR016028.18808080 is the read marked as a duplicate.

You can tell because its SAM flag changed from:

  • 163 → original flag
  • 1187 → original flags + duplicate flag (0x400)

The duplicate status is identified because the new flag includes:

0x400 (1024 decimal) = "PCR/optical duplicate"

Picard retains the “best” representative read and marks the others as duplicates.


Q4

The correct command for merging BAM files is:

samtools merge

samtools cat simply concatenates files and does not guarantee sorted order, so it is not appropriate here.

The full command for this exercise is:

samtools merge -c --write-index HG00418_chr20_sort_markdup.bam \
    ERR016028_chr20_sort_markdup.bam \
    /home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam

Where:

  • -c keeps the read groups unchanged
  • --write-index creates the .bai index automatically

Q5

The sample/library of origin is indicated in the RG (read group) tag attached to each read:

RG:Z:ERR016025
RG:Z:ERR016028
  • RG:Z:ERR016025 → read from the second library
  • RG:Z:ERR016028 → read from the first library (the one you processed)

Variant callers use read groups to ensure correct sample/lane/library attribution.


Q6

Multiplexing

This is the process of pooling multiple samples into a single sequencing run.


Q7

Demultiplexing

This is the computational process of separating pooled reads back into individual samples using barcodes/index sequences.


End of answer key.