Postprocess exercise: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with "<H2>Overview</H2> First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "postalign" <LI>Navigate to the directory you just created. </OL> In this exercise, we will pre-process bam-files so they are ready for SNP calling. This is necessary to reduce the high number of potential false SNPs that will get called. You will try to: <OL> <LI>Mark read duplicates from the BAM-files <LI>Merge BAM files </OL> <H2>Duplicate removal</H2> <p>We are...")
 
No edit summary
 
Line 1: Line 1:
<H2>Overview</H2>
<h2>Overview</h2>


First:
<p>In this exercise, you will perform essential post-alignment processing on BAM files to prepare them for reliable variant calling. Raw aligned BAM files often contain artifacts that can lead to false variants if not handled correctly. Today you will:</p>
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "postalign"
<LI>Navigate to the directory you just created.
</OL>


In this exercise, we will pre-process bam-files so they are ready for SNP calling. This is necessary to reduce the high number of potential false SNPs that will get called. You will try to:
<ol>
<OL>
  <li>Mark duplicate reads in BAM files</li>
  <LI>Mark read duplicates from the BAM-files
  <li>Examine the effect of duplicate marking on read interpretation</li>
  <LI>Merge BAM files
  <li>Merge multiple sequencing libraries from the same individual into a single BAM file</li>
</OL>
</ol>


<H2>Duplicate removal</H2>
<p>First:</p>
<ol>
  <li>Navigate to your home directory</li>
  <li>Create a directory called <code>postalign</code></li>
  <li>Enter the <code>postalign</code> directory</li>
</ol>


<p>We are going to work on data from a Han Chinese individual (HG00418) that was sequenced to around ~40X using Illumina paired-end sequencing. We are not going to use all the data but only some data from two libraries that are both paired-end and we are only going to use chr20 instead of the entire human genome.</p>
<hr>


The first Library can be found here:
<h2>Duplicate Marking</h2>
 
<p>We will work with data from a Han Chinese individual (HG00418), sequenced to approximately 40× coverage using Illumina paired-end sequencing. For speed, we only use reads mapping to chromosome 20 and only two sequencing libraries.</p>
 
<p><b>Library 1 BAM file:</b></p>
<pre>
<pre>
/home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam
/home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam
</pre>
</pre>


We have already trimmed and aligned the data. As the filename indicates, the file has been sorted.
<p>The file is already trimmed, aligned, and sorted.</p>
 
<p>We will mark duplicate reads using <b>Picard MarkDuplicates</b>. The general command is:</p>


We will mark duplicates them using Picard:
<pre>
<pre>
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I [input BAM] -M [metrics output text file] -O [output BAM file]
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
    -I [input.bam] \
    -M [metrics.txt] \
    -O [output.bam]
</pre>
</pre>


Try to maybe add a suffix like "_chr20_sort_markdup.bam".
<p>Suggested output name: <code>ERR016028_chr20_sort_markdup.bam</code></p>


'''Q1:'''  the output should indicate how many reads were marked as duplicates, how many duplicates did Picard flag?
<p><b>Q1:</b> After running Picard, how many reads were marked as duplicates
(Hint: this number is printed in the Picard metrics output file.)</p>


<H3>Effect</H3>
<hr>


We will look at the effect of marking duplicates, a reminder that you can inspect a specific genomic region using:
<h3>Inspecting the Effect of Duplicate Marking</h3>
 
<p>To view reads in a specific genomic region, use:</p>


<pre>
<pre>
samtools view [input.sorted.bam] [region ID]:[start]-[end]  
samtools view [input.sorted.bam] [chrom]:[start]-[end]
</pre>
</pre>


However, the bam file has to be indexed (and therefore sorted, however, do not worry about sorting the file again because Picard maintains the order of the reads).
<p>The BAM file must be indexed. Picard preserves sorting, so you do not need to re-sort it.</p>
 
<p>Inspect reads in the following region for both the <b>original file</b> and your <b>duplicate-marked file</b>:</p>


Try to view the reads aligning to the following region:
<pre>
<pre>
chr20:45996339-45996839
chr20:45996339-45996839
</pre>
</pre>


Both for the original file and a new file you just created with duplicates marked.
<p>Identify the two reads:</p>
<ul>
  <li><code>ERR016028.5947720</code></li>
  <li><code>ERR016028.18808080</code></li>
</ul>
 
<p><b>Q2:</b> Why did MarkDuplicates consider these reads to be duplicates?</p>
 
<p><b>Q3:</b> Which of the two reads was marked as a duplicate, and how can you tell from the SAM flag or tags?</p>


'''Q2:''' why were reads ERR016028.5947720 and ERR016028.18808080 considered duplicates?
<hr>


'''Q3:''' which of the two was marked as duplicate and will be ignored by genotypers? How did you determine this?
<h2>Merging BAM Files</h2>


<p>Often, multiple sequencing libraries (or sequencing runs) exist for the same biological sample. Before variant calling, these must be merged into a single BAM file.</p>


<H2>Merging reads</H2>
<p>You will merge:</p>


Very often you will have to combine different sequencing libraries together prior to genotyping.
<ul>
  <li>Your duplicate-marked file</li>
  <li>The second library file:</li>
</ul>


The goal is to combine the file you just generated with the duplicates removed with another of BAM file:
<pre>
<pre>
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
</pre>
</pre>


This BAM file is from the same individual, it is just a different library.
<p>Run:</p>
 
<pre>samtools</pre>
 
<p>and find the command capable of merging multiple BAM files while preserving read groups and writing an index automatically. The command should keep the file sorted and generate the <code>.bai</code> index in the same step.</p>
 
<p><b>Q4:</b> Which <code>samtools</code> command performs merging, with options to keep read groups and write the index?</p>
 
<p>Use the options:</p>


Type:
<pre>
<pre>
samtools
-c  --write-index
</pre>
</pre>
Find a command that can combine both files, keep them sorted and write the index out for the new file in a single command.


'''Q4:''' which commander can do this?
<p>Your merged BAM file should be named:</p>
 
Run the command that you found and use the options:


<pre>
<pre>
-c --write-index
HG00418_chr20_sort_markdup.bam
</pre>
</pre>


The first option is to keep the re groups as they are is and the second is to simultaneously generate the index. The new file should be named '''HG00418_chr20_sort_markdup.bam'''. Inspect the new file using '''samtools view''' and '''less'''
<p>Inspect your merged file using:</p>


'''Q5:''' which field allows you to determine the sample of origin (i.e. from which file was a read from)?
<pre>
samtools view HG00418_chr20_sort_markdup.bam | less -S
</pre>


<p><b>Q5:</b> Which SAM/BAM field indicates the sample or library of origin for each read?</p>


'''Q6''' What is the term for pooling multiple samples into a single sequencing run?
<p><b>Q6:</b> What is the term for pooling multiple samples together into a single sequencing run?</p>


'''Q7''' What is the term for the computational process where we take the raw data from a single sequencing run and separate them into different samples?
<p><b>Q7:</b> What is the computational step where we separate pooled reads back into individual samples?</p>


Please find answers [[Postprocess_exercise_answers|here]]
<hr>


'''Congratulations you finished the exercise!'''
<p>You can find the answers here: <a href="Postprocess_exercise_answers">Postprocess_exercise_answers</a></p>


<p></p>
<p><b>Congratulations—you have completed the exercise!</b></p>
<p></p>
<p></p>

Latest revision as of 14:52, 20 November 2025

Overview

In this exercise, you will perform essential post-alignment processing on BAM files to prepare them for reliable variant calling. Raw aligned BAM files often contain artifacts that can lead to false variants if not handled correctly. Today you will:

  1. Mark duplicate reads in BAM files
  2. Examine the effect of duplicate marking on read interpretation
  3. Merge multiple sequencing libraries from the same individual into a single BAM file

First:

  1. Navigate to your home directory
  2. Create a directory called postalign
  3. Enter the postalign directory

Duplicate Marking

We will work with data from a Han Chinese individual (HG00418), sequenced to approximately 40× coverage using Illumina paired-end sequencing. For speed, we only use reads mapping to chromosome 20 and only two sequencing libraries.

Library 1 BAM file:

/home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam

The file is already trimmed, aligned, and sorted.

We will mark duplicate reads using Picard MarkDuplicates. The general command is:

java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
    -I [input.bam] \
    -M [metrics.txt] \
    -O [output.bam]

Suggested output name: ERR016028_chr20_sort_markdup.bam

Q1: After running Picard, how many reads were marked as duplicates? (Hint: this number is printed in the Picard metrics output file.)


Inspecting the Effect of Duplicate Marking

To view reads in a specific genomic region, use:

samtools view [input.sorted.bam] [chrom]:[start]-[end]

The BAM file must be indexed. Picard preserves sorting, so you do not need to re-sort it.

Inspect reads in the following region for both the original file and your duplicate-marked file:

chr20:45996339-45996839

Identify the two reads:

  • ERR016028.5947720
  • ERR016028.18808080

Q2: Why did MarkDuplicates consider these reads to be duplicates?

Q3: Which of the two reads was marked as a duplicate, and how can you tell from the SAM flag or tags?


Merging BAM Files

Often, multiple sequencing libraries (or sequencing runs) exist for the same biological sample. Before variant calling, these must be merged into a single BAM file.

You will merge:

  • Your duplicate-marked file
  • The second library file:
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam

Run:

samtools

and find the command capable of merging multiple BAM files while preserving read groups and writing an index automatically. The command should keep the file sorted and generate the .bai index in the same step.

Q4: Which samtools command performs merging, with options to keep read groups and write the index?

Use the options:

-c   --write-index

Your merged BAM file should be named:

HG00418_chr20_sort_markdup.bam

Inspect your merged file using:

samtools view HG00418_chr20_sort_markdup.bam | less -S

Q5: Which SAM/BAM field indicates the sample or library of origin for each read?

Q6: What is the term for pooling multiple samples together into a single sequencing run?

Q7: What is the computational step where we separate pooled reads back into individual samples?


You can find the answers here: <a href="Postprocess_exercise_answers">Postprocess_exercise_answers</a>

Congratulations—you have completed the exercise!