Postprocess exercise
Overview
In this exercise, you will perform essential post-alignment processing on BAM files to prepare them for reliable variant calling. Raw aligned BAM files often contain artifacts that can lead to false variants if not handled correctly. Today you will:
- Mark duplicate reads in BAM files
- Examine the effect of duplicate marking on read interpretation
- Merge multiple sequencing libraries from the same individual into a single BAM file
First:
- Navigate to your home directory
- Create a directory called
postalign - Enter the
postaligndirectory
Duplicate Marking
We will work with data from a Han Chinese individual (HG00418), sequenced to approximately 40× coverage using Illumina paired-end sequencing. For speed, we only use reads mapping to chromosome 20 and only two sequencing libraries.
Library 1 BAM file:
/home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam
The file is already trimmed, aligned, and sorted.
We will mark duplicate reads using Picard MarkDuplicates. The general command is:
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
-I [input.bam] \
-M [metrics.txt] \
-O [output.bam]
Suggested output name: ERR016028_chr20_sort_markdup.bam
Q1: After running Picard, how many reads were marked as duplicates? (Hint: this number is printed in the Picard metrics output file.)
Inspecting the Effect of Duplicate Marking
To view reads in a specific genomic region, use:
samtools view [input.sorted.bam] [chrom]:[start]-[end]
The BAM file must be indexed. Picard preserves sorting, so you do not need to re-sort it.
Inspect reads in the following region for both the original file and your duplicate-marked file:
chr20:45996339-45996839
Identify the two reads:
ERR016028.5947720ERR016028.18808080
Q2: Why did MarkDuplicates consider these reads to be duplicates?
Q3: Which of the two reads was marked as a duplicate, and how can you tell from the SAM flag or tags?
Merging BAM Files
Often, multiple sequencing libraries (or sequencing runs) exist for the same biological sample. Before variant calling, these must be merged into a single BAM file.
You will merge:
- Your duplicate-marked file
- The second library file:
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
Run:
samtools
and find the command capable of merging multiple BAM files while preserving read groups and writing an index automatically. The command should keep the file sorted and generate the .bai index in the same step.
Q4: Which samtools command performs merging, with options to keep read groups and write the index?
Use the options:
-c --write-index
Your merged BAM file should be named:
HG00418_chr20_sort_markdup.bam
Inspect your merged file using:
samtools view HG00418_chr20_sort_markdup.bam | less -S
Q5: Which SAM/BAM field indicates the sample or library of origin for each read?
Q6: What is the term for pooling multiple samples together into a single sequencing run?
Q7: What is the computational step where we separate pooled reads back into individual samples?
You can find the answers here: <a href="Postprocess_exercise_answers">Postprocess_exercise_answers</a>
Congratulations—you have completed the exercise!