Postprocess exercise

From 22126
Revision as of 14:52, 20 November 2025 by Mick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

In this exercise, you will perform essential post-alignment processing on BAM files to prepare them for reliable variant calling. Raw aligned BAM files often contain artifacts that can lead to false variants if not handled correctly. Today you will:

  1. Mark duplicate reads in BAM files
  2. Examine the effect of duplicate marking on read interpretation
  3. Merge multiple sequencing libraries from the same individual into a single BAM file

First:

  1. Navigate to your home directory
  2. Create a directory called postalign
  3. Enter the postalign directory

Duplicate Marking

We will work with data from a Han Chinese individual (HG00418), sequenced to approximately 40× coverage using Illumina paired-end sequencing. For speed, we only use reads mapping to chromosome 20 and only two sequencing libraries.

Library 1 BAM file:

/home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam

The file is already trimmed, aligned, and sorted.

We will mark duplicate reads using Picard MarkDuplicates. The general command is:

java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
    -I [input.bam] \
    -M [metrics.txt] \
    -O [output.bam]

Suggested output name: ERR016028_chr20_sort_markdup.bam

Q1: After running Picard, how many reads were marked as duplicates? (Hint: this number is printed in the Picard metrics output file.)


Inspecting the Effect of Duplicate Marking

To view reads in a specific genomic region, use:

samtools view [input.sorted.bam] [chrom]:[start]-[end]

The BAM file must be indexed. Picard preserves sorting, so you do not need to re-sort it.

Inspect reads in the following region for both the original file and your duplicate-marked file:

chr20:45996339-45996839

Identify the two reads:

  • ERR016028.5947720
  • ERR016028.18808080

Q2: Why did MarkDuplicates consider these reads to be duplicates?

Q3: Which of the two reads was marked as a duplicate, and how can you tell from the SAM flag or tags?


Merging BAM Files

Often, multiple sequencing libraries (or sequencing runs) exist for the same biological sample. Before variant calling, these must be merged into a single BAM file.

You will merge:

  • Your duplicate-marked file
  • The second library file:
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam

Run:

samtools

and find the command capable of merging multiple BAM files while preserving read groups and writing an index automatically. The command should keep the file sorted and generate the .bai index in the same step.

Q4: Which samtools command performs merging, with options to keep read groups and write the index?

Use the options:

-c   --write-index

Your merged BAM file should be named:

HG00418_chr20_sort_markdup.bam

Inspect your merged file using:

samtools view HG00418_chr20_sort_markdup.bam | less -S

Q5: Which SAM/BAM field indicates the sample or library of origin for each read?

Q6: What is the term for pooling multiple samples together into a single sequencing run?

Q7: What is the computational step where we separate pooled reads back into individual samples?


You can find the answers here: <a href="Postprocess_exercise_answers">Postprocess_exercise_answers</a>

Congratulations—you have completed the exercise!