Postprocess exercise

From 22126
Jump to navigation Jump to search

Overview

First:

  1. Navigate to your home directory:
  2. Create a directory called "postalign"
  3. Navigate to the directory you just created.

In this exercise, we will pre-process bam-files so they are ready for SNP calling. This is necessary to reduce the high number of potential false SNPs that will get called. You will try to:

  1. Mark read duplicates from the BAM-files
  2. Merge BAM files

Duplicate removal

We are going to work on data from a Han Chinese individual (HG00418) that was sequenced to around ~40X using Illumina paired-end sequencing. We are not going to use all the data but only some data from two libraries that are both paired-end and we are only going to use chr20 instead of the entire human genome.

The first Library can be found here:

/home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam

We have already trimmed and aligned the data. As the filename indicates, the file has been sorted.

We will mark duplicates them using Picard:

java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I [input BAM] -M [metrics output text file] -O [output BAM file]

Try to maybe add a suffix like "_chr20_sort_markdup.bam".

Q1: the output should indicate how many reads were marked as duplicates, how many duplicates did Picard flag?

Effect

We will look at the effect of marking duplicates, a reminder that you can inspect a specific genomic region using:

samtools view  [input.sorted.bam] [region ID]:[start]-[end] 

However, the bam file has to be indexed (and therefore sorted, however, do not worry about sorting the file again because Picard maintains the order of the reads).

Try to view the reads aligning to the following region:

chr20:45996339-45996839

Both for the original file and a new file you just created with duplicates marked.

Q2: why were reads ERR016028.5947720 and ERR016028.18808080 considered duplicates?

Q3: which of the two was marked as duplicate and will be ignored by genotypers? How did you determine this?


Merging reads

Very often you will have to combine different sequencing libraries together prior to genotyping.

The goal is to combine the file you just generated with the duplicates removed with another of BAM file:

/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam

This BAM file is from the same individual, it is just a different library.

Type:

samtools

Find a command that can combine both files, keep them sorted and write the index out for the new file in a single command.

Q4: which commander can do this?

Run the command that you found and use the options:

-c --write-index

The first option is to keep the re groups as they are is and the second is to simultaneously generate the index. The new file should be named HG00418_chr20_sort_markdup.bam. Inspect the new file using samtools view and less

Q5: which field allows you to determine the sample of origin (i.e. from which file was a read from)?


Q6 What is the term for pooling multiple samples into a single sequencing run?

Q7 What is the term for the computational process where we take the raw data from a single sequencing run and separate them into different samples?

Please find answers here

Congratulations you finished the exercise!