Postprocess exercise answers
Q1 Running:
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I /home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam -M ERR016028_chr20_sort_markdup.metrics.txt -O ERR016028_chr20_sort_markdup.bam
The log should state:
Marking 9798 records as duplicates.
Please note that this is very low but that is because we have very little data so that it runs faster.
Q2
They do not have the same sequence:
ERR016028.5947720 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCTCATTATCTTGCCCAGGCTAG ERR016028.18808080 ACATGTGGCTAATTTTTTTTACTGTTGTGGAGAAAGGAGGAGGGAGAGGGGAGTCNCATTATCTTGCCCAGGCTAG
notice "TCTCA" vs "TCNCA" but they both have the same starting coordinate (45996739).
Q3
ERR016028.18808080 is the read marked as duplicate. It is the read whose flag (2nd field) changed from 163 to 1187, which corresponds to a duplicate (see https://broadinstitute.github.io/picard/explain-flags.html).
Q4
The correct command is:
samtools merge
If you choose:
samtools cat
It will merely concatenate the files meaning that they will be file1, file2, file3.. It will not necessarily be sorted.
The full command should look something like this:
samtools merge -c --write-index HG00418_chr20_sort_markdup.bam ERR016028_chr20_sort_markdup.bam /home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
Q5
It is RG which stands for read group. You will see them at the end of reads:
RG:Z:ERR016025 RG:Z:ERR016028
If it was RG:Z:ERR016025 it was from the file that was stored, RG:Z:ERR016028 was from the file you generated.
Q6 multiplexing
Q7 demultiplexing