Alignment exercise answers

From 22126
Jump to navigation Jump to search

Q1: 3 possible ways:

  • The file with the smaller file size contains the trimmed reads.
  • Peek in the file and determine which file contains reads of uneven lengths
  • Use fastqc, to determine which file contains overrepresented adapter sequences

Q2: 4 lines if you have added the RG

Q3: 166782

Q4: 0, this means that the probability of being mismapped is one. This means that this read cannot be confidently assigned to this position.

Q5: There are 3x16 and 5x0 so 3 on the - strand and 5 on the + strand

Q6: Yes it is unmapped, 4 means read unmapped.

Q7: 1183276 aligned out of 1252591 which is 94.47%. Therefore 100%-94.47% = 5.53%

Q8: Down from 467M to 117M so about 4X less.

Q9: Down from 117M to 37M so about 3.1X less.

Q10: Use:

samtools  view  SRR8002634_1.sorted.bam  NC_002516.2:2000000-3000000|wc -l 

155343


Q11: Use:

samtools  view -q 30 SRR8002634_1.sorted.bam  NC_002516.2:2000000-3000000|wc -l 

149524


Q12: In SRR8002634_1.mosdepth.summary.txt we find:

NC_002516.2  6264404  177098613  28.27  0    113

The average is 28.27X, the maximum is 113 reads.

Q13: if you use the wrong genome reference, the number of aligned reads will likely go down. This is because as the phylogenetic distance increases, the more the genomes diverge, the less the reads are likely to land on the genome due to increased divergence. In the extreme, the reads aligned by serendipity. Conversely, if the reference genome used is closer phylogenetically to the species of the sample, the reads are more likely to align and the number of aligned reads will go up. For instance, 5689 (0.46%) aligned to the Yersinia pestis and 279217 (22.34%) aligned to the Pseudomonas aeruginosa.

Q14 the command line is:

bwa mem  -R "@RG\tID:RG26\tSM:YRB42"  /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa /home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
 /home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz |samtools view -bS /dev/stdin | samtools sort /dev/stdin >  NA19201.bam

By the way, when we write:

samtools view -bS /dev/stdin

we can also write:

samtools view -bS -

It means the same thing and is quicker.

Very technically speaking, the command line above is not the most efficient. By default, BAM is a compressed binary format. Compressed here means it is run through gzip or more precisely bgzip (block version of gzip). It can also be uncompressed binary. The first call to samtools view transforms the data as compressed binary. Subsequently, samtools sort decompresses and re-compresses it. This is a bit wasteful, there is no point in compressing data if you're going to decompress it again in the subsequent step. Is smarter command line is:

bwa mem  -R "@RG\tID:RG26\tSM:YRB42" /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa /home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
 /home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz |samtools view -uS - | samtools sort /dev/stdin >  NA19201.bam

Note that we have changed samtools view -bS to samtools view -uS, this says produced uncompressed BAM.

Q15 1) there are fewer intermediate files being produced 2) you guarantee parallel execution such that you are as fast as your bottleneck which in this case is mapping.


Q16 Using:

samtools flagstat NA19201.bam

We find:

3516557 + 0 mapped (99.83% : N/A)

So 99.83%. Please note that this is unusually high as we knew that these are reads then mapped to chromosome 20.


Q17

3490934 + 0 properly paired (99.23% : N/A)


Q18 Using:

samtools view    NA19201.bam chr20 |wc -l 
3449006 (or 3448996)

However, please note that you might get different numbers. Like mentioned in class, BWA tends to plate reads randomly in case of a tie. But expect the same ballpark. Also, to be more rigorous, as noted in the flagstat, there are 4620 supplementary alignments in total, so one would need to filter those out. Using

samtools view -F0x800 NA19201.bam chr20 |wc -l 
3448196 (or 3448186)


Q19

samtools stat NA19201.bam > NA19201.stat
plot-bamstats -p NA19201 NA19201.stat

The most common insert size is 152.

Q20:reference=T, sample=G

Q21: 8 reads

Q22: FER1L4