Data Preprocess exercise answers

From 22126
Jump to navigation Jump to search

Q1

 zcat /home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz|head -n 2  |tail -1 |wc  -c
151

However, the answers is 150 as "wc" counts the end of line character

Q2

Running:

fastqc -o .  /home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz
fastqc -o .  /home/projects/22126_NGS/exercises/preprocess/ex1/SRR957868_1.fastq.gz

SRR957824 is the worse run, the quality scores towards the end of the reads are low.

Q3

SRR957868 is the ok one but has a warning sign in the category of "Overrepresented sequences" and an error in the "Adapter Content" due to the presence of untrimmed adapters. The one solution would be to trim the remaining adapter sequences.


Q4

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG -o  SRR957868_1_o.fastq.gz SRR957868_1.fastq.gz
Sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG; Type: regular 3'; Length: 50; Trimmed: 59789 times

so 59789 times

Q5

The number of times the adapter was trimmed would have been much lower as the program would not have been able to recognize the adapter sequences and would remove random sequences that just happened to look like the erroneous sequence.

Q6

fastqc -o .  SRR957868_1_o.fastq.gz

The error in the "Adapter Content" section is gone. There is still a warning sign in the category of "Overrepresented sequences" but a program does not report the adapter sequence.

Q7

The command:

fastp -Q -L   --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT    --out1 SRR794302_1_trimmed.fastq.gz --out2 SRR794302_2_trimmed.fastq.gz   --in1  /home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz --in2  /home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_2.fastq.gz

Should give you the following stats:

Filtering result:
reads passed filter: 499988
reads failed due to low quality: 0
reads failed due to too many N: 0
reads with adapter trimmed: 11042
bases trimmed due to adapters: 138056

The first forward read to be trimmed is:

@SRR794302.7 HWI-ST434:134117522:C1N85ACXX:6:1101:1666:2094/1
GCCTTTCTCCTATCCATCCCCTACTGTTTCAGTGCACCACTGAAACTATTCCACCTCCATAGAGACTCCGCAGGGTGATGTCCTATCAGCCAGCTATGAG
+
@@?BDBD?,=C<CFF?EHC@FHEIIEDDFAEHH<F@CBD@AGC@FFHGCBHBFFD8(8B<CHIBGG@F>@AE:/'.;A3>(6.;@@(;;3;?B@3:5>AA
@SRR794302.8 HWI-ST434:134117522:C1N85ACXX:6:1101:1721:2099/1
TCACTTTGTCGGCCAGGCTGGAGTGCAGTTGTGCAATCTCAGTTTGTTGCAACATCTGCCTCCCAGGGTCAAGCAATTCTCATGCTTCA
+
;??DB;DDDH??DG::AC?FGAG?CGF@::B?BDFB*?@<D?FEG?D39DHH9FH>F4@==.6-7?EF=7;777>3;A>A>;;;CCEA>
@SRR794302.9 HWI-ST434:134117522:C1N85ACXX:6:1101:1687:2127/1
TAGAGGGACTAATCTAAAACTACCTTTTTTCAATTTAAGAACTTTGTTTTATTTACCAATTTAAGGGTGATAAGCTGTGAAGAAGTAATTTAGAACAACC
+
@@CDFFFFHHFFHIJJGIIHHIHJIJJIJJJFGIJJFEHGGGIJJIGIJJIIJJDHHEGIJJJIIIIDDACCCAEABCECBCC@C;>>CDCC@CA;?CBB


so "@SRR794302.8 HWI-ST434:134117522:C1N85ACXX:6:1101:1721:2099" was the read that was trimmed first. The original was:

@SRR794302.8 HWI-ST434:134117522:C1N85ACXX:6:1101:1721:2099/1
TCACTTTGTCGGCCAGGCTGGAGTGCAGTTGTGCAATCTCAGTTTGTTGCAACATCTGCCTCCCAGGGTCAAGCAATTCTCATGCTTCAAGATCGGAAGA
+
;??DB;DDDH??DG::AC?FGAG?CGF@::B?BDFB*?@<D?FEG?D39DHH9FH>F4@==.6-7?EF=7;777>3;A>A>;;;CCEA>A>>AC@?<?5<

So the "AGATCGGAAGA" at the end does match the adapter sequence that we provided which was: "AGATCGGAAGAGCACACGTCTGAACTCCAGT".

It is indeed the same sequence that was trimmed first in the reverse read:

@SRR794302.8 HWI-ST434:134117522:C1N85ACXX:6:1101:1721:2099/2
TGAAGCATGAGAATTGCTTGACCCTGGGAGGCAGATGTTGCAACAAACTGAGATTGCACAACTGCACTCCAGCCTGGCCGACAAAGTGA
+
@@@DDDBDF8D>F@FA>F?FC??EHEEHG1?:1?;FDGEGC<BFAHEHGAGHEHHIICH<>;=DHI@;AAA>>?CE?BCB>>/;9?C>3

This makes sense because the length of the original DNA fragment was probably 89bp.

Q8 According to the message when the program finishes:

reads with adapter trimmed: 11042

11042 sequences out of 499988 were trimmed.

Q9

The longer the insert size, the less you will have adapters detected. You will get more adapters detected and trimmed if the insert size is short.

Q10

Reverse reads has particularly low-quality scores towards the end of the reads.

Q11

Trimmomatic should have substantially eliminated a number of sequences with low base quality the improvement should be more noticeable on the reverse read.


Q12

zcat SRR8002634_1U.fastq.gz|wc -l 
384672

so 384672/4 = 96168

A quicker way:

echo $(zcat SRR8002634_1U.fastq.gz|wc -l)/4|bc

96168

echo $(zcat SRR8002634_2U.fastq.gz|wc -l)/4|bc

4770

It is normal that a lot of forward reads now find themselves alone because the reverse reads were much worse. Therefore in the unpaired you will have more reads in the forward that in the reverse.

Q13

leeHom  --auto--ancientdna -fqo ERR4778296 -fq1 /home/projects/22126_NGS/exercises/preprocess/ex4/ERR4778296_1.fastq.gz -fq2 /home/projects/22126_NGS/exercises/preprocess/ex4/ERR4778296_2.fastq.gz 
Total 250000; Merged (trimming) 237536; Merged (overlap) 9848; Kept PE/SR 2601; Trimmed SR 0; Adapter dimers/chimeras 15; Failed Key 0; UMI problems 0

2601 were left as is, 1.04% (2601/250000) so very little.