Data Preprocess exercise
Overview
First:
- Navigate to your home directory:
- Create a directory called "preprocess"
- Navigate to the directory you just created.
We will try to pre-process several types of NGS data.
- Escherichia coli single-end Illumina reads
- Pseudomonas aeruginosa paired-end Illumina reads
Escherichia coli single-end Illumina reads
Introduction
An outbreak of E. coli has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is the one responsible for the outbreak.
The lab technician has performed two different sequencing runs using an Illumina MiSeq sequencer. The lab technician mentions that one run was ok and the other one had poor quality.
The data can be found here:
/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex1/SRR957868_1.fastq.gz
Leave the data there, you do not need to copy it.
Q1: What is the read length?
FastQC
We will use the program "FastQC" to assess the quality of the reads. First, create 2 directories named (have a look at the UNIX notes to remind yourself how to do it):
SRR957824 SRR957868
First, type the following to view options:
fastqc --help
I recommend to create a directory called:
fastqc/
We will use the -o option to redirect the output to a directory. We will use the following command line:
fastqc -o [output directory] [fastq.gz file]
Where [output directory] is the name of the output directory in our case: fastqc/. And [fastq.gz file] is the fastq file to analyze. Please run SRR957824_1.fastq.gz and send the output to the SRR957824/ directory you created. Do the same for SRR957868_1.fastq.gz by sending its output to SRR957868/
firefox fastqc/[file prefix]_fastqc.html &
If this is very slow (it might be if all do this at the same time) you can copy the files to your own computer.
To do this all you have to do is start another local session tab in MobaXterm (windows) or Terminal (Linux or Mac) you write (remember to type your password when prompted):
scp stud0XX@(pupil1 pupil2 pupil3 ):preprocess/fastqc/*.html .
The output should look like a series of graphs/tables in different categories (ex: Basic Statistics, Per base sequence quality). You can use the left column to quickly navigate to a category that looks suspicious. Look at each of the categories and see if it reports something that we should look at (green check vs yellow exclamation mark vs red cross).
N.B. For this run, do not worry about the following categories:
[FAIL]Per base sequence content [WARNING]Per sequence GC content
The most interesting figure is the first where the quality of each base is plotted as you move from the beginning of the read to the end. It shows the distribution of quality scores in the reads as you move from 5' to 3'. Here you should see the trailing bad qualities typical for Illumina data. It can be a good idea to remove these bad bases as we do not want them to influence our assembly.
Also, look for "Overrepresented sequences" (second last plot), these are often sequencing adapters that are also present at the end of reads. It is important to remove these when are doing a de novo assembly as these will overlap between the reads and make erroneous junctions between reads. In alignments, they are also troublesome as they will create incorrect mismatches at the end of reads.
Q2: Which of the 2 sequencing runs (SRR957824 or SRR957868) do you think had poor quality?
Q3: For the sequencing run with good quality, there seems to be a remaining issue. What do you think is the cause and the solution?
cutadapt
In "Overrepresented sequences", here you see that there is a sequence that is overrepresented in the reads that matches the TruSeq Adapter.
We will use cutadapt to remove the lingering adapters:
cutadapt -a [primer sequence] -o [output file] [input file]
The sequencing technician gives you the following sequence for the primer:
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG
Use cutadapt to remove the ring a ring adaptor. I suggest that you call your output file SRR957868_1_trimmed.fastq.gz.
Q4: How many times was this adapter trimmed (look at the output produced by the program)?
Q5: What would have happened to this number of trimmed sequences had the signal technician given you the wrong adapter sequence?
Run FastQC again on the resulting output file.
Q6: you still find adapter sequences among the "overrepresented sequences"?
Human Illumina Paired-end reads
Let us look at some paired-end Illumina reads, these reads are from whole-genome sequencing of a Yoruba individual:
/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_2.fastq.gz
The "_1.fastq.gz" is the forward read, the "2_fastq.gz" is the reverse read. The problem is that you do not know the adapter sequences.
fastp
We will use fastp, an ultra-fast adapter trimming software, to trim the adapter sequences:
fastp -Q -L --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT --out1 [output read1] --out2 [output read2] --in1 [input read1] --in2 [input read2]
-Q and -L disables quality filtering and length filtering. The fastq files do not need to be unzipped, fastq.gz is fine. The file [output read1] contains the trimmed forward reads and [output read2] contains the trimmed reverse reads. You can use "SRR794302_1_trimmed.fastq.gz" as output [output read1] and SRR794302_2_trimmed.fastq.gz as [output read2] . Pay attention to the messages produced by the program as you can see some interesting summary statistics.
You can use:
zcat file.fastq.gz |less -S
To view a zipped fastq file. Please inspect the resulting trimmed reads.
Q7: Which forward read was the first to be trimmed (write the ID)? hint: it will have a different sequence length. Had we asked the same question for the reverse reads, would you have found a different read or the same? Does it makes sense?
Q8: How many sequences have been trimmed?
Q9: Given an identical number of starting sequences, do you think that you will get more sequences trimmed if the insert size is short or long? why?
Metagenomic Illumina Paired-end reads
Let's try some reads from a study of Pseudomonas aeruginosa, an opportunistic pathogen that can live in the environment and may infect humans. Inside humans, they may live in the lungs and form biofilms.
The reads are found here:
/home/projects/22126_NGS/exercises/preprocess/ex3/
FastQC
First let us look at the reads, use FastQC on the forward and reverse reads to identify potential problems.
Q10. One type of read (forward/reverse) has a particular type of problem compared to the other, which type of read is it and what is the problem?
Trimmomatic
We will use Trimmomatic the simultaneously trim adaptors and remove sequences with bad quality. This step can be required especially when doing de novo assembly as they can create incorrect junctions between reads.
The basic command line is as such:
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [some flags] [forward read fastq] [reverse read fastq] [output prefix]_1P.fastq.gz [output prefix]_1U.fastq.gz [output prefix]_2P.fastq.gz [output prefix]_2U.fastq.gz [some more flags]
where:
file | contents |
[output prefix]_1P.fq.gz | paired forward reads |
[output prefix]_1U.fq.gz | unpaired forward reads (the reverse read failed quality control) |
[output prefix]_2P.fq.gz | for paired reverse reads |
[output prefix]_2U.fq.gz | unpaired reverse reads (the forward read failed quality control) |
Because sometimes the reverse forward reads are not of the same average quality, the files ([output prefix]_1U.fq.gz and [output prefix]_2U.fq.gz ) do not necessarily contain the same number of reads. However, the files ([output prefix]_1P.fq.gz and [output prefix]_2P.fq.gz ) should contain the same number of reads. We will use Trimmomatic to trim adapters and remove segments of poor quality. Here is the command line:
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 1 -phred33 /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_2.fastq.gz SRR8002634_1P.fastq.gz SRR8002634_1U.fastq.gz SRR8002634_2P.fastq.gz SRR8002634_2U.fastq.gz ILLUMINACLIP:/usr/share/trimmomatic/TruSeq2-PE.fa:2:30:10 LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:50
The additional options do the following:
option | effect |
ILLUMINACLIP | The sequence of the adapters, the remaining numbers are sensitivity thresholds (see software manual for the exact definitions) |
LEADING | Remove the bases at the 5' end when the QC scores fall below this threshold. |
TRAILING | Remove the bases at the 3' end when the QC scores fall below this threshold. |
SLIDINGWINDOW | Remove bases using a sliding window strategy, the first number is the window size, the second number is the quality score. |
MINLEN | minimum length for a sequence. |
FastQC
Again let us run FastQC on the forward (SRR8002634_1P.fastq.gz) and reverse reads (SRR8002634_2P.fastq.gz) to verify if the previous problems went away.
Q11. Did the quality scores improve especially for the reverse reads?
Q12. The reads in SRR8002634_1U.fastq.gz are the forward reads for which the paired reverse reads were removed due to poor quality. The reverse reads whose paired forward reads failed quality controls are found in SRR8002634_2U.fastq.gz without counting the number of lines, which file would you think contains more reads? Count the number of reads found in each and check if you prediction was correct
Ancient DNA
A study extracted DNA from dogs, wolves and mammoths. You can find the study here. However, DNA tends to degrade fast and the sequences can be very short.
The raw data for a dog can be found here:
/home/projects/22126_NGS/exercises/preprocess/ex4/ERR4778296_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex4/ERR4778296_2.fastq.gz
leeHom
We will not only trim adapters but also merge overlapping sequences to get an idea of how long our sequences are. We will use leeHom, a specialized program for short sequences, to infer the adapter sequences and remove the adapters:
leeHom --auto --ancientdna -fq1 [forward read fastq] -fq2 [reverse read fastq] -fqo [output prefix]
"--auto" means infer the adapter sequences, "--ancientdna" means merge overlapping pairs. The fastq files do not need to be unzipped, fastq.gz is fine. You can use "ERR4778296_trimmed" as output prefix. Pay attention to the messages produced by the program as you can see some interesting summary statistics. The program should produce the following statistics:
Q13. How many reads were left as pairs?
File suffix | Meaning |
ERR4778296_trimmed.fq.gz | Sequences that were trimmed |
ERR4778296_trimmed_r1.fq.gz | Forward reads that were not trimmed and left as is |
ERR4778296_trimmed_r2.fq.gz | Reverse reads that were not trimmed and left as is |
You also see some files with the word "fail", these sequences generally had some problems for instance the program was not sure if the adapter was present or not.
You can see how short the sequences are using zcat on ERR4778296_trimmed.fq.gz.
Different trimming/merging programs
In closing, there are several different programs to remove adapters on the servers:
cutadapt
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar
leeHom
AdapterRemoval
Let us know if you need an additional one that is not found on this list.
Please find answers here
Congratulations you finished the exercise!