Data Preprocess exercise
Overview
First:
- Navigate to your home directory.
- Create a directory called "preprocess".
- Navigate to the directory you just created.
We will try to pre-process several types of NGS data.
- Escherichia coli single-end Illumina reads
- Pseudomonas aeruginosa paired-end Illumina reads
Escherichia coli single-end Illumina reads
Introduction
An outbreak of E. coli has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is responsible for the outbreak.
The lab technician performed two MiSeq sequencing runs. One run was good; the other had poor quality.
The data can be found here:
/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex1/SRR957868_1.fastq.gz
Leave the data where they are; you do not need to copy them.
Q1: What is the read length?
FastQC
We will use FastQC to assess read quality. First create the directories:
SRR957824 SRR957868
Check the FastQC help:
fastqc --help
Create an output directory:
fastqc/
Run FastQC:
fastqc -o [output directory] [fastq.gz file]
View results:
firefox fastqc/[file prefix]_fastqc.html &
If this is slow, copy the files locally using scp:
scp stud0XX@pupilX:preprocess/fastqc/*.html .
Look for warnings or failures in categories such as per-base quality and overrepresented sequences.
Ignore these warnings for this exercise:
[FAIL] Per base sequence content [WARNING] Per sequence GC content
Pay special attention to trailing quality decay and overrepresented adapter sequences, which affect trimming and downstream assembly.
Q2: Which of the two runs (SRR957824 or SRR957868) had poor quality?
Q3: For the good run, there is still a remaining issue. What is the cause and solution?
cutadapt
FastQC shows overrepresented TruSeq adapters.
Use cutadapt:
cutadapt -a [adapter sequence] -o [output file] [input file]
Adapter sequence:
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG
Suggested output file: SRR957868_1_trimmed.fastq.gz
Q4: How many times was this adapter trimmed?
Q5: What would happen if the wrong adapter sequence was used?
Run FastQC again on the trimmed output.
Q6: Do you still find adapter sequences among the “overrepresented sequences”?
Human Illumina Paired-end Reads
These reads come from whole-genome sequencing of a Yoruba individual:
/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_2.fastq.gz
Read 1 is forward; read 2 is reverse.
fastp
fastp is a fast and versatile tool for adapter trimming.
It can also merge overlapping paired-end reads, but here we use it only for trimming.
fastp -Q -L \ --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT \ --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT \ --out1 [output read1] \ --out2 [output read2] \ --in1 [input read1] \ --in2 [input read2]
Use:
SRR794302_1_trimmed.fastq.gz SRR794302_2_trimmed.fastq.gz
Inspect trimmed reads:
zcat file.fastq.gz | less -S
Q7: Which forward read was the first to be trimmed? Would the reverse read be different? Why?
Q8: How many sequences were trimmed?
Q9: With the same number of starting reads, will short or long insert sizes lead to more trimming? Why?
Metagenomic Illumina Paired-end reads
These reads come from a Pseudomonas aeruginosa metagenomics study:
/home/projects/22126_NGS/exercises/preprocess/ex3/
FastQC
Run FastQC on both forward and reverse reads.
Q10: One read type (forward or reverse) has a special problem. Which, and what is the problem?
Trimmomatic
Trimmomatic simultaneously trims adapters and removes low-quality segments — useful before de novo assembly.
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [flags] [forward.fq] [reverse.fq] [prefix]_1P.fastq.gz [prefix]_1U.fastq.gz \ [prefix]_2P.fastq.gz [prefix]_2U.fastq.gz \ [more flags]
| file | contains |
|---|---|
| 1P | paired forward reads |
| 1U | unpaired forward reads |
| 2P | paired reverse reads |
| 2U | unpaired reverse reads |
Example command:
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar \ PE -threads 1 -phred33 \ /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_1.fastq.gz \ /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_2.fastq.gz \ SRR8002634_1P.fastq.gz SRR8002634_1U.fastq.gz \ SRR8002634_2P.fastq.gz SRR8002634_2U.fastq.gz \ ILLUMINACLIP:/usr/share/trimmomatic/TruSeq2-PE.fa:2:30:10 \ LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:50
Q11: Did the reverse-read quality improve?
Q12: Which unpaired file (1U or 2U) do you expect to have more reads? Count lines to check.
Different trimming/merging programs
Here are several commonly used tools available on the servers:
cutadapt # adapter trimming
Trimmomatic # trimming + quality filtering
fastp # trimming + (optionally) merging overlapping paired-end reads
AdapterRemoval # trimming + merging overlapping paired-end reads
Let us know if you need an additional tool installed.
Please find answers here.
Congratulations, you finished the exercise!