Data Preprocess exercise
Overview
First:
- Navigate to your home directory.
- Create a directory called "preprocess".
- Navigate to the directory you just created.
We will try to pre-process several types of NGS data.
- Escherichia coli single-end Illumina reads
- Pseudomonas aeruginosa paired-end Illumina reads
Escherichia coli single-end Illumina reads
Introduction
An outbreak of E. coli has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is responsible for the outbreak.
The lab technician performed two MiSeq sequencing runs. One run was good; the other had poor quality.
The data can be found here:
/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex1/SRR957868_1.fastq.gz
Leave the data where they are; you do not need to copy them.
Q1: What is the read length?
FastQC
We will use FastQC to assess read quality. First create the directories:
SRR957824 SRR957868
Check the FastQC help:
fastqc --help
Create an output directory:
fastqc/
Run FastQC:
fastqc -o [output directory] [fastq.gz file]
View results:
If you are using MobaXterm, you can open the FastQC HTML files directly from the left-hand file panel on the server.
If you are using macOS (or a standard terminal), copy the HTML files to your local computer and open them in a web browser. For example:
scp stud0XX@pupilX:preprocess/fastqc/*.html .
Replace stud0XX with your student ID and pupilX with the
compute node you are working on. The files will be copied to your current local
directory.
Look for warnings or failures in categories such as per-base quality and overrepresented sequences.
Ignore these warnings for this exercise:
[FAIL] Per base sequence content [WARNING] Per sequence GC content
Pay special attention to trailing quality decay and overrepresented adapter sequences, which affect trimming and downstream assembly.
Q2: Which of the two runs (SRR957824 or SRR957868) had poor quality?
Q3: For the good run, there is still a remaining issue. What is the cause and solution?
cutadapt
FastQC shows overrepresented TruSeq adapters.
Use cutadapt:
cutadapt -a [adapter sequence] -o [output file] [input file]
Adapter sequence:
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG
Suggested output file: SRR957868_1_trimmed.fastq.gz
Q4: How many times was this adapter trimmed?
Q5: What would happen if the wrong adapter sequence was used?
Run FastQC again on the trimmed output.
Q6: Do you still find adapter sequences among the “overrepresented sequences”?
Human Illumina Paired-end Reads
These reads come from whole-genome sequencing of a Yoruba individual:
/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_2.fastq.gz
Read 1 is forward; read 2 is reverse.
fastp
fastp is a fast and versatile tool for adapter trimming.
It can also merge overlapping paired-end reads, but here we use it only for trimming.
fastp -Q -L \ --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT \ --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT \ --out1 [output read1] \ --out2 [output read2] \ --in1 [input read1] \ --in2 [input read2]
Use:
SRR794302_1_trimmed.fastq.gz SRR794302_2_trimmed.fastq.gz
Inspect trimmed reads:
zcat file.fastq.gz | less -S
Q7: Which forward read was the first to be trimmed? Would the reverse read be different? Why?
Q8: How many sequences were trimmed?
Q9: With the same number of starting reads, will short or long insert sizes lead to more trimming? Why?
Metagenomic Illumina Paired-end reads
These reads come from a Pseudomonas aeruginosa metagenomics study:
/home/projects/22126_NGS/exercises/preprocess/ex3/
FastQC
Run FastQC on both forward and reverse reads.
Q10: One read type (forward or reverse) has a special problem. Which, and what is the problem?
Trimmomatic
Trimmomatic simultaneously trims adapters and removes low-quality segments — useful before de novo assembly.
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [flags] [forward.fq] [reverse.fq] [prefix]_1P.fastq.gz [prefix]_1U.fastq.gz \ [prefix]_2P.fastq.gz [prefix]_2U.fastq.gz \ [more flags]
| file | contains |
|---|---|
| 1P | paired forward reads |
| 1U | unpaired forward reads |
| 2P | paired reverse reads |
| 2U | unpaired reverse reads |
Example command:
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar \ PE -threads 1 -phred33 \ /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_1.fastq.gz \ /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_2.fastq.gz \ SRR8002634_1P.fastq.gz SRR8002634_1U.fastq.gz \ SRR8002634_2P.fastq.gz SRR8002634_2U.fastq.gz \ ILLUMINACLIP:/usr/share/trimmomatic/TruSeq2-PE.fa:2:30:10 \ LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:50
Q11: Did the reverse-read quality improve?
Q12: Which unpaired file (1U or 2U) do you expect to have more reads? Count lines to check.
Different trimming/merging programs
Here are several commonly used tools available on the servers:
cutadapt # adapter trimming
Trimmomatic # trimming + quality filtering
fastp # trimming + (optionally) merging overlapping paired-end reads
AdapterRemoval # trimming + merging overlapping paired-end reads
Let us know if you need an additional tool installed.
Please find answers here.
Congratulations, you finished the exercise!