Data Preprocess exercise: Difference between revisions
(Created page with " <H3>Overview</H3> First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "preprocess" <LI>Navigate to the directory you just created. </OL> We will try to pre-process several types of NGS data. # <i>Escherichia coli</i> single-end Illumina reads # <i>Pseudomonas aeruginosa</i> paired-end Illumina reads <HR> <h2><i>Escherichia coli</i> single-end Illumina reads</h2> <h3>Introduction</h3> <p> An outbreak of <i>E. coli</i> has occurred. Peop...") |
No edit summary |
||
| Line 1: | Line 1: | ||
<H3>Overview</H3> | <H3>Overview</H3> | ||
First: | First: | ||
<OL> | <OL> | ||
<LI>Navigate to your home directory | <LI>Navigate to your home directory.</LI> | ||
<LI>Create a directory called "preprocess" | <LI>Create a directory called "preprocess".</LI> | ||
<LI>Navigate to the directory you just created. | <LI>Navigate to the directory you just created.</LI> | ||
</OL> | </OL> | ||
We will try to pre-process several types of NGS data. | We will try to pre-process several types of NGS data. | ||
<ol> | |||
<li><i>Escherichia coli</i> single-end Illumina reads</li> | |||
<li><i>Pseudomonas aeruginosa</i> paired-end Illumina reads</li> | |||
</ol> | |||
<HR> | <HR> | ||
| Line 19: | Line 20: | ||
<h3>Introduction</h3> | <h3>Introduction</h3> | ||
<p> An outbreak of <i>E. coli</i> has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is | <p>An outbreak of <i>E. coli</i> has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is responsible for the outbreak.</p> | ||
<p>The lab technician | <p>The lab technician performed two MiSeq sequencing runs. One run was good; the other had poor quality.</p> | ||
The data can be found here: | <p>The data can be found here:</p> | ||
<pre> | <pre> | ||
/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz | /home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz | ||
| Line 29: | Line 30: | ||
</pre> | </pre> | ||
Leave the data | Leave the data where they are; you do not need to copy them. | ||
<b> Q1: What is the read length?</b> | <b>Q1: What is the read length?</b> | ||
<h3>FastQC</h3> | |||
< | <p>We will use FastQC to assess read quality. First create the directories:</p> | ||
<pre> | <pre> | ||
SRR957824 | SRR957824 | ||
| Line 42: | Line 43: | ||
</pre> | </pre> | ||
<p>Check the FastQC help:</p> | |||
<pre>fastqc --help</pre> | |||
<p>Create an output directory:</p> | |||
<pre>fastqc/</pre> | |||
<pre> | <p>Run FastQC:</p> | ||
fastqc | <pre>fastqc -o [output directory] [fastq.gz file]</pre> | ||
</pre> | |||
<p>View results:</p> | |||
<pre>firefox fastqc/[file prefix]_fastqc.html &</pre> | |||
< | <hr> | ||
<b>If this is | <b>If this is slow</b>, copy the files locally using scp: | ||
<pre> | <pre> | ||
scp stud0XX@ | scp stud0XX@pupilX:preprocess/fastqc/*.html . | ||
</pre> | </pre> | ||
<p>Look for warnings or failures in categories such as per-base quality and overrepresented sequences.</p> | |||
<p>Ignore these warnings for this exercise:</p> | |||
<pre> | <pre> | ||
[FAIL]Per base sequence content | [FAIL] Per base sequence content | ||
[WARNING]Per sequence GC content | [WARNING] Per sequence GC content | ||
</pre> | </pre> | ||
<p>Pay special attention to trailing quality decay and overrepresented adapter sequences, which affect trimming and downstream assembly.</p> | |||
<b>Q2: Which of the two runs (SRR957824 or SRR957868) had poor quality?</b> | |||
<b> Q2: Which of the | |||
<b>Q3: For the good run, there is still a remaining issue. What is the cause and solution?</b> | |||
<h3>cutadapt</h3> | <h3>cutadapt</h3> | ||
<p>FastQC shows overrepresented TruSeq adapters.</p> | |||
<p>Use cutadapt:</p> | |||
<pre> | <pre> | ||
cutadapt -a [ | cutadapt -a [adapter sequence] -o [output file] [input file] | ||
</pre> | </pre> | ||
Adapter sequence: | |||
<pre> | <pre> | ||
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG | AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG | ||
</pre> | </pre> | ||
<p>Suggested output file: <code>SRR957868_1_trimmed.fastq.gz</code></p> | |||
<b> Q4: How many times was this adapter trimmed | <b>Q4: How many times was this adapter trimmed?</b> | ||
<b> Q5: What would | <b>Q5: What would happen if the wrong adapter sequence was used?</b> | ||
Run FastQC again on the | <p>Run FastQC again on the trimmed output.</p> | ||
<b> Q6: you still find adapter sequences among the | <b>Q6: Do you still find adapter sequences among the “overrepresented sequences”?</b> | ||
<HR> | |||
< | |||
< | <h2>Human Illumina Paired-end Reads</h2> | ||
<p>These reads come from whole-genome sequencing of a Yoruba individual:</p> | |||
<pre> | <pre> | ||
/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz | /home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz | ||
| Line 134: | Line 113: | ||
</pre> | </pre> | ||
<p> | <p>Read 1 is forward; read 2 is reverse.</p> | ||
< | <h3>fastp</h3> | ||
<p> | <p><code>fastp</code> is a fast and versatile tool for adapter trimming. | ||
<b>It can also merge overlapping paired-end reads</b>, but here we use it only for trimming.</p> | |||
<pre> | <pre> | ||
fastp -Q -L | fastp -Q -L \ | ||
--adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT \ | |||
--adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT \ | |||
--out1 [output read1] \ | |||
--out2 [output read2] \ | |||
--in1 [input read1] \ | |||
--in2 [input read2] | |||
</pre> | </pre> | ||
<p> | <p>Use:</p> | ||
<pre> | <pre> | ||
SRR794302_1_trimmed.fastq.gz | |||
SRR794302_2_trimmed.fastq.gz | |||
</pre> | </pre> | ||
<p>Inspect trimmed reads:</p> | |||
<pre>zcat file.fastq.gz | less -S</pre> | |||
<b> Q7: Which forward read was the first to be trimmed | <b>Q7: Which forward read was the first to be trimmed? | ||
Would the reverse read be different? Why?</b> | |||
<b>Q8: How many sequences were trimmed?</b> | |||
<b> | <b>Q9: With the same number of starting reads, will short or long insert sizes lead to more trimming? Why?</b> | ||
< | <HR> | ||
< | <h2>Metagenomic Illumina Paired-end reads</h2> | ||
<p> | <p>These reads come from a <i>Pseudomonas aeruginosa</i> metagenomics study:</p> | ||
<pre> | <pre> | ||
/home/projects/22126_NGS/exercises/preprocess/ex3/ | /home/projects/22126_NGS/exercises/preprocess/ex3/ | ||
</pre> | </pre> | ||
<h3>FastQC</h3> | <h3>FastQC</h3> | ||
<p> | <p>Run FastQC on both forward and reverse reads.</p> | ||
<b>Q10: One read type (forward or reverse) has a special problem. Which, and what is the problem?</b> | |||
<h3>Trimmomatic</h3> | <h3>Trimmomatic</h3> | ||
<p> | <p>Trimmomatic simultaneously trims adapters and removes low-quality segments — useful before <i>de novo</i> assembly.</p> | ||
<pre> | <pre> | ||
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [ | java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [flags] | ||
[forward | [forward.fq] [reverse.fq] | ||
[ | [prefix]_1P.fastq.gz [prefix]_1U.fastq.gz \ | ||
[prefix]_2P.fastq.gz [prefix]_2U.fastq.gz \ | |||
[more flags] | |||
</pre> | </pre> | ||
< | <table class="wikitable"> | ||
<tr><th>file</th><th>contains</th></tr> | |||
<tr><td>1P</td><td>paired forward reads</td></tr> | |||
<tr><td>1U</td><td>unpaired forward reads</td></tr> | |||
<tr><td>2P</td><td>paired reverse reads</td></tr> | |||
<tr><td>2U</td><td>unpaired reverse reads</td></tr> | |||
</table> | |||
<p> | <p>Example command:</p> | ||
<pre> | <pre> | ||
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 1 -phred33 | java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar \ | ||
PE -threads 1 -phred33 \ | |||
/home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_1.fastq.gz \ | |||
/home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_2.fastq.gz \ | |||
SRR8002634_1P.fastq.gz SRR8002634_1U.fastq.gz \ | |||
SRR8002634_2P.fastq.gz SRR8002634_2U.fastq.gz \ | |||
ILLUMINACLIP:/usr/share/trimmomatic/TruSeq2-PE.fa:2:30:10 \ | |||
LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:50 | |||
</pre> | </pre> | ||
< | <b>Q11: Did the reverse-read quality improve?</b> | ||
<b>Q12: Which unpaired file (1U or 2U) do you expect to have more reads? | |||
Count lines to check.</b> | |||
<b>Q12 | |||
<HR> | |||
<h2>Different trimming/merging programs</h2> | <h2>Different trimming/merging programs</h2> | ||
<p>Here are several commonly used tools available on the servers:</p> | |||
</ | |||
<pre> | <pre>cutadapt # adapter trimming</pre> | ||
<pre>Trimmomatic # trimming + quality filtering</pre> | |||
</pre> | <pre>fastp # trimming + (optionally) merging overlapping paired-end reads</pre> | ||
<pre>AdapterRemoval # trimming + merging overlapping paired-end reads</pre> | |||
<pre> | |||
</pre> | |||
<pre> | |||
AdapterRemoval | |||
</pre> | |||
Let us know if you need an additional | <p>Let us know if you need an additional tool installed.</p> | ||
Please find answers [[Data_Preprocess_exercise_answers|here]] | <p>Please find answers [[Data_Preprocess_exercise_answers|here]].</p> | ||
<b>Congratulations, you finished the exercise!</b> | |||
Latest revision as of 14:04, 11 December 2025
Overview
First:
- Navigate to your home directory.
- Create a directory called "preprocess".
- Navigate to the directory you just created.
We will try to pre-process several types of NGS data.
- Escherichia coli single-end Illumina reads
- Pseudomonas aeruginosa paired-end Illumina reads
Escherichia coli single-end Illumina reads
Introduction
An outbreak of E. coli has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is responsible for the outbreak.
The lab technician performed two MiSeq sequencing runs. One run was good; the other had poor quality.
The data can be found here:
/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex1/SRR957868_1.fastq.gz
Leave the data where they are; you do not need to copy them.
Q1: What is the read length?
FastQC
We will use FastQC to assess read quality. First create the directories:
SRR957824 SRR957868
Check the FastQC help:
fastqc --help
Create an output directory:
fastqc/
Run FastQC:
fastqc -o [output directory] [fastq.gz file]
View results:
firefox fastqc/[file prefix]_fastqc.html &
If this is slow, copy the files locally using scp:
scp stud0XX@pupilX:preprocess/fastqc/*.html .
Look for warnings or failures in categories such as per-base quality and overrepresented sequences.
Ignore these warnings for this exercise:
[FAIL] Per base sequence content [WARNING] Per sequence GC content
Pay special attention to trailing quality decay and overrepresented adapter sequences, which affect trimming and downstream assembly.
Q2: Which of the two runs (SRR957824 or SRR957868) had poor quality?
Q3: For the good run, there is still a remaining issue. What is the cause and solution?
cutadapt
FastQC shows overrepresented TruSeq adapters.
Use cutadapt:
cutadapt -a [adapter sequence] -o [output file] [input file]
Adapter sequence:
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG
Suggested output file: SRR957868_1_trimmed.fastq.gz
Q4: How many times was this adapter trimmed?
Q5: What would happen if the wrong adapter sequence was used?
Run FastQC again on the trimmed output.
Q6: Do you still find adapter sequences among the “overrepresented sequences”?
Human Illumina Paired-end Reads
These reads come from whole-genome sequencing of a Yoruba individual:
/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_2.fastq.gz
Read 1 is forward; read 2 is reverse.
fastp
fastp is a fast and versatile tool for adapter trimming.
It can also merge overlapping paired-end reads, but here we use it only for trimming.
fastp -Q -L \ --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT \ --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT \ --out1 [output read1] \ --out2 [output read2] \ --in1 [input read1] \ --in2 [input read2]
Use:
SRR794302_1_trimmed.fastq.gz SRR794302_2_trimmed.fastq.gz
Inspect trimmed reads:
zcat file.fastq.gz | less -S
Q7: Which forward read was the first to be trimmed? Would the reverse read be different? Why?
Q8: How many sequences were trimmed?
Q9: With the same number of starting reads, will short or long insert sizes lead to more trimming? Why?
Metagenomic Illumina Paired-end reads
These reads come from a Pseudomonas aeruginosa metagenomics study:
/home/projects/22126_NGS/exercises/preprocess/ex3/
FastQC
Run FastQC on both forward and reverse reads.
Q10: One read type (forward or reverse) has a special problem. Which, and what is the problem?
Trimmomatic
Trimmomatic simultaneously trims adapters and removes low-quality segments — useful before de novo assembly.
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [flags] [forward.fq] [reverse.fq] [prefix]_1P.fastq.gz [prefix]_1U.fastq.gz \ [prefix]_2P.fastq.gz [prefix]_2U.fastq.gz \ [more flags]
| file | contains |
|---|---|
| 1P | paired forward reads |
| 1U | unpaired forward reads |
| 2P | paired reverse reads |
| 2U | unpaired reverse reads |
Example command:
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar \ PE -threads 1 -phred33 \ /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_1.fastq.gz \ /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_2.fastq.gz \ SRR8002634_1P.fastq.gz SRR8002634_1U.fastq.gz \ SRR8002634_2P.fastq.gz SRR8002634_2U.fastq.gz \ ILLUMINACLIP:/usr/share/trimmomatic/TruSeq2-PE.fa:2:30:10 \ LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:50
Q11: Did the reverse-read quality improve?
Q12: Which unpaired file (1U or 2U) do you expect to have more reads? Count lines to check.
Different trimming/merging programs
Here are several commonly used tools available on the servers:
cutadapt # adapter trimming
Trimmomatic # trimming + quality filtering
fastp # trimming + (optionally) merging overlapping paired-end reads
AdapterRemoval # trimming + merging overlapping paired-end reads
Let us know if you need an additional tool installed.
Please find answers here.
Congratulations, you finished the exercise!