Data Preprocess exercise: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with " <H3>Overview</H3> First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "preprocess" <LI>Navigate to the directory you just created. </OL> We will try to pre-process several types of NGS data. # <i>Escherichia coli</i> single-end Illumina reads # <i>Pseudomonas aeruginosa</i> paired-end Illumina reads <HR> <h2><i>Escherichia coli</i> single-end Illumina reads</h2> <h3>Introduction</h3> <p> An outbreak of <i>E. coli</i> has occurred. Peop...")
 
No edit summary
 
Line 1: Line 1:
<H3>Overview</H3>
<H3>Overview</H3>


First:
First:
<OL>
<OL>
<LI>Navigate to your home directory:
<LI>Navigate to your home directory.</LI>
<LI>Create a directory called "preprocess"  
<LI>Create a directory called "preprocess".</LI>
<LI>Navigate to the directory you just created.
<LI>Navigate to the directory you just created.</LI>
</OL>
</OL>


We will try to pre-process several types of NGS data.  
We will try to pre-process several types of NGS data.
# <i>Escherichia coli</i> single-end Illumina reads
<ol>
# <i>Pseudomonas aeruginosa</i> paired-end Illumina reads
  <li><i>Escherichia coli</i> single-end Illumina reads</li>
  <li><i>Pseudomonas aeruginosa</i> paired-end Illumina reads</li>
</ol>


<HR>
<HR>
Line 19: Line 20:
<h3>Introduction</h3>
<h3>Introduction</h3>


<p> An outbreak of <i>E. coli</i> has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is the one responsible for the outbreak. </p>
<p>An outbreak of <i>E. coli</i> has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is responsible for the outbreak.</p>


<p>The lab technician has performed two different sequencing runs using an Illumina MiSeq sequencer. The lab technician mentions that one run was ok and the other one had poor quality.</p>
<p>The lab technician performed two MiSeq sequencing runs. One run was good; the other had poor quality.</p>


The data can be found here:
<p>The data can be found here:</p>
<pre>
<pre>
/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz
/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz
Line 29: Line 30:
</pre>
</pre>


Leave the data there, you do not need to copy it.  
Leave the data where they are; you do not need to copy them.


<b> Q1: What is the read length?</b>
<b>Q1: What is the read length?</b>


<h3>FastQC</h3>


<h3>FastQC</h3>
<p>We will use FastQC to assess read quality. First create the directories:</p>


We will use the program "FastQC" to assess the quality of the reads. First, create 2 directories named (have a look at the UNIX notes to remind yourself how to do it):
<pre>
<pre>
SRR957824
SRR957824
Line 42: Line 43:
</pre>
</pre>


<p>Check the FastQC help:</p>
<pre>fastqc --help</pre>


First, type the following to view options:
<p>Create an output directory:</p>
<pre>fastqc/</pre>


<pre>
<p>Run FastQC:</p>
fastqc --help
<pre>fastqc -o [output directory] [fastq.gz file]</pre>
</pre>


I recommend to create a directory called:
<p>View results:</p>
<pre>firefox fastqc/[file prefix]_fastqc.html &</pre>


<pre>
<hr>
fastqc/
</pre>
 
We will use the -o option to redirect the output to a directory. We will use the following command line:
 
<pre>
fastqc -o [output directory] [fastq.gz file]
</pre>
 
Where [output directory] is the name of the output directory in our case: fastqc/. And [fastq.gz file] is the fastq file to analyze. Please run SRR957824_1.fastq.gz and send the output to the SRR957824/ directory you created. Do the same for SRR957868_1.fastq.gz by sending its output to SRR957868/
 
<pre>
firefox fastqc/[file prefix]_fastqc.html &
</pre>
 
<HR>


<b>If this is very slow (it might be if all do this at the same time) you can copy the files to your own computer.</b>
<b>If this is slow</b>, copy the files locally using scp:


To do this all you have to do is start another local session tab in MobaXterm (windows) or Terminal (Linux or Mac) you write (remember to type your password when prompted):
<pre>
<pre>
scp stud0XX@(pupil1 pupil2 pupil3 ):preprocess/fastqc/*.html .
scp stud0XX@pupilX:preprocess/fastqc/*.html .
</pre>
</pre>


<p>Look for warnings or failures in categories such as per-base quality and overrepresented sequences.</p>


The output should look like a series of graphs/tables in different categories (ex: Basic Statistics, Per base sequence quality). You can use the left column to quickly navigate to a category that looks suspicious. Look at each of the categories and see if it reports something that we should look at (green check vs yellow exclamation mark vs red cross).
<p>Ignore these warnings for this exercise:</p>


N.B. For this run, do not worry about the following categories:
<pre>
<pre>
[FAIL]Per base sequence content
[FAIL] Per base sequence content
[WARNING]Per sequence GC content
[WARNING] Per sequence GC content
</pre>
</pre>


The most interesting figure is the first where the quality of each base is plotted as you move from the beginning of the read to the end. It shows the distribution of quality scores in the reads as you move from 5' to 3'. Here you should see the trailing bad qualities typical for Illumina data. It can be a good idea to remove these bad bases as we do not want them to influence our assembly.
<p>Pay special attention to trailing quality decay and overrepresented adapter sequences, which affect trimming and downstream assembly.</p>
 


Also, look for "Overrepresented sequences" (second last plot), these are often sequencing adapters that are also present at the end of reads. It is important to remove these when are doing a ''de novo'' assembly as these will overlap between the reads and make erroneous junctions between reads. In alignments, they are also troublesome as they will create incorrect mismatches at the end of reads.
<b>Q2: Which of the two runs (SRR957824 or SRR957868) had poor quality?</b>
 
 
 
<b> Q2: Which of the 2 sequencing runs (SRR957824 or SRR957868) do you think had poor quality?</b>
 
<b> Q3: For the sequencing run with good quality, there seems to be a remaining issue. What do you think is the cause and the solution?</b>


<b>Q3: For the good run, there is still a remaining issue. What is the cause and solution?</b>


<h3>cutadapt</h3>
<h3>cutadapt</h3>


In "Overrepresented sequences", here you see that there is a sequence that is overrepresented in the reads that matches the TruSeq Adapter.  
<p>FastQC shows overrepresented TruSeq adapters.</p>


We will use cutadapt to remove the lingering adapters:
<p>Use cutadapt:</p>
<pre>
<pre>
cutadapt -a [primer sequence] -o [output file] [input file]  
cutadapt -a [adapter sequence] -o [output file] [input file]
</pre>
</pre>


The sequencing technician gives you the following sequence for the primer:
Adapter sequence:
 
<pre>
<pre>
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG
</pre>
</pre>


Use cutadapt to remove the ring a ring adaptor. I suggest that you call your output file SRR957868_1_trimmed.fastq.gz.
<p>Suggested output file: <code>SRR957868_1_trimmed.fastq.gz</code></p>


<b> Q4: How many times was this adapter trimmed (look at the output produced by the program)?</b>
<b>Q4: How many times was this adapter trimmed?</b>


<b> Q5: What would have happened to this number of trimmed sequences had the signal technician given you the wrong adapter sequence?</b>
<b>Q5: What would happen if the wrong adapter sequence was used?</b>


Run FastQC again on the resulting output file.
<p>Run FastQC again on the trimmed output.</p>


<b> Q6: you still find adapter sequences among the "overrepresented sequences"?</b>
<b>Q6: Do you still find adapter sequences among the “overrepresented sequences”?</b>




<HR>
 
<H2>Human Illumina Paired-end reads</h2>


<p>Let us look at some paired-end Illumina reads, these reads are from whole-genome sequencing of a Yoruba individual:</p>
<h2>Human Illumina Paired-end Reads</h2>


<p>These reads come from whole-genome sequencing of a Yoruba individual:</p>
<pre>
<pre>
/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz
/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz
Line 134: Line 113:
</pre>
</pre>


<p>The "_1.fastq.gz" is the forward read, the "2_fastq.gz" is the reverse read. The problem is that you do not know the adapter sequences.</p>
<p>Read 1 is forward; read 2 is reverse.</p>


<H3>fastp</H3>
<h3>fastp</h3>


<p>We will use fastp, an ultra-fast adapter trimming software, to trim the adapter sequences:</p>
<p><code>fastp</code> is a fast and versatile tool for adapter trimming
<b>It can also merge overlapping paired-end reads</b>, but here we use it only for trimming.</p>


<pre>
<pre>
fastp -Q -L --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT  --out1 [output read1] --out2 [output read2] --in1 [input read1] --in2 [input read2]
fastp -Q -L \
  --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT \
  --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT \
   --out1 [output read1] \
  --out2 [output read2] \
  --in1 [input read1] \
  --in2 [input read2]
</pre>
</pre>


<p> -Q and -L disables quality filtering and length filtering. The fastq files do not need to be unzipped, fastq.gz is fine. The file [output read1] contains the trimmed forward reads and [output read2] contains the trimmed reverse reads. You can use "SRR794302_1_trimmed.fastq.gz" as output [output read1] and SRR794302_2_trimmed.fastq.gz as [output read2] . Pay attention to the messages produced by the program as you can see some interesting summary statistics. </p>
<p>Use:</p>
 
 
You can use:
 
<pre>
<pre>
zcat file.fastq.gz |less -S
SRR794302_1_trimmed.fastq.gz
SRR794302_2_trimmed.fastq.gz
</pre>
</pre>


To view a zipped fastq file. Please inspect the resulting trimmed reads.
<p>Inspect trimmed reads:</p>
<pre>zcat file.fastq.gz | less -S</pre>


<b> Q7: Which forward read was the first to be trimmed (write the ID)? hint: it will have a different sequence length. Had we asked the same question for the reverse reads, would you have found a different read or the same? Does it makes sense? </b>
<b>Q7: Which forward read was the first to be trimmed?
Would the reverse read be different? Why?</b>


<b>Q8: How many sequences were trimmed?</b>


<b> Q8: How many sequences have been trimmed? </b>
<b>Q9: With the same number of starting reads, will short or long insert sizes lead to more trimming? Why?</b>


<b> Q9: Given an identical number of starting sequences, do you think that you will get more sequences trimmed if the insert size is short or long? why? </b>
<HR>


<H2>Metagenomic Illumina Paired-end reads</h2>
<h2>Metagenomic Illumina Paired-end reads</h2>


<p>Let's try some reads from a study of <i>Pseudomonas aeruginosa</i>, an opportunistic pathogen that can live in the environment and may infect humans. Inside humans, they may live in the lungs and form biofilms. </p>
<p>These reads come from a <i>Pseudomonas aeruginosa</i> metagenomics study:</p>


The reads are found here:
<pre>
<pre>
/home/projects/22126_NGS/exercises/preprocess/ex3/
/home/projects/22126_NGS/exercises/preprocess/ex3/
</pre>
</pre>


<h3>FastQC</h3>
<h3>FastQC</h3>


<p>First let us look at the reads, use FastQC on the forward and reverse reads to identify potential problems. </p>
<p>Run FastQC on both forward and reverse reads.</p>
 
 
<b>Q10. One type of read (forward/reverse) has a particular type of problem compared to the other, which type of read is it and what is the problem?</b>


<b>Q10: One read type (forward or reverse) has a special problem. Which, and what is the problem?</b>


<h3>Trimmomatic</h3>
<h3>Trimmomatic</h3>


<p>We will use Trimmomatic the simultaneously trim adaptors and remove sequences with bad quality. This step can be required especially when doing ''de novo'' assembly as they can create incorrect junctions between reads.</p>
<p>Trimmomatic simultaneously trims adapters and removes low-quality segments — useful before <i>de novo</i> assembly.</p>


<p>The basic command line is as such:</p>
<pre>
<pre>
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [some flags]  
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [flags]
[forward read fastq]   [reverse read fastq]
  [forward.fq] [reverse.fq]
   [output prefix]_1P.fastq.gz [output prefix]_1U.fastq.gz [output prefix]_2P.fastq.gz [output prefix]_2U.fastq.gz [some more flags]  
   [prefix]_1P.fastq.gz [prefix]_1U.fastq.gz \
  [prefix]_2P.fastq.gz [prefix]_2U.fastq.gz \
  [more flags]
</pre>
</pre>


<p>where:</p>
<table class="wikitable">
{| class="wikitable"
<tr><th>file</th><th>contains</th></tr>
| file
<tr><td>1P</td><td>paired forward reads</td></tr>
| contents
<tr><td>1U</td><td>unpaired forward reads</td></tr>
|-
<tr><td>2P</td><td>paired reverse reads</td></tr>
| [output prefix]_1P.fq.gz
<tr><td>2U</td><td>unpaired reverse reads</td></tr>
paired forward reads
</table>
|-
| [output prefix]_1U.fq.gz
| unpaired forward reads (the reverse read failed quality control)
|-
| [output prefix]_2P.fq.gz
| for paired reverse reads
|-
| [output prefix]_2U.fq.gz
unpaired reverse reads (the forward read failed quality control)
|-
|}
 


<p>Because sometimes the reverse forward reads are not of the same average quality, the files ([output prefix]_1U.fq.gz and [output prefix]_2U.fq.gz ) do not necessarily contain the same number of reads. However, the files ([output prefix]_1P.fq.gz and [output prefix]_2P.fq.gz ) should contain the same number of reads. We will use Trimmomatic to trim adapters and remove segments of poor quality. Here is the command line: </p>
<p>Example command:</p>


<pre>
<pre>
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 1 -phred33   /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_1.fastq.gz /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_2.fastq.gz SRR8002634_1P.fastq.gz SRR8002634_1U.fastq.gz SRR8002634_2P.fastq.gz SRR8002634_2U.fastq.gz ILLUMINACLIP:/usr/share/trimmomatic/TruSeq2-PE.fa:2:30:10 LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:50
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar \
  PE -threads 1 -phred33 \
  /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_1.fastq.gz \
  /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_2.fastq.gz \
  SRR8002634_1P.fastq.gz SRR8002634_1U.fastq.gz \
  SRR8002634_2P.fastq.gz SRR8002634_2U.fastq.gz \
  ILLUMINACLIP:/usr/share/trimmomatic/TruSeq2-PE.fa:2:30:10 \
  LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:50
</pre>
</pre>


<p>The additional options do the following:</p>
<b>Q11: Did the reverse-read quality improve?</b>


{| class="wikitable"
<b>Q12: Which unpaired file (1U or 2U) do you expect to have more reads?   
| option
Count lines to check.</b>
| effect
|-
| ILLUMINACLIP
| The sequence of the adapters, the remaining numbers are sensitivity thresholds (see software manual for the exact definitions)
|- 
| LEADING
| Remove the bases at the 5' end when the QC scores fall below this threshold.
|-
| TRAILING
| Remove the bases at the 3' end when the QC scores fall below this threshold.
|-
| SLIDINGWINDOW
| Remove bases using a sliding window strategy, the first number is the window size, the second number is the quality score.
|-
| MINLEN
| minimum length for a sequence.
|-
|}
 
 
<h3>FastQC</h3>
 
<p>Again let us run FastQC on the forward (SRR8002634_1P.fastq.gz) and reverse reads (SRR8002634_2P.fastq.gz) to verify if the previous problems went away. </p>
 
 
<b>Q11. Did the quality scores improve especially for the reverse reads?</b>
 
<b>Q12. The reads in SRR8002634_1U.fastq.gz are the forward reads for which the paired reverse reads were removed due to poor quality. The reverse reads whose paired forward reads failed quality controls are found in SRR8002634_2U.fastq.gz without counting the number of lines, which file would you think contains more reads? Count the number of reads found in each and check if you prediction was correct</b>
 
<h2>Ancient DNA</h2>
 
A study extracted DNA from dogs, wolves and mammoths. You can find the study [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07229-y here]. However, DNA tends to degrade fast and the sequences can be very short.
 
The raw data for a dog can be found here:
<pre>
/home/projects/22126_NGS/exercises/preprocess/ex4/ERR4778296_1.fastq.gz
/home/projects/22126_NGS/exercises/preprocess/ex4/ERR4778296_2.fastq.gz
</pre>
 
<H3>leeHom</H3>
 
We will not only trim adapters but also merge overlapping sequences to get an idea of how long our sequences are. We will use leeHom, a specialized program for short sequences, to infer the adapter sequences and remove the adapters:</p>
 
<pre>
leeHom --auto --ancientdna -fq1 [forward read fastq] -fq2 [reverse read fastq]  -fqo [output prefix]
</pre>
 
<p> "--auto" means infer the adapter sequences, "--ancientdna" means merge overlapping pairs. The fastq files do not need to be unzipped, fastq.gz is fine. You can use "ERR4778296_trimmed" as output prefix. Pay attention to the messages produced by the program as you can see some interesting summary statistics. The program should produce the following statistics:</p>
 
<b>Q13. How many reads were left as pairs?</b>
 
 
{| class="wikitable"
| File suffix
| Meaning   
|-
| '''ERR4778296_trimmed.fq.gz'''           
| Sequences that were trimmed                 
|-
| '''ERR4778296_trimmed_r1.fq.gz'''       
| Forward reads that were not trimmed and left as is                     
|-
| '''ERR4778296_trimmed_r2.fq.gz'''       
| Reverse reads that were not trimmed and left as is         
|}
 
 
<p>You also see some files with the word "fail", these sequences generally had some problems for instance the program was not sure if the adapter was present or not. </p>
 
You can see how short the sequences are using '''zcat''' on ERR4778296_trimmed.fq.gz.


<HR>


<h2>Different trimming/merging programs</h2>
<h2>Different trimming/merging programs</h2>


In closing, there are several different programs to remove adapters on the servers:
<p>Here are several commonly used tools available on the servers:</p>
<pre>
cutadapt
</pre>


<pre>
<pre>cutadapt              # adapter trimming</pre>
java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar
<pre>Trimmomatic           # trimming + quality filtering</pre>
</pre>
<pre>fastp                # trimming + (optionally) merging overlapping paired-end reads</pre>
 
<pre>AdapterRemoval       # trimming + merging overlapping paired-end reads</pre>
<pre>
leeHom
</pre>
 
<pre>
AdapterRemoval
</pre>


Let us know if you need an additional one that is not found on this list.
<p>Let us know if you need an additional tool installed.</p>


Please find answers [[Data_Preprocess_exercise_answers|here]]  
<p>Please find answers [[Data_Preprocess_exercise_answers|here]].</p>


'''Congratulations you finished the exercise!'''
<b>Congratulations, you finished the exercise!</b>

Latest revision as of 14:04, 11 December 2025

Overview

First:

  1. Navigate to your home directory.
  2. Create a directory called "preprocess".
  3. Navigate to the directory you just created.

We will try to pre-process several types of NGS data.

  1. Escherichia coli single-end Illumina reads
  2. Pseudomonas aeruginosa paired-end Illumina reads

Escherichia coli single-end Illumina reads

Introduction

An outbreak of E. coli has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is responsible for the outbreak.

The lab technician performed two MiSeq sequencing runs. One run was good; the other had poor quality.

The data can be found here:

/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz
/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957868_1.fastq.gz

Leave the data where they are; you do not need to copy them.

Q1: What is the read length?

FastQC

We will use FastQC to assess read quality. First create the directories:

SRR957824
SRR957868

Check the FastQC help:

fastqc --help

Create an output directory:

fastqc/

Run FastQC:

fastqc -o [output directory] [fastq.gz file]

View results:

firefox fastqc/[file prefix]_fastqc.html &

If this is slow, copy the files locally using scp:

scp stud0XX@pupilX:preprocess/fastqc/*.html .

Look for warnings or failures in categories such as per-base quality and overrepresented sequences.

Ignore these warnings for this exercise:

[FAIL] Per base sequence content
[WARNING] Per sequence GC content

Pay special attention to trailing quality decay and overrepresented adapter sequences, which affect trimming and downstream assembly.

Q2: Which of the two runs (SRR957824 or SRR957868) had poor quality?

Q3: For the good run, there is still a remaining issue. What is the cause and solution?

cutadapt

FastQC shows overrepresented TruSeq adapters.

Use cutadapt:

cutadapt -a [adapter sequence] -o [output file] [input file]

Adapter sequence:

AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG

Suggested output file: SRR957868_1_trimmed.fastq.gz

Q4: How many times was this adapter trimmed?

Q5: What would happen if the wrong adapter sequence was used?

Run FastQC again on the trimmed output.

Q6: Do you still find adapter sequences among the “overrepresented sequences”?



Human Illumina Paired-end Reads

These reads come from whole-genome sequencing of a Yoruba individual:

/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz
/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_2.fastq.gz

Read 1 is forward; read 2 is reverse.

fastp

fastp is a fast and versatile tool for adapter trimming. It can also merge overlapping paired-end reads, but here we use it only for trimming.

fastp -Q -L \
  --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT \
  --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT \
  --out1 [output read1] \
  --out2 [output read2] \
  --in1 [input read1] \
  --in2 [input read2]

Use:

SRR794302_1_trimmed.fastq.gz
SRR794302_2_trimmed.fastq.gz

Inspect trimmed reads:

zcat file.fastq.gz | less -S

Q7: Which forward read was the first to be trimmed? Would the reverse read be different? Why?

Q8: How many sequences were trimmed?

Q9: With the same number of starting reads, will short or long insert sizes lead to more trimming? Why?


Metagenomic Illumina Paired-end reads

These reads come from a Pseudomonas aeruginosa metagenomics study:

/home/projects/22126_NGS/exercises/preprocess/ex3/

FastQC

Run FastQC on both forward and reverse reads.

Q10: One read type (forward or reverse) has a special problem. Which, and what is the problem?

Trimmomatic

Trimmomatic simultaneously trims adapters and removes low-quality segments — useful before de novo assembly.

java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [flags]
  [forward.fq] [reverse.fq]
  [prefix]_1P.fastq.gz [prefix]_1U.fastq.gz \
  [prefix]_2P.fastq.gz [prefix]_2U.fastq.gz \
  [more flags]
filecontains
1Ppaired forward reads
1Uunpaired forward reads
2Ppaired reverse reads
2Uunpaired reverse reads

Example command:

java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar \
  PE -threads 1 -phred33 \
  /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_1.fastq.gz \
  /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_2.fastq.gz \
  SRR8002634_1P.fastq.gz SRR8002634_1U.fastq.gz \
  SRR8002634_2P.fastq.gz SRR8002634_2U.fastq.gz \
  ILLUMINACLIP:/usr/share/trimmomatic/TruSeq2-PE.fa:2:30:10 \
  LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:50

Q11: Did the reverse-read quality improve?

Q12: Which unpaired file (1U or 2U) do you expect to have more reads? Count lines to check.


Different trimming/merging programs

Here are several commonly used tools available on the servers:

cutadapt              # adapter trimming
Trimmomatic           # trimming + quality filtering
fastp                 # trimming + (optionally) merging overlapping paired-end reads
AdapterRemoval        # trimming + merging overlapping paired-end reads

Let us know if you need an additional tool installed.

Please find answers here.

Congratulations, you finished the exercise!