Data Preprocess exercise

From 22126
Revision as of 14:55, 16 December 2025 by Mick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

First:

  1. Navigate to your home directory.
  2. Create a directory called "preprocess".
  3. Navigate to the directory you just created.

We will try to pre-process several types of NGS data.

  1. Escherichia coli single-end Illumina reads
  2. Pseudomonas aeruginosa paired-end Illumina reads

Escherichia coli single-end Illumina reads

Introduction

An outbreak of E. coli has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is responsible for the outbreak.

The lab technician performed two MiSeq sequencing runs. One run was good; the other had poor quality.

The data can be found here:

/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz
/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957868_1.fastq.gz

Leave the data where they are; you do not need to copy them.

Q1: What is the read length?

FastQC

We will use FastQC to assess read quality. First create the directories:

SRR957824
SRR957868

Check the FastQC help:

fastqc --help

Create an output directory:

fastqc/

Run FastQC:

fastqc -o [output directory] [fastq.gz file]

View results:

If you are using MobaXterm, you can open the FastQC HTML files directly from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to your local computer and open them in a web browser. For example:

scp stud0XX@pupilX:preprocess/fastqc/*.html .

Replace stud0XX with your student ID and pupilX with the compute node you are working on. The files will be copied to your current local directory.

Look for warnings or failures in categories such as per-base quality and overrepresented sequences.

Ignore these warnings for this exercise:

[FAIL] Per base sequence content
[WARNING] Per sequence GC content

Pay special attention to trailing quality decay and overrepresented adapter sequences, which affect trimming and downstream assembly.

Q2: Which of the two runs (SRR957824 or SRR957868) had poor quality?

Q3: For the good run, there is still a remaining issue. What is the cause and solution?

cutadapt

FastQC shows overrepresented TruSeq adapters.

Use cutadapt:

cutadapt -a [adapter sequence] -o [output file] [input file]

Adapter sequence:

AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG

Suggested output file: SRR957868_1_trimmed.fastq.gz

Q4: How many times was this adapter trimmed?

Q5: What would happen if the wrong adapter sequence was used?

Run FastQC again on the trimmed output.

Q6: Do you still find adapter sequences among the “overrepresented sequences”?



Human Illumina Paired-end Reads

These reads come from whole-genome sequencing of a Yoruba individual:

/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz
/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_2.fastq.gz

Read 1 is forward; read 2 is reverse.

fastp

fastp is a fast and versatile tool for adapter trimming. It can also merge overlapping paired-end reads, but here we use it only for trimming.

fastp -Q -L \
  --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT \
  --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT \
  --out1 [output read1] \
  --out2 [output read2] \
  --in1 [input read1] \
  --in2 [input read2]

Use:

SRR794302_1_trimmed.fastq.gz
SRR794302_2_trimmed.fastq.gz

Inspect trimmed reads:

zcat file.fastq.gz | less -S

Q7: Which forward read was the first to be trimmed? Would the reverse read be different? Why?

Q8: How many sequences were trimmed?

Q9: With the same number of starting reads, will short or long insert sizes lead to more trimming? Why?


Metagenomic Illumina Paired-end reads

These reads come from a Pseudomonas aeruginosa metagenomics study:

/home/projects/22126_NGS/exercises/preprocess/ex3/

FastQC

Run FastQC on both forward and reverse reads.

Q10: One read type (forward or reverse) has a special problem. Which, and what is the problem?

Trimmomatic

Trimmomatic simultaneously trims adapters and removes low-quality segments — useful before de novo assembly.

java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [flags]
  [forward.fq] [reverse.fq]
  [prefix]_1P.fastq.gz [prefix]_1U.fastq.gz \
  [prefix]_2P.fastq.gz [prefix]_2U.fastq.gz \
  [more flags]
filecontains
1Ppaired forward reads
1Uunpaired forward reads
2Ppaired reverse reads
2Uunpaired reverse reads

Example command:

java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar \
  PE -threads 1 -phred33 \
  /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_1.fastq.gz \
  /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_2.fastq.gz \
  SRR8002634_1P.fastq.gz SRR8002634_1U.fastq.gz \
  SRR8002634_2P.fastq.gz SRR8002634_2U.fastq.gz \
  ILLUMINACLIP:/usr/share/trimmomatic/TruSeq2-PE.fa:2:30:10 \
  LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:50

Q11: Did the reverse-read quality improve?

Q12: Which unpaired file (1U or 2U) do you expect to have more reads? Count lines to check.


Different trimming/merging programs

Here are several commonly used tools available on the servers:

cutadapt              # adapter trimming
Trimmomatic           # trimming + quality filtering
fastp                 # trimming + (optionally) merging overlapping paired-end reads
AdapterRemoval        # trimming + merging overlapping paired-end reads

Let us know if you need an additional tool installed.

Please find answers here.

Congratulations, you finished the exercise!