Data Preprocess exercise

From 22126
Revision as of 14:04, 11 December 2025 by Mick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

First:

  1. Navigate to your home directory.
  2. Create a directory called "preprocess".
  3. Navigate to the directory you just created.

We will try to pre-process several types of NGS data.

  1. Escherichia coli single-end Illumina reads
  2. Pseudomonas aeruginosa paired-end Illumina reads

Escherichia coli single-end Illumina reads

Introduction

An outbreak of E. coli has occurred. People have been getting sick after eating salad. A lab has sequenced different sources to try to pinpoint which one is responsible for the outbreak.

The lab technician performed two MiSeq sequencing runs. One run was good; the other had poor quality.

The data can be found here:

/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957824_1.fastq.gz
/home/projects/22126_NGS/exercises/preprocess/ex1/SRR957868_1.fastq.gz

Leave the data where they are; you do not need to copy them.

Q1: What is the read length?

FastQC

We will use FastQC to assess read quality. First create the directories:

SRR957824
SRR957868

Check the FastQC help:

fastqc --help

Create an output directory:

fastqc/

Run FastQC:

fastqc -o [output directory] [fastq.gz file]

View results:

firefox fastqc/[file prefix]_fastqc.html &

If this is slow, copy the files locally using scp:

scp stud0XX@pupilX:preprocess/fastqc/*.html .

Look for warnings or failures in categories such as per-base quality and overrepresented sequences.

Ignore these warnings for this exercise:

[FAIL] Per base sequence content
[WARNING] Per sequence GC content

Pay special attention to trailing quality decay and overrepresented adapter sequences, which affect trimming and downstream assembly.

Q2: Which of the two runs (SRR957824 or SRR957868) had poor quality?

Q3: For the good run, there is still a remaining issue. What is the cause and solution?

cutadapt

FastQC shows overrepresented TruSeq adapters.

Use cutadapt:

cutadapt -a [adapter sequence] -o [output file] [input file]

Adapter sequence:

AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATG

Suggested output file: SRR957868_1_trimmed.fastq.gz

Q4: How many times was this adapter trimmed?

Q5: What would happen if the wrong adapter sequence was used?

Run FastQC again on the trimmed output.

Q6: Do you still find adapter sequences among the “overrepresented sequences”?



Human Illumina Paired-end Reads

These reads come from whole-genome sequencing of a Yoruba individual:

/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_1.fastq.gz
/home/projects/22126_NGS/exercises/preprocess/ex2/SRR794302_2.fastq.gz

Read 1 is forward; read 2 is reverse.

fastp

fastp is a fast and versatile tool for adapter trimming. It can also merge overlapping paired-end reads, but here we use it only for trimming.

fastp -Q -L \
  --adapter_sequence AGATCGGAAGAGCACACGTCTGAACTCCAGT \
  --adapter_sequence_r2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT \
  --out1 [output read1] \
  --out2 [output read2] \
  --in1 [input read1] \
  --in2 [input read2]

Use:

SRR794302_1_trimmed.fastq.gz
SRR794302_2_trimmed.fastq.gz

Inspect trimmed reads:

zcat file.fastq.gz | less -S

Q7: Which forward read was the first to be trimmed? Would the reverse read be different? Why?

Q8: How many sequences were trimmed?

Q9: With the same number of starting reads, will short or long insert sizes lead to more trimming? Why?


Metagenomic Illumina Paired-end reads

These reads come from a Pseudomonas aeruginosa metagenomics study:

/home/projects/22126_NGS/exercises/preprocess/ex3/

FastQC

Run FastQC on both forward and reverse reads.

Q10: One read type (forward or reverse) has a special problem. Which, and what is the problem?

Trimmomatic

Trimmomatic simultaneously trims adapters and removes low-quality segments — useful before de novo assembly.

java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar PE [flags]
  [forward.fq] [reverse.fq]
  [prefix]_1P.fastq.gz [prefix]_1U.fastq.gz \
  [prefix]_2P.fastq.gz [prefix]_2U.fastq.gz \
  [more flags]
filecontains
1Ppaired forward reads
1Uunpaired forward reads
2Ppaired reverse reads
2Uunpaired reverse reads

Example command:

java -jar /home/ctools/Trimmomatic-0.39/trimmomatic-0.39.jar \
  PE -threads 1 -phred33 \
  /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_1.fastq.gz \
  /home/projects/22126_NGS/exercises/preprocess/ex3/SRR8002634_2.fastq.gz \
  SRR8002634_1P.fastq.gz SRR8002634_1U.fastq.gz \
  SRR8002634_2P.fastq.gz SRR8002634_2U.fastq.gz \
  ILLUMINACLIP:/usr/share/trimmomatic/TruSeq2-PE.fa:2:30:10 \
  LEADING:15 TRAILING:15 SLIDINGWINDOW:5:15 MINLEN:50

Q11: Did the reverse-read quality improve?

Q12: Which unpaired file (1U or 2U) do you expect to have more reads? Count lines to check.


Different trimming/merging programs

Here are several commonly used tools available on the servers:

cutadapt              # adapter trimming
Trimmomatic           # trimming + quality filtering
fastp                 # trimming + (optionally) merging overlapping paired-end reads
AdapterRemoval        # trimming + merging overlapping paired-end reads

Let us know if you need an additional tool installed.

Please find answers here.

Congratulations, you finished the exercise!