Denovo exercise

From 22126
Revision as of 11:09, 26 November 2025 by Mick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

First:

  1. Navigate to your home directory.
  2. Create a directory called denovo.
  3. Navigate to the directory you just created.

In this exercise we will perform a de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will:

  1. Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).
  2. Count k-mers and estimate genome size.
  3. Correct reads using Musket.
  4. Determine insert size of paired-end reads.
  5. Run de novo assembly using MEGAHIT.
  6. Calculate assembly statistics.
  7. Plot coverage and length histograms of the assembly.
  8. Evaluate the assembly quality.
  9. Visualize the assembly using Circoletto.
  10. (Bonus) Try assembling the genome with SPAdes.
  11. Compare assemblies from PacBio and Nanopore data.

FastQC and trimming

Make sure you are in the denovo directory you created. You can double-check with:

pwd

Copy the sequencing data:

cp /home/projects/22126_NGS/exercises/denovo/vchol/* .

Run FastQC on the reads:

mkdir fastqc
fastqc -o fastqc *.txt.gz

Open the reports in Firefox:

firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html &

There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:

gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py

Q1. Which quality encoding format is used?

Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The --basename option defines the output prefix and --gzip compresses the output.

AdapterRemoval \
  --file1 Vchol-001_6_1_sequence.txt.gz \
  --file2 Vchol-001_6_2_sequence.txt.gz \
  --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \
  --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \
  --qualitybase 64 \
  --basename Vchol-001_6 \
  --gzip \
  --trimqualities \
  --minquality 20 \
  --minlength 40

When it finishes, inspect Vchol-001_6.settings for trimming statistics (how many reads were trimmed, discarded, etc.).

Q1A. The output includes discarded.gz, pair1.truncated.gz, pair2.truncated.gz, and singleton.truncated.gz. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)

Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:

python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair2.truncated.gz --gz

Genome size estimation

We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use jellyfish to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram.

Manual: jellyfish

gzip -dc Vchol-001_6.pair*.truncated.gz \
  | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0

jellyfish histo Vchol-001 > Vchol-001.histo

Start R:

R

Then paste:

dat = read.table("Vchol-001.histo")
barplot(dat[,2],
        xlim=c(0,150),
        ylim=c(0,5e5),
        ylab="No of kmers",
        xlab="Counts of a k-mer",
        names.arg=dat[,1],
        cex.names=0.8)
dev.print("Vchol-001.histo.pdf", device=pdf)

The plot shows:

  • x-axis: how many times a k-mer occurs (its count)
  • y-axis: number of distinct k-mers with that count

K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.

Q2. Where is the k-mer coverage peak (approximately)?

We can estimate genome size using:

N = (M * L) / (L - K + 1)
Genome_size = T / N
  • N = depth (coverage)
  • M = k-mer peak (from the histogram)
  • K = k-mer size (here: 15)
  • L = average read length (from fastx_readlength)
  • T = total number of bases (from fastx_readlength)

Compute the estimated genome size and compare with the known V. cholerae genome (~4 Mb). You should be within roughly ±10%.

Q3. What is your estimated genome size?


Error correction

We will correct errors in the reads using Musket.

Musket: Musket

First, get the number of distinct k-mers (needed for memory allocation in Musket):

jellyfish stats Vchol-001

Use the reported number of distinct k-mers (here an example: 8423098) in the Musket command:

musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \
  Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1

The output files are named Vchol-001_6.cor.0 and Vchol-001_6.cor.1. Rename them:

mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz

If this takes too long, you can copy precomputed corrected reads:

cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .

De novo assembly with MEGAHIT

We will now assemble the corrected reads using MEGAHIT (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.

First, set the number of threads:

export OMP_NUM_THREADS=4

Run MEGAHIT with k=35:

/home/ctools/MEGAHIT-1.2.9/bin/megahit \
  -1 Vchol-001_6.pair1.cor.truncated.fq.gz \
  -2 Vchol-001_6.pair2.cor.truncated.fq.gz \
  --k-list 35 \
  -t 4 \
  -m 2000000000 \
  -o 35

When finished, you should have 35/final.contigs.fa. Compress it:

gzip 35/final.contigs.fa

To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):

zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq

Index the assembly and map:

bwa index 35/final.contigs.fa.gz

bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \
  | samtools view -Sb - > Vchol_35bp.bam

Extract insert sizes (TLEN field, column 9):

samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt

Start R:

R

Then paste:

a   = read.table("initial.insertsizes.txt")
a.v = a[a[,1] > 0, 1]
mn  = quantile(a.v, seq(0,1,0.05))[4]
mx  = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])   # mean insert size
sd(a.v[a.v >= mn & a.v <= mx])     # standard deviation

Q4. What are the mean insert size and standard deviation of the library?

Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:

Google sheet for k-mer assignment

Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing [KMER] below:

export OMP_NUM_THREADS=4
/home/ctools/MEGAHIT-1.2.9/bin/megahit \
  -1 Vchol-001_6.pair1.cor.truncated.fq.gz \
  -2 Vchol-001_6.pair2.cor.truncated.fq.gz \
  --k-list [KMER] \
  -t 4 \
  -m 2000000000 \
  -o [KMER]

gzip [KMER]/final.contigs.fa

Compute assembly statistics using the Assemblathon script:

assemblathon_stats.pl [KMER]/final.contigs.fa > [KMER]/final.contigs.stats

Open [KMER]/final.contigs.stats and fill in your stats in the Google sheet for your k-mer. As a class, you can compare and identify an approximate “best” k-mer.

Copy the best assembly to your folder, or use a precomputed multi-k assembly:

cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .

Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?

Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?


Coverage of the assembly

We will now calculate per-contig coverage and lengths, and visualize them in R.

zcat default_final.contigs.fa.gz | fastx_megahit.py --i /dev/stdin > default_finalt.cov

Start R:

R

Then paste:

library(plotrix)
dat = read.table("default_finalt.cov", sep="\t")

par(mfrow=c(1,2))
weighted.hist(w = dat[,2],
              x = dat[,1],
              breaks = seq(0,100,1),
              main = "Weighted coverage",
              xlab = "Contig coverage")
hist(dat[,1],
     xlim   = c(0,100),
     breaks = seq(0,1000,1),
     main   = "Raw coverage",
     xlab   = "Contig coverage")
dev.print("best.coverage.pdf", device=pdf)

# Scaffold lengths
par(mfrow=c(1,1))
barplot(rev(sort(dat[,2])),
        xlab = "# Scaffold",
        ylab = "Length",
        main = "Scaffold Lengths")
dev.print("scaffold.lengths.pdf", device=pdf)
q(save="no")

View the plots:

evince best.coverage.pdf &
evince scaffold.lengths.pdf &

The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.

Q7. Why might some short contigs have much higher coverage than the main assembly?

Q8. Why might some short contigs have much lower coverage than the main assembly?


Assembly evaluation

We will use QUAST to evaluate the assembly using various reference-based metrics.

QUAST: quast

Run QUAST against the V. cholerae reference genome:

python /home/ctools/quast-quast_5.0.2/quast.py \
  default_final.contigs.fa.gz \
  -R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa

firefox quast_results/latest/report.html &

Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?


Visualization using Circoletto

We will visualize the assembly against the V. cholerae reference using Circoletto.

First, filter out contigs shorter than 500 bp:

fastx_filterfasta.py --i default_final.contigs.fa.gz --min 500
gunzip default_final.contigs.fa.gz.filtered_500.fa.gz

On your local machine, open a browser and go to:

Circoletto

Open the filtered assembly in a text editor on the server, for example:

gedit default_final.contigs.fa.gz.filtered_500.fa &

Copy–paste the FASTA content into the “Query fasta” box on the Circoletto page.

Then open the reference genome:

gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &

Copy–paste this into the “Subject fasta” box.

In the “Output” section, select “ONLY show the best hit per query”, then click Submit to Circoletto.

If Circoletto does not work, you can use this precomputed image:

/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png

You should see the two V. cholerae chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).

Q10. Does your assembled genome appear broadly similar to the reference genome?

Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?

Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the V. cholerae genome paper and search for “V. cholerae integron island”: V. cholerae genome paper


Try to assemble the genome using SPAdes (bonus)

Different assemblers can perform very differently. SPAdes is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.

SPAdes: SPAdes

Check the help output:

spades.py -h

Note: A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.

Link to the SPAdes assembly:

ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
# from here you can compute stats and run QUAST

Assembly of PacBio and Nanopore data

We now look at assemblies from long-read technologies using canu, a long-read assembler derived from the Celera assembler (used in the original human genome project).

Canu: canu

We will assemble E. coli K-12 using either PacBio or Oxford Nanopore reads.

# PacBio
ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz .
canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4

# Oxford Nanopore
ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz .
canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4

These runs can take a long time. You can instead link to the precomputed assemblies:

ln -s /home/projects/22126_NGS/exercises/denovo/canu/ecoli_pacbio.contigs.fasta .
ln -s /home/projects/22126_NGS/exercises/denovo/canu/ecoli_nanopore.contigs.fasta .

Count number of contigs and inspect their lengths (from FASTA headers):

grep ">" [input fasta here]

Or, if compressed:

zgrep ">" [zipped input fasta here]

Run QUAST on both assemblies:

python quast.py \
  ecoli_pacbio.contigs.fasta \
  ecoli_nanopore.contigs.fasta \
  -R /home/projects/22126_NGS/exercises/denovo/canu/ecoli_K12.fa

firefox quast_results/latest/report.html &

Q13. Based on QUAST, how good are the PacBio and Nanopore assemblies (e.g. contiguity, errors, completeness)?


Annotation of a prokaryotic genome

We will annotate genes in ecoli_pacbio.contigs.fasta using prodigal.

Prodigal: prodigal

The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:

/home/ctools/prokka/binaries/linux/prodigal \
  -f gff \
  -i [input genome in fasta] \
  -a [output proteins in fasta] \
  -o [output annotations in gff]

GFF format: GFF format description

Next, index the protein FASTA file:

samtools faidx ecoli_pacbio.contigs.aa

Extract the protein sequence for gene ID tig00000001_4582:

samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582

Use BLASTP against the nr database:

BLAST for proteins

Paste the sequence and run BLASTP.

Q14. Which protein (function) does tig00000001_4582 correspond to?


Please find answers here: Denovo_solution


Congratulations, you finished the exercise!