Denovo exercise
Overview
First:
- Navigate to your home directory.
- Create a directory called
denovo. - Navigate to the directory you just created.
In this exercise we will perform a de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will:
- Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).
- Count k-mers and estimate genome size.
- Correct reads using Musket.
- Determine insert size of paired-end reads.
- Run de novo assembly using MEGAHIT.
- Calculate assembly statistics.
- Plot coverage and length histograms of the assembly.
- Evaluate the assembly quality.
- Visualize the assembly using Circoletto.
- (Bonus) Try assembling the genome with SPAdes.
- Compare assemblies from PacBio and Nanopore data.
FastQC and trimming
Make sure you are in the denovo directory you created. You can double-check with:
pwd
Copy the sequencing data:
cp /home/projects/22126_NGS/exercises/denovo/vchol/* .
Run FastQC on the reads:
mkdir fastqc fastqc -o fastqc *.txt.gz
Open the reports in Firefox:
firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html &
There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:
gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py
Q1. Which quality encoding format is used?
Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The --basename option defines the output prefix and --gzip compresses the output.
AdapterRemoval \ --file1 Vchol-001_6_1_sequence.txt.gz \ --file2 Vchol-001_6_2_sequence.txt.gz \ --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \ --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \ --qualitybase 64 \ --basename Vchol-001_6 \ --gzip \ --trimqualities \ --minquality 20 \ --minlength 40
When it finishes, inspect Vchol-001_6.settings for trimming statistics (how many reads were trimmed, discarded, etc.).
Q1A. The output includes discarded.gz, pair1.truncated.gz, pair2.truncated.gz, and singleton.truncated.gz. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)
Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair2.truncated.gz --gz
Genome size estimation
We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use jellyfish to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram.
Manual: jellyfish
gzip -dc Vchol-001_6.pair*.truncated.gz \ | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0 jellyfish histo Vchol-001 > Vchol-001.histo
Start R:
R
Then paste:
dat = read.table("Vchol-001.histo")
barplot(dat[,2],
xlim=c(0,150),
ylim=c(0,5e5),
ylab="No of kmers",
xlab="Counts of a k-mer",
names.arg=dat[,1],
cex.names=0.8)
dev.print("Vchol-001.histo.pdf", device=pdf)
The plot shows:
- x-axis: how many times a k-mer occurs (its count)
- y-axis: number of distinct k-mers with that count
K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.
Q2. Where is the k-mer coverage peak (approximately)?
We can estimate genome size using:
N = (M * L) / (L - K + 1) Genome_size = T / N
- N = depth (coverage)
- M = k-mer peak (from the histogram)
- K = k-mer size (here: 15)
- L = average read length (from fastx_readlength)
- T = total number of bases (from fastx_readlength)
Compute the estimated genome size and compare with the known V. cholerae genome (~4 Mb). You should be within roughly ±10%.
Q3. What is your estimated genome size?
Error correction
We will correct errors in the reads using Musket.
Musket: Musket
First, get the number of distinct k-mers (needed for memory allocation in Musket):
jellyfish stats Vchol-001
Use the reported number of distinct k-mers (here an example: 8423098) in the Musket command:
musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \ Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
The output files are named Vchol-001_6.cor.0 and Vchol-001_6.cor.1. Rename them:
mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz
If this takes too long, you can copy precomputed corrected reads:
cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .
De novo assembly with MEGAHIT
We will now assemble the corrected reads using MEGAHIT (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.
First, set the number of threads:
export OMP_NUM_THREADS=4
Run MEGAHIT with k=35:
/home/ctools/MEGAHIT-1.2.9/bin/megahit \ -1 Vchol-001_6.pair1.cor.truncated.fq.gz \ -2 Vchol-001_6.pair2.cor.truncated.fq.gz \ --k-list 35 \ -t 4 \ -m 2000000000 \ -o 35
When finished, you should have 35/final.contigs.fa. Compress it:
gzip 35/final.contigs.fa
To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq
Index the assembly and map:
bwa index 35/final.contigs.fa.gz bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \ | samtools view -Sb - > Vchol_35bp.bam
Extract insert sizes (TLEN field, column 9):
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt
Start R:
R
Then paste:
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1] > 0, 1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean insert size
sd(a.v[a.v >= mn & a.v <= mx]) # standard deviation
Q4. What are the mean insert size and standard deviation of the library?
Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:
Google sheet for k-mer assignment
Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing [KMER] below:
export OMP_NUM_THREADS=4 /home/ctools/MEGAHIT-1.2.9/bin/megahit \ -1 Vchol-001_6.pair1.cor.truncated.fq.gz \ -2 Vchol-001_6.pair2.cor.truncated.fq.gz \ --k-list [KMER] \ -t 4 \ -m 2000000000 \ -o [KMER] gzip [KMER]/final.contigs.fa
Compute assembly statistics using the Assemblathon script:
assemblathon_stats.pl [KMER]/final.contigs.fa > [KMER]/final.contigs.stats
Open [KMER]/final.contigs.stats and fill in your stats in the Google sheet for your k-mer. As a class, you can compare and identify an approximate “best” k-mer.
Copy the best assembly to your folder, or use a precomputed multi-k assembly:
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz . cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .
Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?
Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?
Coverage of the assembly
We will now calculate per-contig coverage and lengths, and visualize them in R.
zcat default_final.contigs.fa.gz | fastx_megahit.py --i /dev/stdin > default_finalt.cov
Start R:
R
Then paste:
library(plotrix)
dat = read.table("default_finalt.cov", sep="\t")
par(mfrow=c(1,2))
weighted.hist(w = dat[,2],
x = dat[,1],
breaks = seq(0,100,1),
main = "Weighted coverage",
xlab = "Contig coverage")
hist(dat[,1],
xlim = c(0,100),
breaks = seq(0,1000,1),
main = "Raw coverage",
xlab = "Contig coverage")
dev.print("best.coverage.pdf", device=pdf)
# Scaffold lengths
par(mfrow=c(1,1))
barplot(rev(sort(dat[,2])),
xlab = "# Scaffold",
ylab = "Length",
main = "Scaffold Lengths")
dev.print("scaffold.lengths.pdf", device=pdf)
q(save="no")
View the plots:
evince best.coverage.pdf & evince scaffold.lengths.pdf &
The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.
Q7. Why might some short contigs have much higher coverage than the main assembly?
Q8. Why might some short contigs have much lower coverage than the main assembly?
Assembly evaluation
We will use QUAST to evaluate the assembly using various reference-based metrics.
QUAST: quast
Run QUAST against the V. cholerae reference genome:
python /home/ctools/quast-quast_5.0.2/quast.py \ default_final.contigs.fa.gz \ -R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa firefox quast_results/latest/report.html &
Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?
Visualization using Circoletto
We will visualize the assembly against the V. cholerae reference using Circoletto.
First, filter out contigs shorter than 500 bp:
fastx_filterfasta.py --i default_final.contigs.fa.gz --min 500 gunzip default_final.contigs.fa.gz.filtered_500.fa.gz
On your local machine, open a browser and go to:
Open the filtered assembly in a text editor on the server, for example:
gedit default_final.contigs.fa.gz.filtered_500.fa &
Copy–paste the FASTA content into the “Query fasta” box on the Circoletto page.
Then open the reference genome:
gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &
Copy–paste this into the “Subject fasta” box.
In the “Output” section, select “ONLY show the best hit per query”, then click Submit to Circoletto.
If Circoletto does not work, you can use this precomputed image:
/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png
You should see the two V. cholerae chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).
Q10. Does your assembled genome appear broadly similar to the reference genome?
Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?
Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the V. cholerae genome paper and search for “V. cholerae integron island”: V. cholerae genome paper
Try to assemble the genome using SPAdes (bonus)
Different assemblers can perform very differently. SPAdes is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.
SPAdes: SPAdes
Check the help output:
spades.py -h
Note: A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.
Link to the SPAdes assembly:
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta # from here you can compute stats and run QUAST
Assembly of PacBio and Nanopore data
We now look at assemblies from long-read technologies using canu, a long-read assembler derived from the Celera assembler (used in the original human genome project).
Canu: canu
We will assemble E. coli K-12 using either PacBio or Oxford Nanopore reads.
# PacBio ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz . canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4 # Oxford Nanopore ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz . canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4
These runs can take a long time. You can instead link to the precomputed assemblies:
ln -s /home/projects/22126_NGS/exercises/denovo/canu/ecoli_pacbio.contigs.fasta . ln -s /home/projects/22126_NGS/exercises/denovo/canu/ecoli_nanopore.contigs.fasta .
Count number of contigs and inspect their lengths (from FASTA headers):
grep ">" [input fasta here]
Or, if compressed:
zgrep ">" [zipped input fasta here]
Run QUAST on both assemblies:
python quast.py \ ecoli_pacbio.contigs.fasta \ ecoli_nanopore.contigs.fasta \ -R /home/projects/22126_NGS/exercises/denovo/canu/ecoli_K12.fa firefox quast_results/latest/report.html &
Q13. Based on QUAST, how good are the PacBio and Nanopore assemblies (e.g. contiguity, errors, completeness)?
Annotation of a prokaryotic genome
We will annotate genes in ecoli_pacbio.contigs.fasta using prodigal.
Prodigal: prodigal
The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:
/home/ctools/prokka/binaries/linux/prodigal \ -f gff \ -i [input genome in fasta] \ -a [output proteins in fasta] \ -o [output annotations in gff]
GFF format: GFF format description
Next, index the protein FASTA file:
samtools faidx ecoli_pacbio.contigs.aa
Extract the protein sequence for gene ID tig00000001_4582:
samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582
Use BLASTP against the nr database:
Paste the sequence and run BLASTP.
Q14. Which protein (function) does tig00000001_4582 correspond to?
Please find answers here: Denovo_solution
Congratulations, you finished the exercise!