Denovo exercise

Overview

First:

Navigate to your home directory:
Create a directory called "denovo"
Navigate to the directory you just created.

In this exercise we will try to perform de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will try to:

Run FastQC, adaptor and quality trimming reads (Optional - repeat of analysis you have already done in data pre-processing)
Count k-mers and estimate genome size
Correct reads using Musket
Determine insert size of paired-end reads
Run de novo assembly using MEGAHIT
Calculate assembly statistics
Plot coverage and length histograms of the assembly
Assembly evaluation
Visualize assembly using Circoletto
Try to assemble the genome using SPAdes (bonus)
Assembly of PacBio and NanoPore

FastQC and trimming

Make sure you are in the "denovo" directory you created, you can double-check with the following command:

pwd

go ahead and copy the sequencing data

cp /home/projects/22126_NGS/exercises/denovo/vchol/* .

Let us start by running fastqc on the reads:

mkdir fastqc
fastqc -o fastqc *.txt.gz

Open the reports in firefox.

firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html &

There are quite some issues with this data (but do not spend to much time on studying the report), we should clean it up first. Let us find out what type of encoding the qualities are in:

gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py

Q1. Which format are they encoded in?

Let us trim the reads using AdapterRemoval. In the command below the most frequent adapter/primer sequence is already pasted in - do not worry about the others in our case they are just variations of that one. Also, we use a minimum of 40 nt and trim to quality 20 (~3 mins) and note that we write qualitybase 64 (Q1). The "--basename" option gives the base name of the output files and we want it to be compressed using gzip "--gzip". When it is done take a look at "Vchol-001_6.settings" for some statistics on how many reads were trimmed etc.

AdapterRemoval --file1 Vchol-001_6_1_sequence.txt.gz --file2 Vchol-001_6_2_sequence.txt.gz --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG --qualitybase 64 --basename Vchol-001_6 --gzip --trimqualities --minquality 20 --minlength 40

Q1A. There are several output files (discarded.gz, pair1.truncated.gz, pair2.truncated.gz, singleton.truncated.gz), can you explain which files contains which reads? (tip: look for documentation online)

Ok, now we have our reads trimmed and ready to go. Count stats of the trimmed data such as average read length, min length, max length, no. reads and no. bases. Note down the avg readlength and the no. of bases.

python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair2.truncated.gz --gz

Genome size estimation

Let's try to count the occurrence of k-mers in the data - a k-mer is simply a string of nucleotides of a certain length. Let's say we count all the 15-mers that are in the read data we have - we do this using a program called jellyfish. Here we tell jellyfish to count 15-mers and also add counts from the complementary strand. Afterward, we tell it to create a histogram that we will plot using R. "/dev/fd/0" in the command below means that it should take input from "STDIN", eg. from the gzip program.

gzip -dc Vchol-001_6.pair*.truncated.gz | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0
jellyfish histo Vchol-001 > Vchol-001.histo

Then we call R:

and paste the following to generate a histogram:

dat=read.table("Vchol-001.histo")
barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8)
dev.print("Vchol-001.histo.pdf", device=pdf)

The plot shows how many times a k-mer is found (x-axis) and the number of kmers with this count (y-axis). The bars in the lower end are the k-mers that only occur very few times, these are probably sequencing errors, whereas the k-mers that occur many times are "real" k-mers. We can use the information from the "real" k-mers to correct similar "error k-mers". In this way we can increase the performance of our assembly (this also works for SNP calling as well).

Q2. Where is the peak for the kmers?

Before we perform error correction on the reads, we will try to estimate the genome size of our genome. Of course, we know that it is a V. cholerae, but if we did not know that or it was an unknown species then we could do like this. N = (M*L)/(L-K+1) and Genome_size = T/N, where N: Depth, M: Kmer peak, K: Kmer-size, L: avg readlength, T: Total bases. Try to put in the numbers (some of them you have from "fastx_readlength.py" command) and see what you get. The reference vibrio cholerae genome is around 4Mb, you should get within +/- 10% of this.

Q3. What is the estimated genome size?

Error correction

Let us try to correct the reads using Musket. First we need to know how many k-mers will be in the data, this is rather easy to get because we already counted kmers using jellyfish. Use this command to output the number of "distinct" k-mers in the dataset:

jellyfish stats Vchol-001

This is needed for musket to set up the bloom filters and hash tables for memory consumption. Then we are ready to run the correction.

musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1

The files do not have an ideal name. Finally, we will rename the output files to sensible names.

mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz

If this takes a long time copy my files:

cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .

de novo assembly

Now the reads are ready for de novo assembly. We are going to use MEGAHIT as our assembler, it uses the de Bruijn approach. When running de novo assemblies one should try different k-mer sizes - the k-mer size is used for building the de Bruijn graph and is therefore very important. There is currently no way to estimate what will be the optimal k before running, the best k-size may change between different datasets and may change between different assemblers for the same dataset. By default, MEGAHIT tried different kmer length values. But we can set a fixed kmer length as well

We will try first with a fixed kmer length of 35 bases.

export OMP_NUM_THREADS=4 
/home/ctools/MEGAHIT-1.2.9/bin/megahit -1 Vchol-001_6.pair1.cor.truncated.fq.gz -2 Vchol-001_6.pair2.cor.truncated.fq.gz  --k-list 35  -t 4 -m 2000000000 -o 35

The first command with OMP_NUM_THREADS is to allow more threads to be used. The flags -1 -2 specify the paired-end files. The kmer length is specified with --k-list but we could specify multiple values ex --k-list 35,39. The number of threads is -t 4 and memory is 2G (-m 2000000000). The output directory is 35/ in your current working directory.

Now you should have a file called "35/final.contigs.fa". It is unzipped so let's zip it:

gzip 35/final.contigs.fa

we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Let us only map the first 100.000 reads - this should be enough.

zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 |gzip > Vchol_sample_1.fastq.gz
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 |gzip > Vchol_sample_2.fastq.gz

bwa index 35/final.contigs.fa.gz
bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq | samtools view -Sb - > Vchol_35bp.bam

samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt

Then use R:

and paste:

a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])       # mean
sd(a.v[a.v >= mn & a.v <= mx])         # sd

Q4. What are the mean insert size and standard deviation of our library?

We will now try different values of k. We will all choose a different k-mer to run and then we will compare to find the "best" assembly.Chose one of the K-sizes on the Google sheet and change the "XX" values below to your number. Remember to write your name in the document. Then use the kmer length as [KMER LENGTH HERE] here:

export OMP_NUM_THREADS=4 
/home/ctools/MEGAHIT-1.2.9/bin/megahit -1 Vchol-001_6.pair1.cor.truncated.fq.gz -2 Vchol-001_6.pair2.cor.truncated.fq.gz  --k-list [KMER LENGTH HERE]  -t 4 -m 2000000000 -o [KMER LENGTH HERE]
gzip  [KMER LENGTH HERE]/final.contigs.fa

When it is done, we need to calculate some statistics on it, there is a script from the Assemblathon (a competition for assemblers) that does it

assemblathon_stats.pl [KMER LENGTH HERE]/final.contigs.fa > [KMER LENGTH HERE]/final.contigs.stats

Open the [KMER LENGTH HERE]/final.contigs.stats file and fill in the information in the Google sheet for your respective length of k. We will then compare the results and find the best K-size for the particular dataset.

Copy the best assembly to your folder and use this assembly from now on. You can also use a fairly good one I made previously using default parameters so multiple values of k.

cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .

Q5. How does the N50 of the assembly compare with what we got using fixed kmers?

Q6. How does the longest contig of the assembly compare with what we got using fixed kmers?

Coverage of assembly

Let's calculate the coverage and length for each sequence and plot it as a histogram in R. We also plot the lengths of the scaffolds.

zcat default_final.contigs.fa.gz | fastx_megahit.py --i /dev/stdin > default_finalt.cov

Then use R:

and write the following:

library(plotrix)
dat=read.table("default_finalt.cov", sep="\t")
par(mfrow=c(1,2))
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="Weighted coverage", xlab="Contig coverage")
hist(dat[,1], xlim=c(0,100), breaks=seq(0,1000,1), main="Raw coverage", xlab="Contig coverage")
dev.print("best.coverage.pdf", device=pdf)

# Lengths
par(mfrow=c(1,1))
barplot(rev(sort(dat[,2])), xlab="# Scaffold", ylab="Length", main="Scaffold Lengths")
dev.print("scaffold.lengths.pdf", device=pdf)
q(save="no")

evince best.coverage.pdf &
evince scaffold.lengths.pdf &

The left plot shows the length-weighted coverage (of k-mers) of the contigs/scaffolds - this means that long sequences have a larger weight compared to short sequences. The plot on the right side shows a normal histogram of contig coverage. By comparing the two plots we see that the majority of the assembly has a k-mer coverage of around 60-90X, and that the remaining assembly are short sequences. NB: The contig coverage is dependent on the size of the K you chose. Looking at the plot with the lengths you see that the majority of the assembly is in quite long scaffolds.

Q7. Can you think of why some short contigs have much higher coverage compared to the main assembly?

Q8. Can you think of why some short contigs have much lower coverage compared to the main assembly?

Assembly evaluation

First we will use Quast to evaluate the assembly using different metrics, see quast. You can run it with our without a reference genome and we will try to evaluate our assembly vs. the Vibrio cholerae reference genome:

python /home/ctools/quast-quast_5.0.2/quast.py default_final.contigs.fa.gz -R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa
firefox quast_results/latest/report.html &

Q9. You can see in the report that there are some misassemblies - can we always trust this?.

Visualization using circoletto

Let's try to visualize the assembly. Because we know that the data is from a Vibrio cholerae we can try to compare our assembly with the V.cholerae reference genome. First, let's filter the assembly to a minimum of 500bp.

fastx_filterfasta.py --i default_final.contigs.fa.gz --min 500
gunzip default_final.contigs.fa.gz.filtered_500.fa.gz

We need to unzip to use gedit. Open a browser on your own computer and go to the Circoletto webpage. Circoletto is an easy-to-use web program that builds upon Circos - a program to create amazing plots with genomic (and other) data. Open the filtered assembly ("gedit default_final.contigs.fa.gz.filtered_500.fa &") select all and copy-paste it into the upper box (query fasta). Do the same with the reference genome ("gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &") and copy-paste it into the box just below. In the "output" section click the "ONLY show the best hit per query" and click "submit to Circoletto". If it does not work, feel free to use: /home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png

You should see the two chromosomes of V. cholerae on the left-hand side of the figure (named "gi|...") and then the alignment of our assembly to it. The color of the alignments are the bitscores, the red is the most confident and black, the least confident.

Q10. Do you think that our genome is similar to the reference genome?

Q11. Are there scaffolds/contigs that do not or only partially map to the reference genome?

Q12. What do you think the region on chromosome 2 (the small one) with many small low-confident mappings could be? Hint: See the V. cholerae genome paper and search for "V. cholerae integron island"?

Try to assemble the genome using SPAdes (bonus)

There is a lot of discussions on which assembler is better than others. SPAdes is definitely one of the best ones. SPAdes will run error correction and use multiple k-mers at the same time when it is doing the assembly. Try it out, it is in the bin folder - try to figure out the commands you need to write to run it and compare it with the output from MEGAHIT (hint: you can use Quast for the comparison).

spades.py -h

NB: The assembly takes 45 minutes to run with SPAdes, you can therefore use my assembly, try to calculate assembly stats (assemblathon_stats.pl) and run it in Quast together with the SOAPdenovo assembly. To get my assembly:

ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta .
# you write the code from here

Assembly of PacBio and NanoPore data

For the PacBio and NanoPore data, we are going to use canu. Canu is a further development of the Celera assembler which was one of the assemblers used for the original human genome - here they used Sanger reads which are long and have low error rate. Canu has been changed to run with long reads with errors in them.
Let's try to run it on some E. coli K12 data. Here we only run it with either PacBio or NanoPore, but it can also handle combinations of these and Illumina. The commands are quite slow

# pacbio
ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz .
canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4

# oxford nanopore
ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz .
canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4

It is very likely going to take quite some time so you can copy my files:

ln -s /home/projects/22126_NGS/exercises/denovo/canu/ecoli_pacbio.contigs.fasta .
ln -s /home/projects/22126_NGS/exercises/denovo/canu/ecoli_nanopore.contigs.fasta .

Try to count the number of contigs and their length of them (you can see that from the headers):

 
grep ">" [input fasta here]

or

 
zgrep ">" [zipped input fasta here]

python quast.py ecoli_pacbio.contigs.fasta ecoli_nanopore.contigs.fasta  -R /home/projects/22126_NGS/exercises/denovo/canu/ecoli_K12.fa 
firefox quast_results/latest/report.html &

Q13. How good are the assemblies using Nanopore and pacbio?

Annotation of a prokaryotic genome

We will use prodigal to annotate our genes in ecoli_pacbio.contigs.fasta:

/home/ctools/prokka/binaries/linux/prodigal -f gff -i [input genome in fasta] -a [output proteins in fasta] -o [output annotations in gff]

"-f" specifies the output format and "-o" the output file. The output is in gff format for the annotation. The proteins are specified by "-a" and are in FASTA format.

Let's use a gene that was found and identify what it is as an example. First index the FASTA file:

samtools faidx ecoli_pacbio.contigs.aa

Then find the following gene ID "tig00000001_4582":

samtools faidx ecoli_pacbio.contigs.aa   tig00000001_4582

Use BLAST for proteins and determine which protein it is.

Q14. Which protein is it?

Please find answers here

Congratulations you finished the exercise!

Denovo exercise

Contents

Overview

FastQC and trimming

Genome size estimation

Error correction

de novo assembly

Coverage of assembly

Assembly evaluation

Visualization using circoletto

Try to assemble the genome using SPAdes (bonus)

Assembly of PacBio and NanoPore data

Annotation of a prokaryotic genome

Navigation menu

Denovo exercise

Overview

FastQC and trimming

Genome size estimation

Error correction

de novo assembly

Coverage of assembly

Assembly evaluation

Visualization using circoletto

Try to assemble the genome using SPAdes (bonus)

Assembly of PacBio and NanoPore data

Annotation of a prokaryotic genome

Navigation menu

Search