Denovo exercise
Overview
First:
- Navigate to your home directory:
- Create a directory called "denovo"
- Navigate to the directory you just created.
In this exercise we will try to perform de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will try to:
- Run FastQC, adaptor and quality trimming reads (Optional - repeat of analysis you have already done in data pre-processing)
- Count k-mers and estimate genome size
- Correct reads using Musket
- Determine insert size of paired-end reads
- Run de novo assembly using SOAPdenovo
- Calculate assembly statistics
- Plot coverage and length histograms of the assembly
- Assembly evaluation
- Visualize assembly using Circoletto
- Try to assemble the genome using SPAdes
- Assembly of PacBio and NanoPore
FastQC and trimming
Make sure you are in the "denovo" directory you created, you can double-check with the following command:
pwd
go ahead and copy the sequencing data
cp /home/projects/22126_NGS/exercises/denovo/vchol/* .
Let us start by running fastqc on the reads:
mkdir fastqc fastqc -o fastqc *.txt.gz
Open the reports in firefox.
firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html &
There are quite some issues with this data (but do not spend to much time on studying the report), we should clean it up first. Let us find out what type of encoding the qualities are in:
gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py
Q1. Which format are they encoded in?
Let us trim the reads using AdapterRemoval. In the command below the most frequent adapter/primer sequence is already pasted in - do not worry about the others in our case they are just variations of that one. Also, we use a minimum of 40 nt and trim to quality 20 (~3 mins) and note that we write qualitybase 64 (Q1). The "--basename" option gives the base name of the output files and we want it to be compressed using gzip "--gzip". When it is done take a look at "Vchol-001_6.settings" for some statistics on how many reads were trimmed etc.
AdapterRemoval --file1 Vchol-001_6_1_sequence.txt.gz --file2 Vchol-001_6_2_sequence.txt.gz --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG --qualitybase 64 --basename Vchol-001_6 --gzip --trimqualities --minquality 20 --minlength 40
Q1A. There are several output files (discarded.gz, pair1.truncated.gz, pair2.truncated.gz, singleton.truncated.gz), can you explain which files contains which reads?
Ok, now we have our reads trimmed and ready to go. Count stats of the trimmed data such as avg readlength, min length, max length, no. reads and no. bases. Note down the avg readlength and the no. of bases.
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair2.truncated.gz --gz
Genome size estimation
Let's try to count the occurrence of k-mers in the data - a k-mer is simply a string of nucleotides of a certain length. Let's say we count all the 15-mers that are in the read data we have - we do this using a program called jellyfish. Here we tell jellyfish to count 15-mers and also add counts from the complementary strand. Afterward, we tell it to create a histogram that we will plot using R. "/dev/fd/0" in the command below means that it should take input from "STDIN", eg. from the gzip program.
gzip -dc Vchol-001_6.pair*.truncated.gz | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0 jellyfish histo Vchol-001 > Vchol-001.histo
Then we call R:
R
and paste the following to generate a histogram:
dat=read.table("Vchol-001.histo") barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8) dev.print("Vchol-001.histo.pdf", device=pdf)
The plot shows how many times a k-mer is found (x-axis) and the number of kmers with this count (y-axis). The bars in the lower end are the k-mers that only occur very few times, these are probably sequencing errors, whereas the k-mers that occur many times are "real" k-mers. We can use the information from the "real" k-mers to correct similar "error k-mers". In this way we can increase the performance of our assembly (this also works for SNP calling as well).
Q2. Where is the peak for the k-mers?
Before we perform error correction on the reads, we will try to estimate the genome size of our genome. Of course, we know that it is a V. cholerae, but if we did not know that or it was an unknown species then we could do like this. N = (M*L)/(L-K+1) and Genome_size = T/N, where N: Depth, M: Kmer peak, K: Kmer-size, L: avg readlength, T: Total bases. Try to put in the numbers (some of them you have from "fastx_readlength.py" command) and see what you get. The reference vibrio cholerae genome is around 4Mb, you should get within +/- 10% of this.
Q3. What is the estimated genome size?
Error correction
Let us try to correct the reads using Musket. First we need to know how many k-mers will be in the data, this is rather easy to get because we already counted kmers using jellyfish. Use this command to output the number of "distinct" k-mers in the dataset:
jellyfish stats Vchol-001
This is needed for musket to setup the bloom filters and hash tables for memory consumption. Then we are ready to run the correction. Finally, we will rename the outputfiles to sensible names.
musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1 mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.gz mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.gz
If this takes a long time copy my files:
cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.gz .
de novo assembly
Now the reads are ready for de novo assembly. We are going to use SOAPdenovo as the assembler, it uses the de Bruijn approach. When running de novo assemblies one should try different k-mer sizes - the k-mer size is what is being used for building the de Bruijn graph and is therefore very important. There is currently no way to estimate what will be the optimal k before running, the best k-size may change between different datasets and may change between different assemblers for the same dataset.
SOAPdenovo requires a configuration file with different information, a sample file is the Vchol-001.soap.conf. You need to open it and change the path to your file. To open it write "gedit Vchol-001.soap.conf" and then navigate to "q1" and "q2", here paste in the path and filename of your two files. Note that we have written "avg_ins=200" even though we do not know the average insert size. We will run an initial assembly to create contigs and then extract the information and then rerun the full assembly again. When you have changed the paths save the file (Ctrl-S) then start the assembler. We will try first with a k of 35.
SOAPdenovo-127mer pregraph -s Vchol-001.soap.conf -K 35 -p 2 -o initial SOAPdenovo-127mer contig -g initial
Now you should have a file called "initial.contig", we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Let us only map the first 100.000 reads - this should be enough.
zcat Vchol-001_6.pair1.cor.truncated.gz | head -n 400000 > Vchol_sample_1.fastq zcat Vchol-001_6.pair2.cor.truncated.gz | head -n 400000 > Vchol_sample_2.fastq bwa index initial.contig bwa mem initial.contig Vchol_sample_1.fastq Vchol_sample_2.fastq | samtools view -Sb - > initial.sample.bam samtools view initial.sample.bam | cut -f9 > initial.insertsizes.txt
Then use R:
R
and paste:
a = read.table("initial.insertsizes.txt") a.v = a[a[,1]>0,1] mn = quantile(a.v, seq(0,1,0.05))[4] mx = quantile(a.v, seq(0,1,0.05))[18] mean(a.v[a.v >= mn & a.v <= mx]) # mean sd(a.v[a.v >= mn & a.v <= mx]) # sd
Q4. What are the mean insert size and standard deviation of our library?
Open "Vchol-001.soap.conf" and change the avg_ins to the value you found. Then we are going to rerun the assembly where we also do scaffolding. Because the assembly takes some time to finish, you will all chose a different k-mer to run and then we will compare to find the "best" assembly. Chose one of the K-sizes on the Google sheet and change the "XX" values below to your number. Remember to write your name and save the document.
SOAPdenovo-127mer all -s Vchol-001.soap.conf -K XX -p 2 -o asmKXX
When it is done, we need to calculate some statistics on it, there is a script from the Assemblathon that does it nicely, but first lets remove all sequences shorter than 100
fastx_filterfasta.py --i asmKXX.scafSeq --min 100 assemblathon_stats.pl asmKXX.scafSeq.filtered_100.fa > asmKXX.scafSeq.stats
Open the asmKXX.scafSeq.stats file and fill in the information in the Google sheet for your respective K. We will then compare the results and find the best K-size for the particular dataset.
Copy the best assembly to your folder and use this assembly from now on. You can also use a fairly good one I made previously (I renamed it to Vchol-001.best.fa).
cp /home/projects/22126_NGS/exercises/denovo/best/Vchol-001.best.fa . cp /home/projects/22126_NGS/exercises/denovo/best/Vchol-001.best.stats .
Q5. How does the length (genome size) of the assembly compare with what we estimated using kmers?
Q6. How many contigs were joined together in scaffolds? Why do you think it is a such a low number (Hint: Repeats, PE insert size)?
Coverage of assembly
Lets calculate coverage and length for each sequence and plot it as a histogram in R. We also plot the lengths of the scaffolds.
fastx_soapcov.py --i Vchol-001.best.fa > Vchol-001.best.cov
Then use R:
R
and write the following:
library(plotrix) dat=read.table("Vchol-001.best.cov", sep="\t") par(mfrow=c(1,2)) weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="Weighted coverage", xlab="Contig coverage") hist(dat[,1], xlim=c(0,100), breaks=seq(0,1000,1), main="Raw coverage", xlab="Contig coverage") dev.print("best.coverage.pdf", device=pdf) # Lengths par(mfrow=c(1,1)) barplot(rev(sort(dat[,2])), xlab="# Scaffold", ylab="Length", main="Scaffold Lengths") dev.print("scaffold.lengths.pdf", device=pdf) q()
evince best.coverage.pdf & evince scaffold.lengths.pdf &
The left plot shows the length-weighted coverage (of k-mers) of the contigs/scaffolds - this means that long sequences has a larger weight compared to short sequences. The plot on the right side shows a normal histogram of contig coverage. By comparing the two plots we see that the majority of the assembly has a k-mer coverage around 20-30X, and that remaining assembly are short sequences. NB: The contig coverage is dependent on the size of the K you chose - larger K gives smaller coverage. Looking at the plot with the lengths you see that the majority of the assembly are in quite long scaffolds.
Q7. Can you think of why some short contigs have much higher coverage compared to main assembly?
Q8. Can you think of why some short contigs have much lower coverage compared to the main assembly?
Assembly evaluation
First we will use Quast to evaluate the assembly using different metrics, see quast. You can run it with our without a reference genome and we will try to evaluate our assembly vs. the Vibrio cholerae reference genome:
python /home/ctools/quast-quast_5.0.2/quast.py Vchol-001.best.fa -R vibrio_cholerae_O1_N16961.fa firefox quast_results/latest/report.html &
Q9. You can see in the report that there are some misassemblies - can we always trust this?.
Visualization using circoletto
Let's try to visualize the assembly. Because we know that the data is from a Vibrio cholerae we can try to compare our assembly with the V.cholerae reference genome. First, let's filter the assembly to a minimum of 500bp.
fastx_filterfasta.py --i Vchol-001.best.fa --min 500
Open a browser on your own computer and go to the Circoletto webpage. Circoletto is an easy-to-use web program that builds upon Circos - a program to create amazing plots with genomic (and other) data. Open the filtered assembly ("gedit Vchol-001.best.fa.filtered_500.fa &") select all and copy-paste it into the upper box (query fasta). Do the same with the reference genome ("gedit vibrio_cholerae_O1_N16961.fa &") and copy-paste it into the box just below. In the "output" section click the "ONLY show the best hit per query" and click "submit to Circoletto". If it doesn't work you can download the assembly and reference that I made.
If everything fails you can see this file.
You should see the two chromosomes of V. cholerae on the left-hand side of the figure (named "gi|...") and then the alignment of our assembly to it. The color of the alignments are the bitscores, the red is the most confident and black, the least confident.
Q10. Do you think that our genome is similar to the reference genome?
Q11. Are there scaffolds/contigs that do not or only partially map to the reference genome?
Q12. What do you think the region on chromosome 2 (the small one) with many small low-confident mappings could be? Hint: See the V. cholerae genome paper and search for "V. cholerae integron island"?
Try to assemble the genome using SPAdes
There is a lot of discussions on which assembler is better than others. SPAdes is definitely one of the best ones. SPAdes will run error correction and use multiple k-mers at the same time when it is doing the assembly. Try it out, it is in the bin folder - try to figure out the commands you need to write to run it and compare it with the output from SOAPdenovo (hint: you can use Quast for the comparison).
spades.py -h
NB: The assembly takes 45 minutes to run with SPAdes, you can therefore use my assembly, try to calculate assembly stats (assemblathon_stats.pl) and run it in Quast together with the SOAPdenovo assembly. To get my assembly:
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta # you write the code from here
Assembly of PacBio and NanoPore data
For the PacBio and NanoPore data, we are going to use canu. Canu is a further development of the Celera assembler which was one of the assemblers used for the original human genome - here they used Sanger reads which are long and have low error rate. Canu has been changed to run with long reads with errors in them.
Let's try to run it on some E. coli K12 data. Here we only run it with either PacBio or NanoPore, but it can also handle combinations of these and Illumina. The commands are quite slow
# pacbio ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz . canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4 # oxford nanopore ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz . canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4
It is very likely going to take quite some time so you can copy my files:
cp /home/projects/22126_NGS/exercises/denovo/canu/ecoli_pacbio.contigs.fasta . cp /home/projects/22126_NGS/exercises/denovo/canu/ecoli_nanopore.contigs.fasta .
Try to count the number of contigs and their length of them (you can see that from the headers):
grep ">" ecoli_nanopore.contigs.fasta grep ">" ecoli_pacbio.contigs.fasta
python quast.py ecoli_pacbio.contigs.fasta ecoli_nanopore.contigs.fasta -R /home/projects/22126_NGS/exercises/denovo/canu/ecoli_K12.fa firefox quast_results/latest/report.html &
Q13. How good are the assemblies using Nanopore and pacbio?
Please find answers here
Congratulations you finished the exercise!