Denovo exercise: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 1: Line 1:
<H2>Overview</H2>
<h2>Overview</h2>


<p>First:</p>
<ol>
  <li>Navigate to your home directory.</li>
  <li>Create a directory called <code>denovo</code>.</li>
  <li>Navigate to the directory you just created.</li>
</ol>


First:
<p>In this exercise we will perform a <b>de novo assembly</b> of Illumina paired-end reads. The data is from a <i>Vibrio cholerae</i> strain isolated in Nepal. You will:</p>
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "denovo"
<LI>Navigate to the directory you just created.
</OL>


<ol>
  <li>Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).</li>
  <li>Count k-mers and estimate genome size.</li>
  <li>Correct reads using Musket.</li>
  <li>Determine insert size of paired-end reads.</li>
  <li>Run de novo assembly using MEGAHIT.</li>
  <li>Calculate assembly statistics.</li>
  <li>Plot coverage and length histograms of the assembly.</li>
  <li>Evaluate the assembly quality.</li>
  <li>Visualize the assembly using Circoletto.</li>
  <li>(Bonus) Try assembling the genome with SPAdes.</li>
  <li>Compare assemblies from PacBio and Nanopore data.</li>
</ol>


<p>In this exercise we will try to perform de novo assembly of Illumina paired-end reads. The data is from a <i>Vibrio cholerae</i> strain isolated in Nepal. You will try to:
<hr>
<OL>
  <LI>Run FastQC, adaptor and quality trimming reads (Optional - repeat of analysis you have already done in data pre-processing)
  <LI>Count k-mers and estimate genome size
  <LI>Correct reads using Musket
  <LI>Determine insert size of paired-end reads
  <LI>Run de novo assembly using MEGAHIT
  <LI>Calculate assembly statistics
  <LI>Plot coverage and length histograms of the assembly
  <LI>Assembly evaluation
  <LI>Visualize assembly using Circoletto
  <LI>Try to assemble the genome using SPAdes (bonus)
  <LI>Assembly of PacBio and NanoPore
</OL>


<HR>
<h3>FastQC and trimming</h3>


<H3>FastQC and trimming</H3>
<p>Make sure you are in the <code>denovo</code> directory you created. You can double-check with:</p>


Make sure you are in the "denovo" directory you created, you can double-check with the following command:
<pre>
<pre>
pwd
pwd
</pre>
</pre>


go ahead and copy the sequencing data</p>
<p>Copy the sequencing data:</p>


<pre>
<pre>
Line 40: Line 40:
</pre>
</pre>


<p>Let us start by running fastqc on the reads:</p>
<p>Run FastQC on the reads:</p>
 
<pre>
<pre>
mkdir fastqc
mkdir fastqc
Line 46: Line 47:
</pre>
</pre>


<p>Open the reports in firefox.</p>
<p>Open the reports in Firefox:</p>
 
<pre>
<pre>
firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html &
firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html &
</pre>
</pre>


<p>There are quite some issues with this data (but do not spend to much time on studying the report), we should clean it up first. Let us find out what type of encoding the qualities are in:</p>
<p>There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:</p>
 
<pre>
<pre>
gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py
gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py
</pre>
</pre>
<b>Q1. Which format are they encoded in?</b>


<p>Let us trim the reads using AdapterRemoval. In the command below the most frequent adapter/primer sequence is already pasted in - do not worry about the others in our case they are just variations of that one. Also, we use a minimum of 40 nt and trim to quality 20 (~3 mins) and note that we write qualitybase 64 (Q1). The "--basename" option gives the base name of the output files and we want it to be compressed using gzip "--gzip". When it is done take a look at "Vchol-001_6.settings" for some statistics on how many reads were trimmed etc.</p>
<p><b>Q1. Which quality encoding format is used?</b></p>
 
<p>Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The <code>--basename</code> option defines the output prefix and <code>--gzip</code> compresses the output.</p>


<pre>
<pre>
AdapterRemoval --file1 Vchol-001_6_1_sequence.txt.gz --file2 Vchol-001_6_2_sequence.txt.gz --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG --qualitybase 64 --basename Vchol-001_6 --gzip --trimqualities --minquality 20 --minlength 40
AdapterRemoval \
  --file1 Vchol-001_6_1_sequence.txt.gz \
  --file2 Vchol-001_6_2_sequence.txt.gz \
  --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \
  --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \
  --qualitybase 64 \
  --basename Vchol-001_6 \
  --gzip \
  --trimqualities \
  --minquality 20 \
  --minlength 40
</pre>
</pre>


<b>Q1A. There are several output files (discarded.gz, pair1.truncated.gz, pair2.truncated.gz, singleton.truncated.gz), can you explain which files contains which reads? (tip: look for documentation online)</b>
<p>When it finishes, inspect <code>Vchol-001_6.settings</code> for trimming statistics (how many reads were trimmed, discarded, etc.).</p>
 
<p><b>Q1A. The output includes <code>discarded.gz</code>, <code>pair1.truncated.gz</code>, <code>pair2.truncated.gz</code>, and <code>singleton.truncated.gz</code>. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)</b></p>
 
<p>Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:</p>


<p>Ok, now we have our reads trimmed and ready to go. Count stats of the trimmed data such as average read length, min length, max length, no. reads and no. bases. Note down the avg readlength and the no. of bases.</p>
<pre>
<pre>
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz
Line 71: Line 88:
</pre>
</pre>


<HR>
<hr>


<H3>Genome size estimation</H3>
<h3>Genome size estimation</h3>


<p> Let's try to count the occurrence of k-mers in the data - a k-mer is simply a string of nucleotides of a certain length. Let's say we count all the 15-mers that are in the read data we have - we do this using a program called [http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.html jellyfish]. Here we tell jellyfish to count 15-mers and also add counts from the complementary strand. Afterward, we tell it to create a histogram that we will plot using R. "/dev/fd/0" in the command below means that it should take input from "STDIN", eg. from the gzip program.</p>
<p>We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use <b>jellyfish</b> to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram.</p>
 
<p>Manual: [http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.html jellyfish]</p>


<pre>
<pre>
gzip -dc Vchol-001_6.pair*.truncated.gz | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0
gzip -dc Vchol-001_6.pair*.truncated.gz \
  | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0
 
jellyfish histo Vchol-001 > Vchol-001.histo
jellyfish histo Vchol-001 > Vchol-001.histo
</pre>
</pre>


Then we call R:
<p>Start R:</p>
 
<pre>
<pre>
R
R
</pre>
</pre>


and paste the following to generate a histogram:
<p>Then paste:</p>


<pre>
<pre>
dat=read.table("Vchol-001.histo")
dat = read.table("Vchol-001.histo")
barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8)
barplot(dat[,2],
        xlim=c(0,150),
        ylim=c(0,5e5),
        ylab="No of kmers",
        xlab="Counts of a k-mer",
        names.arg=dat[,1],
        cex.names=0.8)
dev.print("Vchol-001.histo.pdf", device=pdf)
dev.print("Vchol-001.histo.pdf", device=pdf)
</pre>
</pre>


<p>The plot shows how many times a k-mer is found (x-axis) and the number of kmers with this count (y-axis). The bars in the lower end are the k-mers that only occur very few times, these are probably sequencing errors, whereas the k-mers that occur many times are "real" k-mers. We can use the information from the "real" k-mers to correct similar "error k-mers". In this way we can increase the performance of our assembly (this also works for SNP calling as well).</p>
<p>The plot shows:</p>
<ul>
  <li>x-axis: how many times a k-mer occurs (its count)</li>
  <li>y-axis: number of distinct k-mers with that count</li>
</ul>


<b>Q2. Where is the peak for the kmers?</b>
<p>K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.</p>


<p>Before we perform error correction on the reads, we will try to estimate the genome size of our genome. Of course, we know that it is a V. cholerae, but if we did not know that or it was an unknown species then we could do like this. N = (M*L)/(L-K+1) and Genome_size = T/N, where N: Depth, M: Kmer peak, K: Kmer-size, L: avg readlength, T: Total bases. Try to put in the numbers (some of them you have from "fastx_readlength.py" command) and see what you get. The reference vibrio cholerae genome is around 4Mb, you should get within +/- 10% of this.</p>
<p><b>Q2. Where is the k-mer coverage peak (approximately)?</b></p>
<b>Q3. What is the estimated genome size?</b>
 
<p>We can estimate genome size using:</p>
 
<pre>
N = (M * L) / (L - K + 1)
Genome_size = T / N
</pre>
 
<ul>
  <li><b>N</b> = depth (coverage)</li>
  <li><b>M</b> = k-mer peak (from the histogram)</li>
  <li><b>K</b> = k-mer size (here: 15)</li>
  <li><b>L</b> = average read length (from fastx_readlength)</li>
  <li><b>T</b> = total number of bases (from fastx_readlength)</li>
</ul>
 
<p>Compute the estimated genome size and compare with the known <i>V. cholerae</i> genome (~4 Mb). You should be within roughly ±10%.</p>
 
<p><b>Q3. What is your estimated genome size?</b></p>


<hr>
<hr>


<H3>Error correction</H3>
<h3>Error correction</h3>


<p>Let us try to correct the reads using [http://musket.sourceforge.net/homepage.htm Musket]. First we need to know how many k-mers will be in the data, this is rather easy to get because we already counted kmers using jellyfish. Use this command to output the number of "distinct" k-mers in the dataset:</p>
<p>We will correct errors in the reads using <b>Musket</b>.</p>
 
<p>Musket: [http://musket.sourceforge.net/homepage.htm Musket]</p>
 
<p>First, get the number of distinct k-mers (needed for memory allocation in Musket):</p>


<pre>
<pre>
Line 112: Line 166:
</pre>
</pre>


<p>This is needed for musket to set up the bloom filters and hash tables for memory consumption. Then we are ready to run the correction. </p>
<p>Use the reported number of distinct k-mers (here an example: <code>8423098</code>) in the Musket command:</p>


<pre>
<pre>
musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \
  Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
</pre>
</pre>


<p>The files do not have an ideal name. Finally, we will rename the output files to sensible names.</p>
<p>The output files are named <code>Vchol-001_6.cor.0</code> and <code>Vchol-001_6.cor.1</code>. Rename them:</p>


<pre>
<pre>
Line 125: Line 180:
</pre>
</pre>


<p><b>If this takes a long time copy my files:</b></p>
<p><b>If this takes too long, you can copy precomputed corrected reads:</b></p>


<pre>
<pre>
Line 131: Line 186:
</pre>
</pre>


<HR>
<hr>
 
<h3><i>De novo</i> assembly with MEGAHIT</h3>


<H3><i>de novo</i> assembly</H3>
<p>We will now assemble the corrected reads using <b>MEGAHIT</b> (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.</p>


<p>Now the reads are ready for <i>de novo</i> assembly. We are going to use MEGAHIT as our assembler, it uses the de Bruijn approach. When running ''de novo'' assemblies one should try different k-mer sizes - the k-mer size is used for building the de Bruijn graph and is therefore very important. There is currently no way to estimate what will be the optimal k before running, the best k-size may change between different datasets and may change between different assemblers for the same dataset. By default, MEGAHIT tried different kmer length values. But we can set a fixed kmer length as well</p>
<p>First, set the number of threads:</p>
<BR>
 
<pre>
export OMP_NUM_THREADS=4
</pre>


<p>We will try first with a fixed kmer length of <b>35 bases</b>. </p>
<p>Run MEGAHIT with k=35:</p>


<pre>
<pre>
export OMP_NUM_THREADS=4
/home/ctools/MEGAHIT-1.2.9/bin/megahit \
/home/ctools/MEGAHIT-1.2.9/bin/megahit -1 Vchol-001_6.pair1.cor.truncated.fq.gz -2 Vchol-001_6.pair2.cor.truncated.fq.gz --k-list 35 -t 4 -m 2000000000 -o 35
  -1 Vchol-001_6.pair1.cor.truncated.fq.gz \
  -2 Vchol-001_6.pair2.cor.truncated.fq.gz \
  --k-list 35 \
  -t 4 \
  -m 2000000000 \
  -o 35
</pre>
</pre>


<p>The first command with OMP_NUM_THREADS is to allow more threads to be used. The flags -1 -2 specify the paired-end files. The kmer length is specified with --k-list but we could specify multiple values ex --k-list 35,39. The number of threads is -t 4 and memory is 2G (-m 2000000000). The output directory is 35/ in your current working directory. </p>
<p>When finished, you should have <code>35/final.contigs.fa</code>. Compress it:</p>
 
<p>Now you should have a file called "35/final.contigs.fa". It is unzipped so let's zip it:</p>


<pre>
<pre>
Line 153: Line 216:
</pre>
</pre>


<p> we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Let us only map the first 100.000 reads - this should be enough.</p>
<p>To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):</p>


<pre>
<pre>
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 |gzip > Vchol_sample_1.fastq.gz
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 |gzip > Vchol_sample_2.fastq.gz
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq
</pre>


<p>Index the assembly and map:</p>
<pre>
bwa index 35/final.contigs.fa.gz
bwa index 35/final.contigs.fa.gz
bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq | samtools view -Sb - > Vchol_35bp.bam


bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \
  | samtools view -Sb - > Vchol_35bp.bam
</pre>
<p>Extract insert sizes (TLEN field, column 9):</p>
<pre>
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt
</pre>


</pre>
<p>Start R:</p>


Then use R:
<pre>
<pre>
R
R
</pre>
</pre>


and paste:
<p>Then paste:</p>
 
<pre>
<pre>
a = read.table("initial.insertsizes.txt")
a   = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
a.v = a[a[,1] > 0, 1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])       # mean
mean(a.v[a.v >= mn & a.v <= mx])   # mean insert size
sd(a.v[a.v >= mn & a.v <= mx])         # sd
sd(a.v[a.v >= mn & a.v <= mx])     # standard deviation
</pre>
</pre>


<b>Q4. What are the mean insert size and standard deviation of our library?</b>
<p><b>Q4. What are the mean insert size and standard deviation of the library?</b></p>


<p>Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:</p>


<p>[https://docs.google.com/spreadsheets/d/1DjfbriqIEuCq8FMRLlt69ITpgm2NKcmZzun11-T43HQ/edit?usp=sharing Google sheet for k-mer assignment]</p>


<p>We will now try different values of k. We will all choose a different k-mer to run and then we will compare to find the "best" assembly.Chose one of the K-sizes on the [https://docs.google.com/spreadsheets/d/1DjfbriqIEuCq8FMRLlt69ITpgm2NKcmZzun11-T43HQ/edit?usp=sharing Google sheet] and change the "XX" values below to your number. Remember to write your name in the document. Then use the kmer length as [KMER LENGTH HERE] here:</p>
<p>Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing <code>[KMER]</code> below:</p>


<pre>
export OMP_NUM_THREADS=4
/home/ctools/MEGAHIT-1.2.9/bin/megahit \
  -1 Vchol-001_6.pair1.cor.truncated.fq.gz \
  -2 Vchol-001_6.pair2.cor.truncated.fq.gz \
  --k-list [KMER] \
  -t 4 \
  -m 2000000000 \
  -o [KMER]


<pre>
gzip [KMER]/final.contigs.fa
export OMP_NUM_THREADS=4
/home/ctools/MEGAHIT-1.2.9/bin/megahit -1 Vchol-001_6.pair1.cor.truncated.fq.gz -2 Vchol-001_6.pair2.cor.truncated.fq.gz  --k-list [KMER LENGTH HERE]  -t 4 -m 2000000000 -o [KMER LENGTH HERE]
gzip [KMER LENGTH HERE]/final.contigs.fa
</pre>
</pre>


<p>When it is done, we need to calculate some statistics on it, there is a script from the Assemblathon (a competition for assemblers) that does it </p>
<p>Compute assembly statistics using the Assemblathon script:</p>
 
<pre>
<pre>
assemblathon_stats.pl [KMER LENGTH HERE]/final.contigs.fa > [KMER LENGTH HERE]/final.contigs.stats
assemblathon_stats.pl [KMER]/final.contigs.fa > [KMER]/final.contigs.stats
</pre>
</pre>


<p>Open the [KMER LENGTH HERE]/final.contigs.stats file and fill in the information in the Google sheet for your respective length of k. We will then compare the results and find the best K-size for the particular dataset.</p>
<p>Open <code>[KMER]/final.contigs.stats</code> and fill in your stats in the Google sheet for your k-mer. As a class, you can compare and identify an approximate “best” k-mer.</p>
 
<p>Copy the best assembly to your folder, or use a precomputed multi-k assembly:</p>


<p>Copy the best assembly to your folder and use this assembly from now on. You can also use a fairly good one I made previously using default parameters so multiple values of k.</p>
<pre>
<pre>
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
Line 207: Line 291:
</pre>
</pre>


<b>Q5. How does the N50 of the assembly compare with what we got using fixed kmers?</b>
<p><b>Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?</b></p>


<br>
<p><b>Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?</b></p>


<hr>


<b>Q6. How does the longest contig of the assembly compare with what we got using fixed kmers?</b>
<h3>Coverage of the assembly</h3>


<HR>
<p>We will now calculate per-contig coverage and lengths, and visualize them in R.</p>


<H3>Coverage of assembly</H3>
<p> Let's calculate the coverage and length for each sequence and plot it as a histogram in R. We also plot the lengths of the scaffolds.</p>
<pre>
<pre>
zcat default_final.contigs.fa.gz | fastx_megahit.py --i /dev/stdin > default_finalt.cov
zcat default_final.contigs.fa.gz | fastx_megahit.py --i /dev/stdin > default_finalt.cov
</pre>
</pre>


Then use R:
<p>Start R:</p>
 
<pre>
<pre>
R
R
</pre>
</pre>


and write the following:
<p>Then paste:</p>
 
<pre>
<pre>
library(plotrix)
library(plotrix)
dat=read.table("default_finalt.cov", sep="\t")
dat = read.table("default_finalt.cov", sep="\t")
 
par(mfrow=c(1,2))
par(mfrow=c(1,2))
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="Weighted coverage", xlab="Contig coverage")
weighted.hist(w = dat[,2],
hist(dat[,1], xlim=c(0,100), breaks=seq(0,1000,1), main="Raw coverage", xlab="Contig coverage")
              x = dat[,1],
              breaks = seq(0,100,1),
              main = "Weighted coverage",
              xlab = "Contig coverage")
hist(dat[,1],
    xlim   = c(0,100),
    breaks = seq(0,1000,1),
    main   = "Raw coverage",
    xlab   = "Contig coverage")
dev.print("best.coverage.pdf", device=pdf)
dev.print("best.coverage.pdf", device=pdf)


# Lengths
# Scaffold lengths
par(mfrow=c(1,1))
par(mfrow=c(1,1))
barplot(rev(sort(dat[,2])), xlab="# Scaffold", ylab="Length", main="Scaffold Lengths")
barplot(rev(sort(dat[,2])),
        xlab = "# Scaffold",
        ylab = "Length",
        main = "Scaffold Lengths")
dev.print("scaffold.lengths.pdf", device=pdf)
dev.print("scaffold.lengths.pdf", device=pdf)
q(save="no")
q(save="no")
</pre>
</pre>
<p>View the plots:</p>


<pre>
<pre>
Line 249: Line 347:
</pre>
</pre>


<p>The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.</p>


<p>The left plot shows the length-weighted coverage (of k-mers) of the contigs/scaffolds - this means that long sequences have a larger weight compared to short sequences. The plot on the right side shows a normal histogram of contig coverage. By comparing the two plots we see that the majority of the assembly has a k-mer coverage of around 60-90X, and that the remaining assembly are short sequences. NB: The contig coverage is dependent on the size of the K you chose. Looking at the plot with the lengths you see that the majority of the assembly is in quite long scaffolds.</p>
<p><b>Q7. Why might some short contigs have much <i>higher</i> coverage than the main assembly?</b></p>


<b>Q7. Can you think of why some short contigs have much higher coverage compared to the main assembly?</b>
<p><b>Q8. Why might some short contigs have much <i>lower</i> coverage than the main assembly?</b></p>


<br>
<hr>


<b>Q8. Can you think of why some short contigs have much lower coverage compared to the main assembly?</b>
<h3>Assembly evaluation</h3>


<br>
<p>We will use <b>QUAST</b> to evaluate the assembly using various reference-based metrics.</p>
<HR>


<br>
<p>QUAST: [http://bioinf.spbau.ru/quast quast]</p>


<H3>Assembly evaluation</H3>
<p>Run QUAST against the <i>V. cholerae</i> reference genome:</p>


<p>First we will use Quast to evaluate the assembly using different metrics, see [http://bioinf.spbau.ru/quast quast]. You can run it with our without a reference genome and we will try to evaluate our assembly vs. the Vibrio cholerae reference genome:</p>
<pre>
python /home/ctools/quast-quast_5.0.2/quast.py \
  default_final.contigs.fa.gz \
  -R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa


<pre>
firefox quast_results/latest/report.html &
python /home/ctools/quast-quast_5.0.2/quast.py default_final.contigs.fa.gz -R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa
firefox quast_results/latest/report.html &  
</pre>
</pre>


<b>Q9. You can see in the report that there are some misassemblies - can we always trust this?</b>.
<p><b>Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?</b></p>


<HR>
<hr>


<H3>Visualization using circoletto</H3>
<h3>Visualization using Circoletto</h3>


<p> Let's try to visualize the assembly. Because we know that the data is from a <i>Vibrio cholerae</i> we can try to compare our assembly with the V.cholerae reference genome. First, let's filter the assembly to a minimum of 500bp.</p>
<p>We will visualize the assembly against the <i>V. cholerae</i> reference using Circoletto.</p>
 
<p>First, filter out contigs shorter than 500 bp:</p>


<pre>
<pre>
Line 285: Line 386:
</pre>
</pre>


<p>We need to unzip to use gedit. Open a browser on your own computer and go to the [https://bat.infspire.org/circoletto/ Circoletto webpage]. Circoletto is an easy-to-use web program that builds upon Circos - a program to create amazing plots with genomic (and other) data. Open the filtered assembly (<b>"gedit default_final.contigs.fa.gz.filtered_500.fa &"</b>) select all and copy-paste it into the upper box (query fasta). Do the same with the reference genome (<b>"gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &"</b>) and copy-paste it into the box just below. In the "output" section click the "ONLY show the best hit per query" and click "submit to Circoletto".
<p>On your local machine, open a browser and go to:</p>


<p>[https://bat.infspire.org/circoletto/ Circoletto]</p>


<p>Open the filtered assembly in a text editor on the server, for example:</p>
<pre>
gedit default_final.contigs.fa.gz.filtered_500.fa &
</pre>
<p>Copy–paste the FASTA content into the <b>“Query fasta”</b> box on the Circoletto page.</p>
<p>Then open the reference genome:</p>
<pre>
gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &
</pre>
<p>Copy–paste this into the <b>“Subject fasta”</b> box.</p>
<p>In the “Output” section, select <b>“ONLY show the best hit per query”</b>, then click <b>Submit to Circoletto</b>.</p>
<p>If Circoletto does not work, you can use this precomputed image:</p>
<pre>
/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png
</pre>


If it does not work, feel free to use: /home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png
<p>You should see the two <i>V. cholerae</i> chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).</p>
<!--
If it doesn't work you can download the [http://www.cbs.dtu.dk/courses/27626/misc/Vchol-001.best.fa assembly] and [http://www.cbs.dtu.dk/courses/27626/misc/vibrio_cholerae_O1_N16961.fa reference] that I made.</p>


<p>If everything fails you can see [http://www.cbs.dtu.dk/courses/27626/misc/circoletto.png this file].</p>
<p><b>Q10. Does your assembled genome appear broadly similar to the reference genome?</b></p>
-->
<p>You should see the two chromosomes of <i>V. cholerae</i> on the left-hand side of the figure (named "gi|...") and then the alignment of our assembly to it. The color of the alignments are the bitscores, the red is the most confident and black, the least confident.</p>


<b>Q10. Do you think that our genome is similar to the reference genome?</b>
<p><b>Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?</b></p>


<br>
<p><b>Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the <i>V. cholerae</i> genome paper and search for “V. cholerae integron island”: [https://www.nature.com/articles/35020000 V. cholerae genome paper]</b></p>


<b>Q11. Are there scaffolds/contigs that do not or only partially map to the reference genome?</b><br>
<hr>


<b>Q12. What do you think the region on chromosome 2 (the small one) with many small low-confident mappings could be? Hint: See the [https://www.nature.com/articles/35020000 V. cholerae genome paper] and search for "V. cholerae integron island"?</b><br>
<h3>Try to assemble the genome using SPAdes (bonus)</h3>


<HR>
<p>Different assemblers can perform very differently. <b>SPAdes</b> is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.</p>


<p>SPAdes: [http://bioinf.spbau.ru/spades SPAdes]</p>


<H3>Try to assemble the genome using SPAdes (bonus)</H3>
<p>Check the help output:</p>


<p>There is a lot of discussions on which assembler is better than others. [http://bioinf.spbau.ru/spades SPAdes] is definitely one of the best ones. SPAdes will run error correction and use multiple k-mers at the same time when it is doing the assembly. Try it out, it is in the bin folder - try to figure out the commands you need to write to run it and compare it with the output from MEGAHIT (hint: you can use Quast for the comparison).</p>
<pre>
<pre>
spades.py -h
spades.py -h
</pre>
</pre>


<!--
<p><b>Note:</b> A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.</p>


<p>To run it:</p>
<p>Link to the SPAdes assembly:</p>
</pre>
ln -s Vchol-001_6.pair1.cor.truncated.gz Vchol-001_6.pair1.cor.truncated.fastq.gz
ln -s Vchol-001_6.pair2.cor.truncated.gz Vchol-001_6.pair2.cor.truncated.fastq.gz


spades.py --careful --pe1-1 Vchol-001_6.pair1.cor.truncated.fastq.gz --pe1-2 Vchol-001_6.pair2.cor.truncated.fastq.gz -o spades -t 1 -m 3 --cov-cutoff auto
<pre>
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
# from here you can compute stats and run QUAST
</pre>
</pre>
-->


<p>NB: The assembly takes 45 minutes to run with SPAdes, you can therefore use my assembly, try to calculate assembly stats (assemblathon_stats.pl) and run it in Quast together with the SOAPdenovo assembly. To get my assembly:</p>
<hr>


<pre>
<h3>Assembly of PacBio and Nanopore data</h3>
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta .
# you write the code from here
</pre>
<!--
assemblathon_stats.pl spades/scaffolds.fasta > spades/scaffolds.stats
paste spades/scaffolds.stats asmK55.scafSeq.stats | less -S
# quite many fewer contigs for spades and higher mean contig length
-->


<HR>
<p>We now look at assemblies from long-read technologies using <b>canu</b>, a long-read assembler derived from the Celera assembler (used in the original human genome project).</p>


<H3>Assembly of PacBio and NanoPore data</H3>
<p>Canu: [https://github.com/marbl/canu canu]</p>


<p>For the PacBio and NanoPore data, we are going to use [https://github.com/marbl/canu canu]. Canu is a further development of the Celera assembler which was one of the assemblers used for the original human genome - here they used Sanger reads which are long and have low error rate. Canu has been changed to run with long reads with errors in them.<br>
<p>We will assemble <i>E. coli</i> K-12 using either PacBio or Oxford Nanopore reads.</p>
Let's try to run it on some E. coli K12 data. Here we only run it with either PacBio or NanoPore, but it can also handle combinations of these and Illumina. The commands are quite slow<p>


<pre>
<pre>
# pacbio
# PacBio
ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz .
ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz .
canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4
canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4


# oxford nanopore
# Oxford Nanopore
ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz .
ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz .
canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4
canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4
</pre>
</pre>


<p>It is very likely going to take quite some time so you can copy my files:</p>
<p>These runs can take a long time. You can instead link to the precomputed assemblies:</p>


<pre>
<pre>
Line 362: Line 472:
</pre>
</pre>


<p>Try to count the number of contigs and their length of them (you can see that from the headers):</p>
<p>Count number of contigs and inspect their lengths (from FASTA headers):</p>


<pre>  
<pre>
grep ">" [input fasta here]
grep ">" [input fasta here]
</pre>
</pre>


or
<p>Or, if compressed:</p>
<pre>  
 
<pre>
zgrep ">" [zipped input fasta here]
zgrep ">" [zipped input fasta here]
</pre>
</pre>


<p>Run QUAST on both assemblies:</p>


<pre>
<pre>
python quast.py ecoli_pacbio.contigs.fasta ecoli_nanopore.contigs.fasta -R /home/projects/22126_NGS/exercises/denovo/canu/ecoli_K12.fa  
python quast.py \
firefox quast_results/latest/report.html &  
  ecoli_pacbio.contigs.fasta \
  ecoli_nanopore.contigs.fasta \
  -R /home/projects/22126_NGS/exercises/denovo/canu/ecoli_K12.fa
 
firefox quast_results/latest/report.html &
</pre>
</pre>


<p><b>Q13. How good are the assemblies using Nanopore and pacbio?</b></p>
<p><b>Q13. Based on QUAST, how good are the PacBio and Nanopore assemblies (e.g. contiguity, errors, completeness)?</b></p>


<H3>Annotation of a prokaryotic genome</H3>
<hr>


We will use [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119 prodigal ] to annotate our genes in ecoli_pacbio.contigs.fasta:
<h3>Annotation of a prokaryotic genome</h3>
 
<p>We will annotate genes in <code>ecoli_pacbio.contigs.fasta</code> using <b>prodigal</b>.</p>
 
<p>Prodigal: [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119 prodigal]</p>
 
<p>The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:</p>


<pre>
<pre>
/home/ctools/prokka/binaries/linux/prodigal -f gff -i [input genome in fasta] -a [output proteins in fasta] -o [output annotations in gff]
/home/ctools/prokka/binaries/linux/prodigal \
  -f gff \
  -i [input genome in fasta] \
  -a [output proteins in fasta] \
  -o [output annotations in gff]
</pre>
</pre>


"-f" specifies the output format and "-o" the output file. The output is in [https://www.ensembl.org/info/website/upload/gff.html gff] format for the annotation. The proteins are specified by "-a" and are in FASTA format.
<p>GFF format: [https://www.ensembl.org/info/website/upload/gff.html GFF format description]</p>
 
<p>Next, index the protein FASTA file:</p>


Let's use a gene that was found and identify what it is as an example. First index the FASTA file:
<pre>
<pre>
samtools faidx ecoli_pacbio.contigs.aa
samtools faidx ecoli_pacbio.contigs.aa
</pre>
</pre>


Then find the following gene ID "tig00000001_4582":
<p>Extract the protein sequence for gene ID <code>tig00000001_4582</code>:</p>
 
<pre>
<pre>
samtools faidx ecoli_pacbio.contigs.aa   tig00000001_4582
samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582
</pre>
</pre>


<p>Use BLASTP against the nr database:</p>


Use [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLAST for proteins ] and determine which protein it is.
<p>[https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLAST for proteins]</p>


<b> Q14. Which protein is it? </b> <BR>
<p>Paste the sequence and run BLASTP.</p>
<HR>
<!--
<H3>Illumina assembly correction (optional)</H3>
<p>Let us instead try to use [http://www.sanger.ac.uk/science/tools/reapr Reapr]  to evaluate the internal quality of the assembly. It will map our reads back to the assembly and try to identify misassemblies from it. It works best using long insert libraries (>1000 bp) and ours is not that large, but we can try anyway. First, it will map our reads to the assembly using <a href="http://www.sanger.ac.uk/science/tools/smalt-0">smalt</a> and afterward use the read mapping to calculate Fragment Coverage Distribution (FCD) to identify breakpoints in the assembly. <b>It takes a bit of time to run so you could do the Visualization using circoletto while it is running. Eg, run these commands and continue while they are running (open a new terminal or run it in "screen").</b></p>


<pre>
<p><b>Q14. Which protein (function) does <code>tig00000001_4582</code> correspond to?</b></p>
reapr facheck Vchol-001.best.fa Vchol-001.best.facheck
gzip -dc Vchol-001_6.pair1.cor.truncated.gz > Vchol-001_6.pair1.cor.truncated
gzip -dc Vchol-001_6.pair2.cor.truncated.gz > Vchol-001_6.pair2.cor.truncated
reapr smaltmap Vchol-001.best.facheck.fa Vchol-001_6.pair1.cor.truncated Vchol-001_6.pair2.cor.truncated Vchol-001.best.bam
reapr pipeline Vchol-001.best.facheck.fa Vchol-001.best.bam reapr_results
</pre>


<p>When it is done will have generated a summary (reapr_results/05.summary.report.txt) and a broken assembly (reapr_results/04.break.broken_assembly.fa).</p>
<hr>
-->


<p>Please find answers here: [[Denovo_solution|Denovo_solution]]</p>


Please find answers [[Denovo_solution|here]]
<hr>


<HR>
<p><b>Congratulations, you finished the exercise!</b></p>
<p>Congratulations you finished the exercise!</p>

Latest revision as of 11:09, 26 November 2025

Overview

First:

  1. Navigate to your home directory.
  2. Create a directory called denovo.
  3. Navigate to the directory you just created.

In this exercise we will perform a de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will:

  1. Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).
  2. Count k-mers and estimate genome size.
  3. Correct reads using Musket.
  4. Determine insert size of paired-end reads.
  5. Run de novo assembly using MEGAHIT.
  6. Calculate assembly statistics.
  7. Plot coverage and length histograms of the assembly.
  8. Evaluate the assembly quality.
  9. Visualize the assembly using Circoletto.
  10. (Bonus) Try assembling the genome with SPAdes.
  11. Compare assemblies from PacBio and Nanopore data.

FastQC and trimming

Make sure you are in the denovo directory you created. You can double-check with:

pwd

Copy the sequencing data:

cp /home/projects/22126_NGS/exercises/denovo/vchol/* .

Run FastQC on the reads:

mkdir fastqc
fastqc -o fastqc *.txt.gz

Open the reports in Firefox:

firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html &

There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:

gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py

Q1. Which quality encoding format is used?

Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The --basename option defines the output prefix and --gzip compresses the output.

AdapterRemoval \
  --file1 Vchol-001_6_1_sequence.txt.gz \
  --file2 Vchol-001_6_2_sequence.txt.gz \
  --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \
  --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \
  --qualitybase 64 \
  --basename Vchol-001_6 \
  --gzip \
  --trimqualities \
  --minquality 20 \
  --minlength 40

When it finishes, inspect Vchol-001_6.settings for trimming statistics (how many reads were trimmed, discarded, etc.).

Q1A. The output includes discarded.gz, pair1.truncated.gz, pair2.truncated.gz, and singleton.truncated.gz. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)

Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:

python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair2.truncated.gz --gz

Genome size estimation

We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use jellyfish to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram.

Manual: jellyfish

gzip -dc Vchol-001_6.pair*.truncated.gz \
  | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0

jellyfish histo Vchol-001 > Vchol-001.histo

Start R:

R

Then paste:

dat = read.table("Vchol-001.histo")
barplot(dat[,2],
        xlim=c(0,150),
        ylim=c(0,5e5),
        ylab="No of kmers",
        xlab="Counts of a k-mer",
        names.arg=dat[,1],
        cex.names=0.8)
dev.print("Vchol-001.histo.pdf", device=pdf)

The plot shows:

  • x-axis: how many times a k-mer occurs (its count)
  • y-axis: number of distinct k-mers with that count

K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.

Q2. Where is the k-mer coverage peak (approximately)?

We can estimate genome size using:

N = (M * L) / (L - K + 1)
Genome_size = T / N
  • N = depth (coverage)
  • M = k-mer peak (from the histogram)
  • K = k-mer size (here: 15)
  • L = average read length (from fastx_readlength)
  • T = total number of bases (from fastx_readlength)

Compute the estimated genome size and compare with the known V. cholerae genome (~4 Mb). You should be within roughly ±10%.

Q3. What is your estimated genome size?


Error correction

We will correct errors in the reads using Musket.

Musket: Musket

First, get the number of distinct k-mers (needed for memory allocation in Musket):

jellyfish stats Vchol-001

Use the reported number of distinct k-mers (here an example: 8423098) in the Musket command:

musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \
  Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1

The output files are named Vchol-001_6.cor.0 and Vchol-001_6.cor.1. Rename them:

mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz

If this takes too long, you can copy precomputed corrected reads:

cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .

De novo assembly with MEGAHIT

We will now assemble the corrected reads using MEGAHIT (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.

First, set the number of threads:

export OMP_NUM_THREADS=4

Run MEGAHIT with k=35:

/home/ctools/MEGAHIT-1.2.9/bin/megahit \
  -1 Vchol-001_6.pair1.cor.truncated.fq.gz \
  -2 Vchol-001_6.pair2.cor.truncated.fq.gz \
  --k-list 35 \
  -t 4 \
  -m 2000000000 \
  -o 35

When finished, you should have 35/final.contigs.fa. Compress it:

gzip 35/final.contigs.fa

To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):

zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq

Index the assembly and map:

bwa index 35/final.contigs.fa.gz

bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \
  | samtools view -Sb - > Vchol_35bp.bam

Extract insert sizes (TLEN field, column 9):

samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt

Start R:

R

Then paste:

a   = read.table("initial.insertsizes.txt")
a.v = a[a[,1] > 0, 1]
mn  = quantile(a.v, seq(0,1,0.05))[4]
mx  = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])   # mean insert size
sd(a.v[a.v >= mn & a.v <= mx])     # standard deviation

Q4. What are the mean insert size and standard deviation of the library?

Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:

Google sheet for k-mer assignment

Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing [KMER] below:

export OMP_NUM_THREADS=4
/home/ctools/MEGAHIT-1.2.9/bin/megahit \
  -1 Vchol-001_6.pair1.cor.truncated.fq.gz \
  -2 Vchol-001_6.pair2.cor.truncated.fq.gz \
  --k-list [KMER] \
  -t 4 \
  -m 2000000000 \
  -o [KMER]

gzip [KMER]/final.contigs.fa

Compute assembly statistics using the Assemblathon script:

assemblathon_stats.pl [KMER]/final.contigs.fa > [KMER]/final.contigs.stats

Open [KMER]/final.contigs.stats and fill in your stats in the Google sheet for your k-mer. As a class, you can compare and identify an approximate “best” k-mer.

Copy the best assembly to your folder, or use a precomputed multi-k assembly:

cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .

Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?

Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?


Coverage of the assembly

We will now calculate per-contig coverage and lengths, and visualize them in R.

zcat default_final.contigs.fa.gz | fastx_megahit.py --i /dev/stdin > default_finalt.cov

Start R:

R

Then paste:

library(plotrix)
dat = read.table("default_finalt.cov", sep="\t")

par(mfrow=c(1,2))
weighted.hist(w = dat[,2],
              x = dat[,1],
              breaks = seq(0,100,1),
              main = "Weighted coverage",
              xlab = "Contig coverage")
hist(dat[,1],
     xlim   = c(0,100),
     breaks = seq(0,1000,1),
     main   = "Raw coverage",
     xlab   = "Contig coverage")
dev.print("best.coverage.pdf", device=pdf)

# Scaffold lengths
par(mfrow=c(1,1))
barplot(rev(sort(dat[,2])),
        xlab = "# Scaffold",
        ylab = "Length",
        main = "Scaffold Lengths")
dev.print("scaffold.lengths.pdf", device=pdf)
q(save="no")

View the plots:

evince best.coverage.pdf &
evince scaffold.lengths.pdf &

The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.

Q7. Why might some short contigs have much higher coverage than the main assembly?

Q8. Why might some short contigs have much lower coverage than the main assembly?


Assembly evaluation

We will use QUAST to evaluate the assembly using various reference-based metrics.

QUAST: quast

Run QUAST against the V. cholerae reference genome:

python /home/ctools/quast-quast_5.0.2/quast.py \
  default_final.contigs.fa.gz \
  -R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa

firefox quast_results/latest/report.html &

Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?


Visualization using Circoletto

We will visualize the assembly against the V. cholerae reference using Circoletto.

First, filter out contigs shorter than 500 bp:

fastx_filterfasta.py --i default_final.contigs.fa.gz --min 500
gunzip default_final.contigs.fa.gz.filtered_500.fa.gz

On your local machine, open a browser and go to:

Circoletto

Open the filtered assembly in a text editor on the server, for example:

gedit default_final.contigs.fa.gz.filtered_500.fa &

Copy–paste the FASTA content into the “Query fasta” box on the Circoletto page.

Then open the reference genome:

gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &

Copy–paste this into the “Subject fasta” box.

In the “Output” section, select “ONLY show the best hit per query”, then click Submit to Circoletto.

If Circoletto does not work, you can use this precomputed image:

/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png

You should see the two V. cholerae chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).

Q10. Does your assembled genome appear broadly similar to the reference genome?

Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?

Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the V. cholerae genome paper and search for “V. cholerae integron island”: V. cholerae genome paper


Try to assemble the genome using SPAdes (bonus)

Different assemblers can perform very differently. SPAdes is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.

SPAdes: SPAdes

Check the help output:

spades.py -h

Note: A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.

Link to the SPAdes assembly:

ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
# from here you can compute stats and run QUAST

Assembly of PacBio and Nanopore data

We now look at assemblies from long-read technologies using canu, a long-read assembler derived from the Celera assembler (used in the original human genome project).

Canu: canu

We will assemble E. coli K-12 using either PacBio or Oxford Nanopore reads.

# PacBio
ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz .
canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4

# Oxford Nanopore
ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz .
canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4

These runs can take a long time. You can instead link to the precomputed assemblies:

ln -s /home/projects/22126_NGS/exercises/denovo/canu/ecoli_pacbio.contigs.fasta .
ln -s /home/projects/22126_NGS/exercises/denovo/canu/ecoli_nanopore.contigs.fasta .

Count number of contigs and inspect their lengths (from FASTA headers):

grep ">" [input fasta here]

Or, if compressed:

zgrep ">" [zipped input fasta here]

Run QUAST on both assemblies:

python quast.py \
  ecoli_pacbio.contigs.fasta \
  ecoli_nanopore.contigs.fasta \
  -R /home/projects/22126_NGS/exercises/denovo/canu/ecoli_K12.fa

firefox quast_results/latest/report.html &

Q13. Based on QUAST, how good are the PacBio and Nanopore assemblies (e.g. contiguity, errors, completeness)?


Annotation of a prokaryotic genome

We will annotate genes in ecoli_pacbio.contigs.fasta using prodigal.

Prodigal: prodigal

The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:

/home/ctools/prokka/binaries/linux/prodigal \
  -f gff \
  -i [input genome in fasta] \
  -a [output proteins in fasta] \
  -o [output annotations in gff]

GFF format: GFF format description

Next, index the protein FASTA file:

samtools faidx ecoli_pacbio.contigs.aa

Extract the protein sequence for gene ID tig00000001_4582:

samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582

Use BLASTP against the nr database:

BLAST for proteins

Paste the sequence and run BLASTP.

Q14. Which protein (function) does tig00000001_4582 correspond to?


Please find answers here: Denovo_solution


Congratulations, you finished the exercise!