Denovo exercise: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with "<H2>Overview</H2> First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "denovo" <LI>Navigate to the directory you just created. </OL> <p>In this exercise we will try to perform de novo assembly of Illumina paired-end reads. The data is from a <i>Vibrio cholerae</i> strain isolated in Nepal. You will try to: <OL> <LI>Run FastQC, adaptor and quality trimming reads (Optional - repeat of analysis you have already done in data pre-processing)...")
 
No edit summary
 
(42 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<H2>Overview</H2>
<h2>Overview</h2>


<p>First:</p>
<ol>
  <li>Navigate to your home directory.</li>
  <li>Create a directory called <code>denovo</code>.</li>
  <li>Navigate to the directory you just created.</li>
</ol>


First:
<p>In this exercise we will perform a <b>de novo assembly</b> of Illumina paired-end reads. The data is from a <i>Vibrio cholerae</i> strain isolated in Nepal. You will:</p>
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "denovo"
<LI>Navigate to the directory you just created.
</OL>


<ol>
  <li>Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).</li>
  <li>Count k-mers and estimate genome size.</li>
  <li>Correct reads using Musket.</li>
  <li>Determine insert size of paired-end reads.</li>
  <li>Run de novo assembly using MEGAHIT.</li>
  <li>Calculate assembly statistics.</li>
  <li>Plot coverage and length histograms of the assembly.</li>
  <li>Evaluate the assembly quality.</li>
  <li>Visualize the assembly using Circoletto.</li>
  <li>(Bonus) Try assembling the genome with SPAdes.</li>
  <li>Annotation of a prokaryotic genome.</li>
</ol>


<p>In this exercise we will try to perform de novo assembly of Illumina paired-end reads. The data is from a <i>Vibrio cholerae</i> strain isolated in Nepal. You will try to:
<hr>
<OL>
  <LI>Run FastQC, adaptor and quality trimming reads (Optional - repeat of analysis you have already done in data pre-processing)
  <LI>Count k-mers and estimate genome size
  <LI>Correct reads using Musket
  <LI>Determine insert size of paired-end reads
  <LI>Run de novo assembly using SOAPdenovo
  <LI>Calculate assembly statistics
  <LI>Plot coverage and length histograms of the assembly
  <LI>Assembly evaluation
  <LI>Visualize assembly using Circoletto
  <LI>Try to assemble the genome using SPAdes
  <LI>Assembly of PacBio and NanoPore
</OL>


<HR>
<h3>FastQC and trimming</h3>


<H3>FastQC and trimming</H3>
<p>Make sure you are in the <code>denovo</code> directory you created. You can double-check with:</p>


Make sure you are in the "denovo" directory you created, you can double-check with the following command:
<pre>
<pre>
pwd
pwd
</pre>
</pre>


go ahead and copy the sequencing data</p>
<p>Copy the sequencing data:</p>


<pre>
<pre>
Line 40: Line 40:
</pre>
</pre>


<p>Let us start by running fastqc on the reads:</p>
<p>Run FastQC on the reads:</p>
 
<pre>
<pre>
mkdir fastqc
mkdir fastqc
Line 46: Line 47:
</pre>
</pre>


<p>Open the reports in firefox.</p>
<p><b>Viewing FastQC HTML reports:</b></p>
 
<p>If you are using <b>MobaXterm</b>, you can open the FastQC HTML files directly
from the left-hand file panel on the server.</p>
 
<p>If you are using <b>macOS</b> (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:</p>
 
<pre>
<pre>
firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html &
scp stud0XX@pupilX:denovo/fastqc/Vchol-001_6_1_sequence_fastqc.html .
scp stud0XX@pupilX:denovo/fastqc/Vchol-001_6_2_sequence_fastqc.html .
</pre>
</pre>


<p>There are quite some issues with this data (but do not spend to much time on studying the report), we should clean it up first. Let us find out what type of encoding the qualities are in:</p>
<p>Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.</p>
 
<p>There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:</p>
 
<pre>
<pre>
gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py
fastx_detect_fq.sh Vchol-001_6_1_sequence.txt.gz
</pre>
</pre>
<b>Q1. Which format are they encoded in?</b>


<p>Let us trim the reads using AdapterRemoval. In the command below the most frequent adapter/primer sequence is already pasted in - do not worry about the others in our case they are just variations of that one. Also, we use a minimum of 40 nt and trim to quality 20 (~3 mins) and note that we write qualitybase 64 (Q1). The "--basename" option gives the base name of the output files and we want it to be compressed using gzip "--gzip". When it is done take a look at "Vchol-001_6.settings" for some statistics on how many reads were trimmed etc.</p>
<p><b>Q1. Which quality encoding format is used?</b></p>
 
<p>Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The <code>--basename</code> option defines the output prefix and <code>--gzip</code> compresses the output.</p>


<pre>
<pre>
AdapterRemoval --file1 Vchol-001_6_1_sequence.txt.gz --file2 Vchol-001_6_2_sequence.txt.gz --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG --qualitybase 64 --basename Vchol-001_6 --gzip --trimqualities --minquality 20 --minlength 40
AdapterRemoval \
  --file1 Vchol-001_6_1_sequence.txt.gz \
  --file2 Vchol-001_6_2_sequence.txt.gz \
  --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \
  --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \
  --qualitybase 64 \
  --basename Vchol-001_6 \
  --gzip \
  --trimqualities \
  --minquality 20 \
  --minlength 40
</pre>
</pre>


<b>Q1A. There are several output files (discarded.gz, pair1.truncated.gz, pair2.truncated.gz, singleton.truncated.gz), can you explain which files contains which reads?</b>
<p>When it finishes, inspect <code>Vchol-001_6.settings</code> for trimming statistics (how many reads were trimmed, discarded, etc.).</p>


<!--
<p><b>Q1A. The output includes <code>discarded.gz</code>, <code>pair1.truncated.gz</code>, <code>pair2.truncated.gz</code>, and <code>singleton.truncated.gz</code>. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)</b></p>
<pre>
cutadapt -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC -O 3 -m 40 -q 20 --quality-base=64 \
Vchol-001_6_1_sequence.txt.gz > Vchol-001_6_1.cut.fastq &
cutadapt -b GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG -O 3 -m 40 -q 20 --quality-base=64 \
Vchol-001_6_2_sequence.txt.gz > Vchol-001_6_2.cut.fastq &
</pre>


cutadapt -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC -O 3 -m 40 -q 20 --quality-base=64 Vchol-001_6_1_sequence.txt.gz > Vchol-001_6_1.cut.fastq &
<p>Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:</p>
cutadapt -b GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG -O 3 -m 40 -q 20 --quality-base=64 Vchol-001_6_2_sequence.txt.gz > Vchol-001_6_2.cut.fastq &


<p>Then we need to put the reads in the same order using cmpfastq (1min) and remove the cut fastqs</p>
<pre>
<pre>
cmpfastq.pl Vchol-001_6_1.cut.fastq Vchol-001_6_2.cut.fastq
fastx_readlength.sh Vchol-001_6.pair1.truncated.gz
rm Vchol-001_6_1.cut.fastq Vchol-001_6_2.cut.fastq
fastx_readlength.sh Vchol-001_6.pair2.truncated.gz
</pre>
</pre>
-->


<p>Ok, now we have our reads trimmed and ready to go. Count stats of the trimmed data such as avg readlength, min length, max length, no. reads and no. bases. Note down the avg readlength and the no. of bases.</p>
<hr>
<pre>
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair2.truncated.gz --gz
</pre>


<HR>
<h3>Genome size estimation</h3>


<H3>Genome size estimation</H3>
<p>We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use <b>jellyfish</b> to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram.</p>


<p> Let's try to count the occurrence of k-mers in the data - a k-mer is simply a string of nucleotides of a certain length. Let's say we count all the 15-mers that are in the read data we have - we do this using a program called [http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.html jellyfish]. Here we tell jellyfish to count 15-mers and also add counts from the complementary strand. Afterward, we tell it to create a histogram that we will plot using R. "/dev/fd/0" in the command below means that it should take input from "STDIN", eg. from the gzip program.</p>
<p>Manual: [http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.html jellyfish]</p>


<pre>
<pre>
gzip -dc Vchol-001_6.pair*.truncated.gz | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0
gzip -dc Vchol-001_6.pair*.truncated.gz \
  | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0
 
jellyfish histo Vchol-001 > Vchol-001.histo
jellyfish histo Vchol-001 > Vchol-001.histo
</pre>


</pre>
<p>Start R:</p>


Then we call R:
<pre>
<pre>
R
R
</pre>
</pre>


and paste the following to generate a histogram:
<p>Then paste:</p>
 
<pre>
<pre>
dat=read.table("Vchol-001.histo")
dat = read.table("Vchol-001.histo")
barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8)
barplot(dat[,2],
        xlim=c(0,150),
        ylim=c(0,5e5),
        ylab="No of kmers",
        xlab="Counts of a k-mer",
        names.arg=dat[,1],
        cex.names=0.8)
dev.print("Vchol-001.histo.pdf", device=pdf)
dev.print("Vchol-001.histo.pdf", device=pdf)
</pre>
</pre>


<p>The plot shows how many times a k-mer is found (x-axis) and the number of kmers with this count (y-axis). The bars in the lower end are the k-mers that only occur very few times, these are probably sequencing errors, whereas the k-mers that occur many times are "real" k-mers. We can use the information from the "real" k-mers to correct similar "error k-mers". In this way we can increase the performance of our assembly (this also works for SNP calling as well).</p>
<p>The plot shows:</p>
<ul>
  <li>x-axis: how many times a k-mer occurs (its count)</li>
  <li>y-axis: number of distinct k-mers with that count</li>
</ul>
 
<p>K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.</p>
 
<p><b>Q2. Where is the k-mer coverage peak (approximately)?</b></p>
 
<p>We can estimate genome size using:</p>
 
<pre>
N = (M * L) / (L - K + 1)
Genome_size = T / N
</pre>
 
<ul>
  <li><b>N</b> = depth (coverage)</li>
  <li><b>M</b> = k-mer peak (from the histogram)</li>
  <li><b>K</b> = k-mer size (here: 15)</li>
  <li><b>L</b> = average read length (from fastx_readlength)</li>
  <li><b>T</b> = total number of bases (from fastx_readlength)</li>
</ul>
 
<p>Compute the estimated genome size and compare with the known <i>V. cholerae</i> genome (~4 Mb). You should be within roughly ±10%.</p>


<b>Q2. Where is the peak for the k-mers?</b>
<p><b>Q3. What is your estimated genome size?</b></p>


<p>Before we perform error correction on the reads, we will try to estimate the genome size of our genome. Of course, we know that it is a V. cholerae, but if we did not know that or it was an unknown species then we could do like this. N = (M*L)/(L-K+1) and Genome_size = T/N, where N: Depth, M: Kmer peak, K: Kmer-size, L: avg readlength, T: Total bases. Try to put in the numbers (some of them you have from "fastx_readlength.py" command) and see what you get. The reference vibrio cholerae genome is around 4Mb, you should get within +/- 10% of this.</p>
<hr>
<b>Q3. What is the estimated genome size?</b>


<HR>
<h3>Error correction</h3>


<H3>Error correction</H3>
<p>We will correct errors in the reads using <b>Musket</b>.</p>


<p>Let us try to correct the reads using [http://musket.sourceforge.net/homepage.htm Musket]. First we need to know how many k-mers will be in the data, this is rather easy to get because we already counted kmers using jellyfish. Use this command to output the number of "distinct" k-mers in the dataset:</p>
<p>Musket: [http://musket.sourceforge.net/homepage.htm Musket]</p>
 
<p>First, get the number of distinct k-mers (needed for memory allocation in Musket):</p>


<pre>
<pre>
Line 130: Line 177:
</pre>
</pre>


<p>This is needed for musket to setup the bloom filters and hash tables for memory consumption. Then we are ready to run the correction. Finally, we will rename the outputfiles to sensible names.</p>
<p>Use the reported number of distinct k-mers (here an example: <code>8423098</code>) in the Musket command:</p>
 
<pre>
musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \
  Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
</pre>
 
<p>The output files are named <code>Vchol-001_6.cor.0</code> and <code>Vchol-001_6.cor.1</code>. Rename them:</p>


<pre>
<pre>
musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz
mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.gz
</pre>
</pre>


<p><b>If this takes a long time copy my files:</b></p>
<p><b>If this takes too long, you can copy precomputed corrected reads:</b></p>


<pre>
<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.gz .
cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .
</pre>
</pre>


<HR>
<hr>


<H3><i>de novo</i> assembly</H3>
<h3><i>De novo</i> assembly with MEGAHIT</h3>


<p>Now the reads are ready for <i>de novo</i> assembly. We are going to use SOAPdenovo as the assembler, it uses the de Bruijn approach. When running ''de novo'' assemblies one should try different k-mer sizes - the k-mer size is what is being used for building the de Bruijn graph and is therefore very important. There is currently no way to estimate what will be the optimal k before running, the best k-size may change between different datasets and may change between different assemblers for the same dataset.</p>
<p>We will now assemble the corrected reads using <b>MEGAHIT</b> (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.</p>
<p><b>SOAPdenovo requires a configuration file with different information, a sample file is the Vchol-001.soap.conf. You need to open it and change the path to your file. To open it write "gedit Vchol-001.soap.conf" and then navigate to "q1" and "q2", here paste in the path and filename of your two files</b>. Note that we have written "avg_ins=200" even though we do not know the average insert size. We will run an initial assembly to create contigs and then extract the information and then rerun the full assembly again. When you have changed the paths save the file (Ctrl-S) then start the assembler. We will try first with a k of 35.</p>
 
<p>First, set the number of threads:</p>


<pre>
<pre>
SOAPdenovo-127mer pregraph -s Vchol-001.soap.conf -K 35 -p 2 -o initial
export OMP_NUM_THREADS=4
SOAPdenovo-127mer contig -g initial
</pre>
</pre>
<p>Now you should have a file called "initial.contig", we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Let us only map the first 100.000 reads - this should be enough.</p>
 
<p>Run MEGAHIT with k=35:</p>


<pre>
<pre>
zcat Vchol-001_6.pair1.cor.truncated.gz | head -n 400000 > Vchol_sample_1.fastq
megahit \
zcat Vchol-001_6.pair2.cor.truncated.gz | head -n 400000 > Vchol_sample_2.fastq
  -1 Vchol-001_6.pair1.cor.truncated.fq.gz \
  -2 Vchol-001_6.pair2.cor.truncated.fq.gz \
  --k-list 35 \
  -t 4 \
  -m 2000000000 \
  -o 35
</pre>


bwa index initial.contig
<p>When finished, you should have <code>35/final.contigs.fa</code>. Compress it:</p>
bwa mem initial.contig Vchol_sample_1.fastq Vchol_sample_2.fastq | samtools view -Sb - > initial.sample.bam


samtools view initial.sample.bam | cut -f9 > initial.insertsizes.txt
<pre>
gzip 35/final.contigs.fa
</pre>


<p>To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):</p>
<pre>
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq
</pre>
</pre>


Then use R:
<p>Index the assembly and map:</p>
 
<pre>
bwa index 35/final.contigs.fa.gz
 
bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \
  | samtools view -Sb - > Vchol_35bp.bam
</pre>
 
<p>Extract insert sizes (TLEN field, column 9):</p>
 
<pre>
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt
</pre>
 
<p>Start R:</p>
 
<pre>
<pre>
R
R
</pre>
</pre>


and paste:
<p>Then paste:</p>
 
<pre>
<pre>
a = read.table("initial.insertsizes.txt")
a   = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
a.v = a[a[,1] > 0, 1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])       # mean
mean(a.v[a.v >= mn & a.v <= mx])   # mean insert size
sd(a.v[a.v >= mn & a.v <= mx])         # sd
sd(a.v[a.v >= mn & a.v <= mx])     # standard deviation
</pre>
</pre>


<b>Q4. What are the mean insert size and standard deviation of our library?</b>
<p><b>Q4. What are the mean insert size and standard deviation of the library?</b></p>
<p>Open "Vchol-001.soap.conf" and change the avg_ins to the value you found. Then we are going to rerun the assembly where we also do scaffolding. Because the assembly takes some time to finish, you will all chose a different k-mer to run and then we will compare to find the "best" assembly. <b>Chose one of the K-sizes on the [https://docs.google.com/spreadsheets/d/1SbOGvtOkr29oZfNXscH7wDSJ9Mi9I7OxBcpr9WNETs4/edit?usp=sharing Google sheet] and change the "XX" values below to your number. Remember to write your name and save the document.</b></p>
 
<p>Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:</p>
 
<p>[https://docs.google.com/spreadsheets/d/1trUMlSwNLoNW67D-OkgA93iOQRp2iioyJSBYyW30P4U/edit?usp=sharing Google sheet for k-mer assignment]</p>
 
<p>Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing <code>[KMER]</code> below:</p>


<pre>
<pre>
SOAPdenovo-127mer all -s Vchol-001.soap.conf -K XX -p 2 -o asmKXX
export OMP_NUM_THREADS=4
megahit \
  -1 Vchol-001_6.pair1.cor.truncated.fq.gz \
  -2 Vchol-001_6.pair2.cor.truncated.fq.gz \
  --k-list [KMER] \
  -t 4 \
  -m 2000000000 \
  -o [KMER]
 
gzip [KMER]/final.contigs.fa
</pre>
</pre>


<p>When it is done, we need to calculate some statistics on it, there is a script from the Assemblathon that does it nicely, but first lets remove all sequences shorter than 100</p>
<p>Compute assembly statistics using <code>QUAST</code>:</p>
 
<pre>
<pre>
fastx_filterfasta.py --i asmKXX.scafSeq --min 100
quast.py \
assemblathon_stats.pl asmKXX.scafSeq.filtered_100.fa > asmKXX.scafSeq.stats
[KMER]/final.contigs.fa.gz \
--threads 1 \
-o [KMER]/quast
</pre>
</pre>


<p>Open the asmKXX.scafSeq.stats file and fill in the information in the [https://docs.google.com/spreadsheets/d/1SbOGvtOkr29oZfNXscH7wDSJ9Mi9I7OxBcpr9WNETs4/edit?usp=sharing Google sheet] for your respective K. We will then compare the results and find the best K-size for the particular dataset.</p>
<p>Open the file <code>[KMER]/quast/report.txt</code> (or <code>report.tsv</code>) and
record the following values in the Google sheet for your k-mer:</p>
 
<ul>
  <li>Number of contigs (&ge; 500 bp)</li>
  <li>Total assembly length</li>
  <li>Largest contig</li>
  <li>N50</li>
</ul>
 
<p>As a class, compare results across k-mer sizes and discuss which k-mer produces
the most reasonable assembly and why.</p>
 
<p>Copy the best assembly to your folder, or use a precomputed multi-k assembly:</p>


<p>Copy the best assembly to your folder and use this assembly from now on. You can also use a fairly good one I made previously (I renamed it to Vchol-001.best.fa).</p>
<pre>
<pre>
cp /home/projects/22126_NGS/exercises/denovo/best/Vchol-001.best.fa .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/Vchol-001.best.stats .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .
</pre>
</pre>


<b>Q5. How does the length (genome size) of the assembly compare with what we estimated using kmers?</b><br>
<p><b>Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?</b></p>
<b>Q6. How many contigs were joined together in scaffolds? Why do you think it is a such a low number (Hint: Repeats, PE insert size)?</b>


<HR>
<p><b>Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?</b></p>
 
<hr>
 
<h3>Coverage of the assembly</h3>
 
<p>We will now calculate per-contig coverage and lengths, and visualize them in R.</p>


<H3>Coverage of assembly</H3>
<p>Lets calculate coverage and length for each sequence and plot it as a histogram in R. We also plot the lengths of the scaffolds.</p>
<pre>
<pre>
fastx_soapcov.py --i Vchol-001.best.fa > Vchol-001.best.cov
zcat default_final.contigs.fa.gz | fastx_megahit.sh --i /dev/stdin > default_finalt.cov
</pre>
</pre>


Then use R:
<p>Start R:</p>
 
<pre>
<pre>
R
R
</pre>
</pre>


and write the following:
<p>Then paste:</p>
 
<pre>
<pre>
library(plotrix)
library(plotrix)
dat=read.table("Vchol-001.best.cov", sep="\t")
dat = read.table("default_finalt.cov", sep="\t")
 
par(mfrow=c(1,2))
par(mfrow=c(1,2))
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="Weighted coverage", xlab="Contig coverage")
weighted.hist(w = dat[,2],
hist(dat[,1], xlim=c(0,100), breaks=seq(0,1000,1), main="Raw coverage", xlab="Contig coverage")
              x = dat[,1],
              breaks = seq(0,100,1),
              main = "Weighted coverage",
              xlab = "Contig coverage")
hist(dat[,1],
    xlim   = c(0,100),
    breaks = seq(0,1000,1),
    main   = "Raw coverage",
    xlab   = "Contig coverage")
dev.print("best.coverage.pdf", device=pdf)
dev.print("best.coverage.pdf", device=pdf)


# Lengths
# Scaffold lengths
par(mfrow=c(1,1))
par(mfrow=c(1,1))
barplot(rev(sort(dat[,2])), xlab="# Scaffold", ylab="Length", main="Scaffold Lengths")
barplot(rev(sort(dat[,2])),
        xlab = "# Scaffold",
        ylab = "Length",
        main = "Scaffold Lengths")
dev.print("scaffold.lengths.pdf", device=pdf)
dev.print("scaffold.lengths.pdf", device=pdf)
q()
q(save="no")
</pre>
</pre>
<p>View the plots:</p>


<pre>
<pre>
Line 241: Line 372:
</pre>
</pre>


<p>The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.</p>


<p>The left plot shows the length-weighted coverage (of k-mers) of the contigs/scaffolds - this means that long sequences has a larger weight compared to short sequences. The plot on the right side shows a normal histogram of contig coverage. By comparing the two plots we see that the majority of the assembly has a k-mer coverage around 20-30X, and that remaining assembly are short sequences. NB: The contig coverage is dependent on the size of the K you chose - larger K gives smaller coverage. Looking at the plot with the lengths you see that the majority of the assembly are in quite long scaffolds.</p>
<p><b>Q7. Why might some short contigs have much <i>higher</i> coverage than the main assembly?</b></p>


<b>Q7. Can you think of why some short contigs have much higher coverage compared to main assembly?</b><br>
<p><b>Q8. Why might some short contigs have much <i>lower</i> coverage than the main assembly?</b></p>
<b>Q8. Can you think of why some short contigs have much lower coverage compared to the main assembly?</b>


<HR>
<hr>


<H3>Assembly evaluation</H3>
<h3>Assembly evaluation</h3>


<p>First we will use Quast to evaluate the assembly using different metrics, see [http://bioinf.spbau.ru/quast quast]. You can run it with our without a reference genome and we will try to evaluate our assembly vs. the Vibrio cholerae reference genome:</p>
<p>We will use <b>QUAST</b> to evaluate the assembly using various reference-based metrics.</p>
 
<p>QUAST: [https://quast.sourceforge.net/quast quast]</p>
 
<p>Run QUAST against the <i>V. cholerae</i> reference genome:</p>


<pre>
<pre>
python /home/ctools/quast-quast_5.0.2/quast.py Vchol-001.best.fa -R vibrio_cholerae_O1_N16961.fa
quast.py \
firefox quast_results/latest/report.html &
  default_final.contigs.fa.gz \
  --threads 1 \
  -R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa
</pre>
</pre>


<b>Q9. You can see in the report that there are some misassemblies - can we always trust this?</b>.
<pre>
mkdir fastqc
fastqc -o fastqc *.txt.gz
</pre>


<HR>
<p>If you are using <b>MobaXterm</b>, you can open the HTML files directly
from the left-hand file panel on the server.</p>


<H3>Visualization using circoletto</H3>
<p>If you are using <b>macOS</b> (or a standard terminal), copy the HTML files to
 
your local computer and open them in a web browser. For example:</p>
<p> Let's try to visualize the assembly. Because we know that the data is from a <i>Vibrio cholerae</i> we can try to compare our assembly with the V.cholerae reference genome. First, let's filter the assembly to a minimum of 500bp.</p>


<pre>
<pre>
fastx_filterfasta.py --i Vchol-001.best.fa --min 500
scp stud0XX@pupilX:denovo/quast_results/latest/report.html .
</pre>
</pre>


<p>Open a browser on your own computer and go to the [https://bat.infspire.org/circoletto/ Circoletto webpage]. Circoletto is an easy-to-use web program that builds upon Circos - a program to create amazing plots with genomic (and other) data. Open the filtered assembly (<b>"gedit Vchol-001.best.fa.filtered_500.fa &"</b>) select all and copy-paste it into the upper box (query fasta). Do the same with the reference genome (<b>"gedit vibrio_cholerae_O1_N16961.fa &"</b>) and copy-paste it into the box just below. In the "output" section click the "ONLY show the best hit per query" and click "submit to Circoletto". If it doesn't work you can download the [http://www.cbs.dtu.dk/courses/27626/misc/Vchol-001.best.fa assembly] and [http://www.cbs.dtu.dk/courses/27626/misc/vibrio_cholerae_O1_N16961.fa reference] that I made.</p>
<p>Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.</p>
 
<p><b>Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?</b></p>
 
<hr>


<p>If everything fails you can see [http://www.cbs.dtu.dk/courses/27626/misc/circoletto.png this file].</p>
<h3>Visualization using Circoletto</h3>


<p>You should see the two chromosomes of <i>V. cholerae</i> on the left-hand side of the figure (named "gi|...") and then the alignment of our assembly to it. The color of the alignments are the bitscores, the red is the most confident and black, the least confident.</p>
<p>We will visualize the assembly against the <i>V. cholerae</i> reference using Circoletto.</p>


<b>Q10. Do you think that our genome is similar to the reference genome?</b><br>
<p>First, filter out contigs shorter than 500 bp:</p>
<b>Q11. Are there scaffolds/contigs that do not or only partially map to the reference genome?</b><br>


<b>Q12. What do you think the region on chromosome 2 (the small one) with many small low-confident mappings could be? Hint: See the [https://www.nature.com/articles/35020000 V. cholerae genome paper] and search for "V. cholerae integron island"?</b><br>
<pre>
fastx_filterfasta.sh default_final.contigs.fa.gz 500 > default_final.contigs_filtered_500.fa
</pre>


<HR>
<p>On your local machine, open a browser and go to:</p>


<p>[https://bat.infspire.org/circoletto/ Circoletto]</p>


<H3>Try to assemble the genome using SPAdes</H3>
<p>Open the filtered assembly in a text editor on the server, for example:</p>


<p>There is a lot of discussions on which assembler is better than others. [http://bioinf.spbau.ru/spades SPAdes] is definitely one of the best ones. SPAdes will run error correction and use multiple k-mers at the same time when it is doing the assembly. Try it out, it is in the bin folder - try to figure out the commands you need to write to run it and compare it with the output from SOAPdenovo (hint: you can use Quast for the comparison).</p>
<pre>
<pre>
spades.py -h
gedit default_final.contigs_filtered_500.fa &
</pre>
</pre>


<!--
<p>Copy–paste the FASTA content into the <b>“Query fasta”</b> box on the Circoletto page.</p>


<p>To run it:</p>
<p>Then open the reference genome:</p>
 
<pre>
gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &
</pre>
</pre>
ln -s Vchol-001_6.pair1.cor.truncated.gz Vchol-001_6.pair1.cor.truncated.fastq.gz
ln -s Vchol-001_6.pair2.cor.truncated.gz Vchol-001_6.pair2.cor.truncated.fastq.gz


spades.py --careful --pe1-1 Vchol-001_6.pair1.cor.truncated.fastq.gz --pe1-2 Vchol-001_6.pair2.cor.truncated.fastq.gz -o spades -t 1 -m 3 --cov-cutoff auto
<p>Copy–paste this into the <b>“Subject fasta”</b> box.</p>
</pre>
 
-->
<p>In the “Output” section, select <b>“ONLY show the best hit per query”</b>, then click <b>Submit to Circoletto</b>.</p>


<p>NB: The assembly takes 45 minutes to run with SPAdes, you can therefore use my assembly, try to calculate assembly stats (assemblathon_stats.pl) and run it in Quast together with the SOAPdenovo assembly. To get my assembly:</p>
<p>If Circoletto does not work, you can use this precomputed image:</p>


<pre>
<pre>
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png
# you write the code from here
</pre>
</pre>
<!--
assemblathon_stats.pl spades/scaffolds.fasta > spades/scaffolds.stats
paste spades/scaffolds.stats asmK55.scafSeq.stats | less -S
# quite many fewer contigs for spades and higher mean contig length
-->


<HR>
<p>You should see the two <i>V. cholerae</i> chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).</p>


<H3>Assembly of PacBio and NanoPore data</H3>
<p><b>Q10. Does your assembled genome appear broadly similar to the reference genome?</b></p>


<p>For the PacBio and NanoPore data, we are going to use [https://github.com/marbl/canu canu]. Canu is a further development of the Celera assembler which was one of the assemblers used for the original human genome - here they used Sanger reads which are long and have low error rate. Canu has been changed to run with long reads with errors in them.<br>
<p><b>Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?</b></p>
Let's try to run it on some E. coli K12 data. Here we only run it with either PacBio or NanoPore, but it can also handle combinations of these and Illumina. The commands are quite slow<p>
 
<p><b>Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the <i>V. cholerae</i> genome paper and search for “V. cholerae integron island”: [https://www.nature.com/articles/35020000 V. cholerae genome paper]</b></p>
 
<hr>
 
<h3>Try to assemble the genome using SPAdes (bonus)</h3>
 
<p>Different assemblers can perform very differently. <b>SPAdes</b> is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.</p>
 
<p>SPAdes: [https://ablab.github.io/spades/ SPAdes]</p>
 
<p>Check the help output:</p>


<pre>
<pre>
# pacbio
spades.py -h
ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz .
</pre>
canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4


# oxford nanopore
<p><b>Note:</b> A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.</p>
ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz .
 
canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4
<p>Link to the SPAdes assembly:</p>
 
<pre>
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
# from here you can compute stats and run QUAST
</pre>
</pre>


<p>It is very likely going to take quite some time so you can copy my files:</p>
 
<h3>Annotation of a prokaryotic genome</h3>
 
<p>We will annotate genes in <code>ecoli_pacbio.contigs.fasta</code> using <b>prodigal</b>.</p>
 
<p>Prodigal: [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119 prodigal]</p>
 
<p>The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:</p>


<pre>
<pre>
cp /home/projects/22126_NGS/exercises/denovo/canu/ecoli_pacbio.contigs.fasta .
prodigal \
cp /home/projects/22126_NGS/exercises/denovo/canu/ecoli_nanopore.contigs.fasta .
  -f gff \
  -i [input genome in fasta] \
  -a [output proteins in fasta] \
  -o [output annotations in gff]
</pre>
</pre>


<p>Try to count the number of contigs and their length of them (you can see that from the headers):</p>
<p>GFF format: [https://www.ensembl.org/info/website/upload/gff.html GFF format description]</p>
 
<p>Next, index the protein FASTA file:</p>


<pre>  
<pre>
grep ">" ecoli_nanopore.contigs.fasta
samtools faidx ecoli_pacbio.contigs.aa
grep ">" ecoli_pacbio.contigs.fasta
</pre>
</pre>
<p>Extract the protein sequence for gene ID <code>tig00000001_4582</code>:</p>


<pre>
<pre>
python quast.py ecoli_pacbio.contigs.fasta ecoli_nanopore.contigs.fasta  -R /home/projects/22126_NGS/exercises/denovo/canu/ecoli_K12.fa
samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582
firefox quast_results/latest/report.html &
</pre>
</pre>


<p><b>Q13. How good are the assemblies using Nanopore and pacbio?</b></p>
<p>Use BLASTP against the nr database:</p>
 
<p>[https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLAST for proteins]</p>


<HR>
<p>Paste the sequence and run BLASTP.</p>
<!--
<H3>Illumina assembly correction (optional)</H3>
<p>Let us instead try to use [http://www.sanger.ac.uk/science/tools/reapr Reapr]  to evaluate the internal quality of the assembly. It will map our reads back to the assembly and try to identify misassemblies from it. It works best using long insert libraries (>1000 bp) and ours is not that large, but we can try anyway. First, it will map our reads to the assembly using <a href="http://www.sanger.ac.uk/science/tools/smalt-0">smalt</a> and afterward use the read mapping to calculate Fragment Coverage Distribution (FCD) to identify breakpoints in the assembly. <b>It takes a bit of time to run so you could do the Visualization using circoletto while it is running. Eg, run these commands and continue while they are running (open a new terminal or run it in "screen").</b></p>


<pre>
<p><b>Q14. Which protein (function) does <code>tig00000001_4582</code> correspond to?</b></p>
reapr facheck Vchol-001.best.fa Vchol-001.best.facheck
gzip -dc Vchol-001_6.pair1.cor.truncated.gz > Vchol-001_6.pair1.cor.truncated
gzip -dc Vchol-001_6.pair2.cor.truncated.gz > Vchol-001_6.pair2.cor.truncated
reapr smaltmap Vchol-001.best.facheck.fa Vchol-001_6.pair1.cor.truncated Vchol-001_6.pair2.cor.truncated Vchol-001.best.bam
reapr pipeline Vchol-001.best.facheck.fa Vchol-001.best.bam reapr_results
</pre>


<p>When it is done will have generated a summary (reapr_results/05.summary.report.txt) and a broken assembly (reapr_results/04.break.broken_assembly.fa).</p>
<hr>
-->


<p>Please find answers here: [[Denovo_solution|Denovo_solution]]</p>


Please find answers [[Denovo_solution|here]]
<hr>


<HR>
<p><b>Congratulations, you finished the exercise!</b></p>
<p>Congratulations you finished the exercise!</p>

Latest revision as of 15:27, 16 December 2025

Overview

First:

  1. Navigate to your home directory.
  2. Create a directory called denovo.
  3. Navigate to the directory you just created.

In this exercise we will perform a de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will:

  1. Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).
  2. Count k-mers and estimate genome size.
  3. Correct reads using Musket.
  4. Determine insert size of paired-end reads.
  5. Run de novo assembly using MEGAHIT.
  6. Calculate assembly statistics.
  7. Plot coverage and length histograms of the assembly.
  8. Evaluate the assembly quality.
  9. Visualize the assembly using Circoletto.
  10. (Bonus) Try assembling the genome with SPAdes.
  11. Annotation of a prokaryotic genome.

FastQC and trimming

Make sure you are in the denovo directory you created. You can double-check with:

pwd

Copy the sequencing data:

cp /home/projects/22126_NGS/exercises/denovo/vchol/* .

Run FastQC on the reads:

mkdir fastqc
fastqc -o fastqc *.txt.gz

Viewing FastQC HTML reports:

If you are using MobaXterm, you can open the FastQC HTML files directly from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to your local computer and open them in a web browser. For example:

scp stud0XX@pupilX:denovo/fastqc/Vchol-001_6_1_sequence_fastqc.html .
scp stud0XX@pupilX:denovo/fastqc/Vchol-001_6_2_sequence_fastqc.html .

Replace stud0XX with your student ID and pupilX with the compute node you are working on. The files will be copied to your current local directory.

There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:

fastx_detect_fq.sh Vchol-001_6_1_sequence.txt.gz

Q1. Which quality encoding format is used?

Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The --basename option defines the output prefix and --gzip compresses the output.

AdapterRemoval \
  --file1 Vchol-001_6_1_sequence.txt.gz \
  --file2 Vchol-001_6_2_sequence.txt.gz \
  --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \
  --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \
  --qualitybase 64 \
  --basename Vchol-001_6 \
  --gzip \
  --trimqualities \
  --minquality 20 \
  --minlength 40

When it finishes, inspect Vchol-001_6.settings for trimming statistics (how many reads were trimmed, discarded, etc.).

Q1A. The output includes discarded.gz, pair1.truncated.gz, pair2.truncated.gz, and singleton.truncated.gz. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)

Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:

fastx_readlength.sh Vchol-001_6.pair1.truncated.gz
fastx_readlength.sh Vchol-001_6.pair2.truncated.gz

Genome size estimation

We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use jellyfish to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram.

Manual: jellyfish

gzip -dc Vchol-001_6.pair*.truncated.gz \
  | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0

jellyfish histo Vchol-001 > Vchol-001.histo

Start R:

R

Then paste:

dat = read.table("Vchol-001.histo")
barplot(dat[,2],
        xlim=c(0,150),
        ylim=c(0,5e5),
        ylab="No of kmers",
        xlab="Counts of a k-mer",
        names.arg=dat[,1],
        cex.names=0.8)
dev.print("Vchol-001.histo.pdf", device=pdf)

The plot shows:

  • x-axis: how many times a k-mer occurs (its count)
  • y-axis: number of distinct k-mers with that count

K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.

Q2. Where is the k-mer coverage peak (approximately)?

We can estimate genome size using:

N = (M * L) / (L - K + 1)
Genome_size = T / N
  • N = depth (coverage)
  • M = k-mer peak (from the histogram)
  • K = k-mer size (here: 15)
  • L = average read length (from fastx_readlength)
  • T = total number of bases (from fastx_readlength)

Compute the estimated genome size and compare with the known V. cholerae genome (~4 Mb). You should be within roughly ±10%.

Q3. What is your estimated genome size?


Error correction

We will correct errors in the reads using Musket.

Musket: Musket

First, get the number of distinct k-mers (needed for memory allocation in Musket):

jellyfish stats Vchol-001

Use the reported number of distinct k-mers (here an example: 8423098) in the Musket command:

musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \
  Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1

The output files are named Vchol-001_6.cor.0 and Vchol-001_6.cor.1. Rename them:

mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz

If this takes too long, you can copy precomputed corrected reads:

cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .

De novo assembly with MEGAHIT

We will now assemble the corrected reads using MEGAHIT (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.

First, set the number of threads:

export OMP_NUM_THREADS=4

Run MEGAHIT with k=35:

megahit \
  -1 Vchol-001_6.pair1.cor.truncated.fq.gz \
  -2 Vchol-001_6.pair2.cor.truncated.fq.gz \
  --k-list 35 \
  -t 4 \
  -m 2000000000 \
  -o 35

When finished, you should have 35/final.contigs.fa. Compress it:

gzip 35/final.contigs.fa

To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):

zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq

Index the assembly and map:

bwa index 35/final.contigs.fa.gz

bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \
  | samtools view -Sb - > Vchol_35bp.bam

Extract insert sizes (TLEN field, column 9):

samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt

Start R:

R

Then paste:

a   = read.table("initial.insertsizes.txt")
a.v = a[a[,1] > 0, 1]
mn  = quantile(a.v, seq(0,1,0.05))[4]
mx  = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])   # mean insert size
sd(a.v[a.v >= mn & a.v <= mx])     # standard deviation

Q4. What are the mean insert size and standard deviation of the library?

Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:

Google sheet for k-mer assignment

Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing [KMER] below:

export OMP_NUM_THREADS=4
megahit \
  -1 Vchol-001_6.pair1.cor.truncated.fq.gz \
  -2 Vchol-001_6.pair2.cor.truncated.fq.gz \
  --k-list [KMER] \
  -t 4 \
  -m 2000000000 \
  -o [KMER]

gzip [KMER]/final.contigs.fa

Compute assembly statistics using QUAST:

quast.py \
 [KMER]/final.contigs.fa.gz \
 --threads 1 \
 -o [KMER]/quast

Open the file [KMER]/quast/report.txt (or report.tsv) and record the following values in the Google sheet for your k-mer:

  • Number of contigs (≥ 500 bp)
  • Total assembly length
  • Largest contig
  • N50

As a class, compare results across k-mer sizes and discuss which k-mer produces the most reasonable assembly and why.

Copy the best assembly to your folder, or use a precomputed multi-k assembly:

cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .

Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?

Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?


Coverage of the assembly

We will now calculate per-contig coverage and lengths, and visualize them in R.

zcat default_final.contigs.fa.gz | fastx_megahit.sh --i /dev/stdin > default_finalt.cov

Start R:

R

Then paste:

library(plotrix)
dat = read.table("default_finalt.cov", sep="\t")

par(mfrow=c(1,2))
weighted.hist(w = dat[,2],
              x = dat[,1],
              breaks = seq(0,100,1),
              main = "Weighted coverage",
              xlab = "Contig coverage")
hist(dat[,1],
     xlim   = c(0,100),
     breaks = seq(0,1000,1),
     main   = "Raw coverage",
     xlab   = "Contig coverage")
dev.print("best.coverage.pdf", device=pdf)

# Scaffold lengths
par(mfrow=c(1,1))
barplot(rev(sort(dat[,2])),
        xlab = "# Scaffold",
        ylab = "Length",
        main = "Scaffold Lengths")
dev.print("scaffold.lengths.pdf", device=pdf)
q(save="no")

View the plots:

evince best.coverage.pdf &
evince scaffold.lengths.pdf &

The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.

Q7. Why might some short contigs have much higher coverage than the main assembly?

Q8. Why might some short contigs have much lower coverage than the main assembly?


Assembly evaluation

We will use QUAST to evaluate the assembly using various reference-based metrics.

QUAST: quast

Run QUAST against the V. cholerae reference genome:

quast.py \
  default_final.contigs.fa.gz \
  --threads 1 \
  -R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa
mkdir fastqc
fastqc -o fastqc *.txt.gz

If you are using MobaXterm, you can open the HTML files directly from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to your local computer and open them in a web browser. For example:

scp stud0XX@pupilX:denovo/quast_results/latest/report.html .

Replace stud0XX with your student ID and pupilX with the compute node you are working on. The files will be copied to your current local directory.

Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?


Visualization using Circoletto

We will visualize the assembly against the V. cholerae reference using Circoletto.

First, filter out contigs shorter than 500 bp:

fastx_filterfasta.sh default_final.contigs.fa.gz 500 > default_final.contigs_filtered_500.fa

On your local machine, open a browser and go to:

Circoletto

Open the filtered assembly in a text editor on the server, for example:

gedit default_final.contigs_filtered_500.fa &

Copy–paste the FASTA content into the “Query fasta” box on the Circoletto page.

Then open the reference genome:

gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &

Copy–paste this into the “Subject fasta” box.

In the “Output” section, select “ONLY show the best hit per query”, then click Submit to Circoletto.

If Circoletto does not work, you can use this precomputed image:

/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png

You should see the two V. cholerae chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).

Q10. Does your assembled genome appear broadly similar to the reference genome?

Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?

Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the V. cholerae genome paper and search for “V. cholerae integron island”: V. cholerae genome paper


Try to assemble the genome using SPAdes (bonus)

Different assemblers can perform very differently. SPAdes is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.

SPAdes: SPAdes

Check the help output:

spades.py -h

Note: A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.

Link to the SPAdes assembly:

ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
# from here you can compute stats and run QUAST


Annotation of a prokaryotic genome

We will annotate genes in ecoli_pacbio.contigs.fasta using prodigal.

Prodigal: prodigal

The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:

prodigal \
  -f gff \
  -i [input genome in fasta] \
  -a [output proteins in fasta] \
  -o [output annotations in gff]

GFF format: GFF format description

Next, index the protein FASTA file:

samtools faidx ecoli_pacbio.contigs.aa

Extract the protein sequence for gene ID tig00000001_4582:

samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582

Use BLASTP against the nr database:

BLAST for proteins

Paste the sequence and run BLASTP.

Q14. Which protein (function) does tig00000001_4582 correspond to?


Please find answers here: Denovo_solution


Congratulations, you finished the exercise!