Denovo exercise: Difference between revisions
No edit summary |
No edit summary |
||
| Line 1: | Line 1: | ||
< | <h2>Overview</h2> | ||
<p>First:</p> | |||
<ol> | |||
<li>Navigate to your home directory.</li> | |||
<li>Create a directory called <code>denovo</code>.</li> | |||
<li>Navigate to the directory you just created.</li> | |||
</ol> | |||
<p>In this exercise we will perform a <b>de novo assembly</b> of Illumina paired-end reads. The data is from a <i>Vibrio cholerae</i> strain isolated in Nepal. You will:</p> | |||
< | |||
< | |||
< | |||
< | |||
</ | |||
<ol> | |||
<li>Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).</li> | |||
<li>Count k-mers and estimate genome size.</li> | |||
<li>Correct reads using Musket.</li> | |||
<li>Determine insert size of paired-end reads.</li> | |||
<li>Run de novo assembly using MEGAHIT.</li> | |||
<li>Calculate assembly statistics.</li> | |||
<li>Plot coverage and length histograms of the assembly.</li> | |||
<li>Evaluate the assembly quality.</li> | |||
<li>Visualize the assembly using Circoletto.</li> | |||
<li>(Bonus) Try assembling the genome with SPAdes.</li> | |||
<li>Compare assemblies from PacBio and Nanopore data.</li> | |||
</ol> | |||
< | <hr> | ||
< | <h3>FastQC and trimming</h3> | ||
< | <p>Make sure you are in the <code>denovo</code> directory you created. You can double-check with:</p> | ||
<pre> | <pre> | ||
pwd | pwd | ||
</pre> | </pre> | ||
<p>Copy the sequencing data:</p> | |||
<pre> | <pre> | ||
| Line 40: | Line 40: | ||
</pre> | </pre> | ||
<p> | <p>Run FastQC on the reads:</p> | ||
<pre> | <pre> | ||
mkdir fastqc | mkdir fastqc | ||
| Line 46: | Line 47: | ||
</pre> | </pre> | ||
<p>Open the reports in | <p>Open the reports in Firefox:</p> | ||
<pre> | <pre> | ||
firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html & | firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html & | ||
</pre> | </pre> | ||
<p>There are | <p>There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:</p> | ||
<pre> | <pre> | ||
gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py | gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py | ||
</pre> | </pre> | ||
<p> | <p><b>Q1. Which quality encoding format is used?</b></p> | ||
<p>Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The <code>--basename</code> option defines the output prefix and <code>--gzip</code> compresses the output.</p> | |||
<pre> | <pre> | ||
AdapterRemoval --file1 Vchol-001_6_1_sequence.txt.gz --file2 Vchol-001_6_2_sequence.txt.gz --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG --qualitybase 64 --basename Vchol-001_6 --gzip --trimqualities --minquality 20 --minlength 40 | AdapterRemoval \ | ||
--file1 Vchol-001_6_1_sequence.txt.gz \ | |||
--file2 Vchol-001_6_2_sequence.txt.gz \ | |||
--adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \ | |||
--adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \ | |||
--qualitybase 64 \ | |||
--basename Vchol-001_6 \ | |||
--gzip \ | |||
--trimqualities \ | |||
--minquality 20 \ | |||
--minlength 40 | |||
</pre> | </pre> | ||
<b>Q1A. | <p>When it finishes, inspect <code>Vchol-001_6.settings</code> for trimming statistics (how many reads were trimmed, discarded, etc.).</p> | ||
<p><b>Q1A. The output includes <code>discarded.gz</code>, <code>pair1.truncated.gz</code>, <code>pair2.truncated.gz</code>, and <code>singleton.truncated.gz</code>. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)</b></p> | |||
<p>Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:</p> | |||
<pre> | <pre> | ||
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz | python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz | ||
| Line 71: | Line 88: | ||
</pre> | </pre> | ||
< | <hr> | ||
< | <h3>Genome size estimation</h3> | ||
<p> | <p>We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use <b>jellyfish</b> to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram.</p> | ||
<p>Manual: [http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.html jellyfish]</p> | |||
<pre> | <pre> | ||
gzip -dc Vchol-001_6.pair*.truncated.gz | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0 | gzip -dc Vchol-001_6.pair*.truncated.gz \ | ||
| jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0 | |||
jellyfish histo Vchol-001 > Vchol-001.histo | jellyfish histo Vchol-001 > Vchol-001.histo | ||
</pre> | </pre> | ||
<p>Start R:</p> | |||
<pre> | <pre> | ||
R | R | ||
</pre> | </pre> | ||
<p>Then paste:</p> | |||
<pre> | <pre> | ||
dat=read.table("Vchol-001.histo") | dat = read.table("Vchol-001.histo") | ||
barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8) | barplot(dat[,2], | ||
xlim=c(0,150), | |||
ylim=c(0,5e5), | |||
ylab="No of kmers", | |||
xlab="Counts of a k-mer", | |||
names.arg=dat[,1], | |||
cex.names=0.8) | |||
dev.print("Vchol-001.histo.pdf", device=pdf) | dev.print("Vchol-001.histo.pdf", device=pdf) | ||
</pre> | </pre> | ||
<p>The plot shows how many times a k-mer | <p>The plot shows:</p> | ||
<ul> | |||
<li>x-axis: how many times a k-mer occurs (its count)</li> | |||
<li>y-axis: number of distinct k-mers with that count</li> | |||
</ul> | |||
< | <p>K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.</p> | ||
<p> | <p><b>Q2. Where is the k-mer coverage peak (approximately)?</b></p> | ||
<b>Q3. What is | |||
<p>We can estimate genome size using:</p> | |||
<pre> | |||
N = (M * L) / (L - K + 1) | |||
Genome_size = T / N | |||
</pre> | |||
<ul> | |||
<li><b>N</b> = depth (coverage)</li> | |||
<li><b>M</b> = k-mer peak (from the histogram)</li> | |||
<li><b>K</b> = k-mer size (here: 15)</li> | |||
<li><b>L</b> = average read length (from fastx_readlength)</li> | |||
<li><b>T</b> = total number of bases (from fastx_readlength)</li> | |||
</ul> | |||
<p>Compute the estimated genome size and compare with the known <i>V. cholerae</i> genome (~4 Mb). You should be within roughly ±10%.</p> | |||
<p><b>Q3. What is your estimated genome size?</b></p> | |||
<hr> | <hr> | ||
< | <h3>Error correction</h3> | ||
<p> | <p>We will correct errors in the reads using <b>Musket</b>.</p> | ||
<p>Musket: [http://musket.sourceforge.net/homepage.htm Musket]</p> | |||
<p>First, get the number of distinct k-mers (needed for memory allocation in Musket):</p> | |||
<pre> | <pre> | ||
| Line 112: | Line 166: | ||
</pre> | </pre> | ||
<p> | <p>Use the reported number of distinct k-mers (here an example: <code>8423098</code>) in the Musket command:</p> | ||
<pre> | <pre> | ||
musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1 | musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \ | ||
Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1 | |||
</pre> | </pre> | ||
<p>The files | <p>The output files are named <code>Vchol-001_6.cor.0</code> and <code>Vchol-001_6.cor.1</code>. Rename them:</p> | ||
<pre> | <pre> | ||
| Line 125: | Line 180: | ||
</pre> | </pre> | ||
<p><b>If this takes | <p><b>If this takes too long, you can copy precomputed corrected reads:</b></p> | ||
<pre> | <pre> | ||
| Line 131: | Line 186: | ||
</pre> | </pre> | ||
< | <hr> | ||
<h3><i>De novo</i> assembly with MEGAHIT</h3> | |||
< | <p>We will now assemble the corrected reads using <b>MEGAHIT</b> (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.</p> | ||
<p> | <p>First, set the number of threads:</p> | ||
< | |||
<pre> | |||
export OMP_NUM_THREADS=4 | |||
</pre> | |||
<p> | <p>Run MEGAHIT with k=35:</p> | ||
<pre> | <pre> | ||
/home/ctools/MEGAHIT-1.2.9/bin/megahit \ | |||
/home/ctools/MEGAHIT-1.2.9/bin/megahit -1 Vchol-001_6.pair1.cor.truncated.fq.gz -2 Vchol-001_6.pair2.cor.truncated.fq.gz | -1 Vchol-001_6.pair1.cor.truncated.fq.gz \ | ||
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \ | |||
--k-list 35 \ | |||
-t 4 \ | |||
-m 2000000000 \ | |||
-o 35 | |||
</pre> | </pre> | ||
<p> | <p>When finished, you should have <code>35/final.contigs.fa</code>. Compress it:</p> | ||
<pre> | <pre> | ||
| Line 153: | Line 216: | ||
</pre> | </pre> | ||
<p> we | <p>To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):</p> | ||
<pre> | <pre> | ||
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 | zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq | ||
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 | zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq | ||
</pre> | |||
<p>Index the assembly and map:</p> | |||
<pre> | |||
bwa index 35/final.contigs.fa.gz | bwa index 35/final.contigs.fa.gz | ||
bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \ | |||
| samtools view -Sb - > Vchol_35bp.bam | |||
</pre> | |||
<p>Extract insert sizes (TLEN field, column 9):</p> | |||
<pre> | |||
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt | samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt | ||
</pre> | |||
</ | <p>Start R:</p> | ||
<pre> | <pre> | ||
R | R | ||
</pre> | </pre> | ||
<p>Then paste:</p> | |||
<pre> | <pre> | ||
a = read.table("initial.insertsizes.txt") | a = read.table("initial.insertsizes.txt") | ||
a.v = a[a[,1]>0,1] | a.v = a[a[,1] > 0, 1] | ||
mn = quantile(a.v, seq(0,1,0.05))[4] | mn = quantile(a.v, seq(0,1,0.05))[4] | ||
mx = quantile(a.v, seq(0,1,0.05))[18] | mx = quantile(a.v, seq(0,1,0.05))[18] | ||
mean(a.v[a.v >= mn & a.v <= mx]) | mean(a.v[a.v >= mn & a.v <= mx]) # mean insert size | ||
sd(a.v[a.v >= mn & a.v <= mx]) | sd(a.v[a.v >= mn & a.v <= mx]) # standard deviation | ||
</pre> | </pre> | ||
<b>Q4. What are the mean insert size and standard deviation of | <p><b>Q4. What are the mean insert size and standard deviation of the library?</b></p> | ||
<p>Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:</p> | |||
<p>[https://docs.google.com/spreadsheets/d/1DjfbriqIEuCq8FMRLlt69ITpgm2NKcmZzun11-T43HQ/edit?usp=sharing Google sheet for k-mer assignment]</p> | |||
<p> | <p>Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing <code>[KMER]</code> below:</p> | ||
<pre> | |||
export OMP_NUM_THREADS=4 | |||
/home/ctools/MEGAHIT-1.2.9/bin/megahit \ | |||
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \ | |||
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \ | |||
--k-list [KMER] \ | |||
-t 4 \ | |||
-m 2000000000 \ | |||
-o [KMER] | |||
gzip [KMER]/final.contigs.fa | |||
gzip | |||
</pre> | </pre> | ||
<p> | <p>Compute assembly statistics using the Assemblathon script:</p> | ||
<pre> | <pre> | ||
assemblathon_stats.pl [KMER | assemblathon_stats.pl [KMER]/final.contigs.fa > [KMER]/final.contigs.stats | ||
</pre> | </pre> | ||
<p>Open | <p>Open <code>[KMER]/final.contigs.stats</code> and fill in your stats in the Google sheet for your k-mer. As a class, you can compare and identify an approximate “best” k-mer.</p> | ||
<p>Copy the best assembly to your folder, or use a precomputed multi-k assembly:</p> | |||
<pre> | <pre> | ||
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz . | cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz . | ||
| Line 207: | Line 291: | ||
</pre> | </pre> | ||
<b>Q5. How does the N50 of the assembly compare | <p><b>Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?</b></p> | ||
< | <p><b>Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?</b></p> | ||
<hr> | |||
< | <h3>Coverage of the assembly</h3> | ||
< | <p>We will now calculate per-contig coverage and lengths, and visualize them in R.</p> | ||
<pre> | <pre> | ||
zcat default_final.contigs.fa.gz | fastx_megahit.py --i /dev/stdin > default_finalt.cov | zcat default_final.contigs.fa.gz | fastx_megahit.py --i /dev/stdin > default_finalt.cov | ||
</pre> | </pre> | ||
<p>Start R:</p> | |||
<pre> | <pre> | ||
R | R | ||
</pre> | </pre> | ||
<p>Then paste:</p> | |||
<pre> | <pre> | ||
library(plotrix) | library(plotrix) | ||
dat=read.table("default_finalt.cov", sep="\t") | dat = read.table("default_finalt.cov", sep="\t") | ||
par(mfrow=c(1,2)) | par(mfrow=c(1,2)) | ||
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="Weighted coverage", xlab="Contig coverage") | weighted.hist(w = dat[,2], | ||
hist(dat[,1], xlim=c(0,100), breaks=seq(0,1000,1), main="Raw coverage", xlab="Contig coverage") | x = dat[,1], | ||
breaks = seq(0,100,1), | |||
main = "Weighted coverage", | |||
xlab = "Contig coverage") | |||
hist(dat[,1], | |||
xlim = c(0,100), | |||
breaks = seq(0,1000,1), | |||
main = "Raw coverage", | |||
xlab = "Contig coverage") | |||
dev.print("best.coverage.pdf", device=pdf) | dev.print("best.coverage.pdf", device=pdf) | ||
# | # Scaffold lengths | ||
par(mfrow=c(1,1)) | par(mfrow=c(1,1)) | ||
barplot(rev(sort(dat[,2])), xlab="# Scaffold", ylab="Length", main="Scaffold Lengths") | barplot(rev(sort(dat[,2])), | ||
xlab = "# Scaffold", | |||
ylab = "Length", | |||
main = "Scaffold Lengths") | |||
dev.print("scaffold.lengths.pdf", device=pdf) | dev.print("scaffold.lengths.pdf", device=pdf) | ||
q(save="no") | q(save="no") | ||
</pre> | </pre> | ||
<p>View the plots:</p> | |||
<pre> | <pre> | ||
| Line 249: | Line 347: | ||
</pre> | </pre> | ||
<p>The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.</p> | |||
<p> | <p><b>Q7. Why might some short contigs have much <i>higher</i> coverage than the main assembly?</b></p> | ||
<b> | <p><b>Q8. Why might some short contigs have much <i>lower</i> coverage than the main assembly?</b></p> | ||
< | <hr> | ||
< | <h3>Assembly evaluation</h3> | ||
< | <p>We will use <b>QUAST</b> to evaluate the assembly using various reference-based metrics.</p> | ||
< | |||
< | <p>QUAST: [http://bioinf.spbau.ru/quast quast]</p> | ||
< | <p>Run QUAST against the <i>V. cholerae</i> reference genome:</p> | ||
< | <pre> | ||
python /home/ctools/quast-quast_5.0.2/quast.py \ | |||
default_final.contigs.fa.gz \ | |||
-R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa | |||
firefox quast_results/latest/report.html & | |||
firefox quast_results/latest/report.html & | |||
</pre> | </pre> | ||
<b>Q9. | <p><b>Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?</b></p> | ||
< | <hr> | ||
< | <h3>Visualization using Circoletto</h3> | ||
<p> | <p>We will visualize the assembly against the <i>V. cholerae</i> reference using Circoletto.</p> | ||
<p>First, filter out contigs shorter than 500 bp:</p> | |||
<pre> | <pre> | ||
| Line 285: | Line 386: | ||
</pre> | </pre> | ||
<p> | <p>On your local machine, open a browser and go to:</p> | ||
<p>[https://bat.infspire.org/circoletto/ Circoletto]</p> | |||
<p>Open the filtered assembly in a text editor on the server, for example:</p> | |||
<pre> | |||
gedit default_final.contigs.fa.gz.filtered_500.fa & | |||
</pre> | |||
<p>Copy–paste the FASTA content into the <b>“Query fasta”</b> box on the Circoletto page.</p> | |||
<p>Then open the reference genome:</p> | |||
<pre> | |||
gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa & | |||
</pre> | |||
<p>Copy–paste this into the <b>“Subject fasta”</b> box.</p> | |||
<p>In the “Output” section, select <b>“ONLY show the best hit per query”</b>, then click <b>Submit to Circoletto</b>.</p> | |||
<p>If Circoletto does not work, you can use this precomputed image:</p> | |||
<pre> | |||
/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png | |||
</pre> | |||
<p>You should see the two <i>V. cholerae</i> chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).</p> | |||
< | |||
<p> | <p><b>Q10. Does your assembled genome appear broadly similar to the reference genome?</b></p> | ||
<b> | <p><b>Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?</b></p> | ||
< | <p><b>Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the <i>V. cholerae</i> genome paper and search for “V. cholerae integron island”: [https://www.nature.com/articles/35020000 V. cholerae genome paper]</b></p> | ||
< | <hr> | ||
< | <h3>Try to assemble the genome using SPAdes (bonus)</h3> | ||
< | <p>Different assemblers can perform very differently. <b>SPAdes</b> is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.</p> | ||
<p>SPAdes: [http://bioinf.spbau.ru/spades SPAdes]</p> | |||
< | <p>Check the help output:</p> | ||
<pre> | <pre> | ||
spades.py -h | spades.py -h | ||
</pre> | </pre> | ||
< | <p><b>Note:</b> A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.</p> | ||
<p> | <p>Link to the SPAdes assembly:</p> | ||
spades. | <pre> | ||
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta | |||
# from here you can compute stats and run QUAST | |||
</pre> | </pre> | ||
< | <hr> | ||
< | <h3>Assembly of PacBio and Nanopore data</h3> | ||
< | |||
< | <p>We now look at assemblies from long-read technologies using <b>canu</b>, a long-read assembler derived from the Celera assembler (used in the original human genome project).</p> | ||
< | <p>Canu: [https://github.com/marbl/canu canu]</p> | ||
<p> | <p>We will assemble <i>E. coli</i> K-12 using either PacBio or Oxford Nanopore reads.</p> | ||
<pre> | <pre> | ||
# | # PacBio | ||
ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz . | ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz . | ||
canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4 | canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4 | ||
# | # Oxford Nanopore | ||
ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz . | ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz . | ||
canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4 | canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4 | ||
</pre> | </pre> | ||
<p> | <p>These runs can take a long time. You can instead link to the precomputed assemblies:</p> | ||
<pre> | <pre> | ||
| Line 362: | Line 472: | ||
</pre> | </pre> | ||
<p> | <p>Count number of contigs and inspect their lengths (from FASTA headers):</p> | ||
<pre> | <pre> | ||
grep ">" [input fasta here] | grep ">" [input fasta here] | ||
</pre> | </pre> | ||
<p>Or, if compressed:</p> | |||
<pre> | |||
<pre> | |||
zgrep ">" [zipped input fasta here] | zgrep ">" [zipped input fasta here] | ||
</pre> | </pre> | ||
<p>Run QUAST on both assemblies:</p> | |||
<pre> | <pre> | ||
python quast.py ecoli_pacbio.contigs.fasta ecoli_nanopore.contigs.fasta | python quast.py \ | ||
firefox quast_results/latest/report.html & | ecoli_pacbio.contigs.fasta \ | ||
ecoli_nanopore.contigs.fasta \ | |||
-R /home/projects/22126_NGS/exercises/denovo/canu/ecoli_K12.fa | |||
firefox quast_results/latest/report.html & | |||
</pre> | </pre> | ||
<p><b>Q13. | <p><b>Q13. Based on QUAST, how good are the PacBio and Nanopore assemblies (e.g. contiguity, errors, completeness)?</b></p> | ||
< | <hr> | ||
We will | <h3>Annotation of a prokaryotic genome</h3> | ||
<p>We will annotate genes in <code>ecoli_pacbio.contigs.fasta</code> using <b>prodigal</b>.</p> | |||
<p>Prodigal: [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119 prodigal]</p> | |||
<p>The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:</p> | |||
<pre> | <pre> | ||
/home/ctools/prokka/binaries/linux/prodigal -f gff -i [input genome in fasta] -a [output proteins in fasta] -o [output annotations in gff] | /home/ctools/prokka/binaries/linux/prodigal \ | ||
-f gff \ | |||
-i [input genome in fasta] \ | |||
-a [output proteins in fasta] \ | |||
-o [output annotations in gff] | |||
</pre> | </pre> | ||
<p>GFF format: [https://www.ensembl.org/info/website/upload/gff.html GFF format description]</p> | |||
<p>Next, index the protein FASTA file:</p> | |||
<pre> | <pre> | ||
samtools faidx ecoli_pacbio.contigs.aa | samtools faidx ecoli_pacbio.contigs.aa | ||
</pre> | </pre> | ||
<p>Extract the protein sequence for gene ID <code>tig00000001_4582</code>:</p> | |||
<pre> | <pre> | ||
samtools faidx ecoli_pacbio.contigs.aa | samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582 | ||
</pre> | </pre> | ||
<p>Use BLASTP against the nr database:</p> | |||
<p>[https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLAST for proteins]</p> | |||
<p>Paste the sequence and run BLASTP.</p> | |||
<p> | |||
< | <p><b>Q14. Which protein (function) does <code>tig00000001_4582</code> correspond to?</b></p> | ||
</ | |||
< | <hr> | ||
<p>Please find answers here: [[Denovo_solution|Denovo_solution]]</p> | |||
<hr> | |||
< | <p><b>Congratulations, you finished the exercise!</b></p> | ||
< | |||
Latest revision as of 11:09, 26 November 2025
Overview
First:
- Navigate to your home directory.
- Create a directory called
denovo. - Navigate to the directory you just created.
In this exercise we will perform a de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will:
- Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).
- Count k-mers and estimate genome size.
- Correct reads using Musket.
- Determine insert size of paired-end reads.
- Run de novo assembly using MEGAHIT.
- Calculate assembly statistics.
- Plot coverage and length histograms of the assembly.
- Evaluate the assembly quality.
- Visualize the assembly using Circoletto.
- (Bonus) Try assembling the genome with SPAdes.
- Compare assemblies from PacBio and Nanopore data.
FastQC and trimming
Make sure you are in the denovo directory you created. You can double-check with:
pwd
Copy the sequencing data:
cp /home/projects/22126_NGS/exercises/denovo/vchol/* .
Run FastQC on the reads:
mkdir fastqc fastqc -o fastqc *.txt.gz
Open the reports in Firefox:
firefox fastqc/Vchol-001_6_1_sequence_fastqc.html fastqc/Vchol-001_6_2_sequence_fastqc.html &
There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:
gzip -dc Vchol-001_6_1_sequence.txt.gz | fastx_detect_fq.py
Q1. Which quality encoding format is used?
Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The --basename option defines the output prefix and --gzip compresses the output.
AdapterRemoval \ --file1 Vchol-001_6_1_sequence.txt.gz \ --file2 Vchol-001_6_2_sequence.txt.gz \ --adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \ --adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \ --qualitybase 64 \ --basename Vchol-001_6 \ --gzip \ --trimqualities \ --minquality 20 \ --minlength 40
When it finishes, inspect Vchol-001_6.settings for trimming statistics (how many reads were trimmed, discarded, etc.).
Q1A. The output includes discarded.gz, pair1.truncated.gz, pair2.truncated.gz, and singleton.truncated.gz. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)
Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:
python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair1.truncated.gz --gz python2 /home/ctools/misc_scripts/fastx_readlength.py --i Vchol-001_6.pair2.truncated.gz --gz
Genome size estimation
We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use jellyfish to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram.
Manual: jellyfish
gzip -dc Vchol-001_6.pair*.truncated.gz \ | jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0 jellyfish histo Vchol-001 > Vchol-001.histo
Start R:
R
Then paste:
dat = read.table("Vchol-001.histo")
barplot(dat[,2],
xlim=c(0,150),
ylim=c(0,5e5),
ylab="No of kmers",
xlab="Counts of a k-mer",
names.arg=dat[,1],
cex.names=0.8)
dev.print("Vchol-001.histo.pdf", device=pdf)
The plot shows:
- x-axis: how many times a k-mer occurs (its count)
- y-axis: number of distinct k-mers with that count
K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.
Q2. Where is the k-mer coverage peak (approximately)?
We can estimate genome size using:
N = (M * L) / (L - K + 1) Genome_size = T / N
- N = depth (coverage)
- M = k-mer peak (from the histogram)
- K = k-mer size (here: 15)
- L = average read length (from fastx_readlength)
- T = total number of bases (from fastx_readlength)
Compute the estimated genome size and compare with the known V. cholerae genome (~4 Mb). You should be within roughly ±10%.
Q3. What is your estimated genome size?
Error correction
We will correct errors in the reads using Musket.
Musket: Musket
First, get the number of distinct k-mers (needed for memory allocation in Musket):
jellyfish stats Vchol-001
Use the reported number of distinct k-mers (here an example: 8423098) in the Musket command:
musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \ Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
The output files are named Vchol-001_6.cor.0 and Vchol-001_6.cor.1. Rename them:
mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz
If this takes too long, you can copy precomputed corrected reads:
cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .
De novo assembly with MEGAHIT
We will now assemble the corrected reads using MEGAHIT (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.
First, set the number of threads:
export OMP_NUM_THREADS=4
Run MEGAHIT with k=35:
/home/ctools/MEGAHIT-1.2.9/bin/megahit \ -1 Vchol-001_6.pair1.cor.truncated.fq.gz \ -2 Vchol-001_6.pair2.cor.truncated.fq.gz \ --k-list 35 \ -t 4 \ -m 2000000000 \ -o 35
When finished, you should have 35/final.contigs.fa. Compress it:
gzip 35/final.contigs.fa
To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq
Index the assembly and map:
bwa index 35/final.contigs.fa.gz bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \ | samtools view -Sb - > Vchol_35bp.bam
Extract insert sizes (TLEN field, column 9):
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt
Start R:
R
Then paste:
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1] > 0, 1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean insert size
sd(a.v[a.v >= mn & a.v <= mx]) # standard deviation
Q4. What are the mean insert size and standard deviation of the library?
Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:
Google sheet for k-mer assignment
Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing [KMER] below:
export OMP_NUM_THREADS=4 /home/ctools/MEGAHIT-1.2.9/bin/megahit \ -1 Vchol-001_6.pair1.cor.truncated.fq.gz \ -2 Vchol-001_6.pair2.cor.truncated.fq.gz \ --k-list [KMER] \ -t 4 \ -m 2000000000 \ -o [KMER] gzip [KMER]/final.contigs.fa
Compute assembly statistics using the Assemblathon script:
assemblathon_stats.pl [KMER]/final.contigs.fa > [KMER]/final.contigs.stats
Open [KMER]/final.contigs.stats and fill in your stats in the Google sheet for your k-mer. As a class, you can compare and identify an approximate “best” k-mer.
Copy the best assembly to your folder, or use a precomputed multi-k assembly:
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz . cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .
Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?
Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?
Coverage of the assembly
We will now calculate per-contig coverage and lengths, and visualize them in R.
zcat default_final.contigs.fa.gz | fastx_megahit.py --i /dev/stdin > default_finalt.cov
Start R:
R
Then paste:
library(plotrix)
dat = read.table("default_finalt.cov", sep="\t")
par(mfrow=c(1,2))
weighted.hist(w = dat[,2],
x = dat[,1],
breaks = seq(0,100,1),
main = "Weighted coverage",
xlab = "Contig coverage")
hist(dat[,1],
xlim = c(0,100),
breaks = seq(0,1000,1),
main = "Raw coverage",
xlab = "Contig coverage")
dev.print("best.coverage.pdf", device=pdf)
# Scaffold lengths
par(mfrow=c(1,1))
barplot(rev(sort(dat[,2])),
xlab = "# Scaffold",
ylab = "Length",
main = "Scaffold Lengths")
dev.print("scaffold.lengths.pdf", device=pdf)
q(save="no")
View the plots:
evince best.coverage.pdf & evince scaffold.lengths.pdf &
The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.
Q7. Why might some short contigs have much higher coverage than the main assembly?
Q8. Why might some short contigs have much lower coverage than the main assembly?
Assembly evaluation
We will use QUAST to evaluate the assembly using various reference-based metrics.
QUAST: quast
Run QUAST against the V. cholerae reference genome:
python /home/ctools/quast-quast_5.0.2/quast.py \ default_final.contigs.fa.gz \ -R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa firefox quast_results/latest/report.html &
Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?
Visualization using Circoletto
We will visualize the assembly against the V. cholerae reference using Circoletto.
First, filter out contigs shorter than 500 bp:
fastx_filterfasta.py --i default_final.contigs.fa.gz --min 500 gunzip default_final.contigs.fa.gz.filtered_500.fa.gz
On your local machine, open a browser and go to:
Open the filtered assembly in a text editor on the server, for example:
gedit default_final.contigs.fa.gz.filtered_500.fa &
Copy–paste the FASTA content into the “Query fasta” box on the Circoletto page.
Then open the reference genome:
gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &
Copy–paste this into the “Subject fasta” box.
In the “Output” section, select “ONLY show the best hit per query”, then click Submit to Circoletto.
If Circoletto does not work, you can use this precomputed image:
/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png
You should see the two V. cholerae chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).
Q10. Does your assembled genome appear broadly similar to the reference genome?
Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?
Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the V. cholerae genome paper and search for “V. cholerae integron island”: V. cholerae genome paper
Try to assemble the genome using SPAdes (bonus)
Different assemblers can perform very differently. SPAdes is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.
SPAdes: SPAdes
Check the help output:
spades.py -h
Note: A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.
Link to the SPAdes assembly:
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta # from here you can compute stats and run QUAST
Assembly of PacBio and Nanopore data
We now look at assemblies from long-read technologies using canu, a long-read assembler derived from the Celera assembler (used in the original human genome project).
Canu: canu
We will assemble E. coli K-12 using either PacBio or Oxford Nanopore reads.
# PacBio ln -s /home/projects/22126_NGS/exercises/denovo/canu/pacbio.fastq.gz . canu -p ecoli_pacbio -d ecoli_pacbio genomeSize=4.8m -pacbio-raw pacbio.fastq.gz -maxThreads=4 # Oxford Nanopore ln -s /home/projects/22126_NGS/exercises/denovo/canu/oxford.fasta.gz . canu -p ecoli_nanopore -d ecoli_nanopore genomeSize=4.8m -nanopore-raw oxford.fasta.gz -maxThreads=4
These runs can take a long time. You can instead link to the precomputed assemblies:
ln -s /home/projects/22126_NGS/exercises/denovo/canu/ecoli_pacbio.contigs.fasta . ln -s /home/projects/22126_NGS/exercises/denovo/canu/ecoli_nanopore.contigs.fasta .
Count number of contigs and inspect their lengths (from FASTA headers):
grep ">" [input fasta here]
Or, if compressed:
zgrep ">" [zipped input fasta here]
Run QUAST on both assemblies:
python quast.py \ ecoli_pacbio.contigs.fasta \ ecoli_nanopore.contigs.fasta \ -R /home/projects/22126_NGS/exercises/denovo/canu/ecoli_K12.fa firefox quast_results/latest/report.html &
Q13. Based on QUAST, how good are the PacBio and Nanopore assemblies (e.g. contiguity, errors, completeness)?
Annotation of a prokaryotic genome
We will annotate genes in ecoli_pacbio.contigs.fasta using prodigal.
Prodigal: prodigal
The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:
/home/ctools/prokka/binaries/linux/prodigal \ -f gff \ -i [input genome in fasta] \ -a [output proteins in fasta] \ -o [output annotations in gff]
GFF format: GFF format description
Next, index the protein FASTA file:
samtools faidx ecoli_pacbio.contigs.aa
Extract the protein sequence for gene ID tig00000001_4582:
samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582
Use BLASTP against the nr database:
Paste the sequence and run BLASTP.
Q14. Which protein (function) does tig00000001_4582 correspond to?
Please find answers here: Denovo_solution
Congratulations, you finished the exercise!