Denovo solution: Difference between revisions
No edit summary |
No edit summary |
||
| Line 1: | Line 1: | ||
<h2>Answers</h2> | |||
<p><b>Q1.</b> The reads are encoded in an <b>Illumina</b>-style quality format (Phred+64 in this dataset).</p> | |||
<p><b>Q1A.</b></p> | |||
<ul> | |||
<li><code>Vchol-001_6.discarded.gz</code>: reads that were discarded during trimming (too short after trimming and/or very low quality).</li> | |||
<li><code>Vchol-001_6.pair1.truncated.gz</code> and <code>Vchol-001_6.pair2.truncated.gz</code>: read pairs where <b>both mates passed</b> the trimming filters.</li> | |||
<li><code>Vchol-001_6.singleton.truncated.gz</code>: reads where <b>one mate passed</b> trimming but the other mate was discarded (so only one read from the pair survives).</li> | |||
</ul> | |||
<hr> | |||
<p><b>Q2.</b> The main k-mer coverage peak is at approximately <b>84×</b>.</p> | |||
<hr> | |||
<p><b>Q3.</b> Using the formulas:</p> | |||
<pre> | |||
N = (M * L) / (L - K + 1) | |||
Genome_size = T / N | |||
</pre> | |||
<ul> | |||
<li><b>M</b> (k-mer peak) ≈ 84</li> | |||
<li><b>L</b> (average read length) ≈ 99</li> | |||
<li><b>K</b> (k-mer size) = 15</li> | |||
<li><b>T</b> (total bases) = 213,721,367 + 212,523,694 = 426,245,061 bases</li> | |||
</ul> | |||
<p>Compute depth:</p> | |||
<pre> | |||
N = (84 * 99) / (99 - 15 + 1) | |||
= 8316 / 85 | |||
≈ 97.84× | |||
</pre> | |||
<p>Then genome size:</p> | |||
<pre> | |||
Genome_size = T / N | |||
= 426,245,061 / 97.84 | |||
≈ 4.35 Mb | |||
</pre> | |||
<p><b>Estimated genome size ≈ 4.35 Mb</b>, which is close to the expected ~4 Mb for <i>V. cholerae</i>.</p> | |||
<hr> | |||
<p><b>Q4.</b> From the insert-size analysis in R, we obtain:</p> | |||
<ul> | |||
<li><b>Mean insert size</b> ≈ <b>259 bp</b></li> | |||
<li><b>Standard deviation (SD)</b> ≈ <b>11 bp</b></li> | |||
</ul> | |||
<pre> | <hr> | ||
<p><b>Q5.</b> The “best” assembly (using multiple k-mers / default settings) has:</p> | |||
<ul> | |||
<li><b>N50 ≈ 179,846 bp</b></li> | |||
</ul> | |||
<p>The best fixed-k assembly we observed (e.g. at k=79) had:</p> | |||
<ul> | |||
<li><b>N50 ≈ 92,020 bp</b></li> | |||
</ul> | |||
<p>So the multi-k / default assembly has a substantially <b>higher N50</b> than the best fixed-k assembly.</p> | |||
<hr> | |||
<p><b>Q6.</b> In the QUAST comparison, there is effectively <b>one main large scaffold</b> that dominates the assembly (i.e. a very contiguous assembly with one primary contig/scaffold capturing most of the genome).</p> | |||
<hr> | |||
<p><b>Q7.</b> Short contigs with <b>much higher coverage</b> than the bulk of the assembly are typically:</p> | |||
<ul> | |||
<li><b>Repeat regions</b> (e.g. rRNA operons, transposons, insertion sequences) that occur multiple times in the genome, causing reads from many genomic copies to pile onto a single contig.</li> | |||
<li>Sometimes <b>misassemblies</b> where repeated segments have been collapsed into one contig, which inflates the apparent coverage.</li> | |||
</ul> | |||
<hr> | |||
<p><b>Q8.</b> Short contigs with <b>much lower coverage</b> than the main assembly are often:</p> | |||
<ul> | |||
<li><b>Contaminant sequences</b> from another species (few reads map, so coverage stays low).</li> | |||
<li><b>Misassemblies</b> or chimeric contigs that are only weakly supported.</li> | |||
<li>Very low-abundance regions (e.g. rare plasmid fragments or artifacts).</li> | |||
</ul> | |||
<hr> | |||
<p><b>Q9.</b> We cannot always fully trust “misassembly” calls from QUAST because:</p> | |||
<ul> | |||
<li>QUAST assumes the reference genome is the <b>truth</b>.</li> | |||
<li>Real biological differences (e.g. insertions, deletions, rearrangements) between the sample and the reference can be misinterpreted as misassemblies.</li> | |||
<li>Thus, some reported “misassemblies” likely reflect <b>true variation</b> rather than assembler errors.</li> | |||
</ul> | |||
<hr> | |||
<p><b>Q10.</b> By visual inspection (Circoletto / alignment plots), a large fraction of the reference genome is covered by our contigs, with strong matches along most of both chromosomes. So, <b>yes</b>, our genome appears broadly similar to the reference <i>V. cholerae</i> genome.</p> | |||
<hr> | |||
<p><b>Q11.</b> There are very few contigs that do not map well. The contig <code>K119.81</code>, for example, only maps partially.</p> | |||
<p>This could represent:</p> | |||
<ul> | |||
<li>A region present in our strain but absent (or highly divergent) in the reference genome, or</li> | |||
<li>A <b>misassembly</b> (chimeric contig or incorrectly joined segments).</li> | |||
</ul> | |||
<hr> | |||
<p><b>Q12.</b> The region on chromosome 2 with many small, low-confidence mappings is likely a highly repetitive region, such as the <b>integron island</b> in <i>V. cholerae</i>.</p> | |||
<p>This region:</p> | |||
<ul> | |||
<li>Contains many gene cassettes and repeats.</li> | |||
<li>Is difficult to assemble with short reads (leading to fragmentation and poor mapping).</li> | |||
<li>Acts as a hotspot where <i>V. cholerae</i> integrates new genes into its genome.</li> | |||
</ul> | |||
<p>See: [https://www.nature.com/articles/35020000 V. cholerae genome paper]</p> | |||
<hr> | |||
<p><b>Q13.</b> Inspecting the number of contigs:</p> | |||
<pre> | |||
grep ">" ecoli_nanopore.contigs.fasta | grep ">" ecoli_nanopore.contigs.fasta | ||
grep ">" ecoli_pacbio.contigs.fasta | grep ">" ecoli_pacbio.contigs.fasta | ||
</pre> | </pre> | ||
The Nanopore assembly | <p>We see that:</p> | ||
<ul> | |||
<li>The <b>Nanopore assembly</b> has <b>2 contigs</b>.</li> | |||
<li>The <b>PacBio assembly</b> has <b>1 contig</b>.</li> | |||
</ul> | |||
<p>This is an extremely good result for bacterial genomes: the PacBio assembly is essentially a <b>single-contig (closed) genome</b>, and Nanopore is nearly as good.</p> | |||
<hr> | |||
<p><b>Q14.</b> Running prodigal:</p> | |||
<pre> | <pre> | ||
/home/ctools/prokka/binaries/linux/prodigal -f gff -i ecoli_pacbio.contigs.fasta -a ecoli_pacbio.contigs.aa -o ecoli_pacbio.contigs.gff | /home/ctools/prokka/binaries/linux/prodigal \ | ||
-f gff \ | |||
-i ecoli_pacbio.contigs.fasta \ | |||
-a ecoli_pacbio.contigs.aa \ | |||
-o ecoli_pacbio.contigs.gff | |||
</pre> | </pre> | ||
<p>Then extracting and BLASTing <code>tig00000001_4582</code> reveals that this gene encodes <b>β-galactosidase (lacZ)</b> from <i>E. coli</i>.</p> | |||
<p>This gene was central to the classic studies of gene regulation by François Jacob, André Lwoff, and Jacques Monod, which led to the concept of the lac operon and earned them the [https://www.nobelprize.org/prizes/medicine/1965/summary/ Nobel Prize in Physiology or Medicine in 1965].</p> | |||
< | |||
Latest revision as of 11:18, 26 November 2025
Answers
Q1. The reads are encoded in an Illumina-style quality format (Phred+64 in this dataset).
Q1A.
Vchol-001_6.discarded.gz: reads that were discarded during trimming (too short after trimming and/or very low quality).Vchol-001_6.pair1.truncated.gzandVchol-001_6.pair2.truncated.gz: read pairs where both mates passed the trimming filters.Vchol-001_6.singleton.truncated.gz: reads where one mate passed trimming but the other mate was discarded (so only one read from the pair survives).
Q2. The main k-mer coverage peak is at approximately 84×.
Q3. Using the formulas:
N = (M * L) / (L - K + 1) Genome_size = T / N
- M (k-mer peak) ≈ 84
- L (average read length) ≈ 99
- K (k-mer size) = 15
- T (total bases) = 213,721,367 + 212,523,694 = 426,245,061 bases
Compute depth:
N = (84 * 99) / (99 - 15 + 1) = 8316 / 85 ≈ 97.84×
Then genome size:
Genome_size = T / N
= 426,245,061 / 97.84
≈ 4.35 Mb
Estimated genome size ≈ 4.35 Mb, which is close to the expected ~4 Mb for V. cholerae.
Q4. From the insert-size analysis in R, we obtain:
- Mean insert size ≈ 259 bp
- Standard deviation (SD) ≈ 11 bp
Q5. The “best” assembly (using multiple k-mers / default settings) has:
- N50 ≈ 179,846 bp
The best fixed-k assembly we observed (e.g. at k=79) had:
- N50 ≈ 92,020 bp
So the multi-k / default assembly has a substantially higher N50 than the best fixed-k assembly.
Q6. In the QUAST comparison, there is effectively one main large scaffold that dominates the assembly (i.e. a very contiguous assembly with one primary contig/scaffold capturing most of the genome).
Q7. Short contigs with much higher coverage than the bulk of the assembly are typically:
- Repeat regions (e.g. rRNA operons, transposons, insertion sequences) that occur multiple times in the genome, causing reads from many genomic copies to pile onto a single contig.
- Sometimes misassemblies where repeated segments have been collapsed into one contig, which inflates the apparent coverage.
Q8. Short contigs with much lower coverage than the main assembly are often:
- Contaminant sequences from another species (few reads map, so coverage stays low).
- Misassemblies or chimeric contigs that are only weakly supported.
- Very low-abundance regions (e.g. rare plasmid fragments or artifacts).
Q9. We cannot always fully trust “misassembly” calls from QUAST because:
- QUAST assumes the reference genome is the truth.
- Real biological differences (e.g. insertions, deletions, rearrangements) between the sample and the reference can be misinterpreted as misassemblies.
- Thus, some reported “misassemblies” likely reflect true variation rather than assembler errors.
Q10. By visual inspection (Circoletto / alignment plots), a large fraction of the reference genome is covered by our contigs, with strong matches along most of both chromosomes. So, yes, our genome appears broadly similar to the reference V. cholerae genome.
Q11. There are very few contigs that do not map well. The contig K119.81, for example, only maps partially.
This could represent:
- A region present in our strain but absent (or highly divergent) in the reference genome, or
- A misassembly (chimeric contig or incorrectly joined segments).
Q12. The region on chromosome 2 with many small, low-confidence mappings is likely a highly repetitive region, such as the integron island in V. cholerae.
This region:
- Contains many gene cassettes and repeats.
- Is difficult to assemble with short reads (leading to fragmentation and poor mapping).
- Acts as a hotspot where V. cholerae integrates new genes into its genome.
Q13. Inspecting the number of contigs:
grep ">" ecoli_nanopore.contigs.fasta grep ">" ecoli_pacbio.contigs.fasta
We see that:
- The Nanopore assembly has 2 contigs.
- The PacBio assembly has 1 contig.
This is an extremely good result for bacterial genomes: the PacBio assembly is essentially a single-contig (closed) genome, and Nanopore is nearly as good.
Q14. Running prodigal:
/home/ctools/prokka/binaries/linux/prodigal \ -f gff \ -i ecoli_pacbio.contigs.fasta \ -a ecoli_pacbio.contigs.aa \ -o ecoli_pacbio.contigs.gff
Then extracting and BLASTing tig00000001_4582 reveals that this gene encodes β-galactosidase (lacZ) from E. coli.
This gene was central to the classic studies of gene regulation by François Jacob, André Lwoff, and Jacques Monod, which led to the concept of the lac operon and earned them the Nobel Prize in Physiology or Medicine in 1965.