Denovo solution: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 1: Line 1:
Q1. Illumina
<h2>Answers</h2>


Q1A. discarded contains reads that are too short, pair1 and pair2 files contain the read pairs were both passed trimming and singleton are reads were one of the two pairs were discarded.
<p><b>Q1.</b> The reads are encoded in an <b>Illumina</b>-style quality format (Phred+64 in this dataset).</p>


Q2. Around 84
<p><b>Q1A.</b></p>
<ul>
  <li><code>Vchol-001_6.discarded.gz</code>: reads that were discarded during trimming (too short after trimming and/or very low quality).</li>
  <li><code>Vchol-001_6.pair1.truncated.gz</code> and <code>Vchol-001_6.pair2.truncated.gz</code>: read pairs where <b>both mates passed</b> the trimming filters.</li>
  <li><code>Vchol-001_6.singleton.truncated.gz</code>: reads where <b>one mate passed</b> trimming but the other mate was discarded (so only one read from the pair survives).</li>
</ul>


Q3. N = (M*L)/(L-K+1) = (84*99)/(99-15+1) = 97.84
<hr>
Genome_size = T/N = (213721367+212523694)/97.84 = 4.35Mb


Q4. Mean = 259 ; SD = 11
<p><b>Q2.</b> The main k-mer coverage peak is at approximately <b>84×</b>.</p>


Q5. It is higher N50:179846 than the best we found at k=79 N50:92020
<hr>


Q6. 1
<p><b>Q3.</b> Using the formulas:</p>


Q7. Repeat regions + misassemblies
<pre>
N = (M * L) / (L - K + 1)
Genome_size = T / N
</pre>
 
<ul>
  <li><b>M</b> (k-mer peak) ≈ 84</li>
  <li><b>L</b> (average read length) ≈ 99</li>
  <li><b>K</b> (k-mer size) = 15</li>
  <li><b>T</b> (total bases) = 213,721,367 + 212,523,694 = 426,245,061 bases</li>
</ul>
 
<p>Compute depth:</p>
 
<pre>
N = (84 * 99) / (99 - 15 + 1)
  = 8316 / 85
  ≈ 97.84×
</pre>


Q8. Contaminations + misassemblies
<p>Then genome size:</p>


Q9. Because we use the reference genome as the truth it may be hard to distinguish what is a misassembly and what is true variation from the reference genome.  
<pre>
Genome_size = T / N
            = 426,245,061 / 97.84
            ≈ 4.35 Mb
</pre>


Q10. This is just visual, but it seems that a lot of the reference genome is covered by our assembly, so yes.
<p><b>Estimated genome size ≈ 4.35 Mb</b>, which is close to the expected ~4 Mb for <i>V. cholerae</i>.</p>


Q11. very few and the K119.81 only maps partially. This could be a sequence in our strain, but not in the reference genome. Or a misassembly.
<hr>


Q12. This is a region with a lot of repeats, this is also why we can't really assemble it. It is used by V. cholerae to integrate new genes into its genome.
<p><b>Q4.</b> From the insert-size analysis in R, we obtain:</p>


Q13. Using:
<ul>
  <li><b>Mean insert size</b> ≈ <b>259 bp</b></li>
  <li><b>Standard deviation (SD)</b> ≈ <b>11 bp</b></li>
</ul>


<pre>  
<hr>
 
<p><b>Q5.</b> The “best” assembly (using multiple k-mers / default settings) has:</p>
 
<ul>
  <li><b>N50 ≈ 179,846 bp</b></li>
</ul>
 
<p>The best fixed-k assembly we observed (e.g. at k=79) had:</p>
 
<ul>
  <li><b>N50 ≈ 92,020 bp</b></li>
</ul>
 
<p>So the multi-k / default assembly has a substantially <b>higher N50</b> than the best fixed-k assembly.</p>
 
<hr>
 
<p><b>Q6.</b> In the QUAST comparison, there is effectively <b>one main large scaffold</b> that dominates the assembly (i.e. a very contiguous assembly with one primary contig/scaffold capturing most of the genome).</p>
 
<hr>
 
<p><b>Q7.</b> Short contigs with <b>much higher coverage</b> than the bulk of the assembly are typically:</p>
<ul>
  <li><b>Repeat regions</b> (e.g. rRNA operons, transposons, insertion sequences) that occur multiple times in the genome, causing reads from many genomic copies to pile onto a single contig.</li>
  <li>Sometimes <b>misassemblies</b> where repeated segments have been collapsed into one contig, which inflates the apparent coverage.</li>
</ul>
 
<hr>
 
<p><b>Q8.</b> Short contigs with <b>much lower coverage</b> than the main assembly are often:</p>
<ul>
  <li><b>Contaminant sequences</b> from another species (few reads map, so coverage stays low).</li>
  <li><b>Misassemblies</b> or chimeric contigs that are only weakly supported.</li>
  <li>Very low-abundance regions (e.g. rare plasmid fragments or artifacts).</li>
</ul>
 
<hr>
 
<p><b>Q9.</b> We cannot always fully trust “misassembly” calls from QUAST because:</p>
<ul>
  <li>QUAST assumes the reference genome is the <b>truth</b>.</li>
  <li>Real biological differences (e.g. insertions, deletions, rearrangements) between the sample and the reference can be misinterpreted as misassemblies.</li>
  <li>Thus, some reported “misassemblies” likely reflect <b>true variation</b> rather than assembler errors.</li>
</ul>
 
<hr>
 
<p><b>Q10.</b> By visual inspection (Circoletto / alignment plots), a large fraction of the reference genome is covered by our contigs, with strong matches along most of both chromosomes. So, <b>yes</b>, our genome appears broadly similar to the reference <i>V. cholerae</i> genome.</p>
 
<hr>
 
<p><b>Q11.</b> There are very few contigs that do not map well. The contig <code>K119.81</code>, for example, only maps partially.</p>
 
<p>This could represent:</p>
<ul>
  <li>A region present in our strain but absent (or highly divergent) in the reference genome, or</li>
  <li>A <b>misassembly</b> (chimeric contig or incorrectly joined segments).</li>
</ul>
 
<hr>
 
<p><b>Q12.</b> The region on chromosome 2 with many small, low-confidence mappings is likely a highly repetitive region, such as the <b>integron island</b> in <i>V. cholerae</i>.</p>
 
<p>This region:</p>
<ul>
  <li>Contains many gene cassettes and repeats.</li>
  <li>Is difficult to assemble with short reads (leading to fragmentation and poor mapping).</li>
  <li>Acts as a hotspot where <i>V. cholerae</i> integrates new genes into its genome.</li>
</ul>
 
<p>See: [https://www.nature.com/articles/35020000 V. cholerae genome paper]</p>
 
<hr>
 
<p><b>Q13.</b> Inspecting the number of contigs:</p>
 
<pre>
grep ">" ecoli_nanopore.contigs.fasta
grep ">" ecoli_nanopore.contigs.fasta
grep ">" ecoli_pacbio.contigs.fasta
grep ">" ecoli_pacbio.contigs.fasta
</pre>
</pre>


The Nanopore assembly only has 2 contigs and pacbio 1! Pretty good!
<p>We see that:</p>
<ul>
  <li>The <b>Nanopore assembly</b> has <b>2 contigs</b>.</li>
  <li>The <b>PacBio assembly</b> has <b>1 contig</b>.</li>
</ul>
 
<p>This is an extremely good result for bacterial genomes: the PacBio assembly is essentially a <b>single-contig (closed) genome</b>, and Nanopore is nearly as good.</p>
 
<hr>


<p><b>Q14.</b> Running prodigal:</p>


Q14. using:
<pre>
<pre>
/home/ctools/prokka/binaries/linux/prodigal -f gff -i ecoli_pacbio.contigs.fasta -a ecoli_pacbio.contigs.aa -o ecoli_pacbio.contigs.gff
/home/ctools/prokka/binaries/linux/prodigal \
  -f gff \
  -i ecoli_pacbio.contigs.fasta \
  -a ecoli_pacbio.contigs.aa \
  -o ecoli_pacbio.contigs.gff
</pre>
</pre>


<p>Then extracting and BLASTing <code>tig00000001_4582</code> reveals that this gene encodes <b>β-galactosidase (lacZ)</b> from <i>E. coli</i>.</p>


it is the Beta-galactosidase gene of E.Coli. That gene was used by François Jacob, André Lwoff and Jacques Monod to describe gene regulation in 1960 which got them the [https://www.nobelprize.org/prizes/medicine/1965/summary/ Nobel prize in 1965]
<p>This gene was central to the classic studies of gene regulation by François Jacob, André Lwoff, and Jacques Monod, which led to the concept of the lac operon and earned them the [https://www.nobelprize.org/prizes/medicine/1965/summary/ Nobel Prize in Physiology or Medicine in 1965].</p>
<!-- Q14. The 454 assembly was best. -->

Latest revision as of 11:18, 26 November 2025

Answers

Q1. The reads are encoded in an Illumina-style quality format (Phred+64 in this dataset).

Q1A.

  • Vchol-001_6.discarded.gz: reads that were discarded during trimming (too short after trimming and/or very low quality).
  • Vchol-001_6.pair1.truncated.gz and Vchol-001_6.pair2.truncated.gz: read pairs where both mates passed the trimming filters.
  • Vchol-001_6.singleton.truncated.gz: reads where one mate passed trimming but the other mate was discarded (so only one read from the pair survives).

Q2. The main k-mer coverage peak is at approximately 84×.


Q3. Using the formulas:

N = (M * L) / (L - K + 1)
Genome_size = T / N
  • M (k-mer peak) ≈ 84
  • L (average read length) ≈ 99
  • K (k-mer size) = 15
  • T (total bases) = 213,721,367 + 212,523,694 = 426,245,061 bases

Compute depth:

N = (84 * 99) / (99 - 15 + 1)
  = 8316 / 85
  ≈ 97.84×

Then genome size:

Genome_size = T / N
            = 426,245,061 / 97.84
            ≈ 4.35 Mb

Estimated genome size ≈ 4.35 Mb, which is close to the expected ~4 Mb for V. cholerae.


Q4. From the insert-size analysis in R, we obtain:

  • Mean insert size259 bp
  • Standard deviation (SD)11 bp

Q5. The “best” assembly (using multiple k-mers / default settings) has:

  • N50 ≈ 179,846 bp

The best fixed-k assembly we observed (e.g. at k=79) had:

  • N50 ≈ 92,020 bp

So the multi-k / default assembly has a substantially higher N50 than the best fixed-k assembly.


Q6. In the QUAST comparison, there is effectively one main large scaffold that dominates the assembly (i.e. a very contiguous assembly with one primary contig/scaffold capturing most of the genome).


Q7. Short contigs with much higher coverage than the bulk of the assembly are typically:

  • Repeat regions (e.g. rRNA operons, transposons, insertion sequences) that occur multiple times in the genome, causing reads from many genomic copies to pile onto a single contig.
  • Sometimes misassemblies where repeated segments have been collapsed into one contig, which inflates the apparent coverage.

Q8. Short contigs with much lower coverage than the main assembly are often:

  • Contaminant sequences from another species (few reads map, so coverage stays low).
  • Misassemblies or chimeric contigs that are only weakly supported.
  • Very low-abundance regions (e.g. rare plasmid fragments or artifacts).

Q9. We cannot always fully trust “misassembly” calls from QUAST because:

  • QUAST assumes the reference genome is the truth.
  • Real biological differences (e.g. insertions, deletions, rearrangements) between the sample and the reference can be misinterpreted as misassemblies.
  • Thus, some reported “misassemblies” likely reflect true variation rather than assembler errors.

Q10. By visual inspection (Circoletto / alignment plots), a large fraction of the reference genome is covered by our contigs, with strong matches along most of both chromosomes. So, yes, our genome appears broadly similar to the reference V. cholerae genome.


Q11. There are very few contigs that do not map well. The contig K119.81, for example, only maps partially.

This could represent:

  • A region present in our strain but absent (or highly divergent) in the reference genome, or
  • A misassembly (chimeric contig or incorrectly joined segments).

Q12. The region on chromosome 2 with many small, low-confidence mappings is likely a highly repetitive region, such as the integron island in V. cholerae.

This region:

  • Contains many gene cassettes and repeats.
  • Is difficult to assemble with short reads (leading to fragmentation and poor mapping).
  • Acts as a hotspot where V. cholerae integrates new genes into its genome.

See: V. cholerae genome paper


Q13. Inspecting the number of contigs:

grep ">" ecoli_nanopore.contigs.fasta
grep ">" ecoli_pacbio.contigs.fasta

We see that:

  • The Nanopore assembly has 2 contigs.
  • The PacBio assembly has 1 contig.

This is an extremely good result for bacterial genomes: the PacBio assembly is essentially a single-contig (closed) genome, and Nanopore is nearly as good.


Q14. Running prodigal:

/home/ctools/prokka/binaries/linux/prodigal \
  -f gff \
  -i ecoli_pacbio.contigs.fasta \
  -a ecoli_pacbio.contigs.aa \
  -o ecoli_pacbio.contigs.gff

Then extracting and BLASTing tig00000001_4582 reveals that this gene encodes β-galactosidase (lacZ) from E. coli.

This gene was central to the classic studies of gene regulation by François Jacob, André Lwoff, and Jacques Monod, which led to the concept of the lac operon and earned them the Nobel Prize in Physiology or Medicine in 1965.