Denovo solution: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with "Q1. Illumina Q1A. discarded contains reads that are too short, pair1 and pair2 files contain the read pairs were both passed trimming and singleton are reads were one of the two pairs were discarded. Q2. Around 84 Q3. N = (M*L)/(L-K+1) = (84*99)/(99-15+1) = 97.84 Genome_size = T/N = (213721367+212523694)/97.84 = 4.35Mb Q4. Mean = 259 ; SD = 11 Q5. It is lower, this means that the actual kmer peak we found (unless you found one higher than 84) is higher (this would g...")
 
No edit summary
 
(6 intermediate revisions by one other user not shown)
Line 1: Line 1:
Q1. Illumina
<h2>Answers</h2>


Q1A. discarded contains reads that are too short, pair1 and pair2 files contain the read pairs were both passed trimming and singleton are reads were one of the two pairs were discarded.
<p><b>Q1.</b> The reads are encoded in an <b>Illumina</b>-style quality format (Phred+64 in this dataset).</p>


Q2. Around 84
<p><b>Q1A.</b></p>
<ul>
  <li><code>Vchol-001_6.discarded.gz</code>: reads that were discarded during trimming (too short after trimming and/or very low quality).</li>
  <li><code>Vchol-001_6.pair1.truncated.gz</code> and <code>Vchol-001_6.pair2.truncated.gz</code>: read pairs where <b>both mates passed</b> the trimming filters.</li>
  <li><code>Vchol-001_6.singleton.truncated.gz</code>: reads where <b>one mate passed</b> trimming but the other mate was discarded (so only one read from the pair survives).</li>
</ul>


Q3. N = (M*L)/(L-K+1) = (84*99)/(99-15+1) = 97.84
<hr>
Genome_size = T/N = (213721367+212523694)/97.84 = 4.35Mb


Q4. Mean = 259 ; SD = 11
<p><b>Q2.</b> The main k-mer coverage peak is at approximately <b>84×</b>.</p>


Q5. It is lower, this means that the actual kmer peak we found (unless you found one higher than 84) is higher (this would give a lower genome size).
<hr>


Q6. 10 of 195 contigs were scaffolded into scaffolds, this is quite few - normally it is much higher. A reason for this could be that our insert size is quite low (~250 bp) and the repeats in the genome are larger than this.
<p><b>Q3.</b> Using the formulas:</p>


Q7. Repeat regions
<pre>
N = (M * L) / (L - K + 1)
Genome_size = T / N
</pre>


Q8. Contaminations
<ul>
  <li><b>M</b> (k-mer peak) ≈ 84</li>
  <li><b>L</b> (average read length) ≈ 99</li>
  <li><b>K</b> (k-mer size) = 15</li>
  <li><b>T</b> (total bases) = 213,721,367 + 212,523,694 = 426,245,061 bases</li>
</ul>


Q9. Because we use the reference genome as the truth it may be hard to distinguish what is a misassembly and what is true variation from the reference genome.
<p>Compute depth:</p>


Q10. This is of course just visual, but it seems that most part of the reference genome is covered by our assembly, so yes.
<pre>
N = (84 * 99) / (99 - 15 + 1)
  = 8316 / 85
  ≈ 97.84×
</pre>


Q11. Yes, a couple of the small contigs does not map at all, and the C1097 only maps partially. This could be sequence in our strain, but not in the reference genome.
<p>Then genome size:</p>


Q12. This is a region with a lot of repeats, this is also why we cant really assemble it. It is used by V. cholerae to integrate new genes into its genome.
<pre>
Genome_size = T / N
            = 426,245,061 / 97.84
            ≈ 4.35 Mb
</pre>


Q13. The Nanopore assembly only has 2 contigs and pacbio 1!
<p><b>Estimated genome size ≈ 4.35 Mb</b>, which is close to the expected ~4 Mb for <i>V. cholerae</i>.</p>


<!-- Q14. The 454 assembly was best. -->
<hr>
 
<p><b>Q4.</b> From the insert-size analysis in R, we obtain:</p>
 
<ul>
  <li><b>Mean insert size</b> ≈ <b>259 bp</b></li>
  <li><b>Standard deviation (SD)</b> ≈ <b>11 bp</b></li>
</ul>
 
<hr>
 
<p><b>Q5.</b> The “best” assembly (using multiple k-mers / default settings) has:</p>
 
<ul>
  <li><b>N50 ≈ 179,846 bp</b></li>
</ul>
 
<p>The best fixed-k assembly we observed (e.g. at k=79) had:</p>
 
<ul>
  <li><b>N50 ≈ 92,020 bp</b></li>
</ul>
 
<p>So the multi-k / default assembly has a substantially <b>higher N50</b> than the best fixed-k assembly.</p>
 
<hr>
 
<p><b>Q6.</b> In the QUAST comparison, there is effectively <b>one main large scaffold</b> that dominates the assembly (i.e. a very contiguous assembly with one primary contig/scaffold capturing most of the genome).</p>
 
<hr>
 
<p><b>Q7.</b> Short contigs with <b>much higher coverage</b> than the bulk of the assembly are typically:</p>
<ul>
  <li><b>Repeat regions</b> (e.g. rRNA operons, transposons, insertion sequences) that occur multiple times in the genome, causing reads from many genomic copies to pile onto a single contig.</li>
  <li>Sometimes <b>misassemblies</b> where repeated segments have been collapsed into one contig, which inflates the apparent coverage.</li>
</ul>
 
<hr>
 
<p><b>Q8.</b> Short contigs with <b>much lower coverage</b> than the main assembly are often:</p>
<ul>
  <li><b>Contaminant sequences</b> from another species (few reads map, so coverage stays low).</li>
  <li><b>Misassemblies</b> or chimeric contigs that are only weakly supported.</li>
  <li>Very low-abundance regions (e.g. rare plasmid fragments or artifacts).</li>
</ul>
 
<hr>
 
<p><b>Q9.</b> We cannot always fully trust “misassembly” calls from QUAST because:</p>
<ul>
  <li>QUAST assumes the reference genome is the <b>truth</b>.</li>
  <li>Real biological differences (e.g. insertions, deletions, rearrangements) between the sample and the reference can be misinterpreted as misassemblies.</li>
  <li>Thus, some reported “misassemblies” likely reflect <b>true variation</b> rather than assembler errors.</li>
</ul>
 
<hr>
 
<p><b>Q10.</b> By visual inspection (Circoletto / alignment plots), a large fraction of the reference genome is covered by our contigs, with strong matches along most of both chromosomes. So, <b>yes</b>, our genome appears broadly similar to the reference <i>V. cholerae</i> genome.</p>
 
<hr>
 
<p><b>Q11.</b> There are very few contigs that do not map well. The contig <code>K119.81</code>, for example, only maps partially.</p>
 
<p>This could represent:</p>
<ul>
  <li>A region present in our strain but absent (or highly divergent) in the reference genome, or</li>
  <li>A <b>misassembly</b> (chimeric contig or incorrectly joined segments).</li>
</ul>
 
<hr>
 
<p><b>Q12.</b> The region on chromosome 2 with many small, low-confidence mappings is likely a highly repetitive region, such as the <b>integron island</b> in <i>V. cholerae</i>.</p>
 
<p>This region:</p>
<ul>
  <li>Contains many gene cassettes and repeats.</li>
  <li>Is difficult to assemble with short reads (leading to fragmentation and poor mapping).</li>
  <li>Acts as a hotspot where <i>V. cholerae</i> integrates new genes into its genome.</li>
</ul>
 
<p>See: [https://www.nature.com/articles/35020000 V. cholerae genome paper]</p>
 
<hr>
 
<p><b>Q13.</b> Inspecting the number of contigs:</p>
 
<pre>
grep ">" ecoli_nanopore.contigs.fasta
grep ">" ecoli_pacbio.contigs.fasta
</pre>
 
<p>We see that:</p>
<ul>
  <li>The <b>Nanopore assembly</b> has <b>2 contigs</b>.</li>
  <li>The <b>PacBio assembly</b> has <b>1 contig</b>.</li>
</ul>
 
<p>This is an extremely good result for bacterial genomes: the PacBio assembly is essentially a <b>single-contig (closed) genome</b>, and Nanopore is nearly as good.</p>
 
<hr>
 
<p><b>Q14.</b> Running prodigal:</p>
 
<pre>
/home/ctools/prokka/binaries/linux/prodigal \
  -f gff \
  -i ecoli_pacbio.contigs.fasta \
  -a ecoli_pacbio.contigs.aa \
  -o ecoli_pacbio.contigs.gff
</pre>
 
<p>Then extracting and BLASTing <code>tig00000001_4582</code> reveals that this gene encodes <b>β-galactosidase (lacZ)</b> from <i>E. coli</i>.</p>
 
<p>This gene was central to the classic studies of gene regulation by François Jacob, André Lwoff, and Jacques Monod, which led to the concept of the lac operon and earned them the [https://www.nobelprize.org/prizes/medicine/1965/summary/ Nobel Prize in Physiology or Medicine in 1965].</p>

Latest revision as of 11:18, 26 November 2025

Answers

Q1. The reads are encoded in an Illumina-style quality format (Phred+64 in this dataset).

Q1A.

  • Vchol-001_6.discarded.gz: reads that were discarded during trimming (too short after trimming and/or very low quality).
  • Vchol-001_6.pair1.truncated.gz and Vchol-001_6.pair2.truncated.gz: read pairs where both mates passed the trimming filters.
  • Vchol-001_6.singleton.truncated.gz: reads where one mate passed trimming but the other mate was discarded (so only one read from the pair survives).

Q2. The main k-mer coverage peak is at approximately 84×.


Q3. Using the formulas:

N = (M * L) / (L - K + 1)
Genome_size = T / N
  • M (k-mer peak) ≈ 84
  • L (average read length) ≈ 99
  • K (k-mer size) = 15
  • T (total bases) = 213,721,367 + 212,523,694 = 426,245,061 bases

Compute depth:

N = (84 * 99) / (99 - 15 + 1)
  = 8316 / 85
  ≈ 97.84×

Then genome size:

Genome_size = T / N
            = 426,245,061 / 97.84
            ≈ 4.35 Mb

Estimated genome size ≈ 4.35 Mb, which is close to the expected ~4 Mb for V. cholerae.


Q4. From the insert-size analysis in R, we obtain:

  • Mean insert size259 bp
  • Standard deviation (SD)11 bp

Q5. The “best” assembly (using multiple k-mers / default settings) has:

  • N50 ≈ 179,846 bp

The best fixed-k assembly we observed (e.g. at k=79) had:

  • N50 ≈ 92,020 bp

So the multi-k / default assembly has a substantially higher N50 than the best fixed-k assembly.


Q6. In the QUAST comparison, there is effectively one main large scaffold that dominates the assembly (i.e. a very contiguous assembly with one primary contig/scaffold capturing most of the genome).


Q7. Short contigs with much higher coverage than the bulk of the assembly are typically:

  • Repeat regions (e.g. rRNA operons, transposons, insertion sequences) that occur multiple times in the genome, causing reads from many genomic copies to pile onto a single contig.
  • Sometimes misassemblies where repeated segments have been collapsed into one contig, which inflates the apparent coverage.

Q8. Short contigs with much lower coverage than the main assembly are often:

  • Contaminant sequences from another species (few reads map, so coverage stays low).
  • Misassemblies or chimeric contigs that are only weakly supported.
  • Very low-abundance regions (e.g. rare plasmid fragments or artifacts).

Q9. We cannot always fully trust “misassembly” calls from QUAST because:

  • QUAST assumes the reference genome is the truth.
  • Real biological differences (e.g. insertions, deletions, rearrangements) between the sample and the reference can be misinterpreted as misassemblies.
  • Thus, some reported “misassemblies” likely reflect true variation rather than assembler errors.

Q10. By visual inspection (Circoletto / alignment plots), a large fraction of the reference genome is covered by our contigs, with strong matches along most of both chromosomes. So, yes, our genome appears broadly similar to the reference V. cholerae genome.


Q11. There are very few contigs that do not map well. The contig K119.81, for example, only maps partially.

This could represent:

  • A region present in our strain but absent (or highly divergent) in the reference genome, or
  • A misassembly (chimeric contig or incorrectly joined segments).

Q12. The region on chromosome 2 with many small, low-confidence mappings is likely a highly repetitive region, such as the integron island in V. cholerae.

This region:

  • Contains many gene cassettes and repeats.
  • Is difficult to assemble with short reads (leading to fragmentation and poor mapping).
  • Acts as a hotspot where V. cholerae integrates new genes into its genome.

See: V. cholerae genome paper


Q13. Inspecting the number of contigs:

grep ">" ecoli_nanopore.contigs.fasta
grep ">" ecoli_pacbio.contigs.fasta

We see that:

  • The Nanopore assembly has 2 contigs.
  • The PacBio assembly has 1 contig.

This is an extremely good result for bacterial genomes: the PacBio assembly is essentially a single-contig (closed) genome, and Nanopore is nearly as good.


Q14. Running prodigal:

/home/ctools/prokka/binaries/linux/prodigal \
  -f gff \
  -i ecoli_pacbio.contigs.fasta \
  -a ecoli_pacbio.contigs.aa \
  -o ecoli_pacbio.contigs.gff

Then extracting and BLASTing tig00000001_4582 reveals that this gene encodes β-galactosidase (lacZ) from E. coli.

This gene was central to the classic studies of gene regulation by François Jacob, André Lwoff, and Jacques Monod, which led to the concept of the lac operon and earned them the Nobel Prize in Physiology or Medicine in 1965.