Denovo solution: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with "Q1. Illumina Q1A. discarded contains reads that are too short, pair1 and pair2 files contain the read pairs were both passed trimming and singleton are reads were one of the two pairs were discarded. Q2. Around 84 Q3. N = (M*L)/(L-K+1) = (84*99)/(99-15+1) = 97.84 Genome_size = T/N = (213721367+212523694)/97.84 = 4.35Mb Q4. Mean = 259 ; SD = 11 Q5. It is lower, this means that the actual kmer peak we found (unless you found one higher than 84) is higher (this would g...")
 
No edit summary
 
(5 intermediate revisions by the same user not shown)
Line 10: Line 10:
Q4. Mean = 259 ; SD = 11
Q4. Mean = 259 ; SD = 11


Q5. It is lower, this means that the actual kmer peak we found (unless you found one higher than 84) is higher (this would give a lower genome size).
Q5. It is higher N50:179846 than the best we found at k=79 N50:92020


Q6. 10 of 195 contigs were scaffolded into scaffolds, this is quite few - normally it is much higher. A reason for this could be that our insert size is quite low (~250 bp) and the repeats in the genome are larger than this.
Q6. 1


Q7. Repeat regions
Q7. Repeat regions + misassemblies


Q8. Contaminations
Q8. Contaminations + misassemblies


Q9. Because we use the reference genome as the truth it may be hard to distinguish what is a misassembly and what is true variation from the reference genome.  
Q9. Because we use the reference genome as the truth it may be hard to distinguish what is a misassembly and what is true variation from the reference genome.  


Q10. This is of course just visual, but it seems that most part of the reference genome is covered by our assembly, so yes.
Q10. This is just visual, but it seems that a lot of the reference genome is covered by our assembly, so yes.


Q11. Yes, a couple of the small contigs does not map at all, and the C1097 only maps partially. This could be sequence in our strain, but not in the reference genome.
Q11. very few and the K119.81 only maps partially. This could be a sequence in our strain, but not in the reference genome. Or a misassembly.  


Q12. This is a region with a lot of repeats, this is also why we cant really assemble it. It is used by V. cholerae to integrate new genes into its genome.
Q12. This is a region with a lot of repeats, this is also why we can't really assemble it. It is used by V. cholerae to integrate new genes into its genome.


Q13. The Nanopore assembly only has 2 contigs and pacbio 1!
Q13. Using:


<pre>
grep ">" ecoli_nanopore.contigs.fasta
grep ">" ecoli_pacbio.contigs.fasta
</pre>
The Nanopore assembly only has 2 contigs and pacbio 1! Pretty good!
Q14. using:
<pre>
/home/ctools/prokka/binaries/linux/prodigal -f gff -i ecoli_pacbio.contigs.fasta -a ecoli_pacbio.contigs.aa -o ecoli_pacbio.contigs.gff
</pre>
it is the Beta-galactosidase gene of E.Coli. That gene was used by François Jacob, André Lwoff and Jacques Monod to describe gene regulation in 1960 which got them the [https://www.nobelprize.org/prizes/medicine/1965/summary/ Nobel prize in 1965]
<!-- Q14. The 454 assembly was best. -->
<!-- Q14. The 454 assembly was best. -->

Latest revision as of 14:02, 29 November 2024

Q1. Illumina

Q1A. discarded contains reads that are too short, pair1 and pair2 files contain the read pairs were both passed trimming and singleton are reads were one of the two pairs were discarded.

Q2. Around 84

Q3. N = (M*L)/(L-K+1) = (84*99)/(99-15+1) = 97.84 Genome_size = T/N = (213721367+212523694)/97.84 = 4.35Mb

Q4. Mean = 259 ; SD = 11

Q5. It is higher N50:179846 than the best we found at k=79 N50:92020

Q6. 1

Q7. Repeat regions + misassemblies

Q8. Contaminations + misassemblies

Q9. Because we use the reference genome as the truth it may be hard to distinguish what is a misassembly and what is true variation from the reference genome.

Q10. This is just visual, but it seems that a lot of the reference genome is covered by our assembly, so yes.

Q11. very few and the K119.81 only maps partially. This could be a sequence in our strain, but not in the reference genome. Or a misassembly.

Q12. This is a region with a lot of repeats, this is also why we can't really assemble it. It is used by V. cholerae to integrate new genes into its genome.

Q13. Using:

 
grep ">" ecoli_nanopore.contigs.fasta
grep ">" ecoli_pacbio.contigs.fasta

The Nanopore assembly only has 2 contigs and pacbio 1! Pretty good!


Q14. using:

/home/ctools/prokka/binaries/linux/prodigal -f gff -i ecoli_pacbio.contigs.fasta -a ecoli_pacbio.contigs.aa -o ecoli_pacbio.contigs.gff


it is the Beta-galactosidase gene of E.Coli. That gene was used by François Jacob, André Lwoff and Jacques Monod to describe gene regulation in 1960 which got them the Nobel prize in 1965