Metagenomic assembly solution
Q1. We cant really find the bell shaped distributions in our samples - except for in MH0032 where there are two very small bell shaped coverage distributions.
Q2. This is because we have many organisms with relative low abundance. This makes it very hard to distinguish them using coverage information.
Q3. It will probably perform much like SOAPdenovo.
Q4. There arent really any major differences in the coverage distributions between the assemblers, the metagenome assemblers are having as many problems as the standard assembler. The coverage distribution tells us that by far the most contigs have fairly low coverage in the assembly and this is also what we expect.
Q5. There arent really any major differences between the assemblies from MetaVelvet and Soapdenovo. However the Megahit assembly actually seems to be as long as the other assemblies, while having a larger mean scaffold size and a longer N50, meaning that it was able to assemble the metagenome into longer pieces.
Q6. MH0032: 84606 ; MH0047: around 70,000. 2-4 times of human genes.
Q7. We only count it once because they are from the same DNA fragment - ie. it was only present once.
Q8. If each pair map to a different gene then we count it as one hit to each gene, because we have seen them both once (our DNA fragment just happened to be spanning both).
Q9. There are both genes in common and genes specific for each sample. Most of the genes have very low abundance (the blue field near 0) - this is also what we expected from the k-mer distributions (Q1).
Q10. Many of the species have very few genes, so we could probably not really trust all of them. We need to have a better reference genome set that covers more of the genomes in our samples (human gut). We could blast vs. human gut species instead.
Q11.Yes there are!
Q12. We can see that several Prevotella species are very abundant in the MH0032 individual compared to MH0047. Probably the MH0032 individual has the Prevotella enterotype.
Q13. 36 bins were identified. You can think of the bins as a clustering of the contig into what we believe are genomes. Importantly there will be errors and list will not be complete.
Q14. The length of the bins vary from 200k to 3.5Mb, why do you think this is?
Q15. We have both very nice bins (in the top) with high completeness and low contamination and bins that are less complete with higher contamination. The bins without marker genes could be incomplete bins or perhaps something that is not well known?