Metagenomic assembly exercise

From 22126
Jump to navigation Jump to search

Overview

In this exercise we will try de novo assemble a metagenomic dataset of Illumina paired end reads from a stool sample. The data is part of the MetaHit project and was published here. These are just two samples of 396, in the next exercise (tomorrow) you will analyze data from 124 samples. Today, you will try to:

  1. Analyze k-mer distributions
  2. Assemble using SOAPdenovo and MetaVelvet (well, it is already done)
  3. Analyze output from metagenomic assemblies
  4. Gene prediction and clustering on metagenomic data
  5. Create a gene abundance matrix

de novo assembly of metagenomic data


Analyze k-mer distributions

Assembly of metagenomic data is much harder than assembly of single organism datasets. Here we will use SOAPdenovo as we did in the de novo assembly exercise and MetaVelvet. Let us first look at the k-mer coverage of our data, we already counted the kmers for you (else you could use jellyfish as we did for the V. cholerae exercise) and lets plot it:


mkdir meta
cd meta
cp /home/projects/22126_NGS/exercises/metagenomics/kmer_counts/MH0047.histo .
cp /home/projects/22126_NGS/exercises/metagenomics/kmer_counts/MH0032.histo .

R
pdf(file="kmerFreq.pdf", width=12, height=12)
par(mfrow=c(2,1))
d = read.table("MH0047.histo")
barplot(d[,2], main="MH0047 - Kmer Frequencies", xlab="K-mer coverage", xlim=c(1,400), ylim=c(0,1e6))

d = read.table("MH0032.histo")
barplot(d[,2], main="MH0032 - Kmer Frequencies", xlab="K-mer coverage", xlim=c(1,400), ylim=c(0,1e6))
dev.off()
q()

Open the plot by downloading and viewing or using acroread.

evince kmerFreq.pdf &

Q1. Can you identify the peak(s) in the Gaussian (bell-shaped) distributions in the two plots?

You should see that compared to the V. cholera sample that we used in the de novo exercise we havent got the same nice peak. Actually the x-axis in the plots goes from 1-400 so there seem to be something with quite high coverage in the MH0032 sample, however by far most of the sequences can not be separated from each other in terms of different abundance.

Q2. Can you think of why the coverage distributions looks like this with many sequences with "lower" coverage?
Q3. Given that MetaVelvet works by dividing the de bruijn graph using coverage peaks do you think it will perform better or worse compared to SOAPdenovo?


de novo assembly

Unfortunately because the assemblies takes 30-45 mins and uses 10-25Gb of RAM each you will not run the assemblies in the exercise, but you can see the code that I used to run it here code if you need to assemble some genomes for the projects.

Instead copy the contigs and scaffolds to your folder and filter for minimum 100 bp:

cp /home/projects/22126_NGS/exercises/metagenomics/assemblies/MH0047.soap.scafSeq MH0047.soap.fa
cp /home/projects/22126_NGS/exercises/metagenomics/assemblies/MH0047.metavelvet.contigs.fa MH0047.velvet.fa
cp /home/projects/22126_NGS/exercises/metagenomics/assemblies/MH0047.megahit.contigs.fa MH0047.megahit.fa
cp /home/projects/22126_NGS/exercises/metagenomics/assemblies/MH0047.spades.fa MH0047.spades.fa

fastx_filterfasta.py --i MH0047.soap.fa --min 100
fastx_filterfasta.py --i MH0047.velvet.fa --min 100
fastx_filterfasta.py --i MH0047.megahit.fa --min 100
fastx_filterfasta.py --i MH0047.spades.fa --min 100

Ok now lets calculate coverage of the MetaVelvet and SOAPdenovo assemblies and plot them (the Megahit assembly does not have that information ready available so we will skip it).

fastx_soapcov.py --i MH0047.soap.fa.filtered_100.fa > MH0047.soap.cov
fastx_velvetcov.py --i MH0047.velvet.fa.filtered_100.fa > MH0047.velvet.cov

R
library(plotrix)
par(mfrow=c(1,2))
dat=read.table("MH0047.soap.cov", sep="\t")
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="SOAPdenovo - Weighted coverage", xlab="Contig coverage")
dat=read.table("MH0047.velvet.cov", sep="\t")
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="MetaVelvet - Weighted coverage", xlab="Contig coverage")
dev.print("MH0047.coverage.pdf", device=pdf)
q()

Open the plot by downloading and viewing or using acroread.

evince MH0047.coverage.pdf &

Q4. Are there any differences between the two assemblies and what can you tell about the assembly from the coverage distributions?

Finally lets look at the stats for the three assemblies. We can either use paste to compare the files (paste file1 file2 file3 file4) but it becomes cumbersome when there are more than two files. Instead we will use R to create a table (assembly.stats.tab) and plot some of the key stats (assembly.stats.pdf).

assemblathon_stats.pl -csv MH0047.soap.fa.filtered_100.fa > MH0047.soap.fa.filtered_100.asm
assemblathon_stats.pl -csv MH0047.velvet.fa.filtered_100.fa > MH0047.velvet.fa.filtered_100.asm
assemblathon_stats.pl -csv MH0047.megahit.fa.filtered_100.fa > MH0047.megahit.fa.filtered_100.asm
assemblathon_stats.pl -csv MH0047.spades.fa.filtered_100.fa > MH0047.spades.fa.filtered_100.asm

R
soap=read.csv("MH0047.soap.fa.filtered_100.csv")
velvet=read.csv("MH0047.velvet.fa.filtered_100.csv")
megahit=read.csv("MH0047.megahit.fa.filtered_100.csv")
spades=read.csv("MH0047.spades.fa.filtered_100.csv")
df = t(rbind(soap, velvet, megahit, spades)[,-1])
options(scipen=999)
colnames(df) = c("Soap", "Velvet", "Megahit", "Spades")
write.table(df, "assembly.stats.tab", quote=FALSE, sep="\t")
par(mfrow=c(2,2))
barplot(df[1,], main=rownames(df)[1], ylab="Count")
barplot(df[2,],  main=rownames(df)[2], ylab="Base pairs")
barplot(df[16,],  main=rownames(df)[16], ylab="Base pairs")
barplot(df[18,],  main=rownames(df)[18], ylab="Base pairs")
dev.print(file="assembly.stats.pdf", device=pdf)

Q5. Are there any large differences in the stats of the four assemblies?


Gene predictions and clustering

Lets try to predict genes, we do this using Prodigal. It is together with MetaGeneMark one of the best and fastest prokaryotic gene finders available. We are using it in the metagenomic setting by setting "-p meta" and outputting the predictions as dna and as gff-format. Lets use the Spades assembly and for the sake of time lets also only use the contigs with a size >500bp (normally you could use down to 100bp). NB: The gene prediction takes 3-4 minutes each and we are running each in the background (the "&") - wait for the commands to finish.

cp /home/projects/22126_NGS/exercises/metagenomics/assemblies/MH0032.spades.fa MH0032.spades.fa
fastx_filterfasta.py --i MH0032.spades.fa --min 500
fastx_filterfasta.py --i MH0047.spades.fa --min 500

prodigal -p meta -f gff -d MH0047.prodigal.fna -i MH0047.spades.fa.filtered_500.fa -o MH0047.prodigal.gff &
prodigal -p meta -f gff -d MH0032.prodigal.fna -i MH0032.spades.fa.filtered_500.fa -o MH0032.prodigal.gff &

Lets look at how many genes were predicted?

grep ">" -c MH00*.prodigal.fna

Q6. How many genes are there in the samples - consider how many genes humans have (22k)?

To be able to create a count matrix of the gene abundances we must create a common set of genes of the all the samples we want to compare (in our case only two). To do this we can use cd-hit which will cluster the genes based on sequence similarity and select a representative gene from each cluster that we will then use. First we combine the two gene sets, rename the genes so that names are unique and then cluster. We use a sequence identity threshold of 95% and say that the alignment must cover 90% of the sequence as shown below. However the clustering takes 15 min so you can copy the file I made to here:

# code for making it yourself
cat MH0047.prodigal.fna MH0032.prodigal.fna | rename_genes.sh humangut > combined.prodigal.fna
cd-hit-est -i combined.prodigal.fna -o combined.prodigal.cdhit.fna -c 0.95 -n 8 -l 100 -aS 0.9 -d 0 -B 0 -T 2 -M 10000

# code to copy
cp /home/27626/exercises/metagenomics/combined.prodigal.cdhit.fna .

Create gene abundance matrix

Ok now that we have our common set of genes we can determine the abundance of each gene in our sample - we do that by mapping the reads from our samples to the common gene set. This we will do using bwa, but for this exercise, we will only use a subset of 0.5 mill reads from each of the libraries (NB: this is only 1% of the total amount of reads):

bwa index combined.prodigal.cdhit.fna

ln -s /home/27626/exercises/metagenomics/sub/* .

bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0032_081224.1.fq.gz MH0032_081224.2.fq.gz | samtools view -Sb - > MH0032_081224.bam
bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0032_091021.1.fq.gz MH0032_091021.2.fq.gz | samtools view -Sb - > MH0032_091021.bam

bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0047_081223.1.fq.gz MH0047_081223.2.fq.gz | samtools view -Sb - > MH0047_081223.bam
bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0047_090201.1.fq.gz MH0047_090201.2.fq.gz | samtools view -Sb - > MH0047_090201.bam

Now that we have mapped the reads back to the gene set we can start counting. However before we start we need to consider what to do with paired end reads where both pairs map to the same gene.

Q7. If both pairs map to the same gene we only count it as one hit - can you think of why?
Q8. What should we do if each pair map to different genes?

We filter pairs mapping to the same gene using the read_count_bam.pl script below and then only takes reads that are mapped with a mapping quality better than 30 (-q30 below). After that we sort the bam-files:

samtools view -h MH0032_081224.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort -O BAM -o MH0032_081224.sort.bam -
samtools view -h MH0032_091021.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort -O BAM -o MH0032_091021.sort.bam - 
samtools view -h MH0047_081223.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort -O BAM -o MH0047_081223.sort.bam -
samtools view -h MH0047_090201.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort -O BAM -o MH0047_090201.sort.bam -

Now lets merge them to sample-bams and index them:

samtools merge MH0032.sort.bam MH0032_081224.sort.bam MH0032_091021.sort.bam
samtools merge MH0047.sort.bam MH0047_081223.sort.bam MH0047_090201.sort.bam
samtools index MH0032.sort.bam
samtools index MH0047.sort.bam

Now it is very easy to do the counting using samtools idxstats. It will output three columns: gene-name, gene-length, no. mapped_reads, no. unmapped_reads. Try it out:

samtools idxstats MH0032.sort.bam | less

Lets create a count matrix then. First we create a header line using "echo", then we get all of the gene-names, then the counts and finally we combined it one file:

echo -e "Gene\tMH0032\tMH0047" > header
samtools idxstats MH0032.sort.bam | grep -v "*" | cut -f1 > gene_names
samtools idxstats MH0032.sort.bam | grep -v "*" | cut -f3 > counts1
samtools idxstats MH0047.sort.bam | grep -v "*" | cut -f3 > counts2
paste gene_names counts1 counts2 | cat header - > count_matrix.sub.tab

Take a look at the count matrix using less. Also lets try to load it into R and look at the count distribution between the samples:

less count_matrix.sub.tab

R
d = read.table("count_matrix.sub.tab", sep="\t", header=TRUE, as.is=TRUE)
library(ggplot2)
p = ggplot(d, aes(x=MH0032, y=MH0047)) + stat_binhex(bins=50) + scale_fill_continuous(low="grey50", high="blue")
p + geom_abline(slope=1, col="white") + labs(title="Sub count matrix")
ggsave("count.sub.hex.pdf")
quit("no")

You see that the counts per gene are quite low, but we actually previous course years used this data. But lets instead try to use a version of the count-matrix that I created using the full data. By now you have probably realised that metagenomics is lots of data and we are only using two samples with not that many reads - our paper has 396 samples in total! We will copy it to here and plot the same plot of counts as before. Last we open both in evince.

cp /home/27626/exercises/metagenomics/full_countmatrix/count_matrix.tab .

R
d = read.table("count_matrix.tab", sep="\t", header=TRUE, as.is=TRUE)
library(ggplot2)
p = ggplot(d, aes(x=MH0032, y=MH0047)) + stat_binhex(bins=50) + scale_fill_continuous(low="grey50", high="blue")
p + geom_abline(slope=1, col="white") + labs(title="Full count matrix")
ggsave("count.hex.pdf")
quit("no")

evince count.*hex.pdf &

Q9. How does the overlap between the samples look like, are there genes in common and/or genes specific for each samples?


Annotation of the count matrix

Now that we have our count matrix we can annotate the genes according to species or function. Lets try to annotate them according to species.

To do that we need to blast all of the genes towards a database of organisms that we expect to be present - in our case lets blast vs. all fully sequenced bacteria at NCBI (>11000 bacterial chromosomes and plasmids) and 373 bacterial genomes from the human gut that we have published. The blast of our ~140k genes against this database takes 10 mins using 5 cores so you will not run this (if you really want to you can) - instead you can use the file I made. If you were to run the blast yourself the command is this:

# copy this file to your folder, it is the blast output #
cp /home/27626/exercises/metagenomics/combined.prodigal.cdhit.m8 .

# command to blast genes vs. bacteria - do not run it in the exercise (unless you really want to) #
# the -num_threads tells blast how many cores to use, eg. here we use 5 of the 28 cores on the machine #
# -max_target_seqs tells blast to only output the five best hits instead of the best 500 hits! #
blastn -query combined.prodigal.cdhit.fna -db /home/27626/exercises/metagenomics/blastdb/Bacteria_MGS.20160531 -evalue 1e-2 -outfmt 6 \
   -max_target_seqs 5 -num_threads 5 -out combined.prodigal.cdhit.m8

Take a look at the blast-report (combined.prodigal.cdhit.m8), the output is in blast-tabular format (also known as m8) - the fields are: query name, subject name, percent identity, aligned length, no. mismatches, no. gaps, query start, query end, subject start, subject end, e-value, bitscore. You should see that there often are several hits pr. gene (query). We will select only the best hit and require that it has >80% identity over 100bp before we will annotate the gene to a species. This is achieved using this perl-oneliner:

perl -ane 'BEGIN{$prev=""}; if ($F[0] eq $prev) { next } else { if ($F[2] > 80 & $F[3] > 100) { print $_}; $prev = $F[0];}' \
combined.prodigal.cdhit.m8 > combined.prodigal.cdhit.best.accepted.m8

Now lets just take the columns 1 and 2 because these has the information of which gene is annotated to which organism and copy an annotation file to here. Take a look at the Bacteria_MGS.20160531.tab file (after you copied it to here - see below), you can see that it contains information on each sequence in the blast database and the phylogenetic information of that sequence (eg. phylum, genus, species etc). This is what we will use to figure out which gene comes from which species.

cut -f1,2 combined.prodigal.cdhit.best.accepted.m8 > combined.prodigal.ann
cp /home/27626/exercises/metagenomics/blastdb/Bacteria_MGS.20160531.tab .

Now we have:

  1. Count matrix with genes (count.matrix.tab)
  2. Annotation of genes to species-ids (combined.prodigal.ann)
  3. Annotation of species-ids to species names (Bacteria_MGS.20160531.tab)

This we will use to create an abundance measure of the species in our sample using an R-script. For each species we determine the abundance of it as the sum of all genes annotated to that particular species:

R --vanilla count_matrix.tab combined.prodigal.ann Bacteria_MGS.20160531.tab species_matrix.tab < /home/27626/bin/create_species_matrix_sum.R

The script outputs the species matrix and a plot of how many genes that was annotated to each species - try to take a look at the species matrix and at the plot

less species_matrix.tab

evince genes_pr_species.pdf &

Q10. When looking at the plot, were there many species for which we could identify most genes? Can you think of how to improve this?

Lets plot the species-abundance versus each other in our two samples.

R
library(ggplot2)
d = read.table("species_matrix.tab", header=TRUE, sep="\t")
p = ggplot(d, aes(x=MH0032, y=MH0047)) + stat_binhex(bins=50) + scale_fill_continuous(name="No. species", low="grey50", high="blue")
p + geom_abline(slope=1, col="white") + labs(title="Species counts", x="Abundance in MH0032", y="Abundance in MH0047")
ggsave("species.count.hex.pdf")

Q11. Are there some species with differential abundance between the two samples?

Let us try to plot the species with the most differences in the abundance:

R
sm=read.table("species_matrix.tab", sep="\t", header=TRUE, as.is=TRUE)
sm$diff = sm[,1]-sm[,2]
min_diff = sm[order(abs(sm$diff)),][1:20,]
max_diff = sm[rev(order(abs(sm$diff))),][1:20,]
# plot differences
par(mar=c(9.5,4.1,4.1,2.1))	# change bottom margin to 9 cm
barplot(as.matrix(t(max_diff[,1:2])), beside=TRUE, las=2, cex.names=0.75, col=c("grey20", "grey80"), ylab="Species abundance", main="Species with most difference")
legend("topright", legend=c("MH0032", "MH0047"), bty="n", fill=c("grey20", "grey80"))
dev.print(file="species.differences.pdf", device=pdf)

Q12. Which genus seems to have the highest number of different species with most difference in abundance?

This is actually a very important genus in the human gut, one of the key drivers in determining which enterotype one is associated with - you can read more here. In the exercise on Quantitative metagenomics will try to analyze a species matrix of data from 120 individuals.



Binning using metabat (optional)

Lets try to do unsupervised binning of the contigs using Metabat. To do that we will combine our contigs and cluster them using metabat so that if we have the same contig assembled twice we only write it once (this is quite tricky actually and if you have many samples you might want to do a co-assembly). First lets cat them together and do the run the clustering, then map the reads to the assembly (as above). NB: I commented out ("#") all commands as you should not run them now - it takes a long time to run.

# ln -s /home/27626/exercises/metagenomics/MH0032/concat/MH0032.1.fq.gz .
# ln -s /home/27626/exercises/metagenomics/MH0032/concat/MH0032.2.fq.gz .
# ln -s /home/27626/exercises/metagenomics/MH0047/concat/MH0047.1.fq.gz .
# ln -s /home/27626/exercises/metagenomics/MH0047/concat/MH0047.2.fq.gz .
# cat MH0032.spades.fa MH0047.spades.fa | rename_contigs.sh humangut > both.spades.fa
# cd-hit-est -i both.spades.fa -o both.spades.cdhit.fna -c 0.95 -n 8 -l 100 -aS 0.9 -d 0 -B 0 -T 8 -M 30000
# bwa index both.spades.cdhit.fna
# bwa mem -t 10 both.spades.cdhit.fna MH0032.1.fq.gz MH0032.2.fq.gz | samtools view -Sb - > MH0032.bam
# bwa mem -t 10 both.spades.cdhit.fna MH0047.1.fq.gz MH0047.2.fq.gz | samtools view -Sb - > MH0047.bam
# samtools sort -O BAM -o MH0032.metabat.sort.bam MH0032.bam
# samtools sort -O BAM -o MH0047.metabat.sort.bam MH0047.bam
# samtools index MH0032.metabat.sort.bam
# samtools index MH0047.metabat.sort.bam

Ok so instead of running all that yourself you can link the files from me and start directly with the metabat command.

ln -s /home/27626/exercises/metagenomics/metabat_on_spades/MH*.metabat.sort.bam .
ln -s /home/27626/exercises/metagenomics/metabat_on_spades/MH*.metabat.sort.bam.bai .
ln -s /home/27626/exercises/metagenomics/metabat_on_spades/both.spades.cdhit.fna .
runMetaBat.sh both.spades.cdhit.fna MH0032.metabat.sort.bam MH0047.metabat.sort.bam

Take a look a the output-folder, here you should see the bins it identified. These are the "genomes" that it has learnt from the data using tetra-nucleotide-frequency content - if there are more than 10 samples it is also using co-abundance.
Q13. How many bins was identified?

Try to look at the length of the bins:

fastx_sumofsequence.py -f fasta both.spades.cdhit.fna.metabat-bins/*.fa | sort -k2 -n

Q14. How many bases are there in the bins? Do you think they are all complete?

Lets try to investigate the bins - this can be done by blasting the contigs in each bin and analyzing the output or alternative using CheckM. We will try to use checkM which is an automated pipeline that uses single marker genes to assess how complete each bin is and if they are potentially contaminated.

checkm lineage_wf both.spades.cdhit.fna.metabat-bins checkm -x fa -t 2 -f checkm.output
# cp /home/27626/exercises/metagenomics/checkm.output .

If checkm is taking a long time to run you can copy my file (the last line above) by removing the hash-sign. Now lets take a look at the "checkm.output" file.
Q15. Does it look like our bins are good or bad (or both?). You may notice some bins without any marker genes, does this mean they are not real?
Q16. Which next steps could one do?



Additional analyses that can be done

We can investigate the abundance of different functional characteristics. This can be achieved by eg. blasting all of our genes to different databases such as ncbi-nt, kegg, eggnog etc.


Congratulations you finished the exercise!


EXTRA COMMANDS FOR THE PROJECT


Using linclust instead of cd-hit

cd-hit for clustering of genes can be a bit slow. Instead you can run linclust which is much faster. The output is the combined.prodigal.linclust.fna file.

module load mmseqs2/release_6-f5a1c
mmseqs createdb combined.prodigal.cdhit.fna combined
mkdir tmp
mmseqs linclust combined combined_cluster tmp -c 0.9 --min-seq-id 0.95  --threads 8
mmseqs result2repseq combined combined_cluster combined_clu_rep
mmseqs result2flat combined combined combined_clu_rep combined.prodigal.linclust.fna --use-fasta-header