Kaiju exercise

From 22126
Jump to navigation Jump to search


Introduction

Kaiju is a protein-based sensitive taxonomic classification of high-throughput sequencing reads from metagenomic whole genome sequencing or metatranscriptomics experiments.
Kaiju translates metagenomic sequencing reads into the six possible reading frames and searches for maximum exact matches (MEMs) of amino acid sequences in a given database of annotated proteins from microbial reference genomes. If matches to one or more database sequences are found for a read, Kaiju outputs the taxonomic identifier of the corresponding taxon, or it determines the Last Common Ancestor (LCA) in the case of equally good matches to different taxa. Kaiju’s underlying sequence comparison algorithm uses the Burrows–Wheeler transform (BWT) of the protein database, which enables exact string matching in time proportional to the length of the query, to achieve a high classification speed.

In k-mer-based methods, the size of k governs the sensitivity and precision of the search. If k is chosen too large, no identical k-mers between read and database might be found, especially for short or erroneous reads, as well as for evolutionary distant sequences. If k is chosen too small, more false positive matches will be found. Therefore, in order to not be restricted by a prespecified k-mer size, Kaiju finds MEMs between reads and database to achieve both a high sensitivity and precision. Reads are directly assigned to a species or strain, or in case of ambiguity, to higher level nodes in the taxonomic tree. For example, if a read contains an amino acid sequence that is identical in two different species of the same genus then the read will be classified to this genus. Kaiju also offers the possibility to extend matches by allowing a certain number of amino acid substitutions at the end of an exact match in a greedy heuristic approach using the BLOSUM62 substitution matrix. See the paperfor a detailed description of Kaiju’s algorithm The most important adjustments are made to adjust the sensitivity of Kaiju. Running Kaiju in greedy mode will be more sensitive BUT less precise. As in we can see that we have Enterobacterales but not if it is Salmonella or Escherichia. Sensitivity <> Precision

Construct Burrows-Wheeler transform and FM-index

Before classification of reads, Kaiju's database index needs to be built from a reference protein database. You can either create a local index based on the currently available data from GenBank, or download one of the indexes used by the Kaiju web server. For creating a local index, the program kaiju-makedb will download a source database and the taxonomy files from the NCBI FTP server, convert them into a protein database and construct Kaiju's index (the Burrows-Wheeler transform and the FM-index) in one go. For this course, we have downloaded the pre-indexed database nr_euk (made 2019-06-25) from the Kaiju website so you do not need to create a database, but can use the Kaiju database at /home/databases/databases/Kaiju/

Assigning taxonomy using Kaiju

Kaiju needs to know the location of the database (kaiju_db_nr_euk.fmi in this case) which you want to use and the associate nodes.dmp file. It is able to take paired-end reads as in the example below:

 kaiju -t /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp -f /home/databases/databases/Kaiju/kaiju_db_nr_euk.fmi -i reads.fastq [-j reads2.fastq]

Mandatory arguments:

  -t FILENAME   Name of nodes.dmp file
  -f FILENAME   Name of database (.fmi) file
  -i FILENAME   Name of input file containing reads in FASTA or FASTQ format

Optional arguments:

  -j FILENAME   Name of second input file for paired-end reads
  -o FILENAME   Name of output file. If not specified, output will be printed to STDOUT
  -z INT        Number of parallel threads for classification (default: 1)
  -a STRING     Run mode, either "mem"  or "greedy" (default: greedy)
  -e INT        Number of mismatches allowed in Greedy mode (default: 3)
  -m INT        Minimum match length (default: 11)
  -s INT        Minimum match score in Greedy mode (default: 65)
  -E FLOAT      Minimum E-value in Greedy mode
  -x            Enable SEG low complexity filter (enabled by default)
  -X            Disable SEG low complexity filter
  -p            Input sequences are protein sequences
  -v            Enable verbose output

Exercise time !!!

Kaiju - Protein-based taxonomy

We are going to compare the effects of using different settings in Kaiju and visualize using Krona.

In the exercises we will use the database nr_euk.

Q1: What is nr_euk? And how do the choice of database influence the results of Kaiju?

Different settings will lead to differences in accuracy and precision of the model.

Q2: Explain the terms precision and sensitivity in relation to testing.

First, we will run kaiju using the most precise but less sensitive mem mode:

kaiju -i /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/SRR7610114_1.fastq \
-j /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/SRR7610114_2.fastq -t /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp \
-f /home/databases/databases/Kaiju/kaiju_db_nr_euk.fmi -v -z 5 -a mem -o SRR7610114_mem.kaiju

Second, we will run using VERY greedy options allowing 5 mismatches and only an E-value of 0.1

kaiju -i /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/SRR7610114_1.fastq \
-j /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/SRR7610114_2.fastq -t /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp \
-f /home/databases/databases/Kaiju/kaiju_db_nr_euk.fmi -v -z 5 -a greedy -e 5 -E 0.1 -o SRR7610114_greedy.kaiju

If you don't have the patience to wait for the program to run, you can also copy the files from

/home/projects/22126_NGS/exercises/metagenomics/kaiju

Try to visualise the results with Krona

kaiju2krona -i SRR7610114_mem.kaiju -o SRR7610114_mem.krona -t /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp \
-n /home/databases/databases/Kaiju/kaiju_db_nr_euk_names.dmp ; ktImportText -o SRR7610114_mem.html SRR7610114_mem.krona  

Download the .html-file and open it in your browser.

If you are interested many other tools exists for taxonomic classification. However as the results for SRR7610114 we will continue with Kaiju, which we now want to run on all samples.

AS IT TAKES TOO MUCH TIME FOR EVERYONE TO RUN KAIJU ON ALL SAMPLES YOU SHOULD USE THE RESULTS FOUND IN /home/projects/22126_NGS/exercises/metagenomics/kaiju/all_samples and skip the parallelisation

parallel -j 1 --xapply "kaiju -i /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/{1} -j /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/{2} \
-t /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp -f /home/databases/databases/Kaiju/kaiju_db_nr_euk.fmi -o {1.}.kaiju -v -z 30" \
:::: /home/projects/22126_NGS/exercises/metagenomics/Pacu/no_fish_r1.list  :::: /home/projects/22126_NGS/exercises/metagenomics/Pacu/no_fish_r2.list;
cd all_samples; rename '_1.kaiju' '.kaiju' *.kaiju

Now we want to fuse all the information into one table. First we need to create tabulated tables and the tools provided by Kaiju are far from optimal, so we use some homebrewed ones.

mkdir all_samples;
cp /home/projects/22126_NGS/exercises/metagenomics/kaiju/all_samples/* all_samples/;
/home/ctools/misc_scripts/kaiju2phyloseq.py -i all_samples -n /home/databases/databases/Kaiju/kaiju_db_nr_euk_names.dmp \
-m /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp -o pacu_kaiju; rm -r all_samples 

Q3: Take a look at pacu_kaiju.otu.tab and pacu_kaiju.tax.tab and explain what information the files contain.

Kaiju - RStudio import

I recommend moving files to your laptop and running RStudio locally.

Again, we will use RStudio, but first we need RStudio set up with necessary packages etc. This workflow uses Tidyverse and broom. Furthermore, we will use Phyloseq and DAtest.

library(tidyverse)
library(broom)
library(phyloseq)
library(DAtest)
library(vegan)

Load in the data from Kaiju and use the metadata found in /home/projects/22126_NGS/exercises/metagenomics/kaiju/

otutab <- read.csv("pacu_kaiju.otu.tab", sep = "\t", row.names = 1, header = TRUE)
OTU = otu_table(otutab, taxa_are_rows = TRUE)

taxtab <- read.csv("pacu_kaiju.tax.tab", sep = "\t", row.names = 1, header = TRUE)
taxmat = as.matrix(taxtab)
TAX = tax_table(taxmat)

metadata = read.csv("metadata.csv", sep = ",", skip=1, header=FALSE)
metadata <- rename(metadata, Run=1, Day=2, Treatment=3) #renaming the columns 

Once the data has been loaded we are now ready to perform the phyloseq analysis.

META = sample_data(metadata)
rownames(META) <-metadata$Run
physeq = phyloseq(OTU, TAX, META)
physeq 

The phyloseq results can be saved as an rds object as:

saveRDS(physeq, "pacu.phyloseq.rds")

Kaiju - Sample composition using R

To get a feel for the samples we can plot the domain distribution but first we'll make it into a dataframe

physeq_df <- psmelt(physeq)
rawkaijubarplot <- ggplot(physeq_df, aes(x = Sample, y = Abundance, fill = Domain)) + theme_bw() +   geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle=90, size=6))
rawkaijubarplot
dir.create("Results")
ggsave("Results/rawkaijubarplot.png") 

Q4: Look at the plot. Which domains do you see in the samples?

For simplicity we will focus on the main components of the known bacterial community. Adjust according to your question, time and resources. We want to focus on the known bacterial composition, so we do:

physeq_bac <- subset_taxa(physeq, Domain == "Bacteria")

Q5: Can you think of domains or fields that could be relevant to investigate for other research questions?

We can filter low abundant taxa based on three criteria:

  • They should be present in a minimum amount of samples (min.samples)
  • They should have a minimum amount of reads (min.reads)
  • They should have a minimum average relative abundance (min.abundance)

You don't have to use all three criteria. The filtered taxa are grouped in a new taxa called "Others".

We only want taxa that have at least a 0.00005 fraction of the total reads. These will be fused into a category called “Others”. This is quite a lot that we filter but it allows us to work faster in this example. Firstly we identify the number of reads from the fraction 0.00005.

 n_reads <- sum(sample_sums(physeq_bac))*0.00005 

This number of reads is used for as cutoff.

physeq_bac_cutoff = preDA(physeq_bac, min.reads = n_reads)
physeq_bac_cutoff

To visualize we do again make a dataframe

physeq_bac_cutoff_df <- psmelt(physeq_bac_cutoff) 

And we do the visualization

ggplot(physeq_bac_cutoff_df, aes(x = Sample, y = Abundance, fill = Family)) + theme_bw() +   geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle=90, size=6))
ggsave("Results/kaijuClassbarplot.png", width = unit(15,"cm"))

There is two "NA" categories. One indicates that Kaiju was unable to assign taxonomy, while the other comes from the fused “Others” that are less abundant.

We can also look at the relative abundance of each class and cluster according to treatment.

physeq_relat_abund <- transform_sample_counts(physeq_bac_cutoff, function(x){x / sum(x)})
phyloseq::plot_bar(physeq_relat_abund, fill = "Phylum") +
  geom_bar(aes(color = Phylum, fill = Phylum), stat = "identity", position = "stack") +
  labs(x = "", y = "Relative Abundance\n") +
  facet_wrap(~ Treatment, scales = "free") +
  theme(panel.background = element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())
ggsave("Results/kaijuPhylumRel_abund.png", width = unit(10, "cm"))

Kaiju - Beta-diversity analysis using R

When we start comparing samples, consequences of the compositional nature of sequencing become important. To do beta-diversity we need to normalize somehow. The classical way is by rarefaction to the number of observations of the smallest sample, however this means removing valuable data. Also, relative abundances become less precise, as it produces false zeroes and several other problems. Luckily, we now have much better methods available. We will use a Centered Log-Ratio transformation (CLR), which is a dedicated compositional transformation using log-ratio to the geometric mean of the relative abundance.

Given an observation vector of D “counted” features (taxa, operational taxonomic units or OTUs, genes, etc.) in a sample, x = [x1, x2, …xD] and G(x) is the geometric mean of x. We use the geometric mean instead of the more common arithmetic mean, because it is the only correct mean when presented as ratios to reference values.

CLR transformation for the sample can be obtained as follows:

File:CLR equation.png

However, G(x) cannot be determined for without deleting, replacing or estimating the 0 count values. Read more here and here about CLR.


We perform a zero correction by substituting 0 with 1 and all non-zero values are corrected such that log-ratios before and after correction are the same.

physeq_zc <- transform_sample_counts(physeq_bac_cutoff, function(y) sapply(y, function(x) ifelse(x==0, 1, (1-(sum(y==0)*1)/sum(y))*x)))

Now we can perform the log transformation:

physeq_clr <- transform_sample_counts(physeq_zc, function(x) log(x/exp(mean(log(x))))) 

We can now perform the Principal Component Analysis (PCA).

Q6:What is PCA used for?

Here we do a redundancy analysis (RDA) without restraints which is the same.

ord_clr <- ordinate(physeq_clr, "RDA")

Usually, we only look at the first couple of principal components because we can plot them easily but with a scree plot we can look at many more.

plot_scree(ord_clr) +
  geom_bar(stat="identity", fill = "blue") +
  labs(x = "\nAxis", y = "Proportion of Variance\n")
ggsave("Results/kaijuclrscree.png")

Q7: What do the plot tell us about the principal components and their associated amount of information?

Now, we will plot PC1 and PC2 while scaling the plot to reflect the relative amount of information explained by each.

clr1 <- ord_clr$CA$eig[1] / sum(ord_clr$CA$eig)
clr2 <- ord_clr$CA$eig[2] / sum(ord_clr$CA$eig)

We can plot were the coordinates are fixed to reflect importance:

plot_ordination(physeq, ord_clr, type="samples", color="Treatment") +
  geom_point(size = 4) +
  coord_fixed(clr2 / clr1) + 
  geom_text(aes(label=Day), colour="black")
ggsave("Results/kaijuclrPCA.png")

Or more spread out for easy reading:

plot_ordination(physeq, ord_clr, type="samples", color="Treatment") +
  geom_point(size = 6) +
  geom_text(aes(label=Day), colour="black")
ggsave("Results/kaijuclrPCA1x1.png")

We see a clear tendency for especially late phase antibiotic treatment samples being separated from the other samples, but is it statistically significant? Albeit very useful, PCA is just an exploratory data visualization tool. To test whether the samples cluster beyond what is expected by sampling variability we will use permutational multivariate analysis of variance (PERMANOVA). It does this by partitioning the sums of squares for the within- and between-cluster components using the concept of centroids. Many permutations of the data (i.e. random shuffling) are used to generate the null distribution. The test from ADONIS can be confounded by differences in dispersion (or spread)…so we want to check this as well.

First, we create a distance matrix. This is also called the Aitchison distance.

clr_dist_matrix <- distance(physeq_clr, method = "euclidean")
adonis(clr_dist_matrix ~ sample_data(physeq_clr)$Treatment, method = "eucledian")

Why do we not always end up with exactly this result? So, despite grouping the less significant “Day 1” samples together with the other antibiotic treatment samples or not merging Pre, Post and control samples, the analysis still shows that the samples are not randomly clustered according to treatment. In fact, Treatment explains 31% of the variance.

Now, we want to do a post-hoc pairwise test to see exactly which treatments drive the variance. This can be done easily using the package pairwiseAdonis - installation guide here.

library(pairwiseAdonis)
pairwise.adonis(clr_dist_matrix, sample_data(physeq_clr)$Treatment, sim.method = "eucledian", p.adjust.m = "holm")

Q8: Do we see any significant pairs?

Kaiju - Differential abundance using R

Now we want to focus on differential abundance. In this case we will look for differential abundance based on phylogeny, but my chosen method can also be used for genes. DeSeq2 is fairly easy to use, shows consistent performance and works by fitting a negative binomial model for count data. DESeq2 default data normalization is Relative Log Expression (RLE) based on scaling each sample by the median ratio of the sample counts over the geometric mean counts across samples. This paper explains more about RLE and CLR. What is also very useful is that DeSeq2 easily allows for the inclusion of covariates in the analysis.

First we load the needed R packages and the R-object created in "Kaiju - RStudio import". You will need to install DESeq2.

library(phyloseq)
library(DAtest)
library(DESeq2)
pacuphyseq = readRDS("pacu.phyloseq.rds")

Firstly, we need to decide at which level you want to test your hypothesis. Do you want to be specific or is it more fitting with your hypothesis to test at a higher taxonomic level. It really depends on the hypothesis and sometimes it makes sense to do it on several levels. Secondly, we need to filter. In this example we filter a lot to speed things up, but there really is no golden standard here. Scientifically, it is a trade-off between keeping rare Amplicon sequence variants (ASVs) which could be interesting to test, and removing ASVs to increase statistical power. Here we collapse the hierarchical taxonomical data at genus level:

phy_genus <- tax_glom(pacuphyseq, "Genus")

This takes 5-10 minutes. In the meantime you can read a bit about DeSeq2

We display a summary of the data

phy_genus

The DESeq2 analysis is carried out.

treatdds <- phyloseq_to_deseq2(phy_genus, ~ Treatment)
treatdds <- DESeq(treatdds)

We investigate the results, where we use the threshold for significance of 0.05. Entries with p-values above 0.05 are removed.

res = results(treatdds, alpha = 0.05)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(phy_genus)[rownames(sigtab), ], "matrix"))

We take a look at the data. The dimensions tells us how many OTU's that are significantly differential abundant between the treatment forms.

head(sigtab)
dim(sigtab)

Q9: How many OTU's are significantly different between the treatments? Try to change the alpha to 0.01. How many OTU's is then significant?

In order to visualise the data we sort the OTU's according to p-value and select the 100 OTU's with the lowest p-value.

sig100 <- sigtab[order(sigtab$padj),][1:100,]

We visualise the significant OTU's. We choose to color and fill by Phylum and label by Genus.

library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
  scale_fill_brewer(palette = palname, ...)
}
# Phylum order
x = tapply(sigtab$log2FoldChange, sigtab$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtab$Phylum = factor(as.character(sigtab$Phylum), levels=names(x))
# Genus order
x = tapply(sigtab$log2FoldChange, sigtab$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtab$Genus = factor(as.character(sigtab$Genus), levels=names(x))
ggplot(sigtab, aes(x=Genus, y=log2FoldChange, color=Phylum)) + geom_point(size=2) + 
  theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))
ggsave("Results/DDseq_100OTU.png")

Try to select the 350 OTU's with the lowest p-value and make the same plot.

Q10: What does the plots with 100 and 350 OTU's show? Are any phylums dominant?

Please find answers here