Bayesian Phylogeny
This exercise is part of the course Computational Molecular Evolution (22115).
Overview
Today's exercise will focus on phylogenetic analysis using Bayesian methods.
As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or probability density of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.
The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.
If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.
In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.
In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.
Getting started
- In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol mkdir bayes cd bayes cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus cp ../data/neanderthal.nexus ./neanderthal.nexus cp ../data/hcvsmall.nexus ./hcvsmall.nexus
- You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.
Load R libraries
- In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr) library(tidyverse) library(bayesplot)
Posterior samples and posterior probability of trees
Question 1
- In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.
- Note that the command "help" will give you a list of all available commands. Issuing "help command" will give you a more detailed description of the specified command along with current option values. This is similar to how "help command" works in PAUP.
Start program
- In a terminal window, issue the command:
mb
- This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.
Get a quick overview of available commands
help
Load your sequences
execute primatemitDNA.nexus
- This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.
Inspect data set
showmatrix
Define outgroup
outgroup Gibbon
Specify your model of sequence evolution
lset nst=2 rates=gamma
- This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.
Start Markov chain Monte Carlo sampling
- Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
- What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.
- Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.
- The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)
- During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).
Continue run until parallel runs converge on same solution
- At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)
Question: MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?
Question 2
Have a look at the resulting sample files
- Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
- This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
- Gen: generation; number of MCMC steps taken so far
- lnL: log likelihood of the current parameter estimates
- LnPr: log of the prior probability
- TL: tree length (sum of all branch lengths)
- kappa: transition/transversion rate ratio
- pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
- alpha: shape parameter for the gamma distribution.
- (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.
Question: You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?
As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.
df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)
ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)
ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)
Question 3
Examine MCMC trajectory for gamma shape parameter, alpha
- Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
- mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.
Question: Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?
Question 4
Investigate posterior probability distribution over trees
- Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
- Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.
- MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.
- Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
- (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.
- What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.
- The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).
- NOTE: Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.
Question: What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?
Neanderthal data: posterior probability of clades
Question 5
For many years, there was considerable debate about the origin of modern humans. One view, often called the Multiregional Hypothesis, proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.
Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?
We will use the present data set to examine this question.
Load Neanderthal data set
- In the Terminal where you have MrBayes running:
execute neanderthal.nexus delete 5-40
- As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.
Investigate data
showmatrix
- This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.
Start analysis
outgroup Pan_troglodytes lset nst=mixed rates=gamma mcmc ngen=500000 nchains=3 diagnfreq=10000
- Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).
Find posterior probability of clades
- When the run has finished, issue this command to compute a consensus tree:
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
- Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).
Question: What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
Posterior distributions over other parameters
Question 6
- As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.
Load data set
execute hcvsmall.nexus
Define site partition
charset 1stpos=1-.\3 charset 2ndpos=2-.\3 charset 3rdpos=3-.\3 partition bycodon = 3:1stpos,2ndpos,3rdpos set partition=bycodon prset ratepr=variable
- This is an alternative way of specifying that different sites may evolve at different rates. With a gamma model, we allow rates to vary across sites but do not specify in advance which sites are fast or slow; instead, that pattern is inferred from the data. Here we instead use prior biological knowledge about the structure of the genetic code to divide sites into three classes: 1st, 2nd, and 3rd codon positions. We then allow each class to have its own rate, so that all 1st positions share one rate, all 2nd positions another, and all 3rd positions a third. Specifically, charset 1stpos=1-.\3 defines a character set named 1stpos consisting of site 1 followed by every third site (\3, i.e. sites 1, 4, 7, 10, …), continuing until the end of the alignment (denoted .).
Specify model
lset nst=6
- This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.
- When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible frequency vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.
Inspect model details
showmodel
- This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-to-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).
- We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).
Start MCMC sampling
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
- The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).
Compute summary of parameter values
sump relburnin=yes burninfrac=0.25
- The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.
- First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.
- Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.
- The last columns contain values indicating whether the run has converged. Specifically, ESS means Effective Sample Size, and measures how many effectively independent samples you have from the posterior — the higher the better, but this should be at least 100. The column labeled PSRF+ gives another convergence diagnostic (also known as “R-hat”) and should be close to 1 if the runs have converged. Specifically, it measures whether different chains (and different parts of the chains) are sampling the same distribution of values. As a rule of thumb, values less than 1.05 are good, values between 1.05 and 1.10 are acceptable, and values above 1.10 suggest poor convergence.
Question: What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?
Question 7: Based on the reported posterior means, which of the two parameters, r(AC) or r(CG), appears to be larger on average?
Question 8
Marginal distributions
- Comparing posterior means gives a useful first summary of the two parameters, but it does not show how uncertain these estimates are. One of the strengths of Bayesian analysis is precisely that it gives us access not just to a single best estimate, but to a full posterior probability distribution over possible parameter values. We will now use this to get a fuller picture of the two substitution-rate parameters.
- We start by looking at the posterior distribution of each parameter separately. Such a distribution for one parameter alone is called its marginal distribution.
Examine marginal distributions
- In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
- The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.
- You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))
ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")
Question: Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?
Question 9
Marginal vs. joint distributions
- Looking at the marginal distributions gives us a fuller understanding of the uncertainty in each parameter separately. However, it still does not directly answer the question of whether r(AC) is larger than r(CG) in most posterior samples, because the two parameters may be associated with each other across samples. For instance, one parameter might be larger than the other in almost every individual sample, even though the two overall marginal distributions overlap. To answer such questions, we must examine the two parameters simultaneously. A probability distribution over several parameters at the same time is called a "joint distribution", whereas a distribution for one parameter considered by itself is called a "marginal distribution".
- These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")
ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")
df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()
Question: Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
- Note how examining the joint distribution provides information that you could not obtain by simply comparing the marginal distributions. In particular, it lets you answer direct questions about how parameters relate to each other, for instance whether one is larger than another in most posterior samples. The same idea can be used to answer many other questions about posterior distributions.
Question 10
- Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))
ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")
Question: Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)
- Codon position 2 is the most variable of the codon positions.
- Codon position 1 is the most variable of the codon positions.
- Codon position 1 is the most conserved codon position.
- Codon position 3 is the most conserved codon position.
- Codon position 3 is the most variable of the codon positions.
- Codon position 2 is the most conserved codon position.
Question 11
Question: How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?