Bayesian phylogenetics: clock models
This exercise is part of the course Computational Molecular Evolution (22115).
Overview
In this exercise we will use the software package BEASTX to infer phylogenies under molecular-clock models.
In previous exercises, branch lengths were measured only in expected numbers of substitutions per site. In a clock-based analysis, genetic change is instead related to calendar time through a model of evolutionary rates. If temporal information is available, for example in the form of known sampling times for rapidly evolving sequences, this can be used to estimate both the rate of evolution and the times of internal nodes in the tree.
In this exercise we will focus on so-called heterochronous data, i.e., sequence data where the individual sequences were sampled at different known times. When evolution is sufficiently rapid, the amount of sequence change observed over these sampling times contains information about the evolutionary rate and about the timing of common ancestors.
The main purpose of the exercise is:
- to become familiar with the BEASTX workflow
- to set up and run a clock-based Bayesian phylogenetic analysis
- to inspect MCMC output in Tracer
- to summarize posterior trees using TreeAnnotator
- to visualize and interpret a dated tree in FigTree
- to compare a strict-clock analysis with a relaxed-clock analysis
- In the exercise below, you should follow the instructions on the tutorial page.
- Depending on your operating system and how you installed the software, you can start the relevant programs either from the command line or by double clicking an app. The executables that you may need are:
- beauti
- beast
- tracer
- treeannotator
- figtree
BEASTX tutorial
- Open this link in a new tab: Estimating rates and dates from time-stamped sequences
Answer the questions below and hand in the report. Include a small number of screendumps showing relevant output from the tools you are using.
Questions
Question 1
Explain what the temporal information is in this analysis. How does BEAST obtain information about the sampling times of the sequences, and why is that information needed in order to estimate dates in calendar time?
Question 2
In the first analysis, the tutorial uses a strict molecular clock. What is the assumption behind this model? Explain what is being assumed about evolutionary rates on different branches, and why this means that expected branch length depends on branch duration and a single shared substitution rate. Also describe a pattern in the data that would suggest this assumption may be unrealistic.
Question 3
After the first BEAST run, inspect the output in Tracer.
After the first BEAST run, inspect the output in Tracer. What indications are there that the initial run is not yet satisfactory? In your answer, mention burn-in, trace behaviour, and ESS, and include at least one relevant screendump from Tracer.
Question 4
Why does increasing the MCMC chain length help in this case? Explain the difference between increasing chain length and discarding a larger burn-in.
Question 5
TreeAnnotator is used to summarize the posterior sample of trees into a single representative tree. Compared with an ordinary phylogram or a simple consensus tree, what additional information does this summary tree contain? Mention at least two specific annotations visible in this tutorial, and explain briefly why each is useful.
Question 6
Inspect the summarized tree in FigTree. How do the virus samples from the Americas cluster relative to the African samples? What does the inferred timescale suggest about the origin and history of yellow fever virus in the Americas?
Question 7
The tutorial then repeats the analysis using a relaxed lognormal clock. What is the difference between a strict clock and this relaxed-clock model? What extra biological possibility is the relaxed-clock model allowing for?
Question 8
Based on the relaxed-clock analysis, is there evidence for substantial rate variation among lineages? In your answer, state which parameter in Tracer you inspected to assess this, and explain what kind of result would indicate little versus substantial rate variation. Also comment on whether the main biological conclusion about introduction of yellow fever virus into the Americas changes or remains similar under the relaxed-clock model.