Bayesian phylogenetics: clock models: Difference between revisions

From 22115
Jump to navigation Jump to search
No edit summary
Line 3: Line 3:
== Overview ==
== Overview ==


In this exercise we will explore how to use the software tool BEASTX to construct phylogenies based on molecular-clock models. In previous exercises we have worked with phylogenies where we did not have information about how fast sequences were evolving, and we therefore used the number of substitutions as branch lengths. When there ''is'' temporal information (e.g., fossils that can be used to date an internal node, or information about sampling-time for rapidly evolving sequences) we can instead use clock-based models. These models assume that sequences are evolving at a more or less constant rate, branch lengths are expressed in terms of time, and we can estimate times for internal nodes. Apart from being useful when the focus is on dating evolutionary events, time trees are also useful in that the clock model itself can lead to better inference of the phylogeny (essentially because it adds prior information to the problem, such that we dont have to infer all branch lengths only from limited amounts of sequence variation).
In this exercise we will use the software package BEASTX to infer phylogenies under molecular-clock models.


The main purpose with this exercise is to make you acquainted with BEASTX and to learn how to fit clock-models using either fossil data (by setting a prior on the date for internal nodes) or using so-called heterochronous data, i.e., sequences where the individual leaves have been sampled at different, known times, and where evolution is sufficiently rapid that we can estimate the parameters in a clock-model by seeing how much change has happened over time.
In previous exercises, branch lengths were measured only in expected numbers of substitutions per site. In a clock-based analysis, genetic change is instead related to calendar time through a model of evolutionary rates. If temporal information is available, for example in the form of known sampling times for rapidly evolving sequences, this can be used to estimate both the rate of evolution and the times of internal nodes in the tree.


For these tutorials you only need to report minimally: make a small report with a handful of uncommented screendumps showing your progress through the exercise. The important thing is that you get to be a bit familiar with the use of the program, such that you can use it in the mini project later.
In this exercise we will focus on so-called heterochronous data, i.e., sequence data where the individual sequences were sampled at different known times. When evolution is sufficiently rapid, the amount of sequence change observed over these sampling times contains information about the evolutionary rate and about the timing of common ancestors.


:* In the exercises below, you should simply follow the instructions on the tutorial pages.  
The main purpose of the exercise is:
:* Depending on your operating system and on how you installed the software, you can start required programs either from the command line or by double clicking an app. The names of the executables that you will need for this exercise are:
:* to become familiar with the BEASTX workflow
:* to set up and run a clock-based Bayesian phylogenetic analysis
:* to inspect MCMC output in Tracer
:* to summarize posterior trees using TreeAnnotator
:* to visualize and interpret a dated tree in FigTree
:* to compare a strict-clock analysis with a relaxed-clock analysis
 
:* In the exercise below, you should follow the instructions on the tutorial page.
:* Depending on your operating system and how you installed the software, you can start the relevant programs either from the command line or by double clicking an app. The executables that you may need are:
:** beauti
:** beauti
:** beast
:** beast
Line 17: Line 25:
:** figtree
:** figtree


== Introduction to BEASTX ==
== BEASTX tutorial ==
 
:* Open this link in a new tab: [https://beast.community/workshop_rates_and_dates Estimating rates and dates from time-stamped sequences]
 
Answer the questions below and hand in the report. Include a small number of screendumps showing relevant output from the tools you are using.
 
== Questions ==
 
'''Question 1'''
 
Explain what the temporal information is in this analysis. How does BEAST obtain information about the sampling times of the sequences, and why is that information needed in order to estimate dates in calendar time?
 
'''Question 2'''
 
In the first analysis, the tutorial uses a strict molecular clock. What does this assumption mean biologically and statistically? Why might this be a reasonable first model to try, and what kinds of evolutionary patterns would violate this assumption?
 
'''Question 3'''
 
After the first BEAST run, inspect the output in Tracer.
 
After the first BEAST run, inspect the output in Tracer. What indications are there that the initial run is not yet satisfactory? In your answer, mention burn-in, trace behaviour, and ESS, and include at least one relevant screendump from Tracer.
 
'''Question 4'''
 
Why does increasing the MCMC chain length help in this case? Explain the difference between increasing chain length and discarding a larger burn-in.
 
'''Question 5'''
 
TreeAnnotator is used to summarize the posterior sample of trees into a single representative tree. Compared with an ordinary phylogram or consensus tree, what additional information does this summary tree contain? Mention at least two specific annotations and explain briefly why each is useful.
 
'''Question 6'''
 
Inspect the summarized tree in FigTree. How do the virus samples from the Americas cluster relative to the African samples? What does the inferred timescale suggest about the origin and history of yellow fever virus in the Americas?
 
'''Question 7'''


:* Create a new directory for storing the results of this exercise:
The tutorial then repeats the analysis using a relaxed lognormal clock. What is the difference between a strict clock and this relaxed-clock model? What extra biological possibility is the relaxed-clock model allowing for?
cd /path/to/molevol
mkdir bayes2
cd bayes2
:* Open this link in a new tab: [https://taming-the-beast.org/tutorials/Introduction-to-BEAST2/ Introduction to BEAST2]
:* Follow instructions down to the optional part.
:* '''Note:''' If you are running the BEASTX programs from the command line (not starting them by double clicking an app), then to get the graphical interface for BEASTX shown in figure 11 in the tutorial, you should start the program as follows:
beast -options


== Prior selection and clock calibration using Influenza A data ==
'''Question 8'''


:* Open this link in a new tab: [https://taming-the-beast.org/tutorials/Prior-selection/ Prior selection and clock calibration using Influenza A data]
Based on the relaxed-clock analysis, is there evidence for substantial rate variation among lineages? In your answer, state what output you used to assess this. Also comment on whether the main biological conclusion about introduction of yellow fever virus into the Americas changes or remains similar under the relaxed-clock model.
:* '''NOTE:''' Only do the part about '''heterochronous''' data (not the homochronous part, although you can if you want to)

Revision as of 12:39, 22 April 2026

This exercise is part of the course Computational Molecular Evolution (22115).

Overview

In this exercise we will use the software package BEASTX to infer phylogenies under molecular-clock models.

In previous exercises, branch lengths were measured only in expected numbers of substitutions per site. In a clock-based analysis, genetic change is instead related to calendar time through a model of evolutionary rates. If temporal information is available, for example in the form of known sampling times for rapidly evolving sequences, this can be used to estimate both the rate of evolution and the times of internal nodes in the tree.

In this exercise we will focus on so-called heterochronous data, i.e., sequence data where the individual sequences were sampled at different known times. When evolution is sufficiently rapid, the amount of sequence change observed over these sampling times contains information about the evolutionary rate and about the timing of common ancestors.

The main purpose of the exercise is:

  • to become familiar with the BEASTX workflow
  • to set up and run a clock-based Bayesian phylogenetic analysis
  • to inspect MCMC output in Tracer
  • to summarize posterior trees using TreeAnnotator
  • to visualize and interpret a dated tree in FigTree
  • to compare a strict-clock analysis with a relaxed-clock analysis
  • In the exercise below, you should follow the instructions on the tutorial page.
  • Depending on your operating system and how you installed the software, you can start the relevant programs either from the command line or by double clicking an app. The executables that you may need are:
    • beauti
    • beast
    • tracer
    • treeannotator
    • figtree

BEASTX tutorial

Answer the questions below and hand in the report. Include a small number of screendumps showing relevant output from the tools you are using.

Questions

Question 1

Explain what the temporal information is in this analysis. How does BEAST obtain information about the sampling times of the sequences, and why is that information needed in order to estimate dates in calendar time?

Question 2

In the first analysis, the tutorial uses a strict molecular clock. What does this assumption mean biologically and statistically? Why might this be a reasonable first model to try, and what kinds of evolutionary patterns would violate this assumption?

Question 3

After the first BEAST run, inspect the output in Tracer.

After the first BEAST run, inspect the output in Tracer. What indications are there that the initial run is not yet satisfactory? In your answer, mention burn-in, trace behaviour, and ESS, and include at least one relevant screendump from Tracer.

Question 4

Why does increasing the MCMC chain length help in this case? Explain the difference between increasing chain length and discarding a larger burn-in.

Question 5

TreeAnnotator is used to summarize the posterior sample of trees into a single representative tree. Compared with an ordinary phylogram or consensus tree, what additional information does this summary tree contain? Mention at least two specific annotations and explain briefly why each is useful.

Question 6

Inspect the summarized tree in FigTree. How do the virus samples from the Americas cluster relative to the African samples? What does the inferred timescale suggest about the origin and history of yellow fever virus in the Americas?

Question 7

The tutorial then repeats the analysis using a relaxed lognormal clock. What is the difference between a strict clock and this relaxed-clock model? What extra biological possibility is the relaxed-clock model allowing for?

Question 8

Based on the relaxed-clock analysis, is there evidence for substantial rate variation among lineages? In your answer, state what output you used to assess this. Also comment on whether the main biological conclusion about introduction of yellow fever virus into the Americas changes or remains similar under the relaxed-clock model.