22115 - User contributions [en]

Bayesian phylogenetics: clock models

2026-04-22T13:43:44Z

Gorm: /* Questions */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

In this exercise we will use the software package BEASTX to infer phylogenies under molecular-clock models.

In previous exercises, branch lengths were measured only in expected numbers of substitutions per site. In a clock-based analysis, genetic change is instead related to calendar time through a model of evolutionary rates. If temporal information is available, for example in the form of known sampling times for rapidly evolving sequences, this can be used to estimate both the rate of evolution and the times of internal nodes in the tree.

In this exercise we will focus on so-called heterochronous data, i.e., sequence data where the individual sequences were sampled at different known times. When evolution is sufficiently rapid, the amount of sequence change observed over these sampling times contains information about the evolutionary rate and about the timing of common ancestors.

The main purpose of the exercise is:
:* to become familiar with the BEASTX workflow
:* to set up and run a clock-based Bayesian phylogenetic analysis
:* to inspect MCMC output in Tracer
:* to summarize posterior trees using TreeAnnotator
:* to visualize and interpret a dated tree in FigTree
:* to compare a strict-clock analysis with a relaxed-clock analysis

:* In the exercise below, you should follow the instructions on the tutorial page.
:* Depending on your operating system and how you installed the software, you can start the relevant programs either from the command line or by double clicking an app. The executables that you may need are:
:** beauti
:** beast
:** tracer
:** treeannotator
:** figtree

== BEASTX tutorial ==

:* Open this link in a new tab: [https://beast.community/workshop_rates_and_dates Estimating rates and dates from time-stamped sequences]

Answer the questions below and hand in the report. Include a small number of screendumps showing relevant output from the tools you are using.

== Questions ==

'''Question 1'''

Explain what the temporal information is in this analysis. How does BEAST obtain information about the sampling times of the sequences, and why is that information needed in order to estimate dates in calendar time?

'''Question 2'''

In the first analysis, the tutorial uses a strict molecular clock. What is the assumption behind this model? Explain what is being assumed about evolutionary rates on different branches, and why this means that expected branch length depends on branch duration and a single shared substitution rate. Also describe a pattern in the data that would suggest this assumption may be unrealistic.

'''Question 3'''

After the first BEAST run, inspect the output in Tracer. What indications are there that the initial run is not yet satisfactory? In your answer, mention burn-in, trace behaviour, and ESS, and include at least one relevant screendump from Tracer.

'''Question 4'''

Why does increasing the MCMC chain length help in this case? Explain the difference between increasing chain length and discarding a larger burn-in.

'''Question 5'''

TreeAnnotator is used to summarize the posterior sample of trees into a single representative tree. Compared with an ordinary phylogram or a simple consensus tree, what additional information does this summary tree contain? Mention at least two specific annotations visible in this tutorial, and explain briefly why each is useful.

'''Question 6'''

Inspect the summarized tree in FigTree. How do the virus samples from the Americas cluster relative to the African samples? What does the inferred timescale suggest about the origin and history of yellow fever virus in the Americas?

'''Question 7'''

The tutorial then repeats the analysis using a relaxed lognormal clock. What is the difference between a strict clock and this relaxed-clock model? What extra biological possibility is the relaxed-clock model allowing for?

'''Question 8'''

Based on the relaxed-clock analysis, is there evidence for substantial rate variation among lineages? In your answer, state which parameter in Tracer you inspected to assess this, and explain what kind of result would indicate little versus substantial rate variation. Also comment on whether the main biological conclusion about introduction of yellow fever virus into the Americas changes or remains similar under the relaxed-clock model.

Bayesian phylogenetics: clock models

2026-04-22T10:50:43Z

Gorm: /* Questions */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

In this exercise we will use the software package BEASTX to infer phylogenies under molecular-clock models.

In previous exercises, branch lengths were measured only in expected numbers of substitutions per site. In a clock-based analysis, genetic change is instead related to calendar time through a model of evolutionary rates. If temporal information is available, for example in the form of known sampling times for rapidly evolving sequences, this can be used to estimate both the rate of evolution and the times of internal nodes in the tree.

In this exercise we will focus on so-called heterochronous data, i.e., sequence data where the individual sequences were sampled at different known times. When evolution is sufficiently rapid, the amount of sequence change observed over these sampling times contains information about the evolutionary rate and about the timing of common ancestors.

The main purpose of the exercise is:
:* to become familiar with the BEASTX workflow
:* to set up and run a clock-based Bayesian phylogenetic analysis
:* to inspect MCMC output in Tracer
:* to summarize posterior trees using TreeAnnotator
:* to visualize and interpret a dated tree in FigTree
:* to compare a strict-clock analysis with a relaxed-clock analysis

:* In the exercise below, you should follow the instructions on the tutorial page.
:* Depending on your operating system and how you installed the software, you can start the relevant programs either from the command line or by double clicking an app. The executables that you may need are:
:** beauti
:** beast
:** tracer
:** treeannotator
:** figtree

== BEASTX tutorial ==

:* Open this link in a new tab: [https://beast.community/workshop_rates_and_dates Estimating rates and dates from time-stamped sequences]

Answer the questions below and hand in the report. Include a small number of screendumps showing relevant output from the tools you are using.

== Questions ==

'''Question 1'''

Explain what the temporal information is in this analysis. How does BEAST obtain information about the sampling times of the sequences, and why is that information needed in order to estimate dates in calendar time?

'''Question 2'''

In the first analysis, the tutorial uses a strict molecular clock. What is the assumption behind this model? Explain what is being assumed about evolutionary rates on different branches, and why this means that expected branch length depends on branch duration and a single shared substitution rate. Also describe a pattern in the data that would suggest this assumption may be unrealistic.

'''Question 3'''

After the first BEAST run, inspect the output in Tracer.

After the first BEAST run, inspect the output in Tracer. What indications are there that the initial run is not yet satisfactory? In your answer, mention burn-in, trace behaviour, and ESS, and include at least one relevant screendump from Tracer.

'''Question 4'''

Why does increasing the MCMC chain length help in this case? Explain the difference between increasing chain length and discarding a larger burn-in.

'''Question 5'''

TreeAnnotator is used to summarize the posterior sample of trees into a single representative tree. Compared with an ordinary phylogram or a simple consensus tree, what additional information does this summary tree contain? Mention at least two specific annotations visible in this tutorial, and explain briefly why each is useful.

'''Question 6'''

Inspect the summarized tree in FigTree. How do the virus samples from the Americas cluster relative to the African samples? What does the inferred timescale suggest about the origin and history of yellow fever virus in the Americas?

'''Question 7'''

The tutorial then repeats the analysis using a relaxed lognormal clock. What is the difference between a strict clock and this relaxed-clock model? What extra biological possibility is the relaxed-clock model allowing for?

'''Question 8'''

Based on the relaxed-clock analysis, is there evidence for substantial rate variation among lineages? In your answer, state which parameter in Tracer you inspected to assess this, and explain what kind of result would indicate little versus substantial rate variation. Also comment on whether the main biological conclusion about introduction of yellow fever virus into the Americas changes or remains similar under the relaxed-clock model.

Bayesian phylogenetics: clock models

2026-04-22T10:39:12Z

Gorm:

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

In this exercise we will use the software package BEASTX to infer phylogenies under molecular-clock models.

In previous exercises, branch lengths were measured only in expected numbers of substitutions per site. In a clock-based analysis, genetic change is instead related to calendar time through a model of evolutionary rates. If temporal information is available, for example in the form of known sampling times for rapidly evolving sequences, this can be used to estimate both the rate of evolution and the times of internal nodes in the tree.

In this exercise we will focus on so-called heterochronous data, i.e., sequence data where the individual sequences were sampled at different known times. When evolution is sufficiently rapid, the amount of sequence change observed over these sampling times contains information about the evolutionary rate and about the timing of common ancestors.

The main purpose of the exercise is:
:* to become familiar with the BEASTX workflow
:* to set up and run a clock-based Bayesian phylogenetic analysis
:* to inspect MCMC output in Tracer
:* to summarize posterior trees using TreeAnnotator
:* to visualize and interpret a dated tree in FigTree
:* to compare a strict-clock analysis with a relaxed-clock analysis

:* In the exercise below, you should follow the instructions on the tutorial page.
:* Depending on your operating system and how you installed the software, you can start the relevant programs either from the command line or by double clicking an app. The executables that you may need are:
:** beauti
:** beast
:** tracer
:** treeannotator
:** figtree

== BEASTX tutorial ==

:* Open this link in a new tab: [https://beast.community/workshop_rates_and_dates Estimating rates and dates from time-stamped sequences]

Answer the questions below and hand in the report. Include a small number of screendumps showing relevant output from the tools you are using.

== Questions ==

'''Question 1'''

Explain what the temporal information is in this analysis. How does BEAST obtain information about the sampling times of the sequences, and why is that information needed in order to estimate dates in calendar time?

'''Question 2'''

In the first analysis, the tutorial uses a strict molecular clock. What does this assumption mean biologically and statistically? Why might this be a reasonable first model to try, and what kinds of evolutionary patterns would violate this assumption?

'''Question 3'''

After the first BEAST run, inspect the output in Tracer.

After the first BEAST run, inspect the output in Tracer. What indications are there that the initial run is not yet satisfactory? In your answer, mention burn-in, trace behaviour, and ESS, and include at least one relevant screendump from Tracer.

'''Question 4'''

Why does increasing the MCMC chain length help in this case? Explain the difference between increasing chain length and discarding a larger burn-in.

'''Question 5'''

TreeAnnotator is used to summarize the posterior sample of trees into a single representative tree. Compared with an ordinary phylogram or consensus tree, what additional information does this summary tree contain? Mention at least two specific annotations and explain briefly why each is useful.

'''Question 6'''

Inspect the summarized tree in FigTree. How do the virus samples from the Americas cluster relative to the African samples? What does the inferred timescale suggest about the origin and history of yellow fever virus in the Americas?

'''Question 7'''

The tutorial then repeats the analysis using a relaxed lognormal clock. What is the difference between a strict clock and this relaxed-clock model? What extra biological possibility is the relaxed-clock model allowing for?

'''Question 8'''

Based on the relaxed-clock analysis, is there evidence for substantial rate variation among lineages? In your answer, state what output you used to assess this. Also comment on whether the main biological conclusion about introduction of yellow fever virus into the Americas changes or remains similar under the relaxed-clock model.

Bayesian phylogenetics: clock models

2026-04-22T09:33:52Z

Gorm: /* Introduction to BEAST2 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

In this exercise we will explore how to use the software tool BEASTX to construct phylogenies based on molecular-clock models. In previous exercises we have worked with phylogenies where we did not have information about how fast sequences were evolving, and we therefore used the number of substitutions as branch lengths. When there ''is'' temporal information (e.g., fossils that can be used to date an internal node, or information about sampling-time for rapidly evolving sequences) we can instead use clock-based models. These models assume that sequences are evolving at a more or less constant rate, branch lengths are expressed in terms of time, and we can estimate times for internal nodes. Apart from being useful when the focus is on dating evolutionary events, time trees are also useful in that the clock model itself can lead to better inference of the phylogeny (essentially because it adds prior information to the problem, such that we dont have to infer all branch lengths only from limited amounts of sequence variation).

The main purpose with this exercise is to make you acquainted with BEASTX and to learn how to fit clock-models using either fossil data (by setting a prior on the date for internal nodes) or using so-called heterochronous data, i.e., sequences where the individual leaves have been sampled at different, known times, and where evolution is sufficiently rapid that we can estimate the parameters in a clock-model by seeing how much change has happened over time.

For these tutorials you only need to report minimally: make a small report with a handful of uncommented screendumps showing your progress through the exercise. The important thing is that you get to be a bit familiar with the use of the program, such that you can use it in the mini project later.

:* In the exercises below, you should simply follow the instructions on the tutorial pages.
:* Depending on your operating system and on how you installed the software, you can start required programs either from the command line or by double clicking an app. The names of the executables that you will need for this exercise are:
:** beauti
:** beast
:** tracer
:** treeannotator
:** figtree

== Introduction to BEASTX ==

:* Create a new directory for storing the results of this exercise:
cd /path/to/molevol
mkdir bayes2
cd bayes2
:* Open this link in a new tab: [https://taming-the-beast.org/tutorials/Introduction-to-BEAST2/ Introduction to BEAST2]
:* Follow instructions down to the optional part.
:* '''Note:''' If you are running the BEASTX programs from the command line (not starting them by double clicking an app), then to get the graphical interface for BEASTX shown in figure 11 in the tutorial, you should start the program as follows:
beast -options

== Prior selection and clock calibration using Influenza A data ==

:* Open this link in a new tab: [https://taming-the-beast.org/tutorials/Prior-selection/ Prior selection and clock calibration using Influenza A data]
:* '''NOTE:''' Only do the part about '''heterochronous''' data (not the homochronous part, although you can if you want to)

Bayesian phylogenetics: clock models

2026-04-22T09:33:17Z

Gorm: /* Overview */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

In this exercise we will explore how to use the software tool BEASTX to construct phylogenies based on molecular-clock models. In previous exercises we have worked with phylogenies where we did not have information about how fast sequences were evolving, and we therefore used the number of substitutions as branch lengths. When there ''is'' temporal information (e.g., fossils that can be used to date an internal node, or information about sampling-time for rapidly evolving sequences) we can instead use clock-based models. These models assume that sequences are evolving at a more or less constant rate, branch lengths are expressed in terms of time, and we can estimate times for internal nodes. Apart from being useful when the focus is on dating evolutionary events, time trees are also useful in that the clock model itself can lead to better inference of the phylogeny (essentially because it adds prior information to the problem, such that we dont have to infer all branch lengths only from limited amounts of sequence variation).

The main purpose with this exercise is to make you acquainted with BEASTX and to learn how to fit clock-models using either fossil data (by setting a prior on the date for internal nodes) or using so-called heterochronous data, i.e., sequences where the individual leaves have been sampled at different, known times, and where evolution is sufficiently rapid that we can estimate the parameters in a clock-model by seeing how much change has happened over time.

For these tutorials you only need to report minimally: make a small report with a handful of uncommented screendumps showing your progress through the exercise. The important thing is that you get to be a bit familiar with the use of the program, such that you can use it in the mini project later.

:* In the exercises below, you should simply follow the instructions on the tutorial pages.
:* Depending on your operating system and on how you installed the software, you can start required programs either from the command line or by double clicking an app. The names of the executables that you will need for this exercise are:
:** beauti
:** beast
:** tracer
:** treeannotator
:** figtree

== Introduction to BEAST2 ==

:* Create a new directory for storing the results of this exercise:
cd /path/to/molevol
mkdir bayes2
cd bayes2
:* Open this link in a new tab: [https://taming-the-beast.org/tutorials/Introduction-to-BEAST2/ Introduction to BEAST2]
:* Follow instructions down to the optional part.
:* '''Note:''' If you are running the BEAST2 programs from the command line (not starting them by double clicking an app), then to get the graphical interface for BEAST2 shown in figure 11 in the tutorial, you should start the program as follows:
beast -options

== Prior selection and clock calibration using Influenza A data ==

:* Open this link in a new tab: [https://taming-the-beast.org/tutorials/Prior-selection/ Prior selection and clock calibration using Influenza A data]
:* '''NOTE:''' Only do the part about '''heterochronous''' data (not the homochronous part, although you can if you want to)

Bayesian phylogenetics: checking convergence

2026-04-22T09:31:17Z

Gorm: /* Check convergence using Tracer */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Check convergence using Tracer ==

In this exercise you will be briefly introduced to how to check if an MCMC run has converged using the program Tracer from the BEASTX package. You will do this by re-examining the output from the Bayesian analysis you did in the week 7 exercise.

----

'''Question 1'''

: Issue this command to start the Tracer program:
tracer

: Now import the two MCMC sample files from the MrBayes run you did in week 7 for the hcvsmall data set:
:* File -> Import Trace File (or use the + under the trace file pane)
:* In the import dialog: find the "bayes" directory and select "All files" under "files of type". This should give you a list of the output files from the MrBayes run
:* Select the file "hcvsmall.nexus.run1.p" and open it.
:* Repeat process for second log file (suffix ".run2.p")

:::::[[File:Tracer fileload2.png |800px]]

: You can now use Tracer to explore the results of the Bayesian analysis. The first thing you want to check is that the two independent runs have resulted in similar posteriors for the different parameters. This is investigated as follows:
:* Select both trace files by shift-clicking on their names in the "Trace files" pane (upper left of the Tracer window)
:* Select the "Marginal Density" tab in the window on the right.
:* Check different parameters by choosing them in the "Traces" pane on the left (while making sure you still have both trace files selected). This will show the two posteriors for the chosen parameter (see example below). If a run has converged then the two posteriors should mostly be placed right on top of each other.
:* Note that Tracer by default uses a burnin of 10% of the total number of generations. You can change that by double-clicking in the Burn-in field of the trace file pane (you need to change it separately for each file). Typically we would use a burn-in of 25% or 50%.

:::::[[File:Tracer marginals overlap.png|800px]]

:* The plot below shows an example where convergence has not occurred yet:

:::::[[File:Tracer lousy convergence.png|800px]]

'''Question: '''Select both trace files in the “Trace Files” pane and inspect the Marginal Density plots for the parameters r(A<->G){all} and m{2}.

Include the marginal posterior plots for these two parameters in your report. For each parameter, describe whether the posterior distributions from the two independent runs overlap closely, and explain what this indicates about convergence.

----

'''Question 2'''

:* Another thing to check is how the trace looks as a function of the iteration number: Optimally you would want a trace that looks like a "hairy caterpillar", with random jumps up and down on a mostly constant level (see example below).
:* Select the "Trace" tab in the window on the right to see trace plots (still with one or both trace files selected in the Trace File pane).
:* Related to this: The ESS column gives the "Effective Sample Size" for each parameter. As a rule of thumb we want this to be at least 200 (and Tracer flags smaller values by colouring the ESS values).
:** Briefly, the problem here is that consecutive samples from MCMC are correlated (they are not independent). This is due to the use of a Markov chain for sampling: the new position in parameter space depends on the previous location (and the proposal distribution).
:** The degree of non-indepence can be quantified by the auto-correlation for different lags: The autocorrelation for lag k is found by computing the Pearson correlation between the vector of all samples, and the values in the same vector shifted k positions.
:** Based on computation of auto-correlation at different lags (<math>k = [1, 2, 3, ...]</math>) Tracer determines the Auto-Correlation Time (ACT), which is the number of generations in the MCMC chain that two samples have to be separated by for them to be uncorrelated. The ACT for a parameter can be seen in the Estimates tab in Tracer.
:** Tracer also estimates the Effective Sample Size (ESS), which is the number of independent samples that the trace is equivalent to. This is essentially the chain length (excluding the burn-in) divided by the ACT.
:* Note how the highlighted parameter corresponding to the hairy caterpillar trace also has a high ESS in the example below.

:::::[[File:Tracer hairy caterpillar.png|800px]]

:* Trace plots where there are clearly visible dips and rises (see example below) indicates that there is auto correlation among the samples we have included - the samples are not independent of each other (and therefore provide less information about the posterior). This is referred to as "poor mixing". One solution to such a problem is to increase the number of iterations (and perhaps write samples less frequently). It might also be an indication that the model fits poorly, and that you could get a better convergence by changing the substitution model, or setting more informative priors.
:* Note how the poorly mixing parameter in the example below also has a low ESS.

:::::[[File:Tracer ugly caterpillar.png|800px]]

'''Question: '''Now inspect the Trace plots for the same two parameters: r(A<->G){all} and m{2}.

Include the trace plots for these two parameters in your report, and state the ESS for each of them. For each parameter, describe whether the trace looks well mixed, and compare the two parameters in terms of sampling efficiency. Does one of them appear to have stronger autocorrelation than the other?

----

'''Question 3'''

'''Question: '''Using both the marginal density plots and the trace plots/ESS values, assess whether r(A<->G){all} and m{2} appear to have converged. Explain what pattern in either the marginal densities or the trace/ESS values would have made you doubt convergence.

Bayesian phylogenetics: checking convergence

2026-04-22T09:09:23Z

Gorm:

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Check convergence using Tracer ==

In this exercise you will be briefly introduced to how to check if an MCMC run has converged using the program Tracer from the BEASTX package. You will do this by re-examining the output from the Bayesian analysis you did in the week 7 exercise.

----

'''Question 1'''

: Issue this command to start the Tracer program:
tracer

: Now import the two MCMC sample files from the MrBayes run you did in week 7 for the hcvsmall data set:
:* File -> Import Trace File (or use the + under the trace file pane)
:* In the import dialog: find the "bayes" directory and select "All files" under "files of type". This should give you a list of the output files from the MrBayes run
:* Select the file "hcvsmall.nexus.run1.p" and open it.
:* Repeat process for second log file (suffix ".run2.p")

:::::[[File:Tracer fileload2.png |800px]]

: You can now use Tracer to explore the results of the Bayesian analysis. The first thing you want to check is that the two independent runs have resulted in similar posteriors for the different parameters. This is investigated as follows:
:* Select both trace files by shift-clicking on their names in the "Trace files" pane (upper left of the Tracer window)
:* Select the "Marginal Density" tab in the window on the right.
:* Check different parameters by choosing them in the "Traces" pane on the left (while making sure you still have both trace files selected). This will show the two posteriors for the chosen parameter (see example below). If a run has converged then the two posteriors should mostly be placed right on top of each other.
:* Note that Tracer by default uses a burnin of 10% of the total number of generations. You can change that by double-clicking in the Burn-in field of the trace file pane (you need to change it separately for each file). Typically we would use a burn-in of 25% or 50%.

:::::[[File:Tracer marginals overlap.png|800px]]

:* The plot below shows an example where convergence has not occurred yet:

:::::[[File:Tracer lousy convergence.png|800px]]

'''Question: '''Take screen dumps of the marginal posterior plots for the following parameters and include them in your report: m{1} and piA{all}

----

'''Question 2'''

:* Another thing to check is how the trace looks as a function of the iteration number: Optimally you would want a trace that looks like a "hairy caterpillar", with random jumps up and down on a mostly constant level (see example below).
:* Select the "Trace" tab in the window on the right to see trace plots (still with one or both trace files selected in the Trace File pane).
:* Related to this: The ESS column gives the "Effective Sample Size" for each parameter. As a rule of thumb we want this to be at least 200 (and Tracer flags smaller values by colouring the ESS values).
:** Briefly, the problem here is that consecutive samples from MCMC are correlated (they are not independent). This is due to the use of a Markov chain for sampling: the new position in parameter space depends on the previous location (and the proposal distribution).
:** The degree of non-indepence can be quantified by the auto-correlation for different lags: The autocorrelation for lag k is found by computing the Pearson correlation between the vector of all samples, and the values in the same vector shifted k positions.
:** Based on computation of auto-correlation at different lags (<math>k = [1, 2, 3, ...]</math>) Tracer determines the Auto-Correlation Time (ACT), which is the number of generations in the MCMC chain that two samples have to be separated by for them to be uncorrelated. The ACT for a parameter can be seen in the Estimates tab in Tracer.
:** Tracer also estimates the Effective Sample Size (ESS), which is the number of independent samples that the trace is equivalent to. This is essentially the chain length (excluding the burn-in) divided by the ACT.
:* Note how the highlighted parameter corresponding to the hairy caterpillar trace also has a high ESS in the example below.

:::::[[File:Tracer hairy caterpillar.png|800px]]

:* Trace plots where there are clearly visible dips and rises (see example below) indicates that there is auto correlation among the samples we have included - the samples are not independent of each other (and therefore provide less information about the posterior). This is referred to as "poor mixing". One solution to such a problem is to increase the number of iterations (and perhaps write samples less frequently). It might also be an indication that the model fits poorly, and that you could get a better convergence by changing the substitution model, or setting more informative priors.
:* Note how the poorly mixing parameter in the example below also has a low ESS.

:::::[[File:Tracer ugly caterpillar.png|800px]]

'''Question: '''Take screen dumps of the trace plots for the following parameters and include them in your report: m{1} and piA{all}. What is the ESS for these parameters?

Bayesian phylogenetics: checking convergence

2026-04-22T09:03:06Z

Gorm: /* Check convergence using Tracer */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Check convergence using Tracer ==

In this exercise you will be briefly introduced to how to check if an MCMC run has converged using the program Tracer from the BEASTX package. You will do this by re-examining the output from the Bayesian analysis you did in the week 7 exercise.

----

'''Question 1'''

: Issue this command to start the Tracer program:
tracer

: Now import the two MCMC sample files from the MrBayes run you did in week 7 for the hcvsmall data set:
:* File -> Import Trace File (or use the + under the trace file pane)
:* In the import dialog: find the "bayes" directory and select "All files" under "files of type". This should give you a list of the output files from the MrBayes run
:* Select the file "hcvsmall.nexus.run1.p" and open it.
:* Repeat process for second log file (suffix ".run2.p")

:::::[[File:Tracer fileload2.png |800px]]

: You can now use Tracer to explore the results of the Bayesian analysis. The first thing you want to check is that the two independent runs have resulted in similar posteriors for the different parameters. This is investigated as follows:
:* Select both trace files by shift-clicking on their names in the "Trace files" pane (upper left of the Tracer window)
:* Select the "Marginal Density" tab in the window on the right.
:* Check different parameters by choosing them in the "Traces" pane on the left (while making sure you still have both trace files selected). This will show the two posteriors for the chosen parameter (see example below). If a run has converged then the two posteriors should mostly be placed right on top of each other.
:* Note that Tracer by default uses a burnin of 10% of the total number of generations. You can change that by double-clicking in the Burn-in field of the trace file pane (you need to change it separately for each file). Typically we would use a burn-in of 25% or 50%.

:::::[[File:Tracer marginals overlap.png|800px]]

:* The plot below shows an example where convergence has not occurred yet:

:::::[[File:Tracer lousy convergence.png|800px]]

'''Question: '''Take screen dumps of the marginal posterior plots for the following parameters and include them in your report: m{1} and piA{all}

----

'''Question 2'''

:* Another thing to check is how the trace looks as a function of the iteration number: Optimally you would want a trace that looks like a "hairy caterpillar", with random jumps up and down on a mostly constant level (see example below).
:* Select the "Trace" tab in the window on the right to see trace plots (still with one or both trace files selected in the Trace File pane).
:* Related to this: The ESS column gives the "Effective Sample Size" for each parameter. As a rule of thumb we want this to be at least 200 (and Tracer flags smaller values by colouring the ESS values).
:** Briefly, the problem here is that consecutive samples from MCMC are correlated (they are not independent). This is due to the use of a Markov chain for sampling: the new position in parameter space depends on the previous location (and the proposal distribution).
:** The degree of non-indepence can be quantified by the auto-correlation for different lags: The autocorrelation for lag k is found by computing the Pearson correlation between all samples, and the samples k generations later.
:** Based on computation of auto-correlation at different lags (<math>k = [1, 2, 3, ...]</math>) Tracer determines the Auto-Correlation Time (ACT), which is the number of generations in the MCMC chain that two samples have to be separated by for them to be uncorrelated. The ACT for a parameter can be seen in the Estimates tab in Tracer.
:** Tracer also estimates the Effective Sample Size (ESS), which is the number of independent samples that the trace is equivalent to. This is essentially the chain length (excluding the burn-in) divided by the ACT.
:* Note how the highlighted parameter corresponding to the hairy caterpillar trace also has a high ESS in the example below.

:::::[[File:Tracer hairy caterpillar.png|800px]]

:* Trace plots where there are clearly visible dips and rises (see example below) indicates that there is auto correlation among the samples we have included - the samples are not independent of each other (and therefore provide less information about the posterior). This is referred to as "poor mixing". One solution to such a problem is to increase the number of iterations (and perhaps write samples less frequently). It might also be an indication that the model fits poorly, and that you could get a better convergence by changing the substitution model, or setting more informative priors.
:* Note how the poorly mixing parameter in the example below also has a low ESS.

:::::[[File:Tracer ugly caterpillar.png|800px]]

'''Question: '''Take screen dumps of the trace plots for the following parameters and include them in your report: m{1} and piA{all}. What is the ESS for these parameters?

Model selection

2026-04-15T18:27:05Z

Gorm:

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute Akaike weights, which can be interpreted as relative support for the models, or approximately as relative model probabilities within the candidate set. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and Akaike weights (interpretable as relative model probabilities) ==

: Later in today’s exercise you will be asked to compute AIC values and Akaike weights. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the Akaike weight for each model is found as: '''weight(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Phylogenetic tree inference ==

=== Tree inference using the model selected by jModelTest2 (PAUP) ===

'''Question 5'''

: Above, you used jModelTest2 to select a substitution model based on AIC. You will now use that model to construct a maximum likelihood tree in '''PAUP'''. The purpose of doing this in PAUP is mainly pedagogical: it makes the individual steps of model-based phylogenetic inference more explicit. In practice, one would often use a program such as IQ-TREE 3 (see next section).

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jModelTest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jModelTest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jModelTest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree_paup.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree_paup.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree_paup.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

=== One-step model selection and tree inference using IQ-TREE 3 ===

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis. You will now repeat the analysis using this more integrated approach.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: '''Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods is that the parameters of a model are estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on whether we can detect positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models. Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS = 1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Edit the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that control how '''codeml''' is run. Before continuing, edit the file and save it so that the following settings have these values. Any text appearing after an asterisk (*) on a line is only a comment, and you do not need to change it.

'''seqfile = gp120align.fasta'''
: name of alignment file

'''treefile = gp120.nexus.treefile'''
: name of tree file produced by IQ-TREE 3

'''seqtype = 1'''
: tells the program that the data consist of coding DNA

'''NSsites = 1 2'''
: tells the program to analyze models M1 and M2

'''cleandata = 0'''
: tells the program to keep positions with gaps

: '''Important:''' codeml requires a phylogenetic tree as input. The tree supplies the topology for the codon-model analysis, while codeml estimates the branch lengths and the remaining model parameters itself. In this exercise, we use the tree produced by IQ-TREE 3 (gp120.nexus.treefile) as input.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS = 1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS = 1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis.)

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves.)

: '''Note:''' On some systems the current version of codeml (4.10.10) may print an error message near the end of the run, even though the relevant results have already been written to '''selection.results'''. If this happens, inspect '''selection.results''' before assuming that the run failed completely.

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values.

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: this is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dN/dS ratios and codon class proportions for model M1'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting "p:" gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class, while 25% of all sites belong to the class having dN/dS = 1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes? Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dN/dS ratios and codon class proportions for model M2'''
: Now scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and Akaike weights for M1 and M2.

'''Question: '''Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: '''Is M2 better than M1? Based on this comparison, is there evidence that some codon sites in gp120 have dN/dS > 1?

----

'''Question 13'''

'''Examine list of positively selected sites'''

: If your M2 is clearly better than M1, then you have evidence for the existence of positively selected sites in the gp120 gene. Now scroll down to the end of the result file and locate the "Bayes Empirical Bayes (BEB) analysis" table. Note: use the BEB table, not the Naive Empirical Bayes (NEB) table. (It is not important what the distinction is in this context, but if you are interested: NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets - for instance w for some codon is perhaps not exactly 3.046, but could lie in a region around that value - whereas [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].)

: This table lists codon positions, the posterior probability that each site belongs to the positively selected class (that is, the class with dN/dS > 1), and the estimated mean value of w for the site.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class.

----

'''Question 14'''

: Consider your results from Questions 10–13. In model M2, only a fraction of sites belong to the positively selected class, while many sites belong to classes with dN/dS ≤ 1.

'''Question: '''What does this suggest about the overall pattern of selection acting on gp120? Explain briefly why one might expect a mixture of negatively selected, neutral, and positively selected sites in this gene.

Model selection

2026-04-15T15:09:47Z

Gorm: /* Detection of positively selected sites in gp120 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Phylogenetic tree inference ==

=== Tree inference using the model selected by jModelTest2 (PAUP) ===

'''Question 5'''

: Above, you used jModelTest2 to select a substitution model based on AIC. You will now use that model to construct a maximum likelihood tree in '''PAUP'''. The purpose of doing this in PAUP is mainly pedagogical: it makes the individual steps of model-based phylogenetic inference more explicit. In practice, one would often use a program such as IQ-TREE 3 (see next section).

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jModelTest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jModelTest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jModelTest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree_paup.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree_paup.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree_paup.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

=== One-step model selection and tree inference using IQ-TREE 3 ===

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis. You will now repeat the analysis using this more integrated approach.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: '''Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods is that the parameters of a model are estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on whether we can detect positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models. Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS = 1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Edit the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that control how '''codeml''' is run. Before continuing, edit the file and save it so that the following settings have these values. Any text appearing after an asterisk (*) on a line is only a comment, and you do not need to change it.

'''seqfile = gp120align.fasta'''
: name of alignment file

'''treefile = gp120.nexus.treefile'''
: name of tree file produced by IQ-TREE 3

'''seqtype = 1'''
: tells the program that the data consist of coding DNA

'''NSsites = 1 2'''
: tells the program to analyze models M1 and M2

'''cleandata = 0'''
: tells the program to keep positions with gaps

: '''Important:''' codeml requires a phylogenetic tree as input. The tree supplies the topology for the codon-model analysis, while codeml estimates the branch lengths and the remaining model parameters itself. In this exercise, we use the tree produced by IQ-TREE 3 (gp120.nexus.treefile) as input.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS = 1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS = 1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis.)

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves.)

: '''Note:''' On some systems the current version of codeml (4.10.10) may print an error message near the end of the run, even though the relevant results have already been written to '''selection.results'''. If this happens, inspect '''selection.results''' before assuming that the run failed completely.

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values.

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: this is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dN/dS ratios and codon class proportions for model M1'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting "p:" gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class, while 25% of all sites belong to the class having dN/dS = 1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes? Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dN/dS ratios and codon class proportions for model M2'''
: Now scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: '''Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: '''Is M2 better than M1? Based on this comparison, is there evidence that some codon sites in gp120 have dN/dS > 1?

----

'''Question 13'''

'''Examine list of positively selected sites'''

: If your M2 is clearly better than M1, then you have evidence for the existence of positively selected sites in the gp120 gene. Now scroll down to the end of the result file and locate the "Bayes Empirical Bayes (BEB) analysis" table. Note: use the BEB table, not the Naive Empirical Bayes (NEB) table. (It is not important what the distinction is in this context, but if you are interested: NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets - for instance w for some codon is perhaps not exactly 3.046, but could lie in a region around that value - whereas [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].)

: This table lists codon positions, the posterior probability that each site belongs to the positively selected class (that is, the class with dN/dS > 1), and the estimated mean value of w for the site.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class.

----

'''Question 14'''

: Consider your results from Questions 10–13. In model M2, only a fraction of sites belong to the positively selected class, while many sites belong to classes with dN/dS ≤ 1.

'''Question: '''What does this suggest about the overall pattern of selection acting on gp120? Explain briefly why one might expect a mixture of negatively selected, neutral, and positively selected sites in this gene.

Model selection

2026-04-15T15:08:58Z

Gorm: /* Detection of positively selected sites in gp120 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Phylogenetic tree inference ==

=== Tree inference using the model selected by jModelTest2 (PAUP) ===

'''Question 5'''

: Above, you used jModelTest2 to select a substitution model based on AIC. You will now use that model to construct a maximum likelihood tree in '''PAUP'''. The purpose of doing this in PAUP is mainly pedagogical: it makes the individual steps of model-based phylogenetic inference more explicit. In practice, one would often use a program such as IQ-TREE 3 (see next section).

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jModelTest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jModelTest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jModelTest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree_paup.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree_paup.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree_paup.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

=== One-step model selection and tree inference using IQ-TREE 3 ===

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis. You will now repeat the analysis using this more integrated approach.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: '''Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods is that the parameters of a model are estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on whether we can detect positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models. Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS = 1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Edit the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that control how '''codeml''' is run. Before continuing, edit the file and save it so that the following settings have these values. Any text appearing after an asterisk (*) on a line is only a comment, and you do not need to change it.

'''seqfile = gp120align.fasta'''
: name of alignment file

'''treefile = gp120.nexus.treefile'''
: name of tree file produced by IQ-TREE 3

'''seqtype = 1'''
: tells the program that the data consist of coding DNA

'''NSsites = 1 2'''
: tells the program to analyze models M1 and M2

'''cleandata = 0'''
: tells the program to keep positions with gaps

: '''Important:''' codeml requires a phylogenetic tree as input. The tree supplies the topology for the codon-model analysis, while codeml estimates the branch lengths and the remaining model parameters itself. In this exercise, we use the tree produced by '''IQ-TREE 3''' ('''gp120.nexus.treefile''') as input.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS = 1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS = 1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis.)

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves.)

: '''Note:''' On some systems the current version of codeml (4.10.10) may print an error message near the end of the run, even though the relevant results have already been written to '''selection.results'''. If this happens, inspect '''selection.results''' before assuming that the run failed completely.

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values.

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: this is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dN/dS ratios and codon class proportions for model M1'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting "p:" gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class, while 25% of all sites belong to the class having dN/dS = 1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes? Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dN/dS ratios and codon class proportions for model M2'''
: Now scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: '''Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: '''Is M2 better than M1? Based on this comparison, is there evidence that some codon sites in gp120 have dN/dS > 1?

----

'''Question 13'''

'''Examine list of positively selected sites'''

: If your M2 is clearly better than M1, then you have evidence for the existence of positively selected sites in the gp120 gene. Now scroll down to the end of the result file and locate the "Bayes Empirical Bayes (BEB) analysis" table. Note: use the BEB table, not the Naive Empirical Bayes (NEB) table. (It is not important what the distinction is in this context, but if you are interested: NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets - for instance w for some codon is perhaps not exactly 3.046, but could lie in a region around that value - whereas [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].)

: This table lists codon positions, the posterior probability that each site belongs to the positively selected class (that is, the class with dN/dS > 1), and the estimated mean value of w for the site.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class.

----

'''Question 14'''

: Consider your results from Questions 10–13. In model M2, only a fraction of sites belong to the positively selected class, while many sites belong to classes with dN/dS ≤ 1.

'''Question: '''What does this suggest about the overall pattern of selection acting on gp120? Explain briefly why one might expect a mixture of negatively selected, neutral, and positively selected sites in this gene.

Model selection

2026-04-15T10:08:22Z

Gorm: /* Detection of positively selected sites in gp120 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Phylogenetic tree inference ==

=== Tree inference using the model selected by jModelTest2 (PAUP) ===

'''Question 5'''

: Above, you used jModelTest2 to select a substitution model based on AIC. You will now use that model to construct a maximum likelihood tree in '''PAUP'''. The purpose of doing this in PAUP is mainly pedagogical: it makes the individual steps of model-based phylogenetic inference more explicit. In practice, one would often use a program such as IQ-TREE 3 (see next section).

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jModelTest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jModelTest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jModelTest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree_paup.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree_paup.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree_paup.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

=== One-step model selection and tree inference using IQ-TREE 3 ===

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis. You will now repeat the analysis using this more integrated approach.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: '''Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods is that the parameters of a model are estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on whether we can detect positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models. Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS = 1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Edit the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that control how '''codeml''' is run. Before continuing, edit the file and save it so that the following settings have these values. Any text appearing after an asterisk (*) on a line is only a comment, and you do not need to change it.

'''seqfile = gp120align.fasta'''
: name of alignment file

'''treefile = gp120.nexus.treefile'''
: name of tree file produced by IQ-TREE 3

'''seqtype = 1'''
: tells the program that the data consist of coding DNA

'''NSsites = 1 2'''
: tells the program to analyze models M1 and M2

'''cleandata = 0'''
: tells the program to keep positions with gaps

: '''Important:''' codeml requires a phylogenetic tree as input. The tree supplies the topology for the codon-model analysis, while codeml estimates the branch lengths and the remaining model parameters itself. In this exercise, use the tree produced by '''IQ-TREE 3''' ('''gp120.nexus.treefile''') as input.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS = 1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS = 1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis.)

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves.)

: '''Note:''' On some systems the current version of codeml (4.10.10) may print an error message near the end of the run, even though the relevant results have already been written to '''selection.results'''. If this happens, inspect '''selection.results''' before assuming that the run failed completely.

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values.

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: this is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dN/dS ratios and codon class proportions for model M1'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting "p:" gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class, while 25% of all sites belong to the class having dN/dS = 1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes? Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dN/dS ratios and codon class proportions for model M2'''
: Now scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: '''Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: '''Is M2 better than M1? Based on this comparison, is there evidence that some codon sites in gp120 have dN/dS > 1?

----

'''Question 13'''

'''Examine list of positively selected sites'''

: If your M2 is clearly better than M1, then you have evidence for the existence of positively selected sites in the gp120 gene. Now scroll down to the end of the result file and locate the "Bayes Empirical Bayes (BEB) analysis" table. Note: use the BEB table, not the Naive Empirical Bayes (NEB) table. (It is not important what the distinction is in this context, but if you are interested: NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets - for instance w for some codon is perhaps not exactly 3.046, but could lie in a region around that value - whereas [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].)

: This table lists codon positions, the posterior probability that each site belongs to the positively selected class (that is, the class with dN/dS > 1), and the estimated mean value of w for the site.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class.

----

'''Question 14'''

: Consider your results from Questions 10–13. In model M2, only a fraction of sites belong to the positively selected class, while many sites belong to classes with dN/dS ≤ 1.

'''Question: '''What does this suggest about the overall pattern of selection acting on gp120? Explain briefly why one might expect a mixture of negatively selected, neutral, and positively selected sites in this gene.

Model selection

2026-04-15T09:48:45Z

Gorm: /* Detection of positively selected sites in gp120 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Phylogenetic tree inference ==

=== Tree inference using the model selected by jModelTest2 (PAUP) ===

'''Question 5'''

: Above, you used jModelTest2 to select a substitution model based on AIC. You will now use that model to construct a maximum likelihood tree in '''PAUP'''. The purpose of doing this in PAUP is mainly pedagogical: it makes the individual steps of model-based phylogenetic inference more explicit. In practice, one would often use a program such as IQ-TREE 3 (see next section).

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jModelTest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jModelTest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jModelTest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree_paup.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree_paup.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree_paup.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

=== One-step model selection and tree inference using IQ-TREE 3 ===

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis. You will now repeat the analysis using this more integrated approach.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: '''Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods is that the parameters of a model are estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on whether we can detect positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models. Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS = 1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Edit the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that control how '''codeml''' is run. Before continuing, edit the file and save it so that the following settings have these values. Any text appearing after an asterisk (*) on a line is only a comment, and you do not need to change it.

'''seqfile = gp120align.fasta'''
: name of alignment file

'''treefile = gp120.nexus.treefile'''
: name of tree file produced by IQ-TREE 3

'''seqtype = 1'''
: tells the program that the data consist of coding DNA

'''NSsites = 1 2'''
: tells the program to analyze models M1 and M2

'''cleandata = 0'''
: tells the program to keep positions with gaps

: '''Important:''' codeml requires a phylogenetic tree as input. The tree supplies the topology for the codon-model analysis, while codeml estimates the branch lengths and the remaining model parameters itself. In this exercise, use the tree produced by '''IQ-TREE 3''' ('''gp120.nexus.treefile''') as input.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS = 1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS = 1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis.)

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves.)

: '''Note:''' On some systems the current version of codeml (4.10.10) may print an error message near the end of the run, even though the relevant results have already been written to '''selection.results'''. If this happens, inspect '''selection.results''' before assuming that the run failed completely.

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values.

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: this is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dN/dS ratios and codon class proportions for model M1'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting "p:" gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class, while 25% of all sites belong to the class having dN/dS = 1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes? Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dN/dS ratios and codon class proportions for model M2'''
: Now scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: '''Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: '''Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1, then you have evidence for the existence of positively selected sites in the gp120 gene. Now scroll down to the end of the result file and locate a list similar to the one below. Note: this is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly: NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could lie in a region around that value), whereas [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].

: This gives you a list of which residues (if any) were found to belong to the positively selected dN/dS class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class.

Model selection

2026-04-15T09:42:36Z

Gorm:

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Phylogenetic tree inference ==

=== Tree inference using the model selected by jModelTest2 (PAUP) ===

'''Question 5'''

: Above, you used jModelTest2 to select a substitution model based on AIC. You will now use that model to construct a maximum likelihood tree in '''PAUP'''. The purpose of doing this in PAUP is mainly pedagogical: it makes the individual steps of model-based phylogenetic inference more explicit. In practice, one would often use a program such as IQ-TREE 3 (see next section).

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jModelTest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jModelTest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jModelTest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree_paup.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree_paup.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree_paup.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

=== One-step model selection and tree inference using IQ-TREE 3 ===

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis. You will now repeat the analysis using this more integrated approach.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: '''Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods is that the parameters of a model are estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on whether we can detect positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models. Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS = 1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Edit the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that control how '''codeml''' is run. Before continuing, edit the file and save it so that the following settings have these values. Any text appearing after an asterisk (*) on a line is only a comment, and you do not need to change it.

'''seqfile = gp120align.fasta'''
: name of alignment file

'''treefile = gp120.nexus.treefile'''
: name of tree file produced by IQ-TREE 3

'''seqtype = 1'''
: tells the program that the data consist of coding DNA

'''NSsites = 1 2'''
: tells the program to analyze models M1 and M2

'''cleandata = 0'''
: tells the program to keep positions with gaps

: '''Important:''' codeml requires a phylogenetic tree as input. The tree supplies the topology for the codon-model analysis, while codeml estimates the branch lengths and the remaining model parameters itself. In this exercise, use the tree produced by '''IQ-TREE 3''' ('''gp120.nexus.treefile''') as input.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS = 1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS = 1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis.)

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves.)

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values.

: '''Note:''' On some systems codeml may print an error message near the end of the run, for instance during the BEB step, even though the relevant results have already been written to '''selection.results'''. If this happens, inspect '''selection.results''' before assuming that the run failed completely.

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: this is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dN/dS ratios and codon class proportions for model M1'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting "p:" gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class, while 25% of all sites belong to the class having dN/dS = 1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes? Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dN/dS ratios and codon class proportions for model M2'''
: Now scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: '''Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: '''Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1, then you have evidence for the existence of positively selected sites in the gp120 gene. Now scroll down to the end of the result file and locate a list similar to the one below. Note: this is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly: NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could lie in a region around that value), whereas [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].

: This gives you a list of which residues (if any) were found to belong to the positively selected dN/dS class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class.

Model selection

2026-04-15T09:28:30Z

Gorm: /* Detection of positively selected sites in gp120 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Phylogenetic tree inference ==

=== Tree inference using the model selected by jModelTest2 (PAUP) ===

'''Question 5'''

: Above, you used jModelTest2 to select a substitution model based on AIC. You will now use that model to construct a maximum likelihood tree in '''PAUP'''. The purpose of doing this in PAUP is mainly pedagogical: it makes the individual steps of model-based phylogenetic inference more explicit. In practice, one would often use a program such as IQ-TREE 3 (see next section).

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jModelTest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jModelTest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jModelTest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree_paup.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree_paup.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree_paup.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

=== One-step model selection and tree inference using IQ-TREE 3 ===

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis. You will now repeat the analysis using this more integrated approach.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: '''Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods is that the parameters of a model are estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on whether we can detect positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models. Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS = 1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Edit the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that control how '''codeml''' is run. Before continuing, edit the file and save it so that the following settings have these values. Any text appearing after an asterisk (*) on a line is only a comment, and you do not need to change it.

'''seqfile = gp120align.fasta'''
: name of alignment file

'''treefile = gp120.nexus.treefile'''
: name of tree file produced by IQ-TREE 3

'''seqtype = 1'''
: tells the program that the data consist of coding DNA

'''NSsites = 1 2'''
: tells the program to analyze models M1 and M2

'''cleandata = 0'''
: tells the program to keep positions with gaps

: '''Important note:''' in this exercise, use the tree from '''IQ-TREE 3''' for the codeml analysis. On some systems, codeml may have problems with the PAUP tree file or may stop with a confusing error message even though the downstream results are still written to the output file. Using the IQ-TREE tree avoids this issue.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS = 1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS = 1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis.)

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves.)

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values.

: '''Note:''' On some systems codeml may print an error message near the end of the run, for instance during the BEB step, even though the relevant results have already been written to '''selection.results'''. If this happens, inspect '''selection.results''' before assuming that the run failed completely.

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: this is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dN/dS ratios and codon class proportions for model M1'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting "p:" gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class, while 25% of all sites belong to the class having dN/dS = 1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes? Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dN/dS ratios and codon class proportions for model M2'''
: Now scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: '''Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: '''Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1, then you have evidence for the existence of positively selected sites in the gp120 gene. Now scroll down to the end of the result file and locate a list similar to the one below. Note: this is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly: NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could lie in a region around that value), whereas [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].

: This gives you a list of which residues (if any) were found to belong to the positively selected dN/dS class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class.

Model selection

2026-04-15T09:25:14Z

Gorm: /* Detection of positively selected sites in gp120 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Phylogenetic tree inference ==

=== Tree inference using the model selected by jModelTest2 (PAUP) ===

'''Question 5'''

: Above, you used jModelTest2 to select a substitution model based on AIC. You will now use that model to construct a maximum likelihood tree in '''PAUP'''. The purpose of doing this in PAUP is mainly pedagogical: it makes the individual steps of model-based phylogenetic inference more explicit. In practice, one would often use a program such as IQ-TREE 3 (see next section).

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jModelTest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jModelTest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jModelTest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree_paup.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree_paup.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree_paup.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

=== One-step model selection and tree inference using IQ-TREE 3 ===

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis. You will now repeat the analysis using this more integrated approach.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: '''Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods is that the parameters of a model are estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on whether we can detect positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models. Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS = 1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Edit the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that control how '''codeml''' is run. Before continuing, edit the file and save it so that the following arguments have exactly these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120.nexus.treefile''': name of tree file produced by IQ-TREE 3
'''seqtype = 1''': tells the program that our data consists of coding DNA
'''NSsites = 1 2''': tells the program to analyze models M1 and M2
'''cleandata = 0''': tells the program to keep positions with gaps

: '''Important note:''' in this exercise, use the tree from '''IQ-TREE 3''' for the codeml analysis. On some systems, codeml may have problems with the PAUP tree file or may stop with a confusing error message even though the downstream results are still written to the output file. Using the IQ-TREE tree avoids this issue.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS = 1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS = 1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis.)

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves.)

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values.

: '''Note:''' On some systems codeml may print an error message near the end of the run, for instance during the BEB step, even though the relevant results have already been written to '''selection.results'''. If this happens, inspect '''selection.results''' before assuming that the run failed completely.

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: this is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dN/dS ratios and codon class proportions for model M1'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting "p:" gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class, while 25% of all sites belong to the class having dN/dS = 1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes? Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dN/dS ratios and codon class proportions for model M2'''
: Now scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: '''Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: '''Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1, then you have evidence for the existence of positively selected sites in the gp120 gene. Now scroll down to the end of the result file and locate a list similar to the one below. Note: this is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly: NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could lie in a region around that value), whereas [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].

: This gives you a list of which residues (if any) were found to belong to the positively selected dN/dS class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class.

Model selection

2026-04-15T09:18:50Z

Gorm: /* Tree inference using the model selected by jModelTest2 (PAUP) */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Phylogenetic tree inference ==

=== Tree inference using the model selected by jModelTest2 (PAUP) ===

'''Question 5'''

: Above, you used jModelTest2 to select a substitution model based on AIC. You will now use that model to construct a maximum likelihood tree in '''PAUP'''. The purpose of doing this in PAUP is mainly pedagogical: it makes the individual steps of model-based phylogenetic inference more explicit. In practice, one would often use a program such as IQ-TREE 3 (see next section).

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jModelTest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jModelTest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jModelTest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree_paup.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree_paup.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree_paup.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

=== One-step model selection and tree inference using IQ-TREE 3 ===

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis. You will now repeat the analysis using this more integrated approach.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: '''Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods is that the parameters of a model are estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on whether we can detect positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models. Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS = 1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120.nexus.treefile''': name of tree file produced by IQ-TREE 3
'''seqtype = 1''': tells the program that our data consists of coding DNA
'''NSsites = 1 2''': tells the program to analyze models M1 and M2
'''cleandata = 0''': tells the program to keep positions with gaps

: '''Important note:''' in this exercise, use the tree from '''IQ-TREE 3''' for the codeml analysis. On some systems, codeml may have problems with the PAUP tree file or may stop with a confusing error message even though the downstream results are still written to the output file. Using the IQ-TREE tree avoids this issue.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS = 1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS = 1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis.)

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves.)

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values.

: '''Note:''' On some systems codeml may print an error message near the end of the run, for instance during the BEB step, even though the relevant results have already been written to '''selection.results'''. If this happens, inspect '''selection.results''' before assuming that the run failed completely.

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: this is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dN/dS ratios and codon class proportions for model M1'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting "p:" gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class, while 25% of all sites belong to the class having dN/dS = 1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes? Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dN/dS ratios and codon class proportions for model M2'''
: Now scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: '''Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: '''Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1, then you have evidence for the existence of positively selected sites in the gp120 gene. Now scroll down to the end of the result file and locate a list similar to the one below. Note: this is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly: NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could lie in a region around that value), whereas [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].

: This gives you a list of which residues (if any) were found to belong to the positively selected dN/dS class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class.

Model selection

2026-04-15T09:18:00Z

Gorm:

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Phylogenetic tree inference ==

=== Tree inference using the model selected by jModelTest2 (PAUP) ===

'''Question 5'''

: Above, you used jModelTest2 to select a substitution model based on AIC. You will now use that model to construct a maximum likelihood tree in '''PAUP'''. The purpose of doing this in PAUP is mainly pedagogical: it makes the individual steps of model-based phylogenetic inference more explicit. In practice, one would often use a more integrated program such as IQ-TREE 3 (see next section).

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jModelTest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jModelTest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jModelTest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree_paup.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree_paup.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree_paup.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

=== One-step model selection and tree inference using IQ-TREE 3 ===

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis. You will now repeat the analysis using this more integrated approach.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: '''Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods is that the parameters of a model are estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on whether we can detect positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models. Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS = 1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120.nexus.treefile''': name of tree file produced by IQ-TREE 3
'''seqtype = 1''': tells the program that our data consists of coding DNA
'''NSsites = 1 2''': tells the program to analyze models M1 and M2
'''cleandata = 0''': tells the program to keep positions with gaps

: '''Important note:''' in this exercise, use the tree from '''IQ-TREE 3''' for the codeml analysis. On some systems, codeml may have problems with the PAUP tree file or may stop with a confusing error message even though the downstream results are still written to the output file. Using the IQ-TREE tree avoids this issue.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS = 1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS = 1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis.)

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves.)

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values.

: '''Note:''' On some systems codeml may print an error message near the end of the run, for instance during the BEB step, even though the relevant results have already been written to '''selection.results'''. If this happens, inspect '''selection.results''' before assuming that the run failed completely.

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: this is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dN/dS ratios and codon class proportions for model M1'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting "p:" gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class, while 25% of all sites belong to the class having dN/dS = 1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes? Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dN/dS ratios and codon class proportions for model M2'''
: Now scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: '''Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: '''Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1, then you have evidence for the existence of positively selected sites in the gp120 gene. Now scroll down to the end of the result file and locate a list similar to the one below. Note: this is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly: NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could lie in a region around that value), whereas [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].

: This gives you a list of which residues (if any) were found to belong to the positively selected dN/dS class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class.

Model selection

2026-04-15T08:29:34Z

Gorm:

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jmodeltest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

== Model selection and tree reconstruction in one step using IQ-TREE 3 ==

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: ''' Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

----

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: ''' Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T08:29:03Z

Gorm: /* Model selection and tree reconstruction in one step using IQ-TREE 3 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jmodeltest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

== Model selection and tree reconstruction in one step using IQ-TREE 3 ==

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: ''' Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: ''' Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T08:28:29Z

Gorm: /* Model selection and tree reconstruction in one step using IQ-TREE 3 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jmodeltest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

== Model selection and tree reconstruction in one step using IQ-TREE 3 ==

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: ''' Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

---

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: ''' Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T08:28:00Z

Gorm: /* Model selection and tree reconstruction in one step using IQ-TREE 3 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jmodeltest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

== Model selection and tree reconstruction in one step using IQ-TREE 3 ==

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". : Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: ''' Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

---

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: ''' Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T08:27:29Z

Gorm:

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jmodeltest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

== Model selection and tree reconstruction in one step using IQ-TREE 3 ==

'''Question 6'''

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP -nt AUTO

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". : Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question: ''' Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

---

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: ''' Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T08:25:27Z

Gorm: /* Model selection and tree reconstruction in one step using IQ-TREE 3 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jmodeltest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

== Model selection and tree reconstruction in one step using IQ-TREE 3 ==

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP -nt AUTO

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". : Note that IQ-TREE 3 reports several model-selection criteria in the output. The line near the top of the ModelFinder section states which criterion was used to choose the final model. Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

: Which substitution model was selected by IQ-TREE 3, and according to which information criterion was it selected? What was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

---

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: ''' Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T08:20:42Z

Gorm: /* Model selection and tree reconstruction in one step using IQ-TREE 3 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jmodeltest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

== Model selection and tree reconstruction in one step using IQ-TREE 3 ==

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP -nt AUTO

: Here, "-m MFP" tells IQ-TREE 3 to perform automatic model selection using "ModelFinder Plus" before reconstructing the tree.

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question 6'''

: Which substitution model was selected by IQ-TREE 3, and what was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

---

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: ''' Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T08:13:51Z

Gorm:

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----
== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above, you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (Note: it is possible to create a maximum likelihood tree, or even a model-averaged tree, directly from jmodeltest2, but we will instead do it in PAUP in order to more clearly see each step that is taken.)

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above, you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates found for the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP window, enter the following command:
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, and the parameter estimates found by modeltest for that model. You could also choose to estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take at most a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection at a subset of codon positions, and the tree is just something we need in order to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed on this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the FigTree window). Close the FigTree window when you have had a look.

'''Question: '''Which substitution model did you use for the PAUP tree search, and what was the negative log likelihood of the best tree found under this model?

----

== Model selection and tree reconstruction in one step using IQ-TREE 3 ==

: In real-life phylogenetic analysis one would often use a modern program such as '''IQ-TREE 3''', which can carry out model selection and maximum likelihood tree reconstruction in a single analysis.

: Run IQ-TREE 3 on the alignment using the following command:
iqtree3 -s gp120.nexus -m MFP -nt AUTO

: When the run has finished, inspect the output written to the screen and to the file ending in ".iqtree". Locate the substitution model selected by IQ-TREE 3 and the log likelihood of the best tree.

'''Question 6'''

: Which substitution model was selected by IQ-TREE 3, and what was the log likelihood of the best tree? Compare this to the earlier jModelTest2 + PAUP analysis. Is the selected model exactly the same? Is the likelihood exactly the same? Briefly suggest one reason why the two analyses might not give identical results.

---

== Detection of positively selected sites in gp120 ==

'''Question 7'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 9'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 10'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 11'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 12: ''' Is M2 better than M1?

----

'''Question 13'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T07:01:03Z

Gorm: /* Construction of phylogenetic tree using PAUP */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Finding the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 6'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 7'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 8'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 9'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 10'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 11: ''' Is M2 better than M1?

----

'''Question 12'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T06:56:05Z

Gorm: /* Model selection using AIC and jModelTest2 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''

: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 6'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 7'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 8'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 9'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 10'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 11: ''' Is M2 better than M1?

----

'''Question 12'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T06:55:26Z

Gorm:

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Protein-guided alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Model selection using AIC and jModelTest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 6'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 7'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 8'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 9'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 10'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 11: ''' Is M2 better than M1?

----

'''Question 12'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T06:36:14Z

Gorm: /* Detection of positively selected sites in gp120 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Analysis of viral data set: alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 6'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 7'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 8'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 9'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 10'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 11: ''' Is M2 better than M1?

----

'''Question 12'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T06:35:40Z

Gorm: /* Construction of phylogenetic tree using PAUP */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Analysis of viral data set: alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 5'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 5'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 6'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 7'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 9'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 10: ''' Is M2 better than M1?

----

'''Question 11'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T06:34:30Z

Gorm: /* Selection of substitution model using jmodeltest2 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Analysis of viral data set: alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in selecting a model that achieves a good balance between fit to the data and number of parameters. We will investigate this issue by fitting a set of 56 different models to our data and then comparing them using AIC.

: In the first part of the exercise, however, you will only compare three closely related models by hand. The purpose of this restricted comparison is not primarily to identify the globally best model, but to understand how AIC, ΔAIC, and Akaike weights are computed and interpreted. Because only three models are included in this manual comparison, the weights you compute will describe relative support '''within this restricted set only'''.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute Akaike weights for three substitution models'''
: Use AIC-based model comparison to investigate which of the following three substitution models are best supported as descriptions of how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

: Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: '''Use the recipe above to compute AIC, ΔAIC, and Akaike weight for the three models. Report the results in a table similar to the one shown above, and verify that the three weights sum to 1 (up to rounding error).

----

'''Question 2'''

: Based on the Akaike weights, which of the three models has the strongest support? Is the support strongly concentrated on one model, or is there still substantial support for one or both of the others?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC values and Akaike weights for the full candidate set:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

'''Question 4'''

: In the same AIC results table, now click (not SHIFT+click) the header of the "-lnL" column. In this table, a normal click sorts in ascending order, whereas SHIFT+click sorts in descending order. The model at the top will therefore be the one with the smallest value of "-lnL" (equivalently: the highest likelihood). Compare this model to the one with the highest Akaike weight.

'''Question: '''Is the model with the highest likelihood also the model with the highest Akaike weight? If not, explain briefly why these two rankings can differ.

----

== Construction of phylogenetic tree using PAUP ==

'''Question 4'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 5'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 6'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 7'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 9'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 10: ''' Is M2 better than M1?

----

'''Question 11'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T05:52:04Z

Gorm: /* Recipe for computing AIC values and model probabilities */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Analysis of viral data set: alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Selection of substitution model using jmodeltest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in using the model that best describes our data without having more parameters than strictly necessary (thus avoiding overfitting). We will investigate this issue by fitting a set of 56 different models to our data and then selecting one with a reasonable balance between model complexity and data fit.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute model probabilities for three substitution models'''
: Use AIC-based model probabilities to investigate which of the following three substitution models are best at describing how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

:Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: ''' Use the recipe above to compute AIC values and model probabilities. Report the results in a table similar to the one shown above

----

'''Question 2''' Based on the model probabilities: wich model has more support?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC and model probabilities, and uses this to select the model that best fits the sequence data:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

== Construction of phylogenetic tree using PAUP ==

'''Question 4'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 5'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 6'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 7'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 9'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 10: ''' Is M2 better than M1?

----

'''Question 11'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T05:51:20Z

Gorm: /* Recipe for computing AIC values and model probabilities */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities (also called model weights) for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Analysis of viral data set: alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Selection of substitution model using jmodeltest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in using the model that best describes our data without having more parameters than strictly necessary (thus avoiding overfitting). We will investigate this issue by fitting a set of 56 different models to our data and then selecting one with a reasonable balance between model complexity and data fit.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute model probabilities for three substitution models'''
: Use AIC-based model probabilities to investigate which of the following three substitution models are best at describing how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

:Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: ''' Use the recipe above to compute AIC values and model probabilities. Report the results in a table similar to the one shown above

----

'''Question 2''' Based on the model probabilities: wich model has more support?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC and model probabilities, and uses this to select the model that best fits the sequence data:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

== Construction of phylogenetic tree using PAUP ==

'''Question 4'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 5'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 6'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 7'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 9'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 10: ''' Is M2 better than M1?

----

'''Question 11'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T05:49:12Z

Gorm: /* Selection of substitution model using jmodeltest2 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Analysis of viral data set: alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Selection of substitution model using jmodeltest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in using the model that best describes our data without having more parameters than strictly necessary (thus avoiding overfitting). We will investigate this issue by fitting a set of 56 different models to our data and then selecting one with a reasonable balance between model complexity and data fit.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute model probabilities for three substitution models'''
: Use AIC-based model probabilities to investigate which of the following three substitution models are best at describing how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

:Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: ''' Use the recipe above to compute AIC values and model probabilities. Report the results in a table similar to the one shown above

----

'''Question 2''' Based on the model probabilities: wich model has more support?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC and model probabilities, and uses this to select the model that best fits the sequence data:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

== Construction of phylogenetic tree using PAUP ==

'''Question 4'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 5'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 6'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 7'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 9'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 10: ''' Is M2 better than M1?

----

'''Question 11'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T05:47:22Z

Gorm: /* Selection of substitution model using jmodeltest2 */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Analysis of viral data set: alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Selection of substitution model using jmodeltest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in using the model that best describes our data without having more parameters than strictly necessary (thus avoiding overfitting). We will investigate this issue by fitting a set of 56 different models to our data and then selecting one with a reasonable balance between model complexity and data fit.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result, manually check model probabilities for three models'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p" - note this includes branch lengths), and estimates of all model parameters (excluding branch lengths).

'''Manually compute model probabilities for three substitution models'''
: Use AIC-based model probabilities to investigate which of the following three substitution models are best at describing how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

:Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: ''' Use the recipe above to compute AIC values and model probabilities. Report the results in a table similar to the one shown above

----

'''Question 2''' Based on the model probabilities: wich model has more support?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC and model probabilities, and uses this to select the model that best fits the sequence data:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

== Construction of phylogenetic tree using PAUP ==

'''Question 4'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 5'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 6'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 7'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 9'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 10: ''' Is M2 better than M1?

----

'''Question 11'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Model selection

2026-04-15T05:31:28Z

Gorm: /* Analysis of viral data set: alignment of coding DNA */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Analysis of viral data set: alignment of coding DNA ==

: There are three reasons why it is preferable to align coding DNA at the amino acid level rather than directly as nucleotides. First, the DNA alphabet has only four letters compared to 20 amino acids, which means that even unrelated DNA sequences will share roughly 25% identity by chance alone, making it much harder to distinguish true homology from noise. Second, protein alignments benefit from empirical substitution matrices such as BLOSUM-62, which capture the fact that amino acid replacements tend to be conservative. Equivalent matrices for DNA are far less informative because nucleotide substitution patterns vary strongly between genes and organisms. Third, synonymous substitutions accumulate faster than non-synonymous ones, so DNA sequences diverge more rapidly than the proteins they encode. Taken together, these factors mean that the phylogenetic signal in a DNA alignment erodes much faster than in the corresponding protein alignment.

: However, while protein-level alignment gives us the most reliable homology signal, it discards exactly the information we need for studying selection: the synonymous substitutions. Estimating the ratio of non-synonymous to synonymous substitution rates (dN/dS) requires a codon-level alignment where analogous codon positions are properly lined up. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment. Because each amino acid maps back to its original codon, gaps are always inserted in groups of three and codon boundaries are preserved throughout the alignment. This is essential for the dN/dS analyses we will carry out in this exercise: if codon positions are misaligned, synonymous and non-synonymous changes cannot be correctly distinguished.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Selection of substitution model using jmodeltest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in using the model that best describes our data without having more parameters than strictly necessary (thus avoiding overfitting). We will investigate this issue by fitting a set of 56 different models to our data and then selecting one with a reasonable balance between model complexity and data fit.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result, manually check model probabilities for three models'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p"), and estimates of all model parameters (excluding branch lengths).

'''Manually compute model probabilities for three substitution models'''
: Use AIC-based model probabilities to investigate which of the following three substitution models are best at describing how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

:Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: ''' Use the recipe above to compute AIC values and model probabilities. Report the results in a table similar to the one shown above

----

'''Question 2''' Based on the model probabilities: wich model has more support?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC and model probabilities, and uses this to select the model that best fits the sequence data:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* '''IMPORTANT:''' SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values? (This will be the model with the highest weight - and lowest AIC - and will be in the first row of the results table after SHIFT-clicking the "weight" header).

----

== Construction of phylogenetic tree using PAUP ==

'''Question 4'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes at most.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 5'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values estimated as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and weights (model probabilities) from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 0''': tells the program to keep positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

: (Regarding "cleandata" setting: "cleandata = 1" would cause codeml to discard any alignment columns with gaps or ambiguity symbols. For alignments with many sequences this can lead to discarding the majority of sites and I would therefore recommend not using that setting. At the same time you do want to be wary of columns where most sequences have gaps, since inference of selection for these will be very uncertain. It might be better to discard such columns from an alignment before doing the codeml analysis).

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. You may want to turn off line-wrapping to more clearly see the structure of the file (there will be some long lines that will otherwise wrap around). The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 6'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting "w:" lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). In this context "w" means "omega", which is the symbol typically used to represent the dN/dS rate ratio. The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 7'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 9'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, weight (model probability) for M1 and M2

----

'''Question 10: ''' Is M2 better than M1?

----

'''Question 11'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Bayesian Phylogeny

2026-03-19T14:59:53Z

Gorm:

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior samples and posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Neanderthal data: posterior probability of clades ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
: When the run has finished, issue this command to compute a consensus tree:
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Posterior distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites may evolve at different rates. With a gamma model, we allow rates to vary across sites but do not specify in advance which sites are fast or slow; instead, that pattern is inferred from the data. Here we instead use prior biological knowledge about the structure of the genetic code to divide sites into three classes: 1st, 2nd, and 3rd codon positions. We then allow each class to have its own rate, so that all 1st positions share one rate, all 2nd positions another, and all 3rd positions a third. Specifically, charset 1stpos=1-.\3 defines a character set named 1stpos consisting of site 1 followed by every third site (\3, i.e. sites 1, 4, 7, 10, …), continuing until the end of the alignment (denoted .).

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible frequency vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-to-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating whether the run has converged. Specifically, ESS means Effective Sample Size, and measures how many effectively independent samples you have from the posterior — the higher the better, but this should be at least 100. [https://sites.stat.columbia.edu/gelman/research/published/brooksgelman2.pdf The column labeled PSRF+ gives another convergence diagnostic (also known as “R-hat”)] and should be close to 1 if the runs have converged. Specifically, it measures whether different chains (and different parts of the chains) are sampling the same distribution of values. As a rule of thumb, values less than 1.05 are good, values between 1.05 and 1.10 are acceptable, and values above 1.10 suggest poor convergence.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: '''Based on the reported posterior means, which of the two parameters, r(AC) or r(CG), appears to be larger on average?

----

'''Question 8'''

'''Marginal distributions'''
: Comparing posterior means gives a useful first summary of the two parameters, but it does not show how uncertain these estimates are. One of the strengths of Bayesian analysis is precisely that it gives us access not just to a single best estimate, but to a full posterior probability distribution over possible parameter values. We will now use this to get a fuller picture of the two substitution-rate parameters.

: We start by looking at the posterior distribution of each parameter separately. Such a distribution for one parameter alone is called its marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Marginal vs. joint distributions'''
: Looking at the marginal distributions gives us a fuller understanding of the uncertainty in each parameter separately. However, it still does not directly answer the question of whether r(AC) is larger than r(CG) in most posterior samples, because the two parameters may be associated with each other across samples. For instance, one parameter might be larger than the other in almost every individual sample, even though the two overall marginal distributions overlap. To answer such questions, we must examine the two parameters simultaneously. A probability distribution over several parameters at the same time is called a "joint distribution", whereas a distribution for one parameter considered by itself is called a "marginal distribution".

: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

: Note how examining the joint distribution provides information that you could not obtain by simply comparing the marginal distributions. In particular, it lets you answer direct questions about how parameters relate to each other, for instance whether one is larger than another in most posterior samples. The same idea can be used to answer many other questions about posterior distributions.

----

'''Question 10'''

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T11:40:05Z

Gorm: /* Probability distributions over other parameters */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
: When the run has finished, issue this command to compute a consensus tree:
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites may evolve at different rates. With a gamma model, we allow rates to vary across sites but do not specify in advance which sites are fast or slow; instead, that pattern is inferred from the data. Here we instead use prior biological knowledge about the structure of the genetic code to divide sites into three classes: 1st, 2nd, and 3rd codon positions. We then allow each class to have its own rate, so that all 1st positions share one rate, all 2nd positions another, and all 3rd positions a third. Specifically, charset 1stpos=1-.\3 defines a character set named 1stpos consisting of site 1 followed by every third site (\3, i.e. sites 1, 4, 7, 10, …), continuing until the end of the alignment (denoted .).

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible frequency vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-to-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating whether the run has converged. Specifically, ESS means Effective Sample Size, and measures how many effectively independent samples you have from the posterior — the higher the better, but this should be at least 100. [https://sites.stat.columbia.edu/gelman/research/published/brooksgelman2.pdf The column labeled PSRF+ gives another convergence diagnostic (also known as “R-hat”)] and should be close to 1 if the runs have converged. Specifically, it measures whether different chains (and different parts of the chains) are sampling the same distribution of values. As a rule of thumb, values less than 1.05 are good, values between 1.05 and 1.10 are acceptable, and values above 1.10 suggest poor convergence.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: '''Based on the reported posterior means, which of the two parameters, r(AC) or r(CG), appears to be larger on average?

----

'''Question 8'''

'''Marginal distributions'''
: Comparing posterior means gives a useful first summary of the two parameters, but it does not show how uncertain these estimates are. One of the strengths of Bayesian analysis is precisely that it gives us access not just to a single best estimate, but to a full posterior probability distribution over possible parameter values. We will now use this to get a fuller picture of the two substitution-rate parameters.

: We start by looking at the posterior distribution of each parameter separately. Such a distribution for one parameter alone is called its marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Marginal vs. joint distributions'''
: Looking at the marginal distributions gives us a fuller understanding of the uncertainty in each parameter separately. However, it still does not directly answer the question of whether r(AC) is larger than r(CG) in most posterior samples, because the two parameters may be associated with each other across samples. For instance, one parameter might be larger than the other in almost every individual sample, even though the two overall marginal distributions overlap. To answer such questions, we must examine the two parameters simultaneously. A probability distribution over several parameters at the same time is called a "joint distribution", whereas a distribution for one parameter considered by itself is called a "marginal distribution".

: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

: Note how examining the joint distribution provides information that you could not obtain by simply comparing the marginal distributions. In particular, it lets you answer direct questions about how parameters relate to each other, for instance whether one is larger than another in most posterior samples. The same idea can be used to answer many other questions about posterior distributions.

----

'''Question 10'''

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T11:09:13Z

Gorm: /* Probability distributions over other parameters */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
: When the run has finished, issue this command to compute a consensus tree:
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites may evolve at different rates. With a gamma model, we allow rates to vary across sites but do not specify in advance which sites are fast or slow; instead, that pattern is inferred from the data. Here we instead use prior biological knowledge about the structure of the genetic code to divide sites into three classes: 1st, 2nd, and 3rd codon positions. We then allow each class to have its own rate, so that all 1st positions share one rate, all 2nd positions another, and all 3rd positions a third. Specifically, charset 1stpos=1-.\3 defines a character set named 1stpos consisting of site 1 followed by every third site (\3, i.e. sites 1, 4, 7, 10, …), continuing until the end of the alignment (denoted .).

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible frequency vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-to-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating whether the run has converged. Specifically, ESS means Effective Sample Size, and measures how many effectively independent samples you have from the posterior — the higher the better, but this should be at least 100. [https://sites.stat.columbia.edu/gelman/research/published/brooksgelman2.pdf The column labeled PSRF+ gives another convergence diagnostic (also known as “R-hat”)] and should be close to 1 if the runs have converged. Specifically, it measures whether different chains (and different parts of the chains) are sampling the same distribution of values. As a rule of thumb, values less than 1.05 are good, values between 1.05 and 1.10 are acceptable, and values above 1.10 suggest poor convergence.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: '''Based on the reported posterior means, which of the two parameters, r(AC) or r(CG), appears to be larger on average?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T10:44:37Z

Gorm: /* Probability distributions over other parameters */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
: When the run has finished, issue this command to compute a consensus tree:
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites may evolve at different rates. With a gamma model, we allow rates to vary across sites but do not specify in advance which sites are fast or slow; instead, that pattern is inferred from the data. Here we instead use prior biological knowledge about the structure of the genetic code to divide sites into three classes: 1st, 2nd, and 3rd codon positions. We then allow each class to have its own rate, so that all 1st positions share one rate, all 2nd positions another, and all 3rd positions a third. Specifically, charset 1stpos=1-.\3 defines a character set named 1stpos consisting of site 1 followed by every third site (\3, i.e. sites 1, 4, 7, 10, …), continuing until the end of the alignment (denoted .).

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible frequency vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-to-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating whether the run has converged. Specifically, ESS means Effective Sample Size, and measures how many effectively independent samples you have from the posterior — the higher the better, but this should be at least 100. [https://sites.stat.columbia.edu/gelman/research/published/brooksgelman2.pdf The column labeled PSRF+ gives another convergence diagnostic (also known as “R-hat”)] and should be close to 1 if the runs have converged. Specifically, it measures whether different chains (and different parts of the chains) are sampling the same distribution of values. As a rule of thumb, values less than 1.05 are good, values between 1.05 and 1.10 are acceptable, and values above 1.10 suggest poor convergence.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: ''' Based on the reported posterior means, does r(CG) appear to be larger than r(AC)?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T10:37:00Z

Gorm: /* Probability distributions over other parameters */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
: When the run has finished, issue this command to compute a consensus tree:
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites may evolve at different rates. With a gamma model, we allow rates to vary across sites but do not specify in advance which sites are fast or slow; instead, that pattern is inferred from the data. Here we instead use prior biological knowledge about the structure of the genetic code to divide sites into three classes: 1st, 2nd, and 3rd codon positions. We then allow each class to have its own rate, so that all 1st positions share one rate, all 2nd positions another, and all 3rd positions a third. Specifically, charset 1stpos=1-.\3 defines a character set named 1stpos consisting of site 1 followed by every third site (\3, i.e. sites 1, 4, 7, 10, …), continuing until the end of the alignment (denoted .).

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible frequency vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-to-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating if the run has converged. Specifically ESS means Effective Sample Size, and is a measure of how many independent samples you have from the posterior - the higher the better, but this should be at least 100. The column labeled PSRF+ is a measure (also known as "R-hat") which should be close to 1 if the runs have converged. Specifically this measures whether different chains (and different parts of different chains) converge to sample the same set of values. As a rule of thumb: values less than 1.05 are good, between 1.05 and 1.10 are ok, and above 1.10 have not converged well.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: ''' Based on the reported posterior means, does r(CG) appear to be larger than r(AC)?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T10:32:45Z

Gorm: /* Probability distributions over other parameters */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
: When the run has finished, issue this command to compute a consensus tree:
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites may evolve at different rates. With a gamma model, we allow rates to vary across sites but do not specify in advance which sites are fast or slow; instead, that pattern is inferred from the data. Here we instead use prior biological knowledge about the structure of the genetic code to divide sites into three classes: 1st, 2nd, and 3rd codon positions. We then allow each class to have its own rate, so that all 1st positions share one rate, all 2nd positions another, and all 3rd positions a third. Specifically, charset 1stpos=1-.\3 defines a character set named 1stpos consisting of site 1 followed by every third site (\3, i.e. sites 1, 4, 7, 10, …), continuing until the end of the alignment (denoted .).

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible frequency vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-to-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating if the run has converged. Specifically ESS means Effective Sample Size, and is a measure of how many independent samples you have from the posterior - the higher the better, but this should be at least 100. The column labeled PSRF+ is a measure (also known as "R-hat") which should be close to 1 if the runs have converged. Specifically this measures whether different chains (and different parts of different chains) converge to sample the same set of values.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: ''' Based on the reported posterior means, does r(CG) appear to be larger than r(AC)?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T10:31:39Z

Gorm: /* Probability distributions over other parameters */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
: When the run has finished, issue this command to compute a consensus tree:
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites may evolve at different rates. With a gamma model, we allow rates to vary across sites but do not specify in advance which sites are fast or slow; instead, that pattern is inferred from the data. Here we instead use prior biological knowledge about the structure of the genetic code to divide sites into three classes: 1st, 2nd, and 3rd codon positions. We then allow each class to have its own rate, so that all 1st positions share one rate, all 2nd positions another, and all 3rd positions a third. Specifically, charset 1stpos=1-.\3 defines a character set named 1stpos consisting of site 1 followed by every third site (\3, i.e. sites 1, 4, 7, 10, …), continuing until the end of the alignment (denoted .).

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible frequency vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating if the run has converged. Specifically ESS means Effective Sample Size, and is a measure of how many independent samples you have from the posterior - the higher the better, but this should be at least 100. The column labeled PSRF+ is a measure (also known as "R-hat") which should be close to 1 if the runs have converged. Specifically this measures whether different chains (and different parts of different chains) converge to sample the same set of values.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: ''' Based on the reported posterior means, does r(CG) appear to be larger than r(AC)?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T10:29:25Z

Gorm: /* Probability distributions over other parameters */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
: When the run has finished, issue this command to compute a consensus tree:
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites may evolve at different rates. With a gamma model, we allow rates to vary across sites but do not specify in advance which sites are fast or slow; instead, that pattern is inferred from the data. Here we instead use prior biological knowledge about the structure of the genetic code to divide sites into three classes: 1st, 2nd, and 3rd codon positions. We then allow each class to have its own rate, so that all 1st positions share one rate, all 2nd positions another, and all 3rd positions a third. Specifically, charset 1stpos=1-.\3 defines a character set named 1stpos consisting of site 1 followed by every third site (\3, i.e. sites 1, 4, 7, 10, …), continuing until the end of the alignment (denoted .).

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating if the run has converged. Specifically ESS means Effective Sample Size, and is a measure of how many independent samples you have from the posterior - the higher the better, but this should be at least 100. The column labeled PSRF+ is a measure (also known as "R-hat") which should be close to 1 if the runs have converged. Specifically this measures whether different chains (and different parts of different chains) converge to sample the same set of values.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: ''' Based on the reported posterior means, does r(CG) appear to be larger than r(AC)?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T09:56:56Z

Gorm: /* Analysis of Neanderthal data (posterior probability of clades) */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
: When the run has finished, issue this command to compute a consensus tree:
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites have different rates. Instead of using a gamma distribution and learning about site-specific rates from the data, we are instead using our prior knowledge about the structure of the genetic code to specify that all 1st codon positions have the same rate, all 2nd codon positions have the same rate, and all 3rd codon positions have the same rate. Specifically, charset 1stpos=1-.\3 means that we define a character set named "1stpos" which includes site 1 in the alignment followed by every third site ("\3", meaning it includes sites 1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating if the run has converged. Specifically ESS means Effective Sample Size, and is a measure of how many independent samples you have from the posterior - the higher the better, but this should be at least 100. The column labeled PSRF+ is a measure (also known as "R-hat") which should be close to 1 if the runs have converged. Specifically this measures whether different chains (and different parts of different chains) converge to sample the same set of values.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: ''' Based on the reported posterior means, does r(CG) appear to be larger than r(AC)?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T09:53:42Z

Gorm: /* Analysis of Neanderthal data (posterior probability of clades) */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites have different rates. Instead of using a gamma distribution and learning about site-specific rates from the data, we are instead using our prior knowledge about the structure of the genetic code to specify that all 1st codon positions have the same rate, all 2nd codon positions have the same rate, and all 3rd codon positions have the same rate. Specifically, charset 1stpos=1-.\3 means that we define a character set named "1stpos" which includes site 1 in the alignment followed by every third site ("\3", meaning it includes sites 1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating if the run has converged. Specifically ESS means Effective Sample Size, and is a measure of how many independent samples you have from the posterior - the higher the better, but this should be at least 100. The column labeled PSRF+ is a measure (also known as "R-hat") which should be close to 1 if the runs have converged. Specifically this measures whether different chains (and different parts of different chains) converge to sample the same set of values.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: ''' Based on the reported posterior means, does r(CG) appear to be larger than r(AC)?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T09:53:10Z

Gorm: /* Analysis of Neanderthal data (posterior probability of clades) */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis], proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites have different rates. Instead of using a gamma distribution and learning about site-specific rates from the data, we are instead using our prior knowledge about the structure of the genetic code to specify that all 1st codon positions have the same rate, all 2nd codon positions have the same rate, and all 3rd codon positions have the same rate. Specifically, charset 1stpos=1-.\3 means that we define a character set named "1stpos" which includes site 1 in the alignment followed by every third site ("\3", meaning it includes sites 1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating if the run has converged. Specifically ESS means Effective Sample Size, and is a measure of how many independent samples you have from the posterior - the higher the better, but this should be at least 100. The column labeled PSRF+ is a measure (also known as "R-hat") which should be close to 1 if the runs have converged. Specifically this measures whether different chains (and different parts of different chains) converge to sample the same set of values.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: ''' Based on the reported posterior means, does r(CG) appear to be larger than r(AC)?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T09:51:49Z

Gorm: /* Analysis of Neanderthal data (posterior probability of clades) */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

For many years, there was considerable debate about the origin of modern humans. One view, often called the Multiregional Hypothesis, proposed that after Homo erectus spread from Africa into different parts of the world, regional populations gradually evolved into modern humans more or less in parallel. A different view, often called the Recent African Origin model, proposed that modern Homo sapiens evolved in Africa and later spread outward, largely replacing other archaic human groups such as the Neanderthals.

Today it is clear that the history is more complicated than either simple extreme: modern humans arose in Africa, but there was also some interbreeding with Neanderthals and other archaic humans. However, in this exercise we will focus on a narrower question that can be addressed using a phylogeny of mitochondrial DNA: do the sampled Neanderthal and human mitochondrial sequences suggest that the Neanderthal sequence falls inside or outside modern human mitochondrial diversity?

We will use the present data set to examine this question.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites have different rates. Instead of using a gamma distribution and learning about site-specific rates from the data, we are instead using our prior knowledge about the structure of the genetic code to specify that all 1st codon positions have the same rate, all 2nd codon positions have the same rate, and all 3rd codon positions have the same rate. Specifically, charset 1stpos=1-.\3 means that we define a character set named "1stpos" which includes site 1 in the alignment followed by every third site ("\3", meaning it includes sites 1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating if the run has converged. Specifically ESS means Effective Sample Size, and is a measure of how many independent samples you have from the posterior - the higher the better, but this should be at least 100. The column labeled PSRF+ is a measure (also known as "R-hat") which should be close to 1 if the runs have converged. Specifically this measures whether different chains (and different parts of different chains) converge to sample the same set of values.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: ''' Based on the reported posterior means, does r(CG) appear to be larger than r(AC)?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?

Bayesian Phylogeny

2026-03-19T09:28:41Z

Gorm: /* Posterior probability of trees */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. This means that, for a given set of parameter values, you can compute the probability or [https://www.statlect.com/glossary/probability-density-function probability density] of any possible observation. For a full dataset, you then obtain the likelihood by multiplying these values across all observations. You will recall from the lecture that in Bayesian statistics the goal is to obtain a full posterior probability distribution over all possible parameter values. The posterior distribution quantifies our degree of belief in any possible parameter value after seeing the data. It is obtained by updating the prior probability distribution using the likelihood of the observed data.

The prior probability distribution expresses your beliefs about the parameters before seeing any data, while the likelihood expresses what the observed data are telling you about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. We regard a parameter value as more plausible the more probable it makes the observed data. This is the same measure we have previously used to find the maximum likelihood estimate. If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability), then the posterior distribution is proportional to the likelihood, and the parameter value with the maximum likelihood also has the maximum posterior probability. However, even in this case, using a Bayesian approach still lets you interpret the result as a probability distribution over parameter values.

If the prior is not flat, then it may have a substantial impact on the posterior, although this effect will usually diminish as the amount of data increases. A prior should ideally be based on domain knowledge and results from previous experiments. For instance one can use the posterior from one analysis as the prior in a new, independent analysis. Often a prior is chosen to be weakly informative, meaning that it places reasonable bounds on the parameter values without constraining them too narrowly. For instance the transition/transversion rate ratio kappa is typically 1.5-10. Values such as 100, 1,000 or 1,000,000 would be extremely unlikely, so a weakly informative prior for this parameter could be chosen to place 95% of its probability mass in the 0.5-20 range, slightly wider than what we think of as plausible values. For instance one could use a lognormal distribution with suitable parameters.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as the transition/transversion rate ratio or the gamma shape parameter. The difference is that, whereas in maximum likelihood phylogeny we seek the best point estimates of the parameter values, in Bayesian phylogeny the goal is instead to infer a full probability distribution over the possible parameter values. The observed data are again usually taken to be the alignment, although strictly speaking it would be more reasonable to say that the sequences are what have been observed, and that the alignment should then be inferred jointly with the phylogeny.

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
: In a terminal window, issue the command:
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=100 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=100 lets the search run for 1,000,000 MCMC steps ("generations") and saves parameter values once every 100 rounds (meaning that a total of 10,000 sets of parameter values will be saved to sample files). You sometimes need to run longer (or shorter) than 1,000,000, and would then typically tweak samplefreq so you get around 1,000 - 10,000 samples in all. The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is successful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees, but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree samples are, specifically the average standard deviation of split frequencies. A “split” is the same as a bipartition, i.e. a division of all leaves in the tree into two groups, obtained by cutting an internal branch. For each split, MrBayes compares how often that split occurs in the two independent runs; if the runs have converged, these frequencies should be very similar, giving a small standard deviation. The program then averages this quantity across splits. As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better)

: During the run you will see reports about the progress of the two independent runs, each consisting of three chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''MrBayes starts two independent runs from different random trees. Why is it useful to run two independent analyses instead of just one? How does the average standard deviation of split frequencies help you decide whether the two runs have converged to the same posterior distribution? At approximately how many generations does this happen in your run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal window (don't quit mrbayes in the other terminal yet!) and cd to the bayes directory. Open one of the parameter sampling files in a text editor:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are:
:* Gen: generation; number of MCMC steps taken so far
:* lnL: log likelihood of the current parameter estimates
:* LnPr: log of the prior probability
:* TL: tree length (sum of all branch lengths)
:* kappa: transition/transversion rate ratio
:* pi(A), pi(C), pi(G), pi(T): frequency of A, C, G, T
:* alpha: shape parameter for the gamma distribution.

: (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values.

'''Question: '''You will notice that lnL is always negative, while LnPr can sometimes be positive. At first sight this may seem impossible, since probabilities cannot be larger than 1. How can this happen?

As a hint, note that (1) priors for continuous parameters are probability densities, and (2) the default prior for each branch length in MrBayes is an exponential distribution with rate 10. Use the following R code to plot this prior on both an ordinary y-axis and a log-scaled y-axis, and then explain why positive values of LnPr are possible.

df_expdist = tibble(
x = seq(0, 1, by = 0.001),
density = dexp(x, rate = 10),
logdensity = log(dexp(x, rate = 10))
)

ggplot(df_expdist, aes(x = x, y = density)) +
geom_line(color="blue") +
geom_hline(yintercept = 1, lty=2) +
labs(
x = "Branch length",
y = "Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

ggplot(df_expdist, aes(x = x, y = logdensity)) +
geom_line(color="blue") +
geom_hline(yintercept = 0, lty=2) +
labs(
x = "Branch length",
y = "log of Probability density",
title = "Exponential prior on branch lengths: Exp(rate = 10)"
)

----

'''Question 3'''

'''Examine MCMC trajectory for gamma shape parameter, alpha'''
: Recall that the idea in MCMCMC sampling is to move around in parameter space in such a way that points are visited according to their posterior probability (i.e., regions with high posterior probability are visited frequently). Now, in RStudio, plot the sampled values for the gamma shape parameter, alpha, for one of the run files:
df_primates = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df_primates, pars="alpha")
: mcmc_trace is one of several plotting commands available in the bayesplot package. This command plots the sampled values of the parameter alpha from the first of the two parallel runs against MCMC generation number. Thus, the x-axis shows the progress of the run through time, with the leftmost values being the earliest samples and the rightmost values the later ones. Note how the Markov chain starts at the arbitrary value 1.0, rapidly moves to values that fit the observed data better, and then moves around in parameter space, sampling different plausible values of alpha. You can experiment with plotting other columns as well.

'''Question: '''Describe briefly what happens to the sampled values of alpha during the run. Why is it reasonable to discard the earliest samples as burn-in?

----

'''Question 4'''

'''Investigate posterior probability distribution over trees'''

: Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical Newick format you have seen before. There are 5 taxa in the present data set, so the number of possible unrooted binary tree topologies is only 15. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic used a relative burn-in of 25%, we will also discard the first 25% of tree samples when summarizing the posterior.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we here quantify our degree of belief in all possible trees.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree? Does the analysis strongly support a single tree, or is the posterior probability distributed across several different trees?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 5'''

The predominant theory in the 1950s and 60s (although it varied greatly from scholar to scholar) was that our earliest hominid ancestors (specifically Homo erectus) evolved in Africa and then radiated out into the world. This so-called [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis] says that after H. erectus arrived in the various regions in the world hundreds of thousands of years ago, they slowly evolved into modern humans. The hypothesis thus posits that there were nearly independent origins of modern humans within the various regions of the world.

In the 1970s, paleontologist W.W. Howells proposed an alternate theory: the first Recent African Origin model. Howells argued that H. sapiens evolved solely in Africa. By the 1980s, growing data from human genetics led Stringer and Andrews to develop a model that said that the very earliest anatomically modern humans arose in Africa about 100,000 years ago and archaic populations found throughout Eurasia (including Neanderthals) might be descendants of H. erectus and later archaic types but they were not related to modern humans.

We will use the present data set to consider this issue.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
<pre>
execute neanderthal.nexus
delete 5-40
</pre>
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set).

'''Question: '''What is the posterior probability that all sampled Homo sapiens sequences form a monophyletic group excluding the Neanderthal sequence? Does this support placing the Neanderthal outside modern human mitochondrial diversity?
----

== Probability distributions over other parameters ==

'''Question 6'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites have different rates. Instead of using a gamma distribution and learning about site-specific rates from the data, we are instead using our prior knowledge about the structure of the genetic code to specify that all 1st codon positions have the same rate, all 2nd codon positions have the same rate, and all 3rd codon positions have the same rate. Specifically, charset 1stpos=1-.\3 means that we define a character set named "1stpos" which includes site 1 in the alignment followed by every third site ("\3", meaning it includes sites 1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. Without going into details, the Dirichlet distribution is a probability distribution over frequency vectors (i.e., vectors of positive values that sum to 1). Depending on the exact parameters the distribution can be more or less flat (flat here means that all sum-1 vectors are equally probable). The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=1000000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command (with a "p" at the end) works much like the sumt command (with a "t" at the end), but for other parameters than the tree-topology. Again, we are using 25% of the total number of samples as burnin.

: First, you get a scatter plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

: The last columns contain values indicating if the run has converged. Specifically ESS means Effective Sample Size, and is a measure of how many independent samples you have from the posterior - the higher the better, but this should be at least 100. The column labeled PSRF+ is a measure (also known as "R-hat") which should be close to 1 if the runs have converged. Specifically this measures whether different chains (and different parts of different chains) converge to sample the same set of values.

'''Question: '''What are the posterior mean values of the relative substitution rate parameters r(AC) and r(CG)?

----

'''Question 7: ''' Based on the reported posterior means, does r(CG) appear to be larger than r(AC)?

----

'''Question 8'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df_hcv = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df_hcv$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df_hcv2 = df_hcv %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df_hcv2, prob_outer = 1)
mcmc_areas(df_hcv2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df_hcv2_long = pivot_longer(df_hcv2, cols = c("CG", "AC"))

ggplot(df_hcv2_long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Based on the marginal distributions, r(AC) appears to be centered at a higher value than r(CG), but the two distributions overlap somewhat. Can you from these marginal distributions alone decide whether r(AC) is larger than r(CG) in most posterior samples? Why or why not?

----

'''Question 9'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df_hcv2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df_hcv2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df_hcv2 %>%
nrow()
df_hcv2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the two different ways to plot the joint distribution and based on the unfiltered and filtered row counts, what is the posterior probability that r(AC) > r(CG)?
----

'''Question 10'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df_hcv3 = df_hcv %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df_hcv3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences are expected to be caused by subsequent selection. Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most variable of the codon positions.
:* Codon position 1 is the most variable of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most variable of the codon positions.
:* Codon position 2 is the most conserved codon position.

----

'''Question 11'''

'''Question: '''How does this result fit with your knowledge of the genetic code? Why are these codon positions the most conserved or the most variable?