Models of Evolution

This exercise is part of the course Computational Molecular Evolution (22115).

Overview

In this exercise we will explore a number of different, but closely related, models of evolution. Using such models it is possible to estimate the number of unseen mutational events and thereby obtain genetic distances that have been corrected for superimposed substitutions. It is, however, important to realize that these corrections are based on the assumption that we observe approximately the expected amount of change - if, for instance, 20 mutational events end up leading to no observable changes then it is impossible to guess the actual amount of change regardless of which correctional scheme we employ. Using more and longer sequences helps ensuring that the observed change is closer to the expected change, so the correction is more likely to be accurate with adequate amounts of data. The same models also play an important role in phylogenetic reconstruction based on maximum likelihood and Bayesian techniques.

Getting started

1: Start Terminal window

2: Construct working directory

In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).

cd /path/to/molevol
mkdir models

3: Change to working directory

cd models

4: Copy files for exercise

cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/titv.data ./titv.data

5: Inspect sequence file

nedit primatemitDNA.nexus &

This file contains an aligned set of mitochondrial DNA sequences from man, chimpanzee, gorilla, orangutan and gibbon. Mitochondria are cellular organelles that are bounded by a lipid membrane and contain their own genome. Mitochondrial DNA is related to certain bacterial genomes, and it is believed that the original mitochondrium was a primitive bacterial cell that was engulfed by an early ancestor of eukaryotic cells and that the pair subsequently went on to form a constant symbiotic relationship.

Mitochondrial DNA has a higher rate of substitution than nuclear DNA. This makes it useful for investigating phylogenetic relationships between closely related species, such as the five primates included in the present data set. Close the nedit window when you are done.

6: Inspect additional data file

nedit titv.data &

This file contains a single header line and one column of numbers giving estimated times of divergence between man and chimpanzee, man and gorilla, man and orangutan, and man and gibbon. (Divergence times are in millions of years). This file will be used later in the exercise when we investigate how various distance measures increase over time. Note: If the nedit window is too narrow, then the column headings will wrap over two lines. Make sure to make the window as wide as possible in order to understand the structure of this file. Close the nedit window when you are done.

The Jukes and Cantor model

Question 1

The Jukes and Cantor model of evolution has the following rate matrix:

  | A   C   G   T |
-------------------
A | -   a   a   a | 
  |		  |
C | a   -   a   a |
  |		  |
G | a   a   -   a |
  |		  |
T | a   a   a   - |
-------------------

Start RStudio

We will use RStudio to explore some features of evolution occurring according to this model. Start by loading the tidyverse libraries:

library(tidyverse)

For the Jukes and Cantor model the following equation gives the probability, D, that a given site will display observable change, expressed as a function of branch length, d:

$D={\frac {3}{4}}\left(1-\exp \left(-{\frac {4}{3}}d\right)\right)$

Here, d is measured in substitutions per site. D is also the expected fraction of sites showing observable change along a branch of length d: if any single site has probability D of changing, then on average D * L sites will have changed in a sequence of length L. We will now explore how the expected amount of observable change depends on the branch length.

In RStudio enter the following (note: you may want to enter this in the script window, in the upper left of RStudio, so you can re-use the code later on):

df = tibble(
    d = seq(0,10,0.1),
    observed = 0.75*(1-exp(-1.33*d)),
    max = 0.75
)

dflong = pivot_longer(df, cols=-d)
ggplot(dflong, aes(x=d, y=value, col=name)) + 
    geom_line() + 
    geom_abline(slope=1, lty=2) +
    labs(title="Exp. observable differences", 
         x = "Actual distance (branch length)",
         y = "Observed distance") +
    ylim(0,1)

In this expression d is the the branch length (the actual amount of change that has occurred). The curve we have plotted thus gives the expected observed difference as a function of the actual amount of change.

Question: Which of the following statements are true?

Sequences can become no more than 75 % different according to the Jukes and Cantor model
The graph of the observed differences plateaus at 3/4
The Jukes and Cantor correction will have a limited effect when the branch length is large
For small branch lengths, the expected observable difference rises almost linearly and is very close to the real distance
The graph of the observed differences plateaus at 1

Question 2

Jukes and Cantor model: Examine estimated branch length as a function of observed difference

Above we examined how the (expected) observed distance depended on the real distance. We will now examine how the real distance can be estimated from the observed distance. This is done by solving the above equation for d, giving us an expression that allows us to estimate the real amount of change as a function of the observed change:

d=-{\frac {3}{4}}ln\left(1-{\frac {4}{3}}D\right)

Note that this correction will only work if the observed difference is approximately as expected. Consider this: In the dice-rolling simulation we found that if there has been 0.67 changes per site then the expected observed difference is 0.44. However, as you saw in the simulation, the actual observed difference can be different from the expected 0.44 (say, 0.33 or 0.58). If the observed difference is not the same as the expected observed difference, then we will obviously also get the wrong estimate of the real distance after correcting for multiple substitutions.

We will now plot (estimated) real change as a function of observed difference (this is the inverse of what you did before). In RStudio enter:

df = tibble(
    D = seq(0,0.749,0.01),
    real = -0.75*log(1-1.33*D)
)

ggplot(df, aes(x=D, y=real)) + 
    geom_line(col="blue") + 
    geom_abline(slope=1, lty=2) +
    labs(title="Estimated real distance", 
         x = "Observed difference",
         y = "Real distance") + 
    xlim(0,0.8)

The function "log" means the natural logarithm in R. Note how the correction becomes increasingly more important as the observed distance increases. Also note that this correction does not allow the observed distance to rise above 0.75, although that situation may arise in real data. Above 75% difference the corrected distance is not defined. When using JC corrected distances for phylogenetic reconstruction, you should therefore beware of this situation.

Question: Use the equation above to estimate the actual distance if the observed distance is 0.1, 0.4, and 0.6 respectively

The Kimura 2 parameter model

The Kimura 2 parameter model of evolution has the following rate matrix:

  | A   C   G   T |
-------------------
A | -   b   a   b | 
  |		  |
C | b   -   b   a |
  |		  |
G | a   b   -   b |
  |		  |
T | b   a   b   - |
-------------------

Note how transitions (A/G and C/T) have a different rate than transversions (A/C, A/T, C/G, and G/T). Based on this matrix, the expected ratio of transitions to transversions is:

$R={\frac {a}{2b}}$

meaning that if transitions and transversions had the same rate (Jukes and Cantor), then we would expect: $R=0.5$ . Empirically, this is typically not the case. In fact one often sees $R\geq 2$ and for mitochondrial DNA a typical value is $R=10$ (meaning that a is 20 times higher than b)! We will now use RStudio to explore some features of evolution occurring according to this model.

It can be shown that, under the K2P model, the chance of observing a transition and a transversion respectively depends on R and t in the following way:

$P_{\textrm {transition}}=0.25-0.5\exp(A*t)+0.25*\exp(B*t)$

$P_{\textrm {transversion}}=0.5-0.5*\exp(B*t)$

where

$A={\frac {-2R-1}{R+1}}$

$B={\frac {-2}{R+1}}$

Note that in these equations we have chosen to measure time in suitable units such that the overall rate of substitution ( $\mu =a+2b$ ) has the value 1 substitution per site per unit time. (An example: if $\mu =10^{-9}$ substitutions per site per year, then we would choose to measure time in billions of years, instead of in years. The substitution rate would now be 1 substitution per site per billion years). This means that the amount of change accumulated during t time units simply is: $D=1*t=t$

Question 3

Examine expected amount of change as a function of branch length We will now examine how the expected amount of transitions and transversions change with time when R=10. In the RStudio window enter the following:

R = 10
A = (-2*R-1.0)/(R+1.0)
B = (-2)/(R+1.0)

You can check the computed values of A and B by using the print command:

print(A)
print(B)

You should have obtained values of approximately A = -1.909 and B = -0.1818. You can now plot the curves showing how the expected amount of transitions and transversions change as a function of the branch length (the actual amount of change):

df = tibble(
    t = seq(0, 40, 0.1),
    Transitions = 0.25-0.5*exp(A*t)+0.25*exp(B*t),
    Transversions = 0.5 - 0.5 * exp(B*t),
    Total_dist = 0.25-0.5*exp(A*t)+0.25*exp(B*t) + 0.5 - 0.5 * exp(B*t)
)

dflong = df %>% pivot_longer(-t)

ggplot(dflong, aes(x=t, y=value, col=name)) + 
    geom_line() + 
    labs(title="Exp. observable differences", 
         x = "Real distance",
         y = "Observed difference") +
    geom_hline(yintercept = 0.25, col="blue", lty=2) + 
    geom_hline(yintercept = 0.5, col="blue", lty=2) + 
    geom_hline(yintercept = 0.75, col="blue", lty=2) + 
    ylim(0,1)

Several interesting things are going on in this plot. First of all, note that I have added a third curve showing the total observed difference. This is simply the sum of the observed transitions and transversions.

Second, as was the case for the Jukes and Cantor model, the total observed difference increases to a maximum value of 0.75 (corresponding to 25% similarity).

Third, note that the expected amount of transitional differences first rise rapidly and then decline slowly to an equilibrium value of 0.25. Transversional differences rise slowly to an equilibrium value of 0.5. The equilibrium values are determined by the fact that when sufficient time has passed sequence similarities will essentially be random; since there are twice as many possible transversions as transitions, these will in the end make up two thirds of all observed changes. Early on, before this stage is reached, the much higher rate of transitions will cause them to make up the vast majority of all observed changes, and only after considerable time has elapsed will the transversions catch up.

Question: From the plot, estimate the real distance (x-axis) at which the transition and transversion lines cross.

Question 4

Experiment with other transition/transversion rate ratios The exact behaviour of the relationship between the two types of change depends on the relative rates of transition and transversion. You should now repeat the above analysis with:

R=2
R=0.5

Remember to recompute A and B after entering the new value of R. Recall that R=0.5 means that transitions and transversion occur with the same rate, a=b. For each of these two cases rerun the plot command and consider the changes.

Question: Based on the two plots which of the following statements are true?

For R = 2, the transition and transversion lines cross at around 1.5 substitutions per site.
For R = 0.5, the transition and transversion lines never cross each other.
For both R=2 and R = 0.5, the transition and transversion lines never cross each other.
For R = 2, the transition and transversion lines cross at around 5 substitutions per site.

Question 5 When R=0.5, the Kimura 2 parameter model is in fact equivalent to another model - which one?

Question 6

Examine how apparent transition/transversion ratio changes with branch length The apparent transition transversion ratio is simply the observed number of transitions divided by the observed number of transversions. The following plot command shows this number as a function of branch length for the case R=10 (I have simply taken the expression for observed transitions and divided it by the expression for observed transversions):

df = tibble(
    t= seq(0.01, 4, 0.01),
    obsrat = (0.25-0.5*exp(-1.909*t)+0.25*exp(-0.1818*t))/ (0.5 - 0.5 * exp(-0.1818*t))
)

ggplot(df, aes(x=t, y=obsrat)) + 
    geom_line(col="blue") + 
    labs(x = "Real distance",
         y = "Observed transition/transversion ratio")

Note how the apparent ratio is close to the real ratio, R=10, when not much change has occurred (i.e., for small x).

The model of evolution that we have explored here is not a particularly complicated one - in fact it only has two free parameters. Nevertheless, you will by now appreciate that it is capable of displaying fairly un-intuitive behaviour. Stating our hypothesis about this biological system in explicit mathematical terms is what allowed us to explore this thoroughly.

Question: What value does the apparent transition/transversion ratio approach asymptotically? (You will need to construct a plot with a wider x-range to see this).

Analysis of mitochondrial data set

In this part of the exercise we will explore a real mitochondrial data set containing sequences from man, chimpanzee, gorilla, orangutan, and gibbon. We will investigate how the use of different models of evolution affects the estimated distance matrix. Since mitochondrial DNA is known to have very different transition and transversion rates, we will pay special attention to this aspect.

Question 7

Prepare editor window In a terminal window, enter:

nedit titv.data &

(Make sure to make the nedit window as wide as possible - otherwise the header line will be wrapped over two lines). This file contains a header line and a column listing the estimated divergence time between man and each of the other four primates (in millions of years). These estimates are associated with a fair amount of uncertainty, but the implied branching order is almost certainly correct. You will be using this file for entering various measures that you compute from the data file.

Start PAUP*, load data file

paup primatemitDNA.nexus

Define outgroup:

outgroup gibbon

Activate outgroup rooting and select how tree will be printed:

set root=outgroup outroot=monophyl

Select distance-based tree-reconstruction:

set criterion=distance

Select uncorrected distances under the least squares criterion:

dset distance=p objective=lsfit

Short digression on PAUP* online help system:

We interrupt this exercise for a brief announcement: By now you should be familiar with many of the commands used in PAUP, but you probably do not have an overview of the long list of possible options that can be specified. Fortunately, PAUP has a command that is useful in this context:

dset ?

Here I have used dset as an example, but typing any command followed by a question mark ("?") will give you a list of all the possible options for that command, along with a list of the current values. This is very useful if you want to experiment with different settings in an analysis. When you want to learn more about the individual settings, you can also check the command reference and manual, which are linked on the course wiki.

One final thing that may be good to know: PAUP* accepts abbreviated commands as long as the abbreviation is unambiguous. That means you can for instance write set crit=dist instead of the full set criterion=distance, and desc instead of describetrees.

Construct least squares tree:

alltrees

This is a small data set so we can use exhaustive searching.

Inspect tree:

describetrees all/plot=phylogram

This tree reflects our current belief about how these organisms are related

Print distance matrix, note distances from human:

showdist

The showdist command lists the distance matrix computed according to the currently active distance-setting (as specified in the dset command above).

Question: What are the p-distances for the following pairs of sequences: human/chimpanzee, human/gorilla, human/orangutan, human/gibbon.

Note: Also copy the entries giving the p-distance between human and each of the other four primates into the proper place in the titv.data file. (The numbers should all be in a single column under the "p_dist" header).

Question 8

Select uncorrected distances, counting only transitions:

dset subst=ti

The option subst=ti specifies that only transitional substitutions should be counted. The previously issued "distance=p" is still the active setting. You can verify this by typing "dset ?" and checking the value listed for distance.

Print distance matrix, note transitional distances from human:

showdist

In this distance matrix only the transitions have been counted for each pair of taxa.

Question: What are the transition-distances for the following pairs of sequences: human/chimpanzee, human/gorilla, human/orangutan, human/gibbon. Note: Also enter the numbers in the column labeled "Transitions(P)" in the file.

Question 9

Select uncorrected distances, counting only transversions:

dset subst=tv

Print distance matrix, note transversional distances from human:

showdist

Question: Again enter the distances from everything to human below (separated by spaces, and using at least two significant digits) and in the column labeled Transversions(Q) in the file.

Question 10

Compute JC-corrected distances: As we saw above, it is possible to come up with model-based corrections for the effect of multiple substitutions that allow us to estimate the real amount of change from the observed amount of change. For the JC-model, the equation for the corrected distance is:

d=-{\frac {3}{4}}ln\left(1-{\frac {4}{3}}D\right)

Question: For each of the four lines in the titv.data file, and based on the numbers in the column labeled p_dist, compute the JC-corrected distance. Enter the results in the column labeled "JC" in the titv.data file.

Question 11

Compute K2P corrected distance: As was the case for the JC model, we can also compute estimated real distances under the K2P model. This can be done using the following equation:

d=-{\frac {1}{2}}\ln(1-2P-Q)-{\frac {1}{4}}\ln(1-2Q)

Question: Using the numbers in columns P and Q, you should now use this equation to compute the K2P-corrected distance estimates. Enter the results in the column labeled K2P in the file. Make sure to save the file after all results have been entered.

Question 12

Plot distances In RStudio enter:

df = read_table2("titv.data", col_names = FALSE, skip=1)

df = df %>% rename(organisms=X1, 
                   div_time=X2, 
                   pdist=X3, 
                   transitions=X4, 
                   transversions=X5,
                   JC = X6, 
                   K2P = X7)

dflong = df %>% pivot_longer(cols=-c(organisms, div_time))

ggplot(dflong, aes(x=div_time, y=value, col=name)) + 
    geom_line() + 
    labs(x="Time since divergence (MY)", 
         y="Genetic distance (substitutions/site)")

We have here plotted the total difference, the observed transitional and transversional difference, as well as the JC- and K2P-corrected distances as a function of estimated divergence times.

Question: Do the two different correction schemes result in the same estimates of the real distance?

Models of Evolution

Contents

Overview

Getting started

The Jukes and Cantor model

The Kimura 2 parameter model

Analysis of mitochondrial data set

Navigation menu

Models of Evolution

Overview

Getting started

The Jukes and Cantor model

The Kimura 2 parameter model

Analysis of mitochondrial data set

Navigation menu

Search