Maximum Likelihood

From 22115
Jump to navigation Jump to search

This exercise is part of the course Computational Molecular Evolution (22115).


Overview

The data set you will work with here consists of an alignment of full length mitochondrial DNA from human (53 sequences), chimpanzee (1 sequence), bonobo (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically 38,000 year old bones found at Vindija in Croatia (all sequence data was taken from this paper: Green et al., Cell, 2008).

The view emerging from most anatomical, archaeological, and DNA-based studies places Neanderthals as a different species from Homo sapiens. This is in agreement with the "Out-of-Africa hypothesis", which states that Neanderthals coexisted with modern humans who originated in Africa somewhere between 100,000 to 200,000 years ago. There is, however, also some anatomical and paleontological research which supports the so-called "multi-regional hypothesis", which propounds that some populations of archaic Homo evolved into modern human populations in many geographical regions. Under this hypothesis, Neanderthals would be a sub-clade within human . We will use the present data set to consider this issue.

Getting started

Construct working directory, copy files

In a terminal window enter the following commands (remember to replace /path/to/molevol with your actual path to the molevol folder):

cd /path/to/molevol
mkdir likelihood
cd likelihood
cp ../data/neanderthal.nexus ./neanderthal.nexus

Question 1

Start PAUP and load data set:

paup neanderthal.nexus

Remove subset of sequences to reduce computational burden:

delete 5-40
This command removes 36 human sequences (sequence number 5 to sequence number 40) from the data set. We do this in order to reduce the time needed to finish the analysis. In the remaining data set we now have 17 human sequences, one chimpanzee, one bonobo, and one Neanderthal.

Specify substitution model In the analysis performed here, we have reason to believe that the Kimura 2 parameter model is a fair description of how the sequences evolve (i.e., transitions and transversions have separate rates). We furthermore have evidence that different sites evolve at quite different rates, and we want to model this using a gamma distribution. Moreover, we will request that the transition/transversion rate ratio and the gamma shape parameter are estimated from data. (Although we will not discuss the issue further at this point, it is important to realize that there are techniques for stringent selection of the best model, and that one should never just randomly select one. We will return to such techniques later in the course when we discuss model selection. For now, however, you should just accept that K2P + gamma is an adequate model for the present data set). To specify the substitution model, enter the following at the PAUP prompt:

set criterion=likelihood
lset nst=2 tratio=estimate basefreq=equal rates=gamma shape=estimate
In order to search for a maximum likelihood tree, we must first give a detailed description of the assumed substitution model. Since this is the first time we do this, I will give a rather thorough description of each part of the command.
First lset ("likelihood settings") is the command used in PAUP to specify likelihood models, just as dset was used to specify settings for the distance criterion.
Secondly, we specify that we want a model with two different types of substitution rates (nst=2) and where the frequency of each base is 25% (basefreq=equal). You will recognize this as the K2P model. Note that, by default, PAUP assumes that nst=2 means that we want to make a distinction between transitions and transversions. It is also possible to specify models with two types of substitutions that are NOT transitions and transversions respectively. One example would be: lset nst=6 rmatrix=estimate rclass=(a a a b b b). I will not explain this example in detail at this point.
Third, we request that the transition/transversion ratio should be estimated from the data (tratio=estimate).
Finally, we specify that we want to use a model where substitution rates at different sites follow a gamma distribution (rates=gamma), and that we want the shape of this distribution to also be estimated from the data (shape=estimate).

Specify outgroup and rooting options

outgroup Pan_troglodytes Pan_paniscus
set root=outgroup outroot=monophyl
The chimpanzee and the bonobo form the outgroup

Start heuristic search of tree space using nearest neighbor interchange (NNI)

hsearch swap=nni
This step may take a little while to finish (depending on your computer). For large datasets you sometimes have to wait hours or even days for a maximum likelihood analysis to finish. The score that PAUP lists for maximum likelihood analysis is the negative log of the likelihood, -ln(L). (Recall that since likelihoods are numbers between 0 and 1, log-likelihoods will be negative numbers, and therefore negative log-likelihoods will be positive numbers. Perhaps a bit confusing that PAUP doesn't simply list the ln(L) ). As the likelihood increases, this number will decrease.

Question: What is the negative log-likelihood, -ln(L), for the best tree found using NNI?


Question 2 Is the Neanderthal sequence placed inside or outside the clade of human sequences?