22115 - User contributions [en]

22115 - Computational Molecular Evolution

2024-03-19T14:00:11Z

WikiSysop:

; Overview [[File:Darwin logo2 medium.png |right|border|550px]]
: This page contains links to video lectures, computer exercises, and other material for the course [https://kurser.dtu.dk/course/22115 22115 - Computational Molecular Evolution], which is part of the [https://www.dtu.dk/english/education/msc/programmes/systems_biology MSc in Bioinformatics and Systems Biology] at the [https://www.dtu.dk/english Technical University of Denmark]. The course is taught by Professor Anders Gorm Pedersen, [https://www.healthtech.dtu.dk/english/Research/Research-Sections/Section-Bioinformatics Section for Bioinformatics], [https://www.healthtech.dtu.dk/english Department of Health Technology].

: The main goal of this course is to give an introduction to theory and algorithms in the field of computational molecular evolution. We will cover basic evolutionary theory (common descent, natural selection, genetic drift, models of growth and selection), and the main types of algorithms used for constructing and analyzing phylogenetic trees (parsimony, distance based methods, maximum likelihood methods, and Bayesian inference). We will also discuss the role of statistical modeling in science more generally

:The course will consist of lectures, computer exercises, and mini-projects. The student will acquire practical experience in the use of a range of computational methods by analyzing sequences from the scientific literature.

__TOC__

=='''Computer setup'''==

===Linux===
:* [[Linux software installation]]


===Windows===
:* [[Windows software installation]]


===MacOS===
:* [[MacOS software installation]]



== '''Lecture Schedule''' ==

:([[27615 Previous course programs|Course programs, previous years]])

===Week 1 (January 31): Introduction to evolutionary theory and population genetics. Models of growth, selection and mutation===

:; Online lectures
:* [https://youtu.be/okjVaLA5S38 Common descent (11:52)]
:* [https://youtu.be/VkkIu1ZtaIE Natural selection (14:57)]
:* [https://youtu.be/wqa6W3_WW7s Evidence for evolution (part 1) (9:34)]
:* [http://y2u.be/_-a-F8egAis Evidence for evolution (part 2) (20:54)]
:* [http://y2u.be/AUGbSMWPILE Population growth and selection (18:13)]

:; Course material
:* [https://github.com/agormp/evolintro/blob/main/evolintro.pdf Lecture notes on evolutionary theory and population genetics]
:* [https://teaching.healthtech.dtu.dk/material/22115/slides_week1.pdf Slides, week 1]

:; Computer exercise
:* [[Population Growth, Fitness, and Selection]]

----

===Week 2 (February 7): Neutral mutations and genetic drift. Tree reconstruction by parsimony===

:; Online lectures
:* [https://youtu.be/cQVjL50kK0k Neutral Theory of Molecular Evolution (11:28)]
:* [https://youtu.be/J8LDUFm4ttA Genetic Drift (9:35)]
:* [https://youtu.be/AZkHWdl2oAw Trees: Terminology and Representation (9:41)]
:* [https://youtu.be/zCj1s9fmaKs Homology and Homoplasy (9:07)]
:* [https://youtu.be/gXb_WuLCD8g Maximum Parsimony (7:48)]
:* [https://youtu.be/Q7ZpdPCx0uQ The Fitch Algorithm (10:31)]
:* [https://youtu.be/deywW9wJXmw Searching Tree Space (14:01)]

:; Course material
:* [https://teaching.healthtech.dtu.dk/material/22115/slides_week2.pdf Slides, week 2]
:* [https://teaching.healthtech.dtu.dk/material/22115/Paup_Doc_31.pdf PAUP 3.1 manual (note: for older version - contains explanations of parsimony and tree moves)]
:* [https://teaching.healthtech.dtu.dk/material/22115/PAUP4-manual.pdf PAUP 4beta command reference]

:; Computer exercise
:* [[Phylogenetic Analysis using Parsimony]]
----

===Week 3 (February 14): Consensus trees. Distance matrix methods===

:; Online lectures
:* [https://www.youtube.com/watch?v=YXZZyu9OAcg Consensus Trees (16:27)]
:* [https://www.youtube.com/watch?v=MhjSSxcGjaY Distance Matrix Methods, part 1 (6:07)]
:* [https://www.youtube.com/watch?v=PNoUcQTCxiM Distance Matrix Methods, part 2 (22:28)]
:* [https://www.youtube.com/watch?v=Dj24mCLQYUE Neighbour Joining (15:28)]

:; Course material
:* [https://teaching.healthtech.dtu.dk/material/22115/Consensus.pdf Handout exercise: Consensus Trees]
:* [https://teaching.healthtech.dtu.dk/material/22115/Distance_handout.pdf Handout exercise: Distance Matrix Methods]
:* [https://teaching.healthtech.dtu.dk/material/22115/Slides_week3.pdf Slides, week 3]

:; Computer exercises
:* [[Consensus Trees]]
:* [[Distance Matrix Methods]]

----

===Week 4+5 (February 21 + 28): Mini project 1===

<--

Project description: [https://teaching.healthtech.dtu.dk/material/22115/Miniproject1_whales.pdf Building a tree from scratch: What are the closest relatives of whales?]

The mini project should be submitted and assessed via a peer assessment module that will become available on the course DTU Learn page.

Take this tree quiz to test yourself on your ability to accurately interpret evolutionary trees:
* [https://teaching.healthtech.dtu.dk/material/22115/Treequiz1.pdf Tree quiz]
Check your replies here:
* [https://teaching.healthtech.dtu.dk/material/22115/Treequiz1_answers.pdf Tree quiz with answers]

-->

----

===Week 6 (March 6): Models of sequence evolution. Likelihood methods===

:; Online lectures
:* [https://youtu.be/ro2MFmVZypQ Models of evolution (28:48)]
:* [https://youtu.be/xDKUIegYpWM Maximum likelihood (22:06)]
:* [https://youtu.be/Siau2o_egGI Ancestral reconstruction (10:45)]

:; Course material
:* [https://teaching.healthtech.dtu.dk/material/22115/Handout_real_exp_change.pdf|Handout exercise: Real, Observed, and Expected Change]
:* [https://teaching.healthtech.dtu.dk/material/22115/Handout_likelihood.pdf Handout exercise: Computation of Likelihood]
:* [https://teaching.healthtech.dtu.dk/material/22115/Slides_week4.pdf Slides, week 6]
:* [https://teaching.healthtech.dtu.dk/material/22115/substitutionmodels.pdf Lecture notes: Substitution models]
:* [https://teaching.healthtech.dtu.dk/material/22115/main.pdf Optional lecture notes: Matrix exponentials for Markov chains]
:; Computer exercises
:* [[Models of Evolution]]
:* [[Maximum Likelihood]]

----

===Week 7 (March 13): Bayesian inference of phylogeny===

:; Online lectures
:* [https://www.youtube.com/watch?v=DI3TIx78NqM&t=12s Bayesian Inference (23:48)]
:* [https://youtu.be/uyG5DVigEyM?list=PLXwjzs_mabFrlRF7uALEomfGGckG0sG5y Markov chain Monte Carlo (19:54)]

:; Course material
:* [https://teaching.healthtech.dtu.dk/material/22115/Handout.class08.pdf Handout exercise: Bayesian estimation of model parameter value]
:* [https://teaching.healthtech.dtu.dk/material/22115/Slides_week5.pdf Slides, week 7]
:* [https://teaching.healthtech.dtu.dk/material/22115/MTN122.pdf| An Introduction to Bayesian Statistics Without Using Equations]
:* [http://www.nature.com/nbt/journal/v22/n9/pdf/nbt0904-1177.pdf Background reading: "What is Bayesian statistics?"]
:* [http://rsta.royalsocietypublishing.org/content/roypta/361/1813/2681.full.pdf Background reading: "Bayesian computation: a statistical revolution"]

:; Computer exercise
:* [[Bayesian Phylogeny]]

----

===Week 8+9 (March 20 + April 3): Mini project 2===


----

===Week 10 (April 10): Model Selection===

:; Online lectures
:* [https://youtu.be/sJB2LmppZj8?list=PLXwjzs_mabFrlRF7uALEomfGGckG0sG5y Model selection, part 1 (15:19)]
:* [https://youtu.be/qSoDZ_33GbM Model selection, part 2 (17:20)]
:* [https://youtu.be/YYoo1vUO4ME Introduction to computer exercise: detection of selection (15:24)]

:; Course material
:* [https://teaching.healthtech.dtu.dk/material/22115/Slides_week6.pdf Slides, week 10]
:* [https://github.com/ddarriba/jmodeltest2/files/157130/manual.pdf jmodeltest manual]

:; Computer exercise
:* [[Model selection]]

----

===Week 11 (April 17): Bayesian Phylogenetics, Part 2 ===
:; Course material
:* [https://www.researchgate.net/publication/319965471_A_biologist%27s_guide_to_Bayesian_phylogenetic_analysis A biologist’s guide to Bayesian phylogenetic analysis]
:* [https://beast.community/analysing_beast_output Analysing BEAST output using Tracer]
:* [https://beast.community/tracer_convergence Identifying convergence problems using Tracer]
:* [https://taming-the-beast.org/tutorials/Troubleshooting/ Post-processing and improving performance]

:; Computer exercise
:* [[Bayesian phylogenetics: checking convergence]]
:* [[Bayesian phylogenetics: clock models]]

----

===Week 12 + 13 (April 24 + May 1): Mini project 3: Final exam===

'''Details will follow'''

----

Bayesian phylogenetics: clock models

2024-03-19T13:48:12Z

2024-03-19T13:44:30Z

WikiSysop: Created page with "This exercise is part of the course Computational Molecular Evolution (22115). == Check convergence using Tracer == In this exercise you will be briefly introduced to how to check if an MCMC run has converged using the program Tracer from the BEAST2 package. You will do this by re-examining the output from the Bayesian analysis you did in the week 9 exercise. ---- '''Question 1''' : Issue this command to start the Trace..."

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Check convergence using Tracer ==

In this exercise you will be briefly introduced to how to check if an MCMC run has converged using the program Tracer from the BEAST2 package. You will do this by re-examining the output from the Bayesian analysis you did in the week 9 exercise.

----

'''Question 1'''

: Issue this command to start the Tracer program:
tracer

: Now import the two MCMC sample files from the MrBayes run you did in week 9 for the hcvsmall data set:
:* File -> Import Trace File (or use the + under the trace file pane)
:* In the import dialog: find the "bayes" directory and select "All files" under "files of type". This should give you a list of the output files from the MrBayes run
:* Select the file "hcvsmall.nexus.run1.p" and open it.
:* Repeat process for second log file (suffix ".run2.p")

:::::[[File:Tracer fileload2.png |800px]]

: You can now use Tracer to explore the results of the Bayesian analysis. The first thing you want to check is that the two independent runs have resulted in similar posteriors for the different parameters. This is investigated as follows:
:* Select both trace files by shift-clicking on their names in the "Trace files" pane (upper left of the Tracer window)
:* Select the "Marginal Density" tab in the window on the right.
:* Check different parameters by choosing them in the "Traces" pane on the left (while making sure you still have both trace files selected). This will show the two posteriors for the chosen parameter (see example below). If a run has converged then the two posteriors should mostly be placed right on top of each other.
:* Note that Tracer by default uses a burnin of 10% of the total number of generations. You can change that by double-clicking in the Burn-in field of the trace file pane (you need to change it separately for each file). Typically we would use a burn-in of 25% or 50%.

:::::[[File:Tracer marginals overlap.png|800px]]

:* The plot below shows an example where convergence has not occurred yet:

:::::[[File:Tracer lousy convergence.png|800px]]

'''Question: '''Take screen dumps of the marginal posterior plots for the following parameters and include them in your report: m{1} and piA{all}

----

'''Question 2'''

:* Another thing to check is how the trace looks as a function of the iteration number: Optimally you would want a trace that looks like a "hairy caterpillar", with random jumps up and down on a mostly constant level (see example below).
:* Select the "Trace" tab in the window on the right to see trace plots (still with one or both trace files selected in the Trace File pane).
:* Related to this: The ESS column gives the "Effective Sample Size" for each parameter. As a rule of thumb we want this to be at least 200 (and Tracer flags smaller values by colouring the ESS values).
:** Briefly, the problem here is that consecutive samples from MCMC are correlated (they are not independent). This is due to the use of a Markov chain for sampling: the new position in parameter space depends on the previous location (and the proposal distribution).
:** The degree of non-indepence can be quantified by the auto-correlation for different lags: The autocorrelation for lag k is found by computing the Pearson correlation between all samples, and the samples k generations later.
:** Based on computation of auto-correlation at different lags (<math>k = [1, 2, 3, ...]</math>) Tracer determines the Auto-Correlation Time (ACT), which is the number of generations in the MCMC chain that two samples have to be separated by for them to be uncorrelated. The ACT for a parameter can be seen in the Estimates tab in Tracer.
:** Tracer also estimates the Effective Sample Size (ESS), which is the number of independent samples that the trace is equivalent to. This is essentially the chain length (excluding the burn-in) divided by the ACT.
:* Note how the highlighted parameter corresponding to the hairy caterpillar trace also has a high ESS in the example below.

:::::[[File:Tracer hairy caterpillar.png|800px]]

:* Trace plots where there are clearly visible dips and rises (see example below) indicates that there is auto correlation among the samples we have included - the samples are not independent of each other (and therefore provide less information about the posterior). This is referred to as "poor mixing". One solution to such a problem is to increase the number of iterations (and perhaps write samples less frequently). It might also be an indication that the model fits poorly, and that you could get a better convergence by changing the substitution model, or setting more informative priors.
:* Note how the poorly mixing parameter in the example below also has a low ESS.

:::::[[File:Tracer ugly caterpillar.png|800px]]

'''Question: '''Take screen dumps of the trace plots for the following parameters and include them in your report: m{1} and piA{all}. What is the ESS for these parameters?

Model selection

2024-03-19T13:43:35Z

WikiSysop: /* Analysis of viral data set: alignment of coding DNA */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Analysis of viral data set: alignment of coding DNA ==

: DNA sequences are a lot less informative than protein sequences and for this reason it is always preferable to align coding DNA in translated form. The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the 'signal-to-noise ratio' in protein sequence alignments is much better than in alignments of DNA. Besides this information-theoretical advantage, protein alignments also benefit from the information that is implicit in empirical substitution matrices such as BLOSUM-62. Taken together with the generally higher rate of synonymous substitutions over non-synonymous ones, this means that the phylogenetic signal disappears much more rapidly from DNA sequences than from the encoded proteins. It is therefore preferable to align coding DNA at the amino acid level.

: However, in the context of molecular evolution, DNA alignments retain a lot of useful information regarding silent mutations. Especially the ratio between silent and non-silent substitutions is informative. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment that is in accordance with the protein alignment. This also means that gaps are always inserted in groups of three so reading frames are kept in order. That is important if you want to analyze selection, as we will in this exercise.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/services/RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Selection of substitution model using jmodeltest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in using the model that best describes our data without having more parameters than strictly necessary (thus avoiding overfitting). We will investigate this issue by fitting a set of 56 different models to our data and then selecting one with a reasonable balance between model complexity and data fit.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result, manually check model probabilities for three models'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p"), and estimates of all model parameters (excluding branch lengths).

'''Manually compute model probabilities for three substitution models'''
: Use AIC-based model probabilities to investigate which of the following three substitution models are best at describing how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

:Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: ''' Use the recipe above to compute AIC values and model probabilities. Report the results in a table similar to the one shown above

----

'''Question 2''' Based on the model probabilities: wich model has more support?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC and model probabilities, and uses this to select the model that best fits the sequence data:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values?

----

== Construction of phylogenetic tree using PAUP ==

'''Question 4'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 5'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values determined as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and model probabilities from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 1''': tells the program to ignore positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 6'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting w: lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 7'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 9'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, w (model probability) for M1 and M2

----

'''Question 10: ''' Is M2 better than M1?

----

'''Question 11'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

File:Molevol-Downloads-aictable.png

2024-03-19T13:43:09Z

WikiSysop:

Model selection

2024-03-19T13:42:35Z

WikiSysop: Created page with "This exercise is part of the course Computational Molecular Evolution (22115). == Overview == : In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein. : Like other retroviruses, particles of HIV..."

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

: In this exercise you are going to investigate features of HIV-1 evolution. You will do this by analyzing a large set of env-genes from HIV-1, subtype B. specifically, the DNA sequences analyzed here correspond to a region surrounding the hypervariable V3 region of the gp120 protein.

: Like other retroviruses, particles of HIV are made up of 2 copies of a single-stranded RNA genome packaged inside a protein core, or capsid. The core particle also contains viral proteins that are essential for the early steps of the virus life cycle, such as reverse transcription and integration. A lipid envelope, derived from the infected cell, surrounds the core particle. Embedded in this envelope are the surface glycoproteins of HIV: gp120 and gp41. The gp120 protein is crucial for binding of the virus particle to target cells, while gp41 is important for the subsequent fusion event. It is the specific affinity of gp120 for the CD4 protein that targets HIV to those cells of the immune system that express CD4 on their surface (e.g., T-helper lymphocytes, monocytes, and macrophages).

: The role gp120 plays in infection and the fact that it is situated on the surface of the HIV particle, means it is an obvious target for the immune response. That means that there may be a considerable selective pressure on gp120 for creating immune-escape mutants, where amino acids in the gp120 epitopes have been substituted. In this exercise you will construct a maximum likelihood tree that we will subsequently use to investigate whether you can detect such a selective pressure on parts of gp120, again using maximum likelihood methods.

: One major goal with the exercise is to introduce you to statistically based methods for assessing the strength of evidence for a set of alternative hypotheses about some biological system of interest. The model selection method we will use is AIC (Akaike Information Criterion), based on which you will compute model probabilities. A second goal is to make you aware that phylogenetic analysis is not only about constructing trees, but that it is also a useful framework for analyzing biological questions more generally.

: Specifically, you will

:# perform a multiple alignment of gp120 DNA sequences taking protein-level information into account (using revtrans).
:# select a suitable nucleotide substitution model (using jmodeltest2)
:# construct a phylogenetic tree (using PAUP).
:# try to detect positively selected sites in gp120 (using PAML).

----

== Recipe for computing AIC values and model probabilities ==

: Later in today's exercise you will be asked to compute AIC values and model probabilities. Return to this section and follow the instructions when you need to do so.

:# Fit a set of models to your data, note the maximized log likelihoods (lnL) and the number of free parameters (K) for each model in the investigated set. The models you fit should represent a plausible and comprehensive set of hypotheses about your data.
:# Compute AIC for each of the models: '''AIC = -2 x lnL + 2K'''. For example: a model with lnL = -2010 and K = 5 will have AIC = -2 x -2010 + 2 x 5 = 4030.
:# Identify the model with the smallest AIC (this is the best model in the set). We will call the AIC for this model '''"AICmin"'''.
:# Compute the "ΔAIC" values for each model: '''ΔAIC = AIC - AICmin''' For each model subtract the minimum AIC value. The best model will have a ΔAIC of zero. The rest of the models will have positive ΔAICs.
:# For each model compute the following quantity: '''numerator = exp(-0.5 x ΔAIC)''' For example, a model with ΔAIC=4.2 will have numerator = exp(-0.5 x 4.2) = exp(-2.1) = 0.1225. Also compute the '''sum''' of the numerator values for all models.
:# Finally, the model probabilities for each model are found as: '''P(model) = numerator / sum''' For example, if sum = 3.75 and a model has numerator = 1.3, then it has P(model) = 1.3 / 3.75 = 0.35

: You may want to keep track of the computations by constructing a table along the following lines:
[[File:Molevol-Downloads-aictable.png|700px]]

: Note that model probabilities can also be computed using Bayesian methods. One advantage of Bayesian methods over AIC is that instead of relying on a point estimate, uncertainty about parameter values is accounted for by integrating over all possible values (typically using MCMC).

----

== Getting started ==

'''Create working directory, copy files'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir modelselect
cd modelselect
cp ../data/gp120.fasta ./gp120.fasta
cp ../data/codeml.ctl ./codeml.ctl

'''Have a look at the DNA data file:'''
nedit gp120.fasta &
: The file contains several DNA sequences from HIV-1, subtype B. The sequences are approximately 500 bp long, and correspond to a region surrounding the hypervariable V3 region in the gene encoding gp120. Close the nedit window when you've had a look.

----

== Analysis of viral data set: alignment of coding DNA ==

: DNA sequences are a lot less informative than protein sequences and for this reason it is always preferable to align coding DNA in translated form. The simple fact that proteins are built from 20 amino acids while DNA only contains four different bases, means that the 'signal-to-noise ratio' in protein sequence alignments is much better than in alignments of DNA. Besides this information-theoretical advantage, protein alignments also benefit from the information that is implicit in empirical substitution matrices such as BLOSUM-62. Taken together with the generally higher rate of synonymous substitutions over non-synonymous ones, this means that the phylogenetic signal disappears much more rapidly from DNA sequences than from the encoded proteins. It is therefore preferable to align coding DNA at the amino acid level.

: However, in the context of molecular evolution, DNA alignments retain a lot of useful information regarding silent mutations. Especially the ratio between silent and non-silent substitutions is informative. We would therefore like to construct a multiple alignment at the DNA level, but using information at the protein level, and the RevTrans server does exactly that.

: '''RevTrans''' takes as input an unaligned set of DNA sequences, automatically translates them to the equivalent amino acid sequences, constructs a multiple alignment of the protein sequences, and finally uses the protein alignment as a template for constructing a multiple DNA alignment that is in accordance with the protein alignment. This also means that gaps are always inserted in groups of three so reading frames are kept in order. That is important if you want to analyze selection, as we will in this exercise.

'''Construct RevTrans alignment'''
:* Open RevTrans server page: https://services.healthtech.dtu.dk/service.php?RevTrans-2.0/
:* On the RevTrans page: Choose the file gp120.fasta as input (or copy and paste the sequence into the sequence window)
:* Click the "Submit query" button
:* When the alignment is done you may have to click link named "here" to go to results page
:* Download DNA alignment, by right-clicking the link for "Download alignment in FASTA format", and choosing "Save link as..." (save file under the name gp120align.fasta and make sure to save the file in the directory modelselect).

'''Convert alignment to NEXUS format'''
: Convert the fasta file to NEXUS format and save file in the modelselect directory under the name gp120.nexus

----

== Selection of substitution model using jmodeltest2 ==

'''Question 1'''

: As part of the present analysis we are going to build a phylogenetic tree based on the DNA alignment constructed above. We will construct the tree using maximum likelihood, but to do that we first have to decide which substitution model we want to use. Specifically, we are interested in using the model that best describes our data without having more parameters than strictly necessary (thus avoiding overfitting). We will investigate this issue by fitting a set of 56 different models to our data and then selecting one with a reasonable balance between model complexity and data fit.

'''Start jmodeltest2'''
jmodeltest
'''Load data set'''
:* File -> Load DNA alignment -> File Format -> Select "All files"
:* Navigate to gp120.nexus and load it
'''Fit 56 models'''
:* Analysis -> Compute likelihood scores
:* Select "7" under "Number of substitution schemes"
:* Select "Fixed BIONJ-JC" under "Base tree for likelihood calculations"
:* Click "Compute likelihoods"

: This causes jmodeltest2 to perform the following actions: first a neighbor joining tree is constructed using the Jukes and Cantor model. Then the tree is fixed and used as the basis for fitting a set of 56 different models to the data. For each model, the estimated model parameters and the negative log-likelihood are recorded. In addition to varying sets of substitution rate parameters (JC, K2P, ...), some of these models also include extra parameters that take into account the presence of different rates between sites. This is done in two ways: (1) by fitting a gamma distribution of rates ("+G"), and (2) by allowing for a proportion of constant ("invariable") sites ("+I").

: Wait until jmodeltest2 is done fitting all 56 models (this will take a little while depending on your computer).

'''Inspect result, manually check model probabilities for three models'''
:* Results -> Show results table
: For each model this table lists the negative log-likelihood ("-lnL"), the number of parameters ("p"), and estimates of all model parameters (excluding branch lengths).

'''Manually compute model probabilities for three substitution models'''
: Use AIC-based model probabilities to investigate which of the following three substitution models are best at describing how the sequences have evolved:
:* Jukes and Cantor with fraction of invariant sites (JC+I)
:* Jukes and Cantor with gamma-distributed rates over sites (JC+G)
:* Jukes and Cantor with invariant sites and gamma-distributed rates (JC+I+G)
: Before you can do the computation you need to know the log likelihood and the number of parameters for each model. Locate these values in the table for the JC+I, JC+G, and JC+I+G models, and write them down. Close the window with the result table when you are done.

:Make sure to get the signs right: the values reported in the table are -lnL values, so you will need to reverse the sign to get the lnL (the lnL values you write down should be negative).

'''Question: ''' Use the recipe above to compute AIC values and model probabilities. Report the results in a table similar to the one shown above

----

'''Question 2''' Based on the model probabilities: wich model has more support?

----

'''Question 3'''
'''Use modeltest program to select best model'''
: What you just did manually for JC+I, JC+G and JC+I+G, jmodeltest2 can do automatically for the full set of 56 fitted models. Specifically, it uses the list of negative log likelihoods and parameter counts in the table to compute AIC and model probabilities, and uses this to select the model that best fits the sequence data:
:* Analysis -> Do AIC calculations ->
:* Select "Write PAUP* block"
:* click "Do AIC calculations"
:* Results -> Show results table
:* Select "AIC" tab
:* SHIFT+click on the header of the "weight" column. This sorts the rows according to model weight, in descending order.

'''Question: '''What model was selected by modeltest based on the AIC values?

----

== Construction of phylogenetic tree using PAUP ==

'''Question 4'''

: Close the results table. In the main window you should now scroll up to the lines giving PAUP commands that will implement the selected model. The command is enclosed between "BEGIN PAUP" and "END;" and should look something like this:
Lset Base=(0.4064, [...]
: You will need to copy this command to a PAUP session in the next step.

'''Start PAUP'''
paup
: Above you used jmodeltest2 to select the most suitable substitution model for the present data set. You will now use this model to construct a maximum likelihood tree. You will use PAUP for this purpose. (note: it is possible to create a maximum likelihood or a model-averaged tree directly from the jmodeltest2 program, but we will instead do it in PAUP in order to more clearly see each step that is taken).

'''Load alignment:'''
execute gp120.nexus

'''Set tree-building criterion to maximum likelihood'''
set criterion=likelihood

'''Set model parameters to winning estimates'''
: Above you located a set of lines in the jmodeltest2 output giving a PAUP command that sets the model parameters to the estimates that were found using the winning model. Copy and paste this lset command (without the BEGIN and END parts) into the window where PAUP is running.

PASTE LSET COMMAND FROM MODELTEST RUN HERE

'''Find best tree using selected model'''
: Still in the PAUP-window, enter the following command
hsearch swap=tbr start=nj
: This command causes PAUP to perform a heuristic search for the best maximum likelihood tree. Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "tree bisection and reconnection" type (TBR). We are using the model selected by modeltest, AND the parameter estimates found by modeltest on that model. You could also have chosen to simply estimate all the model parameters as part of this step (i.e., at the same time as finding the best tree), but fixing them improves speed tremendously. Findind the best tree should take a few minutes.

'''Save best tree to file'''
savetrees format=newick brlens=yes file=gp120tree.phy from=1 to=1

'''Quit program'''
quit

'''Have a look at the tree:'''
: You have now produced an unrooted tree of the HIV sequences and saved it in the file gp120tree.phy. Note that in this exercise we will not be interested in the tree as such - our focus is instead on finding positive selection on a subset of codon positions and the tree is just something we need in order to be able to fit the different codon models to the data. If you want to see the tree, you can do so with the following command:
figtree gp120tree.phy &
: There is no meaningful root placed in this tree, so you may want to choose the unrooted view (the third icon in the Layout section of the figtree window). Close the figtree window when you have had a look

'''Question: '''What is the negative log likelihood of the tree you just found?

----

== Detection of positively selected sites in gp120 ==

'''Question 5'''

: There is much more to phylogenetic analyses than merely reconstructing trees. One interesting result of probabilistic methods, is that the parameters of a model will have their values determined as part of the optimization procedure. This means that once such a model has been fitted to the data, it is possible to investigate these estimated parameter values to learn features about the evolutionary history of the sequences under investigation. In the present example we will focus on investigating whether we can find positively selected sites in our data set, defined as sites where the dN/dS ratio is larger than 1. We do that by using a codon substitution model where the dN/dS ratio is one of its parameters.

: A further strength of the probabilistic approach is that you get a probabilistic measure of how well any model fits the data. This means you can use a stringent approach to determine which model fits the data best. In this framework one uses likelihoods (probabilities of data given model) to determine which model fits the data best. As you saw above, it is for instance possible to compute AIC values and model probabilities from the likelihood values of fitted models, Since each model essentially corresponds to a hypothesis about the evolutionary history of the data, we can thus use a stringent statistical approach to decide which hypothesis best describes our data.

: In outline, you will now use the following steps to investigate whether there is any evidence for positively selected codons in your data set:

:* Fit model M1, which assumes there are two classes of codons in the sequence: some with dN/dS < 1, some with dN/dS=1.
:* Fit model M2, which assumes 3 distinct classes of codons: two with dN/dS ratios as for M1, and one extra class with dN/dS > 1.
:* Assess the strength of evidence for the two models using AIC-based model probabilities
:* If M2 is better: identify the positively selected codons

'''Inspect the parameter file'''
nedit codeml.ctl &
: The file "codeml.ctl" contains several settings that are relevant for running the program '''codeml'''. Find the following lines and ensure that the file contains these values:
'''seqfile = gp120align.fasta''': name of alignment file
'''treefile = gp120tree.phy''': name of tree file
'''seqtype = 1''': tells the program that our data consists of coding DNA.
'''NSsites = 1 2''' : tells the program to analyze models M1 and M2.
'''cleandata = 1''': tells the program to ignore positions with gaps.

: The settings entered by us will cause codeml to analyze two hypotheses about dN/dS ratios. M1 says there are two classes of codons with different dN/dS ratios in the sequence: one class with dN/dS < 1 (codons under purifying or negative selection), and one class with dN/dS=1 (no selection - neutrally evolving sites). M2 says there are 3 distinct dN/dS ratios for different sites in the sequence: one class with dN/dS < 1, one class with dN/dS=1 (these are the same type of classes as for M1), and one class with dN/dS > 1 (corresponding to sites under positive selection). The value of the dN/dS ratios (for those classes that have dN/dS < 1 or dN/dS > 1), the fraction of sites belonging to each class, and the position of sites belonging to each class, are unknown at first and will be determined during the analysis.

'''Start the analysis'''
codeml
: This will start the codeml program using the settings in the file codeml.ctl. Depending on your computer, this will take some minutes to finish. (You may be able to see how the optimization procedure results in progressively better fits: the likelihood increases, meaning that negative log-likelihood decreases, as the fit improves).

'''Inspect result file:'''
: Wait for the run to finish, and then look at the result file:
nedit selection.results &
: This file contains a wealth of information concerning your analysis. The top part of the file gives an overview of your sequences, codon usage and nucleotide frequencies. You can ignore this information for now, and move on to the interesting part, namely the model likelihoods and parameter values:

'''Find likelihood, and number of free parameters for model M1'''
Search ==> Find... ==> enter "Model 1" and click Find
: You are now in the region of the result file where the model likelihoods and parameter estimates are noted. Now, locate a line that looks a bit like the one shown below:
lnL(ntime: 72 np: 74): -4242.470345 +0.000000
: Identify the number of "free parameters", K, used in model M1: This is indicated by "np", and is 74 in the example shown above (most of these parameters are branch lengths in the tree; specifically, the number of branch length parameters is indicated by "ntime", and is 72 in this example). Also note the log-likelihood of the fitted model. This is the number right after the parenthesis, and is -4242.470345 in the example here.

'''Question: '''What are the values of K and lnL for model M1?

----

'''Question 6'''

'''Find dn/dS ratios and codon class proportions for model M1:'''
: Scroll down a few lines until you get to a small table similar to this:
<pre>
dN/dS for site classes (K=2)
p: 0.75111 0.24889
w: 0.06583 1.00000
</pre>

: This gives a summary of the dN/dS ratios that were found in the data set. The line starting w: lists the two dN/dS ratios that were found (in this case 0.06583 and 1.00000 - the last one was pre-specified by us as part of the model and was therefore not a free parameter). The line starting p: gives the proportion of codon sites belonging to each of the dN/dS ratio classes (in the example above approximately 75% belong to the first class , while 25% of all sites belong to the class having dN/dS=1.00000).

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for both classes. Report the following values: p(class1), w(class1), p(class2), w(class2)

----

'''Question 7'''

'''Find likelihood, and K for model M2'''
: Scroll past the M1 output until you get to the results for model M2.

'''Question: '''What are the values of K and lnL for model M2?

----

'''Question 8'''

'''Find dn/dS ratios and codon class proportions for model M2:'''
: Now, scroll down a few lines until you get to a small table similar to the one you examined for M1 before. For this model there are 3 separate classes of codons.

'''Question: '''What are the dN/dS value (w) and proportion (p) of sites for all three classes? Report these values: p(class1), w(class1), p(class2), w(class2), p(class3), w(class3)

----

'''Question 9'''

'''Assess strength of evidence for models M1 and M2:'''
: M2 will always have a better (higher) log-likelihood than model M1 because M2 has more free parameters, and M1 is nested within M2. You should now use the recipe given above to compute AIC values and model probabilities for M1 and M2.

'''Question: ''' Report: AIC, ΔAIC, w (model probability) for M1 and M2

----

'''Question 10: ''' Is M2 better than M1?

----

'''Question 11'''

'''Examine list of positively selected sites'''
: If your M2 is clearly better than M1 (I firmly believe it should be if you did things according to instructions...), then you have evidence for the existence of positively selected sites in the gp120 gene. Now, scroll down to the end of the result file and locate a list similar to the one below. Note: This is the "Bayes Empirical Bayes" table, not the "Naive Empirical Bayes" table.

<pre>
Bayes Empirical Bayes (BEB) analysis
Positively selected sites

Prob(w>1) mean w

25 A 0.959* 3.133 +- 0.769
27 P 0.906 2.990 +- 0.877
56 K 0.987* 3.197 +- 0.687
59 V 0.915 3.032 +- 0.873
78 R 0.637 2.351 +- 1.129
88 K 0.573 2.148 +- 1.077
95 V 0.925 3.046 +- 0.843
...
</pre>

: It is not important what the distinction is in this context, but very briefly NEB ignores the fact that there is uncertainty about maximum likelihood estimates, especially for smaller data sets (for instance w for some codon is perhaps not exactly 3.046, but could be in a region around that value), while [https://pubmed.ncbi.nlm.nih.gov/15689528/ BEB accounts for that uncertainty].
: This gives you a list of which residues (if any) that were found to belong to the positively selected dN/dS-class. Also listed is the probability that the site really is in the codon class where dN/dS > 1, and a weighted average of the w at the site. Using only DNA sequences you have now identified likely epitopes on the gp120 protein.

'''Question: '''List all sites having more than 95% probability of belonging to the positively selected class

Bayesian Phylogeny

2024-03-19T13:41:15Z

WikiSysop: Created page with "This exercise is part of the course Computational Molecular Evolution (22115). == Overview == Today's exercise will focus on phylogenetic analysis using Bayesian methods. As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. (This means that, for a given set of parameter values, you can compute the probability of any possible data set)...."

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

Today's exercise will focus on phylogenetic analysis using Bayesian methods.

As was the case for likelihood methods, Bayesian analysis is founded on having a probabilistic model of how the observed data is produced. (This means that, for a given set of parameter values, you can compute the probability of any possible data set). You will recall from the lecture that in Bayesian statistics the goal is to obtain a full probability distribution over all possible parameter values. To find this so-called posterior probability distribution requires combining the likelihood and the prior probability distribution.

The prior probability distribution shows your beliefs about the parameters before seeing any data, while the likelihood shows what the data is telling about the parameters. Specifically, the likelihood of a parameter value is the probability of the observed data given that parameter value. (This is the measure we have previously used to find the maximum likelihood estimate). If the prior probability distribution is flat (i.e., if all possible parameter values have the same prior probability) then the posterior distribution is simply proportional to the likelihood distribution, and the parameter value with the maximum likelihood then also has the maximum posterior probability. However, even in this case, using a Bayesian approach still allows one to interpret the posterior as a probability distribution. If the prior is NOT flat, then it may have a substantial impact on the posterior although this effect will diminish with increasing amounts of data. A prior may be derived from the results of previous experiments. For instance one can use the posterior of one analysis as the prior in a new, independent analysis.

In Bayesian phylogeny the parameters are of the same kind as in maximum likelihood phylogeny. Thus, typical parameters include tree topology, branch lengths, nucleotide frequencies, and substitution model parameters such as for instance the transition/transversion ratio or the gamma shape parameter. The difference is that while we want to find the best point estimates of parameter values in maximum likelihood, the goal in Bayesian phylogeny is instead to find a full probability distribution over all possible parameter values. The observed data is again usually taken to be the alignment, although it would of course be more reasonable to say that the sequences are what have been observed (and the alignment should then be inferred along with the phylogeny).

In this exercise we will explore how one can determine and use posterior probability distributions over trees, over clades, and over substitution parameters. We will also touch upon the difference between marginal and joint probability distributions.

== Getting started ==

: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir bayes
cd bayes
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/neanderthal.nexus ./neanderthal.nexus
cp ../data/hcvsmall.nexus ./hcvsmall.nexus

: You have analyzed (versions of) all these data files previously in this course. We will now use Bayesian phylogenetic analysis to complement what we learned in those analyses.

'''Load R libraries'''

: In RStudio: set the working directory to the bayes directory. Then issue these commands:
library(magrittr)
library(tidyverse)
library(bayesplot)

----

== Posterior probability of trees ==

'''Question 1'''

: In today's exercise we will be using the program "MrBayes" to perform Bayesian phylogenetic analysis. MrBayes is a program that, like PAUP*, can be controlled by giving commands at a command line prompt. In fact, there is a substantial overlap between the commands used to control MrBayes and the PAUP command language. This should be a help when you are trying to understand how to use the program.

: Note that the command "help" will give you a list of all available commands. Issuing "help ''command''" will give you a more detailed description of the specified command along with current option values. This is similar to how "help ''command''" works in PAUP.

'''Start program'''
mb
: This starts the program, giving you a prompt ("MrBayes> ") where you can enter commands.

'''Get a quick overview of available commands'''
help

'''Load your sequences'''
execute primatemitDNA.nexus
: This file contains mitochondrial DNA sequences from 5 different primates. Note that MrBayes accepts input in nexus format, and that this is the same command that was used to load sequences in PAUP*. In general, you can use many of the PAUP commands in MrBayes also.

'''Inspect data set'''
showmatrix

'''Define outgroup'''
outgroup Gibbon

'''Specify your model of sequence evolution'''
lset nst=2 rates=gamma
: This command is again very much like the corresponding one in PAUP. You are specifying that you want to use a model with two substitution types (nst=2), and this is automatically taken to mean that you want to distinguish between transitions and transversions. Furthermore, rates=gamma means that you want the model to use a gamma distribution to account for different rates at different sites in the sequence.

'''Start Markov chain Monte Carlo sampling'''
:Make sure to make the shell window as wide as possible and then issue the following commands to start the run:
mcmc ngen=1000000 samplefreq=1000 nchains=3 diagnfreq=5000
: What you are doing here is to use the method known as MCMCMC ("Metropolis-coupled Markov chain Monte Carlo") to empirically determine the posterior probability distribution of trees, branch lengths and substitution parameters. Recall that in the Bayesian framework this is how we learn about parameter values: instead of finding the best point estimates, we typically want to quantify the probability of the entire range of possible values. An estimate of the time left is shown in the last column of output.

: Let us examine the command in detail. First, ngen=1000000 samplefreq=1000 lets the search run for 1,000,000 steps ("generations") and saves parameter values once every 1000 rounds (meaning that a total of 1000 sets of parameter values will be saved). The option nchains=3 means that the MCMCMC sampling uses 3 parallel chains (but see below): one "cold" from which sampling takes place, and two "heated" that move around in the parameter space more quickly to find additional peaks in the probability distribution.

: The option diagnfreq=5000 has to do with testing whether the MrBayes run is succesful. Briefly, MrBayes will start two entirely independent runs starting from different random trees. In the early phases of the run, the two runs will sample very different trees but when they have reached convergence (when they produce a good sample from the posterior probability distribution), the two tree samples should be very similar. Every diagnfreq generations, the program will compute a measure of how similar the tree-samples are (specifically, the measure is the average standard deviation of split frequencies. A "split" is the same as a bipartition, i.e., a division of all leaves in the tree in two groups, by cutting an internal branch). As a rule of thumb, you may want to run until this value is less than 0.05 (the smaller the better).

: During the run you will see reports about the progress of the two sets of four chains. Each line of output lists the generation number and the log likelihoods of the current tree/parameter combination for each of the two groups of three chains (a column of asterisks separate the results for the independent runs). The cold chains are the ones enclosed in brackets [...], while the heated chains are enclosed in parentheses (...). Occasionally the chains will swap so one of the heated chains now becomes cold (and sampling then takes place from this chain).

'''Continue run until parallel runs converge on same solution'''
:At the end of the run, Mrbayes will print the average standard deviation of split frequencies (which is a measure of how similar the tree samples of the two independent runs are). We recommend that you continue with the analysis until the value gets below 0.01 (if the value is larger than 0.01 then you should answer "yes" when the program asks "Continue the analysis? (yes/no)".)

'''Question: '''Once you have reached convergence (and answered "no" to continue the analysis): How many generations did you have to run?

----

'''Question 2'''

'''Have a look at the resulting sample files'''
: Open a new Terminal and cd to the bayes directory. Open one of the parameter sampling files in an nedit window:
nedit primatemitDNA.nexus.run1.p &
: This file contains one line for each sampled point (you may want to turn off line-wrapping in nedit under the preferences menu). Each row corresponds to a certain sample time (or generation). Each column contains the sampled values of one specific parameter. The first line contains headings telling what the different columns are: "lnL" is the log likelihood of the current parameter estimates, "TL" is the tree length (sum of all branch lengths), "kappa" is the transition/transversion rate ratio, "pi(A)" is the frequency of A (etc.), and "alpha" is the shape parameter for the gamma distribution. (Column headings may be shifted relative to their corresponding columns). Note how the values of most parameters change a lot during the initial "burnin" period, before they settle near their most probable values. Now, close the nedit window and have a look at the file containing sampled trees:
nedit primatemitDNA.nexus.run1.t &
: Tree topology is also a parameter in our model, and exactly like for the other parameters we also get samples from tree-space. One tree is printed per line in the parenthetical format used by most phylogeny software. There are 5 taxa in the present data set, meaning that the tree-space consists of only 15 different possible trees. Since we have taken more than 15 sample points, there must be several lines containing the same tree topology. Close the nedit window when you are done.

'''Examine MCMC trajectory for nucleotide frequency'''
: Recall, that the idea in MCMCMC sampling is to move around in parameter space in such a way that the points will be visited according to their posterior probability (i.e., a region with very high posterior probability will be visited frequently). Now, in RStudio plot the sampled values for the frequency of A for one of the run files:
df = read_tsv("primatemitDNA.nexus.run1.p", skip=1)
mcmc_trace(df, pars="pi(A)")
: mcmc_trace is one of several plotting commands available in the bayesplot package. These commands produce a plot of f_A (or "pi(A)") from the sample file for the first of the two parallel runs. Note how the Markov chain starts at the arbitrary value of 0.25, rapidly moves to a value that fits with the observed data, and then moves around in parameter space, sampling different possible values of f_A. You can experiment with plotting other columns as well.

'''Investigate posterior probability distribution over trees'''
: MrBayes provides the sumt command to summarize the sampled trees. Before using it, we need to decide on the burn-in: The burn-in is the initial set of samples that are typically discarded, because we want to ensure that the MCMC has moved away from the random starting values, and has found the peaks of the probability landscape. Since the convergence diagnostic we used previously to determine when to stop the analysis discarded the first 25% of the samples, it makes sense to also discard 25% of the samples obtained during the analysis.

: Return to the shell window where you have MrBayes running. In the command below relburnin=yes and burninfrac=0.25 tells MrBayes to discard 25% of the samples as burnin (you could also have explicitly given the number of samples to discard - help sumt will give you details about the command and the current option settings).
sumt contype=halfcompat conformat=simple relburnin=yes burninfrac=0.25 showtreeprobs=yes
: (Scroll back so you can see the top of the output when the command is done). This command gives you a summary of the trees that are in the file you examined manually above. The option contype=halfcompat requests that a majority rule consensus tree is calculated from the set of trees that are left after discarding the burnin. This consensus is the first tree plotted to the screen. Below the consensus cladogram, a consensus phylogram is plotted. The branch lengths in this have been averaged over the trees in which that branch was present (a particular branch corresponds to a bi-partition of the data, and will typically not be present in every sampled tree). The cladogram also has "clade credibility" values. We will return to the meaning of these later in today's exercise.

: What most interests us right now is the list of trees that is printed after the phylogram. These trees are labeled "Tree 1", "Tree 2", etc, and are sorted according to their posterior probability which is indicated by a lower-case p after the tree number. (The upper-case P gives the cumulated probability of trees shown so far, and is useful for constructing a credible set). This list highlights how Bayesian phylogenetic analysis is different from maximum likelihood: Instead of finding the best tree(s), we now get a full list of how probable any possible tree is.

: The list of trees and probabilities was printed because of the option showtreeprobs=yes. Note that you probably do not want to issue that command if you have much more than 5 taxa! In that case you could instead inspect the file named primatemitDNA.nexus.trprobs which is now present in the same directory as your other files (this file is automatically produced by the sumt command).

: '''NOTE''': Annoyingly, there is a bug in the version of mrbayes we are using here, which means leaf names are not printed on the list of trees with probabilities. However, the most probable tree here in fact is identical to the consensus tree printed above it.

'''Question: '''What is the posterior probability of the most probable tree?

----

== Analysis of Neanderthal data (posterior probability of clades) ==

'''Question 3'''

The predominant theory in the 1950s and 60s (although it varied greatly from scholar to scholar) was that our earliest hominid ancestors (specifically Homo erectus) evolved in Africa and then radiated out into the world. This so-called [https://www.thoughtco.com/multiregional-hypothesis-167235 Multiregional Hypothesis] says that after H. erectus arrived in the various regions in the world hundreds of thousands of years ago, they slowly evolved into modern humans. The hypothesis thus posits that there were nearly independent origins of modern humans within the various regions of the world.

In the 1970s, paleontologist W.W. Howells proposed an alternate theory: the first Recent African Origin model. Howells argued that H. sapiens evolved solely in Africa. By the 1980s, growing data from human genetics led Stringer and Andrews to develop a model that said that the very earliest anatomically modern humans arose in Africa about 100,000 years ago and archaic populations found throughout Eurasia (including Neanderthals) might be descendants of H. erectus and later archaic types but they were not related to modern humans.

We will use the present data set to consider this issue.

'''Load Neanderthal data set'''
: In the Terminal where you have MrBayes running:
execute neanderthal.nexus
delete 5-40
: As we did for the maximum likelihood analysis, we will discard some of the human sequences in order to speed up the analysis. The command delete 5-40 removes sequence number 5 to sequence number 40 from the active data set.

'''Investigate data'''
showmatrix
: This data set consists of an alignment of mitochondrial DNA from human (17 sequences), chimpanzee (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically bones found at Vindija in Croatia.

'''Start analysis'''
outgroup Pan_troglodytes
lset nst=mixed rates=gamma
mcmc ngen=500000 nchains=3 diagnfreq=10000

: Here we use the command `nst=mixed` which allows MrBayes to automatically explore all possible substitution models. Essentially, MrBayes now considers the substitution model as one more parameter, and uses MCMC to sample from the possible versions (with nst ranging from 1 to 6). This will often be the best choice when using MrBayes. (Below, I use nst=6 for pedagogical purposes, because it makes it simpler to analyse the output files).

'''Find posterior probability of clades'''
sumt contype=halfcompat showtreeprobs=no relburnin=yes burninfrac=0.25
: Examine the consensus tree that is plotted to screen: On the branches that are resolved, you will notice that numbers have been plotted. These are clade-credibility values, and are in fact the posterior probability that the clade is real (based on the present data set). These numbers are different from bootstrap values: unlike bootstrap support (which have no clear statistical meaning) these are actual probabilities. Furthermore, they have been found using a full probabilistic model, instead of neighbor joining, and have still finished in a reasonable amount of time. These features make Bayesian phylogeny very useful for assessing hypotheses about monophyly.

'''Question: '''What is the clade probability for Homo sapiens being a monophyletic group excluding the Neanderthal?

----

== Probability distributions over other parameters ==

'''Question 4'''

: As the last thing, we will now turn away from the tree topology, and instead examine the other parameters that also form part of the probabilistic model. We will do this using a reduced version of the Hepatitis C virus data set that we have examined previously. Stay in the shell window where you just performed the analysis of Neanderthal sequences.

'''Load data set'''
execute hcvsmall.nexus

'''Define site partition'''
charset 1stpos=1-.\3
charset 2ndpos=2-.\3
charset 3rdpos=3-.\3
partition bycodon = 3:1stpos,2ndpos,3rdpos
set partition=bycodon
prset ratepr=variable
: This is an alternative way of specifying that different sites have different rates. Instead of using a gamma distribution and learning which sites have what rates from the data, we are instead using our prior knowledge about the structure of the genetic code to specify that all 1st codon positions have the same rate, all 2nd codon positions have the same rate, and all 3rd codon positions have the same rate. Specifically, charset 1stpos=1-.\3 means that we define a character set named "1stpos" which includes site 1 in the alignment followed by every third site ("\3", meaning it includes sites 1, 4, 7, 11, ...) until the end of the alignment (here denoted ".").

'''Specify model'''
lset nst=6
: This specifies that we want to use a model of the General Time Reversible (GTR) type, where all 6 substitution types have separate rate parameters.

: When the lset command was discussed previously, a few issues were glossed over. Importantly, and unlike PAUP, the lset command in MrBayes gives no information about whether nucleotide frequencies are equal or not, and whether they should be estimated from the data or not. In MrBayes this is instead controlled by defining the prior probability of the nucleotide frequencies (the command prset can be used to set priors). For instance, a model with equal nucleotide frequencies corresponds to having prior probability 1 (one) for the frequency vector (A=0.25, C=0.25, G=0.25, T=0.25), and zero prior probability for the infinitely many other possible vectors. As you will see below, the default prior is not this limited, and the program will therefore estimate the frequencies from the data.

'''Inspect model details'''
showmodel
: This command gives you a summary of the current model settings. You will also get a summary of how the prior probabilities of all model parameters are set. You will for instance notice that the nucleotide frequencies (parameter labeled "Statefreq") have a "Dirichlet" prior. We will not go into the grisly details of what exactly the Dirichlet distribution looks like, but merely note that it is a distribution over many variables, and that depending on the exact parameters the distribution can be more or less flat. The Dirichlet distribution is a handy way of specifying the prior probability distribution of nucleotide (or amino acid) frequency vectors. The default statefreq prior in MrBayes is the flat or un-informative prior dirichlet(1,1,1,1).

: We will not go into the priors for the remaining parameters in any detail, but you may notice that by default all topologies are taken to be equally likely (a flat prior on trees).

'''Start MCMC sampling'''
mcmc ngen=500000 samplefreq=100 diagnfreq=10000 nchains=3
: The run will take a while to finish (you may want to ensure that the average standard deviation of split frequencies is less than 0.01 before ending the analysis).

'''Compute summary of parameter values'''
sump relburnin=yes burninfrac=0.25
: The sump command works much like the sumt command for the non-tree parameters. Again, we are using 25% of the total number of samples as burnin.

: First, you get a plot of the lnL as a function of generation number. Values from the two independent runs are labeled "1" and "2" respectively. If the burnin is suitable, then the points should be randomly scattered over a narrow lnL interval.

: Secondly, the posterior probability distribution of each parameter is summarized by giving the mean, variance, median, and 95% credible interval.

'''Question: '''Report the mean of the relative substitution rate parameters r(AC) and r(CG).

----

'''Question 5: ''' Based on the reported posterior means, does it seem that r(CG) is different from r(AC)?

----

'''Question 6'''

'''Marginal vs. joint distributions'''
: Strictly speaking the comparison above was not entirely appropriate. We first found the overall distribution of the r(CG) parameter and then compared its mean to the mean of the overall distribution of the r(AC) parameter. By doing things this way, we are ignoring the possibility that the two parameters might be associated in some way. For instance, one parameter might always be larger than the other in any individual sample, even though the total distributions overlap. We should instead be looking at the distribution over both parameters simultaneously. A probability distribution over several parameters simultaneously is called a "joint distribution" over the parameters.

: By looking at one parameter at a time, we are summing its probability over all values of the other parameters. This is called the marginal distribution.

'''Examine marginal distributions'''
: In RStudio, use the following commands to read and plot the marginal distributions of r(AC) and r(CG). Note that we are discarding the first 25% of the reads as burnin
df = read_tsv("hcvsmall.nexus.run1.p", skip=1)
burnin = df$Gen %>%
max() %>%
multiply_by(0.25) %>%
floor()
df2 = df %>%
filter(Gen > burnin) %>%
select(CG = `r(C<->G){all}`,
AC = `r(A<->C){all}`
)
mcmc_intervals(df2, prob_outer = 1)
mcmc_areas(df2, prob_outer = 1)
: The functions mcmc_intervals and mcmc_areas plot different views of the same posterior distributions.

: You can also simply plot the data using ggplot:
df2long = pivot_longer(df2, cols = c("CG", "AC"))

ggplot(df2long) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Substitution rate")

'''Question''': Which of the following statements best describe how the marginal distributions behave?

:* The two marginal distributions have a small overlap. The r(CG) distribution has the highest peak.
:* The two marginal distributions have a large overlap. The r(CG) distribution has the highest peak.
:* The two marginal distributions have no overlap. The r(CG) distribution has the highest peak.
:* The two marginal distributions have a large overlap. The r(AC) distribution has the highest peak.
:* The two marginal distributions have a small overlap. The r(AC) distribution has the highest peak.

----

'''Question 7'''

'''Examine joint distributions'''
: These plots and results explore the relationship between the A<->C and C<->G rates.
ggplot(df2, aes(x=CG, y=AC)) +
geom_point(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0,0.25) +
labs(x="CG rate", y ="AC rate")

ggplot(df2, aes(x=CG, y= AC)) +
geom_hex(col="blue") +
geom_abline(intercept=0, slope=1, lty=2, col="red") +
xlim(0,0.25) +
ylim(0, 0.25) +
labs(x="CG rate", y ="AC rate")

df2 %>%
nrow()
df2 %>%
filter(AC>CG) %>%
nrow()

'''Question: '''Based on the above plots and results: What is the joint probability that rAC > rCG?

----

'''Question 8'''

: Note how examining the joint distribution provides you with information that you could not get from simply comparing the marginal distributions. This very simple procedure can be used to answer many different questions.

: Now, plot the relative substitution rates at the first, second, and third codon positions:
df3 = df %>%
filter(Gen > 75000) %>%
select(Codon_1st = `m{1}`,
Codon_2nd = `m{2}`,
Codon_3rd = `m{3}` ) %>%
pivot_longer(cols=c("Codon_1st", "Codon_2nd", "Codon_3rd"))

ggplot(df3) +
geom_density(mapping=aes(x=value, fill=name), alpha=0.3) +
labs(x="Relative substitution rate")

'''Question: '''Since random mutations presumably hit all three codon positions with the same frequency, any differences must be caused by subsequent selection. How does the result fit with your knowledge of the structure of the genetic code? Which of the following statement are correct ? (More than one answer may be correct)

:* Codon position 2 is the most degenerate of the codon positions.
:* Codon position 1 is the most degenerate of the codon positions.
:* Codon position 1 is the most conserved codon position.
:* Codon position 3 is the most conserved codon position.
:* Codon position 3 is the most degenerate of the codon positions.
:* Codon position 2 is the most conserved codon position.

Maximum Likelihood

2024-03-19T13:39:51Z

WikiSysop: Created page with "This exercise is part of the course Computational Molecular Evolution (22115). == Overview == The data set you will work with here consists of an alignment of full length mitochondrial DNA from human (53 sequences), chimpanzee (1 sequence), bonobo (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically 38,000 year old bones found at Vindija in Croatia (all s..."

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

The data set you will work with here consists of an alignment of full length mitochondrial DNA from human (53 sequences), chimpanzee (1 sequence), bonobo (1 sequence), and Neanderthal (1 sequence). The Neanderthal DNA was extracted from archaeological material, specifically 38,000 year old bones found at Vindija in Croatia (all sequence data was taken from this paper: Green et al., Cell, 2008).

The view emerging from most anatomical, archaeological, and DNA-based studies places Neanderthals as a different species from Homo sapiens. This is in agreement with the "Out-of-Africa hypothesis", which states that Neanderthals coexisted with modern humans who originated in Africa somewhere between 100,000 to 200,000 years ago. There is, however, also some anatomical and paleontological research which supports the so-called "multi-regional hypothesis", which propounds that some populations of archaic Homo evolved into modern human populations in many geographical regions. Under this hypothesis, Neanderthals would be a sub-clade within human . We will use the present data set to consider this issue.

== Getting started ==

'''Construct working directory, copy files'''
In a terminal window enter:
cd ~student
mkdir likelihood
cd likelihood
cp ~/data/neanderthal.nexus ./neanderthal.nexus

----

'''Question 1'''

Start PAUP and load data set:
paup neanderthal.nexus

'''Remove subset of sequences to reduce computational burden:'''
delete 5-40
: This command removes 36 human sequences (sequence number 5 to sequence number 40) from the data set. We do this in order to reduce the time needed to finish the analysis. In the remaining data set we now have 17 human sequences, one chimpanzee, one bonobo, and one Neanderthal.

'''Specify substitution model'''
In the analysis performed here, we have reason to believe that the Kimura 2 parameter model is a fair description of how the sequences evolve (i.e., transitions and transversions have separate rates). We furthermore have evidence that different sites evolve at quite different rates, and we want to model this using a gamma distribution. Moreover, we will request that the transition/transversion rate ratio and the gamma shape parameter are estimated from data. (Although we will not discuss the issue further at this point, it is important to realize that there are techniques for stringent selection of the best model, and that one should never just randomly select one. We will return to such techniques later in the course when we discuss model selection. For now, however, you should just accept that K2P + gamma is an adequate model for the present data set). To specify the substitution model, enter the following at the PAUP prompt:
set criterion=likelihood
lset nst=2 tratio=estimate basefreq=equal rates=gamma shape=estimate
: In order to search for a maximum likelihood tree, we must first give a detailed description of the assumed substitution model. Since this is the first time we do this, I will give a rather thorough description of each part of the command.

: First lset ("likelihood settings") is the command used in PAUP to specify likelihood models, just as dset was used to specify settings for the distance criterion.

: Secondly, we specify that we want a model with two different types of substitution rates (nst=2) and where the frequency of each base is 25% (basefreq=equal). You will recognize this as the K2P model. Note that, by default, PAUP assumes that nst=2 means that we want to make a distinction between transitions and transversions. It is also possible to specify models with two types of substitutions that are NOT transitions and transversions respectively. One example would be: lset nst=6 rmatrix=estimate rclass=(a a a b b b). I will not explain this example in detail at this point.

: Third, we request that the transition/transversion ratio should be estimated from the data (tratio=estimate).

: Finally, we specify that we want to use a model where substitution rates at different sites follow a gamma distribution (rates=gamma), and that we want the shape of this distribution to also be estimated from the data (shape=estimate).

'''Specify outgroup and rooting options'''
outgroup Pan_troglodytes Pan_paniscus
set root=outgroup outroot=monophyl
: The chimpanzee and the bonobo form the outgroup

'''Start heuristic search of tree space using nearest neighbor interchange (NNI)'''
hsearch swap=nni
: This step may take a little while to finish (depending on your computer). For large datasets you sometimes have to wait hours or even days for a maximum likelihood analysis to finish. The score that PAUP lists for maximum likelihood analysis is the ''negative'' log of the likelihood, -ln(L). (Recall that since likelihoods are numbers between 0 and 1, log-likelihoods will be negative numbers, and therefore negative log-likelihoods will be positive numbers. Perhaps a bit confusing that PAUP doesn't simply list the ln(L) ). As the likelihood increases, this number will decrease.

'''Question: '''What is the negative log-likelihood, -ln(L), for the best tree found using NNI?

----

'''Question 2''' Is the Neanderthal sequence placed inside or outside the clade of human sequences?

Models of Evolution

2024-03-19T13:35:13Z

WikiSysop: Created page with "This exercise is part of the course Computational Molecular Evolution (22115). == Overview == In this exercise we will explore a number of different, but closely related, models of evolution. Using such models it is possible to estimate the number of unseen mutational events and thereby obtain genetic distances that have been corrected for superimposed substitutions. It is, however, important to realize that these correct..."

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Overview ==

In this exercise we will explore a number of different, but closely related, models of evolution. Using such models it is possible to estimate the number of unseen mutational events and thereby obtain genetic distances that have been corrected for superimposed substitutions. It is, however, important to realize that these corrections are based on the assumption that we observe approximately the expected amount of change - if, for instance, 20 mutational events end up leading to no observable changes then it is impossible to guess the actual amount of change regardless of which correctional scheme we employ. Using more and longer sequences helps ensuring that the observed change is closer to the expected change, so the correction is more likely to be accurate with adequate amounts of data. The same models also play an important role in phylogenetic reconstruction based on maximum likelihood and Bayesian techniques.

== Getting started ==

'''1: Start Terminal window'''

'''2: Construct working directory'''
: In the command below: Instead of /path/to/molevol enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol, or cd /home/student/molevol).
cd /path/to/molevol
mkdir models

'''3: Change to working directory'''
cd models

'''4: Copy files for exercise'''
cp ../data/primatemitDNA.nexus ./primatemitDNA.nexus
cp ../data/titv.data ./titv.data

'''5: Inspect sequence file'''
nedit primatemitDNA.nexus &
: This file contains an aligned set of mitochondrial DNA sequences from man, chimpanzee, gorilla, orangutan and gibbon. Mitochondria are cellular organelles that are bounded by a lipid membrane and contain their own genome. Mitochondrial DNA is related to certain bacterial genomes, and it is believed that the original mitochondrium was a primitive bacterial cell that was engulfed by an early ancestor of eukaryotic cells and that the pair subsequently went on to form a constant symbiotic relationship.

: Mitochondrial DNA has a higher rate of substitution than nuclear DNA. This makes it useful for investigating phylogenetic relationships between closely related species, such as the five primates included in the present data set. Close the nedit window when you are done.

'''6: Inspect additional data file'''
nedit titv.data &
: This file contains a single header line and one column of numbers giving estimated times of divergence between man and chimpanzee, man and gorilla, man and orangutan, and man and gibbon. (Divergence times are in millions of years). This file will be used later in the exercise when we investigate how various distance measures increase over time. Note: If the nedit window is too narrow, then the column headings will wrap over two lines. Make sure to make the window as wide as possible in order to understand the structure of this file. Close the nedit window when you are done.

----

== The Jukes and Cantor model ==

'''Question 1'''

The Jukes and Cantor model of evolution has the following rate matrix:

<pre>
| A C G T |
-------------------
A | - a a a |
| |
C | a - a a |
| |
G | a a - a |
| |
T | a a a - |
-------------------
</pre>

'''Start RStudio'''
: We will use RStudio to explore some features of evolution occurring according to this model. Start by loading the tidyverse libraries:
library(tidyverse)

For the Jukes and Cantor model the following equation gives the probability, D, that a given site will display observable change, expressed as a function of branch length, d:

<math>D=\frac{3}{4} \left( 1 - \exp\left(-\frac{4}{3}d\right) \right)</math>

Here, d is measured in substitutions per site. D is also the expected fraction of sites showing observable change along a branch of length d: if any single site has probability D of changing, then on average D * L sites will have changed in a sequence of length L. We will now explore how the expected amount of observable change depends on the branch length.

In RStudio enter the following (note: you may want to enter this in the script window, in the upper left of RStudio, so you can re-use the code later on):
df = tibble(
d = seq(0,10,0.1),
observed = 0.75*(1-exp(-1.33*d)),
max = 0.75
)

dflong = pivot_longer(df, cols=-d)
ggplot(dflong, aes(x=d, y=value, col=name)) +
geom_line() +
geom_abline(slope=1, lty=2) +
labs(title="Exp. observable differences",
x = "Actual distance (branch length)",
y = "Observed distance") +
ylim(0,1)

: In this expression d is the the branch length (the actual amount of change that has occurred). The curve we have plotted thus gives the expected observed difference as a function of the actual amount of change.

'''Question: '''Which of the following statements are true?
:# Sequences can become no more than 75 % different according to the Jukes and Cantor model
:# The graph of the observed differences plateaus at 3/4
:# The Jukes and Cantor correction will have a limited effect when the branch length is large
:# For small branch lengths, the expected observable difference rises almost linearly and is very close to the real distance
:# The graph of the observed differences plateaus at 1

----

'''Question 2'''

'''Jukes and Cantor model: Examine estimated branch length as a function of observed difference'''

Above we examined how the (expected) observed distance depended on the real distance. We will now examine how the real distance can be estimated from the observed distance. This is done by solving the above equation for d, giving us an expression that allows us to estimate the real amount of change as a function of the observed change:

:<math>d=-\frac{3}{4}ln\left( 1 - \frac{4}{3}D \right)</math>

Note that this correction will only work if the observed difference is approximately as expected. Consider this: In the dice-rolling simulation we found that if there has been 0.67 changes per site then the expected observed difference is 0.44. However, as you saw in the simulation, the actual observed difference can be different from the expected 0.44 (say, 0.33 or 0.58). If the observed difference is not the same as the expected observed difference, then we will obviously also get the wrong estimate of the real distance after correcting for multiple substitutions.

We will now plot (estimated) real change as a function of observed difference (this is the inverse of what you did before). In RStudio enter:
df = tibble(
D = seq(0,0.749,0.01),
real = -0.75*log(1-1.33*D)
)

ggplot(df, aes(x=D, y=real)) +
geom_line(col="blue") +
geom_abline(slope=1, lty=2) +
labs(title="Estimated real distance",
x = "Observed difference",
y = "Real distance") +
xlim(0,0.8)

: The function "log" means the natural logarithm in R. Note how the correction becomes increasingly more important as the observed distance increases. Also note that this correction does not allow the observed distance to rise above 0.75, although that situation may arise in real data. Above 75% difference the corrected distance is not defined. When using JC corrected distances for phylogenetic reconstruction, you should therefore beware of this situation.

'''Question: '''Use the equation above to estimate the actual distance if the observed distance is 0.1, 0.4, and 0.6 respectively

----

== The Kimura 2 parameter model ==

The Kimura 2 parameter model of evolution has the following rate matrix:
<pre>
| A C G T |
-------------------
A | - b a b |
| |
C | b - b a |
| |
G | a b - b |
| |
T | b a b - |
-------------------
</pre>

Note how transitions (A/G and C/T) have a different rate than transversions (A/C, A/T, C/G, and G/T). Based on this matrix, the expected ratio of transitions to transversions is:

<math>R = \frac{a}{2b}</math>

meaning that if transitions and transversions had the same rate (Jukes and Cantor), then we would expect: <math>R = 0.5</math>. Empirically, this is typically not the case. In fact one often sees <math>R \geq 2</math> and for mitochondrial DNA a typical value is <math>R=10</math> (meaning that a is 20 times higher than b)! We will now use RStudio to explore some features of evolution occurring according to this model.

It can be shown that, under the K2P model, the chance of observing a transition and a transversion respectively depends on R and t in the following way:

<math>
P_\textrm{transition} = 0.25 - 0.5 \exp(A*t) + 0.25 * \exp(B*t)
</math>

<math>
P_\textrm{transversion} = 0.5 - 0.5 * \exp(B*t)
</math>

where

<math>
A = \frac{-2R-1}{R+1}
</math>

<math>
B = \frac{-2}{R+1}
</math>

Note that in these equations we have chosen to measure time in suitable units such that the overall rate of substitution (<math>\mu=a+2b</math>) has the value 1 substitution per site per unit time. (An example: if <math>\mu=10^{-9}</math> substitutions per site per year, then we would choose to measure time in billions of years, instead of in years. The substitution rate would now be 1 substitution per site per billion years). This means that the amount of change accumulated during t time units simply is: <math>D = 1 * t = t</math>

----

'''Question 3'''

'''Examine expected amount of change as a function of branch length'''
We will now examine how the expected amount of transitions and transversions change with time when R=10. In the RStudio window enter the following:
R = 10
A = (-2*R-1.0)/(R+1.0)
B = (-2)/(R+1.0)
You can check the computed values of A and B by using the print command:
print(A)
print(B)
You should have obtained values of approximately A = -1.909 and B = -0.1818. You can now plot the curves showing how the expected amount of transitions and transversions change as a function of the branch length (the actual amount of change):
df = tibble(
t = seq(0, 40, 0.1),
Transitions = 0.25-0.5*exp(A*t)+0.25*exp(B*t),
Transversions = 0.5 - 0.5 * exp(B*t),
Total_dist = 0.25-0.5*exp(A*t)+0.25*exp(B*t) + 0.5 - 0.5 * exp(B*t)
)

dflong = df %>% pivot_longer(-t)

ggplot(dflong, aes(x=t, y=value, col=name)) +
geom_line() +
labs(title="Exp. observable differences",
x = "Real distance",
y = "Observed difference") +
geom_hline(yintercept = 0.25, col="blue", lty=2) +
geom_hline(yintercept = 0.5, col="blue", lty=2) +
geom_hline(yintercept = 0.75, col="blue", lty=2) +
ylim(0,1)

:Several interesting things are going on in this plot. First of all, note that I have added a third curve showing the total observed difference. This is simply the sum of the observed transitions and transversions.

: Second, as was the case for the Jukes and Cantor model, the total observed difference increases to a maximum value of 0.75 (corresponding to 25% similarity).

: Third, note that the expected amount of transitional differences first rise rapidly and then decline slowly to an equilibrium value of 0.25. Transversional differences rise slowly to an equilibrium value of 0.5. The equilibrium values are determined by the fact that when sufficient time has passed sequence similarities will essentially be random; since there are twice as many possible transversions as transitions, these will in the end make up two thirds of all observed changes. Early on, before this stage is reached, the much higher rate of transitions will cause them to make up the vast majority of all observed changes, and only after considerable time has elapsed will the transversions catch up.

'''Question: '''From the plot, estimate the real distance (x-axis) at which the transition and transversion lines cross.

----

'''Question 4'''

'''Experiment with other transition/transversion rate ratios'''
The exact behaviour of the relationship between the two types of change depends on the relative rates of transition and transversion. You should now repeat the above analysis with:
:# R=2
:# R=0.5
Remember to recompute A and B after entering the new value of R. Recall that R=0.5 means that transitions and transversion occur with the same rate, a=b. For each of these two cases rerun the plot command and consider the changes.

'''Question: '''Based on the two plots which of the following statements are true?
:# For R = 2, the transition and transversion lines cross at around 1.5 substitutions per site.
:# For R = 0.5, the transition and transversion lines never cross each other.
:# For both R=2 and R = 0.5, the transition and transversion lines never cross each other.
:# For R = 2, the transition and transversion lines cross at around 5 substitutions per site.

----

'''Question 5''' When R=0.5, the Kimura 2 parameter model is in fact equivalent to another model - which one?

----

'''Question 6'''

'''Examine how apparent transition/transversion ratio changes with branch length'''
The apparent transition transversion ratio is simply the observed number of transitions divided by the observed number of transversions. The following plot command shows this number as a function of branch length for the case R=10 (I have simply taken the expression for observed transitions and divided it by the expression for observed transversions):
df = tibble(
t= seq(0.01, 4, 0.01),
obsrat = (0.25-0.5*exp(-1.909*t)+0.25*exp(-0.1818*t))/ (0.5 - 0.5 * exp(-0.1818*t))
)

ggplot(df, aes(x=t, y=obsrat)) +
geom_line(col="blue") +
labs(x = "Real distance",
y = "Observed transition/transversion ratio")

Note how the apparent ratio is close to the real ratio, R=10, when not much change has occurred (i.e., for small x).

The model of evolution that we have explored here is not a particularly complicated one - in fact it only has two free parameters. Nevertheless, you will by now appreciate that it is capable of displaying fairly un-intuitive behaviour. Stating our hypothesis about this biological system in explicit mathematical terms is what allowed us to explore this thoroughly.

'''Question: '''What value does the apparent transition/transversion ratio approach asymptotically? (You will need to construct a plot with a wider x-range to see this).

----

== Analysis of mitochondrial data set ==

In this part of the exercise we will explore a real mitochondrial data set containing sequences from man, chimpanzee, gorilla, orangutan, and gibbon. We will investigate how the use of different models of evolution affects the estimated distance matrix. Since mitochondrial DNA is known to have very different transition and transversion rates, we will pay special attention to this aspect.

----

'''Question 7 '''

'''Prepare editor window'''
In a terminal window, enter:
nedit titv.data &
(Make sure to make the nedit window as wide as possible - otherwise the header line will be wrapped over two lines). This file contains a header line and a column listing the estimated divergence time between man and each of the other four primates (in millions of years). These estimates are associated with a fair amount of uncertainty, but the implied branching order is almost certainly correct. You will be using this file for entering various measures that you compute from the data file.

'''Start PAUP*, load data file'''
paup primatemitDNA.nexus

'''Define outgroup:'''
outgroup gibbon

'''Activate outgroup rooting and select how tree will be printed:'''
set root=outgroup outroot=monophyl

'''Select distance-based tree-reconstruction:'''
set criterion=distance

'''Select uncorrected distances under the least squares criterion:'''
dset distance=p objective=lsfit

'''Short digression on PAUP* online help system:'''
: We interrupt this exercise for a brief announcement: By now you should be familiar with many of the commands used in PAUP, but you probably do not have an overview of the long list of possible options that can be specified. Fortunately, PAUP has a command that is useful in this context:
dset ?
: Here I have used dset as an example, but typing any command followed by a question mark ("?") will give you a list of all the possible options for that command, along with a list of the current values. This is very useful if you want to experiment with different settings in an analysis. When you want to learn more about the individual settings, you can also check the command reference and manual, which are linked on the course wiki.

: One final thing that may be good to know: PAUP* accepts abbreviated commands as long as the abbreviation is unambiguous. That means you can for instance write set crit=dist instead of the full set criterion=distance, and desc instead of describetrees.

'''Construct least squares tree:'''
alltrees
: This is a small data set so we can use exhaustive searching.

'''Inspect tree:'''
describetrees all/plot=phylogram
: This tree reflects our current belief about how these organisms are related

'''Print distance matrix, note distances from human:'''
showdist
: The showdist command lists the distance matrix computed according to the currently active distance-setting (as specified in the dset command above).

'''Question: '''What are the p-distances for the following pairs of sequences: human/chimpanzee, human/gorilla, human/orangutan, human/gibbon.

'''Note''': Also copy the entries giving the p-distance between human and each of the other four primates into the proper place in the titv.data file. (The numbers should all be in a single column under the "p_dist" header).

----

'''Question 8'''

'''Select uncorrected distances, counting only transitions:'''
dset subst=ti
: The option subst=ti specifies that only transitional substitutions should be counted. The previously issued "distance=p" is still the active setting. You can verify this by typing "dset ?" and checking the value listed for distance.

'''Print distance matrix, note transitional distances from human:'''
showdist
:In this distance matrix only the transitions have been counted for each pair of taxa.

'''Question:''' What are the transition-distances for the following pairs of sequences: human/chimpanzee, human/gorilla, human/orangutan, human/gibbon.
Note: Also enter the numbers in the column labeled "Transitions(P)" in the file.

----

'''Question 9'''

'''Select uncorrected distances, counting only transversions:'''
dset subst=tv

'''Print distance matrix, note transversional distances from human:'''
showdist

'''Question:''' Again enter the distances from everything to human below (separated by spaces, and using at least two significant digits) and in the column labeled Transversions(Q) in the file.

----

'''Question 10'''

'''Compute JC-corrected distances:'''
As we saw above, it is possible to come up with model-based corrections for the effect of multiple substitutions that allow us to estimate the real amount of change from the observed amount of change. For the JC-model, the equation for the corrected distance is:
:<math>d=-\frac{3}{4}ln\left( 1 - \frac{4}{3}D \right)</math>
'''Question: '''For each of the four lines in the titv.data file, and based on the numbers in the column labeled p_dist, compute the JC-corrected distance. Enter the results in the column labeled "JC" in the titv.data file.

----

'''Question 11'''

'''Compute K2P corrected distance:'''
As was the case for the JC model, we can also compute estimated real distances under the K2P model. This can be done using the following equation:
:<math>d = -\frac{1}{2} \ln(1 - 2P - Q) - \frac{1}{4}\ln(1 - 2Q)</math>

'''Question: '''Using the numbers in columns P and Q, you should now use this equation to compute the K2P-corrected distance estimates. Enter the results in the column labeled K2P in the file. Make sure to save the file after all results have been entered.

----
'''Question 12'''

'''Plot distances'''
In RStudio enter:
df = read_table2("titv.data", col_names = FALSE, skip=1)

df = df %>% rename(organisms=X1,
div_time=X2,
pdist=X3,
transitions=X4,
transversions=X5,
JC = X6,
K2P = X7)

dflong = df %>% pivot_longer(cols=-c(organisms, div_time))

ggplot(dflong, aes(x=div_time, y=value, col=name)) +
geom_line() +
labs(x="Time since divergence (MY)",
y="Genetic distance (substitutions/site)")

: We have here plotted the total difference, the observed transitional and transversional difference, as well as the JC- and K2P-corrected distances as a function of estimated divergence times.

'''Question: '''Do the two different correction schemes result in the same estimates of the real distance?

Distance Matrix Methods

2024-03-19T13:34:02Z

WikiSysop: /* Getting started */

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Getting started ==

: '''Note:''' If you didn't already do this during the video lecture: Start by doing the [https://teaching.healthtech.dtu.dk/material/22115/Distance_handout.pdf handout exercise for distance matrix methods].

: In this exercise we will reconstruct phylogenetic trees using a variety of distance-based methods. Specifically, we will explore two different optimality criteria (least squares and minimum evolution), and one clustering method (neighbor joining).

'''Copy files for today's exercise:'''
Make sure you're still in today's working directory (condist) and that you already have the hcv.nexus file there. Now, copy the following file to the dir also:
cp ../data/simple.nexus simple.nexus
ls -l
: simple.nexus is an artificial data set that I have constructed. It is identical to the one you analyzed by hand in the handout exercise. We will use it to convince ourselves that PAUP gets the same result as you.

----

== Analysis of the Simple Data Set ==

'''Question 1'''

'''Start PAUP* and load the simple data set:'''
paup simple.nexus

'''Select distance-based tree-reconstruction:'''
set criterion=distance

'''Select uncorrected distances under the un-weighted least squares criterion:'''
dset distance=p objective=lsfit power=0
: The dset command is used to set various options for the distance-based methods. Option "distance=p" specifies the use of "uncorrected sequence distances", i.e., we do not want to correct the observed distances for multiple substitutions. Note that distances are here reported as "substitutions per site". This simply means that the number of differences has been divided by the length of the sequence. You can think of this distance as the fraction of sites that are different between two sequences.

: The option "objective=lsfit" specifies that we want to reconstruct trees using the least squares optimality criterion. Recall that under least squares we are trying to find the tree that has the smallest possible deviation between the observed pairwise distances and the pairwise distances measured along the tree. (The distance between two taxa measured along the tree is called the "patristic" distance). The overall fit of the tree is found by (1) computing the difference between each observed distance and the corresponding patristic distance, (2) squaring this difference (this way we are sure to obtain a positive number, regardless of whether the observed or the patristic difference is larger), and (3) adding all the squared differences. The option "power=0" specifies that we do not want to weight the squared differences according to branch lengths when computing this fit.

'''Inspect distance matrix'''
showdist
: This command shows the distance matrix as evaluated under the current criteria.

'''Question: ''' Report the pairwise distances for all the pairs of (different) sequences: AB, AC, AD, BC, BD, CD

----

'''Question 2'''

'''Find best tree using exhaustive search:'''
alltrees
: This data set is sufficiently small that we can search through all possible trees.

'''Question: ''' how many different, unrooted trees with 4 leafs is it possible to construct?

----

'''Question 3'''

'''Inspect best tree:'''
outgroup A D
set root=outgroup outroot=poly
describetrees all/plot=phylogram brlens=yes label=yes

'''Question: ''' We now want to investigate whether the fitted branch lengths correspond to the observed pairwise distances. First, draw a sketch of the tree (note that in the PAUP output, this unrooted tree may look a bit weird - just draw it in the normal unrooted way you also used for the manual exercise, i.e., the tree should have a total of 5 branches). Second, label each branch with the branch length as listed in the table you just produced with describetrees. Finally, compute the patristic distance between each pair of species on the tree by adding up the branch lengths of branches lying on the path between the two taxa. Do the observed pairwise distances (from the distance matrix in the previous question) correspond to the patristic distances in this case?

----

'''Question 4'''

'''Compare to the manually constructed tree:'''

'''Question: ''' We now want to investigate whether the tree that PAUP has found here, corresponds to the one you constructed manually in the handout exercise. To do this you should convert all the fractional ("per-site") distances reported by PAUP, to absolute distances. This is done simply by multiplying the fractional distance by the length of the alignment (15 positions, in this case). Is your tree and the PAUP tree identical (within rounding error)?

----

== Analysis of HCV Data Set using Neighbor Joining ==

'''Question 5'''

'''Set up analysis for HCV data set'''
execute hcv.nexus
set criterion=distance
dset distance=p objective=lsfit power=0
outgroup 2_1_1 2_1_2 2_1_3 2_1_4 2_1_5 2_1_7 2_1_8 2_1_9 2_1_10
set root=outgroup outroot=monophyl
: These commands will: load the file hcv.nexus (say yes when asked whether you want to reset the active datafile), select distance-based tree-reconstruction, select uncorrected distances, define patient 2 sequences as the outgroup, set outgroup rooting, and ensure outgroup is printed as monophyletic sister group to ingroup.

'''Construct a neighbor joining tree based on the HCV data:'''
nj
: This will construct a neighbor joining tree using the active distance measure (currently set to uncorrected).

'''Print tree and table of branch lengths:'''
describetrees 1/plot=phylogram brlens=yes
: The neighbor joining tree resembles the trees you previously constructed using parsimony. Importantly, you should see that the viral sequences from different patients form distinct clusters. Note that only a single tree is produced. This is characteristic of clustering methods, which work by following a deterministic algorithm for constructing a tree from distance data. Clustering algorithms such as neighbor joining do not have any measure of tree-goodness and therefore are not able to identify sets of equally good trees.

'''Question: ''' The present neighbor joining tree was computed without correcting the observed distances for multiple substitutions. In the phylogram, identify the internal node that is ancestral to the patient 5 sequences (you will see that internal nodes are labeled with consecutive numbers), and also the internal node that is one level further down in the tree (i.e., ancestral to the ancestral node). You will note that the branch connecting these two nodes is relatively long. Locate the branch in the list of branch lengths, which is printed above the tree. What is the length of this branch?

----

'''Question 6'''

'''Select correction of multiple substitutions using the Jukes and Cantor model:'''
dset distance=jc
: This causes all observed distances to be corrected using a formula based on the Jukes and Cantor model of evolution. Recall that under the Jukes and Cantor model all base frequencies are assumed to be equal (at 0.25), and all base substitution rates are also assumed to be equal.

'''Construct a new neighbor joining tree using corrected distances:'''
nj

'''Print tree and table of branch lengths:'''
describetrees 1/plot=phylogram brlens=yes
: In this tree all branch lengths have been corrected for (unobserved) multiple substitutions. That means they are slightly longer than the uncorrected distances, and this correction is more noticeable for longer branches.

'''Question: ''' Again locate the internal node that is ancestral to the patient 5 sequences and also the immediate ancestor of this node (the node labels are not necessarily the same as before). Now find the corresponding branch in the table and make a note of the length. Is the corrected branch length longer than the uncorrected one?

----

'''Question 7''' What is the ratio of the corrected to the uncorrected branch length? (Divide the corrected branch length by the uncorrected one)

----

'''Question 8'''

'''Prepare table of model fit measures'''
: You are currently using neighbor joining to reconstruct the phylogenetic tree. Below you will also explore the use of least squares and minimum evolution methods. In order to compare the performance and characteristics of these methods we want to record some informative numbers. Construct a small table with two columns (labeled "SSE" and "tree length"), and three rows (labeled "NJ", "least squares", and "minimum evolution").

'''Question: ''' At the end of the list of branch lengths (printed with the describetrees command), you will find the sum of all branch lengths. This is often called the "length" of the tree. What is the length of the tree? (also enter this number in your table, under the column "tree length" in the row "NJ")

----

'''Question 9'''

'''Compute fit of NJ branch lengths to observed pairwise distances:'''
dscores 1/objective=lsfit power=0
: The dscores command calculates the scores of trees in memory according to the distance criterion. In this case we are computing the fit between the observed pairwise distances and the branch lengths found by neighbor joining. The measure used is the sum of squared deviations mentioned above.

'''Question: ''' What is the sum of squared errors? (it is indicated by "SS" which is an abbreviation for sum of squares). Enter the number in the table

----

== Analysis of HCV Data Set Using Least Squares ==

'''Question 10'''

'''Select JC corrected distances under the unweighted least squares criterion:'''
dset distance=jc objective=lsfit power=0

'''Find the best tree using heuristic searching:'''
hsearch start=nj swap=tbr
: As we have seen previously, the HCV data set is far too big for exhaustive searching, and we therefore have to resort to heuristic techniques when we are using a phylogenetic reconstruction method that is based on an optimality criterion. In this case the starting tree is constructed by neighbor joining, i.e., it should be identical to the tree we just inspected (in previous exercises we have used a random starting tree, but neighbour joining will get us closer to the optimum from the start). The heuristic search (which again uses re-arrangements of the "tree-bisection and reconnection" type) should result in a small set of equally good trees.

'''Inspect trees:'''
contree all/strict=no majrule=yes percent=50
: This constructs a consensus tree from the set of equally good best trees. Again you should see that the set of best trees have individual patients clustered separately. Note that while the Neighbor Joining tree also showed this feature, it did not indicate that there might be any uncertainty as to the details of the tree. However, by using a method that has an explicit measure of tree goodness (least squares in this case) you have now learned that there are several equally good reconstructions of the branch order within the individual patient clusters.

'''Compute fit of least squares branch lengths to observed pairwise distances:'''
dscores 1/objective=lsfit power=0
: Again, we are computing the sum of squared deviations between observed and patristic pairwise distances. Arbitrarily we have chosen to only do this for tree number 1 ("dscores all" would have done it for all trees in memory), but recall that all trees in memory are equally good, so the results would have been identical to what you now get.

'''Question: ''' What is the sum of squares? (Also enter the numbers in your table)

----

'''Question 11'''

'''Find total length of tree:'''
describetrees 1/plot=no brlens=yes

'''Question: ''' What is the sum of all branch lengths when using the least squares criterion? (Remember to also enter the numbers in your table).

----

'''Question 12''' Now, compare the results from this analysis with the number you obtained from the neighbor joining tree above. Has the fit improved? (Recall that for both sum of squares and tree length, smaller is better).

----

== Analysis of HCV Data Set Using Minimum Evolution ==

'''Question 13'''

'''Select JC corrected distances under the minimum evolution criterion:'''
dset distance=jc objective=me
: We now want to explore a different optimality criterion for distance-based analysis. Under minimum evolution we take the shortest tree to be the best one. This is very similar to parsimony, but in this case we are using pairwise, JC-corrected distances as the basis for reconstructing the tree. ME proceeds by searching through a list of possible trees; for each tested topology the best set of branch lengths are found by the least squares method, but instead of finally choosing the tree with the best fit, we instead end up by choosing the shortest tree.

'''Find the best tree using heuristic searching starting from a NJ tree:'''
hsearch start=nj swap=tbr

'''Inspect trees:'''
contree all/strict=no majrule=yes percent=50
: Again you should see that the set of best trees have individual patients clustered separately.

'''Find total length of tree:'''
describetrees 1/plot=no brlens=yes

'''Question: ''' At the end of the table listing branch lengths, you will again find the sum of all branch lengths. What is it?

----

'''Question 14''' Is the minimum evolution tree shorter than the other two trees?

----

'''Question 15'''

'''Compute fit of minimum evolution branch lengths to observed pairwise distances:'''
dscores 1/objective=lsfit power=0

'''Question: ''' Again, we are computing the sum of squared deviations between observed and patristic pairwise distances. Note the result from this analysis in your table and compare it with the numbers you obtained from the neighbor joining and least squares analyses above. How is the fit of the ME tree compared to those two judged by the sum of squares?

Distance Matrix Methods

2024-03-19T13:31:16Z

WikiSysop: Created page with "This exercise is part of the course Computational Molecular Evolution (22115). == Getting started == : '''Note:''' If you didn't already do this during the video lecture: Start by doing the [https://teaching.healthtech.dtu.dk/22115/images/3/3f/Distance_handout.pdf handout exercise for distance matrix methods]. : In this exercise we will reconstruct phylogenetic trees using a variety of distance-based methods. Specificall..."

This exercise is part of the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]].

== Getting started ==

: '''Note:''' If you didn't already do this during the video lecture: Start by doing the [https://teaching.healthtech.dtu.dk/22115/images/3/3f/Distance_handout.pdf handout exercise for distance matrix methods].

: In this exercise we will reconstruct phylogenetic trees using a variety of distance-based methods. Specifically, we will explore two different optimality criteria (least squares and minimum evolution), and one clustering method (neighbor joining).

'''Copy files for today's exercise:'''
Make sure you're still in today's working directory (condist) and that you already have the hcv.nexus file there. Now, copy the following file to the dir also:
cp ../data/simple.nexus simple.nexus
ls -l
: simple.nexus is an artificial data set that I have constructed. It is identical to the one you analyzed by hand in the handout exercise. We will use it to convince ourselves that PAUP gets the same result as you.

----

== Analysis of the Simple Data Set ==

'''Question 1'''

'''Start PAUP* and load the simple data set:'''
paup simple.nexus

'''Select distance-based tree-reconstruction:'''
set criterion=distance

'''Select uncorrected distances under the un-weighted least squares criterion:'''
dset distance=p objective=lsfit power=0
: The dset command is used to set various options for the distance-based methods. Option "distance=p" specifies the use of "uncorrected sequence distances", i.e., we do not want to correct the observed distances for multiple substitutions. Note that distances are here reported as "substitutions per site". This simply means that the number of differences has been divided by the length of the sequence. You can think of this distance as the fraction of sites that are different between two sequences.

: The option "objective=lsfit" specifies that we want to reconstruct trees using the least squares optimality criterion. Recall that under least squares we are trying to find the tree that has the smallest possible deviation between the observed pairwise distances and the pairwise distances measured along the tree. (The distance between two taxa measured along the tree is called the "patristic" distance). The overall fit of the tree is found by (1) computing the difference between each observed distance and the corresponding patristic distance, (2) squaring this difference (this way we are sure to obtain a positive number, regardless of whether the observed or the patristic difference is larger), and (3) adding all the squared differences. The option "power=0" specifies that we do not want to weight the squared differences according to branch lengths when computing this fit.

'''Inspect distance matrix'''
showdist
: This command shows the distance matrix as evaluated under the current criteria.

'''Question: ''' Report the pairwise distances for all the pairs of (different) sequences: AB, AC, AD, BC, BD, CD

----

'''Question 2'''

'''Find best tree using exhaustive search:'''
alltrees
: This data set is sufficiently small that we can search through all possible trees.

'''Question: ''' how many different, unrooted trees with 4 leafs is it possible to construct?

----

'''Question 3'''

'''Inspect best tree:'''
outgroup A D
set root=outgroup outroot=poly
describetrees all/plot=phylogram brlens=yes label=yes

'''Question: ''' We now want to investigate whether the fitted branch lengths correspond to the observed pairwise distances. First, draw a sketch of the tree (note that in the PAUP output, this unrooted tree may look a bit weird - just draw it in the normal unrooted way you also used for the manual exercise, i.e., the tree should have a total of 5 branches). Second, label each branch with the branch length as listed in the table you just produced with describetrees. Finally, compute the patristic distance between each pair of species on the tree by adding up the branch lengths of branches lying on the path between the two taxa. Do the observed pairwise distances (from the distance matrix in the previous question) correspond to the patristic distances in this case?

----

'''Question 4'''

'''Compare to the manually constructed tree:'''

'''Question: ''' We now want to investigate whether the tree that PAUP has found here, corresponds to the one you constructed manually in the handout exercise. To do this you should convert all the fractional ("per-site") distances reported by PAUP, to absolute distances. This is done simply by multiplying the fractional distance by the length of the alignment (15 positions, in this case). Is your tree and the PAUP tree identical (within rounding error)?

----

== Analysis of HCV Data Set using Neighbor Joining ==

'''Question 5'''

'''Set up analysis for HCV data set'''
execute hcv.nexus
set criterion=distance
dset distance=p objective=lsfit power=0
outgroup 2_1_1 2_1_2 2_1_3 2_1_4 2_1_5 2_1_7 2_1_8 2_1_9 2_1_10
set root=outgroup outroot=monophyl
: These commands will: load the file hcv.nexus (say yes when asked whether you want to reset the active datafile), select distance-based tree-reconstruction, select uncorrected distances, define patient 2 sequences as the outgroup, set outgroup rooting, and ensure outgroup is printed as monophyletic sister group to ingroup.

'''Construct a neighbor joining tree based on the HCV data:'''
nj
: This will construct a neighbor joining tree using the active distance measure (currently set to uncorrected).

'''Print tree and table of branch lengths:'''
describetrees 1/plot=phylogram brlens=yes
: The neighbor joining tree resembles the trees you previously constructed using parsimony. Importantly, you should see that the viral sequences from different patients form distinct clusters. Note that only a single tree is produced. This is characteristic of clustering methods, which work by following a deterministic algorithm for constructing a tree from distance data. Clustering algorithms such as neighbor joining do not have any measure of tree-goodness and therefore are not able to identify sets of equally good trees.

'''Question: ''' The present neighbor joining tree was computed without correcting the observed distances for multiple substitutions. In the phylogram, identify the internal node that is ancestral to the patient 5 sequences (you will see that internal nodes are labeled with consecutive numbers), and also the internal node that is one level further down in the tree (i.e., ancestral to the ancestral node). You will note that the branch connecting these two nodes is relatively long. Locate the branch in the list of branch lengths, which is printed above the tree. What is the length of this branch?

----

'''Question 6'''

'''Select correction of multiple substitutions using the Jukes and Cantor model:'''
dset distance=jc
: This causes all observed distances to be corrected using a formula based on the Jukes and Cantor model of evolution. Recall that under the Jukes and Cantor model all base frequencies are assumed to be equal (at 0.25), and all base substitution rates are also assumed to be equal.

'''Construct a new neighbor joining tree using corrected distances:'''
nj

'''Print tree and table of branch lengths:'''
describetrees 1/plot=phylogram brlens=yes
: In this tree all branch lengths have been corrected for (unobserved) multiple substitutions. That means they are slightly longer than the uncorrected distances, and this correction is more noticeable for longer branches.

'''Question: ''' Again locate the internal node that is ancestral to the patient 5 sequences and also the immediate ancestor of this node (the node labels are not necessarily the same as before). Now find the corresponding branch in the table and make a note of the length. Is the corrected branch length longer than the uncorrected one?

----

'''Question 7''' What is the ratio of the corrected to the uncorrected branch length? (Divide the corrected branch length by the uncorrected one)

----

'''Question 8'''

'''Prepare table of model fit measures'''
: You are currently using neighbor joining to reconstruct the phylogenetic tree. Below you will also explore the use of least squares and minimum evolution methods. In order to compare the performance and characteristics of these methods we want to record some informative numbers. Construct a small table with two columns (labeled "SSE" and "tree length"), and three rows (labeled "NJ", "least squares", and "minimum evolution").

'''Question: ''' At the end of the list of branch lengths (printed with the describetrees command), you will find the sum of all branch lengths. This is often called the "length" of the tree. What is the length of the tree? (also enter this number in your table, under the column "tree length" in the row "NJ")

----

'''Question 9'''

'''Compute fit of NJ branch lengths to observed pairwise distances:'''
dscores 1/objective=lsfit power=0
: The dscores command calculates the scores of trees in memory according to the distance criterion. In this case we are computing the fit between the observed pairwise distances and the branch lengths found by neighbor joining. The measure used is the sum of squared deviations mentioned above.

'''Question: ''' What is the sum of squared errors? (it is indicated by "SS" which is an abbreviation for sum of squares). Enter the number in the table

----

== Analysis of HCV Data Set Using Least Squares ==

'''Question 10'''

'''Select JC corrected distances under the unweighted least squares criterion:'''
dset distance=jc objective=lsfit power=0

'''Find the best tree using heuristic searching:'''
hsearch start=nj swap=tbr
: As we have seen previously, the HCV data set is far too big for exhaustive searching, and we therefore have to resort to heuristic techniques when we are using a phylogenetic reconstruction method that is based on an optimality criterion. In this case the starting tree is constructed by neighbor joining, i.e., it should be identical to the tree we just inspected (in previous exercises we have used a random starting tree, but neighbour joining will get us closer to the optimum from the start). The heuristic search (which again uses re-arrangements of the "tree-bisection and reconnection" type) should result in a small set of equally good trees.

'''Inspect trees:'''
contree all/strict=no majrule=yes percent=50
: This constructs a consensus tree from the set of equally good best trees. Again you should see that the set of best trees have individual patients clustered separately. Note that while the Neighbor Joining tree also showed this feature, it did not indicate that there might be any uncertainty as to the details of the tree. However, by using a method that has an explicit measure of tree goodness (least squares in this case) you have now learned that there are several equally good reconstructions of the branch order within the individual patient clusters.

'''Compute fit of least squares branch lengths to observed pairwise distances:'''
dscores 1/objective=lsfit power=0
: Again, we are computing the sum of squared deviations between observed and patristic pairwise distances. Arbitrarily we have chosen to only do this for tree number 1 ("dscores all" would have done it for all trees in memory), but recall that all trees in memory are equally good, so the results would have been identical to what you now get.

'''Question: ''' What is the sum of squares? (Also enter the numbers in your table)

----

'''Question 11'''

'''Find total length of tree:'''
describetrees 1/plot=no brlens=yes

'''Question: ''' What is the sum of all branch lengths when using the least squares criterion? (Remember to also enter the numbers in your table).

----

'''Question 12''' Now, compare the results from this analysis with the number you obtained from the neighbor joining tree above. Has the fit improved? (Recall that for both sum of squares and tree length, smaller is better).

----

== Analysis of HCV Data Set Using Minimum Evolution ==

'''Question 13'''

'''Select JC corrected distances under the minimum evolution criterion:'''
dset distance=jc objective=me
: We now want to explore a different optimality criterion for distance-based analysis. Under minimum evolution we take the shortest tree to be the best one. This is very similar to parsimony, but in this case we are using pairwise, JC-corrected distances as the basis for reconstructing the tree. ME proceeds by searching through a list of possible trees; for each tested topology the best set of branch lengths are found by the least squares method, but instead of finally choosing the tree with the best fit, we instead end up by choosing the shortest tree.

'''Find the best tree using heuristic searching starting from a NJ tree:'''
hsearch start=nj swap=tbr

'''Inspect trees:'''
contree all/strict=no majrule=yes percent=50
: Again you should see that the set of best trees have individual patients clustered separately.

'''Find total length of tree:'''
describetrees 1/plot=no brlens=yes

'''Question: ''' At the end of the table listing branch lengths, you will again find the sum of all branch lengths. What is it?

----

'''Question 14''' Is the minimum evolution tree shorter than the other two trees?

----

'''Question 15'''

'''Compute fit of minimum evolution branch lengths to observed pairwise distances:'''
dscores 1/objective=lsfit power=0

'''Question: ''' Again, we are computing the sum of squared deviations between observed and patristic pairwise distances. Note the result from this analysis in your table and compare it with the numbers you obtained from the neighbor joining and least squares analyses above. How is the fit of the ME tree compared to those two judged by the sum of squares?

2024-03-19T13:07:45Z

WikiSysop:

These are instructions for how to install software and data used on the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]] when using the Linux operating system.

The commands assume you are using the [https://ubuntu.com/server/docs/package-management apt package manager] used on e.g. Ubuntu Linux.

'''Note''': If your computer uses a CPU based on the ARM architecture, rather than the more common Intel or AMD (x86_64) architectures, some commands may need to be adjusted. Please let me know if this applies to you, so I can provide additional instructions tailored for ARM-based systems

# Use the out-commented commands below if you want to copy my premade .bashrc file for customising bash
# WARNING: do not owerwrite a pre-existing .bashrc unless you are sure it contains nothing you want to keep
# NOTE: if you are using a different shell, then you should use the corresponding .rc file (e.g., .zshrc for zsh)
# wget http://teaching.bioinformatics.dtu.dk/material/22115/bashrc.txt
# mv bashrc.txt ~/.bashrc

# Nedit
sudo apt update
sudo apt -y install nedit

# R, Rstudio
sudo apt -y install r-base r-base-dev gdebi-core
wget https://download1.rstudio.org/electron/focal/amd64/rstudio-2023.12.1-402-amd64.deb
sudo gdebi -n ./rstudio-2023.12.1-402-amd64.deb
rm rstudio-2023.12.1-402-amd64.deb

# Dependencies for R-packages
sudo apt -y install libcurl4-openssl-dev libxml2-dev libgit2-dev libopenblas-base

# MrBayes
sudo apt -y install git
git clone --depth=1 https://github.com/NBISweden/MrBayes.git ~/MrBayes
cd ~/MrBayes
./configure --disable-sse
make
sudo make install
cd ..
# Note: above, I am using the flag --disable-sse to avoid crashes on some machines
# It is possible that mb will run faster if you omit this flag, so you may want to experiment
# with using just "./configure" instead (without the quotes)

# PAUP
wget http://phylosolutions.com/paup-test/paup4a168_ubuntu64.gz
gunzip paup4a168_ubuntu64.gz
chmod 755 paup4a168_ubuntu64
sudo mv paup4a168_ubuntu64 /usr/local/bin/paup
sudo apt -y install libpython2.7

# PAML
sudo apt -y install paml

# jmodeltest
wget https://github.com/ddarriba/jmodeltest2/files/157117/jmodeltest-2.1.10.tar.gz
sudo tar -xvf jmodeltest-2.1.10.tar.gz --directory /usr/local/src
rm jmodeltest-2.1.10.tar.gz
echo "alias jmodeltest='java -jar /usr/local/src/jmodeltest-2.1.10/jModelTest.jar'" >> ~/.bashrc

# BEAST2
wget https://github.com/CompEvol/beast2/releases/download/v2.7.6/BEAST.v2.7.6.Linux.x86.tgz
sudo tar -zxvf BEAST.v2.7.6.Linux.x86.tgz --directory /usr/local/src
echo "alias beauti='/usr/local/src/beast/bin/beauti > /dev/null 2> /dev/null'" >> ~/.bashrc
echo 'PATH="/usr/local/src/beast/bin${PATH:+:${PATH}}"' >> ~/.bashrc

# FigTree
sudo apt -y install figtree

# Tracer
wget https://github.com/beast-dev/tracer/releases/download/v1.7.2/Tracer_v1.7.2.tgz
sudo mkdir /usr/local/src/Tracer
sudo tar -zxf Tracer_v1.7.2.tgz --directory /usr/local/src/Tracer
sudo ln -s /usr/local/src/Tracer/bin/tracer /usr/local/bin/
rm Tracer_v1.7.2.tgz

# MAFFT
sudo apt -y install mafft

# Aliview
wget https://ormbunkar.se/aliview/downloads/linux/linux-version-1.28/aliview.install.run
chmod 755 aliview.install.run
sudo ./aliview.install.run
rm aliview.install.run

# seqconverter, sequencelib, phylotreelib
# Anders Gorm scripts and libraries:
# https://github.com/agormp/phylotreelib
# https://github.com/agormp/seqconverter
sudo apt -y install python3-numpy python3-pip
pip3 install seqconverter
pip3 install phylotreelib
echo 'PATH="~/.local/bin${PATH:+:${PATH}}"' >> ~/.bashrc

# maxalign tool (see: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-312)
wget http://teaching.bioinformatics.dtu.dk/material/22115/maxalign.pl
chmod 755 maxalign.pl
sudo mv maxalign.pl /usr/local/bin

# Clean up
sudo apt autoremove --purge
sudo apt clean

# Activate changes to .bashrc in current shell
source ~/.bashrc

# Set up molevol directory for course exercises
# You can place this directory anywhere you prefer:
# Just replace tilde (~) in the command below with path to preferred base directory
# (The tilde symbol is short for the user's home directory)
cd ~
mkdir molevol
wget http://teaching.bioinformatics.dtu.dk/material/22115/data.tar.gz
tar -xvf data.tar.gz --directory molevol
rm data.tar.gz

# R packages (do this inside Rstudio)
install.packages("tidyverse")
install.packages("bayesplot")

Linux software installation

2024-03-19T13:04:17Z

WikiSysop: Created page with " These are instructions for how to install software and data used on the course Computational Molecular Evolution (22115) when using the Linux operating system. The commands assume you are using the [https://ubuntu.com/server/docs/package-management apt package manager] used on e.g. Ubuntu Linux. '''Note''': If your computer uses a CPU based on the ARM architecture, rather than the more common Intel or AMD (x86_64) archite..."

These are instructions for how to install software and data used on the course [[22115_-_Computational_Molecular_Evolution|Computational Molecular Evolution (22115)]] when using the Linux operating system.

The commands assume you are using the [https://ubuntu.com/server/docs/package-management apt package manager] used on e.g. Ubuntu Linux.

'''Note''': If your computer uses a CPU based on the ARM architecture, rather than the more common Intel or AMD (x86_64) architectures, some commands may need to be adjusted. Please let me know if this applies to you, so I can provide additional instructions tailored for ARM-based systems

# Use the out-commented commands below if you want to copy my premade .bashrc file for customising bash
# WARNING: do not owerwrite a pre-existing .bashrc unless you are sure it contains nothing you want to keep
# NOTE: if you are using a different shell, then you should use the corresponding .rc file (e.g., .zshrc for zsh)
# wget http://teaching.bioinformatics.dtu.dk/material/36615/bashrc.txt
# mv bashrc.txt ~/.bashrc

# Nedit
sudo apt update
sudo apt -y install nedit

# R, Rstudio
sudo apt -y install r-base r-base-dev gdebi-core
wget https://download1.rstudio.org/electron/focal/amd64/rstudio-2023.12.1-402-amd64.deb
sudo gdebi -n ./rstudio-2023.12.1-402-amd64.deb
rm rstudio-2023.12.1-402-amd64.deb

# Dependencies for R-packages
sudo apt -y install libcurl4-openssl-dev libxml2-dev libgit2-dev libopenblas-base

# MrBayes
sudo apt -y install git
git clone --depth=1 https://github.com/NBISweden/MrBayes.git ~/MrBayes
cd ~/MrBayes
./configure --disable-sse
make
sudo make install
cd ..
# Note: above, I am using the flag --disable-sse to avoid crashes on some machines
# It is possible that mb will run faster if you omit this flag, so you may want to experiment
# with using just "./configure" instead (without the quotes)

# PAUP
wget http://phylosolutions.com/paup-test/paup4a168_ubuntu64.gz
gunzip paup4a168_ubuntu64.gz
chmod 755 paup4a168_ubuntu64
sudo mv paup4a168_ubuntu64 /usr/local/bin/paup
sudo apt -y install libpython2.7

# PAML
sudo apt -y install paml

# jmodeltest
wget https://github.com/ddarriba/jmodeltest2/files/157117/jmodeltest-2.1.10.tar.gz
sudo tar -xvf jmodeltest-2.1.10.tar.gz --directory /usr/local/src
rm jmodeltest-2.1.10.tar.gz
echo "alias jmodeltest='java -jar /usr/local/src/jmodeltest-2.1.10/jModelTest.jar'" >> ~/.bashrc

# BEAST2
wget https://github.com/CompEvol/beast2/releases/download/v2.7.6/BEAST.v2.7.6.Linux.x86.tgz
sudo tar -zxvf BEAST.v2.7.6.Linux.x86.tgz --directory /usr/local/src
echo "alias beauti='/usr/local/src/beast/bin/beauti > /dev/null 2> /dev/null'" >> ~/.bashrc
echo 'PATH="/usr/local/src/beast/bin${PATH:+:${PATH}}"' >> ~/.bashrc

# FigTree
sudo apt -y install figtree

# Tracer
wget https://github.com/beast-dev/tracer/releases/download/v1.7.2/Tracer_v1.7.2.tgz
sudo mkdir /usr/local/src/Tracer
sudo tar -zxf Tracer_v1.7.2.tgz --directory /usr/local/src/Tracer
sudo ln -s /usr/local/src/Tracer/bin/tracer /usr/local/bin/
rm Tracer_v1.7.2.tgz

# MAFFT
sudo apt -y install mafft

# Aliview
wget https://ormbunkar.se/aliview/downloads/linux/linux-version-1.28/aliview.install.run
chmod 755 aliview.install.run
sudo ./aliview.install.run
rm aliview.install.run

# seqconverter, sequencelib, phylotreelib
# Anders Gorm scripts and libraries:
# https://github.com/agormp/phylotreelib
# https://github.com/agormp/seqconverter
sudo apt -y install python3-numpy python3-pip
pip3 install seqconverter
pip3 install phylotreelib
echo 'PATH="~/.local/bin${PATH:+:${PATH}}"' >> ~/.bashrc

# maxalign tool (see: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-312)
wget http://teaching.bioinformatics.dtu.dk/material/36615/maxalign.pl
chmod 755 maxalign.pl
sudo mv maxalign.pl /usr/local/bin

# Clean up
sudo apt autoremove --purge
sudo apt clean

# Activate changes to .bashrc in current shell
source ~/.bashrc

# Set up molevol directory for course exercises
# You can place this directory anywhere you prefer:
# Just replace tilde (~) in the command below with path to preferred base directory
# (The tilde symbol is short for the user's home directory)
cd ~
mkdir molevol
wget http://teaching.bioinformatics.dtu.dk/material/36615/data.tar.gz
tar -xvf data.tar.gz --directory molevol
rm data.tar.gz

# R packages (do this inside Rstudio)
install.packages("tidyverse")
install.packages("bayesplot")

File:Darwin logo2 medium.png

2024-03-19T13:03:32Z

WikiSysop:

22115 - Computational Molecular Evolution

2024-03-19T12:36:12Z

WikiSysop: Created page with "; Overview 550px : This page contains links to video lectures, computer exercises, and other material for the course [https://kurser.dtu.dk/course/22115 22115 - Computational Molecular Evolution], which is part of the [https://www.dtu.dk/english/education/msc/programmes/systems_biology MSc in Bioinformatics and Systems Biology] at the [https://www.dtu.dk/english Technical University of Denmark]. The course is taught by Prof..."

; Overview [[File:Darwin logo2 medium.png |right|border|550px]]
: This page contains links to video lectures, computer exercises, and other material for the course [https://kurser.dtu.dk/course/22115 22115 - Computational Molecular Evolution], which is part of the [https://www.dtu.dk/english/education/msc/programmes/systems_biology MSc in Bioinformatics and Systems Biology] at the [https://www.dtu.dk/english Technical University of Denmark]. The course is taught by Professor Anders Gorm Pedersen, [https://www.healthtech.dtu.dk/english/Research/Research-Sections/Section-Bioinformatics Section for Bioinformatics], [https://www.healthtech.dtu.dk/english Department of Health Technology].

: The main goal of this course is to give an introduction to theory and algorithms in the field of computational molecular evolution. We will cover basic evolutionary theory (common descent, natural selection, genetic drift, models of growth and selection), and the main types of algorithms used for constructing and analyzing phylogenetic trees (parsimony, distance based methods, maximum likelihood methods, and Bayesian inference). We will also discuss the role of statistical modeling in science more generally

:The course will consist of lectures, computer exercises, and mini-projects. The student will acquire practical experience in the use of a range of computational methods by analyzing sequences from the scientific literature.

__TOC__

=='''Computer setup'''==

===Linux===
:* [[Linux software installation]]


===Windows===
:* [[Windows software installation]]


===MacOS===
:* [[MacOS software installation]]



== '''Lecture Schedule''' ==

:([[27615 Previous course programs|Course programs, previous years]])

===Week 1 (January 31): Introduction to evolutionary theory and population genetics. Models of growth, selection and mutation===

:; Online lectures
:* [https://youtu.be/okjVaLA5S38 Common descent (11:52)]
:* [https://youtu.be/VkkIu1ZtaIE Natural selection (14:57)]
:* [https://youtu.be/wqa6W3_WW7s Evidence for evolution (part 1) (9:34)]
:* [http://y2u.be/_-a-F8egAis Evidence for evolution (part 2) (20:54)]
:* [http://y2u.be/AUGbSMWPILE Population growth and selection (18:13)]

:; Course material
:* [https://github.com/agormp/evolintro/blob/main/evolintro.pdf Lecture notes on evolutionary theory and population genetics]
:* [http://teaching.healthtech.dtu.dk/material/36615/slides_week1.pdf Slides, week 1]

:; Computer exercise
:* [[Population Growth, Fitness, and Selection]]

----

===Week 2 (February 7): Neutral mutations and genetic drift. Tree reconstruction by parsimony===

:; Online lectures
:* [https://youtu.be/cQVjL50kK0k Neutral Theory of Molecular Evolution (11:28)]
:* [https://youtu.be/J8LDUFm4ttA Genetic Drift (9:35)]
:* [https://youtu.be/AZkHWdl2oAw Trees: Terminology and Representation (9:41)]
:* [https://youtu.be/zCj1s9fmaKs Homology and Homoplasy (9:07)]
:* [https://youtu.be/gXb_WuLCD8g Maximum Parsimony (7:48)]
:* [https://youtu.be/Q7ZpdPCx0uQ The Fitch Algorithm (10:31)]
:* [https://youtu.be/deywW9wJXmw Searching Tree Space (14:01)]

:; Course material
:* [http://teaching.healthtech.dtu.dk/material/36615/slides_week2.pdf Slides, week 2]
:* [http://teaching.healthtech.dtu.dk/material/36615/Paup_Doc_31.pdf PAUP 3.1 manual (note: for older version - contains explanations of parsimony and tree moves)]
:* [http://teaching.healthtech.dtu.dk/material/36615/PAUP4-manual.pdf PAUP 4beta command reference]

:; Computer exercise
:* [[Phylogenetic Analysis using Parsimony]]
----

===Week 3 (February 14): Consensus trees. Distance matrix methods===

:; Online lectures
:* [https://www.youtube.com/watch?v=YXZZyu9OAcg Consensus Trees (16:27)]
:* [https://www.youtube.com/watch?v=MhjSSxcGjaY Distance Matrix Methods, part 1 (6:07)]
:* [https://www.youtube.com/watch?v=PNoUcQTCxiM Distance Matrix Methods, part 2 (22:28)]
:* [https://www.youtube.com/watch?v=Dj24mCLQYUE Neighbour Joining (15:28)]

:; Course material
:* [[Media:Consensus.pdf|Handout exercise: Consensus Trees]]
:* [[Media:Distance handout.pdf|Handout exercise: Distance Matrix Methods]]
:* [[Media:Slides week3.pdf|Slides, week 3]]

:; Computer exercises
:* [[Consensus Trees]]
:* [[Distance Matrix Methods]]

----

===Week 4+5 (February 21 + 28): Mini project 1===



----

===Week 6 (March 6): Models of sequence evolution. Likelihood methods===

:; Online lectures
:* [https://youtu.be/ro2MFmVZypQ Models of evolution (28:48)]
:* [https://youtu.be/xDKUIegYpWM Maximum likelihood (22:06)]
:* [https://youtu.be/Siau2o_egGI Ancestral reconstruction (10:45)]

:; Course material
:* [[Media:Handout real exp change.pdf|Handout exercise: Real, Observed, and Expected Change]]
:* [[Media:Handout likelihood.pdf|Handout exercise: Computation of Likelihood]]
:* [[Media:Slides week4.pdf|Slides, week 6]]
:* [http://teaching.bioinformatics.dtu.dk/material/36615/substitutionmodels.pdf Lecture notes: Substitution models]
:* [http://teaching.bioinformatics.dtu.dk/material/36615/main.pdf Optional lecture notes: Matrix exponentials for Markov chains]
:; Computer exercises
:* [[Models of Evolution]]
:* [[Maximum Likelihood]]

----

===Week 7 (March 13): Bayesian inference of phylogeny===

:; Online lectures
:* [https://www.youtube.com/watch?v=DI3TIx78NqM&t=12s Bayesian Inference (23:48)]
:* [https://youtu.be/uyG5DVigEyM?list=PLXwjzs_mabFrlRF7uALEomfGGckG0sG5y Markov chain Monte Carlo (19:54)]

:; Course material
:* [[Media:Handout.class08.pdf|Handout exercise: Bayesian estimation of model parameter value]]
:* [[Media:Slides week5.pdf|Slides, week 7]]
:* [[Media:MTN122.pdf| An Introduction to Bayesian Statistics Without Using Equations]]
:* [http://www.nature.com/nbt/journal/v22/n9/pdf/nbt0904-1177.pdf Background reading: "What is Bayesian statistics?"]
:* [http://rsta.royalsocietypublishing.org/content/roypta/361/1813/2681.full.pdf Background reading: "Bayesian computation: a statistical revolution"]

:; Computer exercise
:* [[Bayesian Phylogeny]]

----

===Week 8+9 (March 20 + April 3): Mini project 2===


----

===Week 10 (April 10): Model Selection===

:; Online lectures
:* [https://youtu.be/sJB2LmppZj8?list=PLXwjzs_mabFrlRF7uALEomfGGckG0sG5y Model selection, part 1 (15:19)]
:* [https://youtu.be/qSoDZ_33GbM Model selection, part 2 (17:20)]
:* [https://youtu.be/YYoo1vUO4ME Introduction to computer exercise: detection of selection (15:24)]

:; Course material
:* [[Media:Slides week6.pdf|Slides, week 10]]
:* [https://github.com/ddarriba/jmodeltest2/files/157130/manual.pdf jmodeltest manual]

:; Computer exercise
:* [[Model selection]]

----

===Week 11 (April 17): Bayesian Phylogenetics, Part 2 ===
:; Course material
:* [https://www.researchgate.net/publication/319965471_A_biologist%27s_guide_to_Bayesian_phylogenetic_analysis A biologist’s guide to Bayesian phylogenetic analysis]
:* [https://beast.community/analysing_beast_output Analysing BEAST output using Tracer]
:* [https://beast.community/tracer_convergence Identifying convergence problems using Tracer]
:* [https://taming-the-beast.org/tutorials/Troubleshooting/ Post-processing and improving performance]

:; Computer exercise
:* [[Bayesian phylogenetics: checking convergence]]
:* [[Bayesian phylogenetics: clock models]]

----

===Week 12 + 13 (April 24 + May 1): Mini project 3: Final exam===

'''Details will follow'''

----

MediaWiki:Mainpage

2024-03-19T12:35:43Z

WikiSysop: Created page with "22115 - Computational Molecular Evolution"

22115 - Computational Molecular Evolution

MediaWiki:Sidebar

2024-03-19T12:33:14Z

WikiSysop:

* navigation
** https://teaching.healthtech.dtu.dk/|Course List
** https://teaching.healthtech.dtu.dk/22115/|Course 22115
* TOOLBOX

MediaWiki:Sidebar

2024-03-19T12:33:02Z

WikiSysop: Created page with " * navigation ** https://teaching.healthtech.dtu.dk/|Course List ** https://teaching.healthtech.dtu.dk/22115/|Course 22115 ** Programme|Programme * TOOLBOX"

* navigation
** https://teaching.healthtech.dtu.dk/|Course List
** https://teaching.healthtech.dtu.dk/22115/|Course 22115
** Programme|Programme
* TOOLBOX

MediaWiki:Disclaimers

2024-03-19T12:32:15Z

WikiSysop: Created blank page

MediaWiki:Aboutsite

2024-03-19T12:31:45Z

WikiSysop: Created blank page

MediaWiki:Privacy

2024-03-19T12:31:02Z

WikiSysop: Created blank page