Phylogenetic Analysis using Parsimony
This exercise is part of the course Computational Molecular Evolution (22115).
Overview
In this exercise you are going to examine three different data sets using the PAUP* program. All analyses will be performed under the parsimony criterion. Data set 1 consists of genomic nucleotide sequences from 9 primate species. This set is sufficiently small that an exhaustive search of all possible trees can be performed. Data set 2 consists of mitochondrial nucleotide sequences from 12 primates. This data set is so large that it would take too long to run through all possible trees (at least for the purpose of this exercise), but it is still small enough that branch and bound can be used to perform an exhaustive search. The third and last data set consists of Hepatitis C virus sequences isolated from different patients. This set is much too big for exhaustive searching and we will have to employ heuristic methods to analyze it.
In addition to exploring aspects of parsimonious phylogenetic reconstruction, an important goal of this exercise is to introduce you to the PAUP* interface, and to the different types of manipulations and analyses that can be performed within the program. Later in the course you will use PAUP* for distance-based and maximum likelihood-based phylogenetic reconstruction. Several other programs (e.g., MacClade and MrBayes) use command-line interfaces that are very similar to the one used by PAUP*.
Getting started
1: Start Terminal window
- Make sure to maximize the window: the analyses we will perform give lots of output to the screen, so having a nice and large shell window makes it easier to keep track of what happens.
2: Construct working directory, copy files:
In the command below: Instead of /path/to/molevol
enter the path to the directory where you have placed your course files (for instance cd /Users/bob/Documents/molevol
, or cd /home/student/molevol
).
cd /path/to/molevol mkdir parsimony cd parsimony cp ../data/mhc.fasta mhc.fasta cp ../data/primate_mtDNA_interleaved.fasta primate_mtDNA_interleaved.fasta cp ../data/hcv.fasta hcv.fasta ls -l
- The commands above first construct a directory called "parsimony", then changes to that directory, and finally copies three files from your data directory to the current folder (parsimony).
3: Have a look at the data files:
nedit mhc.fasta
- This file is in the so-called fasta-format, and contains unaligned DNA sequences. Specifically, the sequences are major histo-compatibility complex (MHC) class I genes from nine different primate species. MHC class I genes encode proteins that are involved in the immune response. Now, close the nedit window and have a look at the next data file:
nedit primate_mtDNA_interleaved.fasta
- This file contains mitochondrial DNA sequences from 12 different primate species. Close the nedit window, and have a look at the final data file:
nedit hcv.fasta
- This file contains Hepatitis C virus (HCV) sequences isolated from 4 different patients. The sequenced region corresponds to the end of the E1 gene and the beginning of the E2 gene, surrounding the so-called hyper-variable region 1 (HVR1). When you've had a look close the nedit window so it doesn't clutter your screen.
Phylogenetic Analysis of MHC Class I sequences
Question 1
Make multiple alignment using mafft
mafft --auto mhc.fasta > mhc_aligned.fasta
- mafft is a program for making multiple alignments, that works well and can handle large data sets. It is possible to use different alignment algorithms with mafft, depending on whether the focus is on very precise alignment of a specific type of sequences or rapid alignment of large data sets for instance. (See here for details). The option --auto makes the program choose automatically among these algorithms based on the input data.
Inspect the alignment in graphical viewer:
aliview mhc_aligned.fasta
- aliview is a program for graphically viewing multiple alignments. Residues can be coloured according to different principles (e.g., one color per nucleotide, or colouring according to the degree of conservation of a column) making it simpler to get an overview of the quality of the alignment.
Question: There is a gap in the "yellow_baboon" sequence. What is the length of this gap? (number of nucleotides)
Question: 2
Question: Why does this length make evolutionary sense (as opposed to, say, if the gap was one nucleotide shorter or longer)?
Question 3
Convert alignment format to NEXUS:
The program we will use next does not understand FASTA format, so we will first have to convert the alignment to a format it does understand, namely the so-called NEXUS format. We will do that using the command line program seqconverter (which is one that I have written, and that does various sequence manipulations):
seqconverter --informat fasta --outformat nexus -i mhc_aligned.fasta > mhc.nexus
seqconverter writes its output to the terminal ("stdout") so to save the output to a file we need to use redirection of the output.
(There are a number of other programs that can perform sequence format conversion, including the EMBOSS Seqret tool at the homepage of the European Bioinformatics Institute).
Start the PAUP* program:
paup
- You have now started the PAUP program. Notice that instead of the normal command prompt you now have the PAUP command prompt (
paup>
). This indicates that you are in the PAUP program, which is ready to receive commands.
Load the aligned data
At the PAUP prompt, type the following:
execute mhc.nexus
This loads the data into PAUP.
Question: Check the output from the PAUP program: How long (how many columns) is the alignment you just loaded?
Question 4
Define the outgroup:
At the PAUP prompt, type the following command to set the outgroup for the tree:
outgroup olive_baboon macaque yellow_baboon
- This defines an outgroup consisting of the three old world monkeys in the data set. The names are those that appear in the mhc.nexus file.
- The outgroup will be used to place the root of the tree. The rationale is as follows: our data set consists of sequences from man, from a number of great apes, and from a number of old world monkeys. We know from other evidence that the lineage leading to monkeys branched off before any of the remaining organisms diverged from each other. The root of the tree connecting the organisms investigated here, must therefore be located between the monkeys (the "outgroup") and the rest (the "ingroup")). This way of finding a root is called "outgroup rooting".
Activate outgroup rooting and select how tree will be printed:
set root=outgroup outroot=monophyl
- This makes PAUP* use the outgroup we defined above for the purpose of rooting the tree. The "outroot=monophyl" command makes PAUP construct a tree where the outgroup is a monophyletic sister group to the ingroup. (Outroot could also have been set to "polytomy" or "paraphyl")
Find the most parsimonious tree by examining all possible trees:
Make sure the terminal window is maximally wide and then run the following command:
alltrees fd=barChart
- This makes PAUP* find the best trees by exhaustively searching through all possible trees (the number of unrooted trees with 9 taxons is 135,135 - yes that is an actual number!). By default PAUP* uses the parsimony criterion for constructing trees. For each possible tree PAUP* finds the length (i.e., the number of mutational events required to explain the data set), and upon finishing this, the best tree is the one with the smallest total length. If there are several trees with the same length, then these are all kept since they are equally good. At the end of the run, PAUP* outputs a bar chart (histogram) giving the frequency distribution of all tree lengths (the histogram is turned on its side here). Above the histogram is a textual summary of what PAUP did and the result. Look through all of this to answer the next questions.
Question: What is the length of the shortest trees?
Question 5
Question: How many trees with this score was found?
Question 6
Question: How far are the best trees from the second best ones?
Question 7
Examine the best trees
pscores
- This prints the length of the trees that are currently in memory. (Actually this command not only prints but computes the lengths, so it can be used to evaluate any tree that has been loaded into memory - also trees built by other methods or other programs).
describetrees all/plot=cladogram labelnode=no
- This will print descriptions and plots of both trees (The command "describetrees 1" would have printed a description of only the first tree). A cladogram is a type of tree where branch lengths are ignored and only branching order is indicated. The option "labelnode=no" turns off labelling of internal nodes (otherwise they would have been labeled with consecutive numbers). You can scroll back to compare the two trees. You may notice that the disagreement between the two trees concerns whether gibbon or orangutan is closer to the root. Based on the information in this data set we can not distinguish between these two possibilities.
Question: Note where the human sequence is located: Who are our closest relatives according to this tree?
Question 8
Examine tree with branch lengths:
describetrees all/plot=phylogram labelnode=no
- A phylogram is a tree where branches are drawn with lengths proportional to the number of mutational events that has happened on them. (Note that tree-terminology is not entirely consistent, and there are several other names for this type of plot). Branch lengths are based on the reconstructed location of mutational events (and therefore also on the reconstructed ancestral sequences). Under the parsimony criterion: If a mutational event could have occurred on any one of a number of branches (in the sense that all these reconstructions give the same tree length), then each of the branches are assigned a fraction of a mutation. For instance, if a mutational event could have been placed on either of two branches, then both of them will be counted as having had 0.5 mutational events.
Question: What is the longest terminal branch on the tree? (Terminal branch = branch leading to a leaf).
Question 9
showtrees all
- This gives just the cladograms without further descriptions.
Save the trees to a file:
savetrees file=mhcalltrees.nexus brlens=yes
- This saves the trees to a file named "mhcalltrees.nexus" with indication of the location of the root, and with information about branch lengths.
View trees in FigTree viewer:
Now, open a second Terminal window, change to the working directory and open the tree file with the figtree viewer as follows:
cd /path/to/molevol/parsimony figtree mhcalltrees.nexus
- With this graphical viewer you can investigate the tree more closely and export it as a figure in various formats. The program has several options for altering the display of the tree, including viewing the tree as unrooted, and altering the rooting interactively. After you've played around with the possibilities for a while you should close the window again.
- If, later in this course, you want to construct a tree figure that is prettier than the ASCII rendition that PAUP gives you, then you need to repeat the steps performed above (save tree using the savetrees command, and subsequently open in FigTree. Remember the option "
brlens=yes
" in order to get the phylogram). The FigTree viewer is available for several platforms.
Find the most parsimonious trees using branch and bound:
Now return to the shell window where you have PAUP running and perform a branch and bound search for the best trees:
bandb
- As explained in the lecture branch and bound is guaranteed to find the best trees without necessarily searching through all of tree-space. There is therefore actually no reason to use alltrees unless you are explicitly interested in examining suboptimal trees or the distribution of all tree lengths.
Question: What is the length of the best tree found using branch and bound?
Question 10
Question: Is that the same value found using exhaustive search?
Question 11
Examine the branch and bound trees:
showtrees all
- (Remember: you also have the possibility of using FigTree as explained above).
Load the previously found trees into memory:
We want to convince ourselves that the branch and bound trees really are identical to the trees found by the alltrees command. In order to do that we must now load the previously found trees back into memory while still retaining the branch and bound tree:
gettrees file=mhcalltrees.nexus mode=7
Answer "yes" when asked whether you want to proceed). The mode=7 command means that PAUP* should keep all trees that are currently in memory and append all trees from the file. We now have four trees in memory (the two found by alltrees and the two found by bandb). Now compare the four trees and convince yourself that the same two trees were found by bandb:
showtrees all
You can also compare the scores:
pscores all
As one final way of establishing that the trees are identical you can compute the "distance" between the four trees in memory:
treedist
- This indicates how similar the trees in memory are by computing the so-called Robinson–Foulds or symmetric difference metric. The output includes both a table listing all pairwise differences and a bar-chart giving the distribution of differences. Two trees with identical topology will have a distance of zero.
Question: Are the two new trees the same as two previously found trees?
Question 12
Define a constraint for tree-searching:
constraints homooran (monophyly)=((human,orangutan));
- This defines a constraint named "homooran" which requires the taxons "human" and "orangutan" to form a monophyletic group. The constraint tree was here given as simply: ((human,orangutan)); The tree is here shown using a notation where a pair of parentheses corresponds to an internal node, while a comma-separated list enclosed by a pair of parentheses indicates the subtrees that branch out from this internal node. The constraint tree shown above is a brief way of defining the full unresolved constraint tree:
(gorilla,gibbon,bonobo,chimpanzee,yellow_baboon,olive_baboon,macaque,(human,orangutan));
Find the best tree that satisfies the constraint:
bandb /constraints=homooran enforce=yes
- This performs a branch and bound search for the best tree that has human and orangutan together as a monophyletic group. Note the option enforce=yes which ensures that the named constraint is applied.
Question: Compare the length of this tree with the best tree found above. How many extra steps (mutations) are required in this tree? (The difference tells you something about how much better one hypothesis is than the other. There are tests that will tell you whether the difference is significant, but we will not get into that at this point in the course).
Phylogenetic Analysis of Mitochondrial DNA Sequences
Question 13
Delete previously found trees from memory:
cleartrees
Prepare data for analysis:
We will now analyze the primate mitochondrial data set. Using what you learned above, perform the following steps:
- Align the sequences present in the file primate_mtDNA_interleaved.fasta
- Convert alignment to NEXUS format, and save to file
primate_mtDNA_interleaved.nexus
- Load data into PAUP
- Set outgroup to consist of all the non-hominoid species: Macaca_fuscata M_mulatta M_fascicularis M_sylvanus Saimiri_sciureus Tarsius_syrichta Lemur_catta
- Activate outgroup rooting and select output of outgroup as monophyletic sister group to the ingroup.
Estimate time needed for exhaustive search:
We now have 12 taxons in our data set, corresponding to 654,729,075 possible trees. We first want to estimate how long it would take to search through every single one of them, but we do not want to actually wait for the required amount of time. First, start the search as indicated below:
alltrees
At the bottom of the screen you will see a progress bar indicating the percentage of tree-space that has been explored. Wait until the program reaches about 10% and interrupt the run by pressing CTRL-C (and replying Y to the question of whether you want to stop).
After about 10% finished: press CTRL + C
Question: Use the numbers to estimate how many minutes it would take to work through all 654,729,075 possible trees.
Note: This will obviously depend on your specific machine
Question 14
Find the best trees using branch and bound:
bandb
Question: How long did it take to find the best trees by branch and bound (sec)? Compare this to the estimated time you would have had to wait for the alltrees command to finish. The time saved by ignoring suboptimal parts of search space can be quite impressive, but it depends on the structure of your data set.
Question 15
Inspect most parsimonious trees:
describetrees all/plot=phylogram labelnode=no
Question: Which species is closer to the root: Pongo (Orangutan) or Hylobates (Gibbon)?
Phylogenetic Analysis of viral DNA Sequences
Question 16
Delete previously found trees from memory:
cleartrees
Prepare and load the data:
We will now investigate the data in the file hcv.fasta. Using what you have learned above:
- Align sequences that are in the file hcv.fasta
- Convert to Nexus format and save to file hcv.nexus
- Load Nexus format alignment into PAUP
This dataset consists of 41 hepatitis C virus sequences obtained from four different patients. The alignment should not contain any gaps (there have been no insertions or deletions. So strictly speaking we did not need to align, however you can't know that without aligning the sequences).
Enable PAUP* to store an unlimited number of trees:
set increase=auto
Perform a heuristic search using NNI:
hsearch start=stepwise addseq=random nreps=100 rseed=117 swap=NNI
- We now have so many sequences that exhaustive searching using alltrees or even bandb is impossible (there are approximately 10^60 possible trees). We therefore have to employ heuristic searching. The above command starts a heuristic search using sequential addition with random addition to construct the initial tree. (In all 100 different random addition sequences, and therefore starting trees, are tried). The option rseed=117 controls the random number generator and ensures that student results will be comparable (you can experiment with other seeds, or with leaving out the option alltogether if you are interested; ordinarily you would not set this - it is only for the purpose of the exercise). Once an initial tree has been constructed, the heuristic search proceeds by rearrangements of the "nearest neighbor interchange" type (NNI).
Inspect search result:
This time the search results in a large number of equally parsimonious trees. The result is summarized in the form of a "tree-island profile". Trees from the same island are much more similar to each other than to trees from other islands. Specifically, a tree-island is defined as a set of trees where you can go from any tree to any other tree using one or more re-arrangements of the type currently used for searching tree-space (e.g., NNI).
Question: What is the best score found?
Question 17
Question: How many tree islands are there with the best score?
Question 18
Question: How many trees were found with the best score?
Question 19 Perform a heuristic search using SPR:
hsearch start=stepwise addseq=random nreps=100 rseed=117 swap=SPR
- This starts a heuristic search using rearrangements of the "subtree pruning and re-grafting" type (SPR). SPR is more elaborate than NNI and examines many more neighbors for each tree.
Question: Did this search find the same best score as NNI?
Question 20
Perform a heuristic search using TBR:
hsearch start=stepwise addseq=random nreps=100 rseed=117 swap=TBR
- This starts a heuristic search using rearrangements of the "tree bisection and reconnection" type (TBR). TBR is more elaborate than both NNI and SPR. TBR is the default swap mode for heuristic searching in PAUP*, and NNI or SPR should mostly be used if you are interested in reducing search time.
Question: How many tree islands are there with the best score when using TBR?
Question 21
Question: Why do you think there are fewer islands with TBR and SPR compared to NNI?
Quitting PAUP program
To stop the PAUP program and return to the shell command prompt, you simply type:
q
- for "quit"