Exercise: Phylogeny (Seaview version)
Before you start: please make sure you have the Seaview program installed on your computer. If not, see the Multiple alignment exercise.
The Phylogeny of HIV
In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.
The "Pol" gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1. It is available via this link:
Step 1: alignment
Align the Pol sequences using the Clustal Omega program in Seaview.
- QUESTION 1:
- Make a PDF in color of your alignment (File→Prepare PDF) and hand it in as an attachment to your answer document.
Step 2: distance matrix
In Seaview, go to Trees→Distance Methods. In the window that pops up, select Save to File and set Distance to Observed. Let Ignore all gap sites be checked. Click Go and save the file.
Look at the resulting file in a plain text editor. First, all distances are written in a triangle with the top row containing distances between the first sequence and all the others, the second row containing the distances between the second sequence and all others except the first, and so on. Just below the triangle, the names of the sequences are shown in the order in which they are displayed in the triangle. Further down in the file, all distances are repeated in another format with one pairwise distance per line including the sequence names.
- QUESTION 2:
- Can you spot which sequence has the largest distances to all the others?
Step 3: neighbor joining
Go to Trees→Distance Methods again, but this time, select NJ instead of Save to File. Then, clicking Go will produce a neighbor-joining tree based on the distances you just looked at.
- QUESTION 3:
- Hand in a picture of the resulting tree (Hint: you can either take a screenshot or save the tree as SVG via the File menu).
- Which sequence has the longest branch? Does that correspond to your answer before?
Step 4: rooted vs unrooted tree
In principle, the NJ algorithm always produces an unrooted tree. The reason why the trees you have seen so far (in this and last week's exercises) have been shown as rooted trees is that Seaview uses midpoint rooting, i.e., it places the root halfway between the tips that are furthest away from each other on the tree. However, you can also display the tree as unrooted in Seaview: In the drop-down menu at the top of the tree window, change squared to circular. (It is a bit unfortunate that Seaview uses the term "circular", since some other programs offer a circular way of displaying rooted trees, which should not be confused with unrooted trees). Later in the exercise, we will encounter tree rerooting.
- QUESTION 4:
- Hand in a picture of the unrooted tree.
Step 5: rearrangement
Now, go back to the rooted view of the tree and click Swap in the second line of the tree window. Now, every internal node will be marked by a small black square. Click any square to rotate the subtree defined by that node (i.e., swap the upper and lower branches). When you click Full, the black squares disappear again, but the changes in the tree layout will remain.
- QUESTION 5:
- Hand in a picture of the tree where you have rearranged it so that:
- HTLV is at the bottom,
- The HIV1 sequences are above the HIV2 sequences, and
- "SIVCZ" is placed next to "Smanga_S4".
Note that all these rearrangements do not change the topology (the branching pattern) of the tree — it still shows the same phylogeny.
Step 6: interpretation
- QUESTION 6:
- Inspect the rooted tree that you now have and consider what this tells you about the origin of HIV viruses.
- Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences?
- The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?
- With these groupings in consideration, what can you say about the origin of the two HIV viruses?
Comparing trees
For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequences, but lack the first 90 nucleotides or so). The sequences can be found via the following link:
Step 7: with or without gapped positions
This time, make two versions of your tree: one where Ignore all gap sites is on, and one where it is off.
- QUESTION 7:
- Compare the two trees. Are there any differences in the tree topology (i.e., in branching pattern, not just in branch lengths)?
- Your answers should include the following:
- How did you construct the trees? (alignment method, construction of tree, etc.).
- Pictures of the trees.
- Which tree do you think is most correct?
Step 8: comparison to taxonomy
Now, go to NCBI taxonomy and construct a "Common Tree" with all the different species in your L18 data set. It may be necessary to look up some of the common names on the net (Google, Wikipedia, Tree of Life) in order to enter them in the common tree function. Note: Remember to tick include unranked (phylogenetic) taxa.
- QUESTION 8:
- Compare the most correct of your trees from Step 7 with the Common tree. Are there any errors, i.e. taxa that are not placed correctly on your tree? Which?
Mitochondrial versus cytoplasmic proteins
In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion's own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use UniProt to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyze the phylogeny of the dataset.
Step 9: building the dataset
- Find all proteins named "ribosomal protein L3" from as many eukaryotes (Eukaryota) as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).
- How many of these have a Subcellular location of "mitochondrion" and "cytoplasm", respectively? Download the results of these two searches in FASTA format.
- Now combine the two data sets from the previous question into one FASTA file (using Geany or another plain text editor). Note that their names start by "RL3" (cytoplasmic) or "RM03"/"RK3" (mitochondrial) which is very convenient for telling the difference between them. If you have any names that do not begin with "RL3", "RK3" or "RM03", revisit your UniProt search criteria! Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).
Step 10: making the tree
Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). NB: set Ignore all gap sites off. Describe all the steps you took to make it, and hand in a picture of your tree in unrooted view. Also, go to File→Save unrooted tree and save the tree file; name it something ending in .txt. Open this file in a plain text editor and have a look at it — this is the Newick tree file text format. We will need this file later.
Step 11: rerooting the tree in Seaview
Until now, we have not had to deal with rerooting, because the midpoint rooting happened to be correct. This is not the case here, since we want the cytoplasmic and the mitochondrial sequences to be in two monophyletic groups (two subtrees). In other words, we have to reroot:
- Switch back to rooted ("squared") view.
- Click Re-root in the second row of the tree window; a small black square will appear at each node. If you click a square, the tree will be rerooted at that node (try it!)
- Now find a node where all children are either cytoplasmic or mitochondrial. Click it (don't worry about clicking a wrong node, you can always click another). Make sure that all the cytoplasmic and all the mitochondrial sequences are in two separate subtrees.
- Then, click Full in the second row of the tree window to make the small black squares disappear again.
Include a picture of the rerooted tree in your answer.
Step 12: interactive Tree Of Life
In this step, we will use the website iTOL (interactive Tree Of Life) to reroot our tree:
- Open the website in a new browser tab, and click Upload in the top row.
- Click the button under Tree file: and select the unrooted Newick tree file you saved in Step 10.
- Click Upload. You will now see a tree displayed with an arbitrary placement of the root.
- Look at the Control panel to the right. Under Label options switch Position from Aligned to At tips.
- Note that when you hover the mouse over a branch, information about the branch is displayed.
- Find, like in the previous step, a node where all children are either cytoplasmic or mitochondrial. Click it. A menu will appear. In that menu, go to Editing→Tree structure→Re-root the tree here.
Include a picture of the rerooted tree in your answer. Is there a difference between this tree and the one you made in Step 11? If so, describe it.
Step 13: annotating the tree
- In the left part of the iTOL window, you will see six small icons: Zoom in, Zoom out, Fit to screen, Information, Search tree nodes, and Manual annotations (hover the mouse over them to see the descriptions).
- Click Manual annotations and select the first tool ("Draw an ellipse / circle").
- Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark both these nodes with a green circle each.
- Note that in case you place a circle incorrectly, you can move it with the "Move/rotate/scale objects" tool. There is also a "Delete objects" tool.
- Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a blue circle each.
Hand in a picture of your annotated tree.
Step 14: interpretation
Consider your rerooted and annotated tree from iTOL, and answer the following questions:
- Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?
- Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If no, where do you see differences?
- Consider the horizontal distance between the blue and the green point in both subtrees. Where has evolution been faster (where are there most mutations per time unit) — among the cytoplasmic or the mitochondrial proteins?