22111 - User contributions [en]

Exercise: Phylogeny - Answers (Seaview version)

2025-11-26T21:02:38Z

Henni: /* Step 14 */

== Step 1 ==
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.

==Step 2==
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.
<pre>
#distances order: d(1,2),...,d(1,n) <new line> d(2,3),...,d(2,n) <new line>...
20
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058
0.0682095 0.0621194 0.386114 0.120585 0.118149
0.0657734 0.389769 0.126675 0.123021
0.394641 0.116931 0.115713
0.388551 0.388551
0.0146163
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP

#pairwise distances
HIV1B5,HTLV: 0.750305
HIV1H2,HTLV: 0.751523
HIV1MN,HTLV: 0.75
HIV1N5,HTLV: 0.752741
HIV1ND,HTLV: 0.752741
HIV1OY,HTLV: 0.752741
HIV1PV,HTLV: 0.750305
HIV1U4,HTLV: 0.750305
HIV1Z2,HTLV: 0.752741
HIV2CA,HTLV: 0.749086
HIV2D1,HTLV: 0.741778
HIV2G1,HTLV: 0.747868
HIV2KR,HTLV: 0.749086
HIV2RO,HTLV: 0.744214
HIV2SB,HTLV: 0.750305
HIV2ST,HTLV: 0.747868
HTLV,SIVCZ: 0.747868
HTLV,Smanga_S4: 0.747868
HTLV,Smanga_SP: 0.74665
HIV1B5,HIV1H2: 0.0158343
HIV1B5,HIV1MN: 0.0414634
HIV1B5,HIV1N5: 0.0304507
HIV1B5,HIV1ND: 0.043849
HIV1B5,HIV1OY: 0.0341048
HIV1B5,HIV1PV: 0.0170524
HIV1B5,HIV1U4: 0.0803898
HIV1B5,HIV1Z2: 0.045067
HIV1B5,HIV2CA: 0.399513
HIV1B5,HIV2D1: 0.399513
HIV1B5,HIV2G1: 0.389769
HIV1B5,HIV2KR: 0.393423
HIV1B5,HIV2RO: 0.394641
HIV1B5,HIV2SB: 0.389769
HIV1B5,HIV2ST: 0.394641
HIV1B5,SIVCZ: 0.130329
HIV1B5,Smanga_S4: 0.389769
HIV1B5,Smanga_SP: 0.389769
HIV1H2,HIV1MN: 0.0402439
HIV1H2,HIV1N5: 0.0292326
HIV1H2,HIV1ND: 0.0414129
HIV1H2,HIV1OY: 0.0328867
HIV1H2,HIV1PV: 0.00974421
HIV1H2,HIV1U4: 0.0803898
HIV1H2,HIV1Z2: 0.0426309
HIV1H2,HIV2CA: 0.399513
HIV1H2,HIV2D1: 0.401949
HIV1H2,HIV2G1: 0.392205
HIV1H2,HIV2KR: 0.393423
HIV1H2,HIV2RO: 0.394641
HIV1H2,HIV2SB: 0.389769
HIV1H2,HIV2ST: 0.394641
HIV1H2,SIVCZ: 0.129111
HIV1H2,Smanga_S4: 0.388551
HIV1H2,Smanga_SP: 0.388551
HIV1MN,HIV1N5: 0.0365854
HIV1MN,HIV1ND: 0.0512195
HIV1MN,HIV1OY: 0.0365854
HIV1MN,HIV1PV: 0.0439024
HIV1MN,HIV1U4: 0.0865854
HIV1MN,HIV1Z2: 0.054878
HIV1MN,HIV2CA: 0.4
HIV1MN,HIV2D1: 0.40122
HIV1MN,HIV2G1: 0.396341
HIV1MN,HIV2KR: 0.392683
HIV1MN,HIV2RO: 0.395122
HIV1MN,HIV2SB: 0.392683
HIV1MN,HIV2ST: 0.397561
HIV1MN,SIVCZ: 0.130488
HIV1MN,Smanga_S4: 0.392683
HIV1MN,Smanga_SP: 0.392683
HIV1N5,HIV1ND: 0.0341048
HIV1N5,HIV1OY: 0.0304507
HIV1N5,HIV1PV: 0.0316687
HIV1N5,HIV1U4: 0.0791717
HIV1N5,HIV1Z2: 0.0389769
HIV1N5,HIV2CA: 0.397077
HIV1N5,HIV2D1: 0.399513
HIV1N5,HIV2G1: 0.389769
HIV1N5,HIV2KR: 0.390987
HIV1N5,HIV2RO: 0.392205
HIV1N5,HIV2SB: 0.389769
HIV1N5,HIV2ST: 0.392205
HIV1N5,SIVCZ: 0.127893
HIV1N5,Smanga_S4: 0.387333
HIV1N5,Smanga_SP: 0.387333
HIV1ND,HIV1OY: 0.043849
HIV1ND,HIV1PV: 0.043849
HIV1ND,HIV1U4: 0.0767357
HIV1ND,HIV1Z2: 0.0219245
HIV1ND,HIV2CA: 0.390987
HIV1ND,HIV2D1: 0.394641
HIV1ND,HIV2G1: 0.386114
HIV1ND,HIV2KR: 0.386114
HIV1ND,HIV2RO: 0.388551
HIV1ND,HIV2SB: 0.387333
HIV1ND,HIV2ST: 0.389769
HIV1ND,SIVCZ: 0.125457
HIV1ND,Smanga_S4: 0.386114
HIV1ND,Smanga_SP: 0.386114
HIV1OY,HIV1PV: 0.0365408
HIV1OY,HIV1U4: 0.0767357
HIV1OY,HIV1Z2: 0.047503
HIV1OY,HIV2CA: 0.394641
HIV1OY,HIV2D1: 0.397077
HIV1OY,HIV2G1: 0.388551
HIV1OY,HIV2KR: 0.388551
HIV1OY,HIV2RO: 0.389769
HIV1OY,HIV2SB: 0.386114
HIV1OY,HIV2ST: 0.390987
HIV1OY,SIVCZ: 0.131547
HIV1OY,Smanga_S4: 0.388551
HIV1OY,Smanga_SP: 0.388551
HIV1PV,HIV1U4: 0.0828258
HIV1PV,HIV1Z2: 0.045067
HIV1PV,HIV2CA: 0.401949
HIV1PV,HIV2D1: 0.404385
HIV1PV,HIV2G1: 0.394641
HIV1PV,HIV2KR: 0.394641
HIV1PV,HIV2RO: 0.397077
HIV1PV,HIV2SB: 0.390987
HIV1PV,HIV2ST: 0.393423
HIV1PV,SIVCZ: 0.130329
HIV1PV,Smanga_S4: 0.388551
HIV1PV,Smanga_SP: 0.388551
HIV1U4,HIV1Z2: 0.0767357
HIV1U4,HIV2CA: 0.398295
HIV1U4,HIV2D1: 0.403167
HIV1U4,HIV2G1: 0.392205
HIV1U4,HIV2KR: 0.395859
HIV1U4,HIV2RO: 0.394641
HIV1U4,HIV2SB: 0.394641
HIV1U4,HIV2ST: 0.397077
HIV1U4,SIVCZ: 0.137637
HIV1U4,Smanga_S4: 0.400731
HIV1U4,Smanga_SP: 0.399513
HIV1Z2,HIV2CA: 0.393423
HIV1Z2,HIV2D1: 0.397077
HIV1Z2,HIV2G1: 0.387333
HIV1Z2,HIV2KR: 0.388551
HIV1Z2,HIV2RO: 0.389769
HIV1Z2,HIV2SB: 0.389769
HIV1Z2,HIV2ST: 0.389769
HIV1Z2,SIVCZ: 0.125457
HIV1Z2,Smanga_S4: 0.388551
HIV1Z2,Smanga_SP: 0.388551
HIV2CA,HIV2D1: 0.0816078
HIV2CA,HIV2G1: 0.0694275
HIV2CA,HIV2KR: 0.0645554
HIV2CA,HIV2RO: 0.0511571
HIV2CA,HIV2SB: 0.0682095
HIV2CA,HIV2ST: 0.0657734
HIV2CA,SIVCZ: 0.392205
HIV2CA,Smanga_S4: 0.125457
HIV2CA,Smanga_SP: 0.120585
HIV2D1,HIV2G1: 0.0511571
HIV2D1,HIV2KR: 0.0840438
HIV2D1,HIV2RO: 0.088916
HIV2D1,HIV2SB: 0.09257
HIV2D1,HIV2ST: 0.0864799
HIV2D1,SIVCZ: 0.397077
HIV2D1,Smanga_S4: 0.131547
HIV2D1,Smanga_SP: 0.129111
HIV2G1,HIV2KR: 0.0779537
HIV2G1,HIV2RO: 0.0730816
HIV2G1,HIV2SB: 0.0791717
HIV2G1,HIV2ST: 0.0767357
HIV2G1,SIVCZ: 0.394641
HIV2G1,Smanga_S4: 0.127893
HIV2G1,Smanga_SP: 0.121803
HIV2KR,HIV2RO: 0.0645554
HIV2KR,HIV2SB: 0.0633374
HIV2KR,HIV2ST: 0.0572473
HIV2KR,SIVCZ: 0.392205
HIV2KR,Smanga_S4: 0.118149
HIV2KR,Smanga_SP: 0.112058
HIV2RO,HIV2SB: 0.0682095
HIV2RO,HIV2ST: 0.0621194
HIV2RO,SIVCZ: 0.386114
HIV2RO,Smanga_S4: 0.120585
HIV2RO,Smanga_SP: 0.118149
HIV2SB,HIV2ST: 0.0657734
HIV2SB,SIVCZ: 0.389769
HIV2SB,Smanga_S4: 0.126675
HIV2SB,Smanga_SP: 0.123021
HIV2ST,SIVCZ: 0.394641
HIV2ST,Smanga_S4: 0.116931
HIV2ST,Smanga_SP: 0.115713
SIVCZ,Smanga_S4: 0.388551
SIVCZ,Smanga_SP: 0.388551
Smanga_S4,Smanga_SP: 0.0146163
</pre>

==Step3==
Here is a picture of the NJ tree:

[[File:Pol21-NJ_tree.png]]

The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.

==Step 4==
Here is an unrooted tree:

[[File:Pol21-NJ_tree.unrooted.png]]

==Step 5==
Here is a rearranged (swapped) tree:

[[File:Pol21-NJ_tree.swapped.png]]

==Step 6==
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).
* Further answers to "The Phylogeny of HIV" can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].

==Step 7==
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the "best of both worlds": it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.

Here is the tree made ignoring gap positions:

[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]

And here is the tree made taking gap positions into account:

[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]

There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. '''Note:''' This is not always the case!

==Step 8==
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group ''Tetrapoda''.

[[file:salmon_frog.png‎|center|frame]]

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy's "Common Tree" function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group ''Euarchontoglires''.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group ''Opisthokonta''.

[[file:L18_Common_Taxonomy_Tree.png|center|frame]]

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

==Step 9==
# 54 results. Search string:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)</tt>
# 8 and 27 results, respectively. Search strings:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)</tt> and  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)</tt> Under the Download tab in UniProt, select "Download all", "FASTA (canonical)" and "Uncompressed".
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]

==Step 10==
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure Alignment options is set to "clustalo", and align all sequences. Then make an NJ tree (with Ignore all gap sites unchecked) and change the view to "circular".


Here is the result:

[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]

And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.

== Step 11 ==
Here is the rerooted tree made by Seaview:

[[File:Ribosomal_proteins_35-NJ_tree.rerooted.png]]

==Step 12==


Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:

[[File:Ribosomal_proteins_35-NJ_tree.rerooted+annotated.png]]

==Step 13==
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).

File:Ribosomal proteins 35-NJ tree.rerooted+annotated.png

2025-11-26T21:00:44Z

Henni:

Exercise: Phylogeny - Answers (Seaview version)

2025-11-26T21:00:13Z

Henni: /* Step 13 */

== Step 1 ==
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.

==Step 2==
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.
<pre>
#distances order: d(1,2),...,d(1,n) <new line> d(2,3),...,d(2,n) <new line>...
20
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058
0.0682095 0.0621194 0.386114 0.120585 0.118149
0.0657734 0.389769 0.126675 0.123021
0.394641 0.116931 0.115713
0.388551 0.388551
0.0146163
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP

#pairwise distances
HIV1B5,HTLV: 0.750305
HIV1H2,HTLV: 0.751523
HIV1MN,HTLV: 0.75
HIV1N5,HTLV: 0.752741
HIV1ND,HTLV: 0.752741
HIV1OY,HTLV: 0.752741
HIV1PV,HTLV: 0.750305
HIV1U4,HTLV: 0.750305
HIV1Z2,HTLV: 0.752741
HIV2CA,HTLV: 0.749086
HIV2D1,HTLV: 0.741778
HIV2G1,HTLV: 0.747868
HIV2KR,HTLV: 0.749086
HIV2RO,HTLV: 0.744214
HIV2SB,HTLV: 0.750305
HIV2ST,HTLV: 0.747868
HTLV,SIVCZ: 0.747868
HTLV,Smanga_S4: 0.747868
HTLV,Smanga_SP: 0.74665
HIV1B5,HIV1H2: 0.0158343
HIV1B5,HIV1MN: 0.0414634
HIV1B5,HIV1N5: 0.0304507
HIV1B5,HIV1ND: 0.043849
HIV1B5,HIV1OY: 0.0341048
HIV1B5,HIV1PV: 0.0170524
HIV1B5,HIV1U4: 0.0803898
HIV1B5,HIV1Z2: 0.045067
HIV1B5,HIV2CA: 0.399513
HIV1B5,HIV2D1: 0.399513
HIV1B5,HIV2G1: 0.389769
HIV1B5,HIV2KR: 0.393423
HIV1B5,HIV2RO: 0.394641
HIV1B5,HIV2SB: 0.389769
HIV1B5,HIV2ST: 0.394641
HIV1B5,SIVCZ: 0.130329
HIV1B5,Smanga_S4: 0.389769
HIV1B5,Smanga_SP: 0.389769
HIV1H2,HIV1MN: 0.0402439
HIV1H2,HIV1N5: 0.0292326
HIV1H2,HIV1ND: 0.0414129
HIV1H2,HIV1OY: 0.0328867
HIV1H2,HIV1PV: 0.00974421
HIV1H2,HIV1U4: 0.0803898
HIV1H2,HIV1Z2: 0.0426309
HIV1H2,HIV2CA: 0.399513
HIV1H2,HIV2D1: 0.401949
HIV1H2,HIV2G1: 0.392205
HIV1H2,HIV2KR: 0.393423
HIV1H2,HIV2RO: 0.394641
HIV1H2,HIV2SB: 0.389769
HIV1H2,HIV2ST: 0.394641
HIV1H2,SIVCZ: 0.129111
HIV1H2,Smanga_S4: 0.388551
HIV1H2,Smanga_SP: 0.388551
HIV1MN,HIV1N5: 0.0365854
HIV1MN,HIV1ND: 0.0512195
HIV1MN,HIV1OY: 0.0365854
HIV1MN,HIV1PV: 0.0439024
HIV1MN,HIV1U4: 0.0865854
HIV1MN,HIV1Z2: 0.054878
HIV1MN,HIV2CA: 0.4
HIV1MN,HIV2D1: 0.40122
HIV1MN,HIV2G1: 0.396341
HIV1MN,HIV2KR: 0.392683
HIV1MN,HIV2RO: 0.395122
HIV1MN,HIV2SB: 0.392683
HIV1MN,HIV2ST: 0.397561
HIV1MN,SIVCZ: 0.130488
HIV1MN,Smanga_S4: 0.392683
HIV1MN,Smanga_SP: 0.392683
HIV1N5,HIV1ND: 0.0341048
HIV1N5,HIV1OY: 0.0304507
HIV1N5,HIV1PV: 0.0316687
HIV1N5,HIV1U4: 0.0791717
HIV1N5,HIV1Z2: 0.0389769
HIV1N5,HIV2CA: 0.397077
HIV1N5,HIV2D1: 0.399513
HIV1N5,HIV2G1: 0.389769
HIV1N5,HIV2KR: 0.390987
HIV1N5,HIV2RO: 0.392205
HIV1N5,HIV2SB: 0.389769
HIV1N5,HIV2ST: 0.392205
HIV1N5,SIVCZ: 0.127893
HIV1N5,Smanga_S4: 0.387333
HIV1N5,Smanga_SP: 0.387333
HIV1ND,HIV1OY: 0.043849
HIV1ND,HIV1PV: 0.043849
HIV1ND,HIV1U4: 0.0767357
HIV1ND,HIV1Z2: 0.0219245
HIV1ND,HIV2CA: 0.390987
HIV1ND,HIV2D1: 0.394641
HIV1ND,HIV2G1: 0.386114
HIV1ND,HIV2KR: 0.386114
HIV1ND,HIV2RO: 0.388551
HIV1ND,HIV2SB: 0.387333
HIV1ND,HIV2ST: 0.389769
HIV1ND,SIVCZ: 0.125457
HIV1ND,Smanga_S4: 0.386114
HIV1ND,Smanga_SP: 0.386114
HIV1OY,HIV1PV: 0.0365408
HIV1OY,HIV1U4: 0.0767357
HIV1OY,HIV1Z2: 0.047503
HIV1OY,HIV2CA: 0.394641
HIV1OY,HIV2D1: 0.397077
HIV1OY,HIV2G1: 0.388551
HIV1OY,HIV2KR: 0.388551
HIV1OY,HIV2RO: 0.389769
HIV1OY,HIV2SB: 0.386114
HIV1OY,HIV2ST: 0.390987
HIV1OY,SIVCZ: 0.131547
HIV1OY,Smanga_S4: 0.388551
HIV1OY,Smanga_SP: 0.388551
HIV1PV,HIV1U4: 0.0828258
HIV1PV,HIV1Z2: 0.045067
HIV1PV,HIV2CA: 0.401949
HIV1PV,HIV2D1: 0.404385
HIV1PV,HIV2G1: 0.394641
HIV1PV,HIV2KR: 0.394641
HIV1PV,HIV2RO: 0.397077
HIV1PV,HIV2SB: 0.390987
HIV1PV,HIV2ST: 0.393423
HIV1PV,SIVCZ: 0.130329
HIV1PV,Smanga_S4: 0.388551
HIV1PV,Smanga_SP: 0.388551
HIV1U4,HIV1Z2: 0.0767357
HIV1U4,HIV2CA: 0.398295
HIV1U4,HIV2D1: 0.403167
HIV1U4,HIV2G1: 0.392205
HIV1U4,HIV2KR: 0.395859
HIV1U4,HIV2RO: 0.394641
HIV1U4,HIV2SB: 0.394641
HIV1U4,HIV2ST: 0.397077
HIV1U4,SIVCZ: 0.137637
HIV1U4,Smanga_S4: 0.400731
HIV1U4,Smanga_SP: 0.399513
HIV1Z2,HIV2CA: 0.393423
HIV1Z2,HIV2D1: 0.397077
HIV1Z2,HIV2G1: 0.387333
HIV1Z2,HIV2KR: 0.388551
HIV1Z2,HIV2RO: 0.389769
HIV1Z2,HIV2SB: 0.389769
HIV1Z2,HIV2ST: 0.389769
HIV1Z2,SIVCZ: 0.125457
HIV1Z2,Smanga_S4: 0.388551
HIV1Z2,Smanga_SP: 0.388551
HIV2CA,HIV2D1: 0.0816078
HIV2CA,HIV2G1: 0.0694275
HIV2CA,HIV2KR: 0.0645554
HIV2CA,HIV2RO: 0.0511571
HIV2CA,HIV2SB: 0.0682095
HIV2CA,HIV2ST: 0.0657734
HIV2CA,SIVCZ: 0.392205
HIV2CA,Smanga_S4: 0.125457
HIV2CA,Smanga_SP: 0.120585
HIV2D1,HIV2G1: 0.0511571
HIV2D1,HIV2KR: 0.0840438
HIV2D1,HIV2RO: 0.088916
HIV2D1,HIV2SB: 0.09257
HIV2D1,HIV2ST: 0.0864799
HIV2D1,SIVCZ: 0.397077
HIV2D1,Smanga_S4: 0.131547
HIV2D1,Smanga_SP: 0.129111
HIV2G1,HIV2KR: 0.0779537
HIV2G1,HIV2RO: 0.0730816
HIV2G1,HIV2SB: 0.0791717
HIV2G1,HIV2ST: 0.0767357
HIV2G1,SIVCZ: 0.394641
HIV2G1,Smanga_S4: 0.127893
HIV2G1,Smanga_SP: 0.121803
HIV2KR,HIV2RO: 0.0645554
HIV2KR,HIV2SB: 0.0633374
HIV2KR,HIV2ST: 0.0572473
HIV2KR,SIVCZ: 0.392205
HIV2KR,Smanga_S4: 0.118149
HIV2KR,Smanga_SP: 0.112058
HIV2RO,HIV2SB: 0.0682095
HIV2RO,HIV2ST: 0.0621194
HIV2RO,SIVCZ: 0.386114
HIV2RO,Smanga_S4: 0.120585
HIV2RO,Smanga_SP: 0.118149
HIV2SB,HIV2ST: 0.0657734
HIV2SB,SIVCZ: 0.389769
HIV2SB,Smanga_S4: 0.126675
HIV2SB,Smanga_SP: 0.123021
HIV2ST,SIVCZ: 0.394641
HIV2ST,Smanga_S4: 0.116931
HIV2ST,Smanga_SP: 0.115713
SIVCZ,Smanga_S4: 0.388551
SIVCZ,Smanga_SP: 0.388551
Smanga_S4,Smanga_SP: 0.0146163
</pre>

==Step3==
Here is a picture of the NJ tree:

[[File:Pol21-NJ_tree.png]]

The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.

==Step 4==
Here is an unrooted tree:

[[File:Pol21-NJ_tree.unrooted.png]]

==Step 5==
Here is a rearranged (swapped) tree:

[[File:Pol21-NJ_tree.swapped.png]]

==Step 6==
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).
* Further answers to "The Phylogeny of HIV" can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].

==Step 7==
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the "best of both worlds": it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.

Here is the tree made ignoring gap positions:

[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]

And here is the tree made taking gap positions into account:

[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]

There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. '''Note:''' This is not always the case!

==Step 8==
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group ''Tetrapoda''.

[[file:salmon_frog.png‎|center|frame]]

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy's "Common Tree" function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group ''Euarchontoglires''.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group ''Opisthokonta''.

[[file:L18_Common_Taxonomy_Tree.png|center|frame]]

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

==Step 9==
# 54 results. Search string:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)</tt>
# 8 and 27 results, respectively. Search strings:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)</tt> and  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)</tt> Under the Download tab in UniProt, select "Download all", "FASTA (canonical)" and "Uncompressed".
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]

==Step 10==
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure Alignment options is set to "clustalo", and align all sequences. Then make an NJ tree (with Ignore all gap sites unchecked) and change the view to "circular".


Here is the result:

[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]

And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.

== Step 11 ==
Here is the rerooted tree made by Seaview:

[[File:Ribosomal_proteins_35-NJ_tree.rerooted.png]]

==Step 12==


Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:

[[File:Ribosomal_proteins_35-NJ_tree.rerooted+annotated.png]]

==Step 14==
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).

Exercise: Phylogeny - Answers (Seaview version)

2025-11-26T20:59:13Z

Henni: /* Step 12 */

== Step 1 ==
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.

==Step 2==
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.
<pre>
#distances order: d(1,2),...,d(1,n) <new line> d(2,3),...,d(2,n) <new line>...
20
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058
0.0682095 0.0621194 0.386114 0.120585 0.118149
0.0657734 0.389769 0.126675 0.123021
0.394641 0.116931 0.115713
0.388551 0.388551
0.0146163
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP

#pairwise distances
HIV1B5,HTLV: 0.750305
HIV1H2,HTLV: 0.751523
HIV1MN,HTLV: 0.75
HIV1N5,HTLV: 0.752741
HIV1ND,HTLV: 0.752741
HIV1OY,HTLV: 0.752741
HIV1PV,HTLV: 0.750305
HIV1U4,HTLV: 0.750305
HIV1Z2,HTLV: 0.752741
HIV2CA,HTLV: 0.749086
HIV2D1,HTLV: 0.741778
HIV2G1,HTLV: 0.747868
HIV2KR,HTLV: 0.749086
HIV2RO,HTLV: 0.744214
HIV2SB,HTLV: 0.750305
HIV2ST,HTLV: 0.747868
HTLV,SIVCZ: 0.747868
HTLV,Smanga_S4: 0.747868
HTLV,Smanga_SP: 0.74665
HIV1B5,HIV1H2: 0.0158343
HIV1B5,HIV1MN: 0.0414634
HIV1B5,HIV1N5: 0.0304507
HIV1B5,HIV1ND: 0.043849
HIV1B5,HIV1OY: 0.0341048
HIV1B5,HIV1PV: 0.0170524
HIV1B5,HIV1U4: 0.0803898
HIV1B5,HIV1Z2: 0.045067
HIV1B5,HIV2CA: 0.399513
HIV1B5,HIV2D1: 0.399513
HIV1B5,HIV2G1: 0.389769
HIV1B5,HIV2KR: 0.393423
HIV1B5,HIV2RO: 0.394641
HIV1B5,HIV2SB: 0.389769
HIV1B5,HIV2ST: 0.394641
HIV1B5,SIVCZ: 0.130329
HIV1B5,Smanga_S4: 0.389769
HIV1B5,Smanga_SP: 0.389769
HIV1H2,HIV1MN: 0.0402439
HIV1H2,HIV1N5: 0.0292326
HIV1H2,HIV1ND: 0.0414129
HIV1H2,HIV1OY: 0.0328867
HIV1H2,HIV1PV: 0.00974421
HIV1H2,HIV1U4: 0.0803898
HIV1H2,HIV1Z2: 0.0426309
HIV1H2,HIV2CA: 0.399513
HIV1H2,HIV2D1: 0.401949
HIV1H2,HIV2G1: 0.392205
HIV1H2,HIV2KR: 0.393423
HIV1H2,HIV2RO: 0.394641
HIV1H2,HIV2SB: 0.389769
HIV1H2,HIV2ST: 0.394641
HIV1H2,SIVCZ: 0.129111
HIV1H2,Smanga_S4: 0.388551
HIV1H2,Smanga_SP: 0.388551
HIV1MN,HIV1N5: 0.0365854
HIV1MN,HIV1ND: 0.0512195
HIV1MN,HIV1OY: 0.0365854
HIV1MN,HIV1PV: 0.0439024
HIV1MN,HIV1U4: 0.0865854
HIV1MN,HIV1Z2: 0.054878
HIV1MN,HIV2CA: 0.4
HIV1MN,HIV2D1: 0.40122
HIV1MN,HIV2G1: 0.396341
HIV1MN,HIV2KR: 0.392683
HIV1MN,HIV2RO: 0.395122
HIV1MN,HIV2SB: 0.392683
HIV1MN,HIV2ST: 0.397561
HIV1MN,SIVCZ: 0.130488
HIV1MN,Smanga_S4: 0.392683
HIV1MN,Smanga_SP: 0.392683
HIV1N5,HIV1ND: 0.0341048
HIV1N5,HIV1OY: 0.0304507
HIV1N5,HIV1PV: 0.0316687
HIV1N5,HIV1U4: 0.0791717
HIV1N5,HIV1Z2: 0.0389769
HIV1N5,HIV2CA: 0.397077
HIV1N5,HIV2D1: 0.399513
HIV1N5,HIV2G1: 0.389769
HIV1N5,HIV2KR: 0.390987
HIV1N5,HIV2RO: 0.392205
HIV1N5,HIV2SB: 0.389769
HIV1N5,HIV2ST: 0.392205
HIV1N5,SIVCZ: 0.127893
HIV1N5,Smanga_S4: 0.387333
HIV1N5,Smanga_SP: 0.387333
HIV1ND,HIV1OY: 0.043849
HIV1ND,HIV1PV: 0.043849
HIV1ND,HIV1U4: 0.0767357
HIV1ND,HIV1Z2: 0.0219245
HIV1ND,HIV2CA: 0.390987
HIV1ND,HIV2D1: 0.394641
HIV1ND,HIV2G1: 0.386114
HIV1ND,HIV2KR: 0.386114
HIV1ND,HIV2RO: 0.388551
HIV1ND,HIV2SB: 0.387333
HIV1ND,HIV2ST: 0.389769
HIV1ND,SIVCZ: 0.125457
HIV1ND,Smanga_S4: 0.386114
HIV1ND,Smanga_SP: 0.386114
HIV1OY,HIV1PV: 0.0365408
HIV1OY,HIV1U4: 0.0767357
HIV1OY,HIV1Z2: 0.047503
HIV1OY,HIV2CA: 0.394641
HIV1OY,HIV2D1: 0.397077
HIV1OY,HIV2G1: 0.388551
HIV1OY,HIV2KR: 0.388551
HIV1OY,HIV2RO: 0.389769
HIV1OY,HIV2SB: 0.386114
HIV1OY,HIV2ST: 0.390987
HIV1OY,SIVCZ: 0.131547
HIV1OY,Smanga_S4: 0.388551
HIV1OY,Smanga_SP: 0.388551
HIV1PV,HIV1U4: 0.0828258
HIV1PV,HIV1Z2: 0.045067
HIV1PV,HIV2CA: 0.401949
HIV1PV,HIV2D1: 0.404385
HIV1PV,HIV2G1: 0.394641
HIV1PV,HIV2KR: 0.394641
HIV1PV,HIV2RO: 0.397077
HIV1PV,HIV2SB: 0.390987
HIV1PV,HIV2ST: 0.393423
HIV1PV,SIVCZ: 0.130329
HIV1PV,Smanga_S4: 0.388551
HIV1PV,Smanga_SP: 0.388551
HIV1U4,HIV1Z2: 0.0767357
HIV1U4,HIV2CA: 0.398295
HIV1U4,HIV2D1: 0.403167
HIV1U4,HIV2G1: 0.392205
HIV1U4,HIV2KR: 0.395859
HIV1U4,HIV2RO: 0.394641
HIV1U4,HIV2SB: 0.394641
HIV1U4,HIV2ST: 0.397077
HIV1U4,SIVCZ: 0.137637
HIV1U4,Smanga_S4: 0.400731
HIV1U4,Smanga_SP: 0.399513
HIV1Z2,HIV2CA: 0.393423
HIV1Z2,HIV2D1: 0.397077
HIV1Z2,HIV2G1: 0.387333
HIV1Z2,HIV2KR: 0.388551
HIV1Z2,HIV2RO: 0.389769
HIV1Z2,HIV2SB: 0.389769
HIV1Z2,HIV2ST: 0.389769
HIV1Z2,SIVCZ: 0.125457
HIV1Z2,Smanga_S4: 0.388551
HIV1Z2,Smanga_SP: 0.388551
HIV2CA,HIV2D1: 0.0816078
HIV2CA,HIV2G1: 0.0694275
HIV2CA,HIV2KR: 0.0645554
HIV2CA,HIV2RO: 0.0511571
HIV2CA,HIV2SB: 0.0682095
HIV2CA,HIV2ST: 0.0657734
HIV2CA,SIVCZ: 0.392205
HIV2CA,Smanga_S4: 0.125457
HIV2CA,Smanga_SP: 0.120585
HIV2D1,HIV2G1: 0.0511571
HIV2D1,HIV2KR: 0.0840438
HIV2D1,HIV2RO: 0.088916
HIV2D1,HIV2SB: 0.09257
HIV2D1,HIV2ST: 0.0864799
HIV2D1,SIVCZ: 0.397077
HIV2D1,Smanga_S4: 0.131547
HIV2D1,Smanga_SP: 0.129111
HIV2G1,HIV2KR: 0.0779537
HIV2G1,HIV2RO: 0.0730816
HIV2G1,HIV2SB: 0.0791717
HIV2G1,HIV2ST: 0.0767357
HIV2G1,SIVCZ: 0.394641
HIV2G1,Smanga_S4: 0.127893
HIV2G1,Smanga_SP: 0.121803
HIV2KR,HIV2RO: 0.0645554
HIV2KR,HIV2SB: 0.0633374
HIV2KR,HIV2ST: 0.0572473
HIV2KR,SIVCZ: 0.392205
HIV2KR,Smanga_S4: 0.118149
HIV2KR,Smanga_SP: 0.112058
HIV2RO,HIV2SB: 0.0682095
HIV2RO,HIV2ST: 0.0621194
HIV2RO,SIVCZ: 0.386114
HIV2RO,Smanga_S4: 0.120585
HIV2RO,Smanga_SP: 0.118149
HIV2SB,HIV2ST: 0.0657734
HIV2SB,SIVCZ: 0.389769
HIV2SB,Smanga_S4: 0.126675
HIV2SB,Smanga_SP: 0.123021
HIV2ST,SIVCZ: 0.394641
HIV2ST,Smanga_S4: 0.116931
HIV2ST,Smanga_SP: 0.115713
SIVCZ,Smanga_S4: 0.388551
SIVCZ,Smanga_SP: 0.388551
Smanga_S4,Smanga_SP: 0.0146163
</pre>

==Step3==
Here is a picture of the NJ tree:

[[File:Pol21-NJ_tree.png]]

The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.

==Step 4==
Here is an unrooted tree:

[[File:Pol21-NJ_tree.unrooted.png]]

==Step 5==
Here is a rearranged (swapped) tree:

[[File:Pol21-NJ_tree.swapped.png]]

==Step 6==
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).
* Further answers to "The Phylogeny of HIV" can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].

==Step 7==
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the "best of both worlds": it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.

Here is the tree made ignoring gap positions:

[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]

And here is the tree made taking gap positions into account:

[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]

There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. '''Note:''' This is not always the case!

==Step 8==
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group ''Tetrapoda''.

[[file:salmon_frog.png‎|center|frame]]

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy's "Common Tree" function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group ''Euarchontoglires''.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group ''Opisthokonta''.

[[file:L18_Common_Taxonomy_Tree.png|center|frame]]

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

==Step 9==
# 54 results. Search string:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)</tt>
# 8 and 27 results, respectively. Search strings:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)</tt> and  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)</tt> Under the Download tab in UniProt, select "Download all", "FASTA (canonical)" and "Uncompressed".
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]

==Step 10==
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure Alignment options is set to "clustalo", and align all sequences. Then make an NJ tree (with Ignore all gap sites unchecked) and change the view to "circular".


Here is the result:

[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]

And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.

== Step 11 ==
Here is the rerooted tree made by Seaview:

[[File:Ribosomal_proteins_35-NJ_tree.rerooted.png]]

==Step 12==


==Step 13==
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:

[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]

==Step 14==
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).

Exercise: Phylogeny (Seaview version)

2025-11-26T20:58:07Z

Henni: /* Step 11: rerooting the tree in Seaview */

Before you start: please make sure you have the Seaview program installed on your computer. If not, see the [[Exercise: Multiple Alignments (Seaview version)|Multiple alignment exercise]].

== The Phylogeny of HIV ==

In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:

Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.

The "Pol" gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1. It is available via this link:

:[https://teaching.healthtech.dtu.dk/material/22111/Pol21.fsa Pol21.fsa]

===Step 1: alignment===

Align the Pol sequences using the Clustal Omega program in Seaview.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1''':
:Make a PDF in color of your alignment (File→Prepare PDF) and hand it in as an attachment to your answer document.

===Step 2: distance matrix===

In Seaview, go to Trees→Distance Methods. In the window that pops up, select Save to File and set Distance to Observed. Let Ignore all gap sites be checked. Click Go and save the file.

Look at the resulting file in a plain text editor. First, all distances are written in a triangle with the top row containing distances between the first sequence and all the others, the second row containing the distances between the second sequence and all others except the first, and so on. Just below the triangle, the names of the sequences are shown in the order in which they are displayed in the triangle. Further down in the file, all distances are repeated in another format with one pairwise distance per line including the sequence names.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2''':
:Can you spot which sequence has the largest distances to all the others?

===Step 3: neighbor joining===

Go to Trees→Distance Methods again, but this time, select NJ instead of Save to File. Then, clicking Go will produce a neighbor-joining tree based on the distances you just looked at.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 3''':
:Hand in a picture of the resulting tree ('''Hint''': you can either take a screenshot or save the tree as SVG via the File menu).
:Which sequence has the longest branch? Does that correspond to your answer before?

===Step 4: rooted ''vs'' unrooted tree===

In principle, the NJ algorithm always produces an ''unrooted'' tree. The reason why the trees you have seen so far (in this and last week's exercises) have been shown as rooted trees is that Seaview uses ''midpoint rooting'', i.e., it places the root halfway between the tips that are furthest away from each other on the tree. However, you can also display the tree as unrooted in Seaview: In the drop-down menu at the top of the tree window, change squared to circular. (It is a bit unfortunate that Seaview uses the term "circular", since some other programs offer a circular way of displaying ''rooted'' trees, which should not be confused with unrooted trees). Later in the exercise, we will encounter tree rerooting.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 4''':
:Hand in a picture of the unrooted tree.

===Step 5: rearrangement===
Now, go back to the rooted view of the tree and click Swap in the second line of the tree window. Now, every internal node will be marked by a small black square. Click any square to rotate the subtree defined by that node (i.e., swap the upper and lower branches). When you click Full, the black squares disappear again, but the changes in the tree layout will remain.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 5''':
:Hand in a picture of the tree where you have rearranged it so that:
:# HTLV is at the bottom,
:# The HIV1 sequences are above the HIV2 sequences, and
:# "SIVCZ" is placed next to "Smanga_S4".
Note that all these rearrangements do ''not'' change the topology (the branching pattern) of the tree — it still shows the same phylogeny.

===Step 6: interpretation===

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 6''':
: Inspect the rooted tree that you now have and consider what this tells you about the origin of HIV viruses.
* Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences?
* The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?
* With these groupings in consideration, what can you say about the origin of the two HIV viruses?

== Comparing trees ==

For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequences, but lack the first 90 nucleotides or so). The sequences can be found via the following link:

* [https://teaching.healthtech.dtu.dk/material/22111/L18_CDS.fasta L18_CDS.fasta]

===Step 7: with or without gapped positions===
This time, make two versions of your tree: one where Ignore all gap sites is on, and one where it is off.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 7''':
: Compare the two trees. Are there any differences in the tree topology (i.e., in branching pattern, not just in branch lengths)?
: Your answers should include the following:
:* How did you construct the trees? (alignment method, construction of tree, etc.).
:* Pictures of the trees.
:* Which tree do you think is most correct?

===Step 8: comparison to taxonomy===
Now, go to [http://www.ncbi.nlm.nih.gov/taxonomy NCBI taxonomy] and construct a "Common Tree" with all the different species in your L18 data set. It may be necessary to look up some of the common names on the net (Google, Wikipedia, Tree of Life) in order to enter them in the common tree function. '''Note''': Remember to tick include unranked (phylogenetic) taxa.
[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 8''':
: Compare the most correct of your trees from Step 7 with the Common tree. Are there any errors, i.e. taxa that are not placed correctly on your tree? Which?

== Mitochondrial ''versus'' cytoplasmic proteins ==
In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion's own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use [http://www.uniprot.org/ UniProt] to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyze the phylogeny of the dataset.

===Step 9: building the dataset===
# Find all proteins named "ribosomal protein L3" from as many eukaryotes (''Eukaryota'') as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).
# How many of these have a Subcellular location of "mitochondrion" and "cytoplasm", respectively? Download the results of these two searches in FASTA format.
# Now combine the two data sets from the previous question into one FASTA file (using Geany or another plain text editor). Note that their names start by "RL3" (cytoplasmic) or "RM03"/"RK3" (mitochondrial) which is very convenient for telling the difference between them. ''If you have any names that do not begin with "RL3", "RK3" or "RM03", revisit your UniProt search criteria!'' Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).

===Step 10: making the tree===
Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). NB: set Ignore all gap sites off. Describe all the steps you took to make it, and hand in a picture of your tree in ''unrooted'' view. 

===Step 11: rerooting the tree in Seaview===
Until now, we have not had to deal with rerooting, because the midpoint rooting happened to be correct. This is not the case here, since we want the cytoplasmic and the mitochondrial sequences to be in two monophyletic groups (two subtrees). In other words, we have to reroot:
# Switch back to rooted ("squared") view.
# Click Re-root in the second row of the tree window; a small black square will appear at each node. If you click a square, the tree will be rerooted at that node (try it!)
# Now find a node where all children are either cytoplasmic or mitochondrial. Click it (don't worry about clicking a wrong node, you can always click another). Make sure that all the cytoplasmic and all the mitochondrial sequences are in two separate subtrees.
# Then, click Full in the second row of the tree window to make the small black squares disappear again.
Include a picture of the rerooted tree in your answer.


===Step 12: annotating the tree===
In this step, you need to be able to draw on a screenshot of the tree from Seaview. You can use any drawing software of your own choice, e.g. the Snip and Sketch tool (built into Windows), [https://inkscape.org/ Inkscape], or PowerPoint.
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark ''both'' these nodes with a green circle each.
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a ''blue'' circle each.
Hand in a picture of your annotated tree.


===Step 13: interpretation===

Consider your rerooted and annotated tree, and answer the following questions:
# Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?
# Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If not, where do you see differences?
# Consider the horizontal distance between the blue and the green point in both subtrees. Where has evolution been faster (where are there most mutations per time unit) — among the cytoplasmic or the mitochondrial proteins?

File:Ribosomal proteins 35-NJ tree.rerooted.png

2025-11-26T20:57:02Z

Henni:

Exercise: Phylogeny - Answers (Seaview version)

2025-11-26T20:56:23Z

Henni: /* Step 11 */

== Step 1 ==
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.

==Step 2==
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.
<pre>
#distances order: d(1,2),...,d(1,n) <new line> d(2,3),...,d(2,n) <new line>...
20
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058
0.0682095 0.0621194 0.386114 0.120585 0.118149
0.0657734 0.389769 0.126675 0.123021
0.394641 0.116931 0.115713
0.388551 0.388551
0.0146163
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP

#pairwise distances
HIV1B5,HTLV: 0.750305
HIV1H2,HTLV: 0.751523
HIV1MN,HTLV: 0.75
HIV1N5,HTLV: 0.752741
HIV1ND,HTLV: 0.752741
HIV1OY,HTLV: 0.752741
HIV1PV,HTLV: 0.750305
HIV1U4,HTLV: 0.750305
HIV1Z2,HTLV: 0.752741
HIV2CA,HTLV: 0.749086
HIV2D1,HTLV: 0.741778
HIV2G1,HTLV: 0.747868
HIV2KR,HTLV: 0.749086
HIV2RO,HTLV: 0.744214
HIV2SB,HTLV: 0.750305
HIV2ST,HTLV: 0.747868
HTLV,SIVCZ: 0.747868
HTLV,Smanga_S4: 0.747868
HTLV,Smanga_SP: 0.74665
HIV1B5,HIV1H2: 0.0158343
HIV1B5,HIV1MN: 0.0414634
HIV1B5,HIV1N5: 0.0304507
HIV1B5,HIV1ND: 0.043849
HIV1B5,HIV1OY: 0.0341048
HIV1B5,HIV1PV: 0.0170524
HIV1B5,HIV1U4: 0.0803898
HIV1B5,HIV1Z2: 0.045067
HIV1B5,HIV2CA: 0.399513
HIV1B5,HIV2D1: 0.399513
HIV1B5,HIV2G1: 0.389769
HIV1B5,HIV2KR: 0.393423
HIV1B5,HIV2RO: 0.394641
HIV1B5,HIV2SB: 0.389769
HIV1B5,HIV2ST: 0.394641
HIV1B5,SIVCZ: 0.130329
HIV1B5,Smanga_S4: 0.389769
HIV1B5,Smanga_SP: 0.389769
HIV1H2,HIV1MN: 0.0402439
HIV1H2,HIV1N5: 0.0292326
HIV1H2,HIV1ND: 0.0414129
HIV1H2,HIV1OY: 0.0328867
HIV1H2,HIV1PV: 0.00974421
HIV1H2,HIV1U4: 0.0803898
HIV1H2,HIV1Z2: 0.0426309
HIV1H2,HIV2CA: 0.399513
HIV1H2,HIV2D1: 0.401949
HIV1H2,HIV2G1: 0.392205
HIV1H2,HIV2KR: 0.393423
HIV1H2,HIV2RO: 0.394641
HIV1H2,HIV2SB: 0.389769
HIV1H2,HIV2ST: 0.394641
HIV1H2,SIVCZ: 0.129111
HIV1H2,Smanga_S4: 0.388551
HIV1H2,Smanga_SP: 0.388551
HIV1MN,HIV1N5: 0.0365854
HIV1MN,HIV1ND: 0.0512195
HIV1MN,HIV1OY: 0.0365854
HIV1MN,HIV1PV: 0.0439024
HIV1MN,HIV1U4: 0.0865854
HIV1MN,HIV1Z2: 0.054878
HIV1MN,HIV2CA: 0.4
HIV1MN,HIV2D1: 0.40122
HIV1MN,HIV2G1: 0.396341
HIV1MN,HIV2KR: 0.392683
HIV1MN,HIV2RO: 0.395122
HIV1MN,HIV2SB: 0.392683
HIV1MN,HIV2ST: 0.397561
HIV1MN,SIVCZ: 0.130488
HIV1MN,Smanga_S4: 0.392683
HIV1MN,Smanga_SP: 0.392683
HIV1N5,HIV1ND: 0.0341048
HIV1N5,HIV1OY: 0.0304507
HIV1N5,HIV1PV: 0.0316687
HIV1N5,HIV1U4: 0.0791717
HIV1N5,HIV1Z2: 0.0389769
HIV1N5,HIV2CA: 0.397077
HIV1N5,HIV2D1: 0.399513
HIV1N5,HIV2G1: 0.389769
HIV1N5,HIV2KR: 0.390987
HIV1N5,HIV2RO: 0.392205
HIV1N5,HIV2SB: 0.389769
HIV1N5,HIV2ST: 0.392205
HIV1N5,SIVCZ: 0.127893
HIV1N5,Smanga_S4: 0.387333
HIV1N5,Smanga_SP: 0.387333
HIV1ND,HIV1OY: 0.043849
HIV1ND,HIV1PV: 0.043849
HIV1ND,HIV1U4: 0.0767357
HIV1ND,HIV1Z2: 0.0219245
HIV1ND,HIV2CA: 0.390987
HIV1ND,HIV2D1: 0.394641
HIV1ND,HIV2G1: 0.386114
HIV1ND,HIV2KR: 0.386114
HIV1ND,HIV2RO: 0.388551
HIV1ND,HIV2SB: 0.387333
HIV1ND,HIV2ST: 0.389769
HIV1ND,SIVCZ: 0.125457
HIV1ND,Smanga_S4: 0.386114
HIV1ND,Smanga_SP: 0.386114
HIV1OY,HIV1PV: 0.0365408
HIV1OY,HIV1U4: 0.0767357
HIV1OY,HIV1Z2: 0.047503
HIV1OY,HIV2CA: 0.394641
HIV1OY,HIV2D1: 0.397077
HIV1OY,HIV2G1: 0.388551
HIV1OY,HIV2KR: 0.388551
HIV1OY,HIV2RO: 0.389769
HIV1OY,HIV2SB: 0.386114
HIV1OY,HIV2ST: 0.390987
HIV1OY,SIVCZ: 0.131547
HIV1OY,Smanga_S4: 0.388551
HIV1OY,Smanga_SP: 0.388551
HIV1PV,HIV1U4: 0.0828258
HIV1PV,HIV1Z2: 0.045067
HIV1PV,HIV2CA: 0.401949
HIV1PV,HIV2D1: 0.404385
HIV1PV,HIV2G1: 0.394641
HIV1PV,HIV2KR: 0.394641
HIV1PV,HIV2RO: 0.397077
HIV1PV,HIV2SB: 0.390987
HIV1PV,HIV2ST: 0.393423
HIV1PV,SIVCZ: 0.130329
HIV1PV,Smanga_S4: 0.388551
HIV1PV,Smanga_SP: 0.388551
HIV1U4,HIV1Z2: 0.0767357
HIV1U4,HIV2CA: 0.398295
HIV1U4,HIV2D1: 0.403167
HIV1U4,HIV2G1: 0.392205
HIV1U4,HIV2KR: 0.395859
HIV1U4,HIV2RO: 0.394641
HIV1U4,HIV2SB: 0.394641
HIV1U4,HIV2ST: 0.397077
HIV1U4,SIVCZ: 0.137637
HIV1U4,Smanga_S4: 0.400731
HIV1U4,Smanga_SP: 0.399513
HIV1Z2,HIV2CA: 0.393423
HIV1Z2,HIV2D1: 0.397077
HIV1Z2,HIV2G1: 0.387333
HIV1Z2,HIV2KR: 0.388551
HIV1Z2,HIV2RO: 0.389769
HIV1Z2,HIV2SB: 0.389769
HIV1Z2,HIV2ST: 0.389769
HIV1Z2,SIVCZ: 0.125457
HIV1Z2,Smanga_S4: 0.388551
HIV1Z2,Smanga_SP: 0.388551
HIV2CA,HIV2D1: 0.0816078
HIV2CA,HIV2G1: 0.0694275
HIV2CA,HIV2KR: 0.0645554
HIV2CA,HIV2RO: 0.0511571
HIV2CA,HIV2SB: 0.0682095
HIV2CA,HIV2ST: 0.0657734
HIV2CA,SIVCZ: 0.392205
HIV2CA,Smanga_S4: 0.125457
HIV2CA,Smanga_SP: 0.120585
HIV2D1,HIV2G1: 0.0511571
HIV2D1,HIV2KR: 0.0840438
HIV2D1,HIV2RO: 0.088916
HIV2D1,HIV2SB: 0.09257
HIV2D1,HIV2ST: 0.0864799
HIV2D1,SIVCZ: 0.397077
HIV2D1,Smanga_S4: 0.131547
HIV2D1,Smanga_SP: 0.129111
HIV2G1,HIV2KR: 0.0779537
HIV2G1,HIV2RO: 0.0730816
HIV2G1,HIV2SB: 0.0791717
HIV2G1,HIV2ST: 0.0767357
HIV2G1,SIVCZ: 0.394641
HIV2G1,Smanga_S4: 0.127893
HIV2G1,Smanga_SP: 0.121803
HIV2KR,HIV2RO: 0.0645554
HIV2KR,HIV2SB: 0.0633374
HIV2KR,HIV2ST: 0.0572473
HIV2KR,SIVCZ: 0.392205
HIV2KR,Smanga_S4: 0.118149
HIV2KR,Smanga_SP: 0.112058
HIV2RO,HIV2SB: 0.0682095
HIV2RO,HIV2ST: 0.0621194
HIV2RO,SIVCZ: 0.386114
HIV2RO,Smanga_S4: 0.120585
HIV2RO,Smanga_SP: 0.118149
HIV2SB,HIV2ST: 0.0657734
HIV2SB,SIVCZ: 0.389769
HIV2SB,Smanga_S4: 0.126675
HIV2SB,Smanga_SP: 0.123021
HIV2ST,SIVCZ: 0.394641
HIV2ST,Smanga_S4: 0.116931
HIV2ST,Smanga_SP: 0.115713
SIVCZ,Smanga_S4: 0.388551
SIVCZ,Smanga_SP: 0.388551
Smanga_S4,Smanga_SP: 0.0146163
</pre>

==Step3==
Here is a picture of the NJ tree:

[[File:Pol21-NJ_tree.png]]

The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.

==Step 4==
Here is an unrooted tree:

[[File:Pol21-NJ_tree.unrooted.png]]

==Step 5==
Here is a rearranged (swapped) tree:

[[File:Pol21-NJ_tree.swapped.png]]

==Step 6==
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).
* Further answers to "The Phylogeny of HIV" can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].

==Step 7==
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the "best of both worlds": it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.

Here is the tree made ignoring gap positions:

[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]

And here is the tree made taking gap positions into account:

[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]

There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. '''Note:''' This is not always the case!

==Step 8==
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group ''Tetrapoda''.

[[file:salmon_frog.png‎|center|frame]]

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy's "Common Tree" function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group ''Euarchontoglires''.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group ''Opisthokonta''.

[[file:L18_Common_Taxonomy_Tree.png|center|frame]]

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

==Step 9==
# 54 results. Search string:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)</tt>
# 8 and 27 results, respectively. Search strings:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)</tt> and  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)</tt> Under the Download tab in UniProt, select "Download all", "FASTA (canonical)" and "Uncompressed".
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]

==Step 10==
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure Alignment options is set to "clustalo", and align all sequences. Then make an NJ tree (with Ignore all gap sites unchecked) and change the view to "circular".


Here is the result:

[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]

And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.

== Step 11 ==
Here is the rerooted tree made by Seaview:

[[File:Ribosomal_proteins_35-NJ_tree.rerooted.png]]

==Step 12==
Here is the rerooted tree made by iTOL:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]

Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.

==Step 13==
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:

[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]

==Step 14==
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).

Exercise: Phylogeny - Answers (Seaview version)

2025-11-26T20:55:24Z

Henni: /* Step 11 */

== Step 1 ==
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.

==Step 2==
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.
<pre>
#distances order: d(1,2),...,d(1,n) <new line> d(2,3),...,d(2,n) <new line>...
20
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058
0.0682095 0.0621194 0.386114 0.120585 0.118149
0.0657734 0.389769 0.126675 0.123021
0.394641 0.116931 0.115713
0.388551 0.388551
0.0146163
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP

#pairwise distances
HIV1B5,HTLV: 0.750305
HIV1H2,HTLV: 0.751523
HIV1MN,HTLV: 0.75
HIV1N5,HTLV: 0.752741
HIV1ND,HTLV: 0.752741
HIV1OY,HTLV: 0.752741
HIV1PV,HTLV: 0.750305
HIV1U4,HTLV: 0.750305
HIV1Z2,HTLV: 0.752741
HIV2CA,HTLV: 0.749086
HIV2D1,HTLV: 0.741778
HIV2G1,HTLV: 0.747868
HIV2KR,HTLV: 0.749086
HIV2RO,HTLV: 0.744214
HIV2SB,HTLV: 0.750305
HIV2ST,HTLV: 0.747868
HTLV,SIVCZ: 0.747868
HTLV,Smanga_S4: 0.747868
HTLV,Smanga_SP: 0.74665
HIV1B5,HIV1H2: 0.0158343
HIV1B5,HIV1MN: 0.0414634
HIV1B5,HIV1N5: 0.0304507
HIV1B5,HIV1ND: 0.043849
HIV1B5,HIV1OY: 0.0341048
HIV1B5,HIV1PV: 0.0170524
HIV1B5,HIV1U4: 0.0803898
HIV1B5,HIV1Z2: 0.045067
HIV1B5,HIV2CA: 0.399513
HIV1B5,HIV2D1: 0.399513
HIV1B5,HIV2G1: 0.389769
HIV1B5,HIV2KR: 0.393423
HIV1B5,HIV2RO: 0.394641
HIV1B5,HIV2SB: 0.389769
HIV1B5,HIV2ST: 0.394641
HIV1B5,SIVCZ: 0.130329
HIV1B5,Smanga_S4: 0.389769
HIV1B5,Smanga_SP: 0.389769
HIV1H2,HIV1MN: 0.0402439
HIV1H2,HIV1N5: 0.0292326
HIV1H2,HIV1ND: 0.0414129
HIV1H2,HIV1OY: 0.0328867
HIV1H2,HIV1PV: 0.00974421
HIV1H2,HIV1U4: 0.0803898
HIV1H2,HIV1Z2: 0.0426309
HIV1H2,HIV2CA: 0.399513
HIV1H2,HIV2D1: 0.401949
HIV1H2,HIV2G1: 0.392205
HIV1H2,HIV2KR: 0.393423
HIV1H2,HIV2RO: 0.394641
HIV1H2,HIV2SB: 0.389769
HIV1H2,HIV2ST: 0.394641
HIV1H2,SIVCZ: 0.129111
HIV1H2,Smanga_S4: 0.388551
HIV1H2,Smanga_SP: 0.388551
HIV1MN,HIV1N5: 0.0365854
HIV1MN,HIV1ND: 0.0512195
HIV1MN,HIV1OY: 0.0365854
HIV1MN,HIV1PV: 0.0439024
HIV1MN,HIV1U4: 0.0865854
HIV1MN,HIV1Z2: 0.054878
HIV1MN,HIV2CA: 0.4
HIV1MN,HIV2D1: 0.40122
HIV1MN,HIV2G1: 0.396341
HIV1MN,HIV2KR: 0.392683
HIV1MN,HIV2RO: 0.395122
HIV1MN,HIV2SB: 0.392683
HIV1MN,HIV2ST: 0.397561
HIV1MN,SIVCZ: 0.130488
HIV1MN,Smanga_S4: 0.392683
HIV1MN,Smanga_SP: 0.392683
HIV1N5,HIV1ND: 0.0341048
HIV1N5,HIV1OY: 0.0304507
HIV1N5,HIV1PV: 0.0316687
HIV1N5,HIV1U4: 0.0791717
HIV1N5,HIV1Z2: 0.0389769
HIV1N5,HIV2CA: 0.397077
HIV1N5,HIV2D1: 0.399513
HIV1N5,HIV2G1: 0.389769
HIV1N5,HIV2KR: 0.390987
HIV1N5,HIV2RO: 0.392205
HIV1N5,HIV2SB: 0.389769
HIV1N5,HIV2ST: 0.392205
HIV1N5,SIVCZ: 0.127893
HIV1N5,Smanga_S4: 0.387333
HIV1N5,Smanga_SP: 0.387333
HIV1ND,HIV1OY: 0.043849
HIV1ND,HIV1PV: 0.043849
HIV1ND,HIV1U4: 0.0767357
HIV1ND,HIV1Z2: 0.0219245
HIV1ND,HIV2CA: 0.390987
HIV1ND,HIV2D1: 0.394641
HIV1ND,HIV2G1: 0.386114
HIV1ND,HIV2KR: 0.386114
HIV1ND,HIV2RO: 0.388551
HIV1ND,HIV2SB: 0.387333
HIV1ND,HIV2ST: 0.389769
HIV1ND,SIVCZ: 0.125457
HIV1ND,Smanga_S4: 0.386114
HIV1ND,Smanga_SP: 0.386114
HIV1OY,HIV1PV: 0.0365408
HIV1OY,HIV1U4: 0.0767357
HIV1OY,HIV1Z2: 0.047503
HIV1OY,HIV2CA: 0.394641
HIV1OY,HIV2D1: 0.397077
HIV1OY,HIV2G1: 0.388551
HIV1OY,HIV2KR: 0.388551
HIV1OY,HIV2RO: 0.389769
HIV1OY,HIV2SB: 0.386114
HIV1OY,HIV2ST: 0.390987
HIV1OY,SIVCZ: 0.131547
HIV1OY,Smanga_S4: 0.388551
HIV1OY,Smanga_SP: 0.388551
HIV1PV,HIV1U4: 0.0828258
HIV1PV,HIV1Z2: 0.045067
HIV1PV,HIV2CA: 0.401949
HIV1PV,HIV2D1: 0.404385
HIV1PV,HIV2G1: 0.394641
HIV1PV,HIV2KR: 0.394641
HIV1PV,HIV2RO: 0.397077
HIV1PV,HIV2SB: 0.390987
HIV1PV,HIV2ST: 0.393423
HIV1PV,SIVCZ: 0.130329
HIV1PV,Smanga_S4: 0.388551
HIV1PV,Smanga_SP: 0.388551
HIV1U4,HIV1Z2: 0.0767357
HIV1U4,HIV2CA: 0.398295
HIV1U4,HIV2D1: 0.403167
HIV1U4,HIV2G1: 0.392205
HIV1U4,HIV2KR: 0.395859
HIV1U4,HIV2RO: 0.394641
HIV1U4,HIV2SB: 0.394641
HIV1U4,HIV2ST: 0.397077
HIV1U4,SIVCZ: 0.137637
HIV1U4,Smanga_S4: 0.400731
HIV1U4,Smanga_SP: 0.399513
HIV1Z2,HIV2CA: 0.393423
HIV1Z2,HIV2D1: 0.397077
HIV1Z2,HIV2G1: 0.387333
HIV1Z2,HIV2KR: 0.388551
HIV1Z2,HIV2RO: 0.389769
HIV1Z2,HIV2SB: 0.389769
HIV1Z2,HIV2ST: 0.389769
HIV1Z2,SIVCZ: 0.125457
HIV1Z2,Smanga_S4: 0.388551
HIV1Z2,Smanga_SP: 0.388551
HIV2CA,HIV2D1: 0.0816078
HIV2CA,HIV2G1: 0.0694275
HIV2CA,HIV2KR: 0.0645554
HIV2CA,HIV2RO: 0.0511571
HIV2CA,HIV2SB: 0.0682095
HIV2CA,HIV2ST: 0.0657734
HIV2CA,SIVCZ: 0.392205
HIV2CA,Smanga_S4: 0.125457
HIV2CA,Smanga_SP: 0.120585
HIV2D1,HIV2G1: 0.0511571
HIV2D1,HIV2KR: 0.0840438
HIV2D1,HIV2RO: 0.088916
HIV2D1,HIV2SB: 0.09257
HIV2D1,HIV2ST: 0.0864799
HIV2D1,SIVCZ: 0.397077
HIV2D1,Smanga_S4: 0.131547
HIV2D1,Smanga_SP: 0.129111
HIV2G1,HIV2KR: 0.0779537
HIV2G1,HIV2RO: 0.0730816
HIV2G1,HIV2SB: 0.0791717
HIV2G1,HIV2ST: 0.0767357
HIV2G1,SIVCZ: 0.394641
HIV2G1,Smanga_S4: 0.127893
HIV2G1,Smanga_SP: 0.121803
HIV2KR,HIV2RO: 0.0645554
HIV2KR,HIV2SB: 0.0633374
HIV2KR,HIV2ST: 0.0572473
HIV2KR,SIVCZ: 0.392205
HIV2KR,Smanga_S4: 0.118149
HIV2KR,Smanga_SP: 0.112058
HIV2RO,HIV2SB: 0.0682095
HIV2RO,HIV2ST: 0.0621194
HIV2RO,SIVCZ: 0.386114
HIV2RO,Smanga_S4: 0.120585
HIV2RO,Smanga_SP: 0.118149
HIV2SB,HIV2ST: 0.0657734
HIV2SB,SIVCZ: 0.389769
HIV2SB,Smanga_S4: 0.126675
HIV2SB,Smanga_SP: 0.123021
HIV2ST,SIVCZ: 0.394641
HIV2ST,Smanga_S4: 0.116931
HIV2ST,Smanga_SP: 0.115713
SIVCZ,Smanga_S4: 0.388551
SIVCZ,Smanga_SP: 0.388551
Smanga_S4,Smanga_SP: 0.0146163
</pre>

==Step3==
Here is a picture of the NJ tree:

[[File:Pol21-NJ_tree.png]]

The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.

==Step 4==
Here is an unrooted tree:

[[File:Pol21-NJ_tree.unrooted.png]]

==Step 5==
Here is a rearranged (swapped) tree:

[[File:Pol21-NJ_tree.swapped.png]]

==Step 6==
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).
* Further answers to "The Phylogeny of HIV" can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].

==Step 7==
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the "best of both worlds": it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.

Here is the tree made ignoring gap positions:

[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]

And here is the tree made taking gap positions into account:

[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]

There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. '''Note:''' This is not always the case!

==Step 8==
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group ''Tetrapoda''.

[[file:salmon_frog.png‎|center|frame]]

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy's "Common Tree" function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group ''Euarchontoglires''.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group ''Opisthokonta''.

[[file:L18_Common_Taxonomy_Tree.png|center|frame]]

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

==Step 9==
# 54 results. Search string:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)</tt>
# 8 and 27 results, respectively. Search strings:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)</tt> and  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)</tt> Under the Download tab in UniProt, select "Download all", "FASTA (canonical)" and "Uncompressed".
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]

==Step 10==
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure Alignment options is set to "clustalo", and align all sequences. Then make an NJ tree (with Ignore all gap sites unchecked) and change the view to "circular".


Here is the result:

[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]

And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.

== Step 11 ==
Here is the rerooted tree made by Seaview:

[[File:Ribosomal_proteins_35-NJ_tree.rerooted-Seaview.png]]

==Step 12==
Here is the rerooted tree made by iTOL:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]

Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.

==Step 13==
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:

[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]

==Step 14==
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).

File:Ribosomal proteins 35-NJ tree.unrooted.newick.txt

2025-11-26T20:44:04Z

Henni:

2025-11-10T13:48:21Z

Henni: /* When BLAST fails */

Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.

==Introduction==

Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today's lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today's exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.


===Links===
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/


==When BLAST fails==

Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?

>QUERY1
MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV
EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK
LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS
IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL
YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID
LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE
IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL
QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE

Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select blastp as the algorithm. Paste in the query sequence. Change the database from ClusteredNR to Protein Data Bank (pdb), and press BLAST (Figure 1).

[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]

* '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q1.html backup output])

==Trying another approach==

Now,  click on Edit Search on the results page (then you don't have to paste in the query sequence again). This time, set the database to Reference proteins (refseq_protein) and select PSI-BLAST (Position-Specific Iterated BLAST) as the algorithm (Figure 2).

'''IMPORTANT:''' To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in refseq_protein. We know that the mysterious "Query1" sequence is from an archaeon.

[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]

* '''QUESTION 2''': How many significant hits does BLAST find (E-value < 0.005)? ('''Tip:''' you can see the number by selecting all significant hits (clicking All under Sequences with E-value BETTER than threshold) and then looking at the number of selected hits)
* '''QUESTION 3''': How large a fraction (Query coverage) of the query sequence do the significant hits typically match (excluding the 100% identity match)?


===Constructing the PSSM===
<div style="background-color: lightyellow; border: solid thin grey;">
:'''Note:''' If you see the error message “Entrez Query: txid2157 [ORGN] is not supported”, then click Recent Results in the upper right part of the BLAST window, select your most recent search, and try again.
</div>
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the Run button at Run PSI-Blast iteration 2 (you can find it both at the top of the results table and after the list of significant hits).

* '''QUESTION 4''': How many significant hits does BLAST find (E-value < 0.005)?
* '''QUESTION 5''': What is the E-value of the ''least'' significant hit shown on the results page?
* '''QUESTION 6''': How large a fraction of the query sequence do the 20 most significant hits match (do not include the 100% identity match)?
* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!

===Saving and reusing the PSSM===
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.

Go to the top of the PSI-BLAST output page and click Download All, then click PSSM. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.

Then, open ''a new BLAST window'' (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select pdb as the database. Do ''not'' limit your search to Archaea this time. Click on Algorithm parameters to show the extended settings. Click the button next to Upload PSSM and select the file you just saved. '''Note:''' You don't have to paste the query sequence again, it is stored in the PSSM!

* '''QUESTION 8''': Do you find any significant PDB hits (E-value < 0.005) now? If yes, how many?
* '''QUESTION 9''': What are the PDB identifiers and the E-values for the two best PDB hits?
* '''QUESTION 10''': What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? ('''Tip:''' click on the description to get to the actual alignment between the query sequence and the PDB hit)?
* '''QUESTION 11''': What is the function of these proteins?

===One more round===
Let's try one more iteration of PSI-BLAST:
* Go back to your first BLAST window (the one with the results from the refseq_protein database limited to Archaea) and press the Run button at Run PSI-Blast iteration 3.
* Save the resulting PSSM file (make sure you give it a different name!).
* Launch a new PSI-BLAST search against pdb in all organisms using this PSSM (you may have to click on Clear to erase your first PSSM file from the server).
* '''QUESTION 12''': Answer questions 8-10 again for the new search.

==Finding a remote homolog (on your own)==
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB — now it is time to search the broader database "Reference proteins" (refseq_protein). ('''Note:''' we would have liked to do this exercise in the broadest database nr, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID '''GPAA1_HUMAN''' has a homolog in the genus ''Trypanosoma'' (unicellular parasites which cause diseases like sleeping sickness or Chaga's disease).
* First, try a standard BlastP (where you set Organism to ''Trypanosoma'', Database to refseq_protein ('''not''' refseq_select), switch the Low complexity regions filter off, and set the E-value threshold to 10).
* '''QUESTION 13''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
* Then, try PSI-BLAST. '''Hint:''' You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in ''Trypanosoma''.
* '''QUESTION 14''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?




==Concluding remarks==
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.

ExPSIBLAST answer

2025-11-10T10:15:53Z

Henni: /* Finding a remote homolog (on your own) */

ExPSIBLAST answer

2025-11-10T09:08:52Z

Henni: /* Saving and reusing the PSSM */

2025-10-14T13:06:12Z

Henni: /* Identification of membrane proteins (potential vaccine targets) */

Exercise written by: Thomas Salhøj Rask and [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] — translated, revised and updated to BepiPred 2.0 by [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] and Henrik Nielsen.

The purpose of this exercise is to apply the methods and knowledge you have learned so far on a real biological problem: Taking steps towards designing a malaria vaccine, by selecting peptides from the malaria parasite that have a chance of inflicting an immune response and therefore could be used in a vaccine. As part of the exercise some new material will be introduced, especially concerning prediction of B-cell epitopes (immuno-reactive peptides). The outline of the exercise is as follows:

# What exactly is malaria?
# Identification of membrane bound proteins (potential vaccine targets)
# Analysis of membrane protein domain structure
# Prediction of B-cell epitopes from membrane proteins
# Modelling / visualization of predicted epitopes in the 3D structure of a protein domain.

 

== What exactly is malaria? ==
[[Image:Office-notes-line_drawing.png|30px|left]]
'''Question 1:''' ''Which organism causes malaria? Bacteria, protozoa (single cell eukaryote), worm or virus?''

Investigate this by looking up the organism in the two taxonomy databases we have been covering earlier in the course:
*'''NCBI Taxonomy:''' http://www.ncbi.nlm.nih.gov/Taxonomy    ('''Hint:''' If you don't know the Latin name for the organism, it will be easier to search for a name as a "[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi Token set]" rather than as a "Complete name".
*'''Tree of life:''' http://www.tolweb.org/

'''Question 1a)''' Identify the following taxonomical levels for the malaria-causing organism:
* Genus
* Phylum
* (Super)Kingdom

'''Question 1b)''' How "close" in taxonomy space is the organism to the following other organisms (find the upper level taxonomical group, that ties them together). '''Hint:''' as an alternative to manually comparing the taxonomy-strings (the "lineage"), you can use the [http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi NCBI Taxonomy Common Tree] tool to automate the comparison.
* ''Homo sapiens''
* ''Babesia microti''    (Can in rare cases be transmitted by ticks (danish: "Skovflåt") and can lead to the disease ''[https://en.wikipedia.org/wiki/Babesiosis babesiosis]'', where the red blood cells (erythrocytes) are invaded as in malaria, and which will lead to ''anemia'' ("blood loss", in this case lack of oxygen carrying capacity in the blood) — see the Tree of Life page for this organism for images of infected erythrocytes.

Finally, read more about malaria and the complicated life cycle of the malaria parasite here: [http://www.cdc.gov/dpdx/malaria/ CDC - DPDx Malaria] .

'''Question 1c)''' Report the names of the '''four''' species of parasites causing malaria in humans, and use the NCBI Genome (https://www.ncbi.nlm.nih.gov/datasets/genome/) database to investigate which of them (if any) have had their genomes sequenced.

 

== Identification of membrane proteins (potential vaccine targets) ==
Malaria caused by ''Plasmodium falciparum'' (''Pf'') is by far the most lethal malaria variant. This parasite species is responsible for 80%-90% of the ~1 million annual deaths due to malaria. It will therefore be a natural starting point to develop a vaccine against this type of malaria.

When the ''Pf'' genome was initially sequenced in the 1990s, it was based on ''Pf'' cells isolated from the blood of a Dutch malaria patient, who picked up the disease while traveling. Unfortunately, it was not recorded exactly where the patient had been. This isolate is named ''3D7'' and is the most studied malaria strain to this day (even though it's not known from where in the world it originates).

'''Task:'''
Locate the entry for ''Pf'' 3D7 in [http://www.ncbi.nlm.nih.gov/Taxonomy NCBIs taxonomy browser].  In the multi-colored table on the right hand side ("Entrez records"), a set of sequence related data is shown. For instance the "Gene" link describes how many genes have been identified in the genome (including both manually curated genes as well as genes predicted using bioinformatics methods).

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Question 2a)''' How many verified genes (NOT hypothetical) does ''Pf'' 3D7 have? ('''Hint:''' Follow the Gene link and add <tt>NOT hypothetical</tt> to the search string).

Malaria takes place in different stages within the human host (see figure below), and this is important to take into account when designing a malaria vaccine. The disease development can be divided into two phases: 1) The liver-stage and 2) the blood-stage. The liver-stage is defined by ''sporozoites'' injected by the malaria mosquito, which travel to the liver and invade hepatocytes (liver cells). The blood-stage is the second stage and is reached when ''merozoites'' developed within the hepatocytes are released into the blood stream, where they invade erythrocytes (red blood cells). In both stages the malaria parasite hides from the human immune system by staying inside native human cells.

Much of the effort towards developing malaria vaccines so far has been focused on surface exposed (cell-membrane) proteins from the ''sporozoites'' and ''merozoites'' as well as non-human proteins on the surface of infected hepatocytes and erythrocytes.

[[Image:Nm0206-170-F1.jpg | center]]

=== Searching UniProt ===
We'll now see if we can use the annotation of protein properties in UniProt to point us towards potential vaccine targets. When designing a vaccine it is important to make sure that the intended vaccine target is indeed "visible" to the immune system. Building on the information from the previous section, we therefore need to identify proteins that '''originate''' from the parasite, and that are present on the cell surface of ''sporozoites'', ''merozoites'' OR infected host cells. In the case of infected host cells, we would therefore be looking for proteins that fulfill the following criteria:

# Are secreted from the parasite to the vacuole ''inside'' the host cell,
# Migrate from the vacuole to the host cell, and
# Are transported to the surface (membrane) of the host cell

Initially, we'll see how many hits we can find by searching for one or more of these criteria in relevant UniProtKB fields. Here we'll use the same search interface as in the UniProt exercise. We recommend to have the original [[Exercise: The protein database UniProt|UniProt Exercise manual]] open in a different browser window for quick cross-referencing of what we have already learned about searching UniProt.

[[Image:Emblem-important_tiny.png‎|left]]'''Note:''' When answering the questions below, you have to ''write the search string'' you used in the answer; merely writing a number is not enough. When the search string is included in the answer, we can understand the reason for possible wrong answers.

'''2b)''' Go to [http://www.uniprot.org/ UniProt]. Investigate how many ''Plasmodium falciparum'' (''Pf'') proteins there are in total in UniProtKB (i.e. proteins from all ''Pf'' strains, not only from 3D7). How many of these are from Swiss-Prot and how many from TrEMBL?

'''2c)''' Now try to see how many of the hits from the previous question are from the strain (isolate) 3D7. Is the number approximately equal to the number you got in question '''2a)'''? How many of these are from Swiss-Prot and how many from TrEMBL?

Now, we shall investigate whether we can use the annotations of subcellular location in UniProt. '''Note:''' We go back to working with all strains of ''Pf'', not exclusively 3D7.

'''2d)''' First, check how many ''Pf'' proteins have a "Subcellular location [CC]" comment at all ('''Tip:''' choose Subcellular location > Subcellular location [CC] > Subcellular location term in the menu and enter a <tt>*</tt> in the field). How many from each part of the database? ('''Note''' that the ratio between TrEMBL and Swiss-Prot numbers changes considerably relative to question '''2b)''' — Swiss-Prot entries on average contain many more annotations than TrEMBL entries).

'''2e)''' How many of these are secreted? ('''Tip:''' that should go into the field that pops up when the menu is set to Subcellular location > Subcellular location [CC] > Subcellular location term).

To get more hits, we will try to search for other terms in the Subcellular location term field. Interesting subcellular locations might include words such as "<tt>surface</tt>" or "<tt>membrane</tt>".

'''2f)''' How many are there of these, respectively?

The word "membrane" gave the highest number of hits, so we will examine those in more detail. Far from all of these proteins are suitable as vaccine targets. In order to be potentially interesting, they need to be located in the cell membrane (plasma membrane) of either the parasite or the host cell, ''not'' in an inner membrane in the cell. To get an overview, you should try another function in UniProt's interface: First, click to select the Table view instead of the Card view (above the results list). Then, click the button Customize columns; that will bring up a table where you can find a Subcellular location item. Click it, mark Subcellular location [CC], and click Close.

'''2g)''' Now look at the list of results, where "subcellular location" contained "membrane", again. Consider the field Subcellular location. Give some examples (including accession codes, protein names, and reasons for selecting them) of hits that may be useful, and hits that are surely not useful as vaccine targets (at least two ''different'' examples of each). '''Hint:''' if you need to see some different examples, try clicking on the column headings in the table to sort the results list after, e.g., Accession (Entry), Entry name, or Protein name.

Now, let us focus on the life stage of the parasite where it is located inside an erythrocyte (a red blood cell), and thereby focus on the vaccine targets that are in the plasma membrane of the ''host cell''.

'''2h)''' How many of the hits have the location "host cell membrane"?


These proteins could be very interesting as vaccine targets. However, the experimental researchers from your organization report that these have already been tried and do not work in practice, so they ask you to find other examples. We therefore try another approach: If the information we are looking for is not part of the "Subcellular location" annotation, it might be a part of the description (the protein name). '''Tip:''' you can always discard a search term in the Advanced interface by clicking the Remove button.

'''2i)''' How many ''Pf'' proteins contain <tt>erythrocyte</tt> in their Protein Name [DE] field? How many of these are from Swiss-Prot (reviewed)?

'''2j)''' How many of these erythrocyte proteins also have <tt>membrane</tt> in their name?

Some of the hits you find in this way are very short (you can try to sort them by length by clicking the Length heading). These short proteins might be fragments.

'''2k)''' How many of the hits are complete (not annotated as fragments)? ('''Tip:''' see question 16 in [[Exercise: The protein database UniProt|the UniProt exercise]]).

'''2l)''' Do any of these proteins have a determined 3D structure? In other words: Do any proteins from the previous search have a cross-reference to the database PDB? ('''Tip:''' you should look for Cross-references in the menu, and again place a <tt>*</tt> in the field). If yes, what are their names and accession codes?

As a last step in this part of the exercise, you should save all sequences from the last search in FASTA format. This is most easily done by clicking Download above the results list and choosing FASTA (canonical). You can either choose to download them (remember to choose No under Compressed) and then open them in a text editor or to preview them in the browser. In the latter case, keep the browser window with the sequences; we will need them later in the exercise.

 

== Analysis of membrane protein domain structure ==
[[Image:PfEMP1_transport.jpg|right|border]]

The PfEMP1 (''Plasmodium falciparum'' Erythrocyte Membrane Protein 1) proteins which we now have found constitute a group of proteins expressed by the malaria parasite and transported to the plasma membrane of the infected erythrocyte (see figure — the red/orange sticks represent PfEMP1 proteins).

The function of the PfEMP1 proteins on the surface of the infected erythrocytes is to mediate binding to certain receptors on the surface of endothelial cells (the cells making up the walls of blood vessels). In this way, the malaria parasite can make the infected erythrocytes stick to the walls of the blood vessels in various tissues of the body, and thereby it can avoid being transported through the spleen (Danish: ''milten'') which otherwise removes diseased erythrocytes from the blood and is one of the main actors in generating an immune response against malaria.

If we, using a vaccine, can generate antibodies that bind to the PfEMP1 proteins, preventing the infected erythrocytes from binding to the endothelial cells, the body would be able to generate a faster and broader immune response against ''Pf''. Symptoms such as anemia would thereby not become so severe.

We will now examine how the PfEMP1 proteins are built.

Look at the entries you found in the end of section 2. Select just those hits whose accession codes start with "Q" (there should be three of them — otherwise, revisit section 2).

Take a closer look (in UniProt) at these three entries. Scroll down to Family and domain databases under Family & Domains. Here, you will find some services providing an overview of known families/domains in the protein in question. InterPro is the most important of these, since it collects information from a number of family & domain databases (including the one called Pfam) and therefore has the widest repertoire of domain types.

Open the link labeled View protein in InterPro in a new tab. Note the graphical interface of InterPro under the heading "Entry matches to this protein". When you hover the mouse over one of the coloured bars, the name of the family/domain will appear. Note that each family/domain in InterPro has at least ''two'' names and identifiers, an InterPro identifier beginning with "IPR" and a member database identifier, e.g. beginning with "PF" if it is derived from Pfam.

<div style="background-color: lavender; border: solid thin grey;">
:'''What are families and domains, anyway?'''
:Here are the definitions from the [https://www.ebi.ac.uk/interpro/help/faqs/ InterPro FAQ]:
:*'''Domains''' are distinct functional, structural or sequence units that may exist in a variety of biological contexts. A match to an InterPro entry of this type indicates the presence of a domain. Common examples of protein domains are the PH domain, Immunoglobulin domain or the classical C2H2 zinc finger.
:*A protein '''family''' is a group of proteins that share a common evolutionary origin reflected by their related functions, similarities in sequence, or similar primary, secondary or tertiary structure. A match to an InterPro entry of this type indicates membership of a protein family.
:However, the distinction between what is regarded as a family and what is regarded as a domain is not completely sharp.
</div>

'''3a)''' Note that one named family/domain is found in several copies in all our three erythrocyte membrane proteins. What are the names and identifiers of this family/domain? How many times does it occur in each of the proteins?

Click the identifiers for this particular family/domain and read more about it.

'''3b)''' Under "Other Features", Interpro has annotated a transmembrane segment. Which positions are transmembrane in the three proteins? Which part (N- or C-terminal part) of the proteins is intracellular, and which part is extracellular?

Look (in UniProt) at the PDB cross-references under 3D structure databases (under Structure). Focus on X-ray structures only. Compare the coordinates (positions) for the structures to the coordinates for the domains denoted in Pfam.

'''3c)''' Which positions are structurally determined '''by X-ray''' in each of the three proteins? If you number the occurrences of the known family/domain from '''3a''' (1, 2, 3, and so on, starting from the N-terminus), which number(s) are covered by the structurally determined region(s) in each of the three proteins?

Now read what is said about the function and location of our proteins according to Gene Ontology (GO - Molecular function, GO - Biological process and GO - Cellular component) in UniProt.

'''3d)''' Do these pieces of information support our choice of these proteins as vaccine targets? Give at least 3 examples.

 

== Prediction of B-cell epitopes in a membrane protein ==
'''Q8I639''' is also known as VAR2CSA, and this protein is of particular interest, since it is considered to be responsible for ''Pregnacy associated malaria'' (PAM). Pregnant women are more prone to contract malaria, which sadly leads to a fatality rate of ~10,000 mothers and ~200,000 newborn/unborn children annually.

One of the reasons why it has been so difficult to develop a malaria vaccine, is that the malaria parasite carries ~60 PfEMP1 protein variants, and that you'll need antibodies against all of them to be immune. However, in the case of PAM there is only one specific PfEMP1 in play, and this special case is therefore easier to start to address with a vaccine.

In order to have a better handle on our bioinformatics work, we'll concentrate the effort on the Duffy binding domain in VAR2CSA for which a 3D structure is available (the one we found in '''question 3c''').

=== Epitope prediction ===
The vaccine we are working towards designing should contain '''epitopes'''. Epitopes are the parts of the disease-associated protein the immune system will recognize, for instance the parts the infected person's antibodies will bind to (the so called '''B-cell epitopes''' — there also exist '''T-cell epitopes''', which we'll not cover here).

For predicting which parts of the protein are potential epitopes, we'll use the '''BepiPred 2.0 server''', which was created here at DTU.
<div style="background-color: lavender; border: solid thin grey;">
:'''Important Note:''' Please run the prediction on the web server of the IEDB instead of the one at DTU, as our local servers had an update that has modified the results.
:Please select the method called "BepiPred 2.0"
:http://tools.iedb.org/bcell/
</div>

In order to run the prediction, we'll first need to extract the amino acid sequence for the structure-determined Duffy binding domain in VAR2CSA. This must be done using the link to PDB from UniProt:
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt.
# Go to the Structure section.
# Right-click the link labeled RCSB-PDB and open it in a new tab. This will take you to a PDB page.
# Here, you can find the sequence by clicking Display Files and choosing FASTA Sequence. Alternative, you can choose to download the sequence by clicking Download Files.


'''Question 4a''': What is the name of the PDB entry, and is it a crystal or NMR structure?

'''Question 4b''': Report the FASTA sequence of the structure-determined Duffy binding domain in VAR2CSA.

'''Question 4c''':
Note down the following from the UniProt entry, you'll need it in the next section:
* What was the sequence interval in the coordinates of the original (full) UniProt sequence?
* What position in the original protein does position 1 in the new FASTA file correspond to?

You can now run the '''BepiPred 2.0''' prediction server on the domain sequence (ONLY the subset extracted above). Run it and then adjust the following on the '''results page''':
* Set '''threshold''' to '''0.55'''
This gives us a reasonable amount of epitopes to continue our work with:
* Write down the start/end sequence positions of all epitopes of at least '''8 amino acids'''
* '''Hint:''' there should be '''7''' such epitopes, and the last one starts at position '''276'''

[[image:BepiPred-2_onIEDB.png|thumb|center|600px|Click to zoom]]

'''Question 4d''': Create a table with the following information about the predicted epitopes:
* Start/end position, length, Start/end position ''in the original protein''

''(We'll need the coordinate-transformed values for the PyMOL visualization)''

== Visualization of epitopes ==
Lastly, we'll want to visualize the epitopes in the VAR2CSA Duffy binding domain. Generally, BepiPred 2.0 is very good at selecting surface exposed epitopes, but it's still a good idea to check it visually.


In the PDB database page for the structure you found in the last section, click the "Sequence" tab and look at the figure. In the case of this structure, the authors' numbering directly follows the coordinates from the FULL UniProt sequence.

[[Image:Office-notes-line_drawing.png|30px|left]]'''Question 5a):'''
* Which intervals in the sequence are missing (disordered/invisible) in the structure? Hint: Look at the "UNMODELED" feature. 
* Will this have an impact on any of our predicted epitopes?

Now it's time to work with visualization of the epitopes in PyMOL. IMPORTANT: Cross-reference with the exercises from the PyMOL exercise if you have forgotten some of the PyMOL fundamentals.

The goal will be to:
* Colour the epitopes in different colours
* Have a look at where in the structure they are found: on the surface or inside.

After you have loaded the structure (either via "fetch" or by downloading the file), you can help yourself by setting the base colour to a neutral grey, and with a basic "cartoon" visualization as the first step:

color gray80
hide all
show cartoon

Since we're working with 7 epitopes it can be beneficial to work with named selections. To avoid renaming selections you can specify the name directly in the select command:
select epitope_XXX, resi 1-3

This will create the selection of residues 1 to 3 under the name "epitope_XXX" — please refer to the PyMOL exercise for more details about selection rules.

'''TASK:'''
* Create named selections for all seven epitopes
** Select a good naming scheme — for example epitope_1 to epitope_7 or reference the first position (e.g. epitope_273 for the last one)
** Select a unique and easy to identify colour for each epitope.
** HINT: Turn on the sequence viewer — then you can directly see your selections AND colours in the sequence as well!

As you may have noticed there are two (identical) chains in the structure. We only need one of them, and the next step will be to separate them out.

create ka, chain A

This will create a new object with the A chain.
* Repeat this for the B chain (you could call the object kb), and then use the click-interface to hide the whole structure, and select ONLY one of the chains to continue to work with. REMEMBER to write which chain you have chosen to work with.

Lastly, we'll need to look at how the epitopes are located relative to the surface. Here you can benefit from switching between two types of visualization (using the click interface):
* show as → surface
to show the protein from the outside.
* show as → cartoon
* show → mesh
to show BOTH the inside and outside — it especially works nicely when you actively rotate the structure.

[[Image:Office-notes-line_drawing.png|30px|left]]'''Question 5b):''' Play around with the visualization, and create one (or more) good figures for your report that show the following:
* Placement of the epitopes
* A legend for the colours (or arrows with explanations or something similar)
* Which epitopes are (partly) missing?
* Are the remaining epitopes accessible on the surface of the protein?

== Epilogue ==
''Now all that remains is to ship off the sequences of the surface accessible epitopes to the lab, to start the long process of constructing an expression vector with the gene fragments, with the right linker sequences, getting it expressed in a production host, follow up with animal testing and phase 1, 2 and 3 clinical trials, and the vaccine should be ready for the market.''

Exercise:Malaria Vaccine

2025-10-14T13:00:53Z

Henni: /* What exactly is malaria? */

Exercise written by: Thomas Salhøj Rask and [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] — translated, revised and updated to BepiPred 2.0 by [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] and Henrik Nielsen.

The purpose of this exercise is to apply the methods and knowledge you have learned so far on a real biological problem: Taking steps towards designing a malaria vaccine, by selecting peptides from the malaria parasite that have a chance of inflicting an immune response and therefore could be used in a vaccine. As part of the exercise some new material will be introduced, especially concerning prediction of B-cell epitopes (immuno-reactive peptides). The outline of the exercise is as follows:

# What exactly is malaria?
# Identification of membrane bound proteins (potential vaccine targets)
# Analysis of membrane protein domain structure
# Prediction of B-cell epitopes from membrane proteins
# Modelling / visualization of predicted epitopes in the 3D structure of a protein domain.

 

== What exactly is malaria? ==
[[Image:Office-notes-line_drawing.png|30px|left]]
'''Question 1:''' ''Which organism causes malaria? Bacteria, protozoa (single cell eukaryote), worm or virus?''

Investigate this by looking up the organism in the two taxonomy databases we have been covering earlier in the course:
*'''NCBI Taxonomy:''' http://www.ncbi.nlm.nih.gov/Taxonomy    ('''Hint:''' If you don't know the Latin name for the organism, it will be easier to search for a name as a "[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi Token set]" rather than as a "Complete name".
*'''Tree of life:''' http://www.tolweb.org/

'''Question 1a)''' Identify the following taxonomical levels for the malaria-causing organism:
* Genus
* Phylum
* (Super)Kingdom

'''Question 1b)''' How "close" in taxonomy space is the organism to the following other organisms (find the upper level taxonomical group, that ties them together). '''Hint:''' as an alternative to manually comparing the taxonomy-strings (the "lineage"), you can use the [http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi NCBI Taxonomy Common Tree] tool to automate the comparison.
* ''Homo sapiens''
* ''Babesia microti''    (Can in rare cases be transmitted by ticks (danish: "Skovflåt") and can lead to the disease ''[https://en.wikipedia.org/wiki/Babesiosis babesiosis]'', where the red blood cells (erythrocytes) are invaded as in malaria, and which will lead to ''anemia'' ("blood loss", in this case lack of oxygen carrying capacity in the blood) — see the Tree of Life page for this organism for images of infected erythrocytes.

Finally, read more about malaria and the complicated life cycle of the malaria parasite here: [http://www.cdc.gov/dpdx/malaria/ CDC - DPDx Malaria] .

'''Question 1c)''' Report the names of the '''four''' species of parasites causing malaria in humans, and use the NCBI Genome (https://www.ncbi.nlm.nih.gov/datasets/genome/) database to investigate which of them (if any) have had their genomes sequenced.

 

== Identification of membrane proteins (potential vaccine targets) ==
Malaria caused by ''Plasmodium falciparum'' (''Pf'') is by far the most lethal malaria variant. This parasite species is responsible for 80%-90% of the ~1 million annual deaths due to malaria. It will therefore be a natural starting point to develop a vaccine against this type of malaria.

When the ''Pf'' genome was initially sequenced in the 1990s, it was based on ''Pf'' cells isolated from the blood of a Dutch malaria patient, who picked up the disease while traveling. Unfortunately, it was not recorded exactly where the patient had been. This isolate is named ''3D7'' and is the most studied malaria strain to this day (even though it's not known from where in the world it originates).

'''Task:'''
Locate the entry for ''Pf'' 3D7 in [http://www.ncbi.nlm.nih.gov/Taxonomy NCBIs taxonomy browser].  In the multi-colored table on the right hand side ("Entrez records"), a set of sequence related data is shown. For instance the "Gene" entry describes how many genes have been identified in the genome (including both manually curated genes as well as genes predicted using bioinformatics methods).

[[Image:Office-notes-line_drawing.png|30px|left]]
'''Question 2a)''' How many chromosomes, and how many verified genes (NOT hypothetical) does ''Pf'' 3D7 have? ('''Hints:''' First, follow the Genome link and select the first assembly to see an overview of the chromosomes. Then, go back to the taxonomy page and follow the Gene link and add <tt>NOT hypothetical</tt> to the search string).

Malaria takes place in different stages within the human host (see figure below), and this is important to take into account when designing a malaria vaccine. The disease development can be divided into two phases: 1) The liver-stage and 2) the blood-stage. The liver-stage is defined by ''sporozoites'' injected by the malaria mosquito, which travel to the liver and invade hepatocytes (liver cells). The blood-stage is the second stage and is reached when ''merozoites'' developed within the hepatocytes are released into the blood stream, where they invade erythrocytes (red blood cells). In both stages the malaria parasite hides from the human immune system by staying inside native human cells.

Much of the effort towards developing malaria vaccines so far has been focused on surface exposed (cell-membrane) proteins from the ''sporozoites'' and ''merozoites'' as well as non-human proteins on the surface of infected hepatocytes and erythrocytes.

[[Image:Nm0206-170-F1.jpg | center]]

=== Searching UniProt ===
We'll now see if we can use the annotation of protein properties in UniProt to point us towards potential vaccine targets. When designing a vaccine it is important to make sure that the intended vaccine target is indeed "visible" to the immune system. Building on the information from the previous section, we therefore need to identify proteins that '''originate''' from the parasite, and that are present on the cell surface of ''sporozoites'', ''merozoites'' OR infected host cells. In the case of infected host cells, we would therefore be looking for proteins that fulfill the following criteria:

# Are secreted from the parasite to the vacuole ''inside'' the host cell,
# Migrate from the vacuole to the host cell, and
# Are transported to the surface (membrane) of the host cell

Initially, we'll see how many hits we can find by searching for one or more of these criteria in relevant UniProtKB fields. Here we'll use the same search interface as in the UniProt exercise. We recommend to have the original [[Exercise: The protein database UniProt|UniProt Exercise manual]] open in a different browser window for quick cross-referencing of what we have already learned about searching UniProt.

[[Image:Emblem-important_tiny.png‎|left]]'''Note:''' When answering the questions below, you have to ''write the search string'' you used in the answer; merely writing a number is not enough. When the search string is included in the answer, we can understand the reason for possible wrong answers.

'''2b)''' Go to [http://www.uniprot.org/ UniProt]. Investigate how many ''Plasmodium falciparum'' (''Pf'') proteins there are in total in UniProtKB (i.e. proteins from all ''Pf'' strains, not only from 3D7). How many of these are from Swiss-Prot and how many from TrEMBL?

'''2c)''' Now try to see how many of the hits from the previous question are from the strain (isolate) 3D7. Is the number approximately equal to the number you got in question '''2a)'''? How many of these are from Swiss-Prot and how many from TrEMBL?

Now, we shall investigate whether we can use the annotations of subcellular location in UniProt. '''Note:''' We go back to working with all strains of ''Pf'', not exclusively 3D7.

'''2d)''' First, check how many ''Pf'' proteins have a "Subcellular location [CC]" comment at all ('''Tip:''' choose Subcellular location > Subcellular location [CC] > Subcellular location term in the menu and enter a <tt>*</tt> in the field). How many from each part of the database? ('''Note''' that the ratio between TrEMBL and Swiss-Prot numbers changes considerably relative to question '''2b)''' — Swiss-Prot entries on average contain many more annotations than TrEMBL entries).

'''2e)''' How many of these are secreted? ('''Tip:''' that should go into the field that pops up when the menu is set to Subcellular location > Subcellular location [CC] > Subcellular location term).

To get more hits, we will try to search for other terms in the Subcellular location term field. Interesting subcellular locations might include words such as "<tt>surface</tt>" or "<tt>membrane</tt>".

'''2f)''' How many are there of these, respectively?

The word "membrane" gave the highest number of hits, so we will examine those in more detail. Far from all of these proteins are suitable as vaccine targets. In order to be potentially interesting, they need to be located in the cell membrane (plasma membrane) of either the parasite or the host cell, ''not'' in an inner membrane in the cell. To get an overview, you should try another function in UniProt's interface: First, click to select the Table view instead of the Card view (above the results list). Then, click the button Customize columns; that will bring up a table where you can find a Subcellular location item. Click it, mark Subcellular location [CC], and click Close.

'''2g)''' Now look at the list of results, where "subcellular location" contained "membrane", again. Consider the field Subcellular location. Give some examples (including accession codes, protein names, and reasons for selecting them) of hits that may be useful, and hits that are surely not useful as vaccine targets (at least two ''different'' examples of each). '''Hint:''' if you need to see some different examples, try clicking on the column headings in the table to sort the results list after, e.g., Accession (Entry), Entry name, or Protein name.

Now, let us focus on the life stage of the parasite where it is located inside an erythrocyte (a red blood cell), and thereby focus on the vaccine targets that are in the plasma membrane of the ''host cell''.

'''2h)''' How many of the hits have the location "host cell membrane"?


These proteins could be very interesting as vaccine targets. However, the experimental researchers from your organization report that these have already been tried and do not work in practice, so they ask you to find other examples. We therefore try another approach: If the information we are looking for is not part of the "Subcellular location" annotation, it might be a part of the description (the protein name). '''Tip:''' you can always discard a search term in the Advanced interface by clicking the Remove button.

'''2i)''' How many ''Pf'' proteins contain <tt>erythrocyte</tt> in their Protein Name [DE] field? How many of these are from Swiss-Prot (reviewed)?

'''2j)''' How many of these erythrocyte proteins also have <tt>membrane</tt> in their name?

Some of the hits you find in this way are very short (you can try to sort them by length by clicking the Length heading). These short proteins might be fragments.

'''2k)''' How many of the hits are complete (not annotated as fragments)? ('''Tip:''' see question 16 in [[Exercise: The protein database UniProt|the UniProt exercise]]).

'''2l)''' Do any of these proteins have a determined 3D structure? In other words: Do any proteins from the previous search have a cross-reference to the database PDB? ('''Tip:''' you should look for Cross-references in the menu, and again place a <tt>*</tt> in the field). If yes, what are their names and accession codes?

As a last step in this part of the exercise, you should save all sequences from the last search in FASTA format. This is most easily done by clicking Download above the results list and choosing FASTA (canonical). You can either choose to download them (remember to choose No under Compressed) and then open them in a text editor or to preview them in the browser. In the latter case, keep the browser window with the sequences; we will need them later in the exercise.

 

== Analysis of membrane protein domain structure ==
[[Image:PfEMP1_transport.jpg|right|border]]

The PfEMP1 (''Plasmodium falciparum'' Erythrocyte Membrane Protein 1) proteins which we now have found constitute a group of proteins expressed by the malaria parasite and transported to the plasma membrane of the infected erythrocyte (see figure — the red/orange sticks represent PfEMP1 proteins).

The function of the PfEMP1 proteins on the surface of the infected erythrocytes is to mediate binding to certain receptors on the surface of endothelial cells (the cells making up the walls of blood vessels). In this way, the malaria parasite can make the infected erythrocytes stick to the walls of the blood vessels in various tissues of the body, and thereby it can avoid being transported through the spleen (Danish: ''milten'') which otherwise removes diseased erythrocytes from the blood and is one of the main actors in generating an immune response against malaria.

If we, using a vaccine, can generate antibodies that bind to the PfEMP1 proteins, preventing the infected erythrocytes from binding to the endothelial cells, the body would be able to generate a faster and broader immune response against ''Pf''. Symptoms such as anemia would thereby not become so severe.

We will now examine how the PfEMP1 proteins are built.

Look at the entries you found in the end of section 2. Select just those hits whose accession codes start with "Q" (there should be three of them — otherwise, revisit section 2).

Take a closer look (in UniProt) at these three entries. Scroll down to Family and domain databases under Family & Domains. Here, you will find some services providing an overview of known families/domains in the protein in question. InterPro is the most important of these, since it collects information from a number of family & domain databases (including the one called Pfam) and therefore has the widest repertoire of domain types.

Open the link labeled View protein in InterPro in a new tab. Note the graphical interface of InterPro under the heading "Entry matches to this protein". When you hover the mouse over one of the coloured bars, the name of the family/domain will appear. Note that each family/domain in InterPro has at least ''two'' names and identifiers, an InterPro identifier beginning with "IPR" and a member database identifier, e.g. beginning with "PF" if it is derived from Pfam.

<div style="background-color: lavender; border: solid thin grey;">
:'''What are families and domains, anyway?'''
:Here are the definitions from the [https://www.ebi.ac.uk/interpro/help/faqs/ InterPro FAQ]:
:*'''Domains''' are distinct functional, structural or sequence units that may exist in a variety of biological contexts. A match to an InterPro entry of this type indicates the presence of a domain. Common examples of protein domains are the PH domain, Immunoglobulin domain or the classical C2H2 zinc finger.
:*A protein '''family''' is a group of proteins that share a common evolutionary origin reflected by their related functions, similarities in sequence, or similar primary, secondary or tertiary structure. A match to an InterPro entry of this type indicates membership of a protein family.
:However, the distinction between what is regarded as a family and what is regarded as a domain is not completely sharp.
</div>

'''3a)''' Note that one named family/domain is found in several copies in all our three erythrocyte membrane proteins. What are the names and identifiers of this family/domain? How many times does it occur in each of the proteins?

Click the identifiers for this particular family/domain and read more about it.

'''3b)''' Under "Other Features", Interpro has annotated a transmembrane segment. Which positions are transmembrane in the three proteins? Which part (N- or C-terminal part) of the proteins is intracellular, and which part is extracellular?

Look (in UniProt) at the PDB cross-references under 3D structure databases (under Structure). Focus on X-ray structures only. Compare the coordinates (positions) for the structures to the coordinates for the domains denoted in Pfam.

'''3c)''' Which positions are structurally determined '''by X-ray''' in each of the three proteins? If you number the occurrences of the known family/domain from '''3a''' (1, 2, 3, and so on, starting from the N-terminus), which number(s) are covered by the structurally determined region(s) in each of the three proteins?

Now read what is said about the function and location of our proteins according to Gene Ontology (GO - Molecular function, GO - Biological process and GO - Cellular component) in UniProt.

'''3d)''' Do these pieces of information support our choice of these proteins as vaccine targets? Give at least 3 examples.

 

== Prediction of B-cell epitopes in a membrane protein ==
'''Q8I639''' is also known as VAR2CSA, and this protein is of particular interest, since it is considered to be responsible for ''Pregnacy associated malaria'' (PAM). Pregnant women are more prone to contract malaria, which sadly leads to a fatality rate of ~10,000 mothers and ~200,000 newborn/unborn children annually.

One of the reasons why it has been so difficult to develop a malaria vaccine, is that the malaria parasite carries ~60 PfEMP1 protein variants, and that you'll need antibodies against all of them to be immune. However, in the case of PAM there is only one specific PfEMP1 in play, and this special case is therefore easier to start to address with a vaccine.

In order to have a better handle on our bioinformatics work, we'll concentrate the effort on the Duffy binding domain in VAR2CSA for which a 3D structure is available (the one we found in '''question 3c''').

=== Epitope prediction ===
The vaccine we are working towards designing should contain '''epitopes'''. Epitopes are the parts of the disease-associated protein the immune system will recognize, for instance the parts the infected person's antibodies will bind to (the so called '''B-cell epitopes''' — there also exist '''T-cell epitopes''', which we'll not cover here).

For predicting which parts of the protein are potential epitopes, we'll use the '''BepiPred 2.0 server''', which was created here at DTU.
<div style="background-color: lavender; border: solid thin grey;">
:'''Important Note:''' Please run the prediction on the web server of the IEDB instead of the one at DTU, as our local servers had an update that has modified the results.
:Please select the method called "BepiPred 2.0"
:http://tools.iedb.org/bcell/
</div>

In order to run the prediction, we'll first need to extract the amino acid sequence for the structure-determined Duffy binding domain in VAR2CSA. This must be done using the link to PDB from UniProt:
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt.
# Go to the Structure section.
# Right-click the link labeled RCSB-PDB and open it in a new tab. This will take you to a PDB page.
# Here, you can find the sequence by clicking Display Files and choosing FASTA Sequence. Alternative, you can choose to download the sequence by clicking Download Files.


'''Question 4a''': What is the name of the PDB entry, and is it a crystal or NMR structure?

'''Question 4b''': Report the FASTA sequence of the structure-determined Duffy binding domain in VAR2CSA.

'''Question 4c''':
Note down the following from the UniProt entry, you'll need it in the next section:
* What was the sequence interval in the coordinates of the original (full) UniProt sequence?
* What position in the original protein does position 1 in the new FASTA file correspond to?

You can now run the '''BepiPred 2.0''' prediction server on the domain sequence (ONLY the subset extracted above). Run it and then adjust the following on the '''results page''':
* Set '''threshold''' to '''0.55'''
This gives us a reasonable amount of epitopes to continue our work with:
* Write down the start/end sequence positions of all epitopes of at least '''8 amino acids'''
* '''Hint:''' there should be '''7''' such epitopes, and the last one starts at position '''276'''

[[image:BepiPred-2_onIEDB.png|thumb|center|600px|Click to zoom]]

'''Question 4d''': Create a table with the following information about the predicted epitopes:
* Start/end position, length, Start/end position ''in the original protein''

''(We'll need the coordinate-transformed values for the PyMOL visualization)''

== Visualization of epitopes ==
Lastly, we'll want to visualize the epitopes in the VAR2CSA Duffy binding domain. Generally, BepiPred 2.0 is very good at selecting surface exposed epitopes, but it's still a good idea to check it visually.


In the PDB database page for the structure you found in the last section, click the "Sequence" tab and look at the figure. In the case of this structure, the authors' numbering directly follows the coordinates from the FULL UniProt sequence.

[[Image:Office-notes-line_drawing.png|30px|left]]'''Question 5a):'''
* Which intervals in the sequence are missing (disordered/invisible) in the structure? Hint: Look at the "UNMODELED" feature. 
* Will this have an impact on any of our predicted epitopes?

Now it's time to work with visualization of the epitopes in PyMOL. IMPORTANT: Cross-reference with the exercises from the PyMOL exercise if you have forgotten some of the PyMOL fundamentals.

The goal will be to:
* Colour the epitopes in different colours
* Have a look at where in the structure they are found: on the surface or inside.

After you have loaded the structure (either via "fetch" or by downloading the file), you can help yourself by setting the base colour to a neutral grey, and with a basic "cartoon" visualization as the first step:

color gray80
hide all
show cartoon

Since we're working with 7 epitopes it can be beneficial to work with named selections. To avoid renaming selections you can specify the name directly in the select command:
select epitope_XXX, resi 1-3

This will create the selection of residues 1 to 3 under the name "epitope_XXX" — please refer to the PyMOL exercise for more details about selection rules.

'''TASK:'''
* Create named selections for all seven epitopes
** Select a good naming scheme — for example epitope_1 to epitope_7 or reference the first position (e.g. epitope_273 for the last one)
** Select a unique and easy to identify colour for each epitope.
** HINT: Turn on the sequence viewer — then you can directly see your selections AND colours in the sequence as well!

As you may have noticed there are two (identical) chains in the structure. We only need one of them, and the next step will be to separate them out.

create ka, chain A

This will create a new object with the A chain.
* Repeat this for the B chain (you could call the object kb), and then use the click-interface to hide the whole structure, and select ONLY one of the chains to continue to work with. REMEMBER to write which chain you have chosen to work with.

Lastly, we'll need to look at how the epitopes are located relative to the surface. Here you can benefit from switching between two types of visualization (using the click interface):
* show as → surface
to show the protein from the outside.
* show as → cartoon
* show → mesh
to show BOTH the inside and outside — it especially works nicely when you actively rotate the structure.

[[Image:Office-notes-line_drawing.png|30px|left]]'''Question 5b):''' Play around with the visualization, and create one (or more) good figures for your report that show the following:
* Placement of the epitopes
* A legend for the colours (or arrows with explanations or something similar)
* Which epitopes are (partly) missing?
* Are the remaining epitopes accessible on the surface of the protein?

== Epilogue ==
''Now all that remains is to ship off the sequences of the surface accessible epitopes to the lab, to start the long process of constructing an expression vector with the gene fragments, with the right linker sequences, getting it expressed in a production host, follow up with animal testing and phase 1, 2 and 3 clinical trials, and the vaccine should be ready for the market.''