<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://teaching.healthtech.dtu.dk/22111/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Henni</id>
	<title>22111 - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://teaching.healthtech.dtu.dk/22111/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Henni"/>
	<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php/Special:Contributions/Henni"/>
	<updated>2026-05-07T22:36:59Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=796</id>
		<title>Exercise: Phylogeny - Answers (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=796"/>
		<updated>2025-11-26T21:02:38Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 14 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Step 1 ==&lt;br /&gt;
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.&lt;br /&gt;
&lt;br /&gt;
==Step 2==&lt;br /&gt;
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#distances order: d(1,2),...,d(1,n) &amp;lt;new line&amp;gt; d(2,3),...,d(2,n) &amp;lt;new line&amp;gt;...&lt;br /&gt;
20&lt;br /&gt;
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665 &lt;br /&gt;
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769 &lt;br /&gt;
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551 &lt;br /&gt;
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683 &lt;br /&gt;
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333 &lt;br /&gt;
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114 &lt;br /&gt;
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551 &lt;br /&gt;
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551 &lt;br /&gt;
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513 &lt;br /&gt;
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551 &lt;br /&gt;
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585 &lt;br /&gt;
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111 &lt;br /&gt;
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803 &lt;br /&gt;
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058 &lt;br /&gt;
0.0682095 0.0621194 0.386114 0.120585 0.118149 &lt;br /&gt;
0.0657734 0.389769 0.126675 0.123021 &lt;br /&gt;
0.394641 0.116931 0.115713 &lt;br /&gt;
0.388551 0.388551 &lt;br /&gt;
0.0146163 &lt;br /&gt;
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP &lt;br /&gt;
&lt;br /&gt;
#pairwise distances&lt;br /&gt;
HIV1B5,HTLV: 0.750305&lt;br /&gt;
HIV1H2,HTLV: 0.751523&lt;br /&gt;
HIV1MN,HTLV: 0.75&lt;br /&gt;
HIV1N5,HTLV: 0.752741&lt;br /&gt;
HIV1ND,HTLV: 0.752741&lt;br /&gt;
HIV1OY,HTLV: 0.752741&lt;br /&gt;
HIV1PV,HTLV: 0.750305&lt;br /&gt;
HIV1U4,HTLV: 0.750305&lt;br /&gt;
HIV1Z2,HTLV: 0.752741&lt;br /&gt;
HIV2CA,HTLV: 0.749086&lt;br /&gt;
HIV2D1,HTLV: 0.741778&lt;br /&gt;
HIV2G1,HTLV: 0.747868&lt;br /&gt;
HIV2KR,HTLV: 0.749086&lt;br /&gt;
HIV2RO,HTLV: 0.744214&lt;br /&gt;
HIV2SB,HTLV: 0.750305&lt;br /&gt;
HIV2ST,HTLV: 0.747868&lt;br /&gt;
HTLV,SIVCZ: 0.747868&lt;br /&gt;
HTLV,Smanga_S4: 0.747868&lt;br /&gt;
HTLV,Smanga_SP: 0.74665&lt;br /&gt;
HIV1B5,HIV1H2: 0.0158343&lt;br /&gt;
HIV1B5,HIV1MN: 0.0414634&lt;br /&gt;
HIV1B5,HIV1N5: 0.0304507&lt;br /&gt;
HIV1B5,HIV1ND: 0.043849&lt;br /&gt;
HIV1B5,HIV1OY: 0.0341048&lt;br /&gt;
HIV1B5,HIV1PV: 0.0170524&lt;br /&gt;
HIV1B5,HIV1U4: 0.0803898&lt;br /&gt;
HIV1B5,HIV1Z2: 0.045067&lt;br /&gt;
HIV1B5,HIV2CA: 0.399513&lt;br /&gt;
HIV1B5,HIV2D1: 0.399513&lt;br /&gt;
HIV1B5,HIV2G1: 0.389769&lt;br /&gt;
HIV1B5,HIV2KR: 0.393423&lt;br /&gt;
HIV1B5,HIV2RO: 0.394641&lt;br /&gt;
HIV1B5,HIV2SB: 0.389769&lt;br /&gt;
HIV1B5,HIV2ST: 0.394641&lt;br /&gt;
HIV1B5,SIVCZ: 0.130329&lt;br /&gt;
HIV1B5,Smanga_S4: 0.389769&lt;br /&gt;
HIV1B5,Smanga_SP: 0.389769&lt;br /&gt;
HIV1H2,HIV1MN: 0.0402439&lt;br /&gt;
HIV1H2,HIV1N5: 0.0292326&lt;br /&gt;
HIV1H2,HIV1ND: 0.0414129&lt;br /&gt;
HIV1H2,HIV1OY: 0.0328867&lt;br /&gt;
HIV1H2,HIV1PV: 0.00974421&lt;br /&gt;
HIV1H2,HIV1U4: 0.0803898&lt;br /&gt;
HIV1H2,HIV1Z2: 0.0426309&lt;br /&gt;
HIV1H2,HIV2CA: 0.399513&lt;br /&gt;
HIV1H2,HIV2D1: 0.401949&lt;br /&gt;
HIV1H2,HIV2G1: 0.392205&lt;br /&gt;
HIV1H2,HIV2KR: 0.393423&lt;br /&gt;
HIV1H2,HIV2RO: 0.394641&lt;br /&gt;
HIV1H2,HIV2SB: 0.389769&lt;br /&gt;
HIV1H2,HIV2ST: 0.394641&lt;br /&gt;
HIV1H2,SIVCZ: 0.129111&lt;br /&gt;
HIV1H2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1H2,Smanga_SP: 0.388551&lt;br /&gt;
HIV1MN,HIV1N5: 0.0365854&lt;br /&gt;
HIV1MN,HIV1ND: 0.0512195&lt;br /&gt;
HIV1MN,HIV1OY: 0.0365854&lt;br /&gt;
HIV1MN,HIV1PV: 0.0439024&lt;br /&gt;
HIV1MN,HIV1U4: 0.0865854&lt;br /&gt;
HIV1MN,HIV1Z2: 0.054878&lt;br /&gt;
HIV1MN,HIV2CA: 0.4&lt;br /&gt;
HIV1MN,HIV2D1: 0.40122&lt;br /&gt;
HIV1MN,HIV2G1: 0.396341&lt;br /&gt;
HIV1MN,HIV2KR: 0.392683&lt;br /&gt;
HIV1MN,HIV2RO: 0.395122&lt;br /&gt;
HIV1MN,HIV2SB: 0.392683&lt;br /&gt;
HIV1MN,HIV2ST: 0.397561&lt;br /&gt;
HIV1MN,SIVCZ: 0.130488&lt;br /&gt;
HIV1MN,Smanga_S4: 0.392683&lt;br /&gt;
HIV1MN,Smanga_SP: 0.392683&lt;br /&gt;
HIV1N5,HIV1ND: 0.0341048&lt;br /&gt;
HIV1N5,HIV1OY: 0.0304507&lt;br /&gt;
HIV1N5,HIV1PV: 0.0316687&lt;br /&gt;
HIV1N5,HIV1U4: 0.0791717&lt;br /&gt;
HIV1N5,HIV1Z2: 0.0389769&lt;br /&gt;
HIV1N5,HIV2CA: 0.397077&lt;br /&gt;
HIV1N5,HIV2D1: 0.399513&lt;br /&gt;
HIV1N5,HIV2G1: 0.389769&lt;br /&gt;
HIV1N5,HIV2KR: 0.390987&lt;br /&gt;
HIV1N5,HIV2RO: 0.392205&lt;br /&gt;
HIV1N5,HIV2SB: 0.389769&lt;br /&gt;
HIV1N5,HIV2ST: 0.392205&lt;br /&gt;
HIV1N5,SIVCZ: 0.127893&lt;br /&gt;
HIV1N5,Smanga_S4: 0.387333&lt;br /&gt;
HIV1N5,Smanga_SP: 0.387333&lt;br /&gt;
HIV1ND,HIV1OY: 0.043849&lt;br /&gt;
HIV1ND,HIV1PV: 0.043849&lt;br /&gt;
HIV1ND,HIV1U4: 0.0767357&lt;br /&gt;
HIV1ND,HIV1Z2: 0.0219245&lt;br /&gt;
HIV1ND,HIV2CA: 0.390987&lt;br /&gt;
HIV1ND,HIV2D1: 0.394641&lt;br /&gt;
HIV1ND,HIV2G1: 0.386114&lt;br /&gt;
HIV1ND,HIV2KR: 0.386114&lt;br /&gt;
HIV1ND,HIV2RO: 0.388551&lt;br /&gt;
HIV1ND,HIV2SB: 0.387333&lt;br /&gt;
HIV1ND,HIV2ST: 0.389769&lt;br /&gt;
HIV1ND,SIVCZ: 0.125457&lt;br /&gt;
HIV1ND,Smanga_S4: 0.386114&lt;br /&gt;
HIV1ND,Smanga_SP: 0.386114&lt;br /&gt;
HIV1OY,HIV1PV: 0.0365408&lt;br /&gt;
HIV1OY,HIV1U4: 0.0767357&lt;br /&gt;
HIV1OY,HIV1Z2: 0.047503&lt;br /&gt;
HIV1OY,HIV2CA: 0.394641&lt;br /&gt;
HIV1OY,HIV2D1: 0.397077&lt;br /&gt;
HIV1OY,HIV2G1: 0.388551&lt;br /&gt;
HIV1OY,HIV2KR: 0.388551&lt;br /&gt;
HIV1OY,HIV2RO: 0.389769&lt;br /&gt;
HIV1OY,HIV2SB: 0.386114&lt;br /&gt;
HIV1OY,HIV2ST: 0.390987&lt;br /&gt;
HIV1OY,SIVCZ: 0.131547&lt;br /&gt;
HIV1OY,Smanga_S4: 0.388551&lt;br /&gt;
HIV1OY,Smanga_SP: 0.388551&lt;br /&gt;
HIV1PV,HIV1U4: 0.0828258&lt;br /&gt;
HIV1PV,HIV1Z2: 0.045067&lt;br /&gt;
HIV1PV,HIV2CA: 0.401949&lt;br /&gt;
HIV1PV,HIV2D1: 0.404385&lt;br /&gt;
HIV1PV,HIV2G1: 0.394641&lt;br /&gt;
HIV1PV,HIV2KR: 0.394641&lt;br /&gt;
HIV1PV,HIV2RO: 0.397077&lt;br /&gt;
HIV1PV,HIV2SB: 0.390987&lt;br /&gt;
HIV1PV,HIV2ST: 0.393423&lt;br /&gt;
HIV1PV,SIVCZ: 0.130329&lt;br /&gt;
HIV1PV,Smanga_S4: 0.388551&lt;br /&gt;
HIV1PV,Smanga_SP: 0.388551&lt;br /&gt;
HIV1U4,HIV1Z2: 0.0767357&lt;br /&gt;
HIV1U4,HIV2CA: 0.398295&lt;br /&gt;
HIV1U4,HIV2D1: 0.403167&lt;br /&gt;
HIV1U4,HIV2G1: 0.392205&lt;br /&gt;
HIV1U4,HIV2KR: 0.395859&lt;br /&gt;
HIV1U4,HIV2RO: 0.394641&lt;br /&gt;
HIV1U4,HIV2SB: 0.394641&lt;br /&gt;
HIV1U4,HIV2ST: 0.397077&lt;br /&gt;
HIV1U4,SIVCZ: 0.137637&lt;br /&gt;
HIV1U4,Smanga_S4: 0.400731&lt;br /&gt;
HIV1U4,Smanga_SP: 0.399513&lt;br /&gt;
HIV1Z2,HIV2CA: 0.393423&lt;br /&gt;
HIV1Z2,HIV2D1: 0.397077&lt;br /&gt;
HIV1Z2,HIV2G1: 0.387333&lt;br /&gt;
HIV1Z2,HIV2KR: 0.388551&lt;br /&gt;
HIV1Z2,HIV2RO: 0.389769&lt;br /&gt;
HIV1Z2,HIV2SB: 0.389769&lt;br /&gt;
HIV1Z2,HIV2ST: 0.389769&lt;br /&gt;
HIV1Z2,SIVCZ: 0.125457&lt;br /&gt;
HIV1Z2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1Z2,Smanga_SP: 0.388551&lt;br /&gt;
HIV2CA,HIV2D1: 0.0816078&lt;br /&gt;
HIV2CA,HIV2G1: 0.0694275&lt;br /&gt;
HIV2CA,HIV2KR: 0.0645554&lt;br /&gt;
HIV2CA,HIV2RO: 0.0511571&lt;br /&gt;
HIV2CA,HIV2SB: 0.0682095&lt;br /&gt;
HIV2CA,HIV2ST: 0.0657734&lt;br /&gt;
HIV2CA,SIVCZ: 0.392205&lt;br /&gt;
HIV2CA,Smanga_S4: 0.125457&lt;br /&gt;
HIV2CA,Smanga_SP: 0.120585&lt;br /&gt;
HIV2D1,HIV2G1: 0.0511571&lt;br /&gt;
HIV2D1,HIV2KR: 0.0840438&lt;br /&gt;
HIV2D1,HIV2RO: 0.088916&lt;br /&gt;
HIV2D1,HIV2SB: 0.09257&lt;br /&gt;
HIV2D1,HIV2ST: 0.0864799&lt;br /&gt;
HIV2D1,SIVCZ: 0.397077&lt;br /&gt;
HIV2D1,Smanga_S4: 0.131547&lt;br /&gt;
HIV2D1,Smanga_SP: 0.129111&lt;br /&gt;
HIV2G1,HIV2KR: 0.0779537&lt;br /&gt;
HIV2G1,HIV2RO: 0.0730816&lt;br /&gt;
HIV2G1,HIV2SB: 0.0791717&lt;br /&gt;
HIV2G1,HIV2ST: 0.0767357&lt;br /&gt;
HIV2G1,SIVCZ: 0.394641&lt;br /&gt;
HIV2G1,Smanga_S4: 0.127893&lt;br /&gt;
HIV2G1,Smanga_SP: 0.121803&lt;br /&gt;
HIV2KR,HIV2RO: 0.0645554&lt;br /&gt;
HIV2KR,HIV2SB: 0.0633374&lt;br /&gt;
HIV2KR,HIV2ST: 0.0572473&lt;br /&gt;
HIV2KR,SIVCZ: 0.392205&lt;br /&gt;
HIV2KR,Smanga_S4: 0.118149&lt;br /&gt;
HIV2KR,Smanga_SP: 0.112058&lt;br /&gt;
HIV2RO,HIV2SB: 0.0682095&lt;br /&gt;
HIV2RO,HIV2ST: 0.0621194&lt;br /&gt;
HIV2RO,SIVCZ: 0.386114&lt;br /&gt;
HIV2RO,Smanga_S4: 0.120585&lt;br /&gt;
HIV2RO,Smanga_SP: 0.118149&lt;br /&gt;
HIV2SB,HIV2ST: 0.0657734&lt;br /&gt;
HIV2SB,SIVCZ: 0.389769&lt;br /&gt;
HIV2SB,Smanga_S4: 0.126675&lt;br /&gt;
HIV2SB,Smanga_SP: 0.123021&lt;br /&gt;
HIV2ST,SIVCZ: 0.394641&lt;br /&gt;
HIV2ST,Smanga_S4: 0.116931&lt;br /&gt;
HIV2ST,Smanga_SP: 0.115713&lt;br /&gt;
SIVCZ,Smanga_S4: 0.388551&lt;br /&gt;
SIVCZ,Smanga_SP: 0.388551&lt;br /&gt;
Smanga_S4,Smanga_SP: 0.0146163&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Step3==&lt;br /&gt;
Here is a picture of the NJ tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.png]]&lt;br /&gt;
&lt;br /&gt;
The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.&lt;br /&gt;
&lt;br /&gt;
==Step 4==&lt;br /&gt;
Here is an unrooted tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 5==&lt;br /&gt;
Here is a rearranged (swapped) tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.swapped.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 6==&lt;br /&gt;
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).&lt;br /&gt;
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).&lt;br /&gt;
* Further answers to &amp;quot;The Phylogeny of HIV&amp;quot; can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].&lt;br /&gt;
&lt;br /&gt;
==Step 7==&lt;br /&gt;
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the &amp;quot;best of both worlds&amp;quot;: it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.&lt;br /&gt;
&lt;br /&gt;
Here is the tree made ignoring gap positions: &lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]&lt;br /&gt;
&lt;br /&gt;
And here is the tree made taking gap positions into account:&lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]&lt;br /&gt;
&lt;br /&gt;
There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. &#039;&#039;&#039;Note:&#039;&#039;&#039; This is not always the case!&lt;br /&gt;
&lt;br /&gt;
==Step 8==&lt;br /&gt;
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group &#039;&#039;Tetrapoda&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:salmon_frog.png‎|center|frame]]&lt;br /&gt;
&lt;br /&gt;
There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy&#039;s &amp;quot;Common Tree&amp;quot; function (see illustration below). &lt;br /&gt;
&lt;br /&gt;
First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group &#039;&#039;Euarchontoglires&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group &#039;&#039;Opisthokonta&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:L18_Common_Taxonomy_Tree.png|center|frame]]&lt;br /&gt;
&lt;br /&gt;
It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).&lt;br /&gt;
&lt;br /&gt;
==Step 9==&lt;br /&gt;
# 54 results. &amp;lt;br&amp;gt;Search string: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)&amp;lt;/tt&amp;gt;&lt;br /&gt;
# 8 and 27 results, respectively. &amp;lt;br&amp;gt;Search strings: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:mitochondrion) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;and &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:cytoplasm) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)&amp;lt;/tt&amp;gt; &amp;lt;br&amp;gt;Under the &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; tab in UniProt, select &amp;quot;Download all&amp;quot;, &amp;quot;FASTA (canonical)&amp;quot; and &amp;quot;Uncompressed&amp;quot;.&lt;br /&gt;
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]&lt;br /&gt;
&lt;br /&gt;
==Step 10==&lt;br /&gt;
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure &amp;lt;u&amp;gt;Alignment options&amp;lt;/u&amp;gt; is set to &amp;quot;clustalo&amp;quot;, and align all sequences. Then make an NJ tree (with &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; unchecked) and change the view to &amp;quot;&amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;&amp;quot;. &lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; the following pictures are made last year, when the number of cytoplasmic+mitochondrial sequences was 34, not 35. The rabbit has been added since then, but the general picture is the same.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the result:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.&lt;br /&gt;
&lt;br /&gt;
== Step 11 ==&lt;br /&gt;
Here is the rerooted tree made by Seaview:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.rerooted.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 12==&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
Here is the rerooted tree made by iTOL:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]&lt;br /&gt;
&lt;br /&gt;
Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.rerooted+annotated.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 13==&lt;br /&gt;
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.&lt;br /&gt;
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.&lt;br /&gt;
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Ribosomal_proteins_35-NJ_tree.rerooted%2Bannotated.png&amp;diff=795</id>
		<title>File:Ribosomal proteins 35-NJ tree.rerooted+annotated.png</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Ribosomal_proteins_35-NJ_tree.rerooted%2Bannotated.png&amp;diff=795"/>
		<updated>2025-11-26T21:00:44Z</updated>

		<summary type="html">&lt;p&gt;Henni: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=794</id>
		<title>Exercise: Phylogeny - Answers (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=794"/>
		<updated>2025-11-26T21:00:13Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 13 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Step 1 ==&lt;br /&gt;
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.&lt;br /&gt;
&lt;br /&gt;
==Step 2==&lt;br /&gt;
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#distances order: d(1,2),...,d(1,n) &amp;lt;new line&amp;gt; d(2,3),...,d(2,n) &amp;lt;new line&amp;gt;...&lt;br /&gt;
20&lt;br /&gt;
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665 &lt;br /&gt;
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769 &lt;br /&gt;
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551 &lt;br /&gt;
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683 &lt;br /&gt;
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333 &lt;br /&gt;
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114 &lt;br /&gt;
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551 &lt;br /&gt;
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551 &lt;br /&gt;
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513 &lt;br /&gt;
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551 &lt;br /&gt;
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585 &lt;br /&gt;
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111 &lt;br /&gt;
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803 &lt;br /&gt;
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058 &lt;br /&gt;
0.0682095 0.0621194 0.386114 0.120585 0.118149 &lt;br /&gt;
0.0657734 0.389769 0.126675 0.123021 &lt;br /&gt;
0.394641 0.116931 0.115713 &lt;br /&gt;
0.388551 0.388551 &lt;br /&gt;
0.0146163 &lt;br /&gt;
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP &lt;br /&gt;
&lt;br /&gt;
#pairwise distances&lt;br /&gt;
HIV1B5,HTLV: 0.750305&lt;br /&gt;
HIV1H2,HTLV: 0.751523&lt;br /&gt;
HIV1MN,HTLV: 0.75&lt;br /&gt;
HIV1N5,HTLV: 0.752741&lt;br /&gt;
HIV1ND,HTLV: 0.752741&lt;br /&gt;
HIV1OY,HTLV: 0.752741&lt;br /&gt;
HIV1PV,HTLV: 0.750305&lt;br /&gt;
HIV1U4,HTLV: 0.750305&lt;br /&gt;
HIV1Z2,HTLV: 0.752741&lt;br /&gt;
HIV2CA,HTLV: 0.749086&lt;br /&gt;
HIV2D1,HTLV: 0.741778&lt;br /&gt;
HIV2G1,HTLV: 0.747868&lt;br /&gt;
HIV2KR,HTLV: 0.749086&lt;br /&gt;
HIV2RO,HTLV: 0.744214&lt;br /&gt;
HIV2SB,HTLV: 0.750305&lt;br /&gt;
HIV2ST,HTLV: 0.747868&lt;br /&gt;
HTLV,SIVCZ: 0.747868&lt;br /&gt;
HTLV,Smanga_S4: 0.747868&lt;br /&gt;
HTLV,Smanga_SP: 0.74665&lt;br /&gt;
HIV1B5,HIV1H2: 0.0158343&lt;br /&gt;
HIV1B5,HIV1MN: 0.0414634&lt;br /&gt;
HIV1B5,HIV1N5: 0.0304507&lt;br /&gt;
HIV1B5,HIV1ND: 0.043849&lt;br /&gt;
HIV1B5,HIV1OY: 0.0341048&lt;br /&gt;
HIV1B5,HIV1PV: 0.0170524&lt;br /&gt;
HIV1B5,HIV1U4: 0.0803898&lt;br /&gt;
HIV1B5,HIV1Z2: 0.045067&lt;br /&gt;
HIV1B5,HIV2CA: 0.399513&lt;br /&gt;
HIV1B5,HIV2D1: 0.399513&lt;br /&gt;
HIV1B5,HIV2G1: 0.389769&lt;br /&gt;
HIV1B5,HIV2KR: 0.393423&lt;br /&gt;
HIV1B5,HIV2RO: 0.394641&lt;br /&gt;
HIV1B5,HIV2SB: 0.389769&lt;br /&gt;
HIV1B5,HIV2ST: 0.394641&lt;br /&gt;
HIV1B5,SIVCZ: 0.130329&lt;br /&gt;
HIV1B5,Smanga_S4: 0.389769&lt;br /&gt;
HIV1B5,Smanga_SP: 0.389769&lt;br /&gt;
HIV1H2,HIV1MN: 0.0402439&lt;br /&gt;
HIV1H2,HIV1N5: 0.0292326&lt;br /&gt;
HIV1H2,HIV1ND: 0.0414129&lt;br /&gt;
HIV1H2,HIV1OY: 0.0328867&lt;br /&gt;
HIV1H2,HIV1PV: 0.00974421&lt;br /&gt;
HIV1H2,HIV1U4: 0.0803898&lt;br /&gt;
HIV1H2,HIV1Z2: 0.0426309&lt;br /&gt;
HIV1H2,HIV2CA: 0.399513&lt;br /&gt;
HIV1H2,HIV2D1: 0.401949&lt;br /&gt;
HIV1H2,HIV2G1: 0.392205&lt;br /&gt;
HIV1H2,HIV2KR: 0.393423&lt;br /&gt;
HIV1H2,HIV2RO: 0.394641&lt;br /&gt;
HIV1H2,HIV2SB: 0.389769&lt;br /&gt;
HIV1H2,HIV2ST: 0.394641&lt;br /&gt;
HIV1H2,SIVCZ: 0.129111&lt;br /&gt;
HIV1H2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1H2,Smanga_SP: 0.388551&lt;br /&gt;
HIV1MN,HIV1N5: 0.0365854&lt;br /&gt;
HIV1MN,HIV1ND: 0.0512195&lt;br /&gt;
HIV1MN,HIV1OY: 0.0365854&lt;br /&gt;
HIV1MN,HIV1PV: 0.0439024&lt;br /&gt;
HIV1MN,HIV1U4: 0.0865854&lt;br /&gt;
HIV1MN,HIV1Z2: 0.054878&lt;br /&gt;
HIV1MN,HIV2CA: 0.4&lt;br /&gt;
HIV1MN,HIV2D1: 0.40122&lt;br /&gt;
HIV1MN,HIV2G1: 0.396341&lt;br /&gt;
HIV1MN,HIV2KR: 0.392683&lt;br /&gt;
HIV1MN,HIV2RO: 0.395122&lt;br /&gt;
HIV1MN,HIV2SB: 0.392683&lt;br /&gt;
HIV1MN,HIV2ST: 0.397561&lt;br /&gt;
HIV1MN,SIVCZ: 0.130488&lt;br /&gt;
HIV1MN,Smanga_S4: 0.392683&lt;br /&gt;
HIV1MN,Smanga_SP: 0.392683&lt;br /&gt;
HIV1N5,HIV1ND: 0.0341048&lt;br /&gt;
HIV1N5,HIV1OY: 0.0304507&lt;br /&gt;
HIV1N5,HIV1PV: 0.0316687&lt;br /&gt;
HIV1N5,HIV1U4: 0.0791717&lt;br /&gt;
HIV1N5,HIV1Z2: 0.0389769&lt;br /&gt;
HIV1N5,HIV2CA: 0.397077&lt;br /&gt;
HIV1N5,HIV2D1: 0.399513&lt;br /&gt;
HIV1N5,HIV2G1: 0.389769&lt;br /&gt;
HIV1N5,HIV2KR: 0.390987&lt;br /&gt;
HIV1N5,HIV2RO: 0.392205&lt;br /&gt;
HIV1N5,HIV2SB: 0.389769&lt;br /&gt;
HIV1N5,HIV2ST: 0.392205&lt;br /&gt;
HIV1N5,SIVCZ: 0.127893&lt;br /&gt;
HIV1N5,Smanga_S4: 0.387333&lt;br /&gt;
HIV1N5,Smanga_SP: 0.387333&lt;br /&gt;
HIV1ND,HIV1OY: 0.043849&lt;br /&gt;
HIV1ND,HIV1PV: 0.043849&lt;br /&gt;
HIV1ND,HIV1U4: 0.0767357&lt;br /&gt;
HIV1ND,HIV1Z2: 0.0219245&lt;br /&gt;
HIV1ND,HIV2CA: 0.390987&lt;br /&gt;
HIV1ND,HIV2D1: 0.394641&lt;br /&gt;
HIV1ND,HIV2G1: 0.386114&lt;br /&gt;
HIV1ND,HIV2KR: 0.386114&lt;br /&gt;
HIV1ND,HIV2RO: 0.388551&lt;br /&gt;
HIV1ND,HIV2SB: 0.387333&lt;br /&gt;
HIV1ND,HIV2ST: 0.389769&lt;br /&gt;
HIV1ND,SIVCZ: 0.125457&lt;br /&gt;
HIV1ND,Smanga_S4: 0.386114&lt;br /&gt;
HIV1ND,Smanga_SP: 0.386114&lt;br /&gt;
HIV1OY,HIV1PV: 0.0365408&lt;br /&gt;
HIV1OY,HIV1U4: 0.0767357&lt;br /&gt;
HIV1OY,HIV1Z2: 0.047503&lt;br /&gt;
HIV1OY,HIV2CA: 0.394641&lt;br /&gt;
HIV1OY,HIV2D1: 0.397077&lt;br /&gt;
HIV1OY,HIV2G1: 0.388551&lt;br /&gt;
HIV1OY,HIV2KR: 0.388551&lt;br /&gt;
HIV1OY,HIV2RO: 0.389769&lt;br /&gt;
HIV1OY,HIV2SB: 0.386114&lt;br /&gt;
HIV1OY,HIV2ST: 0.390987&lt;br /&gt;
HIV1OY,SIVCZ: 0.131547&lt;br /&gt;
HIV1OY,Smanga_S4: 0.388551&lt;br /&gt;
HIV1OY,Smanga_SP: 0.388551&lt;br /&gt;
HIV1PV,HIV1U4: 0.0828258&lt;br /&gt;
HIV1PV,HIV1Z2: 0.045067&lt;br /&gt;
HIV1PV,HIV2CA: 0.401949&lt;br /&gt;
HIV1PV,HIV2D1: 0.404385&lt;br /&gt;
HIV1PV,HIV2G1: 0.394641&lt;br /&gt;
HIV1PV,HIV2KR: 0.394641&lt;br /&gt;
HIV1PV,HIV2RO: 0.397077&lt;br /&gt;
HIV1PV,HIV2SB: 0.390987&lt;br /&gt;
HIV1PV,HIV2ST: 0.393423&lt;br /&gt;
HIV1PV,SIVCZ: 0.130329&lt;br /&gt;
HIV1PV,Smanga_S4: 0.388551&lt;br /&gt;
HIV1PV,Smanga_SP: 0.388551&lt;br /&gt;
HIV1U4,HIV1Z2: 0.0767357&lt;br /&gt;
HIV1U4,HIV2CA: 0.398295&lt;br /&gt;
HIV1U4,HIV2D1: 0.403167&lt;br /&gt;
HIV1U4,HIV2G1: 0.392205&lt;br /&gt;
HIV1U4,HIV2KR: 0.395859&lt;br /&gt;
HIV1U4,HIV2RO: 0.394641&lt;br /&gt;
HIV1U4,HIV2SB: 0.394641&lt;br /&gt;
HIV1U4,HIV2ST: 0.397077&lt;br /&gt;
HIV1U4,SIVCZ: 0.137637&lt;br /&gt;
HIV1U4,Smanga_S4: 0.400731&lt;br /&gt;
HIV1U4,Smanga_SP: 0.399513&lt;br /&gt;
HIV1Z2,HIV2CA: 0.393423&lt;br /&gt;
HIV1Z2,HIV2D1: 0.397077&lt;br /&gt;
HIV1Z2,HIV2G1: 0.387333&lt;br /&gt;
HIV1Z2,HIV2KR: 0.388551&lt;br /&gt;
HIV1Z2,HIV2RO: 0.389769&lt;br /&gt;
HIV1Z2,HIV2SB: 0.389769&lt;br /&gt;
HIV1Z2,HIV2ST: 0.389769&lt;br /&gt;
HIV1Z2,SIVCZ: 0.125457&lt;br /&gt;
HIV1Z2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1Z2,Smanga_SP: 0.388551&lt;br /&gt;
HIV2CA,HIV2D1: 0.0816078&lt;br /&gt;
HIV2CA,HIV2G1: 0.0694275&lt;br /&gt;
HIV2CA,HIV2KR: 0.0645554&lt;br /&gt;
HIV2CA,HIV2RO: 0.0511571&lt;br /&gt;
HIV2CA,HIV2SB: 0.0682095&lt;br /&gt;
HIV2CA,HIV2ST: 0.0657734&lt;br /&gt;
HIV2CA,SIVCZ: 0.392205&lt;br /&gt;
HIV2CA,Smanga_S4: 0.125457&lt;br /&gt;
HIV2CA,Smanga_SP: 0.120585&lt;br /&gt;
HIV2D1,HIV2G1: 0.0511571&lt;br /&gt;
HIV2D1,HIV2KR: 0.0840438&lt;br /&gt;
HIV2D1,HIV2RO: 0.088916&lt;br /&gt;
HIV2D1,HIV2SB: 0.09257&lt;br /&gt;
HIV2D1,HIV2ST: 0.0864799&lt;br /&gt;
HIV2D1,SIVCZ: 0.397077&lt;br /&gt;
HIV2D1,Smanga_S4: 0.131547&lt;br /&gt;
HIV2D1,Smanga_SP: 0.129111&lt;br /&gt;
HIV2G1,HIV2KR: 0.0779537&lt;br /&gt;
HIV2G1,HIV2RO: 0.0730816&lt;br /&gt;
HIV2G1,HIV2SB: 0.0791717&lt;br /&gt;
HIV2G1,HIV2ST: 0.0767357&lt;br /&gt;
HIV2G1,SIVCZ: 0.394641&lt;br /&gt;
HIV2G1,Smanga_S4: 0.127893&lt;br /&gt;
HIV2G1,Smanga_SP: 0.121803&lt;br /&gt;
HIV2KR,HIV2RO: 0.0645554&lt;br /&gt;
HIV2KR,HIV2SB: 0.0633374&lt;br /&gt;
HIV2KR,HIV2ST: 0.0572473&lt;br /&gt;
HIV2KR,SIVCZ: 0.392205&lt;br /&gt;
HIV2KR,Smanga_S4: 0.118149&lt;br /&gt;
HIV2KR,Smanga_SP: 0.112058&lt;br /&gt;
HIV2RO,HIV2SB: 0.0682095&lt;br /&gt;
HIV2RO,HIV2ST: 0.0621194&lt;br /&gt;
HIV2RO,SIVCZ: 0.386114&lt;br /&gt;
HIV2RO,Smanga_S4: 0.120585&lt;br /&gt;
HIV2RO,Smanga_SP: 0.118149&lt;br /&gt;
HIV2SB,HIV2ST: 0.0657734&lt;br /&gt;
HIV2SB,SIVCZ: 0.389769&lt;br /&gt;
HIV2SB,Smanga_S4: 0.126675&lt;br /&gt;
HIV2SB,Smanga_SP: 0.123021&lt;br /&gt;
HIV2ST,SIVCZ: 0.394641&lt;br /&gt;
HIV2ST,Smanga_S4: 0.116931&lt;br /&gt;
HIV2ST,Smanga_SP: 0.115713&lt;br /&gt;
SIVCZ,Smanga_S4: 0.388551&lt;br /&gt;
SIVCZ,Smanga_SP: 0.388551&lt;br /&gt;
Smanga_S4,Smanga_SP: 0.0146163&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Step3==&lt;br /&gt;
Here is a picture of the NJ tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.png]]&lt;br /&gt;
&lt;br /&gt;
The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.&lt;br /&gt;
&lt;br /&gt;
==Step 4==&lt;br /&gt;
Here is an unrooted tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 5==&lt;br /&gt;
Here is a rearranged (swapped) tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.swapped.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 6==&lt;br /&gt;
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).&lt;br /&gt;
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).&lt;br /&gt;
* Further answers to &amp;quot;The Phylogeny of HIV&amp;quot; can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].&lt;br /&gt;
&lt;br /&gt;
==Step 7==&lt;br /&gt;
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the &amp;quot;best of both worlds&amp;quot;: it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.&lt;br /&gt;
&lt;br /&gt;
Here is the tree made ignoring gap positions: &lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]&lt;br /&gt;
&lt;br /&gt;
And here is the tree made taking gap positions into account:&lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]&lt;br /&gt;
&lt;br /&gt;
There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. &#039;&#039;&#039;Note:&#039;&#039;&#039; This is not always the case!&lt;br /&gt;
&lt;br /&gt;
==Step 8==&lt;br /&gt;
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group &#039;&#039;Tetrapoda&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:salmon_frog.png‎|center|frame]]&lt;br /&gt;
&lt;br /&gt;
There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy&#039;s &amp;quot;Common Tree&amp;quot; function (see illustration below). &lt;br /&gt;
&lt;br /&gt;
First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group &#039;&#039;Euarchontoglires&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group &#039;&#039;Opisthokonta&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:L18_Common_Taxonomy_Tree.png|center|frame]]&lt;br /&gt;
&lt;br /&gt;
It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).&lt;br /&gt;
&lt;br /&gt;
==Step 9==&lt;br /&gt;
# 54 results. &amp;lt;br&amp;gt;Search string: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)&amp;lt;/tt&amp;gt;&lt;br /&gt;
# 8 and 27 results, respectively. &amp;lt;br&amp;gt;Search strings: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:mitochondrion) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;and &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:cytoplasm) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)&amp;lt;/tt&amp;gt; &amp;lt;br&amp;gt;Under the &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; tab in UniProt, select &amp;quot;Download all&amp;quot;, &amp;quot;FASTA (canonical)&amp;quot; and &amp;quot;Uncompressed&amp;quot;.&lt;br /&gt;
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]&lt;br /&gt;
&lt;br /&gt;
==Step 10==&lt;br /&gt;
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure &amp;lt;u&amp;gt;Alignment options&amp;lt;/u&amp;gt; is set to &amp;quot;clustalo&amp;quot;, and align all sequences. Then make an NJ tree (with &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; unchecked) and change the view to &amp;quot;&amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;&amp;quot;. &lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; the following pictures are made last year, when the number of cytoplasmic+mitochondrial sequences was 34, not 35. The rabbit has been added since then, but the general picture is the same.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the result:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.&lt;br /&gt;
&lt;br /&gt;
== Step 11 ==&lt;br /&gt;
Here is the rerooted tree made by Seaview:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.rerooted.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 12==&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
Here is the rerooted tree made by iTOL:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]&lt;br /&gt;
&lt;br /&gt;
Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.rerooted+annotated.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 14==&lt;br /&gt;
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.&lt;br /&gt;
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.&lt;br /&gt;
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=793</id>
		<title>Exercise: Phylogeny - Answers (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=793"/>
		<updated>2025-11-26T20:59:13Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 12 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Step 1 ==&lt;br /&gt;
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.&lt;br /&gt;
&lt;br /&gt;
==Step 2==&lt;br /&gt;
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#distances order: d(1,2),...,d(1,n) &amp;lt;new line&amp;gt; d(2,3),...,d(2,n) &amp;lt;new line&amp;gt;...&lt;br /&gt;
20&lt;br /&gt;
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665 &lt;br /&gt;
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769 &lt;br /&gt;
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551 &lt;br /&gt;
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683 &lt;br /&gt;
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333 &lt;br /&gt;
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114 &lt;br /&gt;
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551 &lt;br /&gt;
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551 &lt;br /&gt;
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513 &lt;br /&gt;
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551 &lt;br /&gt;
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585 &lt;br /&gt;
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111 &lt;br /&gt;
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803 &lt;br /&gt;
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058 &lt;br /&gt;
0.0682095 0.0621194 0.386114 0.120585 0.118149 &lt;br /&gt;
0.0657734 0.389769 0.126675 0.123021 &lt;br /&gt;
0.394641 0.116931 0.115713 &lt;br /&gt;
0.388551 0.388551 &lt;br /&gt;
0.0146163 &lt;br /&gt;
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP &lt;br /&gt;
&lt;br /&gt;
#pairwise distances&lt;br /&gt;
HIV1B5,HTLV: 0.750305&lt;br /&gt;
HIV1H2,HTLV: 0.751523&lt;br /&gt;
HIV1MN,HTLV: 0.75&lt;br /&gt;
HIV1N5,HTLV: 0.752741&lt;br /&gt;
HIV1ND,HTLV: 0.752741&lt;br /&gt;
HIV1OY,HTLV: 0.752741&lt;br /&gt;
HIV1PV,HTLV: 0.750305&lt;br /&gt;
HIV1U4,HTLV: 0.750305&lt;br /&gt;
HIV1Z2,HTLV: 0.752741&lt;br /&gt;
HIV2CA,HTLV: 0.749086&lt;br /&gt;
HIV2D1,HTLV: 0.741778&lt;br /&gt;
HIV2G1,HTLV: 0.747868&lt;br /&gt;
HIV2KR,HTLV: 0.749086&lt;br /&gt;
HIV2RO,HTLV: 0.744214&lt;br /&gt;
HIV2SB,HTLV: 0.750305&lt;br /&gt;
HIV2ST,HTLV: 0.747868&lt;br /&gt;
HTLV,SIVCZ: 0.747868&lt;br /&gt;
HTLV,Smanga_S4: 0.747868&lt;br /&gt;
HTLV,Smanga_SP: 0.74665&lt;br /&gt;
HIV1B5,HIV1H2: 0.0158343&lt;br /&gt;
HIV1B5,HIV1MN: 0.0414634&lt;br /&gt;
HIV1B5,HIV1N5: 0.0304507&lt;br /&gt;
HIV1B5,HIV1ND: 0.043849&lt;br /&gt;
HIV1B5,HIV1OY: 0.0341048&lt;br /&gt;
HIV1B5,HIV1PV: 0.0170524&lt;br /&gt;
HIV1B5,HIV1U4: 0.0803898&lt;br /&gt;
HIV1B5,HIV1Z2: 0.045067&lt;br /&gt;
HIV1B5,HIV2CA: 0.399513&lt;br /&gt;
HIV1B5,HIV2D1: 0.399513&lt;br /&gt;
HIV1B5,HIV2G1: 0.389769&lt;br /&gt;
HIV1B5,HIV2KR: 0.393423&lt;br /&gt;
HIV1B5,HIV2RO: 0.394641&lt;br /&gt;
HIV1B5,HIV2SB: 0.389769&lt;br /&gt;
HIV1B5,HIV2ST: 0.394641&lt;br /&gt;
HIV1B5,SIVCZ: 0.130329&lt;br /&gt;
HIV1B5,Smanga_S4: 0.389769&lt;br /&gt;
HIV1B5,Smanga_SP: 0.389769&lt;br /&gt;
HIV1H2,HIV1MN: 0.0402439&lt;br /&gt;
HIV1H2,HIV1N5: 0.0292326&lt;br /&gt;
HIV1H2,HIV1ND: 0.0414129&lt;br /&gt;
HIV1H2,HIV1OY: 0.0328867&lt;br /&gt;
HIV1H2,HIV1PV: 0.00974421&lt;br /&gt;
HIV1H2,HIV1U4: 0.0803898&lt;br /&gt;
HIV1H2,HIV1Z2: 0.0426309&lt;br /&gt;
HIV1H2,HIV2CA: 0.399513&lt;br /&gt;
HIV1H2,HIV2D1: 0.401949&lt;br /&gt;
HIV1H2,HIV2G1: 0.392205&lt;br /&gt;
HIV1H2,HIV2KR: 0.393423&lt;br /&gt;
HIV1H2,HIV2RO: 0.394641&lt;br /&gt;
HIV1H2,HIV2SB: 0.389769&lt;br /&gt;
HIV1H2,HIV2ST: 0.394641&lt;br /&gt;
HIV1H2,SIVCZ: 0.129111&lt;br /&gt;
HIV1H2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1H2,Smanga_SP: 0.388551&lt;br /&gt;
HIV1MN,HIV1N5: 0.0365854&lt;br /&gt;
HIV1MN,HIV1ND: 0.0512195&lt;br /&gt;
HIV1MN,HIV1OY: 0.0365854&lt;br /&gt;
HIV1MN,HIV1PV: 0.0439024&lt;br /&gt;
HIV1MN,HIV1U4: 0.0865854&lt;br /&gt;
HIV1MN,HIV1Z2: 0.054878&lt;br /&gt;
HIV1MN,HIV2CA: 0.4&lt;br /&gt;
HIV1MN,HIV2D1: 0.40122&lt;br /&gt;
HIV1MN,HIV2G1: 0.396341&lt;br /&gt;
HIV1MN,HIV2KR: 0.392683&lt;br /&gt;
HIV1MN,HIV2RO: 0.395122&lt;br /&gt;
HIV1MN,HIV2SB: 0.392683&lt;br /&gt;
HIV1MN,HIV2ST: 0.397561&lt;br /&gt;
HIV1MN,SIVCZ: 0.130488&lt;br /&gt;
HIV1MN,Smanga_S4: 0.392683&lt;br /&gt;
HIV1MN,Smanga_SP: 0.392683&lt;br /&gt;
HIV1N5,HIV1ND: 0.0341048&lt;br /&gt;
HIV1N5,HIV1OY: 0.0304507&lt;br /&gt;
HIV1N5,HIV1PV: 0.0316687&lt;br /&gt;
HIV1N5,HIV1U4: 0.0791717&lt;br /&gt;
HIV1N5,HIV1Z2: 0.0389769&lt;br /&gt;
HIV1N5,HIV2CA: 0.397077&lt;br /&gt;
HIV1N5,HIV2D1: 0.399513&lt;br /&gt;
HIV1N5,HIV2G1: 0.389769&lt;br /&gt;
HIV1N5,HIV2KR: 0.390987&lt;br /&gt;
HIV1N5,HIV2RO: 0.392205&lt;br /&gt;
HIV1N5,HIV2SB: 0.389769&lt;br /&gt;
HIV1N5,HIV2ST: 0.392205&lt;br /&gt;
HIV1N5,SIVCZ: 0.127893&lt;br /&gt;
HIV1N5,Smanga_S4: 0.387333&lt;br /&gt;
HIV1N5,Smanga_SP: 0.387333&lt;br /&gt;
HIV1ND,HIV1OY: 0.043849&lt;br /&gt;
HIV1ND,HIV1PV: 0.043849&lt;br /&gt;
HIV1ND,HIV1U4: 0.0767357&lt;br /&gt;
HIV1ND,HIV1Z2: 0.0219245&lt;br /&gt;
HIV1ND,HIV2CA: 0.390987&lt;br /&gt;
HIV1ND,HIV2D1: 0.394641&lt;br /&gt;
HIV1ND,HIV2G1: 0.386114&lt;br /&gt;
HIV1ND,HIV2KR: 0.386114&lt;br /&gt;
HIV1ND,HIV2RO: 0.388551&lt;br /&gt;
HIV1ND,HIV2SB: 0.387333&lt;br /&gt;
HIV1ND,HIV2ST: 0.389769&lt;br /&gt;
HIV1ND,SIVCZ: 0.125457&lt;br /&gt;
HIV1ND,Smanga_S4: 0.386114&lt;br /&gt;
HIV1ND,Smanga_SP: 0.386114&lt;br /&gt;
HIV1OY,HIV1PV: 0.0365408&lt;br /&gt;
HIV1OY,HIV1U4: 0.0767357&lt;br /&gt;
HIV1OY,HIV1Z2: 0.047503&lt;br /&gt;
HIV1OY,HIV2CA: 0.394641&lt;br /&gt;
HIV1OY,HIV2D1: 0.397077&lt;br /&gt;
HIV1OY,HIV2G1: 0.388551&lt;br /&gt;
HIV1OY,HIV2KR: 0.388551&lt;br /&gt;
HIV1OY,HIV2RO: 0.389769&lt;br /&gt;
HIV1OY,HIV2SB: 0.386114&lt;br /&gt;
HIV1OY,HIV2ST: 0.390987&lt;br /&gt;
HIV1OY,SIVCZ: 0.131547&lt;br /&gt;
HIV1OY,Smanga_S4: 0.388551&lt;br /&gt;
HIV1OY,Smanga_SP: 0.388551&lt;br /&gt;
HIV1PV,HIV1U4: 0.0828258&lt;br /&gt;
HIV1PV,HIV1Z2: 0.045067&lt;br /&gt;
HIV1PV,HIV2CA: 0.401949&lt;br /&gt;
HIV1PV,HIV2D1: 0.404385&lt;br /&gt;
HIV1PV,HIV2G1: 0.394641&lt;br /&gt;
HIV1PV,HIV2KR: 0.394641&lt;br /&gt;
HIV1PV,HIV2RO: 0.397077&lt;br /&gt;
HIV1PV,HIV2SB: 0.390987&lt;br /&gt;
HIV1PV,HIV2ST: 0.393423&lt;br /&gt;
HIV1PV,SIVCZ: 0.130329&lt;br /&gt;
HIV1PV,Smanga_S4: 0.388551&lt;br /&gt;
HIV1PV,Smanga_SP: 0.388551&lt;br /&gt;
HIV1U4,HIV1Z2: 0.0767357&lt;br /&gt;
HIV1U4,HIV2CA: 0.398295&lt;br /&gt;
HIV1U4,HIV2D1: 0.403167&lt;br /&gt;
HIV1U4,HIV2G1: 0.392205&lt;br /&gt;
HIV1U4,HIV2KR: 0.395859&lt;br /&gt;
HIV1U4,HIV2RO: 0.394641&lt;br /&gt;
HIV1U4,HIV2SB: 0.394641&lt;br /&gt;
HIV1U4,HIV2ST: 0.397077&lt;br /&gt;
HIV1U4,SIVCZ: 0.137637&lt;br /&gt;
HIV1U4,Smanga_S4: 0.400731&lt;br /&gt;
HIV1U4,Smanga_SP: 0.399513&lt;br /&gt;
HIV1Z2,HIV2CA: 0.393423&lt;br /&gt;
HIV1Z2,HIV2D1: 0.397077&lt;br /&gt;
HIV1Z2,HIV2G1: 0.387333&lt;br /&gt;
HIV1Z2,HIV2KR: 0.388551&lt;br /&gt;
HIV1Z2,HIV2RO: 0.389769&lt;br /&gt;
HIV1Z2,HIV2SB: 0.389769&lt;br /&gt;
HIV1Z2,HIV2ST: 0.389769&lt;br /&gt;
HIV1Z2,SIVCZ: 0.125457&lt;br /&gt;
HIV1Z2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1Z2,Smanga_SP: 0.388551&lt;br /&gt;
HIV2CA,HIV2D1: 0.0816078&lt;br /&gt;
HIV2CA,HIV2G1: 0.0694275&lt;br /&gt;
HIV2CA,HIV2KR: 0.0645554&lt;br /&gt;
HIV2CA,HIV2RO: 0.0511571&lt;br /&gt;
HIV2CA,HIV2SB: 0.0682095&lt;br /&gt;
HIV2CA,HIV2ST: 0.0657734&lt;br /&gt;
HIV2CA,SIVCZ: 0.392205&lt;br /&gt;
HIV2CA,Smanga_S4: 0.125457&lt;br /&gt;
HIV2CA,Smanga_SP: 0.120585&lt;br /&gt;
HIV2D1,HIV2G1: 0.0511571&lt;br /&gt;
HIV2D1,HIV2KR: 0.0840438&lt;br /&gt;
HIV2D1,HIV2RO: 0.088916&lt;br /&gt;
HIV2D1,HIV2SB: 0.09257&lt;br /&gt;
HIV2D1,HIV2ST: 0.0864799&lt;br /&gt;
HIV2D1,SIVCZ: 0.397077&lt;br /&gt;
HIV2D1,Smanga_S4: 0.131547&lt;br /&gt;
HIV2D1,Smanga_SP: 0.129111&lt;br /&gt;
HIV2G1,HIV2KR: 0.0779537&lt;br /&gt;
HIV2G1,HIV2RO: 0.0730816&lt;br /&gt;
HIV2G1,HIV2SB: 0.0791717&lt;br /&gt;
HIV2G1,HIV2ST: 0.0767357&lt;br /&gt;
HIV2G1,SIVCZ: 0.394641&lt;br /&gt;
HIV2G1,Smanga_S4: 0.127893&lt;br /&gt;
HIV2G1,Smanga_SP: 0.121803&lt;br /&gt;
HIV2KR,HIV2RO: 0.0645554&lt;br /&gt;
HIV2KR,HIV2SB: 0.0633374&lt;br /&gt;
HIV2KR,HIV2ST: 0.0572473&lt;br /&gt;
HIV2KR,SIVCZ: 0.392205&lt;br /&gt;
HIV2KR,Smanga_S4: 0.118149&lt;br /&gt;
HIV2KR,Smanga_SP: 0.112058&lt;br /&gt;
HIV2RO,HIV2SB: 0.0682095&lt;br /&gt;
HIV2RO,HIV2ST: 0.0621194&lt;br /&gt;
HIV2RO,SIVCZ: 0.386114&lt;br /&gt;
HIV2RO,Smanga_S4: 0.120585&lt;br /&gt;
HIV2RO,Smanga_SP: 0.118149&lt;br /&gt;
HIV2SB,HIV2ST: 0.0657734&lt;br /&gt;
HIV2SB,SIVCZ: 0.389769&lt;br /&gt;
HIV2SB,Smanga_S4: 0.126675&lt;br /&gt;
HIV2SB,Smanga_SP: 0.123021&lt;br /&gt;
HIV2ST,SIVCZ: 0.394641&lt;br /&gt;
HIV2ST,Smanga_S4: 0.116931&lt;br /&gt;
HIV2ST,Smanga_SP: 0.115713&lt;br /&gt;
SIVCZ,Smanga_S4: 0.388551&lt;br /&gt;
SIVCZ,Smanga_SP: 0.388551&lt;br /&gt;
Smanga_S4,Smanga_SP: 0.0146163&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Step3==&lt;br /&gt;
Here is a picture of the NJ tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.png]]&lt;br /&gt;
&lt;br /&gt;
The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.&lt;br /&gt;
&lt;br /&gt;
==Step 4==&lt;br /&gt;
Here is an unrooted tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 5==&lt;br /&gt;
Here is a rearranged (swapped) tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.swapped.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 6==&lt;br /&gt;
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).&lt;br /&gt;
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).&lt;br /&gt;
* Further answers to &amp;quot;The Phylogeny of HIV&amp;quot; can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].&lt;br /&gt;
&lt;br /&gt;
==Step 7==&lt;br /&gt;
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the &amp;quot;best of both worlds&amp;quot;: it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.&lt;br /&gt;
&lt;br /&gt;
Here is the tree made ignoring gap positions: &lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]&lt;br /&gt;
&lt;br /&gt;
And here is the tree made taking gap positions into account:&lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]&lt;br /&gt;
&lt;br /&gt;
There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. &#039;&#039;&#039;Note:&#039;&#039;&#039; This is not always the case!&lt;br /&gt;
&lt;br /&gt;
==Step 8==&lt;br /&gt;
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group &#039;&#039;Tetrapoda&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:salmon_frog.png‎|center|frame]]&lt;br /&gt;
&lt;br /&gt;
There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy&#039;s &amp;quot;Common Tree&amp;quot; function (see illustration below). &lt;br /&gt;
&lt;br /&gt;
First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group &#039;&#039;Euarchontoglires&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group &#039;&#039;Opisthokonta&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:L18_Common_Taxonomy_Tree.png|center|frame]]&lt;br /&gt;
&lt;br /&gt;
It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).&lt;br /&gt;
&lt;br /&gt;
==Step 9==&lt;br /&gt;
# 54 results. &amp;lt;br&amp;gt;Search string: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)&amp;lt;/tt&amp;gt;&lt;br /&gt;
# 8 and 27 results, respectively. &amp;lt;br&amp;gt;Search strings: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:mitochondrion) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;and &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:cytoplasm) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)&amp;lt;/tt&amp;gt; &amp;lt;br&amp;gt;Under the &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; tab in UniProt, select &amp;quot;Download all&amp;quot;, &amp;quot;FASTA (canonical)&amp;quot; and &amp;quot;Uncompressed&amp;quot;.&lt;br /&gt;
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]&lt;br /&gt;
&lt;br /&gt;
==Step 10==&lt;br /&gt;
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure &amp;lt;u&amp;gt;Alignment options&amp;lt;/u&amp;gt; is set to &amp;quot;clustalo&amp;quot;, and align all sequences. Then make an NJ tree (with &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; unchecked) and change the view to &amp;quot;&amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;&amp;quot;. &lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; the following pictures are made last year, when the number of cytoplasmic+mitochondrial sequences was 34, not 35. The rabbit has been added since then, but the general picture is the same.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the result:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.&lt;br /&gt;
&lt;br /&gt;
== Step 11 ==&lt;br /&gt;
Here is the rerooted tree made by Seaview:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.rerooted.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 12==&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
Here is the rerooted tree made by iTOL:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]&lt;br /&gt;
&lt;br /&gt;
Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Step 13==&lt;br /&gt;
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 14==&lt;br /&gt;
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.&lt;br /&gt;
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.&lt;br /&gt;
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=792</id>
		<title>Exercise: Phylogeny (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=792"/>
		<updated>2025-11-26T20:58:07Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 11: rerooting the tree in Seaview */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Before you start: please make sure you have the Seaview program installed on your computer. If not, see the [[Exercise: Multiple Alignments (Seaview version)|Multiple alignment exercise]].&lt;br /&gt;
&lt;br /&gt;
== The Phylogeny of HIV ==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:&lt;br /&gt;
&lt;br /&gt;
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.&lt;br /&gt;
&lt;br /&gt;
The &amp;quot;Pol&amp;quot; gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1. It is available via this link:&lt;br /&gt;
&lt;br /&gt;
:[https://teaching.healthtech.dtu.dk/material/22111/Pol21.fsa Pol21.fsa]&lt;br /&gt;
&lt;br /&gt;
===Step 1: alignment===&lt;br /&gt;
&lt;br /&gt;
Align the Pol sequences using the Clustal Omega program in Seaview.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: &lt;br /&gt;
:Make a PDF in color of your alignment (File→Prepare PDF) and hand it in as an attachment to your answer document.&lt;br /&gt;
&lt;br /&gt;
===Step 2: distance matrix===&lt;br /&gt;
&lt;br /&gt;
In Seaview, go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt;. In the window that pops up, select &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt; and set &amp;lt;u&amp;gt;Distance&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Observed&amp;lt;/u&amp;gt;. Let &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; be checked. Click &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; and save the file. &lt;br /&gt;
&lt;br /&gt;
Look at the resulting file in a plain text editor. First, all distances are written in a triangle with the top row containing distances between the first sequence and all the others, the second row containing the distances between the second sequence and all others except the first, and so on. Just below the triangle, the names of the sequences are shown in the order in which they are displayed in the triangle. Further down in the file, all distances are repeated in another format with one pairwise distance per line including the sequence names.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: &lt;br /&gt;
:Can you spot which sequence has the largest distances to all the others?&lt;br /&gt;
&lt;br /&gt;
===Step 3: neighbor joining===&lt;br /&gt;
&lt;br /&gt;
Go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt; again, but this time, select &amp;lt;u&amp;gt;NJ&amp;lt;/u&amp;gt; instead of &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt;. Then, clicking &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; will produce a neighbor-joining tree based on the distances you just looked at. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the resulting tree (&#039;&#039;&#039;Hint&#039;&#039;&#039;: you can either take a screenshot or save the tree as SVG via the &amp;lt;u&amp;gt;File&amp;lt;/u&amp;gt; menu). &lt;br /&gt;
:Which sequence has the longest branch? Does that correspond to your answer before?&lt;br /&gt;
&lt;br /&gt;
===Step 4: rooted &#039;&#039;vs&#039;&#039; unrooted tree===&lt;br /&gt;
&lt;br /&gt;
In principle, the NJ algorithm always produces an &#039;&#039;unrooted&#039;&#039; tree. The reason why the trees you have seen so far (in this and last week&#039;s exercises) have been shown as rooted trees is that Seaview uses &#039;&#039;midpoint rooting&#039;&#039;, i.e., it places the root halfway between the tips that are furthest away from each other on the tree. However, you can also display the tree as unrooted in Seaview: In the drop-down menu at the top of the tree window, change &amp;lt;u&amp;gt;squared&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;. (It is a bit unfortunate that Seaview uses the term &amp;quot;circular&amp;quot;, since some other programs offer a circular way of displaying &#039;&#039;rooted&#039;&#039; trees, which should not be confused with unrooted trees). Later in the exercise, we will encounter tree rerooting.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the unrooted tree.&lt;br /&gt;
&lt;br /&gt;
===Step 5: rearrangement===&lt;br /&gt;
Now, go back to the rooted view of the tree and click &amp;lt;u&amp;gt;Swap&amp;lt;/u&amp;gt; in the second line of the tree window. Now, every internal node will be marked by a small black square. Click any square to rotate the subtree defined by that node (i.e., swap the upper and lower branches). When you click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt;, the black squares disappear again, but the changes in the tree layout will remain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the tree where you have rearranged it so that:&lt;br /&gt;
:# HTLV is at the bottom,&lt;br /&gt;
:# The HIV1 sequences are above the HIV2 sequences, and&lt;br /&gt;
:# &amp;quot;SIVCZ&amp;quot; is placed next to &amp;quot;Smanga_S4&amp;quot;.&lt;br /&gt;
Note that all these rearrangements do &#039;&#039;not&#039;&#039; change the topology (the branching pattern) of the tree — it still shows the same phylogeny.&lt;br /&gt;
&lt;br /&gt;
===Step 6: interpretation===&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: &lt;br /&gt;
: Inspect the rooted tree that you now have and consider what this tells you about the origin of HIV viruses.&lt;br /&gt;
* Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences?&lt;br /&gt;
* The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?&lt;br /&gt;
* With these groupings in consideration, what can you say about the origin of the two HIV viruses?&lt;br /&gt;
&lt;br /&gt;
== Comparing trees ==&lt;br /&gt;
&lt;br /&gt;
For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequences, but lack the first 90 nucleotides or so). The sequences can be found via the following link:&lt;br /&gt;
&lt;br /&gt;
* [https://teaching.healthtech.dtu.dk/material/22111/L18_CDS.fasta L18_CDS.fasta]&lt;br /&gt;
&lt;br /&gt;
===Step 7: with or without gapped positions===&lt;br /&gt;
This time, make two versions of your tree: one where &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; is on, and one where it is off. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: &lt;br /&gt;
: Compare the two trees. Are there any differences in the tree topology (i.e., in branching pattern, not just in branch lengths)?&lt;br /&gt;
: Your answers should include the following:&lt;br /&gt;
:* How did you construct the trees? (alignment method, construction of tree, etc.).&lt;br /&gt;
:* Pictures of the trees. &lt;br /&gt;
:* Which tree do you think is most correct?&lt;br /&gt;
&lt;br /&gt;
===Step 8: comparison to taxonomy===&lt;br /&gt;
Now, go to [http://www.ncbi.nlm.nih.gov/taxonomy NCBI taxonomy] and construct a &amp;quot;Common Tree&amp;quot; with all the different species in your L18 data set. It may be necessary to look up some of the common names on the net (Google, Wikipedia, Tree of Life) in order to enter them in the common tree function. &#039;&#039;&#039;Note&#039;&#039;&#039;: Remember to tick &amp;lt;u&amp;gt;include unranked (phylogenetic) taxa&amp;lt;/u&amp;gt;.&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;:&lt;br /&gt;
: Compare the most correct of your trees from Step 7 with the Common tree. Are there any errors, i.e. taxa that are not placed correctly on your tree? Which?&lt;br /&gt;
&lt;br /&gt;
== Mitochondrial &#039;&#039;versus&#039;&#039; cytoplasmic proteins ==&lt;br /&gt;
In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion&#039;s own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use [http://www.uniprot.org/ UniProt] to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyze the phylogeny of the dataset.&lt;br /&gt;
&lt;br /&gt;
===Step 9: building the dataset===&lt;br /&gt;
# Find all proteins named &amp;quot;ribosomal protein L3&amp;quot; from as many eukaryotes (&#039;&#039;Eukaryota&#039;&#039;) as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).&lt;br /&gt;
# How many of these have a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; of &amp;quot;mitochondrion&amp;quot; and &amp;quot;cytoplasm&amp;quot;, respectively? Download the results of these two searches in FASTA format.&lt;br /&gt;
# Now combine the two data sets from the previous question into one FASTA file (using Geany or another plain text editor). Note that their names start by &amp;quot;RL3&amp;quot; (cytoplasmic) or &amp;quot;RM03&amp;quot;/&amp;quot;RK3&amp;quot; (mitochondrial) which is very convenient for telling the difference between them. &#039;&#039;If you have any names that do not begin with &amp;quot;RL3&amp;quot;, &amp;quot;RK3&amp;quot; or &amp;quot;RM03&amp;quot;, revisit your UniProt search criteria!&#039;&#039; Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).&lt;br /&gt;
&lt;br /&gt;
===Step 10: making the tree===&lt;br /&gt;
Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). NB: set &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; off. Describe all the steps you took to make it, and hand in a picture of your tree in &#039;&#039;unrooted&#039;&#039; view. &amp;lt;!-- Also, go to &amp;lt;u&amp;gt;File→Save unrooted tree&amp;lt;/u&amp;gt; and save the tree file; name it something ending in &amp;lt;tt&amp;gt;.txt&amp;lt;/tt&amp;gt;. Open this file in a plain text editor and have a look at it — this is the Newick tree file text format. We will need this file later. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 11: rerooting the tree in Seaview===&lt;br /&gt;
Until now, we have not had to deal with rerooting, because the midpoint rooting happened to be correct. This is not the case here, since we want the cytoplasmic and the mitochondrial sequences to be in two monophyletic groups (two subtrees). In other words, we have to reroot:&lt;br /&gt;
# Switch back to rooted (&amp;quot;squared&amp;quot;) view.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Re-root&amp;lt;/u&amp;gt; in the second row of the tree window; a small black square will appear at each node. If you click a square, the tree will be rerooted at that node (try it!)&lt;br /&gt;
# Now find a node where all children are either cytoplasmic or mitochondrial.  Click it (don&#039;t worry about clicking a wrong node, you can always click another). Make sure that all the cytoplasmic and all the mitochondrial sequences are in two separate subtrees. &lt;br /&gt;
# Then, click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt; in the second row of the tree window to make the small black squares disappear again.&lt;br /&gt;
Include a picture of the rerooted tree in your answer.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 12: interactive Tree Of Life===&lt;br /&gt;
In this step, we will use the website [https://itol.embl.de/ iTOL] (interactive Tree Of Life) to reroot our tree: &lt;br /&gt;
# Open the website in a new browser tab, and click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt; in the top row.&lt;br /&gt;
# Click the button under &amp;lt;u&amp;gt;Tree file:&amp;lt;/u&amp;gt; and select the unrooted Newick tree file you saved in Step 10.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt;. You will now see a tree displayed with an arbitrary placement of the root.&lt;br /&gt;
# Look at the &amp;lt;u&amp;gt;Control panel&amp;lt;/u&amp;gt; to the right. Under &amp;lt;u&amp;gt;Label options&amp;lt;/u&amp;gt; switch &amp;lt;u&amp;gt;Position&amp;lt;/u&amp;gt; from &amp;lt;u&amp;gt;Aligned&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;At tips&amp;lt;/u&amp;gt;.&lt;br /&gt;
# Note that when you hover the mouse over a branch, information about the branch is displayed.&lt;br /&gt;
# Find, like in the previous step, a node where all children are either cytoplasmic or mitochondrial. Click it. A menu will appear. In that menu, go to &amp;lt;u&amp;gt;Editing→Tree structure→Re-root the tree here&amp;lt;/u&amp;gt;.&lt;br /&gt;
Include a picture of the rerooted tree in your answer. Is there a difference between this tree and the one you made in Step 11? If so, describe it.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 12: annotating the tree===&lt;br /&gt;
In this step, you need to be able to draw on a screenshot of the tree from Seaview. You can use any drawing software of your own choice, e.g. the Snip and Sketch tool (built into Windows), [https://inkscape.org/ Inkscape], or PowerPoint.&lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 13: annotating the tree===&lt;br /&gt;
# In the left part of the iTOL window, you will see six small icons: Zoom in, Zoom out, Fit to screen, Information, Search tree nodes, and Manual annotations (hover the mouse over them to see the descriptions). &lt;br /&gt;
# Click &amp;lt;u&amp;gt;Manual annotations&amp;lt;/u&amp;gt; and select the first tool (&amp;quot;Draw an ellipse / circle&amp;quot;). &lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Note that in case you place a circle incorrectly, you can move it with the &amp;quot;Move/rotate/scale objects&amp;quot; tool. There is also a &amp;quot;Delete objects&amp;quot; tool.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 13: interpretation===&lt;br /&gt;
&lt;br /&gt;
Consider your rerooted and annotated tree, and answer the following questions: &lt;br /&gt;
# Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?&lt;br /&gt;
# Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If not, where do you see differences?&lt;br /&gt;
# Consider the horizontal distance between the blue and the green point in both subtrees. Where has evolution been faster (where are there most mutations per time unit) — among the cytoplasmic or the mitochondrial proteins?&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Ribosomal_proteins_35-NJ_tree.rerooted.png&amp;diff=791</id>
		<title>File:Ribosomal proteins 35-NJ tree.rerooted.png</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Ribosomal_proteins_35-NJ_tree.rerooted.png&amp;diff=791"/>
		<updated>2025-11-26T20:57:02Z</updated>

		<summary type="html">&lt;p&gt;Henni: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=790</id>
		<title>Exercise: Phylogeny - Answers (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=790"/>
		<updated>2025-11-26T20:56:23Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 11 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Step 1 ==&lt;br /&gt;
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.&lt;br /&gt;
&lt;br /&gt;
==Step 2==&lt;br /&gt;
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#distances order: d(1,2),...,d(1,n) &amp;lt;new line&amp;gt; d(2,3),...,d(2,n) &amp;lt;new line&amp;gt;...&lt;br /&gt;
20&lt;br /&gt;
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665 &lt;br /&gt;
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769 &lt;br /&gt;
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551 &lt;br /&gt;
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683 &lt;br /&gt;
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333 &lt;br /&gt;
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114 &lt;br /&gt;
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551 &lt;br /&gt;
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551 &lt;br /&gt;
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513 &lt;br /&gt;
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551 &lt;br /&gt;
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585 &lt;br /&gt;
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111 &lt;br /&gt;
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803 &lt;br /&gt;
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058 &lt;br /&gt;
0.0682095 0.0621194 0.386114 0.120585 0.118149 &lt;br /&gt;
0.0657734 0.389769 0.126675 0.123021 &lt;br /&gt;
0.394641 0.116931 0.115713 &lt;br /&gt;
0.388551 0.388551 &lt;br /&gt;
0.0146163 &lt;br /&gt;
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP &lt;br /&gt;
&lt;br /&gt;
#pairwise distances&lt;br /&gt;
HIV1B5,HTLV: 0.750305&lt;br /&gt;
HIV1H2,HTLV: 0.751523&lt;br /&gt;
HIV1MN,HTLV: 0.75&lt;br /&gt;
HIV1N5,HTLV: 0.752741&lt;br /&gt;
HIV1ND,HTLV: 0.752741&lt;br /&gt;
HIV1OY,HTLV: 0.752741&lt;br /&gt;
HIV1PV,HTLV: 0.750305&lt;br /&gt;
HIV1U4,HTLV: 0.750305&lt;br /&gt;
HIV1Z2,HTLV: 0.752741&lt;br /&gt;
HIV2CA,HTLV: 0.749086&lt;br /&gt;
HIV2D1,HTLV: 0.741778&lt;br /&gt;
HIV2G1,HTLV: 0.747868&lt;br /&gt;
HIV2KR,HTLV: 0.749086&lt;br /&gt;
HIV2RO,HTLV: 0.744214&lt;br /&gt;
HIV2SB,HTLV: 0.750305&lt;br /&gt;
HIV2ST,HTLV: 0.747868&lt;br /&gt;
HTLV,SIVCZ: 0.747868&lt;br /&gt;
HTLV,Smanga_S4: 0.747868&lt;br /&gt;
HTLV,Smanga_SP: 0.74665&lt;br /&gt;
HIV1B5,HIV1H2: 0.0158343&lt;br /&gt;
HIV1B5,HIV1MN: 0.0414634&lt;br /&gt;
HIV1B5,HIV1N5: 0.0304507&lt;br /&gt;
HIV1B5,HIV1ND: 0.043849&lt;br /&gt;
HIV1B5,HIV1OY: 0.0341048&lt;br /&gt;
HIV1B5,HIV1PV: 0.0170524&lt;br /&gt;
HIV1B5,HIV1U4: 0.0803898&lt;br /&gt;
HIV1B5,HIV1Z2: 0.045067&lt;br /&gt;
HIV1B5,HIV2CA: 0.399513&lt;br /&gt;
HIV1B5,HIV2D1: 0.399513&lt;br /&gt;
HIV1B5,HIV2G1: 0.389769&lt;br /&gt;
HIV1B5,HIV2KR: 0.393423&lt;br /&gt;
HIV1B5,HIV2RO: 0.394641&lt;br /&gt;
HIV1B5,HIV2SB: 0.389769&lt;br /&gt;
HIV1B5,HIV2ST: 0.394641&lt;br /&gt;
HIV1B5,SIVCZ: 0.130329&lt;br /&gt;
HIV1B5,Smanga_S4: 0.389769&lt;br /&gt;
HIV1B5,Smanga_SP: 0.389769&lt;br /&gt;
HIV1H2,HIV1MN: 0.0402439&lt;br /&gt;
HIV1H2,HIV1N5: 0.0292326&lt;br /&gt;
HIV1H2,HIV1ND: 0.0414129&lt;br /&gt;
HIV1H2,HIV1OY: 0.0328867&lt;br /&gt;
HIV1H2,HIV1PV: 0.00974421&lt;br /&gt;
HIV1H2,HIV1U4: 0.0803898&lt;br /&gt;
HIV1H2,HIV1Z2: 0.0426309&lt;br /&gt;
HIV1H2,HIV2CA: 0.399513&lt;br /&gt;
HIV1H2,HIV2D1: 0.401949&lt;br /&gt;
HIV1H2,HIV2G1: 0.392205&lt;br /&gt;
HIV1H2,HIV2KR: 0.393423&lt;br /&gt;
HIV1H2,HIV2RO: 0.394641&lt;br /&gt;
HIV1H2,HIV2SB: 0.389769&lt;br /&gt;
HIV1H2,HIV2ST: 0.394641&lt;br /&gt;
HIV1H2,SIVCZ: 0.129111&lt;br /&gt;
HIV1H2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1H2,Smanga_SP: 0.388551&lt;br /&gt;
HIV1MN,HIV1N5: 0.0365854&lt;br /&gt;
HIV1MN,HIV1ND: 0.0512195&lt;br /&gt;
HIV1MN,HIV1OY: 0.0365854&lt;br /&gt;
HIV1MN,HIV1PV: 0.0439024&lt;br /&gt;
HIV1MN,HIV1U4: 0.0865854&lt;br /&gt;
HIV1MN,HIV1Z2: 0.054878&lt;br /&gt;
HIV1MN,HIV2CA: 0.4&lt;br /&gt;
HIV1MN,HIV2D1: 0.40122&lt;br /&gt;
HIV1MN,HIV2G1: 0.396341&lt;br /&gt;
HIV1MN,HIV2KR: 0.392683&lt;br /&gt;
HIV1MN,HIV2RO: 0.395122&lt;br /&gt;
HIV1MN,HIV2SB: 0.392683&lt;br /&gt;
HIV1MN,HIV2ST: 0.397561&lt;br /&gt;
HIV1MN,SIVCZ: 0.130488&lt;br /&gt;
HIV1MN,Smanga_S4: 0.392683&lt;br /&gt;
HIV1MN,Smanga_SP: 0.392683&lt;br /&gt;
HIV1N5,HIV1ND: 0.0341048&lt;br /&gt;
HIV1N5,HIV1OY: 0.0304507&lt;br /&gt;
HIV1N5,HIV1PV: 0.0316687&lt;br /&gt;
HIV1N5,HIV1U4: 0.0791717&lt;br /&gt;
HIV1N5,HIV1Z2: 0.0389769&lt;br /&gt;
HIV1N5,HIV2CA: 0.397077&lt;br /&gt;
HIV1N5,HIV2D1: 0.399513&lt;br /&gt;
HIV1N5,HIV2G1: 0.389769&lt;br /&gt;
HIV1N5,HIV2KR: 0.390987&lt;br /&gt;
HIV1N5,HIV2RO: 0.392205&lt;br /&gt;
HIV1N5,HIV2SB: 0.389769&lt;br /&gt;
HIV1N5,HIV2ST: 0.392205&lt;br /&gt;
HIV1N5,SIVCZ: 0.127893&lt;br /&gt;
HIV1N5,Smanga_S4: 0.387333&lt;br /&gt;
HIV1N5,Smanga_SP: 0.387333&lt;br /&gt;
HIV1ND,HIV1OY: 0.043849&lt;br /&gt;
HIV1ND,HIV1PV: 0.043849&lt;br /&gt;
HIV1ND,HIV1U4: 0.0767357&lt;br /&gt;
HIV1ND,HIV1Z2: 0.0219245&lt;br /&gt;
HIV1ND,HIV2CA: 0.390987&lt;br /&gt;
HIV1ND,HIV2D1: 0.394641&lt;br /&gt;
HIV1ND,HIV2G1: 0.386114&lt;br /&gt;
HIV1ND,HIV2KR: 0.386114&lt;br /&gt;
HIV1ND,HIV2RO: 0.388551&lt;br /&gt;
HIV1ND,HIV2SB: 0.387333&lt;br /&gt;
HIV1ND,HIV2ST: 0.389769&lt;br /&gt;
HIV1ND,SIVCZ: 0.125457&lt;br /&gt;
HIV1ND,Smanga_S4: 0.386114&lt;br /&gt;
HIV1ND,Smanga_SP: 0.386114&lt;br /&gt;
HIV1OY,HIV1PV: 0.0365408&lt;br /&gt;
HIV1OY,HIV1U4: 0.0767357&lt;br /&gt;
HIV1OY,HIV1Z2: 0.047503&lt;br /&gt;
HIV1OY,HIV2CA: 0.394641&lt;br /&gt;
HIV1OY,HIV2D1: 0.397077&lt;br /&gt;
HIV1OY,HIV2G1: 0.388551&lt;br /&gt;
HIV1OY,HIV2KR: 0.388551&lt;br /&gt;
HIV1OY,HIV2RO: 0.389769&lt;br /&gt;
HIV1OY,HIV2SB: 0.386114&lt;br /&gt;
HIV1OY,HIV2ST: 0.390987&lt;br /&gt;
HIV1OY,SIVCZ: 0.131547&lt;br /&gt;
HIV1OY,Smanga_S4: 0.388551&lt;br /&gt;
HIV1OY,Smanga_SP: 0.388551&lt;br /&gt;
HIV1PV,HIV1U4: 0.0828258&lt;br /&gt;
HIV1PV,HIV1Z2: 0.045067&lt;br /&gt;
HIV1PV,HIV2CA: 0.401949&lt;br /&gt;
HIV1PV,HIV2D1: 0.404385&lt;br /&gt;
HIV1PV,HIV2G1: 0.394641&lt;br /&gt;
HIV1PV,HIV2KR: 0.394641&lt;br /&gt;
HIV1PV,HIV2RO: 0.397077&lt;br /&gt;
HIV1PV,HIV2SB: 0.390987&lt;br /&gt;
HIV1PV,HIV2ST: 0.393423&lt;br /&gt;
HIV1PV,SIVCZ: 0.130329&lt;br /&gt;
HIV1PV,Smanga_S4: 0.388551&lt;br /&gt;
HIV1PV,Smanga_SP: 0.388551&lt;br /&gt;
HIV1U4,HIV1Z2: 0.0767357&lt;br /&gt;
HIV1U4,HIV2CA: 0.398295&lt;br /&gt;
HIV1U4,HIV2D1: 0.403167&lt;br /&gt;
HIV1U4,HIV2G1: 0.392205&lt;br /&gt;
HIV1U4,HIV2KR: 0.395859&lt;br /&gt;
HIV1U4,HIV2RO: 0.394641&lt;br /&gt;
HIV1U4,HIV2SB: 0.394641&lt;br /&gt;
HIV1U4,HIV2ST: 0.397077&lt;br /&gt;
HIV1U4,SIVCZ: 0.137637&lt;br /&gt;
HIV1U4,Smanga_S4: 0.400731&lt;br /&gt;
HIV1U4,Smanga_SP: 0.399513&lt;br /&gt;
HIV1Z2,HIV2CA: 0.393423&lt;br /&gt;
HIV1Z2,HIV2D1: 0.397077&lt;br /&gt;
HIV1Z2,HIV2G1: 0.387333&lt;br /&gt;
HIV1Z2,HIV2KR: 0.388551&lt;br /&gt;
HIV1Z2,HIV2RO: 0.389769&lt;br /&gt;
HIV1Z2,HIV2SB: 0.389769&lt;br /&gt;
HIV1Z2,HIV2ST: 0.389769&lt;br /&gt;
HIV1Z2,SIVCZ: 0.125457&lt;br /&gt;
HIV1Z2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1Z2,Smanga_SP: 0.388551&lt;br /&gt;
HIV2CA,HIV2D1: 0.0816078&lt;br /&gt;
HIV2CA,HIV2G1: 0.0694275&lt;br /&gt;
HIV2CA,HIV2KR: 0.0645554&lt;br /&gt;
HIV2CA,HIV2RO: 0.0511571&lt;br /&gt;
HIV2CA,HIV2SB: 0.0682095&lt;br /&gt;
HIV2CA,HIV2ST: 0.0657734&lt;br /&gt;
HIV2CA,SIVCZ: 0.392205&lt;br /&gt;
HIV2CA,Smanga_S4: 0.125457&lt;br /&gt;
HIV2CA,Smanga_SP: 0.120585&lt;br /&gt;
HIV2D1,HIV2G1: 0.0511571&lt;br /&gt;
HIV2D1,HIV2KR: 0.0840438&lt;br /&gt;
HIV2D1,HIV2RO: 0.088916&lt;br /&gt;
HIV2D1,HIV2SB: 0.09257&lt;br /&gt;
HIV2D1,HIV2ST: 0.0864799&lt;br /&gt;
HIV2D1,SIVCZ: 0.397077&lt;br /&gt;
HIV2D1,Smanga_S4: 0.131547&lt;br /&gt;
HIV2D1,Smanga_SP: 0.129111&lt;br /&gt;
HIV2G1,HIV2KR: 0.0779537&lt;br /&gt;
HIV2G1,HIV2RO: 0.0730816&lt;br /&gt;
HIV2G1,HIV2SB: 0.0791717&lt;br /&gt;
HIV2G1,HIV2ST: 0.0767357&lt;br /&gt;
HIV2G1,SIVCZ: 0.394641&lt;br /&gt;
HIV2G1,Smanga_S4: 0.127893&lt;br /&gt;
HIV2G1,Smanga_SP: 0.121803&lt;br /&gt;
HIV2KR,HIV2RO: 0.0645554&lt;br /&gt;
HIV2KR,HIV2SB: 0.0633374&lt;br /&gt;
HIV2KR,HIV2ST: 0.0572473&lt;br /&gt;
HIV2KR,SIVCZ: 0.392205&lt;br /&gt;
HIV2KR,Smanga_S4: 0.118149&lt;br /&gt;
HIV2KR,Smanga_SP: 0.112058&lt;br /&gt;
HIV2RO,HIV2SB: 0.0682095&lt;br /&gt;
HIV2RO,HIV2ST: 0.0621194&lt;br /&gt;
HIV2RO,SIVCZ: 0.386114&lt;br /&gt;
HIV2RO,Smanga_S4: 0.120585&lt;br /&gt;
HIV2RO,Smanga_SP: 0.118149&lt;br /&gt;
HIV2SB,HIV2ST: 0.0657734&lt;br /&gt;
HIV2SB,SIVCZ: 0.389769&lt;br /&gt;
HIV2SB,Smanga_S4: 0.126675&lt;br /&gt;
HIV2SB,Smanga_SP: 0.123021&lt;br /&gt;
HIV2ST,SIVCZ: 0.394641&lt;br /&gt;
HIV2ST,Smanga_S4: 0.116931&lt;br /&gt;
HIV2ST,Smanga_SP: 0.115713&lt;br /&gt;
SIVCZ,Smanga_S4: 0.388551&lt;br /&gt;
SIVCZ,Smanga_SP: 0.388551&lt;br /&gt;
Smanga_S4,Smanga_SP: 0.0146163&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Step3==&lt;br /&gt;
Here is a picture of the NJ tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.png]]&lt;br /&gt;
&lt;br /&gt;
The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.&lt;br /&gt;
&lt;br /&gt;
==Step 4==&lt;br /&gt;
Here is an unrooted tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 5==&lt;br /&gt;
Here is a rearranged (swapped) tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.swapped.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 6==&lt;br /&gt;
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).&lt;br /&gt;
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).&lt;br /&gt;
* Further answers to &amp;quot;The Phylogeny of HIV&amp;quot; can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].&lt;br /&gt;
&lt;br /&gt;
==Step 7==&lt;br /&gt;
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the &amp;quot;best of both worlds&amp;quot;: it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.&lt;br /&gt;
&lt;br /&gt;
Here is the tree made ignoring gap positions: &lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]&lt;br /&gt;
&lt;br /&gt;
And here is the tree made taking gap positions into account:&lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]&lt;br /&gt;
&lt;br /&gt;
There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. &#039;&#039;&#039;Note:&#039;&#039;&#039; This is not always the case!&lt;br /&gt;
&lt;br /&gt;
==Step 8==&lt;br /&gt;
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group &#039;&#039;Tetrapoda&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:salmon_frog.png‎|center|frame]]&lt;br /&gt;
&lt;br /&gt;
There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy&#039;s &amp;quot;Common Tree&amp;quot; function (see illustration below). &lt;br /&gt;
&lt;br /&gt;
First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group &#039;&#039;Euarchontoglires&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group &#039;&#039;Opisthokonta&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:L18_Common_Taxonomy_Tree.png|center|frame]]&lt;br /&gt;
&lt;br /&gt;
It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).&lt;br /&gt;
&lt;br /&gt;
==Step 9==&lt;br /&gt;
# 54 results. &amp;lt;br&amp;gt;Search string: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)&amp;lt;/tt&amp;gt;&lt;br /&gt;
# 8 and 27 results, respectively. &amp;lt;br&amp;gt;Search strings: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:mitochondrion) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;and &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:cytoplasm) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)&amp;lt;/tt&amp;gt; &amp;lt;br&amp;gt;Under the &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; tab in UniProt, select &amp;quot;Download all&amp;quot;, &amp;quot;FASTA (canonical)&amp;quot; and &amp;quot;Uncompressed&amp;quot;.&lt;br /&gt;
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]&lt;br /&gt;
&lt;br /&gt;
==Step 10==&lt;br /&gt;
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure &amp;lt;u&amp;gt;Alignment options&amp;lt;/u&amp;gt; is set to &amp;quot;clustalo&amp;quot;, and align all sequences. Then make an NJ tree (with &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; unchecked) and change the view to &amp;quot;&amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;&amp;quot;. &lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; the following pictures are made last year, when the number of cytoplasmic+mitochondrial sequences was 34, not 35. The rabbit has been added since then, but the general picture is the same.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the result:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.&lt;br /&gt;
&lt;br /&gt;
== Step 11 ==&lt;br /&gt;
Here is the rerooted tree made by Seaview:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.rerooted.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 12==&lt;br /&gt;
Here is the rerooted tree made by iTOL:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]&lt;br /&gt;
&lt;br /&gt;
Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.&lt;br /&gt;
&lt;br /&gt;
==Step 13==&lt;br /&gt;
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 14==&lt;br /&gt;
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.&lt;br /&gt;
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.&lt;br /&gt;
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=789</id>
		<title>Exercise: Phylogeny - Answers (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=789"/>
		<updated>2025-11-26T20:55:24Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 11 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Step 1 ==&lt;br /&gt;
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.&lt;br /&gt;
&lt;br /&gt;
==Step 2==&lt;br /&gt;
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#distances order: d(1,2),...,d(1,n) &amp;lt;new line&amp;gt; d(2,3),...,d(2,n) &amp;lt;new line&amp;gt;...&lt;br /&gt;
20&lt;br /&gt;
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665 &lt;br /&gt;
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769 &lt;br /&gt;
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551 &lt;br /&gt;
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683 &lt;br /&gt;
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333 &lt;br /&gt;
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114 &lt;br /&gt;
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551 &lt;br /&gt;
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551 &lt;br /&gt;
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513 &lt;br /&gt;
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551 &lt;br /&gt;
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585 &lt;br /&gt;
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111 &lt;br /&gt;
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803 &lt;br /&gt;
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058 &lt;br /&gt;
0.0682095 0.0621194 0.386114 0.120585 0.118149 &lt;br /&gt;
0.0657734 0.389769 0.126675 0.123021 &lt;br /&gt;
0.394641 0.116931 0.115713 &lt;br /&gt;
0.388551 0.388551 &lt;br /&gt;
0.0146163 &lt;br /&gt;
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP &lt;br /&gt;
&lt;br /&gt;
#pairwise distances&lt;br /&gt;
HIV1B5,HTLV: 0.750305&lt;br /&gt;
HIV1H2,HTLV: 0.751523&lt;br /&gt;
HIV1MN,HTLV: 0.75&lt;br /&gt;
HIV1N5,HTLV: 0.752741&lt;br /&gt;
HIV1ND,HTLV: 0.752741&lt;br /&gt;
HIV1OY,HTLV: 0.752741&lt;br /&gt;
HIV1PV,HTLV: 0.750305&lt;br /&gt;
HIV1U4,HTLV: 0.750305&lt;br /&gt;
HIV1Z2,HTLV: 0.752741&lt;br /&gt;
HIV2CA,HTLV: 0.749086&lt;br /&gt;
HIV2D1,HTLV: 0.741778&lt;br /&gt;
HIV2G1,HTLV: 0.747868&lt;br /&gt;
HIV2KR,HTLV: 0.749086&lt;br /&gt;
HIV2RO,HTLV: 0.744214&lt;br /&gt;
HIV2SB,HTLV: 0.750305&lt;br /&gt;
HIV2ST,HTLV: 0.747868&lt;br /&gt;
HTLV,SIVCZ: 0.747868&lt;br /&gt;
HTLV,Smanga_S4: 0.747868&lt;br /&gt;
HTLV,Smanga_SP: 0.74665&lt;br /&gt;
HIV1B5,HIV1H2: 0.0158343&lt;br /&gt;
HIV1B5,HIV1MN: 0.0414634&lt;br /&gt;
HIV1B5,HIV1N5: 0.0304507&lt;br /&gt;
HIV1B5,HIV1ND: 0.043849&lt;br /&gt;
HIV1B5,HIV1OY: 0.0341048&lt;br /&gt;
HIV1B5,HIV1PV: 0.0170524&lt;br /&gt;
HIV1B5,HIV1U4: 0.0803898&lt;br /&gt;
HIV1B5,HIV1Z2: 0.045067&lt;br /&gt;
HIV1B5,HIV2CA: 0.399513&lt;br /&gt;
HIV1B5,HIV2D1: 0.399513&lt;br /&gt;
HIV1B5,HIV2G1: 0.389769&lt;br /&gt;
HIV1B5,HIV2KR: 0.393423&lt;br /&gt;
HIV1B5,HIV2RO: 0.394641&lt;br /&gt;
HIV1B5,HIV2SB: 0.389769&lt;br /&gt;
HIV1B5,HIV2ST: 0.394641&lt;br /&gt;
HIV1B5,SIVCZ: 0.130329&lt;br /&gt;
HIV1B5,Smanga_S4: 0.389769&lt;br /&gt;
HIV1B5,Smanga_SP: 0.389769&lt;br /&gt;
HIV1H2,HIV1MN: 0.0402439&lt;br /&gt;
HIV1H2,HIV1N5: 0.0292326&lt;br /&gt;
HIV1H2,HIV1ND: 0.0414129&lt;br /&gt;
HIV1H2,HIV1OY: 0.0328867&lt;br /&gt;
HIV1H2,HIV1PV: 0.00974421&lt;br /&gt;
HIV1H2,HIV1U4: 0.0803898&lt;br /&gt;
HIV1H2,HIV1Z2: 0.0426309&lt;br /&gt;
HIV1H2,HIV2CA: 0.399513&lt;br /&gt;
HIV1H2,HIV2D1: 0.401949&lt;br /&gt;
HIV1H2,HIV2G1: 0.392205&lt;br /&gt;
HIV1H2,HIV2KR: 0.393423&lt;br /&gt;
HIV1H2,HIV2RO: 0.394641&lt;br /&gt;
HIV1H2,HIV2SB: 0.389769&lt;br /&gt;
HIV1H2,HIV2ST: 0.394641&lt;br /&gt;
HIV1H2,SIVCZ: 0.129111&lt;br /&gt;
HIV1H2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1H2,Smanga_SP: 0.388551&lt;br /&gt;
HIV1MN,HIV1N5: 0.0365854&lt;br /&gt;
HIV1MN,HIV1ND: 0.0512195&lt;br /&gt;
HIV1MN,HIV1OY: 0.0365854&lt;br /&gt;
HIV1MN,HIV1PV: 0.0439024&lt;br /&gt;
HIV1MN,HIV1U4: 0.0865854&lt;br /&gt;
HIV1MN,HIV1Z2: 0.054878&lt;br /&gt;
HIV1MN,HIV2CA: 0.4&lt;br /&gt;
HIV1MN,HIV2D1: 0.40122&lt;br /&gt;
HIV1MN,HIV2G1: 0.396341&lt;br /&gt;
HIV1MN,HIV2KR: 0.392683&lt;br /&gt;
HIV1MN,HIV2RO: 0.395122&lt;br /&gt;
HIV1MN,HIV2SB: 0.392683&lt;br /&gt;
HIV1MN,HIV2ST: 0.397561&lt;br /&gt;
HIV1MN,SIVCZ: 0.130488&lt;br /&gt;
HIV1MN,Smanga_S4: 0.392683&lt;br /&gt;
HIV1MN,Smanga_SP: 0.392683&lt;br /&gt;
HIV1N5,HIV1ND: 0.0341048&lt;br /&gt;
HIV1N5,HIV1OY: 0.0304507&lt;br /&gt;
HIV1N5,HIV1PV: 0.0316687&lt;br /&gt;
HIV1N5,HIV1U4: 0.0791717&lt;br /&gt;
HIV1N5,HIV1Z2: 0.0389769&lt;br /&gt;
HIV1N5,HIV2CA: 0.397077&lt;br /&gt;
HIV1N5,HIV2D1: 0.399513&lt;br /&gt;
HIV1N5,HIV2G1: 0.389769&lt;br /&gt;
HIV1N5,HIV2KR: 0.390987&lt;br /&gt;
HIV1N5,HIV2RO: 0.392205&lt;br /&gt;
HIV1N5,HIV2SB: 0.389769&lt;br /&gt;
HIV1N5,HIV2ST: 0.392205&lt;br /&gt;
HIV1N5,SIVCZ: 0.127893&lt;br /&gt;
HIV1N5,Smanga_S4: 0.387333&lt;br /&gt;
HIV1N5,Smanga_SP: 0.387333&lt;br /&gt;
HIV1ND,HIV1OY: 0.043849&lt;br /&gt;
HIV1ND,HIV1PV: 0.043849&lt;br /&gt;
HIV1ND,HIV1U4: 0.0767357&lt;br /&gt;
HIV1ND,HIV1Z2: 0.0219245&lt;br /&gt;
HIV1ND,HIV2CA: 0.390987&lt;br /&gt;
HIV1ND,HIV2D1: 0.394641&lt;br /&gt;
HIV1ND,HIV2G1: 0.386114&lt;br /&gt;
HIV1ND,HIV2KR: 0.386114&lt;br /&gt;
HIV1ND,HIV2RO: 0.388551&lt;br /&gt;
HIV1ND,HIV2SB: 0.387333&lt;br /&gt;
HIV1ND,HIV2ST: 0.389769&lt;br /&gt;
HIV1ND,SIVCZ: 0.125457&lt;br /&gt;
HIV1ND,Smanga_S4: 0.386114&lt;br /&gt;
HIV1ND,Smanga_SP: 0.386114&lt;br /&gt;
HIV1OY,HIV1PV: 0.0365408&lt;br /&gt;
HIV1OY,HIV1U4: 0.0767357&lt;br /&gt;
HIV1OY,HIV1Z2: 0.047503&lt;br /&gt;
HIV1OY,HIV2CA: 0.394641&lt;br /&gt;
HIV1OY,HIV2D1: 0.397077&lt;br /&gt;
HIV1OY,HIV2G1: 0.388551&lt;br /&gt;
HIV1OY,HIV2KR: 0.388551&lt;br /&gt;
HIV1OY,HIV2RO: 0.389769&lt;br /&gt;
HIV1OY,HIV2SB: 0.386114&lt;br /&gt;
HIV1OY,HIV2ST: 0.390987&lt;br /&gt;
HIV1OY,SIVCZ: 0.131547&lt;br /&gt;
HIV1OY,Smanga_S4: 0.388551&lt;br /&gt;
HIV1OY,Smanga_SP: 0.388551&lt;br /&gt;
HIV1PV,HIV1U4: 0.0828258&lt;br /&gt;
HIV1PV,HIV1Z2: 0.045067&lt;br /&gt;
HIV1PV,HIV2CA: 0.401949&lt;br /&gt;
HIV1PV,HIV2D1: 0.404385&lt;br /&gt;
HIV1PV,HIV2G1: 0.394641&lt;br /&gt;
HIV1PV,HIV2KR: 0.394641&lt;br /&gt;
HIV1PV,HIV2RO: 0.397077&lt;br /&gt;
HIV1PV,HIV2SB: 0.390987&lt;br /&gt;
HIV1PV,HIV2ST: 0.393423&lt;br /&gt;
HIV1PV,SIVCZ: 0.130329&lt;br /&gt;
HIV1PV,Smanga_S4: 0.388551&lt;br /&gt;
HIV1PV,Smanga_SP: 0.388551&lt;br /&gt;
HIV1U4,HIV1Z2: 0.0767357&lt;br /&gt;
HIV1U4,HIV2CA: 0.398295&lt;br /&gt;
HIV1U4,HIV2D1: 0.403167&lt;br /&gt;
HIV1U4,HIV2G1: 0.392205&lt;br /&gt;
HIV1U4,HIV2KR: 0.395859&lt;br /&gt;
HIV1U4,HIV2RO: 0.394641&lt;br /&gt;
HIV1U4,HIV2SB: 0.394641&lt;br /&gt;
HIV1U4,HIV2ST: 0.397077&lt;br /&gt;
HIV1U4,SIVCZ: 0.137637&lt;br /&gt;
HIV1U4,Smanga_S4: 0.400731&lt;br /&gt;
HIV1U4,Smanga_SP: 0.399513&lt;br /&gt;
HIV1Z2,HIV2CA: 0.393423&lt;br /&gt;
HIV1Z2,HIV2D1: 0.397077&lt;br /&gt;
HIV1Z2,HIV2G1: 0.387333&lt;br /&gt;
HIV1Z2,HIV2KR: 0.388551&lt;br /&gt;
HIV1Z2,HIV2RO: 0.389769&lt;br /&gt;
HIV1Z2,HIV2SB: 0.389769&lt;br /&gt;
HIV1Z2,HIV2ST: 0.389769&lt;br /&gt;
HIV1Z2,SIVCZ: 0.125457&lt;br /&gt;
HIV1Z2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1Z2,Smanga_SP: 0.388551&lt;br /&gt;
HIV2CA,HIV2D1: 0.0816078&lt;br /&gt;
HIV2CA,HIV2G1: 0.0694275&lt;br /&gt;
HIV2CA,HIV2KR: 0.0645554&lt;br /&gt;
HIV2CA,HIV2RO: 0.0511571&lt;br /&gt;
HIV2CA,HIV2SB: 0.0682095&lt;br /&gt;
HIV2CA,HIV2ST: 0.0657734&lt;br /&gt;
HIV2CA,SIVCZ: 0.392205&lt;br /&gt;
HIV2CA,Smanga_S4: 0.125457&lt;br /&gt;
HIV2CA,Smanga_SP: 0.120585&lt;br /&gt;
HIV2D1,HIV2G1: 0.0511571&lt;br /&gt;
HIV2D1,HIV2KR: 0.0840438&lt;br /&gt;
HIV2D1,HIV2RO: 0.088916&lt;br /&gt;
HIV2D1,HIV2SB: 0.09257&lt;br /&gt;
HIV2D1,HIV2ST: 0.0864799&lt;br /&gt;
HIV2D1,SIVCZ: 0.397077&lt;br /&gt;
HIV2D1,Smanga_S4: 0.131547&lt;br /&gt;
HIV2D1,Smanga_SP: 0.129111&lt;br /&gt;
HIV2G1,HIV2KR: 0.0779537&lt;br /&gt;
HIV2G1,HIV2RO: 0.0730816&lt;br /&gt;
HIV2G1,HIV2SB: 0.0791717&lt;br /&gt;
HIV2G1,HIV2ST: 0.0767357&lt;br /&gt;
HIV2G1,SIVCZ: 0.394641&lt;br /&gt;
HIV2G1,Smanga_S4: 0.127893&lt;br /&gt;
HIV2G1,Smanga_SP: 0.121803&lt;br /&gt;
HIV2KR,HIV2RO: 0.0645554&lt;br /&gt;
HIV2KR,HIV2SB: 0.0633374&lt;br /&gt;
HIV2KR,HIV2ST: 0.0572473&lt;br /&gt;
HIV2KR,SIVCZ: 0.392205&lt;br /&gt;
HIV2KR,Smanga_S4: 0.118149&lt;br /&gt;
HIV2KR,Smanga_SP: 0.112058&lt;br /&gt;
HIV2RO,HIV2SB: 0.0682095&lt;br /&gt;
HIV2RO,HIV2ST: 0.0621194&lt;br /&gt;
HIV2RO,SIVCZ: 0.386114&lt;br /&gt;
HIV2RO,Smanga_S4: 0.120585&lt;br /&gt;
HIV2RO,Smanga_SP: 0.118149&lt;br /&gt;
HIV2SB,HIV2ST: 0.0657734&lt;br /&gt;
HIV2SB,SIVCZ: 0.389769&lt;br /&gt;
HIV2SB,Smanga_S4: 0.126675&lt;br /&gt;
HIV2SB,Smanga_SP: 0.123021&lt;br /&gt;
HIV2ST,SIVCZ: 0.394641&lt;br /&gt;
HIV2ST,Smanga_S4: 0.116931&lt;br /&gt;
HIV2ST,Smanga_SP: 0.115713&lt;br /&gt;
SIVCZ,Smanga_S4: 0.388551&lt;br /&gt;
SIVCZ,Smanga_SP: 0.388551&lt;br /&gt;
Smanga_S4,Smanga_SP: 0.0146163&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Step3==&lt;br /&gt;
Here is a picture of the NJ tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.png]]&lt;br /&gt;
&lt;br /&gt;
The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.&lt;br /&gt;
&lt;br /&gt;
==Step 4==&lt;br /&gt;
Here is an unrooted tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 5==&lt;br /&gt;
Here is a rearranged (swapped) tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.swapped.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 6==&lt;br /&gt;
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).&lt;br /&gt;
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).&lt;br /&gt;
* Further answers to &amp;quot;The Phylogeny of HIV&amp;quot; can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].&lt;br /&gt;
&lt;br /&gt;
==Step 7==&lt;br /&gt;
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the &amp;quot;best of both worlds&amp;quot;: it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.&lt;br /&gt;
&lt;br /&gt;
Here is the tree made ignoring gap positions: &lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]&lt;br /&gt;
&lt;br /&gt;
And here is the tree made taking gap positions into account:&lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]&lt;br /&gt;
&lt;br /&gt;
There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. &#039;&#039;&#039;Note:&#039;&#039;&#039; This is not always the case!&lt;br /&gt;
&lt;br /&gt;
==Step 8==&lt;br /&gt;
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group &#039;&#039;Tetrapoda&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:salmon_frog.png‎|center|frame]]&lt;br /&gt;
&lt;br /&gt;
There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy&#039;s &amp;quot;Common Tree&amp;quot; function (see illustration below). &lt;br /&gt;
&lt;br /&gt;
First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group &#039;&#039;Euarchontoglires&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group &#039;&#039;Opisthokonta&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:L18_Common_Taxonomy_Tree.png|center|frame]]&lt;br /&gt;
&lt;br /&gt;
It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).&lt;br /&gt;
&lt;br /&gt;
==Step 9==&lt;br /&gt;
# 54 results. &amp;lt;br&amp;gt;Search string: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)&amp;lt;/tt&amp;gt;&lt;br /&gt;
# 8 and 27 results, respectively. &amp;lt;br&amp;gt;Search strings: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:mitochondrion) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;and &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:cytoplasm) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)&amp;lt;/tt&amp;gt; &amp;lt;br&amp;gt;Under the &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; tab in UniProt, select &amp;quot;Download all&amp;quot;, &amp;quot;FASTA (canonical)&amp;quot; and &amp;quot;Uncompressed&amp;quot;.&lt;br /&gt;
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]&lt;br /&gt;
&lt;br /&gt;
==Step 10==&lt;br /&gt;
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure &amp;lt;u&amp;gt;Alignment options&amp;lt;/u&amp;gt; is set to &amp;quot;clustalo&amp;quot;, and align all sequences. Then make an NJ tree (with &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; unchecked) and change the view to &amp;quot;&amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;&amp;quot;. &lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; the following pictures are made last year, when the number of cytoplasmic+mitochondrial sequences was 34, not 35. The rabbit has been added since then, but the general picture is the same.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the result:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.&lt;br /&gt;
&lt;br /&gt;
== Step 11 ==&lt;br /&gt;
Here is the rerooted tree made by Seaview:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.rerooted-Seaview.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 12==&lt;br /&gt;
Here is the rerooted tree made by iTOL:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]&lt;br /&gt;
&lt;br /&gt;
Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.&lt;br /&gt;
&lt;br /&gt;
==Step 13==&lt;br /&gt;
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 14==&lt;br /&gt;
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.&lt;br /&gt;
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.&lt;br /&gt;
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt&amp;diff=788</id>
		<title>File:Ribosomal proteins 35-NJ tree.unrooted.newick.txt</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt&amp;diff=788"/>
		<updated>2025-11-26T20:44:04Z</updated>

		<summary type="html">&lt;p&gt;Henni: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Ribosomal_proteins_35-NJ_tree.unrooted.png&amp;diff=787</id>
		<title>File:Ribosomal proteins 35-NJ tree.unrooted.png</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Ribosomal_proteins_35-NJ_tree.unrooted.png&amp;diff=787"/>
		<updated>2025-11-26T20:43:03Z</updated>

		<summary type="html">&lt;p&gt;Henni: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=786</id>
		<title>Exercise: Phylogeny - Answers (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_-_Answers_(Seaview_version)&amp;diff=786"/>
		<updated>2025-11-26T20:42:30Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 10 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Step 1 ==&lt;br /&gt;
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.&lt;br /&gt;
&lt;br /&gt;
==Step 2==&lt;br /&gt;
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
#distances order: d(1,2),...,d(1,n) &amp;lt;new line&amp;gt; d(2,3),...,d(2,n) &amp;lt;new line&amp;gt;...&lt;br /&gt;
20&lt;br /&gt;
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665 &lt;br /&gt;
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769 &lt;br /&gt;
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551 &lt;br /&gt;
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683 &lt;br /&gt;
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333 &lt;br /&gt;
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114 &lt;br /&gt;
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551 &lt;br /&gt;
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551 &lt;br /&gt;
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513 &lt;br /&gt;
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551 &lt;br /&gt;
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585 &lt;br /&gt;
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111 &lt;br /&gt;
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803 &lt;br /&gt;
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058 &lt;br /&gt;
0.0682095 0.0621194 0.386114 0.120585 0.118149 &lt;br /&gt;
0.0657734 0.389769 0.126675 0.123021 &lt;br /&gt;
0.394641 0.116931 0.115713 &lt;br /&gt;
0.388551 0.388551 &lt;br /&gt;
0.0146163 &lt;br /&gt;
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP &lt;br /&gt;
&lt;br /&gt;
#pairwise distances&lt;br /&gt;
HIV1B5,HTLV: 0.750305&lt;br /&gt;
HIV1H2,HTLV: 0.751523&lt;br /&gt;
HIV1MN,HTLV: 0.75&lt;br /&gt;
HIV1N5,HTLV: 0.752741&lt;br /&gt;
HIV1ND,HTLV: 0.752741&lt;br /&gt;
HIV1OY,HTLV: 0.752741&lt;br /&gt;
HIV1PV,HTLV: 0.750305&lt;br /&gt;
HIV1U4,HTLV: 0.750305&lt;br /&gt;
HIV1Z2,HTLV: 0.752741&lt;br /&gt;
HIV2CA,HTLV: 0.749086&lt;br /&gt;
HIV2D1,HTLV: 0.741778&lt;br /&gt;
HIV2G1,HTLV: 0.747868&lt;br /&gt;
HIV2KR,HTLV: 0.749086&lt;br /&gt;
HIV2RO,HTLV: 0.744214&lt;br /&gt;
HIV2SB,HTLV: 0.750305&lt;br /&gt;
HIV2ST,HTLV: 0.747868&lt;br /&gt;
HTLV,SIVCZ: 0.747868&lt;br /&gt;
HTLV,Smanga_S4: 0.747868&lt;br /&gt;
HTLV,Smanga_SP: 0.74665&lt;br /&gt;
HIV1B5,HIV1H2: 0.0158343&lt;br /&gt;
HIV1B5,HIV1MN: 0.0414634&lt;br /&gt;
HIV1B5,HIV1N5: 0.0304507&lt;br /&gt;
HIV1B5,HIV1ND: 0.043849&lt;br /&gt;
HIV1B5,HIV1OY: 0.0341048&lt;br /&gt;
HIV1B5,HIV1PV: 0.0170524&lt;br /&gt;
HIV1B5,HIV1U4: 0.0803898&lt;br /&gt;
HIV1B5,HIV1Z2: 0.045067&lt;br /&gt;
HIV1B5,HIV2CA: 0.399513&lt;br /&gt;
HIV1B5,HIV2D1: 0.399513&lt;br /&gt;
HIV1B5,HIV2G1: 0.389769&lt;br /&gt;
HIV1B5,HIV2KR: 0.393423&lt;br /&gt;
HIV1B5,HIV2RO: 0.394641&lt;br /&gt;
HIV1B5,HIV2SB: 0.389769&lt;br /&gt;
HIV1B5,HIV2ST: 0.394641&lt;br /&gt;
HIV1B5,SIVCZ: 0.130329&lt;br /&gt;
HIV1B5,Smanga_S4: 0.389769&lt;br /&gt;
HIV1B5,Smanga_SP: 0.389769&lt;br /&gt;
HIV1H2,HIV1MN: 0.0402439&lt;br /&gt;
HIV1H2,HIV1N5: 0.0292326&lt;br /&gt;
HIV1H2,HIV1ND: 0.0414129&lt;br /&gt;
HIV1H2,HIV1OY: 0.0328867&lt;br /&gt;
HIV1H2,HIV1PV: 0.00974421&lt;br /&gt;
HIV1H2,HIV1U4: 0.0803898&lt;br /&gt;
HIV1H2,HIV1Z2: 0.0426309&lt;br /&gt;
HIV1H2,HIV2CA: 0.399513&lt;br /&gt;
HIV1H2,HIV2D1: 0.401949&lt;br /&gt;
HIV1H2,HIV2G1: 0.392205&lt;br /&gt;
HIV1H2,HIV2KR: 0.393423&lt;br /&gt;
HIV1H2,HIV2RO: 0.394641&lt;br /&gt;
HIV1H2,HIV2SB: 0.389769&lt;br /&gt;
HIV1H2,HIV2ST: 0.394641&lt;br /&gt;
HIV1H2,SIVCZ: 0.129111&lt;br /&gt;
HIV1H2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1H2,Smanga_SP: 0.388551&lt;br /&gt;
HIV1MN,HIV1N5: 0.0365854&lt;br /&gt;
HIV1MN,HIV1ND: 0.0512195&lt;br /&gt;
HIV1MN,HIV1OY: 0.0365854&lt;br /&gt;
HIV1MN,HIV1PV: 0.0439024&lt;br /&gt;
HIV1MN,HIV1U4: 0.0865854&lt;br /&gt;
HIV1MN,HIV1Z2: 0.054878&lt;br /&gt;
HIV1MN,HIV2CA: 0.4&lt;br /&gt;
HIV1MN,HIV2D1: 0.40122&lt;br /&gt;
HIV1MN,HIV2G1: 0.396341&lt;br /&gt;
HIV1MN,HIV2KR: 0.392683&lt;br /&gt;
HIV1MN,HIV2RO: 0.395122&lt;br /&gt;
HIV1MN,HIV2SB: 0.392683&lt;br /&gt;
HIV1MN,HIV2ST: 0.397561&lt;br /&gt;
HIV1MN,SIVCZ: 0.130488&lt;br /&gt;
HIV1MN,Smanga_S4: 0.392683&lt;br /&gt;
HIV1MN,Smanga_SP: 0.392683&lt;br /&gt;
HIV1N5,HIV1ND: 0.0341048&lt;br /&gt;
HIV1N5,HIV1OY: 0.0304507&lt;br /&gt;
HIV1N5,HIV1PV: 0.0316687&lt;br /&gt;
HIV1N5,HIV1U4: 0.0791717&lt;br /&gt;
HIV1N5,HIV1Z2: 0.0389769&lt;br /&gt;
HIV1N5,HIV2CA: 0.397077&lt;br /&gt;
HIV1N5,HIV2D1: 0.399513&lt;br /&gt;
HIV1N5,HIV2G1: 0.389769&lt;br /&gt;
HIV1N5,HIV2KR: 0.390987&lt;br /&gt;
HIV1N5,HIV2RO: 0.392205&lt;br /&gt;
HIV1N5,HIV2SB: 0.389769&lt;br /&gt;
HIV1N5,HIV2ST: 0.392205&lt;br /&gt;
HIV1N5,SIVCZ: 0.127893&lt;br /&gt;
HIV1N5,Smanga_S4: 0.387333&lt;br /&gt;
HIV1N5,Smanga_SP: 0.387333&lt;br /&gt;
HIV1ND,HIV1OY: 0.043849&lt;br /&gt;
HIV1ND,HIV1PV: 0.043849&lt;br /&gt;
HIV1ND,HIV1U4: 0.0767357&lt;br /&gt;
HIV1ND,HIV1Z2: 0.0219245&lt;br /&gt;
HIV1ND,HIV2CA: 0.390987&lt;br /&gt;
HIV1ND,HIV2D1: 0.394641&lt;br /&gt;
HIV1ND,HIV2G1: 0.386114&lt;br /&gt;
HIV1ND,HIV2KR: 0.386114&lt;br /&gt;
HIV1ND,HIV2RO: 0.388551&lt;br /&gt;
HIV1ND,HIV2SB: 0.387333&lt;br /&gt;
HIV1ND,HIV2ST: 0.389769&lt;br /&gt;
HIV1ND,SIVCZ: 0.125457&lt;br /&gt;
HIV1ND,Smanga_S4: 0.386114&lt;br /&gt;
HIV1ND,Smanga_SP: 0.386114&lt;br /&gt;
HIV1OY,HIV1PV: 0.0365408&lt;br /&gt;
HIV1OY,HIV1U4: 0.0767357&lt;br /&gt;
HIV1OY,HIV1Z2: 0.047503&lt;br /&gt;
HIV1OY,HIV2CA: 0.394641&lt;br /&gt;
HIV1OY,HIV2D1: 0.397077&lt;br /&gt;
HIV1OY,HIV2G1: 0.388551&lt;br /&gt;
HIV1OY,HIV2KR: 0.388551&lt;br /&gt;
HIV1OY,HIV2RO: 0.389769&lt;br /&gt;
HIV1OY,HIV2SB: 0.386114&lt;br /&gt;
HIV1OY,HIV2ST: 0.390987&lt;br /&gt;
HIV1OY,SIVCZ: 0.131547&lt;br /&gt;
HIV1OY,Smanga_S4: 0.388551&lt;br /&gt;
HIV1OY,Smanga_SP: 0.388551&lt;br /&gt;
HIV1PV,HIV1U4: 0.0828258&lt;br /&gt;
HIV1PV,HIV1Z2: 0.045067&lt;br /&gt;
HIV1PV,HIV2CA: 0.401949&lt;br /&gt;
HIV1PV,HIV2D1: 0.404385&lt;br /&gt;
HIV1PV,HIV2G1: 0.394641&lt;br /&gt;
HIV1PV,HIV2KR: 0.394641&lt;br /&gt;
HIV1PV,HIV2RO: 0.397077&lt;br /&gt;
HIV1PV,HIV2SB: 0.390987&lt;br /&gt;
HIV1PV,HIV2ST: 0.393423&lt;br /&gt;
HIV1PV,SIVCZ: 0.130329&lt;br /&gt;
HIV1PV,Smanga_S4: 0.388551&lt;br /&gt;
HIV1PV,Smanga_SP: 0.388551&lt;br /&gt;
HIV1U4,HIV1Z2: 0.0767357&lt;br /&gt;
HIV1U4,HIV2CA: 0.398295&lt;br /&gt;
HIV1U4,HIV2D1: 0.403167&lt;br /&gt;
HIV1U4,HIV2G1: 0.392205&lt;br /&gt;
HIV1U4,HIV2KR: 0.395859&lt;br /&gt;
HIV1U4,HIV2RO: 0.394641&lt;br /&gt;
HIV1U4,HIV2SB: 0.394641&lt;br /&gt;
HIV1U4,HIV2ST: 0.397077&lt;br /&gt;
HIV1U4,SIVCZ: 0.137637&lt;br /&gt;
HIV1U4,Smanga_S4: 0.400731&lt;br /&gt;
HIV1U4,Smanga_SP: 0.399513&lt;br /&gt;
HIV1Z2,HIV2CA: 0.393423&lt;br /&gt;
HIV1Z2,HIV2D1: 0.397077&lt;br /&gt;
HIV1Z2,HIV2G1: 0.387333&lt;br /&gt;
HIV1Z2,HIV2KR: 0.388551&lt;br /&gt;
HIV1Z2,HIV2RO: 0.389769&lt;br /&gt;
HIV1Z2,HIV2SB: 0.389769&lt;br /&gt;
HIV1Z2,HIV2ST: 0.389769&lt;br /&gt;
HIV1Z2,SIVCZ: 0.125457&lt;br /&gt;
HIV1Z2,Smanga_S4: 0.388551&lt;br /&gt;
HIV1Z2,Smanga_SP: 0.388551&lt;br /&gt;
HIV2CA,HIV2D1: 0.0816078&lt;br /&gt;
HIV2CA,HIV2G1: 0.0694275&lt;br /&gt;
HIV2CA,HIV2KR: 0.0645554&lt;br /&gt;
HIV2CA,HIV2RO: 0.0511571&lt;br /&gt;
HIV2CA,HIV2SB: 0.0682095&lt;br /&gt;
HIV2CA,HIV2ST: 0.0657734&lt;br /&gt;
HIV2CA,SIVCZ: 0.392205&lt;br /&gt;
HIV2CA,Smanga_S4: 0.125457&lt;br /&gt;
HIV2CA,Smanga_SP: 0.120585&lt;br /&gt;
HIV2D1,HIV2G1: 0.0511571&lt;br /&gt;
HIV2D1,HIV2KR: 0.0840438&lt;br /&gt;
HIV2D1,HIV2RO: 0.088916&lt;br /&gt;
HIV2D1,HIV2SB: 0.09257&lt;br /&gt;
HIV2D1,HIV2ST: 0.0864799&lt;br /&gt;
HIV2D1,SIVCZ: 0.397077&lt;br /&gt;
HIV2D1,Smanga_S4: 0.131547&lt;br /&gt;
HIV2D1,Smanga_SP: 0.129111&lt;br /&gt;
HIV2G1,HIV2KR: 0.0779537&lt;br /&gt;
HIV2G1,HIV2RO: 0.0730816&lt;br /&gt;
HIV2G1,HIV2SB: 0.0791717&lt;br /&gt;
HIV2G1,HIV2ST: 0.0767357&lt;br /&gt;
HIV2G1,SIVCZ: 0.394641&lt;br /&gt;
HIV2G1,Smanga_S4: 0.127893&lt;br /&gt;
HIV2G1,Smanga_SP: 0.121803&lt;br /&gt;
HIV2KR,HIV2RO: 0.0645554&lt;br /&gt;
HIV2KR,HIV2SB: 0.0633374&lt;br /&gt;
HIV2KR,HIV2ST: 0.0572473&lt;br /&gt;
HIV2KR,SIVCZ: 0.392205&lt;br /&gt;
HIV2KR,Smanga_S4: 0.118149&lt;br /&gt;
HIV2KR,Smanga_SP: 0.112058&lt;br /&gt;
HIV2RO,HIV2SB: 0.0682095&lt;br /&gt;
HIV2RO,HIV2ST: 0.0621194&lt;br /&gt;
HIV2RO,SIVCZ: 0.386114&lt;br /&gt;
HIV2RO,Smanga_S4: 0.120585&lt;br /&gt;
HIV2RO,Smanga_SP: 0.118149&lt;br /&gt;
HIV2SB,HIV2ST: 0.0657734&lt;br /&gt;
HIV2SB,SIVCZ: 0.389769&lt;br /&gt;
HIV2SB,Smanga_S4: 0.126675&lt;br /&gt;
HIV2SB,Smanga_SP: 0.123021&lt;br /&gt;
HIV2ST,SIVCZ: 0.394641&lt;br /&gt;
HIV2ST,Smanga_S4: 0.116931&lt;br /&gt;
HIV2ST,Smanga_SP: 0.115713&lt;br /&gt;
SIVCZ,Smanga_S4: 0.388551&lt;br /&gt;
SIVCZ,Smanga_SP: 0.388551&lt;br /&gt;
Smanga_S4,Smanga_SP: 0.0146163&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Step3==&lt;br /&gt;
Here is a picture of the NJ tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.png]]&lt;br /&gt;
&lt;br /&gt;
The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.&lt;br /&gt;
&lt;br /&gt;
==Step 4==&lt;br /&gt;
Here is an unrooted tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 5==&lt;br /&gt;
Here is a rearranged (swapped) tree:&lt;br /&gt;
&lt;br /&gt;
[[File:Pol21-NJ_tree.swapped.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 6==&lt;br /&gt;
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).&lt;br /&gt;
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).&lt;br /&gt;
* Further answers to &amp;quot;The Phylogeny of HIV&amp;quot; can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].&lt;br /&gt;
&lt;br /&gt;
==Step 7==&lt;br /&gt;
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the &amp;quot;best of both worlds&amp;quot;: it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.&lt;br /&gt;
&lt;br /&gt;
Here is the tree made ignoring gap positions: &lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]&lt;br /&gt;
&lt;br /&gt;
And here is the tree made taking gap positions into account:&lt;br /&gt;
&lt;br /&gt;
[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]&lt;br /&gt;
&lt;br /&gt;
There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. &#039;&#039;&#039;Note:&#039;&#039;&#039; This is not always the case!&lt;br /&gt;
&lt;br /&gt;
==Step 8==&lt;br /&gt;
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group &#039;&#039;Tetrapoda&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:salmon_frog.png‎|center|frame]]&lt;br /&gt;
&lt;br /&gt;
There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy&#039;s &amp;quot;Common Tree&amp;quot; function (see illustration below). &lt;br /&gt;
&lt;br /&gt;
First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group &#039;&#039;Euarchontoglires&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group &#039;&#039;Opisthokonta&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
[[file:L18_Common_Taxonomy_Tree.png|center|frame]]&lt;br /&gt;
&lt;br /&gt;
It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).&lt;br /&gt;
&lt;br /&gt;
==Step 9==&lt;br /&gt;
# 54 results. &amp;lt;br&amp;gt;Search string: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)&amp;lt;/tt&amp;gt;&lt;br /&gt;
# 8 and 27 results, respectively. &amp;lt;br&amp;gt;Search strings: &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:mitochondrion) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;and &amp;lt;!-- &amp;lt;tt&amp;gt;name:&amp;quot;ribosomal protein l3&amp;quot; taxonomy:eukaryota fragment:no locations:(location:cytoplasm) AND reviewed:yes&amp;lt;/tt&amp;gt; --&amp;gt; &amp;lt;tt&amp;gt;(protein_name:&amp;quot;ribosomal protein l3&amp;quot;) AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)&amp;lt;/tt&amp;gt; &amp;lt;br&amp;gt;Under the &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; tab in UniProt, select &amp;quot;Download all&amp;quot;, &amp;quot;FASTA (canonical)&amp;quot; and &amp;quot;Uncompressed&amp;quot;.&lt;br /&gt;
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_35.fasta.txt Ribosomal_proteins_35.fasta.txt]&lt;br /&gt;
&lt;br /&gt;
==Step 10==&lt;br /&gt;
Open the FASTA file with the 35 ribosomal protein sequences in Seaview, make sure &amp;lt;u&amp;gt;Alignment options&amp;lt;/u&amp;gt; is set to &amp;quot;clustalo&amp;quot;, and align all sequences. Then make an NJ tree (with &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; unchecked) and change the view to &amp;quot;&amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;&amp;quot;. &lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; the following pictures are made last year, when the number of cytoplasmic+mitochondrial sequences was 34, not 35. The rabbit has been added since then, but the general picture is the same.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Here is the result:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_35-NJ_tree.unrooted.png]]&lt;br /&gt;
&lt;br /&gt;
And [[Media:Ribosomal_proteins_35-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.&lt;br /&gt;
&lt;br /&gt;
== Step 11 ==&lt;br /&gt;
Here is the rerooted tree made by Seaview:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.rerooted-Seaview.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 12==&lt;br /&gt;
Here is the rerooted tree made by iTOL:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]&lt;br /&gt;
&lt;br /&gt;
Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.&lt;br /&gt;
&lt;br /&gt;
==Step 13==&lt;br /&gt;
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:&lt;br /&gt;
&lt;br /&gt;
[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]&lt;br /&gt;
&lt;br /&gt;
==Step 14==&lt;br /&gt;
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.&lt;br /&gt;
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.&lt;br /&gt;
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=22111:Course_plan_autumn_2025&amp;diff=785</id>
		<title>22111:Course plan autumn 2025</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=22111:Course_plan_autumn_2025&amp;diff=785"/>
		<updated>2025-11-26T08:58:14Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Tuesday Dec 2 — Bioinformatics in practice + old exam questions */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== General information ==&lt;br /&gt;
&lt;br /&gt;
=== Where and when ===&lt;br /&gt;
Lectures plus subsequent exercises will take place every Tuesday afternoon during the semester, starting &#039;&#039;&#039;Tuesday Sep 2 at 13:00&#039;&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
Lectures will be from 13:00 to approx. 14 in &#039;&#039;&#039;Aud. 54, building 208&#039;&#039;&#039;, and the exercises will then take place in &#039;&#039;&#039;the group rooms ALC1 (001), ALC2 (011), ALC4 (012), and &amp;quot;Touch Down&amp;quot; (024) also in building 208&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
=== Teachers ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.dtu.dk/service/telefonbog/person?id=25617&amp;amp;cpid=257116&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Henrik Nielsen] &amp;amp;mdash; Associate professor, course responsible.&lt;br /&gt;
* [https://www.dtu.dk/Person/cwis?id=142840&amp;amp;entity=profile Carolina Barra Quaglia] &amp;amp;mdash; Associate professor, course responsible.&lt;br /&gt;
* [https://www.dtu.dk/service/telefonbog/person?id=18103&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Rasmus Wernersson] &amp;amp;mdash; Affiliated professor.&lt;br /&gt;
&amp;lt;!-- * [http://www.dtu.dk/service/telefonbog/person?id=5118&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Anders Gorm Pedersen] &amp;amp;mdash; Professor, guest lecturer. Topic: Phylogenetic trees. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Teaching assistants ===&lt;br /&gt;
&lt;br /&gt;
* [https://www.dtu.dk/person/mads-vodder-hartmann?id=137701&amp;amp;entity=profile Mads Vodder Hartmann] &amp;amp;mdash; PhD student&lt;br /&gt;
* [https://www.dtu.dk/person/david-lokjaer-faurdal?id=98246&amp;amp;entity=profile David Lokjær Faurdal] &amp;amp;mdash; PhD student&lt;br /&gt;
&lt;br /&gt;
=== Course content ===&lt;br /&gt;
In this course, a large emphasis is placed on the practical usage of bioinformatics databases and tools. A typical lecture will present the theoretical aspects of the topics of the day — sometimes including a small group exercise using pen and paper — and last about an hour. The rest of the time will be spent on practical computer exercises, where the teachers and teaching assistants will be ready to help.&lt;br /&gt;
&lt;br /&gt;
See also [http://kurser.dtu.dk/course/22111 the course base about 22111].&lt;br /&gt;
&lt;br /&gt;
=== Curriculum ===&lt;br /&gt;
There is no formal textbook. The curriculum consists of the exercise guides, supplemented with various papers and chapters which will be made available on this homepage or on DTU Learn. Please note that &#039;&#039;all&#039;&#039; exercise guides are mandatory curriculum — including the &#039;&#039;answers&#039;&#039; to the exercises which will be made available on DTU Learn after each exercise.&lt;br /&gt;
&lt;br /&gt;
=== Computers ===&lt;br /&gt;
====Hardware====&lt;br /&gt;
&#039;&#039;&#039;You must bring your own laptop&#039;&#039;&#039; to the exercises, and it must be able to connect to DTU&#039;s wireless network. The type of computer / operating system is not important; Windows, Mac or Linux will all work fine. An iPad or an Android tablet, on the other hand, will not be good enough. A Chromebook will also not be enough (unless you have succeeded in installing a Linux distribution on it, but in that case we assume you know what you&#039;re doing). &lt;br /&gt;
&lt;br /&gt;
In some of the exercises (&amp;quot;PDB/PyMOL&amp;quot;, &amp;quot;Malaria vaccine&amp;quot;, and &amp;quot;Old exam questions&amp;quot;), you will work with the molecular visualization program PyMOL. This is rather difficult to control by a touchpad, so please remember to &#039;&#039;&#039;bring a mouse&#039;&#039;&#039;. The mouse should have two buttons plus a scroll-wheel. &lt;br /&gt;
&lt;br /&gt;
====Software====&lt;br /&gt;
# Most importantly: an updated &#039;&#039;&#039;internet browser&#039;&#039;&#039; (e.g. [http://www.google.com/chrome Google Chrome], [http://www.mozilla.com/ FireFox], [http://www.opera.com/ Opera], [https://www.microsoft.com/edge Edge], or Safari for Mac only). &#039;&#039;&#039;NB:&#039;&#039;&#039; You must have more than one browser installed; Safari for Mac or Edge for Windows may have glitches with some bioinformatics websites, and in those cases it is important to be able to switch to an alternative browser.&lt;br /&gt;
# A plain text editor for working with, e.g., sequence files. We recommend &#039;&#039;&#039;Geany&#039;&#039;&#039;, which you can download for free from https://geany.org/. You will find some tips and installation instructions in [[Plain_text_files_and_Geany|the first exercise]].&lt;br /&gt;
Other software will be installed during the exercises.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; Previously in the course, we have used some java-based software; but it is our experience that new Macs (with M-series CPUs, a.k.a. ARM chips) often have problems with java. Therefore, we have replaced these programs with other options: &lt;br /&gt;
* [https://geany.org/ Geany] has replaced [http://jedit.org/ jEdit], see the exercise in [[Plain_text_files_and_Geany|plain text files]].&lt;br /&gt;
* [https://doua.prabi.fr/software/seaview SeaView] has replaced [https://www.jalview.org/ Jalview], see the exercise in [[Exercise:_Multiple_Alignments_(Seaview_version)|multiple alignments]].&lt;br /&gt;
* [https://doua.prabi.fr/software/seaview SeaView] (and to some degree the website [https://itol.embl.de/ iTOL]) has also replaced the software [https://github.com/rambaut/figtree/releases FigTree], see the exercise in [[Exercise: Phylogeny|phylogenetic trees]].&lt;br /&gt;
Be aware that if you are working on old exam sets, they may refer to the old software.&lt;br /&gt;
&lt;br /&gt;
=== Hand-ins ===&lt;br /&gt;
As preparation for the computer-based exam, each participant or group must write a &amp;quot;&#039;&#039;&#039;logbook&#039;&#039;&#039;&amp;quot; with answers to the questions posed in the exercise guides. After the exercise, you should upload the logbook to DTU Learn.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;NB:&#039;&#039;&#039; All hand-ins are per definition group hand-ins. If you work alone, you must form a &amp;quot;group&amp;quot; of one person. &lt;br /&gt;
&amp;lt;!-- It is possible to hand in as a group. We would &#039;&#039;much&#039;&#039; rather receive one group hand-in than a number of identical logbooks. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You decide which software you prefer for writing the logbook — e.g. Microsoft Word, [http://www.libreoffice.org/ LibreOffice] (free), [http://www.openoffice.org/ Apache OpenOffice] (free), Pages for Mac, [https://docs.google.com/ Google Docs] or similar. You should be able to insert &#039;&#039;&#039;screenshots&#039;&#039;&#039; in the logbooks for documentation purposes. Microsoft Word has a built-in screenshot tool. Both Windows 10/11 and Mac OS also have dedicated screenshot tools.&lt;br /&gt;
&amp;lt;!-- For Windows users, however, we recommend the free program [http://getgreenshot.org/ Greenshot] which can not only take screenshots and copy them to the clipboard, but also make simple edits and annotations in the screenshots. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Regardless of your choice of writing software, the result &#039;&#039;&#039;must be handed in as a PDF file&#039;&#039;&#039;. LibreOffice and Google Docs can make PDFs directly. MacOS and Windows 10/11 have built-in functions for converting any printable file to PDF. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Please do &#039;&#039;not&#039;&#039; copy the questions&#039;&#039;&#039; from the exercise guide to your logbook. The hand-in module on DTU Learn has a system for plagiarism detection, which will raise an alarm if significant portions of your hand-in are identical to documents found on the internet — and that includes the exercise guides.&lt;br /&gt;
&lt;br /&gt;
In case you don&#039;t finish the exercises Tuesday afternoon, there is still a chance to hand in — the deadline for handing in at Learn is &#039;&#039;&#039;Thursday at 13:00&#039;&#039;&#039; each week. The &#039;&#039;&#039;answers to the exercises&#039;&#039;&#039; will become visible at the same moment (Thursday at 13:00). You should read the answers carefully and compare with your own answers.&lt;br /&gt;
&lt;br /&gt;
We do not offer individual feedback on the hand-ins, but we will give a collective feedback before the lecture the next Tuesday, where we address any common mistakes there may have been.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;NB:&#039;&#039;&#039; &#039;&#039;The hand-ins do not affect your grade&#039;&#039; — they are mainly meant as a preparation for the exam. They are also a means for us to check the understanding of the teaching; if we can see that many participants have made the same mistake, we will try to explain the issue better at the beginning of the next lecture.&lt;br /&gt;
&lt;br /&gt;
=== Exam ===&lt;br /&gt;
The 22111 exam is electronic; i.e. you must bring your own computer, and you will &#039;&#039;not&#039;&#039; get a paper copy of the questions. &lt;br /&gt;
&lt;br /&gt;
This year, the exam questions will be Multiple Choice.&lt;br /&gt;
&amp;lt;!-- The questions will be made available as a PDF file on the DTU online exam system. &#039;&#039;&#039;The only accepted hand-in format is PDF&#039;&#039;&#039;. --&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
All aids are allowed at the exam; you can bring any books, papers or notes. You will have &#039;&#039;&#039;open access to the internet&#039;&#039;&#039; which includes all the materials and websites we have used during the course. You are also allowed to search information on Google, Wikipedia, etc., but you are &#039;&#039;not&#039;&#039; allowed to communicate with others through e-mail, Facebook, chat, or file sharing websites. The internet traffic will be logged during the exam to ensure that these restrictions are kept.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
Just like in the weekly hand-ins, we kindly ask you: &#039;&#039;Please don&#039;t copy the questions in your answer document&#039;&#039; — that might result in the answer being flagged as plagiarism.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== DTU Learn &amp;amp; Inside ===&lt;br /&gt;
* Link to this year&#039;s DTU Learn page: https://learn.inside.dtu.dk/d2l/home/271404 &lt;br /&gt;
* Link to this year&#039;s Campusnet group: https://campusnet.dtu.dk/cnnet/element/805277&lt;br /&gt;
&lt;br /&gt;
=== Evaluation and feedback ===&lt;br /&gt;
We will be very happy to receive comments, suggestions, criticisms, or praise at any time during the semester. You can:&lt;br /&gt;
* send them by email to the teachers, or &lt;br /&gt;
* write them under &amp;quot;General feedback&amp;quot; in &amp;quot;Discussion&amp;quot; in DTU Learn.&lt;br /&gt;
If somebody writes a message in &amp;quot;Discussion&amp;quot;, you can comment on it. If you see a message you agree on, please comment &amp;quot;Agree!&amp;quot; so that we can see that it is not just one person&#039;s opinion. &lt;br /&gt;
&lt;br /&gt;
In addition, we will conduct a mid-term evaluation in [https://evaluering.dtu.dk/ DTU evaluation].&lt;br /&gt;
&lt;br /&gt;
== Lecture &amp;amp; exercise plan ==&lt;br /&gt;
&lt;br /&gt;
Note: This is a &#039;&#039;preliminary&#039;&#039; plan, changes may occur!&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Sep 2 — Introduction &amp;amp; taxonomy ===&lt;br /&gt;
:&#039;&#039;&#039;Lectures:&#039;&#039;&#039;&lt;br /&gt;
:* &#039;&#039;Introduction to the course, bioinformatics, and computers&#039;&#039; — Henrik Nielsen.&lt;br /&gt;
:* &#039;&#039;Test of prior knowledge&#039;&#039; — a Vevox session.&lt;br /&gt;
:* &#039;&#039;Evolution and taxonomy&#039;&#039; — Rasmus Wernersson.&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; will be made available on DTU Learn.&lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; [https://teaching.healthtech.dtu.dk/material/22111/PDF/Chapter2_Evolution.pdf Brief Introduction to Evolutionary Theory] — Written by Anders Gorm Pedersen.&lt;br /&gt;
&amp;lt;!-- :&#039;&#039;&#039;Test of prior knowledge:&#039;&#039;&#039; Go to  https://evaluering.dtu.dk/, click &amp;quot;Test of prior knowledge&amp;quot; under 22111, and fill out the form (it&#039;s anonymous). Spend max. 10 minutes on it. --&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Exercises:&#039;&#039;&#039;&lt;br /&gt;
:# [[Plain text files and Geany]] &lt;br /&gt;
:# [[Taxonomy databases]] &lt;br /&gt;
:&#039;&#039;&#039;Extra material&#039;&#039;&#039; &lt;br /&gt;
:*&amp;quot;[https://teaching.healthtech.dtu.dk/material/22111/ELS_bioinformatics.pdf Bioinformatics]&amp;quot; — Encyclopedia entry from 2009.&lt;br /&gt;
:*&amp;quot;[https://doi.org/10.1093/nar/gkae979 Database resources of the National Center for Biotechnology Information in 2025]&amp;quot; — article from the annual database issue of Nucleic Acids Research, 2025&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Sep 9 — GenBank ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;DNA as Biological Information&#039;&#039; — Rasmus Wernersson&lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; [https://teaching.healthtech.dtu.dk/material/22111/DNA_SequencingTutorial.pdf DNA sequencing tutorial] — source: IDT Tech Vault&lt;br /&gt;
:&#039;&#039;&#039;Handout&#039;&#039;&#039; for the lecture: [https://teaching.healthtech.dtu.dk/material/22111/HandoutEx_BaseCalling_Simple.pdf &amp;quot;Base-calling&amp;quot; exercise (for printing)] [PDF] / [https://teaching.healthtech.dtu.dk/material/22111/BaseCalling_on_screen_version.pdf &amp;quot;Base-calling&amp;quot; exercise (version for on-screen viewing)] [PDF].&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
:&#039;&#039;&#039;Test of prior knowledge:&#039;&#039;&#039; Go to  https://evaluering.dtu.dk/, click &amp;quot;Test of prior knowledge&amp;quot; under 22111, and fill out the form (it&#039;s anonymous). Spend max. 10 minutes on it.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Exercise:&#039;&#039;&#039; [[ExGenbank-new|Using the GenBank database]] &lt;br /&gt;
:&#039;&#039;&#039;Reference material&#039;&#039;&#039; for the exercise: [https://teaching.healthtech.dtu.dk/material/22111/GenBank+FASTA_handout_revised.pdf GenBank + FASTA format] [PDF] &lt;br /&gt;
&lt;br /&gt;
:&#039;&#039;&#039;Background material&#039;&#039;&#039; (supposedly known): &lt;br /&gt;
:*[[File:Phone_34.gif‎]] [http://www.youtube.com/watch?v=YgmoHtLGb5c mRNA splicing] (YouTube).&lt;br /&gt;
:*[https://teaching.healthtech.dtu.dk/material/22111/GeneStructure.pdf Overview of eukaryotic gene structure] (PDF).&lt;br /&gt;
&lt;br /&gt;
:&#039;&#039;&#039;Extra material:&#039;&#039;&#039; &lt;br /&gt;
:*[http://www.ncbi.nlm.nih.gov/books/NBK44863/ Entrez Sequences Quick Start] (NCBI)&lt;br /&gt;
:*[https://doi.org/10.1093/nar/gkae1114 &amp;quot;GenBank 2025 update&amp;quot;] — article from the annual database issue of Nucleic Acids Research, 2025.&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Sep 16 — Translation &amp;amp; UniProt ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;Protein databases&#039;&#039; — Henrik Nielsen&lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; [https://teaching.healthtech.dtu.dk/material/22111/VirtualRibosome.pdf Virtual Ribosome] — software article (PDF).&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
&lt;br /&gt;
:&#039;&#039;&#039;Exercises:&#039;&#039;&#039; &lt;br /&gt;
:#[[Exercise: Translation - Virtual Ribosome]] &lt;br /&gt;
:#[[Exercise: The protein database UniProt]] &lt;br /&gt;
:&#039;&#039;&#039;Background material&#039;&#039;&#039; (supposedly known): &lt;br /&gt;
:*[https://teaching.healthtech.dtu.dk/material/22111/PDF/protein_handout.pdf Levels of protein structure] [PDF]&lt;br /&gt;
:*[https://teaching.healthtech.dtu.dk/material/22111/GeneStructure.pdf Overview of eukaryotic gene structure] (PDF).&lt;br /&gt;
&lt;br /&gt;
:&#039;&#039;&#039;Extra material:&#039;&#039;&#039; &lt;br /&gt;
:*[https://doi.org/10.1093/nar/gkae1010 &amp;quot;UniProt: the Universal Protein Knowledgebase in 2025&amp;quot;] — article from the annual database issue of Nucleic Acids Research, 2025.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
:*[https://teaching.healthtech.dtu.dk/material/22111/uniprotkb_quickguide.pdf &amp;quot;A Quick Guide to UniProtKB&amp;quot;] — nice printable overview.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Sep 23 — Pairwise alignment ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;Pairwise alignment&#039;&#039; — Henrik Nielsen.&lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; Page 35-55 in Immunological Bioinformatics (PDF: on DTU Learn → General information and files → Textbook excerpt).&lt;br /&gt;
:&#039;&#039;&#039;Handout&#039;&#039;&#039; for the lecture: [https://teaching.healthtech.dtu.dk/material/22111/New_handout_alignscores.pdf Alignment scores]&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
:&#039;&#039;&#039;Exercise:&#039;&#039;&#039; [[ExPairwiseAlignment|Pairwise alignment]]&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Sep 30 — BLAST ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;Introduction to BLAST&#039;&#039; — Carolina Barra Quaglia.&lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; section 3.2.5 → 3.3 (i.e. pages 47-52) in Immunological Bioinformatics (PDF: on DTU Learn). &lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
:&#039;&#039;&#039;Exercise: [[Exercise:_BLAST3|BLAST]]&#039;&#039;&#039; &lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
:&#039;&#039;&#039;Extra material:&#039;&#039;&#039; &lt;br /&gt;
::[[File:Phone_34.gif‎]] &#039;&#039;&#039;Videos about BLAST from NCBI:&#039;&#039;&#039; (Video introduction to NCBI&#039;s web interface and Expect Values)  [http://www.youtube.com/playlist?list=PLH-TjWpFfWrtjzMCIvUe-YbrlIeFQlKMq NCBI&#039;s YouTube channel]&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Oct 7 — Protein structure, PDB &amp;amp; PyMOL ===&lt;br /&gt;
:&#039;&#039;&#039;Remember to bring a mouse for this day&#039;s exercise.&#039;&#039;&#039; The mouse should have two buttons and a scroll wheel.&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;Protein 3D structure&#039;&#039; — Carolina Barra Quaglia&lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; [http://en.wikipedia.org/wiki/Protein_structure Protein Structure (Wikipedia)]&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
&lt;br /&gt;
:&#039;&#039;&#039;Link to advanced course:&#039;&#039;&#039; &lt;br /&gt;
::* [http://kurser.dtu.dk/course/22117 22117 Protein Structure and Computational Biology]&lt;br /&gt;
&lt;br /&gt;
:&#039;&#039;&#039;Software&#039;&#039;&#039; for installation: [https://pymol.org/ PyMOL] (choose the newest version)&lt;br /&gt;
::&#039;&#039;&#039;Note:&#039;&#039;&#039; you will need the license file found at DTU Learn under this week&#039;s topic. The license is valid for a limited time. If you need PyMOL for educational purposes later in your studies, you can go to https://pymol.org/edu/index.php and register as a student to get your own license file (and if you don&#039;t receive an email after registering, write to help@schrodinger.com). However, if you need PyMOL to make figures for a scientific publication, you will have to pay for a license.&lt;br /&gt;
:&#039;&#039;&#039;Exercises:&#039;&#039;&#039; &lt;br /&gt;
:#[[Media:PyMOL_tutorial.pdf|PyMol tutorial]] (PDF) — basic usage of PyMOL.&lt;br /&gt;
:#[https://teaching.healthtech.dtu.dk/22111/index.php/Protein_Structure Protein Structure exercise]&lt;br /&gt;
:&#039;&#039;&#039;Extra material:&#039;&#039;&#039; &lt;br /&gt;
:*[https://doi.org/10.1093/nar/gkae1091 &amp;quot;Updated resources for exploring experimentally-determined PDB structures and Computed Structure Models at the RCSB Protein Data Bank&amp;quot;] — article from the annual database issue of Nucleic Acids Research, 2025.&lt;br /&gt;
:*[[PyMOL]] — some tips and tricks.&lt;br /&gt;
:*[https://teaching.healthtech.dtu.dk/material/22111/PDF/PyMOL_structure_navigation.pdf PyMOL basics — a small example] (optional extra exercise)&lt;br /&gt;
&lt;br /&gt;
------&lt;br /&gt;
&amp;lt;div align=&amp;quot;center&amp;quot;&amp;gt;&lt;br /&gt;
 &#039;&#039;&#039;Autumn holiday&#039;&#039;&#039; &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
------&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Oct 21 — Sequence information &amp;amp; logo-plots ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;Sequence information &amp;amp; logo-plots&#039;&#039; — Rasmus Wernersson&lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; &lt;br /&gt;
:# Pages 68-80 in Immunological Bioinformatics (PDF: on DTU Learn). &lt;br /&gt;
:# Pages 1-9 of &amp;quot;&#039;&#039;Information theory primer&#039;&#039;&amp;quot; ([https://teaching.healthtech.dtu.dk/material/22111/PDF/primer-2.72.pdf PDF])&lt;br /&gt;
:#* Read also the appendix on logarithms (especially log&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;) if needed!&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
:&#039;&#039;&#039;Handout&#039;&#039;&#039; for the lecture: [https://teaching.healthtech.dtu.dk/material/22111/Logo_exercise.pdf How to construct sequence logos] (PDF)&lt;br /&gt;
:[[Image:Emblem-important_tiny.png‎]] &#039;&#039;&#039;Mid-term evaluation:&#039;&#039;&#039; Go to https://evaluering.dtu.dk/ and click &amp;quot;Mid-term evaluation&amp;quot; under 22111 [[Image:Emblem-important_tiny.png‎]]&lt;br /&gt;
:&#039;&#039;&#039;Exercise:&#039;&#039;&#039; [[ExSeqLogos|DNA and Peptide Logos]]&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Oct 28 — Case: Malaria vaccine ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;Malaria and vaccines&#039;&#039; — [https://cmp.ku.dk/staff/?pure=en/persons/226923 Thomas Lavstsen], Associate Professor, University of Copenhagen&lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; [http://www.cdc.gov/dpdx/malaria/ Malaria — Causal Agents / Life Cycle]&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
:&#039;&#039;&#039;Exercise:&#039;&#039;&#039; [[Exercise:Malaria Vaccine|Malaria vaccine]]&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Nov 4 — Weight matrices and other prediction methods ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;Introduction to prediction methods, especially Weight Matrices&#039;&#039; — Henrik Nielsen&lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; Same as Oct 21!&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
&amp;lt;!--:&#039;&#039;&#039;Handouts&#039;&#039;&#039; for the lecture: --&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Exercises:&#039;&#039;&#039; &lt;br /&gt;
:# [https://teaching.healthtech.dtu.dk/material/22111/Estimationofpseudocounts_new+examples.pdf How to estimate pseudo frequencies]  &#039;&#039;&#039;Note&#039;&#039;&#039;: If you solve this manually, just select a couple of amino acids from the table. But if you solve it programmatically (python, Excel, other...), fill out the entire table.&lt;br /&gt;
:# [[Exercise: Construction of sequence logos and weight matrices|Construction of weight matrices]] &lt;br /&gt;
:&#039;&#039;&#039;Link to advanced course: &#039;&#039;&#039;&lt;br /&gt;
:: [http://teaching.healthtech.dtu.dk/22125/ 22125: Algorithms in bioinformatics]&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Nov 11 — PSI-BLAST ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;PSI-BLAST&#039;&#039; — Rasmus Wernersson &lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; &lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
:&#039;&#039;&#039;Exercise:&#039;&#039;&#039; [[ExPSIBLAST|PSI-BLAST]]&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Nov 18 — Multiple alignments ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;Multiple alignment&#039;&#039; — Henrik Nielsen &lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; RevTrans ([https://www.ncbi.nlm.nih.gov/pmc/articles/PMC169015/ article])&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
:&#039;&#039;&#039;Exercise:&#039;&#039;&#039; [[Exercise: Multiple Alignments (Seaview version)|Multiple Alignments]]&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Nov 25 — Phylogenetic trees ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;Phylogenetic Reconstruction: Distance Matrix Methods&#039;&#039; — Henrik Nielsen&lt;br /&gt;
&amp;lt;!-- :&#039;&#039;&#039;Extra lecture:&#039;&#039;&#039; &#039;&#039;Bioinformatics and Systems Biology in precision medicine&#039;&#039; — Rasmus Wernersson --&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; &lt;br /&gt;
:# &#039;&#039;Introduction to Tree Building&#039;&#039;, PDF on Learn &amp;lt;!-- XXX WHERE? → Slides etc → Lecture12 --&amp;gt;&lt;br /&gt;
:# &#039;&#039;[http://evolution.berkeley.edu/evolibrary/article/phylogenetics_01 Evolutionary trees]&#039;&#039; (minus the section &amp;quot;How to reconstruct an evolutionary tree&amp;quot;)&lt;br /&gt;
:# &#039;&#039;Understanding Evolutionary Trees&#039;&#039;, [https://teaching.healthtech.dtu.dk/material/22111/PDF/understanding_evo_trees.pdf PDF].&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
:&#039;&#039;&#039;Handout&#039;&#039;&#039; for lecture: [https://teaching.healthtech.dtu.dk/material/22111/PDF/handout_distance.pdf Reconstructing a distance tree] &lt;br /&gt;
&amp;lt;!-- :&#039;&#039;&#039;Software&#039;&#039;&#039; for installation: [https://github.com/rambaut/figtree/releases FigTree tree-viewer]&lt;br /&gt;
::&#039;&#039;&#039;IMPORTANT NOTE&#039;&#039;&#039; for Windows users: Download the &amp;lt;tt&amp;gt;.zip&amp;lt;/tt&amp;gt; file (FigTree.v1.4.4.zip) and unpack it. Then, go to the &amp;quot;lib&amp;quot; subfolder and double-click the &amp;lt;tt&amp;gt;.jar&amp;lt;/tt&amp;gt; file. The &amp;lt;tt&amp;gt;.exe&amp;lt;/tt&amp;gt; file may not work.&lt;br /&gt;
:&#039;&#039;&#039;TEST&#039;&#039;&#039; of the internal webserver we are going to use during the exercise: Please go to https://services.healthtech.dtu.dk/service.php?TreeHugger and click &amp;quot;View &amp;lt;u&amp;gt;example alignment files&amp;lt;/u&amp;gt;&amp;quot;. Then, copy either the &amp;quot;Sample DNA alignment&amp;quot; or the &amp;quot;Sample peptide dataset&amp;quot; and paste it in the TreeHugger input field. Click &amp;lt;u&amp;gt;Submit query&amp;lt;/u&amp;gt; when instructed by the lecturer.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Exercise: [[Exercise: Phylogeny (Seaview version)|Phylogeny]]&#039;&#039;&#039; &lt;br /&gt;
:&#039;&#039;&#039;Link to advanced course:&#039;&#039;&#039; &lt;br /&gt;
::* [http://teaching.healthtech.dtu.dk/22115/ 22115 Computational Molecular Evolution]&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Dec 2 — Bioinformatics in practice + old exam questions ===&lt;br /&gt;
:&#039;&#039;&#039;Lecture:&#039;&#039;&#039; &#039;&#039;AI, phage discovery and supercomputing&#039;&#039; — [https://globe.ku.dk/staff-list/?pure=en/persons/271131 Bent Petersen, KU]. &lt;br /&gt;
:&#039;&#039;&#039;Curriculum:&#039;&#039;&#039; (None - lean back and enjoy)&lt;br /&gt;
:&#039;&#039;&#039;Slides:&#039;&#039;&#039; on DTU Learn.&lt;br /&gt;
:&#039;&#039;&#039;Exercise:&#039;&#039;&#039; We train on the old exam set from &#039;&#039;&#039;spring 2022&#039;&#039;&#039; - available on DTU Learn. Note that there is no hand-in. The answers will become available 17:00 on Tuesday Dec 2.&lt;br /&gt;
&lt;br /&gt;
== Exam ==&lt;br /&gt;
&lt;br /&gt;
=== Tuesday Dec 16 ===&lt;br /&gt;
&#039;&#039;&#039;Winter exam 2025:&#039;&#039;&#039; Go to https://eksamen.dtu.dk/ and find 22111. &lt;br /&gt;
&lt;br /&gt;
[https://teaching.healthtech.dtu.dk/material/22111/Vejledning-til-digital-eksamen-DE-DK-ENG-revideret-2023-.pdf Here is a guide] to the Digital Exam interface (in Danish and English).&lt;br /&gt;
&lt;br /&gt;
The assignment will be accessible from &#039;&#039;&#039;XX:00&#039;&#039;&#039; on Dec 16.&lt;br /&gt;
&lt;br /&gt;
=== Checklist for computers ===&lt;br /&gt;
Check here whether your computer has all the software needed for the exam: [[Checklist for computers]]&lt;br /&gt;
&lt;br /&gt;
=== Link collection ===&lt;br /&gt;
A quick overview of the websites we have used in the course: [[Link collection]]&lt;br /&gt;
&lt;br /&gt;
=== FAQ ===&lt;br /&gt;
Questions we have received and answered: [[FAQ]]&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=784</id>
		<title>Exercise: Phylogeny (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=784"/>
		<updated>2025-11-25T11:17:42Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 13: interpretation */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Before you start: please make sure you have the Seaview program installed on your computer. If not, see the [[Exercise: Multiple Alignments (Seaview version)|Multiple alignment exercise]].&lt;br /&gt;
&lt;br /&gt;
== The Phylogeny of HIV ==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:&lt;br /&gt;
&lt;br /&gt;
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.&lt;br /&gt;
&lt;br /&gt;
The &amp;quot;Pol&amp;quot; gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1. It is available via this link:&lt;br /&gt;
&lt;br /&gt;
:[https://teaching.healthtech.dtu.dk/material/22111/Pol21.fsa Pol21.fsa]&lt;br /&gt;
&lt;br /&gt;
===Step 1: alignment===&lt;br /&gt;
&lt;br /&gt;
Align the Pol sequences using the Clustal Omega program in Seaview.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: &lt;br /&gt;
:Make a PDF in color of your alignment (File→Prepare PDF) and hand it in as an attachment to your answer document.&lt;br /&gt;
&lt;br /&gt;
===Step 2: distance matrix===&lt;br /&gt;
&lt;br /&gt;
In Seaview, go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt;. In the window that pops up, select &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt; and set &amp;lt;u&amp;gt;Distance&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Observed&amp;lt;/u&amp;gt;. Let &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; be checked. Click &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; and save the file. &lt;br /&gt;
&lt;br /&gt;
Look at the resulting file in a plain text editor. First, all distances are written in a triangle with the top row containing distances between the first sequence and all the others, the second row containing the distances between the second sequence and all others except the first, and so on. Just below the triangle, the names of the sequences are shown in the order in which they are displayed in the triangle. Further down in the file, all distances are repeated in another format with one pairwise distance per line including the sequence names.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: &lt;br /&gt;
:Can you spot which sequence has the largest distances to all the others?&lt;br /&gt;
&lt;br /&gt;
===Step 3: neighbor joining===&lt;br /&gt;
&lt;br /&gt;
Go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt; again, but this time, select &amp;lt;u&amp;gt;NJ&amp;lt;/u&amp;gt; instead of &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt;. Then, clicking &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; will produce a neighbor-joining tree based on the distances you just looked at. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the resulting tree (&#039;&#039;&#039;Hint&#039;&#039;&#039;: you can either take a screenshot or save the tree as SVG via the &amp;lt;u&amp;gt;File&amp;lt;/u&amp;gt; menu). &lt;br /&gt;
:Which sequence has the longest branch? Does that correspond to your answer before?&lt;br /&gt;
&lt;br /&gt;
===Step 4: rooted &#039;&#039;vs&#039;&#039; unrooted tree===&lt;br /&gt;
&lt;br /&gt;
In principle, the NJ algorithm always produces an &#039;&#039;unrooted&#039;&#039; tree. The reason why the trees you have seen so far (in this and last week&#039;s exercises) have been shown as rooted trees is that Seaview uses &#039;&#039;midpoint rooting&#039;&#039;, i.e., it places the root halfway between the tips that are furthest away from each other on the tree. However, you can also display the tree as unrooted in Seaview: In the drop-down menu at the top of the tree window, change &amp;lt;u&amp;gt;squared&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;. (It is a bit unfortunate that Seaview uses the term &amp;quot;circular&amp;quot;, since some other programs offer a circular way of displaying &#039;&#039;rooted&#039;&#039; trees, which should not be confused with unrooted trees). Later in the exercise, we will encounter tree rerooting.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the unrooted tree.&lt;br /&gt;
&lt;br /&gt;
===Step 5: rearrangement===&lt;br /&gt;
Now, go back to the rooted view of the tree and click &amp;lt;u&amp;gt;Swap&amp;lt;/u&amp;gt; in the second line of the tree window. Now, every internal node will be marked by a small black square. Click any square to rotate the subtree defined by that node (i.e., swap the upper and lower branches). When you click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt;, the black squares disappear again, but the changes in the tree layout will remain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the tree where you have rearranged it so that:&lt;br /&gt;
:# HTLV is at the bottom,&lt;br /&gt;
:# The HIV1 sequences are above the HIV2 sequences, and&lt;br /&gt;
:# &amp;quot;SIVCZ&amp;quot; is placed next to &amp;quot;Smanga_S4&amp;quot;.&lt;br /&gt;
Note that all these rearrangements do &#039;&#039;not&#039;&#039; change the topology (the branching pattern) of the tree — it still shows the same phylogeny.&lt;br /&gt;
&lt;br /&gt;
===Step 6: interpretation===&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: &lt;br /&gt;
: Inspect the rooted tree that you now have and consider what this tells you about the origin of HIV viruses.&lt;br /&gt;
* Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences?&lt;br /&gt;
* The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?&lt;br /&gt;
* With these groupings in consideration, what can you say about the origin of the two HIV viruses?&lt;br /&gt;
&lt;br /&gt;
== Comparing trees ==&lt;br /&gt;
&lt;br /&gt;
For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequences, but lack the first 90 nucleotides or so). The sequences can be found via the following link:&lt;br /&gt;
&lt;br /&gt;
* [https://teaching.healthtech.dtu.dk/material/22111/L18_CDS.fasta L18_CDS.fasta]&lt;br /&gt;
&lt;br /&gt;
===Step 7: with or without gapped positions===&lt;br /&gt;
This time, make two versions of your tree: one where &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; is on, and one where it is off. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: &lt;br /&gt;
: Compare the two trees. Are there any differences in the tree topology (i.e., in branching pattern, not just in branch lengths)?&lt;br /&gt;
: Your answers should include the following:&lt;br /&gt;
:* How did you construct the trees? (alignment method, construction of tree, etc.).&lt;br /&gt;
:* Pictures of the trees. &lt;br /&gt;
:* Which tree do you think is most correct?&lt;br /&gt;
&lt;br /&gt;
===Step 8: comparison to taxonomy===&lt;br /&gt;
Now, go to [http://www.ncbi.nlm.nih.gov/taxonomy NCBI taxonomy] and construct a &amp;quot;Common Tree&amp;quot; with all the different species in your L18 data set. It may be necessary to look up some of the common names on the net (Google, Wikipedia, Tree of Life) in order to enter them in the common tree function. &#039;&#039;&#039;Note&#039;&#039;&#039;: Remember to tick &amp;lt;u&amp;gt;include unranked (phylogenetic) taxa&amp;lt;/u&amp;gt;.&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;:&lt;br /&gt;
: Compare the most correct of your trees from Step 7 with the Common tree. Are there any errors, i.e. taxa that are not placed correctly on your tree? Which?&lt;br /&gt;
&lt;br /&gt;
== Mitochondrial &#039;&#039;versus&#039;&#039; cytoplasmic proteins ==&lt;br /&gt;
In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion&#039;s own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use [http://www.uniprot.org/ UniProt] to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyze the phylogeny of the dataset.&lt;br /&gt;
&lt;br /&gt;
===Step 9: building the dataset===&lt;br /&gt;
# Find all proteins named &amp;quot;ribosomal protein L3&amp;quot; from as many eukaryotes (&#039;&#039;Eukaryota&#039;&#039;) as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).&lt;br /&gt;
# How many of these have a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; of &amp;quot;mitochondrion&amp;quot; and &amp;quot;cytoplasm&amp;quot;, respectively? Download the results of these two searches in FASTA format.&lt;br /&gt;
# Now combine the two data sets from the previous question into one FASTA file (using Geany or another plain text editor). Note that their names start by &amp;quot;RL3&amp;quot; (cytoplasmic) or &amp;quot;RM03&amp;quot;/&amp;quot;RK3&amp;quot; (mitochondrial) which is very convenient for telling the difference between them. &#039;&#039;If you have any names that do not begin with &amp;quot;RL3&amp;quot;, &amp;quot;RK3&amp;quot; or &amp;quot;RM03&amp;quot;, revisit your UniProt search criteria!&#039;&#039; Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).&lt;br /&gt;
&lt;br /&gt;
===Step 10: making the tree===&lt;br /&gt;
Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). NB: set &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; off. Describe all the steps you took to make it, and hand in a picture of your tree in &#039;&#039;unrooted&#039;&#039; view. &amp;lt;!-- Also, go to &amp;lt;u&amp;gt;File→Save unrooted tree&amp;lt;/u&amp;gt; and save the tree file; name it something ending in &amp;lt;tt&amp;gt;.txt&amp;lt;/tt&amp;gt;. Open this file in a plain text editor and have a look at it — this is the Newick tree file text format. We will need this file later. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 11: rerooting the tree in Seaview===&lt;br /&gt;
Until now, we have not had to deal with rerooting, because the midpoint rooting happened to be correct. This is not the case here, since we want the cytoplasmic and the mitochondrial sequences to be in two monophyletic groups (two subtrees). In other words, we have to reroot:&lt;br /&gt;
# Switch back to rooted (&amp;quot;squared&amp;quot;) view.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Re-root&amp;lt;/u&amp;gt; in the second row of the tree window; a small black square will appear at each node. If you click a square, the tree will be rerooted at that node (try it!)&lt;br /&gt;
# Now find a node where all children are either cytoplasmic or mitochondrial.  Click it (don&#039;t worry about clicking a wrong node, you can always click another). Make sure that all the cytoplasmic and all the mitochondrial sequences are in two separate subtrees. &lt;br /&gt;
# Then, click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt; in the second row of the tree window to make the small black squares disappear again.&lt;br /&gt;
Include a picture of the rerooted tree in your answer.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 12: interactive Tree Of Life===&lt;br /&gt;
In this step, we will use the website [https://itol.embl.de/ iTOL] (interactive Tree Of Life) to reroot our tree: &lt;br /&gt;
# Open the website in a new browser tab, and click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt; in the top row.&lt;br /&gt;
# Click the button under &amp;lt;u&amp;gt;Tree file:&amp;lt;/u&amp;gt; and select the unrooted Newick tree file you saved in Step 10.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt;. You will now see a tree displayed with an arbitrary placement of the root.&lt;br /&gt;
# Look at the &amp;lt;u&amp;gt;Control panel&amp;lt;/u&amp;gt; to the right. Under &amp;lt;u&amp;gt;Label options&amp;lt;/u&amp;gt; switch &amp;lt;u&amp;gt;Position&amp;lt;/u&amp;gt; from &amp;lt;u&amp;gt;Aligned&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;At tips&amp;lt;/u&amp;gt;.&lt;br /&gt;
# Note that when you hover the mouse over a branch, information about the branch is displayed.&lt;br /&gt;
# Find, like in the previous step, a node where all children are either cytoplasmic or mitochondrial. Click it. A menu will appear. In that menu, go to &amp;lt;u&amp;gt;Editing→Tree structure→Re-root the tree here&amp;lt;/u&amp;gt;.&lt;br /&gt;
Include a picture of the rerooted tree in your answer. Is there a difference between this tree and the one you made in Step 11? If so, describe it.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 12: annotating the tree===&lt;br /&gt;
In this step, you need to be able to draw on a screenshot of the tree from Seaview. You can use any drawing software of your own choice, e.g. the Snip and Sketch tool (built into Windows), [https://inkscape.org/ Inkscape], or PowerPoint.&lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 13: annotating the tree===&lt;br /&gt;
# In the left part of the iTOL window, you will see six small icons: Zoom in, Zoom out, Fit to screen, Information, Search tree nodes, and Manual annotations (hover the mouse over them to see the descriptions). &lt;br /&gt;
# Click &amp;lt;u&amp;gt;Manual annotations&amp;lt;/u&amp;gt; and select the first tool (&amp;quot;Draw an ellipse / circle&amp;quot;). &lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Note that in case you place a circle incorrectly, you can move it with the &amp;quot;Move/rotate/scale objects&amp;quot; tool. There is also a &amp;quot;Delete objects&amp;quot; tool.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 13: interpretation===&lt;br /&gt;
&lt;br /&gt;
Consider your rerooted and annotated tree, and answer the following questions: &lt;br /&gt;
# Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?&lt;br /&gt;
# Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If not, where do you see differences?&lt;br /&gt;
# Consider the horizontal distance between the blue and the green point in both subtrees. Where has evolution been faster (where are there most mutations per time unit) — among the cytoplasmic or the mitochondrial proteins?&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=783</id>
		<title>Exercise: Phylogeny (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=783"/>
		<updated>2025-11-25T11:13:26Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 10: making the tree */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Before you start: please make sure you have the Seaview program installed on your computer. If not, see the [[Exercise: Multiple Alignments (Seaview version)|Multiple alignment exercise]].&lt;br /&gt;
&lt;br /&gt;
== The Phylogeny of HIV ==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:&lt;br /&gt;
&lt;br /&gt;
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.&lt;br /&gt;
&lt;br /&gt;
The &amp;quot;Pol&amp;quot; gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1. It is available via this link:&lt;br /&gt;
&lt;br /&gt;
:[https://teaching.healthtech.dtu.dk/material/22111/Pol21.fsa Pol21.fsa]&lt;br /&gt;
&lt;br /&gt;
===Step 1: alignment===&lt;br /&gt;
&lt;br /&gt;
Align the Pol sequences using the Clustal Omega program in Seaview.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: &lt;br /&gt;
:Make a PDF in color of your alignment (File→Prepare PDF) and hand it in as an attachment to your answer document.&lt;br /&gt;
&lt;br /&gt;
===Step 2: distance matrix===&lt;br /&gt;
&lt;br /&gt;
In Seaview, go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt;. In the window that pops up, select &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt; and set &amp;lt;u&amp;gt;Distance&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Observed&amp;lt;/u&amp;gt;. Let &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; be checked. Click &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; and save the file. &lt;br /&gt;
&lt;br /&gt;
Look at the resulting file in a plain text editor. First, all distances are written in a triangle with the top row containing distances between the first sequence and all the others, the second row containing the distances between the second sequence and all others except the first, and so on. Just below the triangle, the names of the sequences are shown in the order in which they are displayed in the triangle. Further down in the file, all distances are repeated in another format with one pairwise distance per line including the sequence names.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: &lt;br /&gt;
:Can you spot which sequence has the largest distances to all the others?&lt;br /&gt;
&lt;br /&gt;
===Step 3: neighbor joining===&lt;br /&gt;
&lt;br /&gt;
Go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt; again, but this time, select &amp;lt;u&amp;gt;NJ&amp;lt;/u&amp;gt; instead of &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt;. Then, clicking &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; will produce a neighbor-joining tree based on the distances you just looked at. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the resulting tree (&#039;&#039;&#039;Hint&#039;&#039;&#039;: you can either take a screenshot or save the tree as SVG via the &amp;lt;u&amp;gt;File&amp;lt;/u&amp;gt; menu). &lt;br /&gt;
:Which sequence has the longest branch? Does that correspond to your answer before?&lt;br /&gt;
&lt;br /&gt;
===Step 4: rooted &#039;&#039;vs&#039;&#039; unrooted tree===&lt;br /&gt;
&lt;br /&gt;
In principle, the NJ algorithm always produces an &#039;&#039;unrooted&#039;&#039; tree. The reason why the trees you have seen so far (in this and last week&#039;s exercises) have been shown as rooted trees is that Seaview uses &#039;&#039;midpoint rooting&#039;&#039;, i.e., it places the root halfway between the tips that are furthest away from each other on the tree. However, you can also display the tree as unrooted in Seaview: In the drop-down menu at the top of the tree window, change &amp;lt;u&amp;gt;squared&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;. (It is a bit unfortunate that Seaview uses the term &amp;quot;circular&amp;quot;, since some other programs offer a circular way of displaying &#039;&#039;rooted&#039;&#039; trees, which should not be confused with unrooted trees). Later in the exercise, we will encounter tree rerooting.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the unrooted tree.&lt;br /&gt;
&lt;br /&gt;
===Step 5: rearrangement===&lt;br /&gt;
Now, go back to the rooted view of the tree and click &amp;lt;u&amp;gt;Swap&amp;lt;/u&amp;gt; in the second line of the tree window. Now, every internal node will be marked by a small black square. Click any square to rotate the subtree defined by that node (i.e., swap the upper and lower branches). When you click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt;, the black squares disappear again, but the changes in the tree layout will remain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the tree where you have rearranged it so that:&lt;br /&gt;
:# HTLV is at the bottom,&lt;br /&gt;
:# The HIV1 sequences are above the HIV2 sequences, and&lt;br /&gt;
:# &amp;quot;SIVCZ&amp;quot; is placed next to &amp;quot;Smanga_S4&amp;quot;.&lt;br /&gt;
Note that all these rearrangements do &#039;&#039;not&#039;&#039; change the topology (the branching pattern) of the tree — it still shows the same phylogeny.&lt;br /&gt;
&lt;br /&gt;
===Step 6: interpretation===&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: &lt;br /&gt;
: Inspect the rooted tree that you now have and consider what this tells you about the origin of HIV viruses.&lt;br /&gt;
* Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences?&lt;br /&gt;
* The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?&lt;br /&gt;
* With these groupings in consideration, what can you say about the origin of the two HIV viruses?&lt;br /&gt;
&lt;br /&gt;
== Comparing trees ==&lt;br /&gt;
&lt;br /&gt;
For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequences, but lack the first 90 nucleotides or so). The sequences can be found via the following link:&lt;br /&gt;
&lt;br /&gt;
* [https://teaching.healthtech.dtu.dk/material/22111/L18_CDS.fasta L18_CDS.fasta]&lt;br /&gt;
&lt;br /&gt;
===Step 7: with or without gapped positions===&lt;br /&gt;
This time, make two versions of your tree: one where &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; is on, and one where it is off. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: &lt;br /&gt;
: Compare the two trees. Are there any differences in the tree topology (i.e., in branching pattern, not just in branch lengths)?&lt;br /&gt;
: Your answers should include the following:&lt;br /&gt;
:* How did you construct the trees? (alignment method, construction of tree, etc.).&lt;br /&gt;
:* Pictures of the trees. &lt;br /&gt;
:* Which tree do you think is most correct?&lt;br /&gt;
&lt;br /&gt;
===Step 8: comparison to taxonomy===&lt;br /&gt;
Now, go to [http://www.ncbi.nlm.nih.gov/taxonomy NCBI taxonomy] and construct a &amp;quot;Common Tree&amp;quot; with all the different species in your L18 data set. It may be necessary to look up some of the common names on the net (Google, Wikipedia, Tree of Life) in order to enter them in the common tree function. &#039;&#039;&#039;Note&#039;&#039;&#039;: Remember to tick &amp;lt;u&amp;gt;include unranked (phylogenetic) taxa&amp;lt;/u&amp;gt;.&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;:&lt;br /&gt;
: Compare the most correct of your trees from Step 7 with the Common tree. Are there any errors, i.e. taxa that are not placed correctly on your tree? Which?&lt;br /&gt;
&lt;br /&gt;
== Mitochondrial &#039;&#039;versus&#039;&#039; cytoplasmic proteins ==&lt;br /&gt;
In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion&#039;s own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use [http://www.uniprot.org/ UniProt] to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyze the phylogeny of the dataset.&lt;br /&gt;
&lt;br /&gt;
===Step 9: building the dataset===&lt;br /&gt;
# Find all proteins named &amp;quot;ribosomal protein L3&amp;quot; from as many eukaryotes (&#039;&#039;Eukaryota&#039;&#039;) as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).&lt;br /&gt;
# How many of these have a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; of &amp;quot;mitochondrion&amp;quot; and &amp;quot;cytoplasm&amp;quot;, respectively? Download the results of these two searches in FASTA format.&lt;br /&gt;
# Now combine the two data sets from the previous question into one FASTA file (using Geany or another plain text editor). Note that their names start by &amp;quot;RL3&amp;quot; (cytoplasmic) or &amp;quot;RM03&amp;quot;/&amp;quot;RK3&amp;quot; (mitochondrial) which is very convenient for telling the difference between them. &#039;&#039;If you have any names that do not begin with &amp;quot;RL3&amp;quot;, &amp;quot;RK3&amp;quot; or &amp;quot;RM03&amp;quot;, revisit your UniProt search criteria!&#039;&#039; Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).&lt;br /&gt;
&lt;br /&gt;
===Step 10: making the tree===&lt;br /&gt;
Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). NB: set &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; off. Describe all the steps you took to make it, and hand in a picture of your tree in &#039;&#039;unrooted&#039;&#039; view. &amp;lt;!-- Also, go to &amp;lt;u&amp;gt;File→Save unrooted tree&amp;lt;/u&amp;gt; and save the tree file; name it something ending in &amp;lt;tt&amp;gt;.txt&amp;lt;/tt&amp;gt;. Open this file in a plain text editor and have a look at it — this is the Newick tree file text format. We will need this file later. --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 11: rerooting the tree in Seaview===&lt;br /&gt;
Until now, we have not had to deal with rerooting, because the midpoint rooting happened to be correct. This is not the case here, since we want the cytoplasmic and the mitochondrial sequences to be in two monophyletic groups (two subtrees). In other words, we have to reroot:&lt;br /&gt;
# Switch back to rooted (&amp;quot;squared&amp;quot;) view.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Re-root&amp;lt;/u&amp;gt; in the second row of the tree window; a small black square will appear at each node. If you click a square, the tree will be rerooted at that node (try it!)&lt;br /&gt;
# Now find a node where all children are either cytoplasmic or mitochondrial.  Click it (don&#039;t worry about clicking a wrong node, you can always click another). Make sure that all the cytoplasmic and all the mitochondrial sequences are in two separate subtrees. &lt;br /&gt;
# Then, click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt; in the second row of the tree window to make the small black squares disappear again.&lt;br /&gt;
Include a picture of the rerooted tree in your answer.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 12: interactive Tree Of Life===&lt;br /&gt;
In this step, we will use the website [https://itol.embl.de/ iTOL] (interactive Tree Of Life) to reroot our tree: &lt;br /&gt;
# Open the website in a new browser tab, and click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt; in the top row.&lt;br /&gt;
# Click the button under &amp;lt;u&amp;gt;Tree file:&amp;lt;/u&amp;gt; and select the unrooted Newick tree file you saved in Step 10.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt;. You will now see a tree displayed with an arbitrary placement of the root.&lt;br /&gt;
# Look at the &amp;lt;u&amp;gt;Control panel&amp;lt;/u&amp;gt; to the right. Under &amp;lt;u&amp;gt;Label options&amp;lt;/u&amp;gt; switch &amp;lt;u&amp;gt;Position&amp;lt;/u&amp;gt; from &amp;lt;u&amp;gt;Aligned&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;At tips&amp;lt;/u&amp;gt;.&lt;br /&gt;
# Note that when you hover the mouse over a branch, information about the branch is displayed.&lt;br /&gt;
# Find, like in the previous step, a node where all children are either cytoplasmic or mitochondrial. Click it. A menu will appear. In that menu, go to &amp;lt;u&amp;gt;Editing→Tree structure→Re-root the tree here&amp;lt;/u&amp;gt;.&lt;br /&gt;
Include a picture of the rerooted tree in your answer. Is there a difference between this tree and the one you made in Step 11? If so, describe it.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 12: annotating the tree===&lt;br /&gt;
In this step, you need to be able to draw on a screenshot of the tree from Seaview. You can use any drawing software of your own choice, e.g. the Snip and Sketch tool (built into Windows), [https://inkscape.org/ Inkscape], or PowerPoint.&lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 13: annotating the tree===&lt;br /&gt;
# In the left part of the iTOL window, you will see six small icons: Zoom in, Zoom out, Fit to screen, Information, Search tree nodes, and Manual annotations (hover the mouse over them to see the descriptions). &lt;br /&gt;
# Click &amp;lt;u&amp;gt;Manual annotations&amp;lt;/u&amp;gt; and select the first tool (&amp;quot;Draw an ellipse / circle&amp;quot;). &lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Note that in case you place a circle incorrectly, you can move it with the &amp;quot;Move/rotate/scale objects&amp;quot; tool. There is also a &amp;quot;Delete objects&amp;quot; tool.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 13: interpretation===&lt;br /&gt;
&lt;br /&gt;
Consider your rerooted and annotated tree, and answer the following questions: &lt;br /&gt;
# Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?&lt;br /&gt;
# Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If no, where do you see differences?&lt;br /&gt;
# Consider the horizontal distance between the blue and the green point in both subtrees. Where has evolution been faster (where are there most mutations per time unit) — among the cytoplasmic or the mitochondrial proteins?&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=782</id>
		<title>Exercise: Phylogeny (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=782"/>
		<updated>2025-11-25T11:02:16Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 14: interpretation */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Before you start: please make sure you have the Seaview program installed on your computer. If not, see the [[Exercise: Multiple Alignments (Seaview version)|Multiple alignment exercise]].&lt;br /&gt;
&lt;br /&gt;
== The Phylogeny of HIV ==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:&lt;br /&gt;
&lt;br /&gt;
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.&lt;br /&gt;
&lt;br /&gt;
The &amp;quot;Pol&amp;quot; gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1. It is available via this link:&lt;br /&gt;
&lt;br /&gt;
:[https://teaching.healthtech.dtu.dk/material/22111/Pol21.fsa Pol21.fsa]&lt;br /&gt;
&lt;br /&gt;
===Step 1: alignment===&lt;br /&gt;
&lt;br /&gt;
Align the Pol sequences using the Clustal Omega program in Seaview.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: &lt;br /&gt;
:Make a PDF in color of your alignment (File→Prepare PDF) and hand it in as an attachment to your answer document.&lt;br /&gt;
&lt;br /&gt;
===Step 2: distance matrix===&lt;br /&gt;
&lt;br /&gt;
In Seaview, go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt;. In the window that pops up, select &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt; and set &amp;lt;u&amp;gt;Distance&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Observed&amp;lt;/u&amp;gt;. Let &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; be checked. Click &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; and save the file. &lt;br /&gt;
&lt;br /&gt;
Look at the resulting file in a plain text editor. First, all distances are written in a triangle with the top row containing distances between the first sequence and all the others, the second row containing the distances between the second sequence and all others except the first, and so on. Just below the triangle, the names of the sequences are shown in the order in which they are displayed in the triangle. Further down in the file, all distances are repeated in another format with one pairwise distance per line including the sequence names.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: &lt;br /&gt;
:Can you spot which sequence has the largest distances to all the others?&lt;br /&gt;
&lt;br /&gt;
===Step 3: neighbor joining===&lt;br /&gt;
&lt;br /&gt;
Go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt; again, but this time, select &amp;lt;u&amp;gt;NJ&amp;lt;/u&amp;gt; instead of &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt;. Then, clicking &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; will produce a neighbor-joining tree based on the distances you just looked at. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the resulting tree (&#039;&#039;&#039;Hint&#039;&#039;&#039;: you can either take a screenshot or save the tree as SVG via the &amp;lt;u&amp;gt;File&amp;lt;/u&amp;gt; menu). &lt;br /&gt;
:Which sequence has the longest branch? Does that correspond to your answer before?&lt;br /&gt;
&lt;br /&gt;
===Step 4: rooted &#039;&#039;vs&#039;&#039; unrooted tree===&lt;br /&gt;
&lt;br /&gt;
In principle, the NJ algorithm always produces an &#039;&#039;unrooted&#039;&#039; tree. The reason why the trees you have seen so far (in this and last week&#039;s exercises) have been shown as rooted trees is that Seaview uses &#039;&#039;midpoint rooting&#039;&#039;, i.e., it places the root halfway between the tips that are furthest away from each other on the tree. However, you can also display the tree as unrooted in Seaview: In the drop-down menu at the top of the tree window, change &amp;lt;u&amp;gt;squared&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;. (It is a bit unfortunate that Seaview uses the term &amp;quot;circular&amp;quot;, since some other programs offer a circular way of displaying &#039;&#039;rooted&#039;&#039; trees, which should not be confused with unrooted trees). Later in the exercise, we will encounter tree rerooting.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the unrooted tree.&lt;br /&gt;
&lt;br /&gt;
===Step 5: rearrangement===&lt;br /&gt;
Now, go back to the rooted view of the tree and click &amp;lt;u&amp;gt;Swap&amp;lt;/u&amp;gt; in the second line of the tree window. Now, every internal node will be marked by a small black square. Click any square to rotate the subtree defined by that node (i.e., swap the upper and lower branches). When you click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt;, the black squares disappear again, but the changes in the tree layout will remain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the tree where you have rearranged it so that:&lt;br /&gt;
:# HTLV is at the bottom,&lt;br /&gt;
:# The HIV1 sequences are above the HIV2 sequences, and&lt;br /&gt;
:# &amp;quot;SIVCZ&amp;quot; is placed next to &amp;quot;Smanga_S4&amp;quot;.&lt;br /&gt;
Note that all these rearrangements do &#039;&#039;not&#039;&#039; change the topology (the branching pattern) of the tree — it still shows the same phylogeny.&lt;br /&gt;
&lt;br /&gt;
===Step 6: interpretation===&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: &lt;br /&gt;
: Inspect the rooted tree that you now have and consider what this tells you about the origin of HIV viruses.&lt;br /&gt;
* Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences?&lt;br /&gt;
* The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?&lt;br /&gt;
* With these groupings in consideration, what can you say about the origin of the two HIV viruses?&lt;br /&gt;
&lt;br /&gt;
== Comparing trees ==&lt;br /&gt;
&lt;br /&gt;
For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequences, but lack the first 90 nucleotides or so). The sequences can be found via the following link:&lt;br /&gt;
&lt;br /&gt;
* [https://teaching.healthtech.dtu.dk/material/22111/L18_CDS.fasta L18_CDS.fasta]&lt;br /&gt;
&lt;br /&gt;
===Step 7: with or without gapped positions===&lt;br /&gt;
This time, make two versions of your tree: one where &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; is on, and one where it is off. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: &lt;br /&gt;
: Compare the two trees. Are there any differences in the tree topology (i.e., in branching pattern, not just in branch lengths)?&lt;br /&gt;
: Your answers should include the following:&lt;br /&gt;
:* How did you construct the trees? (alignment method, construction of tree, etc.).&lt;br /&gt;
:* Pictures of the trees. &lt;br /&gt;
:* Which tree do you think is most correct?&lt;br /&gt;
&lt;br /&gt;
===Step 8: comparison to taxonomy===&lt;br /&gt;
Now, go to [http://www.ncbi.nlm.nih.gov/taxonomy NCBI taxonomy] and construct a &amp;quot;Common Tree&amp;quot; with all the different species in your L18 data set. It may be necessary to look up some of the common names on the net (Google, Wikipedia, Tree of Life) in order to enter them in the common tree function. &#039;&#039;&#039;Note&#039;&#039;&#039;: Remember to tick &amp;lt;u&amp;gt;include unranked (phylogenetic) taxa&amp;lt;/u&amp;gt;.&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;:&lt;br /&gt;
: Compare the most correct of your trees from Step 7 with the Common tree. Are there any errors, i.e. taxa that are not placed correctly on your tree? Which?&lt;br /&gt;
&lt;br /&gt;
== Mitochondrial &#039;&#039;versus&#039;&#039; cytoplasmic proteins ==&lt;br /&gt;
In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion&#039;s own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use [http://www.uniprot.org/ UniProt] to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyze the phylogeny of the dataset.&lt;br /&gt;
&lt;br /&gt;
===Step 9: building the dataset===&lt;br /&gt;
# Find all proteins named &amp;quot;ribosomal protein L3&amp;quot; from as many eukaryotes (&#039;&#039;Eukaryota&#039;&#039;) as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).&lt;br /&gt;
# How many of these have a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; of &amp;quot;mitochondrion&amp;quot; and &amp;quot;cytoplasm&amp;quot;, respectively? Download the results of these two searches in FASTA format.&lt;br /&gt;
# Now combine the two data sets from the previous question into one FASTA file (using Geany or another plain text editor). Note that their names start by &amp;quot;RL3&amp;quot; (cytoplasmic) or &amp;quot;RM03&amp;quot;/&amp;quot;RK3&amp;quot; (mitochondrial) which is very convenient for telling the difference between them. &#039;&#039;If you have any names that do not begin with &amp;quot;RL3&amp;quot;, &amp;quot;RK3&amp;quot; or &amp;quot;RM03&amp;quot;, revisit your UniProt search criteria!&#039;&#039; Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).&lt;br /&gt;
&lt;br /&gt;
===Step 10: making the tree===&lt;br /&gt;
Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). NB: set &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; off. Describe all the steps you took to make it, and hand in a picture of your tree in &#039;&#039;unrooted&#039;&#039; view. Also, go to &amp;lt;u&amp;gt;File→Save unrooted tree&amp;lt;/u&amp;gt; and save the tree file; name it something ending in &amp;lt;tt&amp;gt;.txt&amp;lt;/tt&amp;gt;. Open this file in a plain text editor and have a look at it — this is the Newick tree file text format. We will need this file later.&lt;br /&gt;
&lt;br /&gt;
===Step 11: rerooting the tree in Seaview===&lt;br /&gt;
Until now, we have not had to deal with rerooting, because the midpoint rooting happened to be correct. This is not the case here, since we want the cytoplasmic and the mitochondrial sequences to be in two monophyletic groups (two subtrees). In other words, we have to reroot:&lt;br /&gt;
# Switch back to rooted (&amp;quot;squared&amp;quot;) view.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Re-root&amp;lt;/u&amp;gt; in the second row of the tree window; a small black square will appear at each node. If you click a square, the tree will be rerooted at that node (try it!)&lt;br /&gt;
# Now find a node where all children are either cytoplasmic or mitochondrial.  Click it (don&#039;t worry about clicking a wrong node, you can always click another). Make sure that all the cytoplasmic and all the mitochondrial sequences are in two separate subtrees. &lt;br /&gt;
# Then, click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt; in the second row of the tree window to make the small black squares disappear again.&lt;br /&gt;
Include a picture of the rerooted tree in your answer.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 12: interactive Tree Of Life===&lt;br /&gt;
In this step, we will use the website [https://itol.embl.de/ iTOL] (interactive Tree Of Life) to reroot our tree: &lt;br /&gt;
# Open the website in a new browser tab, and click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt; in the top row.&lt;br /&gt;
# Click the button under &amp;lt;u&amp;gt;Tree file:&amp;lt;/u&amp;gt; and select the unrooted Newick tree file you saved in Step 10.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt;. You will now see a tree displayed with an arbitrary placement of the root.&lt;br /&gt;
# Look at the &amp;lt;u&amp;gt;Control panel&amp;lt;/u&amp;gt; to the right. Under &amp;lt;u&amp;gt;Label options&amp;lt;/u&amp;gt; switch &amp;lt;u&amp;gt;Position&amp;lt;/u&amp;gt; from &amp;lt;u&amp;gt;Aligned&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;At tips&amp;lt;/u&amp;gt;.&lt;br /&gt;
# Note that when you hover the mouse over a branch, information about the branch is displayed.&lt;br /&gt;
# Find, like in the previous step, a node where all children are either cytoplasmic or mitochondrial. Click it. A menu will appear. In that menu, go to &amp;lt;u&amp;gt;Editing→Tree structure→Re-root the tree here&amp;lt;/u&amp;gt;.&lt;br /&gt;
Include a picture of the rerooted tree in your answer. Is there a difference between this tree and the one you made in Step 11? If so, describe it.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 12: annotating the tree===&lt;br /&gt;
In this step, you need to be able to draw on a screenshot of the tree from Seaview. You can use any drawing software of your own choice, e.g. the Snip and Sketch tool (built into Windows), [https://inkscape.org/ Inkscape], or PowerPoint.&lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 13: annotating the tree===&lt;br /&gt;
# In the left part of the iTOL window, you will see six small icons: Zoom in, Zoom out, Fit to screen, Information, Search tree nodes, and Manual annotations (hover the mouse over them to see the descriptions). &lt;br /&gt;
# Click &amp;lt;u&amp;gt;Manual annotations&amp;lt;/u&amp;gt; and select the first tool (&amp;quot;Draw an ellipse / circle&amp;quot;). &lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Note that in case you place a circle incorrectly, you can move it with the &amp;quot;Move/rotate/scale objects&amp;quot; tool. There is also a &amp;quot;Delete objects&amp;quot; tool.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 13: interpretation===&lt;br /&gt;
&lt;br /&gt;
Consider your rerooted and annotated tree, and answer the following questions: &lt;br /&gt;
# Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?&lt;br /&gt;
# Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If no, where do you see differences?&lt;br /&gt;
# Consider the horizontal distance between the blue and the green point in both subtrees. Where has evolution been faster (where are there most mutations per time unit) — among the cytoplasmic or the mitochondrial proteins?&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=781</id>
		<title>Exercise: Phylogeny (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=781"/>
		<updated>2025-11-25T11:01:38Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 12: annotating the tree */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Before you start: please make sure you have the Seaview program installed on your computer. If not, see the [[Exercise: Multiple Alignments (Seaview version)|Multiple alignment exercise]].&lt;br /&gt;
&lt;br /&gt;
== The Phylogeny of HIV ==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:&lt;br /&gt;
&lt;br /&gt;
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.&lt;br /&gt;
&lt;br /&gt;
The &amp;quot;Pol&amp;quot; gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1. It is available via this link:&lt;br /&gt;
&lt;br /&gt;
:[https://teaching.healthtech.dtu.dk/material/22111/Pol21.fsa Pol21.fsa]&lt;br /&gt;
&lt;br /&gt;
===Step 1: alignment===&lt;br /&gt;
&lt;br /&gt;
Align the Pol sequences using the Clustal Omega program in Seaview.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: &lt;br /&gt;
:Make a PDF in color of your alignment (File→Prepare PDF) and hand it in as an attachment to your answer document.&lt;br /&gt;
&lt;br /&gt;
===Step 2: distance matrix===&lt;br /&gt;
&lt;br /&gt;
In Seaview, go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt;. In the window that pops up, select &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt; and set &amp;lt;u&amp;gt;Distance&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Observed&amp;lt;/u&amp;gt;. Let &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; be checked. Click &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; and save the file. &lt;br /&gt;
&lt;br /&gt;
Look at the resulting file in a plain text editor. First, all distances are written in a triangle with the top row containing distances between the first sequence and all the others, the second row containing the distances between the second sequence and all others except the first, and so on. Just below the triangle, the names of the sequences are shown in the order in which they are displayed in the triangle. Further down in the file, all distances are repeated in another format with one pairwise distance per line including the sequence names.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: &lt;br /&gt;
:Can you spot which sequence has the largest distances to all the others?&lt;br /&gt;
&lt;br /&gt;
===Step 3: neighbor joining===&lt;br /&gt;
&lt;br /&gt;
Go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt; again, but this time, select &amp;lt;u&amp;gt;NJ&amp;lt;/u&amp;gt; instead of &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt;. Then, clicking &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; will produce a neighbor-joining tree based on the distances you just looked at. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the resulting tree (&#039;&#039;&#039;Hint&#039;&#039;&#039;: you can either take a screenshot or save the tree as SVG via the &amp;lt;u&amp;gt;File&amp;lt;/u&amp;gt; menu). &lt;br /&gt;
:Which sequence has the longest branch? Does that correspond to your answer before?&lt;br /&gt;
&lt;br /&gt;
===Step 4: rooted &#039;&#039;vs&#039;&#039; unrooted tree===&lt;br /&gt;
&lt;br /&gt;
In principle, the NJ algorithm always produces an &#039;&#039;unrooted&#039;&#039; tree. The reason why the trees you have seen so far (in this and last week&#039;s exercises) have been shown as rooted trees is that Seaview uses &#039;&#039;midpoint rooting&#039;&#039;, i.e., it places the root halfway between the tips that are furthest away from each other on the tree. However, you can also display the tree as unrooted in Seaview: In the drop-down menu at the top of the tree window, change &amp;lt;u&amp;gt;squared&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;. (It is a bit unfortunate that Seaview uses the term &amp;quot;circular&amp;quot;, since some other programs offer a circular way of displaying &#039;&#039;rooted&#039;&#039; trees, which should not be confused with unrooted trees). Later in the exercise, we will encounter tree rerooting.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the unrooted tree.&lt;br /&gt;
&lt;br /&gt;
===Step 5: rearrangement===&lt;br /&gt;
Now, go back to the rooted view of the tree and click &amp;lt;u&amp;gt;Swap&amp;lt;/u&amp;gt; in the second line of the tree window. Now, every internal node will be marked by a small black square. Click any square to rotate the subtree defined by that node (i.e., swap the upper and lower branches). When you click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt;, the black squares disappear again, but the changes in the tree layout will remain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the tree where you have rearranged it so that:&lt;br /&gt;
:# HTLV is at the bottom,&lt;br /&gt;
:# The HIV1 sequences are above the HIV2 sequences, and&lt;br /&gt;
:# &amp;quot;SIVCZ&amp;quot; is placed next to &amp;quot;Smanga_S4&amp;quot;.&lt;br /&gt;
Note that all these rearrangements do &#039;&#039;not&#039;&#039; change the topology (the branching pattern) of the tree — it still shows the same phylogeny.&lt;br /&gt;
&lt;br /&gt;
===Step 6: interpretation===&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: &lt;br /&gt;
: Inspect the rooted tree that you now have and consider what this tells you about the origin of HIV viruses.&lt;br /&gt;
* Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences?&lt;br /&gt;
* The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?&lt;br /&gt;
* With these groupings in consideration, what can you say about the origin of the two HIV viruses?&lt;br /&gt;
&lt;br /&gt;
== Comparing trees ==&lt;br /&gt;
&lt;br /&gt;
For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequences, but lack the first 90 nucleotides or so). The sequences can be found via the following link:&lt;br /&gt;
&lt;br /&gt;
* [https://teaching.healthtech.dtu.dk/material/22111/L18_CDS.fasta L18_CDS.fasta]&lt;br /&gt;
&lt;br /&gt;
===Step 7: with or without gapped positions===&lt;br /&gt;
This time, make two versions of your tree: one where &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; is on, and one where it is off. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: &lt;br /&gt;
: Compare the two trees. Are there any differences in the tree topology (i.e., in branching pattern, not just in branch lengths)?&lt;br /&gt;
: Your answers should include the following:&lt;br /&gt;
:* How did you construct the trees? (alignment method, construction of tree, etc.).&lt;br /&gt;
:* Pictures of the trees. &lt;br /&gt;
:* Which tree do you think is most correct?&lt;br /&gt;
&lt;br /&gt;
===Step 8: comparison to taxonomy===&lt;br /&gt;
Now, go to [http://www.ncbi.nlm.nih.gov/taxonomy NCBI taxonomy] and construct a &amp;quot;Common Tree&amp;quot; with all the different species in your L18 data set. It may be necessary to look up some of the common names on the net (Google, Wikipedia, Tree of Life) in order to enter them in the common tree function. &#039;&#039;&#039;Note&#039;&#039;&#039;: Remember to tick &amp;lt;u&amp;gt;include unranked (phylogenetic) taxa&amp;lt;/u&amp;gt;.&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;:&lt;br /&gt;
: Compare the most correct of your trees from Step 7 with the Common tree. Are there any errors, i.e. taxa that are not placed correctly on your tree? Which?&lt;br /&gt;
&lt;br /&gt;
== Mitochondrial &#039;&#039;versus&#039;&#039; cytoplasmic proteins ==&lt;br /&gt;
In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion&#039;s own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use [http://www.uniprot.org/ UniProt] to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyze the phylogeny of the dataset.&lt;br /&gt;
&lt;br /&gt;
===Step 9: building the dataset===&lt;br /&gt;
# Find all proteins named &amp;quot;ribosomal protein L3&amp;quot; from as many eukaryotes (&#039;&#039;Eukaryota&#039;&#039;) as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).&lt;br /&gt;
# How many of these have a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; of &amp;quot;mitochondrion&amp;quot; and &amp;quot;cytoplasm&amp;quot;, respectively? Download the results of these two searches in FASTA format.&lt;br /&gt;
# Now combine the two data sets from the previous question into one FASTA file (using Geany or another plain text editor). Note that their names start by &amp;quot;RL3&amp;quot; (cytoplasmic) or &amp;quot;RM03&amp;quot;/&amp;quot;RK3&amp;quot; (mitochondrial) which is very convenient for telling the difference between them. &#039;&#039;If you have any names that do not begin with &amp;quot;RL3&amp;quot;, &amp;quot;RK3&amp;quot; or &amp;quot;RM03&amp;quot;, revisit your UniProt search criteria!&#039;&#039; Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).&lt;br /&gt;
&lt;br /&gt;
===Step 10: making the tree===&lt;br /&gt;
Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). NB: set &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; off. Describe all the steps you took to make it, and hand in a picture of your tree in &#039;&#039;unrooted&#039;&#039; view. Also, go to &amp;lt;u&amp;gt;File→Save unrooted tree&amp;lt;/u&amp;gt; and save the tree file; name it something ending in &amp;lt;tt&amp;gt;.txt&amp;lt;/tt&amp;gt;. Open this file in a plain text editor and have a look at it — this is the Newick tree file text format. We will need this file later.&lt;br /&gt;
&lt;br /&gt;
===Step 11: rerooting the tree in Seaview===&lt;br /&gt;
Until now, we have not had to deal with rerooting, because the midpoint rooting happened to be correct. This is not the case here, since we want the cytoplasmic and the mitochondrial sequences to be in two monophyletic groups (two subtrees). In other words, we have to reroot:&lt;br /&gt;
# Switch back to rooted (&amp;quot;squared&amp;quot;) view.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Re-root&amp;lt;/u&amp;gt; in the second row of the tree window; a small black square will appear at each node. If you click a square, the tree will be rerooted at that node (try it!)&lt;br /&gt;
# Now find a node where all children are either cytoplasmic or mitochondrial.  Click it (don&#039;t worry about clicking a wrong node, you can always click another). Make sure that all the cytoplasmic and all the mitochondrial sequences are in two separate subtrees. &lt;br /&gt;
# Then, click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt; in the second row of the tree window to make the small black squares disappear again.&lt;br /&gt;
Include a picture of the rerooted tree in your answer.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 12: interactive Tree Of Life===&lt;br /&gt;
In this step, we will use the website [https://itol.embl.de/ iTOL] (interactive Tree Of Life) to reroot our tree: &lt;br /&gt;
# Open the website in a new browser tab, and click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt; in the top row.&lt;br /&gt;
# Click the button under &amp;lt;u&amp;gt;Tree file:&amp;lt;/u&amp;gt; and select the unrooted Newick tree file you saved in Step 10.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt;. You will now see a tree displayed with an arbitrary placement of the root.&lt;br /&gt;
# Look at the &amp;lt;u&amp;gt;Control panel&amp;lt;/u&amp;gt; to the right. Under &amp;lt;u&amp;gt;Label options&amp;lt;/u&amp;gt; switch &amp;lt;u&amp;gt;Position&amp;lt;/u&amp;gt; from &amp;lt;u&amp;gt;Aligned&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;At tips&amp;lt;/u&amp;gt;.&lt;br /&gt;
# Note that when you hover the mouse over a branch, information about the branch is displayed.&lt;br /&gt;
# Find, like in the previous step, a node where all children are either cytoplasmic or mitochondrial. Click it. A menu will appear. In that menu, go to &amp;lt;u&amp;gt;Editing→Tree structure→Re-root the tree here&amp;lt;/u&amp;gt;.&lt;br /&gt;
Include a picture of the rerooted tree in your answer. Is there a difference between this tree and the one you made in Step 11? If so, describe it.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 12: annotating the tree===&lt;br /&gt;
In this step, you need to be able to draw on a screenshot of the tree from Seaview. You can use any drawing software of your own choice, e.g. the Snip and Sketch tool (built into Windows), [https://inkscape.org/ Inkscape], or PowerPoint.&lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 13: annotating the tree===&lt;br /&gt;
# In the left part of the iTOL window, you will see six small icons: Zoom in, Zoom out, Fit to screen, Information, Search tree nodes, and Manual annotations (hover the mouse over them to see the descriptions). &lt;br /&gt;
# Click &amp;lt;u&amp;gt;Manual annotations&amp;lt;/u&amp;gt; and select the first tool (&amp;quot;Draw an ellipse / circle&amp;quot;). &lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Note that in case you place a circle incorrectly, you can move it with the &amp;quot;Move/rotate/scale objects&amp;quot; tool. There is also a &amp;quot;Delete objects&amp;quot; tool.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 14: interpretation===&lt;br /&gt;
&lt;br /&gt;
Consider your rerooted and annotated tree from iTOL, and answer the following questions: &lt;br /&gt;
# Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?&lt;br /&gt;
# Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If no, where do you see differences?&lt;br /&gt;
# Consider the horizontal distance between the blue and the green point in both subtrees. Where has evolution been faster (where are there most mutations per time unit) — among the cytoplasmic or the mitochondrial proteins?&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=780</id>
		<title>Exercise: Phylogeny (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=780"/>
		<updated>2025-11-25T10:43:55Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 13: annotating the tree */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Before you start: please make sure you have the Seaview program installed on your computer. If not, see the [[Exercise: Multiple Alignments (Seaview version)|Multiple alignment exercise]].&lt;br /&gt;
&lt;br /&gt;
== The Phylogeny of HIV ==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:&lt;br /&gt;
&lt;br /&gt;
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.&lt;br /&gt;
&lt;br /&gt;
The &amp;quot;Pol&amp;quot; gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1. It is available via this link:&lt;br /&gt;
&lt;br /&gt;
:[https://teaching.healthtech.dtu.dk/material/22111/Pol21.fsa Pol21.fsa]&lt;br /&gt;
&lt;br /&gt;
===Step 1: alignment===&lt;br /&gt;
&lt;br /&gt;
Align the Pol sequences using the Clustal Omega program in Seaview.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: &lt;br /&gt;
:Make a PDF in color of your alignment (File→Prepare PDF) and hand it in as an attachment to your answer document.&lt;br /&gt;
&lt;br /&gt;
===Step 2: distance matrix===&lt;br /&gt;
&lt;br /&gt;
In Seaview, go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt;. In the window that pops up, select &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt; and set &amp;lt;u&amp;gt;Distance&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Observed&amp;lt;/u&amp;gt;. Let &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; be checked. Click &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; and save the file. &lt;br /&gt;
&lt;br /&gt;
Look at the resulting file in a plain text editor. First, all distances are written in a triangle with the top row containing distances between the first sequence and all the others, the second row containing the distances between the second sequence and all others except the first, and so on. Just below the triangle, the names of the sequences are shown in the order in which they are displayed in the triangle. Further down in the file, all distances are repeated in another format with one pairwise distance per line including the sequence names.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: &lt;br /&gt;
:Can you spot which sequence has the largest distances to all the others?&lt;br /&gt;
&lt;br /&gt;
===Step 3: neighbor joining===&lt;br /&gt;
&lt;br /&gt;
Go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt; again, but this time, select &amp;lt;u&amp;gt;NJ&amp;lt;/u&amp;gt; instead of &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt;. Then, clicking &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; will produce a neighbor-joining tree based on the distances you just looked at. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the resulting tree (&#039;&#039;&#039;Hint&#039;&#039;&#039;: you can either take a screenshot or save the tree as SVG via the &amp;lt;u&amp;gt;File&amp;lt;/u&amp;gt; menu). &lt;br /&gt;
:Which sequence has the longest branch? Does that correspond to your answer before?&lt;br /&gt;
&lt;br /&gt;
===Step 4: rooted &#039;&#039;vs&#039;&#039; unrooted tree===&lt;br /&gt;
&lt;br /&gt;
In principle, the NJ algorithm always produces an &#039;&#039;unrooted&#039;&#039; tree. The reason why the trees you have seen so far (in this and last week&#039;s exercises) have been shown as rooted trees is that Seaview uses &#039;&#039;midpoint rooting&#039;&#039;, i.e., it places the root halfway between the tips that are furthest away from each other on the tree. However, you can also display the tree as unrooted in Seaview: In the drop-down menu at the top of the tree window, change &amp;lt;u&amp;gt;squared&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;. (It is a bit unfortunate that Seaview uses the term &amp;quot;circular&amp;quot;, since some other programs offer a circular way of displaying &#039;&#039;rooted&#039;&#039; trees, which should not be confused with unrooted trees). Later in the exercise, we will encounter tree rerooting.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the unrooted tree.&lt;br /&gt;
&lt;br /&gt;
===Step 5: rearrangement===&lt;br /&gt;
Now, go back to the rooted view of the tree and click &amp;lt;u&amp;gt;Swap&amp;lt;/u&amp;gt; in the second line of the tree window. Now, every internal node will be marked by a small black square. Click any square to rotate the subtree defined by that node (i.e., swap the upper and lower branches). When you click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt;, the black squares disappear again, but the changes in the tree layout will remain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the tree where you have rearranged it so that:&lt;br /&gt;
:# HTLV is at the bottom,&lt;br /&gt;
:# The HIV1 sequences are above the HIV2 sequences, and&lt;br /&gt;
:# &amp;quot;SIVCZ&amp;quot; is placed next to &amp;quot;Smanga_S4&amp;quot;.&lt;br /&gt;
Note that all these rearrangements do &#039;&#039;not&#039;&#039; change the topology (the branching pattern) of the tree — it still shows the same phylogeny.&lt;br /&gt;
&lt;br /&gt;
===Step 6: interpretation===&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: &lt;br /&gt;
: Inspect the rooted tree that you now have and consider what this tells you about the origin of HIV viruses.&lt;br /&gt;
* Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences?&lt;br /&gt;
* The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?&lt;br /&gt;
* With these groupings in consideration, what can you say about the origin of the two HIV viruses?&lt;br /&gt;
&lt;br /&gt;
== Comparing trees ==&lt;br /&gt;
&lt;br /&gt;
For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequences, but lack the first 90 nucleotides or so). The sequences can be found via the following link:&lt;br /&gt;
&lt;br /&gt;
* [https://teaching.healthtech.dtu.dk/material/22111/L18_CDS.fasta L18_CDS.fasta]&lt;br /&gt;
&lt;br /&gt;
===Step 7: with or without gapped positions===&lt;br /&gt;
This time, make two versions of your tree: one where &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; is on, and one where it is off. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: &lt;br /&gt;
: Compare the two trees. Are there any differences in the tree topology (i.e., in branching pattern, not just in branch lengths)?&lt;br /&gt;
: Your answers should include the following:&lt;br /&gt;
:* How did you construct the trees? (alignment method, construction of tree, etc.).&lt;br /&gt;
:* Pictures of the trees. &lt;br /&gt;
:* Which tree do you think is most correct?&lt;br /&gt;
&lt;br /&gt;
===Step 8: comparison to taxonomy===&lt;br /&gt;
Now, go to [http://www.ncbi.nlm.nih.gov/taxonomy NCBI taxonomy] and construct a &amp;quot;Common Tree&amp;quot; with all the different species in your L18 data set. It may be necessary to look up some of the common names on the net (Google, Wikipedia, Tree of Life) in order to enter them in the common tree function. &#039;&#039;&#039;Note&#039;&#039;&#039;: Remember to tick &amp;lt;u&amp;gt;include unranked (phylogenetic) taxa&amp;lt;/u&amp;gt;.&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;:&lt;br /&gt;
: Compare the most correct of your trees from Step 7 with the Common tree. Are there any errors, i.e. taxa that are not placed correctly on your tree? Which?&lt;br /&gt;
&lt;br /&gt;
== Mitochondrial &#039;&#039;versus&#039;&#039; cytoplasmic proteins ==&lt;br /&gt;
In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion&#039;s own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use [http://www.uniprot.org/ UniProt] to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyze the phylogeny of the dataset.&lt;br /&gt;
&lt;br /&gt;
===Step 9: building the dataset===&lt;br /&gt;
# Find all proteins named &amp;quot;ribosomal protein L3&amp;quot; from as many eukaryotes (&#039;&#039;Eukaryota&#039;&#039;) as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).&lt;br /&gt;
# How many of these have a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; of &amp;quot;mitochondrion&amp;quot; and &amp;quot;cytoplasm&amp;quot;, respectively? Download the results of these two searches in FASTA format.&lt;br /&gt;
# Now combine the two data sets from the previous question into one FASTA file (using Geany or another plain text editor). Note that their names start by &amp;quot;RL3&amp;quot; (cytoplasmic) or &amp;quot;RM03&amp;quot;/&amp;quot;RK3&amp;quot; (mitochondrial) which is very convenient for telling the difference between them. &#039;&#039;If you have any names that do not begin with &amp;quot;RL3&amp;quot;, &amp;quot;RK3&amp;quot; or &amp;quot;RM03&amp;quot;, revisit your UniProt search criteria!&#039;&#039; Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).&lt;br /&gt;
&lt;br /&gt;
===Step 10: making the tree===&lt;br /&gt;
Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). NB: set &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; off. Describe all the steps you took to make it, and hand in a picture of your tree in &#039;&#039;unrooted&#039;&#039; view. Also, go to &amp;lt;u&amp;gt;File→Save unrooted tree&amp;lt;/u&amp;gt; and save the tree file; name it something ending in &amp;lt;tt&amp;gt;.txt&amp;lt;/tt&amp;gt;. Open this file in a plain text editor and have a look at it — this is the Newick tree file text format. We will need this file later.&lt;br /&gt;
&lt;br /&gt;
===Step 11: rerooting the tree in Seaview===&lt;br /&gt;
Until now, we have not had to deal with rerooting, because the midpoint rooting happened to be correct. This is not the case here, since we want the cytoplasmic and the mitochondrial sequences to be in two monophyletic groups (two subtrees). In other words, we have to reroot:&lt;br /&gt;
# Switch back to rooted (&amp;quot;squared&amp;quot;) view.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Re-root&amp;lt;/u&amp;gt; in the second row of the tree window; a small black square will appear at each node. If you click a square, the tree will be rerooted at that node (try it!)&lt;br /&gt;
# Now find a node where all children are either cytoplasmic or mitochondrial.  Click it (don&#039;t worry about clicking a wrong node, you can always click another). Make sure that all the cytoplasmic and all the mitochondrial sequences are in two separate subtrees. &lt;br /&gt;
# Then, click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt; in the second row of the tree window to make the small black squares disappear again.&lt;br /&gt;
Include a picture of the rerooted tree in your answer.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 12: interactive Tree Of Life===&lt;br /&gt;
In this step, we will use the website [https://itol.embl.de/ iTOL] (interactive Tree Of Life) to reroot our tree: &lt;br /&gt;
# Open the website in a new browser tab, and click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt; in the top row.&lt;br /&gt;
# Click the button under &amp;lt;u&amp;gt;Tree file:&amp;lt;/u&amp;gt; and select the unrooted Newick tree file you saved in Step 10.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt;. You will now see a tree displayed with an arbitrary placement of the root.&lt;br /&gt;
# Look at the &amp;lt;u&amp;gt;Control panel&amp;lt;/u&amp;gt; to the right. Under &amp;lt;u&amp;gt;Label options&amp;lt;/u&amp;gt; switch &amp;lt;u&amp;gt;Position&amp;lt;/u&amp;gt; from &amp;lt;u&amp;gt;Aligned&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;At tips&amp;lt;/u&amp;gt;.&lt;br /&gt;
# Note that when you hover the mouse over a branch, information about the branch is displayed.&lt;br /&gt;
# Find, like in the previous step, a node where all children are either cytoplasmic or mitochondrial. Click it. A menu will appear. In that menu, go to &amp;lt;u&amp;gt;Editing→Tree structure→Re-root the tree here&amp;lt;/u&amp;gt;.&lt;br /&gt;
Include a picture of the rerooted tree in your answer. Is there a difference between this tree and the one you made in Step 11? If so, describe it.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 12: annotating the tree===&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 13: annotating the tree===&lt;br /&gt;
# In the left part of the iTOL window, you will see six small icons: Zoom in, Zoom out, Fit to screen, Information, Search tree nodes, and Manual annotations (hover the mouse over them to see the descriptions). &lt;br /&gt;
# Click &amp;lt;u&amp;gt;Manual annotations&amp;lt;/u&amp;gt; and select the first tool (&amp;quot;Draw an ellipse / circle&amp;quot;). &lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Note that in case you place a circle incorrectly, you can move it with the &amp;quot;Move/rotate/scale objects&amp;quot; tool. There is also a &amp;quot;Delete objects&amp;quot; tool.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 14: interpretation===&lt;br /&gt;
&lt;br /&gt;
Consider your rerooted and annotated tree from iTOL, and answer the following questions: &lt;br /&gt;
# Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?&lt;br /&gt;
# Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If no, where do you see differences?&lt;br /&gt;
# Consider the horizontal distance between the blue and the green point in both subtrees. Where has evolution been faster (where are there most mutations per time unit) — among the cytoplasmic or the mitochondrial proteins?&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=779</id>
		<title>Exercise: Phylogeny (Seaview version)</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Phylogeny_(Seaview_version)&amp;diff=779"/>
		<updated>2025-11-25T10:42:22Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Step 12: interactive Tree Of Life */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Before you start: please make sure you have the Seaview program installed on your computer. If not, see the [[Exercise: Multiple Alignments (Seaview version)|Multiple alignment exercise]].&lt;br /&gt;
&lt;br /&gt;
== The Phylogeny of HIV ==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will analyze the evolutionary relationship between HIV-related viruses from man and monkeys:&lt;br /&gt;
&lt;br /&gt;
Acquired Immune Deficiency Syndrome (AIDS) is caused by two divergent viruses, Human Immunodeficiency Virus one (HIV-1) and Human Immunodeficiency Virus two (HIV-2). HIV-1 is responsible for the global pandemic, while HIV-2 has, until recently, been restricted to West Africa and appears to be less virulent in its effects. Viruses related to HIV have been found in many species of non-human primates (monkeys, apes, ...) and have been named Simian Immunodeficiency Virus, SIV. HTLV-1 is another, more distantly related, member of the family of retroviruses to which HIV and SIV belong.&lt;br /&gt;
&lt;br /&gt;
The &amp;quot;Pol&amp;quot; gene, which is present in the genome of all these viruses, encodes three different polypeptides important for the viral life cycles: integrase, reverse transcriptase, and protease. It is expressed as a single polyprotein and is subsequently cleaved by protease into its three separate parts. In this exercise you will use a data set consisting of 20 different POL-polyprotein sequences from HIV1, HIV2, chimpanzee SIV, sooty mangabey SIV, and HTLV-1. It is available via this link:&lt;br /&gt;
&lt;br /&gt;
:[https://teaching.healthtech.dtu.dk/material/22111/Pol21.fsa Pol21.fsa]&lt;br /&gt;
&lt;br /&gt;
===Step 1: alignment===&lt;br /&gt;
&lt;br /&gt;
Align the Pol sequences using the Clustal Omega program in Seaview.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: &lt;br /&gt;
:Make a PDF in color of your alignment (File→Prepare PDF) and hand it in as an attachment to your answer document.&lt;br /&gt;
&lt;br /&gt;
===Step 2: distance matrix===&lt;br /&gt;
&lt;br /&gt;
In Seaview, go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt;. In the window that pops up, select &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt; and set &amp;lt;u&amp;gt;Distance&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Observed&amp;lt;/u&amp;gt;. Let &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; be checked. Click &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; and save the file. &lt;br /&gt;
&lt;br /&gt;
Look at the resulting file in a plain text editor. First, all distances are written in a triangle with the top row containing distances between the first sequence and all the others, the second row containing the distances between the second sequence and all others except the first, and so on. Just below the triangle, the names of the sequences are shown in the order in which they are displayed in the triangle. Further down in the file, all distances are repeated in another format with one pairwise distance per line including the sequence names.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: &lt;br /&gt;
:Can you spot which sequence has the largest distances to all the others?&lt;br /&gt;
&lt;br /&gt;
===Step 3: neighbor joining===&lt;br /&gt;
&lt;br /&gt;
Go to &amp;lt;u&amp;gt;Trees→Distance Methods&amp;lt;/u&amp;gt; again, but this time, select &amp;lt;u&amp;gt;NJ&amp;lt;/u&amp;gt; instead of &amp;lt;u&amp;gt;Save to File&amp;lt;/u&amp;gt;. Then, clicking &amp;lt;u&amp;gt;Go&amp;lt;/u&amp;gt; will produce a neighbor-joining tree based on the distances you just looked at. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the resulting tree (&#039;&#039;&#039;Hint&#039;&#039;&#039;: you can either take a screenshot or save the tree as SVG via the &amp;lt;u&amp;gt;File&amp;lt;/u&amp;gt; menu). &lt;br /&gt;
:Which sequence has the longest branch? Does that correspond to your answer before?&lt;br /&gt;
&lt;br /&gt;
===Step 4: rooted &#039;&#039;vs&#039;&#039; unrooted tree===&lt;br /&gt;
&lt;br /&gt;
In principle, the NJ algorithm always produces an &#039;&#039;unrooted&#039;&#039; tree. The reason why the trees you have seen so far (in this and last week&#039;s exercises) have been shown as rooted trees is that Seaview uses &#039;&#039;midpoint rooting&#039;&#039;, i.e., it places the root halfway between the tips that are furthest away from each other on the tree. However, you can also display the tree as unrooted in Seaview: In the drop-down menu at the top of the tree window, change &amp;lt;u&amp;gt;squared&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;circular&amp;lt;/u&amp;gt;. (It is a bit unfortunate that Seaview uses the term &amp;quot;circular&amp;quot;, since some other programs offer a circular way of displaying &#039;&#039;rooted&#039;&#039; trees, which should not be confused with unrooted trees). Later in the exercise, we will encounter tree rerooting.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the unrooted tree.&lt;br /&gt;
&lt;br /&gt;
===Step 5: rearrangement===&lt;br /&gt;
Now, go back to the rooted view of the tree and click &amp;lt;u&amp;gt;Swap&amp;lt;/u&amp;gt; in the second line of the tree window. Now, every internal node will be marked by a small black square. Click any square to rotate the subtree defined by that node (i.e., swap the upper and lower branches). When you click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt;, the black squares disappear again, but the changes in the tree layout will remain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: &lt;br /&gt;
:Hand in a picture of the tree where you have rearranged it so that:&lt;br /&gt;
:# HTLV is at the bottom,&lt;br /&gt;
:# The HIV1 sequences are above the HIV2 sequences, and&lt;br /&gt;
:# &amp;quot;SIVCZ&amp;quot; is placed next to &amp;quot;Smanga_S4&amp;quot;.&lt;br /&gt;
Note that all these rearrangements do &#039;&#039;not&#039;&#039; change the topology (the branching pattern) of the tree — it still shows the same phylogeny.&lt;br /&gt;
&lt;br /&gt;
===Step 6: interpretation===&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: &lt;br /&gt;
: Inspect the rooted tree that you now have and consider what this tells you about the origin of HIV viruses.&lt;br /&gt;
* Note that all HIV1 sequences form a clade. Which sequence is the sister group to the HIV1 sequences?&lt;br /&gt;
* The HIV2 sequences also form a clade. Which sequences make up the sister group to HIV2?&lt;br /&gt;
* With these groupings in consideration, what can you say about the origin of the two HIV viruses?&lt;br /&gt;
&lt;br /&gt;
== Comparing trees ==&lt;br /&gt;
&lt;br /&gt;
For the next part of the exercise the task is to create a rooted phylogenetic tree with a dataset consisting of DNA sequences encoding the ribosomal protein L18 from a number of different species. L18 forms part of the 60S subunit of the ribosome. (The sequences used here are not the complete coding sequences, but lack the first 90 nucleotides or so). The sequences can be found via the following link:&lt;br /&gt;
&lt;br /&gt;
* [https://teaching.healthtech.dtu.dk/material/22111/L18_CDS.fasta L18_CDS.fasta]&lt;br /&gt;
&lt;br /&gt;
===Step 7: with or without gapped positions===&lt;br /&gt;
This time, make two versions of your tree: one where &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; is on, and one where it is off. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: &lt;br /&gt;
: Compare the two trees. Are there any differences in the tree topology (i.e., in branching pattern, not just in branch lengths)?&lt;br /&gt;
: Your answers should include the following:&lt;br /&gt;
:* How did you construct the trees? (alignment method, construction of tree, etc.).&lt;br /&gt;
:* Pictures of the trees. &lt;br /&gt;
:* Which tree do you think is most correct?&lt;br /&gt;
&lt;br /&gt;
===Step 8: comparison to taxonomy===&lt;br /&gt;
Now, go to [http://www.ncbi.nlm.nih.gov/taxonomy NCBI taxonomy] and construct a &amp;quot;Common Tree&amp;quot; with all the different species in your L18 data set. It may be necessary to look up some of the common names on the net (Google, Wikipedia, Tree of Life) in order to enter them in the common tree function. &#039;&#039;&#039;Note&#039;&#039;&#039;: Remember to tick &amp;lt;u&amp;gt;include unranked (phylogenetic) taxa&amp;lt;/u&amp;gt;.&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:&#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;:&lt;br /&gt;
: Compare the most correct of your trees from Step 7 with the Common tree. Are there any errors, i.e. taxa that are not placed correctly on your tree? Which?&lt;br /&gt;
&lt;br /&gt;
== Mitochondrial &#039;&#039;versus&#039;&#039; cytoplasmic proteins ==&lt;br /&gt;
In eukaryotes, many proteins occur inside mitochondria, where they function in energy metabolism or in the mitochondrion&#039;s own genetic system. This system includes ribosomes that differ from the ribosomes found in the cytoplasm. In this part of the exercise, you will use [http://www.uniprot.org/ UniProt] to construct a dataset of a specific ribosomal protein (L3) that exists in the large subunit of both cytoplasmic and mitochondrial ribosomes. Then, you will analyze the phylogeny of the dataset.&lt;br /&gt;
&lt;br /&gt;
===Step 9: building the dataset===&lt;br /&gt;
# Find all proteins named &amp;quot;ribosomal protein L3&amp;quot; from as many eukaryotes (&#039;&#039;Eukaryota&#039;&#039;) as possible in Swiss-Prot. Avoid fragments. How many results do you get? (Remember, as always, to include the search string in your answer).&lt;br /&gt;
# How many of these have a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; of &amp;quot;mitochondrion&amp;quot; and &amp;quot;cytoplasm&amp;quot;, respectively? Download the results of these two searches in FASTA format.&lt;br /&gt;
# Now combine the two data sets from the previous question into one FASTA file (using Geany or another plain text editor). Note that their names start by &amp;quot;RL3&amp;quot; (cytoplasmic) or &amp;quot;RM03&amp;quot;/&amp;quot;RK3&amp;quot; (mitochondrial) which is very convenient for telling the difference between them. &#039;&#039;If you have any names that do not begin with &amp;quot;RL3&amp;quot;, &amp;quot;RK3&amp;quot; or &amp;quot;RM03&amp;quot;, revisit your UniProt search criteria!&#039;&#039; Hand in your FASTA file as an attachment to your answers (do not include it in your PDF).&lt;br /&gt;
&lt;br /&gt;
===Step 10: making the tree===&lt;br /&gt;
Make a phylogenetic tree of all the sequences (cytoplasmic as well as mitochondrial). NB: set &amp;lt;u&amp;gt;Ignore all gap sites&amp;lt;/u&amp;gt; off. Describe all the steps you took to make it, and hand in a picture of your tree in &#039;&#039;unrooted&#039;&#039; view. Also, go to &amp;lt;u&amp;gt;File→Save unrooted tree&amp;lt;/u&amp;gt; and save the tree file; name it something ending in &amp;lt;tt&amp;gt;.txt&amp;lt;/tt&amp;gt;. Open this file in a plain text editor and have a look at it — this is the Newick tree file text format. We will need this file later.&lt;br /&gt;
&lt;br /&gt;
===Step 11: rerooting the tree in Seaview===&lt;br /&gt;
Until now, we have not had to deal with rerooting, because the midpoint rooting happened to be correct. This is not the case here, since we want the cytoplasmic and the mitochondrial sequences to be in two monophyletic groups (two subtrees). In other words, we have to reroot:&lt;br /&gt;
# Switch back to rooted (&amp;quot;squared&amp;quot;) view.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Re-root&amp;lt;/u&amp;gt; in the second row of the tree window; a small black square will appear at each node. If you click a square, the tree will be rerooted at that node (try it!)&lt;br /&gt;
# Now find a node where all children are either cytoplasmic or mitochondrial.  Click it (don&#039;t worry about clicking a wrong node, you can always click another). Make sure that all the cytoplasmic and all the mitochondrial sequences are in two separate subtrees. &lt;br /&gt;
# Then, click &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt; in the second row of the tree window to make the small black squares disappear again.&lt;br /&gt;
Include a picture of the rerooted tree in your answer.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
===Step 12: interactive Tree Of Life===&lt;br /&gt;
In this step, we will use the website [https://itol.embl.de/ iTOL] (interactive Tree Of Life) to reroot our tree: &lt;br /&gt;
# Open the website in a new browser tab, and click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt; in the top row.&lt;br /&gt;
# Click the button under &amp;lt;u&amp;gt;Tree file:&amp;lt;/u&amp;gt; and select the unrooted Newick tree file you saved in Step 10.&lt;br /&gt;
# Click &amp;lt;u&amp;gt;Upload&amp;lt;/u&amp;gt;. You will now see a tree displayed with an arbitrary placement of the root.&lt;br /&gt;
# Look at the &amp;lt;u&amp;gt;Control panel&amp;lt;/u&amp;gt; to the right. Under &amp;lt;u&amp;gt;Label options&amp;lt;/u&amp;gt; switch &amp;lt;u&amp;gt;Position&amp;lt;/u&amp;gt; from &amp;lt;u&amp;gt;Aligned&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;At tips&amp;lt;/u&amp;gt;.&lt;br /&gt;
# Note that when you hover the mouse over a branch, information about the branch is displayed.&lt;br /&gt;
# Find, like in the previous step, a node where all children are either cytoplasmic or mitochondrial. Click it. A menu will appear. In that menu, go to &amp;lt;u&amp;gt;Editing→Tree structure→Re-root the tree here&amp;lt;/u&amp;gt;.&lt;br /&gt;
Include a picture of the rerooted tree in your answer. Is there a difference between this tree and the one you made in Step 11? If so, describe it.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Step 13: annotating the tree===&lt;br /&gt;
# In the left part of the iTOL window, you will see six small icons: Zoom in, Zoom out, Fit to screen, Information, Search tree nodes, and Manual annotations (hover the mouse over them to see the descriptions). &lt;br /&gt;
# Click &amp;lt;u&amp;gt;Manual annotations&amp;lt;/u&amp;gt; and select the first tool (&amp;quot;Draw an ellipse / circle&amp;quot;). &lt;br /&gt;
# Find the nodes that mark the splits between Human and Mouse (the most recent common ancestors of Human and Mouse) in both the mitochondrial subtree and the cytoplasmic subtree. Mark &#039;&#039;both&#039;&#039; these nodes with a green circle each.&lt;br /&gt;
# Note that in case you place a circle incorrectly, you can move it with the &amp;quot;Move/rotate/scale objects&amp;quot; tool. There is also a &amp;quot;Delete objects&amp;quot; tool.&lt;br /&gt;
# Now, find the nodes that mark the most recent common ancestors of Human and Yeast in the two subtrees and mark those with a &#039;&#039;blue&#039;&#039; circle each. &lt;br /&gt;
Hand in a picture of your annotated tree. &lt;br /&gt;
&lt;br /&gt;
===Step 14: interpretation===&lt;br /&gt;
&lt;br /&gt;
Consider your rerooted and annotated tree from iTOL, and answer the following questions: &lt;br /&gt;
# Are the mitochondrial proteins most closely related to each other, or is each mitochondrial protein most closely related to its cytoplasmic counterpart from the same species? Does this indicate that mitochondria have evolved once or many times in the eukaryotes?&lt;br /&gt;
# Consider those species that are represented in both the cytoplasmic and the mitochondrial group. Do the two subtrees agree on the phylogeny of the eukaryotes? If no, where do you see differences?&lt;br /&gt;
# Consider the horizontal distance between the blue and the green point in both subtrees. Where has evolution been faster (where are there most mutations per time unit) — among the cytoplasmic or the mitochondrial proteins?&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=778</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=778"/>
		<updated>2025-11-10T14:11:30Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Introduction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Technical issues===&lt;br /&gt;
If you are denied access to the BLAST server, it may help to make a hotspot on your phone and use that to connect to the internet instead of the DTU Wi-Fi. &lt;br /&gt;
&lt;br /&gt;
In case you cannot make the BLAST server run at all, we have made some &amp;quot;backup output&amp;quot; links with copies of the relevant BLAST output page. &#039;&#039;&#039;Note:&#039;&#039;&#039; The backup outputs &#039;&#039;cannot&#039;&#039; show &amp;lt;u&amp;gt;Graphic Summary&amp;lt;/u&amp;gt; or &amp;lt;u&amp;gt;Alignments&amp;lt;/u&amp;gt;!&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from &amp;lt;u&amp;gt;ClusteredNR&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q1.html backup output])&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now, &amp;lt;!-- go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP].--&amp;gt; click on &amp;lt;u&amp;gt;Edit Search&amp;lt;/u&amp;gt; on the results page (then you don&#039;t have to paste in the query sequence again). This time, set the database to &amp;lt;u&amp;gt;Reference proteins (refseq_protein)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in refseq_protein. We know that the mysterious &amp;quot;Query1&amp;quot; sequence is from an archaeon.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits) ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q2.html backup output])&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits typically match (excluding the 100% identity match)?&lt;br /&gt;
&amp;lt;!-- * &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted because it only makes sense when using nr)  Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it both at the top of the results table and after the list of significant hits).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q4.html backup output])&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the 100% identity match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&amp;lt;!-- This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.--&amp;gt;&lt;br /&gt;
Now, you are going to save the PSSM that PSI-BLAST has created and use it for searching PDB.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q4-PSSM_Scoremat.asn backup PSSM])&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits (E-value &amp;lt; 0.005) now? If yes, how many? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q8.html backup output])&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q12A.html backup output]) &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!). ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q12-PSSM_Scoremat.asn backup PSSM])&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search. ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q12B.html backup output])&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q13.html backup output])&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q14A.html backup output 1]) ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q14-PSSM_Scoremat.asn backup PSSM]) ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q14B.html backup output 2])&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=777</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=777"/>
		<updated>2025-11-10T14:03:59Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Finding a remote homolog (on your own) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from &amp;lt;u&amp;gt;ClusteredNR&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q1.html backup output])&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now, &amp;lt;!-- go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP].--&amp;gt; click on &amp;lt;u&amp;gt;Edit Search&amp;lt;/u&amp;gt; on the results page (then you don&#039;t have to paste in the query sequence again). This time, set the database to &amp;lt;u&amp;gt;Reference proteins (refseq_protein)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in refseq_protein. We know that the mysterious &amp;quot;Query1&amp;quot; sequence is from an archaeon.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits) ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q2.html backup output])&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits typically match (excluding the 100% identity match)?&lt;br /&gt;
&amp;lt;!-- * &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted because it only makes sense when using nr)  Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it both at the top of the results table and after the list of significant hits).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q4.html backup output])&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the 100% identity match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&amp;lt;!-- This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.--&amp;gt;&lt;br /&gt;
Now, you are going to save the PSSM that PSI-BLAST has created and use it for searching PDB.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q4-PSSM_Scoremat.asn backup PSSM])&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits (E-value &amp;lt; 0.005) now? If yes, how many? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q8.html backup output])&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q12A.html backup output]) &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!). ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q12-PSSM_Scoremat.asn backup PSSM])&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search. ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q12B.html backup output])&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q13.html backup output])&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q14A.html backup output 1]) ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q14-PSSM_Scoremat.asn backup PSSM]) ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q14B.html backup output 2])&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=776</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=776"/>
		<updated>2025-11-10T13:58:36Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from &amp;lt;u&amp;gt;ClusteredNR&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q1.html backup output])&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now, &amp;lt;!-- go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP].--&amp;gt; click on &amp;lt;u&amp;gt;Edit Search&amp;lt;/u&amp;gt; on the results page (then you don&#039;t have to paste in the query sequence again). This time, set the database to &amp;lt;u&amp;gt;Reference proteins (refseq_protein)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in refseq_protein. We know that the mysterious &amp;quot;Query1&amp;quot; sequence is from an archaeon.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits) ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q2.html backup output])&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits typically match (excluding the 100% identity match)?&lt;br /&gt;
&amp;lt;!-- * &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted because it only makes sense when using nr)  Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it both at the top of the results table and after the list of significant hits).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q4.html backup output])&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the 100% identity match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&amp;lt;!-- This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.--&amp;gt;&lt;br /&gt;
Now, you are going to save the PSSM that PSI-BLAST has created and use it for searching PDB.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q4-PSSM_Scoremat.asn backup PSSM])&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits (E-value &amp;lt; 0.005) now? If yes, how many? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q8.html backup output])&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q12A.html backup output]) &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!). ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q12-PSSM_Scoremat.asn backup PSSM])&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search. ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q12B.html backup output])&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=775</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=775"/>
		<updated>2025-11-10T13:48:21Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* When BLAST fails */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from &amp;lt;u&amp;gt;ClusteredNR&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? ([https://teaching.healthtech.dtu.dk/material/22111/PSI-BLAST/Q1.html backup output])&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now, &amp;lt;!-- go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP].--&amp;gt; click on &amp;lt;u&amp;gt;Edit Search&amp;lt;/u&amp;gt; on the results page (then you don&#039;t have to paste in the query sequence again). This time, set the database to &amp;lt;u&amp;gt;Reference proteins (refseq_protein)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in refseq_protein. We know that the mysterious &amp;quot;Query1&amp;quot; sequence is from an archaeon.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits typically match (excluding the 100% identity match)?&lt;br /&gt;
&amp;lt;!-- * &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted because it only makes sense when using nr)  Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it both at the top of the results table and after the list of significant hits).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the 100% identity match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits (E-value &amp;lt; 0.005) now? If yes, how many?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!).&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=774</id>
		<title>ExPSIBLAST answer</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=774"/>
		<updated>2025-11-10T10:15:53Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Finding a remote homolog (on your own) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: E-values etc. are found November 8, 2025.&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005. &lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: After the first iteration, 181 significant hits are found.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)? &lt;br /&gt;
Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;!--* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted)  Do you find any PDB hits among the significant hits? &lt;br /&gt;
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500) &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the the 100% identity match))?&lt;br /&gt;
Answer: 53%-87% &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic BLOSUM62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 16&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 5HXY_A with an E-value of 8´2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;; 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-17&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  46%   20%    35%&lt;br /&gt;
 4A8E_A  63%   18%    37%&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 18 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  3e-55  65%   18%    36%&lt;br /&gt;
 5HXY_A  3e-41  61%   18%    32%&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 4e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 3e-03 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=773</id>
		<title>ExPSIBLAST answer</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=773"/>
		<updated>2025-11-10T09:08:52Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: E-values etc. are found November 8, 2025.&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005. &lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: After the first iteration, 181 significant hits are found.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)? &lt;br /&gt;
Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;!--* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted)  Do you find any PDB hits among the significant hits? &lt;br /&gt;
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500) &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the the 100% identity match))?&lt;br /&gt;
Answer: 53%-87% &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic BLOSUM62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 16&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 5HXY_A with an E-value of 8´2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;; 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-17&amp;lt;/sup&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  46%   20%    35%&lt;br /&gt;
 4A8E_A  63%   18%    37%&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 18 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  3e-55  65%   18%    36%&lt;br /&gt;
 5HXY_A  3e-41  61%   18%    32%&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=772</id>
		<title>ExPSIBLAST answer</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=772"/>
		<updated>2025-11-10T09:08:08Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* One more round */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: E-values etc. are found November 8, 2025.&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005. &lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: After the first iteration, 181 significant hits are found.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)? &lt;br /&gt;
Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;!--* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted)  Do you find any PDB hits among the significant hits? &lt;br /&gt;
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500) &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the the 100% identity match))?&lt;br /&gt;
Answer: 53%-87% &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic BLOSUM62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 16&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 5HXY_A with an E-value of 8´2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;; 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-17&amp;lt;/sup&amp;gt;, , &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  46%   20%    35%&lt;br /&gt;
 4A8E_A  63%   18%    37%&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 18 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  3e-55  65%   18%    36%&lt;br /&gt;
 5HXY_A  3e-41  61%   18%    32%&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=771</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=771"/>
		<updated>2025-11-10T08:52:36Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* One more round */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from &amp;lt;u&amp;gt;ClusteredNR&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now, &amp;lt;!-- go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP].--&amp;gt; click on &amp;lt;u&amp;gt;Edit Search&amp;lt;/u&amp;gt; on the results page (then you don&#039;t have to paste in the query sequence again). This time, set the database to &amp;lt;u&amp;gt;Reference proteins (refseq_protein)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in refseq_protein. We know that the mysterious &amp;quot;Query1&amp;quot; sequence is from an archaeon.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits typically match (excluding the 100% identity match)?&lt;br /&gt;
&amp;lt;!-- * &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted because it only makes sense when using nr)  Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it both at the top of the results table and after the list of significant hits).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the 100% identity match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits (E-value &amp;lt; 0.005) now? If yes, how many?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!).&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=770</id>
		<title>ExPSIBLAST answer</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=770"/>
		<updated>2025-11-10T08:50:13Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: E-values etc. are found November 8, 2025.&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005. &lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: After the first iteration, 181 significant hits are found.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)? &lt;br /&gt;
Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;!--* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted)  Do you find any PDB hits among the significant hits? &lt;br /&gt;
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500) &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the the 100% identity match))?&lt;br /&gt;
Answer: 53%-87% &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic BLOSUM62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 16&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 5HXY_A with an E-value of 8´2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;; 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-17&amp;lt;/sup&amp;gt;, , &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  46%   20%    35%&lt;br /&gt;
 4A8E_A  63%   18%    37%&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=769</id>
		<title>ExPSIBLAST answer</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=769"/>
		<updated>2025-11-09T14:20:21Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: E-values etc. are found November 8, 2025.&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005. &lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: After the first iteration, 181 significant hits are found.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)? &lt;br /&gt;
Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;!--* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted)  Do you find any PDB hits among the significant hits? &lt;br /&gt;
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500) &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the the 100% identity match))?&lt;br /&gt;
Answer: 53%-87% &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic BLOSUM62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 16&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=768</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=768"/>
		<updated>2025-11-09T14:19:05Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from &amp;lt;u&amp;gt;ClusteredNR&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now, &amp;lt;!-- go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP].--&amp;gt; click on &amp;lt;u&amp;gt;Edit Search&amp;lt;/u&amp;gt; on the results page (then you don&#039;t have to paste in the query sequence again). This time, set the database to &amp;lt;u&amp;gt;Reference proteins (refseq_protein)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in refseq_protein. We know that the mysterious &amp;quot;Query1&amp;quot; sequence is from an archaeon.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits typically match (excluding the 100% identity match)?&lt;br /&gt;
&amp;lt;!-- * &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted because it only makes sense when using nr)  Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it both at the top of the results table and after the list of significant hits).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the 100% identity match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits (E-value &amp;lt; 0.005) now? If yes, how many?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!).&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=767</id>
		<title>ExPSIBLAST answer</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=767"/>
		<updated>2025-11-09T13:33:10Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Constructing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: E-values etc. are found November 8, 2025.&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005. &lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: After the first iteration, 181 significant hits are found.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)? &lt;br /&gt;
Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;!--* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted)  Do you find any PDB hits among the significant hits? &lt;br /&gt;
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500) &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the the 100% identity match))?&lt;br /&gt;
Answer: 53%-87% &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic BLOSUM62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=766</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=766"/>
		<updated>2025-11-09T13:27:15Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Constructing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from &amp;lt;u&amp;gt;ClusteredNR&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now, &amp;lt;!-- go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP].--&amp;gt; click on &amp;lt;u&amp;gt;Edit Search&amp;lt;/u&amp;gt; on the results page (then you don&#039;t have to paste in the query sequence again). This time, set the database to &amp;lt;u&amp;gt;Reference proteins (refseq_protein)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in refseq_protein. We know that the mysterious &amp;quot;Query1&amp;quot; sequence is from an archaeon.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits typically match (excluding the 100% identity match)?&lt;br /&gt;
&amp;lt;!-- * &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted because it only makes sense when using nr)  Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it both at the top of the results table and after the list of significant hits).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the 100% identity match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!).&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=765</id>
		<title>ExPSIBLAST answer</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=765"/>
		<updated>2025-11-08T16:29:22Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: E-values etc. are found November 8, 2025.&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005. &lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: After the first iteration, 181 significant hits are found.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)? &lt;br /&gt;
Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;!--* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted)  Do you find any PDB hits among the significant hits? &lt;br /&gt;
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500) &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: 53%-87% &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=764</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=764"/>
		<updated>2025-11-08T16:18:49Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from &amp;lt;u&amp;gt;ClusteredNR&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now, &amp;lt;!-- go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP].--&amp;gt; click on &amp;lt;u&amp;gt;Edit Search&amp;lt;/u&amp;gt; on the results page (then you don&#039;t have to paste in the query sequence again). This time, set the database to &amp;lt;u&amp;gt;Reference proteins (refseq_protein)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in refseq_protein. We know that the mysterious &amp;quot;Query1&amp;quot; sequence is from an archaeon.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits typically match (excluding the 100% identity match)?&lt;br /&gt;
&amp;lt;!-- * &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted because it only makes sense when using nr)  Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: What is the E-value of the &#039;&#039;least&#039;&#039; significant hit shown on the results page?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the 100% identity match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!).&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=763</id>
		<title>ExPSIBLAST answer</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=763"/>
		<updated>2025-11-08T16:03:05Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: E-values etc. are found November 8, 2025.&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005. &lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: After the first iteration, 181 significant hits are found.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)? &lt;br /&gt;
Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.&lt;br /&gt;
 &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted) &amp;lt;!-- Do you find any PDB hits among the significant hits? &lt;br /&gt;
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=762</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=762"/>
		<updated>2025-11-08T15:53:44Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from &amp;lt;u&amp;gt;ClusteredNR&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now, &amp;lt;!-- go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP].--&amp;gt; click on &amp;lt;u&amp;gt;Edit Search&amp;lt;/u&amp;gt; on the results page (then you don&#039;t have to paste in the query sequence again). This time, set the database to &amp;lt;u&amp;gt;Reference proteins (refseq_protein)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in refseq_protein. We know that the mysterious &amp;quot;Query1&amp;quot; sequence is from an archaeon.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits typically match (excluding the 100% identity match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted) &amp;lt;!-- Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!).&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=761</id>
		<title>ExPSIBLAST answer</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=761"/>
		<updated>2025-11-08T15:52:12Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: E-values etc. are found November 8, 2025.&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005. &lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: After the first iteration, 181 significant hits are found.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction of the query sequence do the significant hits match (excluding the identical match)? &lt;br /&gt;
Answer: For most hits between 45 and 55%. One hit (#2) is 84%. A few hits are lower, down to 11%.&lt;br /&gt;
 &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: (deleted) &amp;lt;!-- Do you find any PDB hits among the significant hits? &lt;br /&gt;
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=760</id>
		<title>ExPSIBLAST answer</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST_answer&amp;diff=760"/>
		<updated>2025-11-08T15:33:32Z</updated>

		<summary type="html">&lt;p&gt;Henni: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Note: E-values etc. are found November 8, 2025.&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005. &lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: After the first iteration, 494 hits are found.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction of the query sequence do the significant hits match (excluding the identical match)? &lt;br /&gt;
Answer: For most hits between 45 and 55%. One hit (#2) is 84%. A few hits are lower, down to 11%.&lt;br /&gt;
 &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Do you find any PDB hits among the significant hits? &lt;br /&gt;
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=759</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=759"/>
		<updated>2025-11-08T15:28:20Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* When BLAST fails */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to find a homologue with experimentally known structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from &amp;lt;u&amp;gt;ClusteredNR&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in nr.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits match (excluding the identical match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!).&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=758</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=758"/>
		<updated>2025-11-08T15:23:12Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Introduction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to identify relationships between proteins with low sequence similarity.&lt;br /&gt;
&amp;lt;!-- * Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein) --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to make predictions about its structural homologue. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in nr.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits match (excluding the identical match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!).&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExLogo%2BMatrix-answers&amp;diff=648</id>
		<title>ExLogo+Matrix-answers</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExLogo%2BMatrix-answers&amp;diff=648"/>
		<updated>2025-11-04T08:26:15Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Construction of weightmatrices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Answers to &amp;quot;Construction of sequence logos and weight matrices&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Identification of MHC binding motifs ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 1:&#039;&#039;&#039; Which positions are anchor position and what amino acids are found at the anchor positions? &lt;br /&gt;
:Anchor positions are P2 and P9. Preferred amino acids are P2: LM, P9: VL. You don&#039;t have to take the &amp;quot;Auxiliary anchor&amp;quot; into account.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Construction of weightmatrices ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 2.1&#039;&#039;&#039;: Have a look at the sequence logo. How many different amino acids are present in the logo? &lt;br /&gt;
:More than 10 (all 20 in fact)! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 2.2&#039;&#039;&#039;: Can you understand the weight matrix values? Hint, compare the weight matrix values to the Blosum62 scoring matrix values for &amp;lt;tt&amp;gt;L&amp;lt;/tt&amp;gt;.&lt;br /&gt;
:If you only have one sequence (one amino acid), alpha in the equation for the combined frequency is zero, and p = g. To calculate the weight matrix values for for instance A we get&lt;br /&gt;
   g(A) = q(A|L) = 0.04&lt;br /&gt;
   p(A) = 0.04&lt;br /&gt;
   w(A) = 2*log(0.04/0.074)/log(2) = -1.78&lt;br /&gt;
:This value compares well with the Blosum scoring matrix value for matching L to A, BL(A,L) = -1. The matrix value reported by the EasyPred program is -1.468. The difference between the value found here (1.78) and the EasyPred value (-1.468) is due to round-off errors. The EasyPred program uses a Blosum matrix with more digits defining the substitutions.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 2.3:&#039;&#039;&#039; How many different amino acids are present at the P1 position in the logo (just give a rough estimate)? &lt;br /&gt;
:More than 10 (all 20 in fact)! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 2.4:&#039;&#039;&#039; How many different amino acids are present at the P1 position in the binding data? &lt;br /&gt;
:2.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 3:&#039;&#039;&#039; Try to reproduce the matrix values for P1(I), and P1(K).&lt;br /&gt;
:The pseudo frequences for I and K are&lt;br /&gt;
   g(I) = 0.4*0.12 + 0.6*0.16 = 0.144&lt;br /&gt;
   g(K) = 0.4*0.03 + 0.6*0.03 = 0.03&lt;br /&gt;
:Since weight on prior (or weight on pseudo count, beta) is much greater than the number of sequences, the final frequencies p(I) = g(I), and p(K) = g(K). Using the formula for the values in the weight matrix with q(I)=0.068, and q(K)=0.058 (remember q is the back ground frequencies), we find the weight matrix value to be&lt;br /&gt;
:Score(I) = 2.17, and Score(K) = -1.90&lt;br /&gt;
:These values compare fine with the values calculated by the Easypred program (2.18, and -2.34). Remember, the EasyPred program uses a Blosum matrix with more digits defining the substitutions.&lt;br /&gt;
&lt;br /&gt;
== Weight Matrix generation ==&lt;br /&gt;
&lt;br /&gt;
=== Small training set ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 4:&#039;&#039;&#039; What is the predictive performance of the matrix method?&lt;br /&gt;
:Pearson coefficient for N= 1266 data: 0.07628 Aroc value: 0.56979 &lt;br /&gt;
View the logo plot of the calculated matrix. Can you understand why the matrix performs so poorly? &lt;br /&gt;
:The logo shows very low information at all positions. We have trained a method for peptide:MHC binding on a mixture of peptide binders and peptide non-binders. This is clearly wrong. We can only include binders when estimating a binding motif. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 5:&#039;&#039;&#039; How many of the 110 peptides in the small.train.set are included in the matrix construction (Look for number of positive training examples)? &lt;br /&gt;
:All 110 peptides were included.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 6:&#039;&#039;&#039; What is the predictive performance of the matrix method now? &lt;br /&gt;
:Pearson coefficient for N= 1266 data: 0.29529 Aroc value: 0.71191 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 7:&#039;&#039;&#039; How many of the 110 peptides in the small.train.set are included in the matrix construction?&lt;br /&gt;
:10.&lt;br /&gt;
View the logo plot of the calculated matrix. Does the logo resemble the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise? &lt;br /&gt;
:No.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 8:&#039;&#039;&#039; What is the predictive performance of the matrix method now? &lt;br /&gt;
:Pearson coefficient for N= 1266 data: 0.45328 Aroc value: 0.81865&lt;br /&gt;
Again view the logo plot of the calculated matrix. Has it changed compared to the previous calculation? &lt;br /&gt;
:In some positions, the order of letters have changed, but it still does not resemble the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 9:&#039;&#039;&#039; What is the predictive performance of the matrix method now? &lt;br /&gt;
:Pearson coefficient for N= 1266 data: 0.49684 Aroc value: 0.84838 &lt;br /&gt;
View the logo plot of the calculated matrix. What is the big difference between this logo and the two previous ones? (how many different amino acids are present at each position in the binding motif?) &lt;br /&gt;
:In the two previous logo plots, only four amino acids were present at for instance P2. In this last logo all amino acids are present. The information content is also much lower.&lt;br /&gt;
What are the reasons for these differences? &lt;br /&gt;
:Using pseudo-counts will give non-zero frequency values also for amino acids not observed, and add more terms to the sum in the equation for the information content, thereby lowering the value.&lt;br /&gt;
Does the logo begin to resemble the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise? &lt;br /&gt;
:Yes, it captures some of the features (especially that position 2 is most important). Remember, it is still made from only 10 binding peptides.&lt;br /&gt;
&lt;br /&gt;
=== Large training set ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 10:&#039;&#039;&#039; What is the predictive performance of the matrix method now? &lt;br /&gt;
:Pearson coefficient for N= 1266 data: 0.71798 Aroc value: 0.96651 &lt;br /&gt;
View the logo plot of the calculated matrix. Does the logo compare to the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise? &lt;br /&gt;
:Yes.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 11:&#039;&#039;&#039; Look at the prediction list. How many false positive hits do you find among the top 20 highest scoring peptides (Assignment score &amp;lt; 0.426)? &lt;br /&gt;
:One false positive (Assignment score = 0).&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Construction_of_sequence_logos_and_weight_matrices&amp;diff=647</id>
		<title>Exercise: Construction of sequence logos and weight matrices</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:_Construction_of_sequence_logos_and_weight_matrices&amp;diff=647"/>
		<updated>2025-11-04T08:24:44Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Construction of weightmatrices */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Exercise by: [https://www.dtu.dk/service/telefonbog/Person?id=5973 Morten Nielsen] - editing by Rasmus Wernerson&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
You shall use Bioinformatics tools to predict peptide-MHC binding and select potential epitope vaccine candidates. The exercise has four parts:&lt;br /&gt;
&lt;br /&gt;
#Identification of MHC binding motif&lt;br /&gt;
#Visualize the motif using sequence logos&lt;br /&gt;
#Training of MHC binding prediction methods&lt;br /&gt;
#Use the MHC binding prediction method to select potential epitope vaccine candidates&lt;br /&gt;
&lt;br /&gt;
===Background: Peptide MHC binding===&lt;br /&gt;
&lt;br /&gt;
The most selective step in identifying potential peptide immunogens is the binding of the peptide to the MHC complex. Only one in about 200 peptides will bind to a given MHC complex. A very large number of different MHC alleles exist each with a highly selective peptide binding specificity.&lt;br /&gt;
&lt;br /&gt;
The binding motif for a given MHC class I complex is in most cases 9 amino acids long. The motif is characterized by a strong amino acid preference at specific positions in the motif. These position are called anchor positions. For many MHC complexes the anchor positions are placed at P2 and P9 in the motif. However this is not always the case.&lt;br /&gt;
&lt;br /&gt;
Large number of peptide data exist describing this MHC specificity variation. One important source of data was the SYFPEITHI MHC database (http://www.syfpeithi.de - currently unavailable). This database contains information on MHC ligands and binding motifs.&lt;br /&gt;
&lt;br /&gt;
Once an accurate and reliable prediction method of peptide MHC binding has been developed it can be applied in the context of effective vaccine design to search for potential epitope candidates. On a whole genome scale one can search for peptides with high MHC binding affinity.&lt;br /&gt;
&lt;br /&gt;
===Purpose of exercise===&lt;br /&gt;
&lt;br /&gt;
In this exercise you are going to:&lt;br /&gt;
&lt;br /&gt;
*Visualize the binding motif using sequence logos.&lt;br /&gt;
*Use the Easypred web-interface to train Bioinformatics predictor for MHC-peptide binding.&lt;br /&gt;
*Apply a MHC binding prediction method to select peptides in the Sars genome useful for vaccine design.&lt;br /&gt;
&lt;br /&gt;
===Prediction performance===&lt;br /&gt;
&lt;br /&gt;
We shall use two performance measures to evaluate the predictive performance of a prediction method. The two measures are AUC (the area under the ROC curve), and the Pearson correlation. A short description of different performance methods is given here [https://teaching.healthtech.dtu.dk/material/22111/PDF/perf.pdf Performance measures].&lt;br /&gt;
&lt;br /&gt;
==The exercise==&lt;br /&gt;
[[file:HLA-A0201.gif|right|frame|Logo visualization of the binding motif of the MHC allele HLA-A*0201]]&lt;br /&gt;
&lt;br /&gt;
===Identification of MHC binding motifs===&lt;br /&gt;
&lt;br /&gt;
Have a look at the peptide characteristics for HLA-A*0201 MHC allele (displayed on the figure on the right). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: Which positions do you think are anchor positions for a peptide to bind MHC, and what amino acids are found more frequently at the anchor positions?&lt;br /&gt;
&lt;br /&gt;
===Sequence logos===&lt;br /&gt;
&lt;br /&gt;
A powerful way to visualize the peptide characteristics of the binding motif of an MHC complex, is to plot a sequence logo. In a logo the information content at each position in the sequence motif corresponds to the height of a column of letters. The Information content &#039;&#039;I&amp;lt;sub&amp;gt;p&amp;lt;/sub&amp;gt;&#039;&#039; on each position &#039;&#039;p&#039;&#039; is defined as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- :&amp;lt;math&amp;gt;\log_2(20)+\sum_a{p_{ap}*\log_2(p_{ap})}&amp;lt;/math&amp;gt; --&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&amp;lt;dd&amp;gt;&amp;lt;span class=&amp;quot;texhtml&amp;quot; dir=&amp;quot;ltr&amp;quot;&amp;gt;&amp;lt;table&amp;gt;&amp;lt;tr align=&amp;quot;center&amp;quot;&amp;gt;&amp;lt;td&amp;gt;log&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;(20) + &amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&amp;lt;font size=&amp;quot;+2&amp;quot;&amp;gt;∑&amp;lt;/font&amp;gt;&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&amp;lt;i&amp;gt;p&amp;lt;/i&amp;gt;&amp;lt;sub&amp;gt;&amp;lt;i&amp;gt;a&amp;lt;/i&amp;gt;&amp;lt;i&amp;gt;p&amp;lt;/i&amp;gt;&amp;lt;/sub&amp;gt; * log&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;(&amp;lt;i&amp;gt;p&amp;lt;/i&amp;gt;&amp;lt;sub&amp;gt;&amp;lt;i&amp;gt;a&amp;lt;/i&amp;gt;&amp;lt;i&amp;gt;p&amp;lt;/i&amp;gt;&amp;lt;/sub&amp;gt;)&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;tr valign=&amp;quot;top&amp;quot; align=&amp;quot;center&amp;quot;&amp;gt;&amp;lt;td&amp;gt;&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&amp;lt;i&amp;gt;a&amp;lt;/i&amp;gt;&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/table&amp;gt;&amp;lt;/span&amp;gt;&amp;lt;/dd&amp;gt;&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The information content is a measure of the degree of conservation and lies within the range of 0 (no conservation — all amino acids are equally probable) and log&amp;lt;sub&amp;gt;2&amp;lt;/sub&amp;gt;(20) = 4.3 (full conservation — only a single amino acid is observed at that position). The height of each letter within the columns is proportional to the frequency &#039;&#039;p&amp;lt;sub&amp;gt;ap&amp;lt;/sub&amp;gt;&#039;&#039; of the corresponding amino acid &#039;&#039;a&#039;&#039; at position &#039;&#039;p&#039;&#039;. The amino acids are colored according to their properties:&lt;br /&gt;
&lt;br /&gt;
* Acidic [DE] red&lt;br /&gt;
* Basic [HKR]: blue&lt;br /&gt;
* Hydrophobic: [ACFILMPVW] black&lt;br /&gt;
* Neutral [GNQSTY]: green&lt;br /&gt;
&lt;br /&gt;
In the exercise, you shall use logo plots to visualize the MHC binding motif contained in different weight matrix predictors.&lt;br /&gt;
&lt;br /&gt;
===Construction of weightmatrices===&lt;br /&gt;
&lt;br /&gt;
First you shall use the [https://services.healthtech.dtu.dk/services/EasyPred-1.0/ EasyPred web-server] (click on the right mouse button, and select open in a new window) to confirm the results from the lecture. &lt;br /&gt;
&lt;br /&gt;
First you shall calculate the weight matrix from the simple alignment containing one single amino acid:&lt;br /&gt;
&lt;br /&gt;
 L&lt;br /&gt;
&lt;br /&gt;
Go to the [https://services.healthtech.dtu.dk/services/EasyPred-1.0/ EasyPred web-server] and type in the above amino acid in the &#039;&#039;&#039;Paste in training examples&#039;&#039;&#039; window. Next, press &#039;&#039;&#039;Submit query&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
*&#039;&#039;&#039;QUESTION 2.1&#039;&#039;&#039;: Have a look at the sequence logo. How many different amino acids are present in the logo?&lt;br /&gt;
&lt;br /&gt;
Click on the link &#039;&#039;&#039;Parameters for prediction method&#039;&#039;&#039;. This will open a window with the weightmatrix estimated from the input peptide sequences - If you run into problems try to right click on the link and press &amp;quot;Download linked file&amp;quot; - this file you should be able to open in Geany. &lt;br /&gt;
&lt;br /&gt;
*&#039;&#039;&#039;QUESTION 2.2&#039;&#039;&#039;: Can you understand the weight matrix values? Hint, compare the weight matrix values to the Blosum62 scoring matrix values for L. &lt;br /&gt;
Link: [https://teaching.healthtech.dtu.dk/material/22111/files/Blosum62.txt BLOSUM62 scoring matrix].&lt;br /&gt;
&lt;br /&gt;
Next, you shall calculate the weight matrix from the multiple alignment:&lt;br /&gt;
&lt;br /&gt;
 VFAAA&lt;br /&gt;
 VHYWW&lt;br /&gt;
 VLQPK&lt;br /&gt;
 LREWQ&lt;br /&gt;
 LPYIH&lt;br /&gt;
&lt;br /&gt;
Go to the EasyPred web-server and past the five peptide sequences from above into the &#039;&#039;&#039;Paste in training examples&#039;&#039;&#039; window. Select &#039;&#039;&#039;No clustering&#039;&#039;&#039;, and set &#039;&#039;&#039;Weight on prior&#039;&#039;&#039; to 10000. Note, this weight on prior (or weight on pseudo count) is arbitrary but very large, and will allow the calculations to become more easy. Next, press Submit query.&lt;br /&gt;
&lt;br /&gt;
Have a look at the sequence logo.&lt;br /&gt;
&lt;br /&gt;
*&#039;&#039;&#039;QUESTION 2.3&#039;&#039;&#039;: How many different amino acids are present at the P1 position in the logo (just give a rough estimate)?&lt;br /&gt;
*&#039;&#039;&#039;QUESTION 2.4&#039;&#039;&#039;: How many different amino acids are present at the P1 position in the binding data (the multiple alignment you used as input)?&lt;br /&gt;
&lt;br /&gt;
Click on the link &#039;&#039;&#039;Parameters for prediction method&#039;&#039;&#039;. This will open a window with the weightmatrix estimated from the input peptide sequences. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;:Try to reproduce the matrix values for P1(I), and P1(K). Remember, you shall use the blosum substitution matrix to estimate the pseudo frequency for the two amino acids as we did in the lecture. Here, you find a link to a file containing the [https://teaching.healthtech.dtu.dk/material/22111/files/BLOSUM62-probabilities.txt Blosum substitution probability matrix]. Here, is a link to a file with the [https://teaching.healthtech.dtu.dk/material/22111/files/Background_freq.txt background frequencies] for the different amino acids. Be careful to use the Blosum matrix correctly, i.e. remember that each row contains the conditional probabilities P(J|I). Also, remember that the values in the weight matrix are calculated as:&lt;br /&gt;
&lt;br /&gt;
[[File:Weight.png]]&lt;br /&gt;
&lt;br /&gt;
where &#039;&#039;p&amp;lt;sub&amp;gt;a&amp;lt;/sub&amp;gt;&#039;&#039; is the &#039;&#039;estimated&#039;&#039; frequency of a given amino acid &#039;&#039;a&#039;&#039; (calculated as described in the lecture including pseudo counts), and &#039;&#039;q&amp;lt;sub&amp;gt;a&amp;lt;/sub&amp;gt;&#039;&#039; is the background frequency. You might not get exactly identical numbers to those of the EasyPred program. So, +/- 20% is fine!&lt;br /&gt;
&lt;br /&gt;
===Prediction of MHC-peptide binding===&lt;br /&gt;
&lt;br /&gt;
====Data====&lt;br /&gt;
&lt;br /&gt;
You shall now use the EasyPred web-interface to train and evaluate a series of different MHC-peptide binding predictors. The EasyPred web-server is an interface that allows for easy training and performance evaluation of weight matrix and artificial neural network prediction methods. In this exercise, you shall only use the part of the web-server that trains weight-matrices.&lt;br /&gt;
&lt;br /&gt;
You shall use three files (eval.set, small.train.set and large.train.set) that contain peptides and binding affinity to the MHC alleles HLA-A*0201. In the two train.sets you find peptide with binding affinity of either 0.1 (this value has absolutely no meaning, it is set to 0.1 for practical reasons) or 1, where 0.1 indicates a non-binder and 1 that the peptide binds to the MHC complex. The [https://teaching.healthtech.dtu.dk/material/22111/files/small.train.set small.train.set] contains 110 peptides, whereas the [https://teaching.healthtech.dtu.dk/material/22111/files/large.train.set large.train.set] contains 232 peptides.&lt;br /&gt;
&lt;br /&gt;
For the evaluation set (eval.set) the affinities are given as real values. A high value indicates strong binding (&#039;&#039;&#039;a value of 0.5 corresponds to a binding affinity of approximately 200 nM&#039;&#039;&#039;). The values are calculated from the actual nM binding affinities using the relation:&lt;br /&gt;
&lt;br /&gt;
x = 1 - log(aff nM)/log(50000)&lt;br /&gt;
&lt;br /&gt;
A peptide that binds with an affinity stronger than 500 nM is said to be an intermediate binder, and a peptide that binds stronger than 50 nM is a high binder. &#039;&#039;&#039;Note that low affinity means strong binding&#039;&#039;&#039;. As a rule of thumb a peptide must be at least an intermediate binder in order to induce an immune-response. Using the above transformation an intermediate binder (500 nM) will have a value of 0.426, and a strong binding a value of 0.638, respectively. Again note that a high transformed value corresponds to a low affinity value, and hence to a strong binder. The [https://teaching.healthtech.dtu.dk/material/22111/files/eval.set eval.set] contains 1266, Click on the filenames to view the content of the files.&lt;br /&gt;
&lt;br /&gt;
During the exercise you shall use the files small.train.set and eval.set often. It might therefore be smart if you save these files on the Desktop on your laptop. You do that by clicking on the files names ([https://teaching.healthtech.dtu.dk/material/22111/files/eval.set eval.set], [https://teaching.healthtech.dtu.dk/material/22111/files/small.train.set small.train.set]) and saving the files as text files on the Desktop. You can also just copy and paste the files every time you need them.&lt;br /&gt;
&lt;br /&gt;
The accuracy of a prediction method can be measured in many ways. Here we use two measures. The area under the receiver operating curve (ROC) curve (Aroc), and the Pearson correlation. Both measures have a value of 1 for the perfect method. For a method that is random the Aroc measure has a value of 0.5 whereas the Pearson correlation is 0. When you evaluate the performance of a prediction method you hence rank a method with high Aroc and Pearson correlation highest (if you want to know more, use Google to find information on how the two measures are calculated).&lt;br /&gt;
&lt;br /&gt;
===Weight Matrix generation===&lt;br /&gt;
&lt;br /&gt;
You shall now use EasyPred web-server to train a series of methods to predict peptide-MHC binding. Go to the [https://services.healthtech.dtu.dk/service.php?EasyPred EasyPred] web-server (again click on the right mouse button, and select open in a new window). When you go through the exercise please read carefully the instructions (all of them). Many of the examples only make sense if you do exactly as described in the text.&lt;br /&gt;
&lt;br /&gt;
====Small training set====&lt;br /&gt;
&lt;br /&gt;
In the upload training examples window browse and select the small.train.set file from the Desktop, in the upload evaluation window browse and select the eval.set file from the Desktop. &#039;&#039;&#039;Important!&#039;&#039;&#039; In the window &amp;quot;Cutoff for counting an example as a positive example&amp;quot; type 0. Now press &#039;&#039;&#039;Submit query&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: What is the predictive performance of the matrix method? (Pearson coefficient and Aroc values)&lt;br /&gt;
View the logo plot of the calculated matrix. Can you understand why the matrix performs so poorly?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many of the 110 peptides in the small.train.set are included in the matrix construction (Look for number of positive training examples)?&lt;br /&gt;
&lt;br /&gt;
Go back to the EasyPred server window. Set the &amp;quot;&#039;&#039;&#039;Cutoff for counting an example as a positive example&#039;&#039;&#039;&amp;quot; to 0.5. Set clustering method to No clustering and the weight on prior to 0.0 and redo calculation.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: What is the predictive performance of the matrix method now?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: How many of the 110 peptides in the small.train.set are included in the matrix construction?&lt;br /&gt;
View the logo plot of the calculated matrix. Does the logo resemble the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise?&lt;br /&gt;
&lt;br /&gt;
Return to the EasyPred server window (use the Back button). Set the clustering method to &#039;&#039;&#039;Clustering at 62% identity&#039;&#039;&#039;. Keep weight on prior on 0.0. Redo calculation.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: &lt;br /&gt;
*What is the predictive performance of the matrix method now?&lt;br /&gt;
*Again view the logo plot of the calculated matrix. Has it changed compared to the previous calculation?&lt;br /&gt;
&lt;br /&gt;
Return to the EasyPred server window (use the Back button). Keep the clustering method to &#039;&#039;&#039;Clustering at 62% identity&#039;&#039;&#039;. Set the &#039;&#039;&#039;weight on prior to 200.0&#039;&#039;&#039;, and redo calculation.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: &lt;br /&gt;
* What is the predictive performance of the matrix method now?&lt;br /&gt;
* View the logo plot of the calculated matrix. What is the big difference between this logo and the two previous ones? (how many different amino acids are present at each position in the binding motif?)&lt;br /&gt;
* What are reasons for these differences?&lt;br /&gt;
* Does the logo begin to resemble the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise?&lt;br /&gt;
&lt;br /&gt;
Note, that you using as few as 10 binding peptides in the small.train.set were able to derive a weight-matrix with a reasonable predictive performance, and a corresponding logo plot that captured many the features observed in the logo plot calculated from the large training data set.&lt;br /&gt;
&lt;br /&gt;
====Large training set====&lt;br /&gt;
&lt;br /&gt;
Now you shall do the last matrix training. Return to the EasyPred server window. Now you shall train a weight matrix using a larger set of data. Press &#039;&#039;&#039;Clear fields&#039;&#039;&#039;. Upload the large.train.set in the &amp;quot;&#039;&#039;&#039;Paste in training examples&#039;&#039;&#039;&amp;quot; window., and the eval.set in the &amp;quot;&#039;&#039;&#039;Paste in evaluation examples&#039;&#039;&#039;&amp;quot;. Leave all other option unchanged (Cluster at 62% identity, and weight on prior at 200). Select &#039;&#039;&#039;Sort output on predicted values&#039;&#039;&#039; and submit query.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What is the predictive performance of the matrix method now?&lt;br /&gt;
View the logo plot of the calculated matrix. Does the logo compare to the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Look at the prediction list. How many false positive hits do you find among the top 20 highest scoring peptides (Assignment score &amp;lt; 0.426)?&lt;br /&gt;
&lt;br /&gt;
Before you continue click on the &amp;quot;Parameters for prediction method&amp;quot; link and save the content to a file on your Desktop (&amp;lt;tt&amp;gt;para.dat&amp;lt;/tt&amp;gt; for instance).&lt;br /&gt;
&lt;br /&gt;
===Finding epitopes in real proteins===&lt;br /&gt;
&lt;br /&gt;
As the last part you shall use the weight matrix to find potential epitopes in the Sars virus. In the EasyPred web-interface clear field to reset all parameter fields.&lt;br /&gt;
&lt;br /&gt;
#Go to the Uniprot home-page: http://uniprot.org. Search for a Sars entry by typing &amp;quot;&#039;&#039;Sars virus&#039;&#039;&amp;quot; in the search window. Click you way to the FASTA format for one of the proteins (select your protein of interest, and click &amp;lt;u&amp;gt;Format&amp;lt;/u&amp;gt; near the top of the page, next click &amp;quot;FASTA (canonical)&amp;quot;). Backup sequence: [https://teaching.healthtech.dtu.dk/material/22111/files/P59595.fsa P59595 in FASTA format]&lt;br /&gt;
#Paste in FASTA file into the Paste in evaluation examples. Upload the weight matrix parameter file (&amp;lt;tt&amp;gt;para.dat&amp;lt;/tt&amp;gt;) from before into the Load saved prediction method window. Make sure that the option for sorting the output is set to Sort output on predicted values, and press Submit query.&lt;br /&gt;
#Now the top scoring peptides should be the peptides that resemble the binding motif (i.e. your weight matrix) the most. Selecting the top 10 peptides you should have a high chance of including a set of potential HLA-A*0201 epitopes. These you could sell to a big pharma company and become rich and famous!!&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Now you are done!!&#039;&#039;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=638</id>
		<title>Exercise:Malaria Vaccine</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=638"/>
		<updated>2025-10-28T12:41:29Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Epitope prediction */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Exercise written by: Thomas Salhøj Rask and [http://www.dtu.dk/service/telefonbog/person?id=25617&amp;amp;cpid=214126&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Henrik Nielsen] — translated, revised and updated to BepiPred 2.0 by [http://www.dtu.dk/service/telefonbog/person?id=18103&amp;amp;cpid=214039&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Rasmus Wernersson] and Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
The purpose of this exercise is to apply the methods and knowledge you have learned so far on a real biological problem: Taking steps towards designing a malaria vaccine, by selecting peptides from the malaria parasite that have a chance of inflicting an immune response and therefore could be used in a vaccine. As part of the exercise some new material will be introduced, especially concerning prediction of B-cell epitopes (immuno-reactive peptides). The outline of the exercise is as follows:&lt;br /&gt;
&lt;br /&gt;
# What exactly is malaria?&lt;br /&gt;
# Identification of membrane bound proteins (potential vaccine targets)&lt;br /&gt;
# Analysis of membrane protein domain structure&lt;br /&gt;
# Prediction of B-cell epitopes from membrane proteins&lt;br /&gt;
# Modelling / visualization of predicted epitopes in the 3D structure of a protein domain.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== What exactly is malaria? ==&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 1:&#039;&#039;&#039; &#039;&#039;Which organism causes malaria? Bacteria, protozoa (single cell eukaryote), worm or virus?&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
Investigate this by looking up the organism in the two taxonomy databases we have been covering earlier in the course:&lt;br /&gt;
*&#039;&#039;&#039;NCBI Taxonomy:&#039;&#039;&#039; http://www.ncbi.nlm.nih.gov/Taxonomy &amp;amp;nbsp;&amp;amp;nbsp; (&#039;&#039;&#039;Hint:&#039;&#039;&#039; If you don&#039;t know the Latin name for the organism, it will be easier to search for a name as a &amp;quot;[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi Token set]&amp;quot; rather than as a &amp;quot;Complete name&amp;quot;.&lt;br /&gt;
*&#039;&#039;&#039;Tree of life:&#039;&#039;&#039; http://www.tolweb.org/ &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1a)&#039;&#039;&#039; Identify the following taxonomical levels for the malaria-causing organism:&lt;br /&gt;
* Genus&lt;br /&gt;
* Phylum&lt;br /&gt;
* (Super)Kingdom&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1b)&#039;&#039;&#039; How &amp;quot;close&amp;quot; in taxonomy space is the organism to the following other organisms (find the upper level taxonomical group, that ties them together). &#039;&#039;&#039;Hint:&#039;&#039;&#039; as an alternative to manually comparing the taxonomy-strings (the &amp;quot;lineage&amp;quot;), you can use the [http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi NCBI Taxonomy Common Tree] tool to automate the comparison.&lt;br /&gt;
* &#039;&#039;Homo sapiens&#039;&#039;&lt;br /&gt;
* &#039;&#039;Babesia microti&#039;&#039;&amp;amp;nbsp;&amp;amp;nbsp;&amp;amp;nbsp; (Can in rare cases be transmitted by ticks (danish: &amp;quot;Skovflåt&amp;quot;) and can lead to the disease &#039;&#039;[https://en.wikipedia.org/wiki/Babesiosis babesiosis]&#039;&#039;, where the red blood cells (erythrocytes) are invaded as in malaria, and which will lead to &#039;&#039;anemia&#039;&#039; (&amp;quot;blood loss&amp;quot;, in this case lack of oxygen carrying capacity in the blood) — see the Tree of Life page for this organism for images of infected erythrocytes.&lt;br /&gt;
&lt;br /&gt;
Finally, read more about malaria and the complicated life cycle of the malaria parasite here: [http://www.cdc.gov/dpdx/malaria/ CDC - DPDx Malaria] .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1c)&#039;&#039;&#039; Report the names of the &#039;&#039;&#039;four&#039;&#039;&#039; species of parasites causing malaria in humans, and use the NCBI Genome (https://www.ncbi.nlm.nih.gov/datasets/genome/) database to investigate which of them (if any) have had their genomes sequenced.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Identification of membrane proteins (potential vaccine targets) ==&lt;br /&gt;
Malaria caused by &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) is by far the most lethal malaria variant. This parasite species is responsible for 80%-90% of the ~1 million annual deaths due to malaria. It will therefore be a natural starting point to develop a vaccine against this type of malaria.&lt;br /&gt;
&lt;br /&gt;
When the &#039;&#039;Pf&#039;&#039; genome was initially sequenced in the 1990s, it was based on &#039;&#039;Pf&#039;&#039; cells isolated from the blood of a Dutch malaria patient, who picked up the disease while traveling. Unfortunately, it was not recorded exactly where the patient had been. This isolate is named &#039;&#039;3D7&#039;&#039; and is the most studied malaria strain to this day (even though it&#039;s not known from where in the world it originates).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039;&lt;br /&gt;
Locate the entry for &#039;&#039;Pf&#039;&#039; 3D7 in [http://www.ncbi.nlm.nih.gov/Taxonomy NCBIs taxonomy browser]. &amp;lt;!-- At the bottom of the page some technical information regarding the genome sequencing is shown (&amp;quot;Genome Information&amp;quot;), and --&amp;gt; In the multi-colored table on the right hand side (&amp;quot;Entrez records&amp;quot;), a set of sequence related data is shown. For instance the &amp;quot;Gene&amp;quot; link describes how many genes have been identified in the genome (including both manually curated genes as well as genes predicted using bioinformatics methods).&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 2a)&#039;&#039;&#039; How many verified genes (NOT hypothetical) does &#039;&#039;Pf&#039;&#039; 3D7 have? (&#039;&#039;&#039;Hint:&#039;&#039;&#039; Follow the &amp;lt;u&amp;gt;Gene&amp;lt;/u&amp;gt; link and add &amp;lt;tt&amp;gt;NOT hypothetical&amp;lt;/tt&amp;gt; to the search string).&lt;br /&gt;
&lt;br /&gt;
Malaria takes place in different stages within the human host (see figure below), and this is important to take into account when designing a malaria vaccine. The disease development can be divided into two phases: 1) The liver-stage and 2) the blood-stage. The liver-stage is defined by &#039;&#039;sporozoites&#039;&#039; injected by the malaria mosquito, which travel to the liver and invade hepatocytes (liver cells). The blood-stage is the second stage and is reached when &#039;&#039;merozoites&#039;&#039; developed within the hepatocytes are released into the blood stream, where they invade erythrocytes (red blood cells). In both stages the malaria parasite hides from the human immune system by staying inside native human cells. &lt;br /&gt;
&lt;br /&gt;
Much of the effort towards developing malaria vaccines so far has been focused on surface exposed (cell-membrane) proteins from the &#039;&#039;sporozoites&#039;&#039; and &#039;&#039;merozoites&#039;&#039; as well as non-human proteins on the surface of infected hepatocytes and erythrocytes. &lt;br /&gt;
&lt;br /&gt;
[[Image:Nm0206-170-F1.jpg | center]]&lt;br /&gt;
&lt;br /&gt;
=== Searching UniProt ===&lt;br /&gt;
We&#039;ll now see if we can use the annotation of protein properties in UniProt to point us towards potential vaccine targets. When designing a vaccine it is important to make sure that the intended vaccine target is indeed &amp;quot;visible&amp;quot; to the immune system. Building on the information from the previous section, we therefore need to identify proteins that &#039;&#039;&#039;originate&#039;&#039;&#039; from the parasite, and that are present on the cell surface of &#039;&#039;sporozoites&#039;&#039;, &#039;&#039;merozoites&#039;&#039; OR infected host cells. In the case of infected host cells, we would therefore be looking for proteins that fulfill the following criteria:&lt;br /&gt;
&lt;br /&gt;
# Are secreted from the parasite to the vacuole &#039;&#039;inside&#039;&#039; the host cell,&lt;br /&gt;
# Migrate from the vacuole to the host cell, and&lt;br /&gt;
# Are transported to the surface (membrane) of the host cell&lt;br /&gt;
&lt;br /&gt;
Initially, we&#039;ll see how many hits we can find by searching for one or more of these criteria in relevant UniProtKB fields. Here we&#039;ll use the same search interface as in the UniProt exercise. We recommend to have the original [[Exercise: The protein database UniProt|UniProt Exercise manual]] open in a different browser window for quick cross-referencing of what we have already learned about searching UniProt.&lt;br /&gt;
&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;Note:&#039;&#039;&#039; When answering the questions below, you have to &#039;&#039;write the search string&#039;&#039; you used in the answer; merely writing a number is not enough. When the search string is included in the answer, we can understand the reason for possible wrong answers.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2b)&#039;&#039;&#039; Go to [http://www.uniprot.org/ UniProt]. Investigate how many &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) proteins there are in total in UniProtKB (i.e. proteins from all &#039;&#039;Pf&#039;&#039; strains, not only from 3D7). How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2c)&#039;&#039;&#039; Now try to see how many of the hits from the previous question are from the strain (isolate) 3D7. Is the number approximately equal to the number you got in question &#039;&#039;&#039;2a)&#039;&#039;&#039;? How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
Now, we shall investigate whether we can use the annotations of subcellular location in UniProt. &#039;&#039;&#039;Note:&#039;&#039;&#039; We go back to working with all strains of &#039;&#039;Pf&#039;&#039;, not exclusively 3D7. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2d)&#039;&#039;&#039; First, check how many &#039;&#039;Pf&#039;&#039; proteins have a &amp;quot;&amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;&amp;quot; comment at all (&#039;&#039;&#039;Tip:&#039;&#039;&#039; choose &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt; in the menu and &amp;lt;!-- leave the &amp;lt;u&amp;gt;Term&amp;lt;/u&amp;gt; field empty)--&amp;gt;enter a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). How many from each part of the database? (&#039;&#039;&#039;Note&#039;&#039;&#039; that the ratio between TrEMBL and Swiss-Prot numbers changes considerably relative to question &#039;&#039;&#039;2b)&#039;&#039;&#039; — Swiss-Prot entries on average contain many more annotations than TrEMBL entries).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2e)&#039;&#039;&#039; How many of these are secreted? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; that should go into the field that pops up when the menu is set to &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt;).  &lt;br /&gt;
&lt;br /&gt;
To get more hits, we will try to search for other terms in the &amp;lt;u&amp;gt;Subcellular location term&amp;lt;/u&amp;gt; field. Interesting subcellular locations might include words such as &amp;quot;&amp;lt;tt&amp;gt;surface&amp;lt;/tt&amp;gt;&amp;quot; or &amp;quot;&amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt;&amp;quot;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2f)&#039;&#039;&#039; How many are there of these, respectively? &lt;br /&gt;
&lt;br /&gt;
The word &amp;quot;membrane&amp;quot; gave the highest number of hits, so we will examine those in more detail. Far from all of these proteins are suitable as vaccine targets. In order to be potentially interesting, they need to be located in the cell membrane (plasma membrane) of either the parasite or the host cell, &#039;&#039;not&#039;&#039; in an inner membrane in the cell. To get an overview, you should try another function in UniProt&#039;s interface: First, click to select the &amp;lt;u&amp;gt;Table&amp;lt;/u&amp;gt; view instead of the &amp;lt;u&amp;gt;Card&amp;lt;/u&amp;gt; view (above the results list). Then, click the button &amp;lt;u&amp;gt;Customize columns&amp;lt;/u&amp;gt;; that will bring up a table where you can find a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; item. Click it, mark &amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;, and click &amp;lt;u&amp;gt;Save&amp;lt;/u&amp;gt;.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2g)&#039;&#039;&#039; Now look at the list of results, where &amp;quot;&amp;lt;u&amp;gt;subcellular location&amp;lt;/u&amp;gt;&amp;quot; contained &amp;quot;membrane&amp;quot;, again. Consider the field &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt;. Give some examples (including accession codes, protein names, and reasons for selecting them) of hits that may be useful, and hits that are surely not useful as vaccine targets (at least two &#039;&#039;different&#039;&#039; examples of each). &#039;&#039;&#039;Hint:&#039;&#039;&#039; if you need to see some different examples, try clicking on the column headings in the table to sort the results list after, e.g., Accession (&amp;lt;u&amp;gt;Entry&amp;lt;/u&amp;gt;), Entry name, or Protein name. &lt;br /&gt;
&lt;br /&gt;
Now, let us focus on the life stage of the parasite where it is located inside an erythrocyte (a red blood cell), and thereby focus on the vaccine targets that are in the plasma membrane of the &#039;&#039;host cell&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2h)&#039;&#039;&#039; How many of the hits have the location &amp;quot;host cell membrane&amp;quot;?&lt;br /&gt;
&amp;lt;!-- These should ideally have a &amp;quot;Subcellular location&amp;quot; annotated as &amp;quot;erythrocyte membrane&amp;quot; or &amp;quot;host cell membrane&amp;quot; — but there are no examples of that in your search from the last question (you are welcome to try!).  --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These proteins could be very interesting as vaccine targets. However, the experimental researchers from your organization report that these have already been tried and do not work in practice, so they ask you to find other examples. We therefore try another approach: If the information we are looking for is not part of the &amp;quot;Subcellular location&amp;quot; annotation, it might be a part of the description (the protein name). &#039;&#039;&#039;Tip:&#039;&#039;&#039; you can always discard a search term in the Advanced interface by clicking the &amp;lt;u&amp;gt;Remove&amp;lt;/u&amp;gt; button.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2i)&#039;&#039;&#039; How many &#039;&#039;Pf&#039;&#039; proteins contain &amp;lt;tt&amp;gt;erythrocyte&amp;lt;/tt&amp;gt; in their &amp;lt;u&amp;gt;Protein Name [DE]&amp;lt;/u&amp;gt; field? How many of these are from Swiss-Prot (reviewed)?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2j)&#039;&#039;&#039; How many of these erythrocyte proteins also have &amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt; in their name? &lt;br /&gt;
&lt;br /&gt;
Some of the hits you find in this way are very short (you can try to sort them by length by clicking the &amp;lt;u&amp;gt;Length&amp;lt;/u&amp;gt; heading). These short proteins might be fragments. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2k)&#039;&#039;&#039; How many of the hits are complete (not annotated as fragments)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; see question 16 in [[Exercise: The protein database UniProt|the UniProt exercise]]).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2l)&#039;&#039;&#039; Do any of these proteins have a determined 3D structure? In other words: Do any proteins from the previous search have a cross-reference to the database PDB? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you should look for &amp;lt;u&amp;gt;Cross-references&amp;lt;/u&amp;gt; in the menu, and again place a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). If yes, what are their names and accession codes?&lt;br /&gt;
&lt;br /&gt;
As a last step in this part of the exercise, you should save all sequences from the last search in FASTA format. This is most easily done by clicking &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; above the results list and choosing &amp;lt;u&amp;gt;FASTA (canonical)&amp;lt;/u&amp;gt;. You can either choose to download them (remember to choose &amp;lt;u&amp;gt;No&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Compressed&amp;lt;/u&amp;gt;) and then open them in a text editor or to preview them in the browser. In the latter case, keep the browser window with the sequences; we will need them later in the exercise.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Analysis of membrane protein domain structure ==&lt;br /&gt;
[[Image:PfEMP1_transport.jpg|right|border]]&lt;br /&gt;
&lt;br /&gt;
The PfEMP1 (&#039;&#039;Plasmodium falciparum&#039;&#039; Erythrocyte Membrane Protein 1) proteins which we now have found constitute a group of proteins expressed by the malaria parasite and transported to the plasma membrane of the infected erythrocyte (see figure — the red/orange sticks represent PfEMP1 proteins). &lt;br /&gt;
&lt;br /&gt;
The function of the PfEMP1 proteins on the surface of the infected erythrocytes is to mediate binding to certain receptors on the surface of endothelial cells (the cells making up the walls of blood vessels). In this way, the malaria parasite can make the infected erythrocytes stick to the walls of the blood vessels in various tissues of the body, and thereby it can avoid being transported through the spleen (Danish: &#039;&#039;milten&#039;&#039;) which otherwise removes diseased erythrocytes from the blood and is one of the main actors in generating an immune response against malaria.&lt;br /&gt;
&lt;br /&gt;
If we, using a vaccine, can generate antibodies that bind to the PfEMP1 proteins, preventing the infected erythrocytes from binding to the endothelial cells, the body would be able to generate a faster and broader immune response against &#039;&#039;Pf&#039;&#039;. Symptoms such as anemia would thereby not become so severe. &lt;br /&gt;
&lt;br /&gt;
We will now examine how the PfEMP1 proteins are built. &lt;br /&gt;
&lt;br /&gt;
Look at the entries you found in the end of section 2. Select just those hits that are from &#039;&#039;Pf&#039;&#039; 3D7 (there should be three of them — otherwise, revisit section 2). &lt;br /&gt;
&lt;br /&gt;
Take a closer look (in UniProt) at these three entries. Scroll down to &amp;lt;u&amp;gt;Family and domain databases&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Family &amp;amp; Domains&amp;lt;/u&amp;gt;. Here, you will find some services providing an overview of known families/domains in the protein in question. &amp;lt;u&amp;gt;InterPro&amp;lt;/u&amp;gt; is the most important of these, since it collects information from a number of family &amp;amp; domain databases (including the one called &amp;lt;u&amp;gt;Pfam&amp;lt;/u&amp;gt;) and therefore has the widest repertoire of domain types. &lt;br /&gt;
&lt;br /&gt;
Open the link labeled &amp;lt;u&amp;gt;View protein in InterPro&amp;lt;/u&amp;gt; in a new tab. Note the graphical interface of InterPro under the heading &amp;quot;Entry matches to this protein&amp;quot;. When you hover the mouse over one of the coloured bars, the name of the family/domain will appear. Note that each family/domain in InterPro has at least &#039;&#039;two&#039;&#039; names and identifiers, an InterPro identifier beginning with &amp;quot;IPR&amp;quot; and a member database identifier, e.g. beginning with &amp;quot;PF&amp;quot; if it is derived from Pfam.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;What are families and domains, anyway?&#039;&#039;&#039;&lt;br /&gt;
:Here are the definitions from the [https://www.ebi.ac.uk/interpro/help/faqs/ InterPro FAQ]:&lt;br /&gt;
:*&#039;&#039;&#039;Domains&#039;&#039;&#039; are distinct functional, structural or sequence units that may exist in a variety of biological contexts. A match to an InterPro entry of this type indicates the presence of a domain. Common examples of protein domains are the PH domain, Immunoglobulin domain or the classical C2H2 zinc finger. &lt;br /&gt;
:*A protein &#039;&#039;&#039;family&#039;&#039;&#039; is a group of proteins that share a common evolutionary origin reflected by their related functions, similarities in sequence, or similar primary, secondary or tertiary structure. A match to an InterPro entry of this type indicates membership of a protein family. &lt;br /&gt;
:However, the distinction between what is regarded as a family and what is regarded as a domain is not completely sharp.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3a)&#039;&#039;&#039; Note that one domain named &amp;quot;Duffy binding domain&amp;quot; is found in several copies in all our three erythrocyte membrane proteins. What are the identifiers of this domain? How many times does it occur in each of the proteins?&lt;br /&gt;
&lt;br /&gt;
Click the identifiers for this particular family/domain and read more about it.&lt;br /&gt;
&lt;br /&gt;
Now, set &amp;lt;u&amp;gt;Feature Display Mode&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt; (instead of &amp;lt;u&amp;gt;Summary&amp;lt;/u&amp;gt;) and scroll down.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3b)&#039;&#039;&#039; Here, you see that Interpro has annotated a transmembrane segment. Which positions are transmembrane in the three proteins? Which part (N- or C-terminal part) of the proteins is intracellular, and which part is extracellular? Are the Duffy binding domains intra- or extracellular?&lt;br /&gt;
&lt;br /&gt;
Look (in UniProt) at the PDB cross-references under &amp;lt;u&amp;gt;3D structure databases&amp;lt;/u&amp;gt; (under &amp;lt;u&amp;gt;Structure&amp;lt;/u&amp;gt;). Focus on X-ray structures only. Compare the coordinates (positions) for the structures to the coordinates for the domains denoted in Pfam. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3c)&#039;&#039;&#039; Which positions are structurally determined &#039;&#039;&#039;by X-ray&#039;&#039;&#039; in each of the three proteins? If you number the occurrences of the known family/domain from &#039;&#039;&#039;3a&#039;&#039;&#039; (1, 2, 3, and so on, starting from the N-terminus), which number(s) are covered by the structurally determined region(s) in each of the three proteins? &lt;br /&gt;
&lt;br /&gt;
Now read what is said about the function and location of our proteins according to Gene Ontology (&amp;lt;u&amp;gt;GO - Molecular function&amp;lt;/u&amp;gt;, &amp;lt;u&amp;gt;GO - Biological process&amp;lt;/u&amp;gt; and &amp;lt;u&amp;gt;GO - Cellular component&amp;lt;/u&amp;gt;) in UniProt.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3d)&#039;&#039;&#039; Do these pieces of information support our choice of these proteins as vaccine targets? Give at least 3 examples. &#039;&#039;&#039;Tip:&#039;&#039;&#039; You can click &amp;lt;u&amp;gt;View the complete GO annotation on QuickGO&amp;lt;/u&amp;gt; in UniProt.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Prediction of B-cell epitopes in a membrane protein ==&lt;br /&gt;
&#039;&#039;&#039;Q8I639&#039;&#039;&#039; is also known as VAR2CSA, and this protein is of particular interest, since it is considered to be responsible for &#039;&#039;Pregnacy associated malaria&#039;&#039; (PAM). Pregnant women are more prone to contract malaria, which sadly leads to a fatality rate of ~10,000 mothers and ~200,000 newborn/unborn children annually. &lt;br /&gt;
&lt;br /&gt;
One of the reasons why it has been so difficult to develop a malaria vaccine, is that the malaria parasite carries ~60 PfEMP1 protein variants, and that you&#039;ll need antibodies against all of them to be immune. However, in the case of PAM there is only one specific PfEMP1 in play, and this special case is therefore easier to start to address with a vaccine.&lt;br /&gt;
&lt;br /&gt;
In order to have a better handle on our bioinformatics work, we&#039;ll concentrate the effort on the Duffy binding domain in VAR2CSA for which a 3D structure is available (the one we found in &#039;&#039;&#039;question 3c&#039;&#039;&#039;).  &lt;br /&gt;
&lt;br /&gt;
=== Epitope prediction ===&lt;br /&gt;
The vaccine we are working towards designing should contain &#039;&#039;&#039;epitopes&#039;&#039;&#039;. Epitopes are the parts of the disease-associated protein the immune system will recognize, for instance the parts the infected person&#039;s antibodies will bind to (the so called &#039;&#039;&#039;B-cell epitopes&#039;&#039;&#039; — there also exist &#039;&#039;&#039;T-cell epitopes&#039;&#039;&#039;, which we&#039;ll not cover here).&lt;br /&gt;
&lt;br /&gt;
For predicting which parts of the protein are potential epitopes, we&#039;ll use the &#039;&#039;&#039;BepiPred 2.0 server&#039;&#039;&#039;, which was created here at DTU.  &lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Important Note:&#039;&#039;&#039; Before running the B-cell epitope prediction it will be needed to create a hotspot with your own MobilePhone as all the  submissions from DTU will be runnning on a single queue, and not starting until the previous has finished.&lt;br /&gt;
:https://services.healthtech.dtu.dk/services/BepiPred-2.0/&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the structure-determined Duffy binding domain in VAR2CSA. This must be done using the link to PDB from UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt.&lt;br /&gt;
# Go to the Structure section.&lt;br /&gt;
# Right-click the link labeled &amp;lt;u&amp;gt;RCSB-PDB&amp;lt;/u&amp;gt; and open it in a new tab. This will take you to a PDB page.&lt;br /&gt;
# Here, you can find the sequence by clicking &amp;lt;u&amp;gt;Display Files&amp;lt;/u&amp;gt; and choosing &amp;lt;u&amp;gt;FASTA Sequence&amp;lt;/u&amp;gt;. You can Download the sequence by Clicing on Download Files &amp;gt; FASTA Sequence.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the Duffy binding domain in VAR2CSA. This can be done using only the web-interface for UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt&lt;br /&gt;
# Locate the section concerning CROSS-REFERENCES to 3D structures (NOT the &amp;quot;live action&amp;quot; 3D structure you can move around).&lt;br /&gt;
#* Find the field called &#039;&#039;&#039;positions&#039;&#039;&#039; — this is actually a &#039;&#039;&#039;clickable link!&#039;&#039;&#039;&lt;br /&gt;
#* Click the positions link — this will open up a new page where this subsequence can be used for a BLAST query.&lt;br /&gt;
#* &#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; do NOT start the BLAST run, but just copy out the FASTA sequence, it contains ONLY the sequence interval specified in the &#039;&#039;&#039;positions&#039;&#039;&#039; field.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4a&#039;&#039;&#039;: What is the name of the PDB entry, and is it a crystal, EM or, NMR structure?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4b&#039;&#039;&#039;: Report the FASTA sequence of the structure-determined Duffy binding domain in VAR2CSA. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4c&#039;&#039;&#039;: &lt;br /&gt;
Note down the following from the UniProt entry, you&#039;ll need it in the next section:&lt;br /&gt;
* What was the sequence interval in the coordinates of the original (full) UniProt sequence?&lt;br /&gt;
* What position in the original protein does position 1 in the new FASTA file correspond to?&lt;br /&gt;
&lt;br /&gt;
You can now run the &#039;&#039;&#039;BepiPred 2.0&#039;&#039;&#039; prediction server on the domain sequence (ONLY the subset extracted above). Run it and then adjust the following on the &#039;&#039;&#039;results page&#039;&#039;&#039;: &lt;br /&gt;
* Set &#039;&#039;&#039;threshold&#039;&#039;&#039; to &#039;&#039;&#039;0.53&#039;&#039;&#039;&lt;br /&gt;
This gives us a reasonable amount of epitopes to continue our work with:&lt;br /&gt;
* Write down the start/end sequence positions of all epitopes of at least &#039;&#039;&#039;5 amino acids&#039;&#039;&#039;&lt;br /&gt;
* &#039;&#039;&#039;Hint:&#039;&#039;&#039; there should be &#039;&#039;&#039;6&#039;&#039;&#039; such epitopes, and the second one starts at position &#039;&#039;&#039;L23&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
[[image:BepiPred2.png|thumb|center|600px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
 &#039;&#039;&#039;Hint:&#039;&#039;&#039; If the prediction takes more than 5min to run, you can follow this link to pre-calculated results: &lt;br /&gt;
 https://teaching.healthtech.dtu.dk/material/22111/BepiPred-2.0/&lt;br /&gt;
 &amp;lt;!-- https://services.healthtech.dtu.dk/cgi-bin/webface2.cgi?jobid=6900B2C30023DC3C6C456FAC&amp;amp;wait=20&lt;br /&gt;
 Please note that this will only work until Wed 29-10-25 at 13hs --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4d&#039;&#039;&#039;: Create a table with the following information about the predicted epitopes:&lt;br /&gt;
* Start/end position, length, Start/end position &#039;&#039;in the original protein&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;(We&#039;ll need the coordinate-transformed values for the PyMOL visualization)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Visualization of epitopes ==&lt;br /&gt;
Lastly, we&#039;ll want to visualize the epitopes in the VAR2CSA Duffy binding domain. Generally, BepiPred 2.0 is very good at selecting surface exposed epitopes, but it&#039;s still a good idea to check it visually.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
For the Q8I639 UniProt entry we have been working with, look at the structure section again, and find the link to the PDB structure of the Duffy binding domain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
Sometimes it will not be possible to get reliable structural information about the entire protein (or in this case the Duffy binding domain). This could for example be the case if parts of the protein are in &#039;&#039;disorder&#039;&#039; (essentially not stabilized and not fixed in place in the crystal). We&#039;ll investigate this next. &lt;br /&gt;
&lt;br /&gt;
From the UniProt page, locate the right structure in PDB:&lt;br /&gt;
* Method 1: Go to https://www.rcsb.org and search for the structure&lt;br /&gt;
* Method 2: Adjust the cross-link in the &amp;quot;structure&amp;quot; section in UniProt to be &amp;quot;RCSB PDB&amp;quot; and click the link.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
In the PDB database page for the structure you found in the last section, click the &amp;quot;Sequence&amp;quot; tab and look at the figure. In the case of this structure, the authors&#039; numbering directly follows the coordinates from the FULL UniProt sequence.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; &lt;br /&gt;
* Which intervals in the sequence are missing (disordered/invisible) in the structure? Hint: Look at the &amp;quot;UNMODELED&amp;quot; feature. &amp;lt;!-- DSSP legend and notice what the lack of underlining means. --&amp;gt;&lt;br /&gt;
* Will this have an impact on any of our predicted epitopes?&lt;br /&gt;
&lt;br /&gt;
Now it&#039;s time to work with visualization of the epitopes in PyMOL. IMPORTANT: Cross-reference with the exercises from the PyMOL exercise if you have forgotten some of the PyMOL fundamentals. &lt;br /&gt;
&lt;br /&gt;
The goal will be to:&lt;br /&gt;
* Colour the epitopes in different colours&lt;br /&gt;
* Have a look at where in the structure they are found: on the surface or inside.&lt;br /&gt;
&lt;br /&gt;
After you have loaded the structure (either via &amp;quot;fetch&amp;quot; or by downloading the file), you can help yourself by setting the base colour to a neutral grey, and with a basic &amp;quot;cartoon&amp;quot; visualization as the first step:&lt;br /&gt;
&lt;br /&gt;
 color gray80&lt;br /&gt;
 hide all&lt;br /&gt;
 show cartoon&lt;br /&gt;
&lt;br /&gt;
Since we&#039;re working with 7 epitopes it can be beneficial to work with named selections. To avoid renaming selections you can specify the name directly in the select command:&lt;br /&gt;
 select epitope_XXX, resi 1-3&lt;br /&gt;
&lt;br /&gt;
This will create the selection of residues 1 to 3 under the name &amp;quot;epitope_XXX&amp;quot; — please refer to the PyMOL exercise for more details about selection rules.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;TASK:&#039;&#039;&#039;&lt;br /&gt;
* Create named selections for all seven epitopes&lt;br /&gt;
** Select a good naming scheme — for example epitope_1 to epitope_7 or reference the first position (e.g. epitope_273 for the last one)&lt;br /&gt;
** Select a unique and easy to identify colour for each epitope.&lt;br /&gt;
** HINT: Turn on the sequence viewer — then you can directly see your selections AND colours in the sequence as well!&lt;br /&gt;
&lt;br /&gt;
As you may have noticed there are two (identical) chains in the structure. We only need one of them, and the next step will be to separate them out.&lt;br /&gt;
&lt;br /&gt;
 create ka, chain A&lt;br /&gt;
&lt;br /&gt;
This will create a new object with the A chain. &lt;br /&gt;
* Repeat this for the B chain (you could call the object kb), and then use the click-interface to hide the whole structure, and select ONLY one of the chains to continue to work with. REMEMBER to write which chain you have chosen to work with.&lt;br /&gt;
&lt;br /&gt;
Lastly, we&#039;ll need to look at how the epitopes are located relative to the surface. Here you can benefit from switching between two types of visualization (using the click interface):&lt;br /&gt;
* show as → surface &lt;br /&gt;
to show the protein from the outside.&lt;br /&gt;
* show as → cartoon&lt;br /&gt;
* show → mesh&lt;br /&gt;
to show BOTH the inside and outside — it especially works nicely when you actively rotate the structure.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5b):&#039;&#039;&#039; Play around with the visualization, and create one (or more) good figures for your report that show the following:&lt;br /&gt;
* Placement of the epitopes&lt;br /&gt;
* A legend for the colours (or arrows with explanations or something similar)&lt;br /&gt;
* Which epitopes are (partly) missing?&lt;br /&gt;
* Are the remaining epitopes accessible on the surface of the protein?&lt;br /&gt;
&lt;br /&gt;
== Epilogue ==&lt;br /&gt;
&#039;&#039;Now all that remains is to ship off the sequences of the surface accessible epitopes to the lab, to start the long process of constructing an expression vector with the gene fragments, with the right linker sequences, getting it expressed in a production host, follow up with animal testing and phase 1, 2 and 3 clinical trials, and the vaccine should be ready for the market.&#039;&#039;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=620</id>
		<title>Exercise:Malaria Vaccine</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=620"/>
		<updated>2025-10-14T13:50:56Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Searching UniProt */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Exercise written by: Thomas Salhøj Rask and [http://www.dtu.dk/service/telefonbog/person?id=25617&amp;amp;cpid=214126&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Henrik Nielsen] — translated, revised and updated to BepiPred 2.0 by [http://www.dtu.dk/service/telefonbog/person?id=18103&amp;amp;cpid=214039&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Rasmus Wernersson] and Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
The purpose of this exercise is to apply the methods and knowledge you have learned so far on a real biological problem: Taking steps towards designing a malaria vaccine, by selecting peptides from the malaria parasite that have a chance of inflicting an immune response and therefore could be used in a vaccine. As part of the exercise some new material will be introduced, especially concerning prediction of B-cell epitopes (immuno-reactive peptides). The outline of the exercise is as follows:&lt;br /&gt;
&lt;br /&gt;
# What exactly is malaria?&lt;br /&gt;
# Identification of membrane bound proteins (potential vaccine targets)&lt;br /&gt;
# Analysis of membrane protein domain structure&lt;br /&gt;
# Prediction of B-cell epitopes from membrane proteins&lt;br /&gt;
# Modelling / visualization of predicted epitopes in the 3D structure of a protein domain.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== What exactly is malaria? ==&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 1:&#039;&#039;&#039; &#039;&#039;Which organism causes malaria? Bacteria, protozoa (single cell eukaryote), worm or virus?&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
Investigate this by looking up the organism in the two taxonomy databases we have been covering earlier in the course:&lt;br /&gt;
*&#039;&#039;&#039;NCBI Taxonomy:&#039;&#039;&#039; http://www.ncbi.nlm.nih.gov/Taxonomy &amp;amp;nbsp;&amp;amp;nbsp; (&#039;&#039;&#039;Hint:&#039;&#039;&#039; If you don&#039;t know the Latin name for the organism, it will be easier to search for a name as a &amp;quot;[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi Token set]&amp;quot; rather than as a &amp;quot;Complete name&amp;quot;.&lt;br /&gt;
*&#039;&#039;&#039;Tree of life:&#039;&#039;&#039; http://www.tolweb.org/ &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1a)&#039;&#039;&#039; Identify the following taxonomical levels for the malaria-causing organism:&lt;br /&gt;
* Genus&lt;br /&gt;
* Phylum&lt;br /&gt;
* (Super)Kingdom&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1b)&#039;&#039;&#039; How &amp;quot;close&amp;quot; in taxonomy space is the organism to the following other organisms (find the upper level taxonomical group, that ties them together). &#039;&#039;&#039;Hint:&#039;&#039;&#039; as an alternative to manually comparing the taxonomy-strings (the &amp;quot;lineage&amp;quot;), you can use the [http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi NCBI Taxonomy Common Tree] tool to automate the comparison.&lt;br /&gt;
* &#039;&#039;Homo sapiens&#039;&#039;&lt;br /&gt;
* &#039;&#039;Babesia microti&#039;&#039;&amp;amp;nbsp;&amp;amp;nbsp;&amp;amp;nbsp; (Can in rare cases be transmitted by ticks (danish: &amp;quot;Skovflåt&amp;quot;) and can lead to the disease &#039;&#039;[https://en.wikipedia.org/wiki/Babesiosis babesiosis]&#039;&#039;, where the red blood cells (erythrocytes) are invaded as in malaria, and which will lead to &#039;&#039;anemia&#039;&#039; (&amp;quot;blood loss&amp;quot;, in this case lack of oxygen carrying capacity in the blood) — see the Tree of Life page for this organism for images of infected erythrocytes.&lt;br /&gt;
&lt;br /&gt;
Finally, read more about malaria and the complicated life cycle of the malaria parasite here: [http://www.cdc.gov/dpdx/malaria/ CDC - DPDx Malaria] .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1c)&#039;&#039;&#039; Report the names of the &#039;&#039;&#039;four&#039;&#039;&#039; species of parasites causing malaria in humans, and use the NCBI Genome (https://www.ncbi.nlm.nih.gov/datasets/genome/) database to investigate which of them (if any) have had their genomes sequenced.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Identification of membrane proteins (potential vaccine targets) ==&lt;br /&gt;
Malaria caused by &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) is by far the most lethal malaria variant. This parasite species is responsible for 80%-90% of the ~1 million annual deaths due to malaria. It will therefore be a natural starting point to develop a vaccine against this type of malaria.&lt;br /&gt;
&lt;br /&gt;
When the &#039;&#039;Pf&#039;&#039; genome was initially sequenced in the 1990s, it was based on &#039;&#039;Pf&#039;&#039; cells isolated from the blood of a Dutch malaria patient, who picked up the disease while traveling. Unfortunately, it was not recorded exactly where the patient had been. This isolate is named &#039;&#039;3D7&#039;&#039; and is the most studied malaria strain to this day (even though it&#039;s not known from where in the world it originates).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039;&lt;br /&gt;
Locate the entry for &#039;&#039;Pf&#039;&#039; 3D7 in [http://www.ncbi.nlm.nih.gov/Taxonomy NCBIs taxonomy browser]. &amp;lt;!-- At the bottom of the page some technical information regarding the genome sequencing is shown (&amp;quot;Genome Information&amp;quot;), and --&amp;gt; In the multi-colored table on the right hand side (&amp;quot;Entrez records&amp;quot;), a set of sequence related data is shown. For instance the &amp;quot;Gene&amp;quot; link describes how many genes have been identified in the genome (including both manually curated genes as well as genes predicted using bioinformatics methods).&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 2a)&#039;&#039;&#039; How many verified genes (NOT hypothetical) does &#039;&#039;Pf&#039;&#039; 3D7 have? (&#039;&#039;&#039;Hint:&#039;&#039;&#039; Follow the &amp;lt;u&amp;gt;Gene&amp;lt;/u&amp;gt; link and add &amp;lt;tt&amp;gt;NOT hypothetical&amp;lt;/tt&amp;gt; to the search string).&lt;br /&gt;
&lt;br /&gt;
Malaria takes place in different stages within the human host (see figure below), and this is important to take into account when designing a malaria vaccine. The disease development can be divided into two phases: 1) The liver-stage and 2) the blood-stage. The liver-stage is defined by &#039;&#039;sporozoites&#039;&#039; injected by the malaria mosquito, which travel to the liver and invade hepatocytes (liver cells). The blood-stage is the second stage and is reached when &#039;&#039;merozoites&#039;&#039; developed within the hepatocytes are released into the blood stream, where they invade erythrocytes (red blood cells). In both stages the malaria parasite hides from the human immune system by staying inside native human cells. &lt;br /&gt;
&lt;br /&gt;
Much of the effort towards developing malaria vaccines so far has been focused on surface exposed (cell-membrane) proteins from the &#039;&#039;sporozoites&#039;&#039; and &#039;&#039;merozoites&#039;&#039; as well as non-human proteins on the surface of infected hepatocytes and erythrocytes. &lt;br /&gt;
&lt;br /&gt;
[[Image:Nm0206-170-F1.jpg | center]]&lt;br /&gt;
&lt;br /&gt;
=== Searching UniProt ===&lt;br /&gt;
We&#039;ll now see if we can use the annotation of protein properties in UniProt to point us towards potential vaccine targets. When designing a vaccine it is important to make sure that the intended vaccine target is indeed &amp;quot;visible&amp;quot; to the immune system. Building on the information from the previous section, we therefore need to identify proteins that &#039;&#039;&#039;originate&#039;&#039;&#039; from the parasite, and that are present on the cell surface of &#039;&#039;sporozoites&#039;&#039;, &#039;&#039;merozoites&#039;&#039; OR infected host cells. In the case of infected host cells, we would therefore be looking for proteins that fulfill the following criteria:&lt;br /&gt;
&lt;br /&gt;
# Are secreted from the parasite to the vacuole &#039;&#039;inside&#039;&#039; the host cell,&lt;br /&gt;
# Migrate from the vacuole to the host cell, and&lt;br /&gt;
# Are transported to the surface (membrane) of the host cell&lt;br /&gt;
&lt;br /&gt;
Initially, we&#039;ll see how many hits we can find by searching for one or more of these criteria in relevant UniProtKB fields. Here we&#039;ll use the same search interface as in the UniProt exercise. We recommend to have the original [[Exercise: The protein database UniProt|UniProt Exercise manual]] open in a different browser window for quick cross-referencing of what we have already learned about searching UniProt.&lt;br /&gt;
&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;Note:&#039;&#039;&#039; When answering the questions below, you have to &#039;&#039;write the search string&#039;&#039; you used in the answer; merely writing a number is not enough. When the search string is included in the answer, we can understand the reason for possible wrong answers.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2b)&#039;&#039;&#039; Go to [http://www.uniprot.org/ UniProt]. Investigate how many &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) proteins there are in total in UniProtKB (i.e. proteins from all &#039;&#039;Pf&#039;&#039; strains, not only from 3D7). How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2c)&#039;&#039;&#039; Now try to see how many of the hits from the previous question are from the strain (isolate) 3D7. Is the number approximately equal to the number you got in question &#039;&#039;&#039;2a)&#039;&#039;&#039;? How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
Now, we shall investigate whether we can use the annotations of subcellular location in UniProt. &#039;&#039;&#039;Note:&#039;&#039;&#039; We go back to working with all strains of &#039;&#039;Pf&#039;&#039;, not exclusively 3D7. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2d)&#039;&#039;&#039; First, check how many &#039;&#039;Pf&#039;&#039; proteins have a &amp;quot;&amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;&amp;quot; comment at all (&#039;&#039;&#039;Tip:&#039;&#039;&#039; choose &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt; in the menu and &amp;lt;!-- leave the &amp;lt;u&amp;gt;Term&amp;lt;/u&amp;gt; field empty)--&amp;gt;enter a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). How many from each part of the database? (&#039;&#039;&#039;Note&#039;&#039;&#039; that the ratio between TrEMBL and Swiss-Prot numbers changes considerably relative to question &#039;&#039;&#039;2b)&#039;&#039;&#039; — Swiss-Prot entries on average contain many more annotations than TrEMBL entries).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2e)&#039;&#039;&#039; How many of these are secreted? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; that should go into the field that pops up when the menu is set to &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt;).  &lt;br /&gt;
&lt;br /&gt;
To get more hits, we will try to search for other terms in the &amp;lt;u&amp;gt;Subcellular location term&amp;lt;/u&amp;gt; field. Interesting subcellular locations might include words such as &amp;quot;&amp;lt;tt&amp;gt;surface&amp;lt;/tt&amp;gt;&amp;quot; or &amp;quot;&amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt;&amp;quot;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2f)&#039;&#039;&#039; How many are there of these, respectively? &lt;br /&gt;
&lt;br /&gt;
The word &amp;quot;membrane&amp;quot; gave the highest number of hits, so we will examine those in more detail. Far from all of these proteins are suitable as vaccine targets. In order to be potentially interesting, they need to be located in the cell membrane (plasma membrane) of either the parasite or the host cell, &#039;&#039;not&#039;&#039; in an inner membrane in the cell. To get an overview, you should try another function in UniProt&#039;s interface: First, click to select the &amp;lt;u&amp;gt;Table&amp;lt;/u&amp;gt; view instead of the &amp;lt;u&amp;gt;Card&amp;lt;/u&amp;gt; view (above the results list). Then, click the button &amp;lt;u&amp;gt;Customize columns&amp;lt;/u&amp;gt;; that will bring up a table where you can find a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; item. Click it, mark &amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;, and click &amp;lt;u&amp;gt;Save&amp;lt;/u&amp;gt;.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2g)&#039;&#039;&#039; Now look at the list of results, where &amp;quot;&amp;lt;u&amp;gt;subcellular location&amp;lt;/u&amp;gt;&amp;quot; contained &amp;quot;membrane&amp;quot;, again. Consider the field &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt;. Give some examples (including accession codes, protein names, and reasons for selecting them) of hits that may be useful, and hits that are surely not useful as vaccine targets (at least two &#039;&#039;different&#039;&#039; examples of each). &#039;&#039;&#039;Hint:&#039;&#039;&#039; if you need to see some different examples, try clicking on the column headings in the table to sort the results list after, e.g., Accession (&amp;lt;u&amp;gt;Entry&amp;lt;/u&amp;gt;), Entry name, or Protein name. &lt;br /&gt;
&lt;br /&gt;
Now, let us focus on the life stage of the parasite where it is located inside an erythrocyte (a red blood cell), and thereby focus on the vaccine targets that are in the plasma membrane of the &#039;&#039;host cell&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2h)&#039;&#039;&#039; How many of the hits have the location &amp;quot;host cell membrane&amp;quot;?&lt;br /&gt;
&amp;lt;!-- These should ideally have a &amp;quot;Subcellular location&amp;quot; annotated as &amp;quot;erythrocyte membrane&amp;quot; or &amp;quot;host cell membrane&amp;quot; — but there are no examples of that in your search from the last question (you are welcome to try!).  --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These proteins could be very interesting as vaccine targets. However, the experimental researchers from your organization report that these have already been tried and do not work in practice, so they ask you to find other examples. We therefore try another approach: If the information we are looking for is not part of the &amp;quot;Subcellular location&amp;quot; annotation, it might be a part of the description (the protein name). &#039;&#039;&#039;Tip:&#039;&#039;&#039; you can always discard a search term in the Advanced interface by clicking the &amp;lt;u&amp;gt;Remove&amp;lt;/u&amp;gt; button.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2i)&#039;&#039;&#039; How many &#039;&#039;Pf&#039;&#039; proteins contain &amp;lt;tt&amp;gt;erythrocyte&amp;lt;/tt&amp;gt; in their &amp;lt;u&amp;gt;Protein Name [DE]&amp;lt;/u&amp;gt; field? How many of these are from Swiss-Prot (reviewed)?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2j)&#039;&#039;&#039; How many of these erythrocyte proteins also have &amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt; in their name? &lt;br /&gt;
&lt;br /&gt;
Some of the hits you find in this way are very short (you can try to sort them by length by clicking the &amp;lt;u&amp;gt;Length&amp;lt;/u&amp;gt; heading). These short proteins might be fragments. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2k)&#039;&#039;&#039; How many of the hits are complete (not annotated as fragments)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; see question 16 in [[Exercise: The protein database UniProt|the UniProt exercise]]).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2l)&#039;&#039;&#039; Do any of these proteins have a determined 3D structure? In other words: Do any proteins from the previous search have a cross-reference to the database PDB? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you should look for &amp;lt;u&amp;gt;Cross-references&amp;lt;/u&amp;gt; in the menu, and again place a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). If yes, what are their names and accession codes?&lt;br /&gt;
&lt;br /&gt;
As a last step in this part of the exercise, you should save all sequences from the last search in FASTA format. This is most easily done by clicking &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; above the results list and choosing &amp;lt;u&amp;gt;FASTA (canonical)&amp;lt;/u&amp;gt;. You can either choose to download them (remember to choose &amp;lt;u&amp;gt;No&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Compressed&amp;lt;/u&amp;gt;) and then open them in a text editor or to preview them in the browser. In the latter case, keep the browser window with the sequences; we will need them later in the exercise.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Analysis of membrane protein domain structure ==&lt;br /&gt;
[[Image:PfEMP1_transport.jpg|right|border]]&lt;br /&gt;
&lt;br /&gt;
The PfEMP1 (&#039;&#039;Plasmodium falciparum&#039;&#039; Erythrocyte Membrane Protein 1) proteins which we now have found constitute a group of proteins expressed by the malaria parasite and transported to the plasma membrane of the infected erythrocyte (see figure — the red/orange sticks represent PfEMP1 proteins). &lt;br /&gt;
&lt;br /&gt;
The function of the PfEMP1 proteins on the surface of the infected erythrocytes is to mediate binding to certain receptors on the surface of endothelial cells (the cells making up the walls of blood vessels). In this way, the malaria parasite can make the infected erythrocytes stick to the walls of the blood vessels in various tissues of the body, and thereby it can avoid being transported through the spleen (Danish: &#039;&#039;milten&#039;&#039;) which otherwise removes diseased erythrocytes from the blood and is one of the main actors in generating an immune response against malaria.&lt;br /&gt;
&lt;br /&gt;
If we, using a vaccine, can generate antibodies that bind to the PfEMP1 proteins, preventing the infected erythrocytes from binding to the endothelial cells, the body would be able to generate a faster and broader immune response against &#039;&#039;Pf&#039;&#039;. Symptoms such as anemia would thereby not become so severe. &lt;br /&gt;
&lt;br /&gt;
We will now examine how the PfEMP1 proteins are built. &lt;br /&gt;
&lt;br /&gt;
Look at the entries you found in the end of section 2. Select just those hits that are from &#039;&#039;Pf&#039;&#039; 3D7 (there should be three of them — otherwise, revisit section 2). &lt;br /&gt;
&lt;br /&gt;
Take a closer look (in UniProt) at these three entries. Scroll down to &amp;lt;u&amp;gt;Family and domain databases&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Family &amp;amp; Domains&amp;lt;/u&amp;gt;. Here, you will find some services providing an overview of known families/domains in the protein in question. &amp;lt;u&amp;gt;InterPro&amp;lt;/u&amp;gt; is the most important of these, since it collects information from a number of family &amp;amp; domain databases (including the one called &amp;lt;u&amp;gt;Pfam&amp;lt;/u&amp;gt;) and therefore has the widest repertoire of domain types. &lt;br /&gt;
&lt;br /&gt;
Open the link labeled &amp;lt;u&amp;gt;View protein in InterPro&amp;lt;/u&amp;gt; in a new tab. Note the graphical interface of InterPro under the heading &amp;quot;Entry matches to this protein&amp;quot;. When you hover the mouse over one of the coloured bars, the name of the family/domain will appear. Note that each family/domain in InterPro has at least &#039;&#039;two&#039;&#039; names and identifiers, an InterPro identifier beginning with &amp;quot;IPR&amp;quot; and a member database identifier, e.g. beginning with &amp;quot;PF&amp;quot; if it is derived from Pfam.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;What are families and domains, anyway?&#039;&#039;&#039;&lt;br /&gt;
:Here are the definitions from the [https://www.ebi.ac.uk/interpro/help/faqs/ InterPro FAQ]:&lt;br /&gt;
:*&#039;&#039;&#039;Domains&#039;&#039;&#039; are distinct functional, structural or sequence units that may exist in a variety of biological contexts. A match to an InterPro entry of this type indicates the presence of a domain. Common examples of protein domains are the PH domain, Immunoglobulin domain or the classical C2H2 zinc finger. &lt;br /&gt;
:*A protein &#039;&#039;&#039;family&#039;&#039;&#039; is a group of proteins that share a common evolutionary origin reflected by their related functions, similarities in sequence, or similar primary, secondary or tertiary structure. A match to an InterPro entry of this type indicates membership of a protein family. &lt;br /&gt;
:However, the distinction between what is regarded as a family and what is regarded as a domain is not completely sharp.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3a)&#039;&#039;&#039; Note that one domain named &amp;quot;Duffy binding domain&amp;quot; is found in several copies in all our three erythrocyte membrane proteins. What are the identifiers of this domain? How many times does it occur in each of the proteins?&lt;br /&gt;
&lt;br /&gt;
Click the identifiers for this particular family/domain and read more about it.&lt;br /&gt;
&lt;br /&gt;
Now, set &amp;lt;u&amp;gt;Feature Display Mode&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt; (instead of &amp;lt;u&amp;gt;Summary&amp;lt;/u&amp;gt;) and scroll down.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3b)&#039;&#039;&#039; Here, you see that Interpro has annotated a transmembrane segment. Which positions are transmembrane in the three proteins? Which part (N- or C-terminal part) of the proteins is intracellular, and which part is extracellular? Are the Duffy binding domains intra- or extracellular?&lt;br /&gt;
&lt;br /&gt;
Look (in UniProt) at the PDB cross-references under &amp;lt;u&amp;gt;3D structure databases&amp;lt;/u&amp;gt; (under &amp;lt;u&amp;gt;Structure&amp;lt;/u&amp;gt;). Focus on X-ray structures only. Compare the coordinates (positions) for the structures to the coordinates for the domains denoted in Pfam. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3c)&#039;&#039;&#039; Which positions are structurally determined &#039;&#039;&#039;by X-ray&#039;&#039;&#039; in each of the three proteins? If you number the occurrences of the known family/domain from &#039;&#039;&#039;3a&#039;&#039;&#039; (1, 2, 3, and so on, starting from the N-terminus), which number(s) are covered by the structurally determined region(s) in each of the three proteins? &lt;br /&gt;
&lt;br /&gt;
Now read what is said about the function and location of our proteins according to Gene Ontology (&amp;lt;u&amp;gt;GO - Molecular function&amp;lt;/u&amp;gt;, &amp;lt;u&amp;gt;GO - Biological process&amp;lt;/u&amp;gt; and &amp;lt;u&amp;gt;GO - Cellular component&amp;lt;/u&amp;gt;) in UniProt.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3d)&#039;&#039;&#039; Do these pieces of information support our choice of these proteins as vaccine targets? Give at least 3 examples. &#039;&#039;&#039;Tip:&#039;&#039;&#039; You can click &amp;lt;u&amp;gt;View the complete GO annotation on QuickGO&amp;lt;/u&amp;gt; in UniProt.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Prediction of B-cell epitopes in a membrane protein ==&lt;br /&gt;
&#039;&#039;&#039;Q8I639&#039;&#039;&#039; is also known as VAR2CSA, and this protein is of particular interest, since it is considered to be responsible for &#039;&#039;Pregnacy associated malaria&#039;&#039; (PAM). Pregnant women are more prone to contract malaria, which sadly leads to a fatality rate of ~10,000 mothers and ~200,000 newborn/unborn children annually. &lt;br /&gt;
&lt;br /&gt;
One of the reasons why it has been so difficult to develop a malaria vaccine, is that the malaria parasite carries ~60 PfEMP1 protein variants, and that you&#039;ll need antibodies against all of them to be immune. However, in the case of PAM there is only one specific PfEMP1 in play, and this special case is therefore easier to start to address with a vaccine.&lt;br /&gt;
&lt;br /&gt;
In order to have a better handle on our bioinformatics work, we&#039;ll concentrate the effort on the Duffy binding domain in VAR2CSA for which a 3D structure is available (the one we found in &#039;&#039;&#039;question 3c&#039;&#039;&#039;).  &lt;br /&gt;
&lt;br /&gt;
=== Epitope prediction ===&lt;br /&gt;
The vaccine we are working towards designing should contain &#039;&#039;&#039;epitopes&#039;&#039;&#039;. Epitopes are the parts of the disease-associated protein the immune system will recognize, for instance the parts the infected person&#039;s antibodies will bind to (the so called &#039;&#039;&#039;B-cell epitopes&#039;&#039;&#039; — there also exist &#039;&#039;&#039;T-cell epitopes&#039;&#039;&#039;, which we&#039;ll not cover here).&lt;br /&gt;
&lt;br /&gt;
For predicting which parts of the protein are potential epitopes, we&#039;ll use the &#039;&#039;&#039;BepiPred 2.0 server&#039;&#039;&#039;, which was created here at DTU.  &lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Important Note:&#039;&#039;&#039; Please run the prediction on the web server of the IEDB instead of the one at DTU, as our local servers had an update that has modified the results. &lt;br /&gt;
:Please select the method called &amp;quot;BepiPred 2.0&amp;quot;&lt;br /&gt;
:http://tools.iedb.org/bcell/ &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the structure-determined Duffy binding domain in VAR2CSA. This must be done using the link to PDB from UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt.&lt;br /&gt;
# Go to the Structure section.&lt;br /&gt;
# Right-click the link labeled &amp;lt;u&amp;gt;RCSB-PDB&amp;lt;/u&amp;gt; and open it in a new tab. This will take you to a PDB page.&lt;br /&gt;
# Here, you can find the sequence by clicking &amp;lt;u&amp;gt;Display Files&amp;lt;/u&amp;gt; and choosing &amp;lt;u&amp;gt;FASTA Sequence&amp;lt;/u&amp;gt;. Alternative, you can choose to download the sequence by clicking &amp;lt;u&amp;gt;Download Files&amp;lt;/u&amp;gt;.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the Duffy binding domain in VAR2CSA. This can be done using only the web-interface for UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt&lt;br /&gt;
# Locate the section concerning CROSS-REFERENCES to 3D structures (NOT the &amp;quot;live action&amp;quot; 3D structure you can move around).&lt;br /&gt;
#* Find the field called &#039;&#039;&#039;positions&#039;&#039;&#039; — this is actually a &#039;&#039;&#039;clickable link!&#039;&#039;&#039;&lt;br /&gt;
#* Click the positions link — this will open up a new page where this subsequence can be used for a BLAST query.&lt;br /&gt;
#* &#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; do NOT start the BLAST run, but just copy out the FASTA sequence, it contains ONLY the sequence interval specified in the &#039;&#039;&#039;positions&#039;&#039;&#039; field.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4a&#039;&#039;&#039;: What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4b&#039;&#039;&#039;: Report the FASTA sequence of the structure-determined Duffy binding domain in VAR2CSA. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4c&#039;&#039;&#039;: &lt;br /&gt;
Note down the following from the UniProt entry, you&#039;ll need it in the next section:&lt;br /&gt;
* What was the sequence interval in the coordinates of the original (full) UniProt sequence?&lt;br /&gt;
* What position in the original protein does position 1 in the new FASTA file correspond to?&lt;br /&gt;
&lt;br /&gt;
You can now run the &#039;&#039;&#039;BepiPred 2.0&#039;&#039;&#039; prediction server on the domain sequence (ONLY the subset extracted above). Run it and then adjust the following on the &#039;&#039;&#039;results page&#039;&#039;&#039;: &lt;br /&gt;
* Set &#039;&#039;&#039;threshold&#039;&#039;&#039; to &#039;&#039;&#039;0.55&#039;&#039;&#039;&lt;br /&gt;
This gives us a reasonable amount of epitopes to continue our work with:&lt;br /&gt;
* Write down the start/end sequence positions of all epitopes of at least &#039;&#039;&#039;8 amino acids&#039;&#039;&#039;&lt;br /&gt;
* &#039;&#039;&#039;Hint:&#039;&#039;&#039; there should be &#039;&#039;&#039;7&#039;&#039;&#039; such epitopes, and the last one starts at position &#039;&#039;&#039;276&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
[[image:BepiPred-2_onIEDB.png|thumb|center|600px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4d&#039;&#039;&#039;: Create a table with the following information about the predicted epitopes:&lt;br /&gt;
* Start/end position, length, Start/end position &#039;&#039;in the original protein&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;(We&#039;ll need the coordinate-transformed values for the PyMOL visualization)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Visualization of epitopes ==&lt;br /&gt;
Lastly, we&#039;ll want to visualize the epitopes in the VAR2CSA Duffy binding domain. Generally, BepiPred 2.0 is very good at selecting surface exposed epitopes, but it&#039;s still a good idea to check it visually.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
For the Q8I639 UniProt entry we have been working with, look at the structure section again, and find the link to the PDB structure of the Duffy binding domain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
Sometimes it will not be possible to get reliable structural information about the entire protein (or in this case the Duffy binding domain). This could for example be the case if parts of the protein are in &#039;&#039;disorder&#039;&#039; (essentially not stabilized and not fixed in place in the crystal). We&#039;ll investigate this next. &lt;br /&gt;
&lt;br /&gt;
From the UniProt page, locate the right structure in PDB:&lt;br /&gt;
* Method 1: Go to https://www.rcsb.org and search for the structure&lt;br /&gt;
* Method 2: Adjust the cross-link in the &amp;quot;structure&amp;quot; section in UniProt to be &amp;quot;RCSB PDB&amp;quot; and click the link.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
In the PDB database page for the structure you found in the last section, click the &amp;quot;Sequence&amp;quot; tab and look at the figure. In the case of this structure, the authors&#039; numbering directly follows the coordinates from the FULL UniProt sequence.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; &lt;br /&gt;
* Which intervals in the sequence are missing (disordered/invisible) in the structure? Hint: Look at the &amp;quot;UNMODELED&amp;quot; feature. &amp;lt;!-- DSSP legend and notice what the lack of underlining means. --&amp;gt;&lt;br /&gt;
* Will this have an impact on any of our predicted epitopes?&lt;br /&gt;
&lt;br /&gt;
Now it&#039;s time to work with visualization of the epitopes in PyMOL. IMPORTANT: Cross-reference with the exercises from the PyMOL exercise if you have forgotten some of the PyMOL fundamentals. &lt;br /&gt;
&lt;br /&gt;
The goal will be to:&lt;br /&gt;
* Colour the epitopes in different colours&lt;br /&gt;
* Have a look at where in the structure they are found: on the surface or inside.&lt;br /&gt;
&lt;br /&gt;
After you have loaded the structure (either via &amp;quot;fetch&amp;quot; or by downloading the file), you can help yourself by setting the base colour to a neutral grey, and with a basic &amp;quot;cartoon&amp;quot; visualization as the first step:&lt;br /&gt;
&lt;br /&gt;
 color gray80&lt;br /&gt;
 hide all&lt;br /&gt;
 show cartoon&lt;br /&gt;
&lt;br /&gt;
Since we&#039;re working with 7 epitopes it can be beneficial to work with named selections. To avoid renaming selections you can specify the name directly in the select command:&lt;br /&gt;
 select epitope_XXX, resi 1-3&lt;br /&gt;
&lt;br /&gt;
This will create the selection of residues 1 to 3 under the name &amp;quot;epitope_XXX&amp;quot; — please refer to the PyMOL exercise for more details about selection rules.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;TASK:&#039;&#039;&#039;&lt;br /&gt;
* Create named selections for all seven epitopes&lt;br /&gt;
** Select a good naming scheme — for example epitope_1 to epitope_7 or reference the first position (e.g. epitope_273 for the last one)&lt;br /&gt;
** Select a unique and easy to identify colour for each epitope.&lt;br /&gt;
** HINT: Turn on the sequence viewer — then you can directly see your selections AND colours in the sequence as well!&lt;br /&gt;
&lt;br /&gt;
As you may have noticed there are two (identical) chains in the structure. We only need one of them, and the next step will be to separate them out.&lt;br /&gt;
&lt;br /&gt;
 create ka, chain A&lt;br /&gt;
&lt;br /&gt;
This will create a new object with the A chain. &lt;br /&gt;
* Repeat this for the B chain (you could call the object kb), and then use the click-interface to hide the whole structure, and select ONLY one of the chains to continue to work with. REMEMBER to write which chain you have chosen to work with.&lt;br /&gt;
&lt;br /&gt;
Lastly, we&#039;ll need to look at how the epitopes are located relative to the surface. Here you can benefit from switching between two types of visualization (using the click interface):&lt;br /&gt;
* show as → surface &lt;br /&gt;
to show the protein from the outside.&lt;br /&gt;
* show as → cartoon&lt;br /&gt;
* show → mesh&lt;br /&gt;
to show BOTH the inside and outside — it especially works nicely when you actively rotate the structure.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5b):&#039;&#039;&#039; Play around with the visualization, and create one (or more) good figures for your report that show the following:&lt;br /&gt;
* Placement of the epitopes&lt;br /&gt;
* A legend for the colours (or arrows with explanations or something similar)&lt;br /&gt;
* Which epitopes are (partly) missing?&lt;br /&gt;
* Are the remaining epitopes accessible on the surface of the protein?&lt;br /&gt;
&lt;br /&gt;
== Epilogue ==&lt;br /&gt;
&#039;&#039;Now all that remains is to ship off the sequences of the surface accessible epitopes to the lab, to start the long process of constructing an expression vector with the gene fragments, with the right linker sequences, getting it expressed in a production host, follow up with animal testing and phase 1, 2 and 3 clinical trials, and the vaccine should be ready for the market.&#039;&#039;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=619</id>
		<title>Exercise:Malaria Vaccine</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=619"/>
		<updated>2025-10-14T13:46:37Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Analysis of membrane protein domain structure */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Exercise written by: Thomas Salhøj Rask and [http://www.dtu.dk/service/telefonbog/person?id=25617&amp;amp;cpid=214126&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Henrik Nielsen] — translated, revised and updated to BepiPred 2.0 by [http://www.dtu.dk/service/telefonbog/person?id=18103&amp;amp;cpid=214039&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Rasmus Wernersson] and Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
The purpose of this exercise is to apply the methods and knowledge you have learned so far on a real biological problem: Taking steps towards designing a malaria vaccine, by selecting peptides from the malaria parasite that have a chance of inflicting an immune response and therefore could be used in a vaccine. As part of the exercise some new material will be introduced, especially concerning prediction of B-cell epitopes (immuno-reactive peptides). The outline of the exercise is as follows:&lt;br /&gt;
&lt;br /&gt;
# What exactly is malaria?&lt;br /&gt;
# Identification of membrane bound proteins (potential vaccine targets)&lt;br /&gt;
# Analysis of membrane protein domain structure&lt;br /&gt;
# Prediction of B-cell epitopes from membrane proteins&lt;br /&gt;
# Modelling / visualization of predicted epitopes in the 3D structure of a protein domain.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== What exactly is malaria? ==&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 1:&#039;&#039;&#039; &#039;&#039;Which organism causes malaria? Bacteria, protozoa (single cell eukaryote), worm or virus?&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
Investigate this by looking up the organism in the two taxonomy databases we have been covering earlier in the course:&lt;br /&gt;
*&#039;&#039;&#039;NCBI Taxonomy:&#039;&#039;&#039; http://www.ncbi.nlm.nih.gov/Taxonomy &amp;amp;nbsp;&amp;amp;nbsp; (&#039;&#039;&#039;Hint:&#039;&#039;&#039; If you don&#039;t know the Latin name for the organism, it will be easier to search for a name as a &amp;quot;[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi Token set]&amp;quot; rather than as a &amp;quot;Complete name&amp;quot;.&lt;br /&gt;
*&#039;&#039;&#039;Tree of life:&#039;&#039;&#039; http://www.tolweb.org/ &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1a)&#039;&#039;&#039; Identify the following taxonomical levels for the malaria-causing organism:&lt;br /&gt;
* Genus&lt;br /&gt;
* Phylum&lt;br /&gt;
* (Super)Kingdom&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1b)&#039;&#039;&#039; How &amp;quot;close&amp;quot; in taxonomy space is the organism to the following other organisms (find the upper level taxonomical group, that ties them together). &#039;&#039;&#039;Hint:&#039;&#039;&#039; as an alternative to manually comparing the taxonomy-strings (the &amp;quot;lineage&amp;quot;), you can use the [http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi NCBI Taxonomy Common Tree] tool to automate the comparison.&lt;br /&gt;
* &#039;&#039;Homo sapiens&#039;&#039;&lt;br /&gt;
* &#039;&#039;Babesia microti&#039;&#039;&amp;amp;nbsp;&amp;amp;nbsp;&amp;amp;nbsp; (Can in rare cases be transmitted by ticks (danish: &amp;quot;Skovflåt&amp;quot;) and can lead to the disease &#039;&#039;[https://en.wikipedia.org/wiki/Babesiosis babesiosis]&#039;&#039;, where the red blood cells (erythrocytes) are invaded as in malaria, and which will lead to &#039;&#039;anemia&#039;&#039; (&amp;quot;blood loss&amp;quot;, in this case lack of oxygen carrying capacity in the blood) — see the Tree of Life page for this organism for images of infected erythrocytes.&lt;br /&gt;
&lt;br /&gt;
Finally, read more about malaria and the complicated life cycle of the malaria parasite here: [http://www.cdc.gov/dpdx/malaria/ CDC - DPDx Malaria] .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1c)&#039;&#039;&#039; Report the names of the &#039;&#039;&#039;four&#039;&#039;&#039; species of parasites causing malaria in humans, and use the NCBI Genome (https://www.ncbi.nlm.nih.gov/datasets/genome/) database to investigate which of them (if any) have had their genomes sequenced.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Identification of membrane proteins (potential vaccine targets) ==&lt;br /&gt;
Malaria caused by &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) is by far the most lethal malaria variant. This parasite species is responsible for 80%-90% of the ~1 million annual deaths due to malaria. It will therefore be a natural starting point to develop a vaccine against this type of malaria.&lt;br /&gt;
&lt;br /&gt;
When the &#039;&#039;Pf&#039;&#039; genome was initially sequenced in the 1990s, it was based on &#039;&#039;Pf&#039;&#039; cells isolated from the blood of a Dutch malaria patient, who picked up the disease while traveling. Unfortunately, it was not recorded exactly where the patient had been. This isolate is named &#039;&#039;3D7&#039;&#039; and is the most studied malaria strain to this day (even though it&#039;s not known from where in the world it originates).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039;&lt;br /&gt;
Locate the entry for &#039;&#039;Pf&#039;&#039; 3D7 in [http://www.ncbi.nlm.nih.gov/Taxonomy NCBIs taxonomy browser]. &amp;lt;!-- At the bottom of the page some technical information regarding the genome sequencing is shown (&amp;quot;Genome Information&amp;quot;), and --&amp;gt; In the multi-colored table on the right hand side (&amp;quot;Entrez records&amp;quot;), a set of sequence related data is shown. For instance the &amp;quot;Gene&amp;quot; link describes how many genes have been identified in the genome (including both manually curated genes as well as genes predicted using bioinformatics methods).&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 2a)&#039;&#039;&#039; How many verified genes (NOT hypothetical) does &#039;&#039;Pf&#039;&#039; 3D7 have? (&#039;&#039;&#039;Hint:&#039;&#039;&#039; Follow the &amp;lt;u&amp;gt;Gene&amp;lt;/u&amp;gt; link and add &amp;lt;tt&amp;gt;NOT hypothetical&amp;lt;/tt&amp;gt; to the search string).&lt;br /&gt;
&lt;br /&gt;
Malaria takes place in different stages within the human host (see figure below), and this is important to take into account when designing a malaria vaccine. The disease development can be divided into two phases: 1) The liver-stage and 2) the blood-stage. The liver-stage is defined by &#039;&#039;sporozoites&#039;&#039; injected by the malaria mosquito, which travel to the liver and invade hepatocytes (liver cells). The blood-stage is the second stage and is reached when &#039;&#039;merozoites&#039;&#039; developed within the hepatocytes are released into the blood stream, where they invade erythrocytes (red blood cells). In both stages the malaria parasite hides from the human immune system by staying inside native human cells. &lt;br /&gt;
&lt;br /&gt;
Much of the effort towards developing malaria vaccines so far has been focused on surface exposed (cell-membrane) proteins from the &#039;&#039;sporozoites&#039;&#039; and &#039;&#039;merozoites&#039;&#039; as well as non-human proteins on the surface of infected hepatocytes and erythrocytes. &lt;br /&gt;
&lt;br /&gt;
[[Image:Nm0206-170-F1.jpg | center]]&lt;br /&gt;
&lt;br /&gt;
=== Searching UniProt ===&lt;br /&gt;
We&#039;ll now see if we can use the annotation of protein properties in UniProt to point us towards potential vaccine targets. When designing a vaccine it is important to make sure that the intended vaccine target is indeed &amp;quot;visible&amp;quot; to the immune system. Building on the information from the previous section, we therefore need to identify proteins that &#039;&#039;&#039;originate&#039;&#039;&#039; from the parasite, and that are present on the cell surface of &#039;&#039;sporozoites&#039;&#039;, &#039;&#039;merozoites&#039;&#039; OR infected host cells. In the case of infected host cells, we would therefore be looking for proteins that fulfill the following criteria:&lt;br /&gt;
&lt;br /&gt;
# Are secreted from the parasite to the vacuole &#039;&#039;inside&#039;&#039; the host cell,&lt;br /&gt;
# Migrate from the vacuole to the host cell, and&lt;br /&gt;
# Are transported to the surface (membrane) of the host cell&lt;br /&gt;
&lt;br /&gt;
Initially, we&#039;ll see how many hits we can find by searching for one or more of these criteria in relevant UniProtKB fields. Here we&#039;ll use the same search interface as in the UniProt exercise. We recommend to have the original [[Exercise: The protein database UniProt|UniProt Exercise manual]] open in a different browser window for quick cross-referencing of what we have already learned about searching UniProt.&lt;br /&gt;
&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;Note:&#039;&#039;&#039; When answering the questions below, you have to &#039;&#039;write the search string&#039;&#039; you used in the answer; merely writing a number is not enough. When the search string is included in the answer, we can understand the reason for possible wrong answers.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2b)&#039;&#039;&#039; Go to [http://www.uniprot.org/ UniProt]. Investigate how many &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) proteins there are in total in UniProtKB (i.e. proteins from all &#039;&#039;Pf&#039;&#039; strains, not only from 3D7). How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2c)&#039;&#039;&#039; Now try to see how many of the hits from the previous question are from the strain (isolate) 3D7. Is the number approximately equal to the number you got in question &#039;&#039;&#039;2a)&#039;&#039;&#039;? How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
Now, we shall investigate whether we can use the annotations of subcellular location in UniProt. &#039;&#039;&#039;Note:&#039;&#039;&#039; We go back to working with all strains of &#039;&#039;Pf&#039;&#039;, not exclusively 3D7. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2d)&#039;&#039;&#039; First, check how many &#039;&#039;Pf&#039;&#039; proteins have a &amp;quot;&amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;&amp;quot; comment at all (&#039;&#039;&#039;Tip:&#039;&#039;&#039; choose &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt; in the menu and &amp;lt;!-- leave the &amp;lt;u&amp;gt;Term&amp;lt;/u&amp;gt; field empty)--&amp;gt;enter a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). How many from each part of the database? (&#039;&#039;&#039;Note&#039;&#039;&#039; that the ratio between TrEMBL and Swiss-Prot numbers changes considerably relative to question &#039;&#039;&#039;2b)&#039;&#039;&#039; — Swiss-Prot entries on average contain many more annotations than TrEMBL entries).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2e)&#039;&#039;&#039; How many of these are secreted? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; that should go into the field that pops up when the menu is set to &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt;).  &lt;br /&gt;
&lt;br /&gt;
To get more hits, we will try to search for other terms in the &amp;lt;u&amp;gt;Subcellular location term&amp;lt;/u&amp;gt; field. Interesting subcellular locations might include words such as &amp;quot;&amp;lt;tt&amp;gt;surface&amp;lt;/tt&amp;gt;&amp;quot; or &amp;quot;&amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt;&amp;quot;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2f)&#039;&#039;&#039; How many are there of these, respectively? &lt;br /&gt;
&lt;br /&gt;
The word &amp;quot;membrane&amp;quot; gave the highest number of hits, so we will examine those in more detail. Far from all of these proteins are suitable as vaccine targets. In order to be potentially interesting, they need to be located in the cell membrane (plasma membrane) of either the parasite or the host cell, &#039;&#039;not&#039;&#039; in an inner membrane in the cell. To get an overview, you should try another function in UniProt&#039;s interface: First, click to select the &amp;lt;u&amp;gt;Table&amp;lt;/u&amp;gt; view instead of the &amp;lt;u&amp;gt;Card&amp;lt;/u&amp;gt; view (above the results list). Then, click the button &amp;lt;u&amp;gt;Customize columns&amp;lt;/u&amp;gt;; that will bring up a table where you can find a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; item. Click it, mark &amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;, and click &amp;lt;u&amp;gt;Close&amp;lt;/u&amp;gt;.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2g)&#039;&#039;&#039; Now look at the list of results, where &amp;quot;&amp;lt;u&amp;gt;subcellular location&amp;lt;/u&amp;gt;&amp;quot; contained &amp;quot;membrane&amp;quot;, again. Consider the field &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt;. Give some examples (including accession codes, protein names, and reasons for selecting them) of hits that may be useful, and hits that are surely not useful as vaccine targets (at least two &#039;&#039;different&#039;&#039; examples of each). &#039;&#039;&#039;Hint:&#039;&#039;&#039; if you need to see some different examples, try clicking on the column headings in the table to sort the results list after, e.g., Accession (&amp;lt;u&amp;gt;Entry&amp;lt;/u&amp;gt;), Entry name, or Protein name. &lt;br /&gt;
&lt;br /&gt;
Now, let us focus on the life stage of the parasite where it is located inside an erythrocyte (a red blood cell), and thereby focus on the vaccine targets that are in the plasma membrane of the &#039;&#039;host cell&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2h)&#039;&#039;&#039; How many of the hits have the location &amp;quot;host cell membrane&amp;quot;?&lt;br /&gt;
&amp;lt;!-- These should ideally have a &amp;quot;Subcellular location&amp;quot; annotated as &amp;quot;erythrocyte membrane&amp;quot; or &amp;quot;host cell membrane&amp;quot; — but there are no examples of that in your search from the last question (you are welcome to try!).  --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These proteins could be very interesting as vaccine targets. However, the experimental researchers from your organization report that these have already been tried and do not work in practice, so they ask you to find other examples. We therefore try another approach: If the information we are looking for is not part of the &amp;quot;Subcellular location&amp;quot; annotation, it might be a part of the description (the protein name). &#039;&#039;&#039;Tip:&#039;&#039;&#039; you can always discard a search term in the Advanced interface by clicking the &amp;lt;u&amp;gt;Remove&amp;lt;/u&amp;gt; button.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2i)&#039;&#039;&#039; How many &#039;&#039;Pf&#039;&#039; proteins contain &amp;lt;tt&amp;gt;erythrocyte&amp;lt;/tt&amp;gt; in their &amp;lt;u&amp;gt;Protein Name [DE]&amp;lt;/u&amp;gt; field? How many of these are from Swiss-Prot (reviewed)?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2j)&#039;&#039;&#039; How many of these erythrocyte proteins also have &amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt; in their name? &lt;br /&gt;
&lt;br /&gt;
Some of the hits you find in this way are very short (you can try to sort them by length by clicking the &amp;lt;u&amp;gt;Length&amp;lt;/u&amp;gt; heading). These short proteins might be fragments. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2k)&#039;&#039;&#039; How many of the hits are complete (not annotated as fragments)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; see question 16 in [[Exercise: The protein database UniProt|the UniProt exercise]]).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2l)&#039;&#039;&#039; Do any of these proteins have a determined 3D structure? In other words: Do any proteins from the previous search have a cross-reference to the database PDB? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you should look for &amp;lt;u&amp;gt;Cross-references&amp;lt;/u&amp;gt; in the menu, and again place a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). If yes, what are their names and accession codes?&lt;br /&gt;
&lt;br /&gt;
As a last step in this part of the exercise, you should save all sequences from the last search in FASTA format. This is most easily done by clicking &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; above the results list and choosing &amp;lt;u&amp;gt;FASTA (canonical)&amp;lt;/u&amp;gt;. You can either choose to download them (remember to choose &amp;lt;u&amp;gt;No&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Compressed&amp;lt;/u&amp;gt;) and then open them in a text editor or to preview them in the browser. In the latter case, keep the browser window with the sequences; we will need them later in the exercise.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Analysis of membrane protein domain structure ==&lt;br /&gt;
[[Image:PfEMP1_transport.jpg|right|border]]&lt;br /&gt;
&lt;br /&gt;
The PfEMP1 (&#039;&#039;Plasmodium falciparum&#039;&#039; Erythrocyte Membrane Protein 1) proteins which we now have found constitute a group of proteins expressed by the malaria parasite and transported to the plasma membrane of the infected erythrocyte (see figure — the red/orange sticks represent PfEMP1 proteins). &lt;br /&gt;
&lt;br /&gt;
The function of the PfEMP1 proteins on the surface of the infected erythrocytes is to mediate binding to certain receptors on the surface of endothelial cells (the cells making up the walls of blood vessels). In this way, the malaria parasite can make the infected erythrocytes stick to the walls of the blood vessels in various tissues of the body, and thereby it can avoid being transported through the spleen (Danish: &#039;&#039;milten&#039;&#039;) which otherwise removes diseased erythrocytes from the blood and is one of the main actors in generating an immune response against malaria.&lt;br /&gt;
&lt;br /&gt;
If we, using a vaccine, can generate antibodies that bind to the PfEMP1 proteins, preventing the infected erythrocytes from binding to the endothelial cells, the body would be able to generate a faster and broader immune response against &#039;&#039;Pf&#039;&#039;. Symptoms such as anemia would thereby not become so severe. &lt;br /&gt;
&lt;br /&gt;
We will now examine how the PfEMP1 proteins are built. &lt;br /&gt;
&lt;br /&gt;
Look at the entries you found in the end of section 2. Select just those hits that are from &#039;&#039;Pf&#039;&#039; 3D7 (there should be three of them — otherwise, revisit section 2). &lt;br /&gt;
&lt;br /&gt;
Take a closer look (in UniProt) at these three entries. Scroll down to &amp;lt;u&amp;gt;Family and domain databases&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Family &amp;amp; Domains&amp;lt;/u&amp;gt;. Here, you will find some services providing an overview of known families/domains in the protein in question. &amp;lt;u&amp;gt;InterPro&amp;lt;/u&amp;gt; is the most important of these, since it collects information from a number of family &amp;amp; domain databases (including the one called &amp;lt;u&amp;gt;Pfam&amp;lt;/u&amp;gt;) and therefore has the widest repertoire of domain types. &lt;br /&gt;
&lt;br /&gt;
Open the link labeled &amp;lt;u&amp;gt;View protein in InterPro&amp;lt;/u&amp;gt; in a new tab. Note the graphical interface of InterPro under the heading &amp;quot;Entry matches to this protein&amp;quot;. When you hover the mouse over one of the coloured bars, the name of the family/domain will appear. Note that each family/domain in InterPro has at least &#039;&#039;two&#039;&#039; names and identifiers, an InterPro identifier beginning with &amp;quot;IPR&amp;quot; and a member database identifier, e.g. beginning with &amp;quot;PF&amp;quot; if it is derived from Pfam.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;What are families and domains, anyway?&#039;&#039;&#039;&lt;br /&gt;
:Here are the definitions from the [https://www.ebi.ac.uk/interpro/help/faqs/ InterPro FAQ]:&lt;br /&gt;
:*&#039;&#039;&#039;Domains&#039;&#039;&#039; are distinct functional, structural or sequence units that may exist in a variety of biological contexts. A match to an InterPro entry of this type indicates the presence of a domain. Common examples of protein domains are the PH domain, Immunoglobulin domain or the classical C2H2 zinc finger. &lt;br /&gt;
:*A protein &#039;&#039;&#039;family&#039;&#039;&#039; is a group of proteins that share a common evolutionary origin reflected by their related functions, similarities in sequence, or similar primary, secondary or tertiary structure. A match to an InterPro entry of this type indicates membership of a protein family. &lt;br /&gt;
:However, the distinction between what is regarded as a family and what is regarded as a domain is not completely sharp.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3a)&#039;&#039;&#039; Note that one domain named &amp;quot;Duffy binding domain&amp;quot; is found in several copies in all our three erythrocyte membrane proteins. What are the identifiers of this domain? How many times does it occur in each of the proteins?&lt;br /&gt;
&lt;br /&gt;
Click the identifiers for this particular family/domain and read more about it.&lt;br /&gt;
&lt;br /&gt;
Now, set &amp;lt;u&amp;gt;Feature Display Mode&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;Full&amp;lt;/u&amp;gt; (instead of &amp;lt;u&amp;gt;Summary&amp;lt;/u&amp;gt;) and scroll down.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3b)&#039;&#039;&#039; Here, you see that Interpro has annotated a transmembrane segment. Which positions are transmembrane in the three proteins? Which part (N- or C-terminal part) of the proteins is intracellular, and which part is extracellular? Are the Duffy binding domains intra- or extracellular?&lt;br /&gt;
&lt;br /&gt;
Look (in UniProt) at the PDB cross-references under &amp;lt;u&amp;gt;3D structure databases&amp;lt;/u&amp;gt; (under &amp;lt;u&amp;gt;Structure&amp;lt;/u&amp;gt;). Focus on X-ray structures only. Compare the coordinates (positions) for the structures to the coordinates for the domains denoted in Pfam. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3c)&#039;&#039;&#039; Which positions are structurally determined &#039;&#039;&#039;by X-ray&#039;&#039;&#039; in each of the three proteins? If you number the occurrences of the known family/domain from &#039;&#039;&#039;3a&#039;&#039;&#039; (1, 2, 3, and so on, starting from the N-terminus), which number(s) are covered by the structurally determined region(s) in each of the three proteins? &lt;br /&gt;
&lt;br /&gt;
Now read what is said about the function and location of our proteins according to Gene Ontology (&amp;lt;u&amp;gt;GO - Molecular function&amp;lt;/u&amp;gt;, &amp;lt;u&amp;gt;GO - Biological process&amp;lt;/u&amp;gt; and &amp;lt;u&amp;gt;GO - Cellular component&amp;lt;/u&amp;gt;) in UniProt.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3d)&#039;&#039;&#039; Do these pieces of information support our choice of these proteins as vaccine targets? Give at least 3 examples. &#039;&#039;&#039;Tip:&#039;&#039;&#039; You can click &amp;lt;u&amp;gt;View the complete GO annotation on QuickGO&amp;lt;/u&amp;gt; in UniProt.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Prediction of B-cell epitopes in a membrane protein ==&lt;br /&gt;
&#039;&#039;&#039;Q8I639&#039;&#039;&#039; is also known as VAR2CSA, and this protein is of particular interest, since it is considered to be responsible for &#039;&#039;Pregnacy associated malaria&#039;&#039; (PAM). Pregnant women are more prone to contract malaria, which sadly leads to a fatality rate of ~10,000 mothers and ~200,000 newborn/unborn children annually. &lt;br /&gt;
&lt;br /&gt;
One of the reasons why it has been so difficult to develop a malaria vaccine, is that the malaria parasite carries ~60 PfEMP1 protein variants, and that you&#039;ll need antibodies against all of them to be immune. However, in the case of PAM there is only one specific PfEMP1 in play, and this special case is therefore easier to start to address with a vaccine.&lt;br /&gt;
&lt;br /&gt;
In order to have a better handle on our bioinformatics work, we&#039;ll concentrate the effort on the Duffy binding domain in VAR2CSA for which a 3D structure is available (the one we found in &#039;&#039;&#039;question 3c&#039;&#039;&#039;).  &lt;br /&gt;
&lt;br /&gt;
=== Epitope prediction ===&lt;br /&gt;
The vaccine we are working towards designing should contain &#039;&#039;&#039;epitopes&#039;&#039;&#039;. Epitopes are the parts of the disease-associated protein the immune system will recognize, for instance the parts the infected person&#039;s antibodies will bind to (the so called &#039;&#039;&#039;B-cell epitopes&#039;&#039;&#039; — there also exist &#039;&#039;&#039;T-cell epitopes&#039;&#039;&#039;, which we&#039;ll not cover here).&lt;br /&gt;
&lt;br /&gt;
For predicting which parts of the protein are potential epitopes, we&#039;ll use the &#039;&#039;&#039;BepiPred 2.0 server&#039;&#039;&#039;, which was created here at DTU.  &lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Important Note:&#039;&#039;&#039; Please run the prediction on the web server of the IEDB instead of the one at DTU, as our local servers had an update that has modified the results. &lt;br /&gt;
:Please select the method called &amp;quot;BepiPred 2.0&amp;quot;&lt;br /&gt;
:http://tools.iedb.org/bcell/ &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the structure-determined Duffy binding domain in VAR2CSA. This must be done using the link to PDB from UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt.&lt;br /&gt;
# Go to the Structure section.&lt;br /&gt;
# Right-click the link labeled &amp;lt;u&amp;gt;RCSB-PDB&amp;lt;/u&amp;gt; and open it in a new tab. This will take you to a PDB page.&lt;br /&gt;
# Here, you can find the sequence by clicking &amp;lt;u&amp;gt;Display Files&amp;lt;/u&amp;gt; and choosing &amp;lt;u&amp;gt;FASTA Sequence&amp;lt;/u&amp;gt;. Alternative, you can choose to download the sequence by clicking &amp;lt;u&amp;gt;Download Files&amp;lt;/u&amp;gt;.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the Duffy binding domain in VAR2CSA. This can be done using only the web-interface for UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt&lt;br /&gt;
# Locate the section concerning CROSS-REFERENCES to 3D structures (NOT the &amp;quot;live action&amp;quot; 3D structure you can move around).&lt;br /&gt;
#* Find the field called &#039;&#039;&#039;positions&#039;&#039;&#039; — this is actually a &#039;&#039;&#039;clickable link!&#039;&#039;&#039;&lt;br /&gt;
#* Click the positions link — this will open up a new page where this subsequence can be used for a BLAST query.&lt;br /&gt;
#* &#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; do NOT start the BLAST run, but just copy out the FASTA sequence, it contains ONLY the sequence interval specified in the &#039;&#039;&#039;positions&#039;&#039;&#039; field.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4a&#039;&#039;&#039;: What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4b&#039;&#039;&#039;: Report the FASTA sequence of the structure-determined Duffy binding domain in VAR2CSA. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4c&#039;&#039;&#039;: &lt;br /&gt;
Note down the following from the UniProt entry, you&#039;ll need it in the next section:&lt;br /&gt;
* What was the sequence interval in the coordinates of the original (full) UniProt sequence?&lt;br /&gt;
* What position in the original protein does position 1 in the new FASTA file correspond to?&lt;br /&gt;
&lt;br /&gt;
You can now run the &#039;&#039;&#039;BepiPred 2.0&#039;&#039;&#039; prediction server on the domain sequence (ONLY the subset extracted above). Run it and then adjust the following on the &#039;&#039;&#039;results page&#039;&#039;&#039;: &lt;br /&gt;
* Set &#039;&#039;&#039;threshold&#039;&#039;&#039; to &#039;&#039;&#039;0.55&#039;&#039;&#039;&lt;br /&gt;
This gives us a reasonable amount of epitopes to continue our work with:&lt;br /&gt;
* Write down the start/end sequence positions of all epitopes of at least &#039;&#039;&#039;8 amino acids&#039;&#039;&#039;&lt;br /&gt;
* &#039;&#039;&#039;Hint:&#039;&#039;&#039; there should be &#039;&#039;&#039;7&#039;&#039;&#039; such epitopes, and the last one starts at position &#039;&#039;&#039;276&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
[[image:BepiPred-2_onIEDB.png|thumb|center|600px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4d&#039;&#039;&#039;: Create a table with the following information about the predicted epitopes:&lt;br /&gt;
* Start/end position, length, Start/end position &#039;&#039;in the original protein&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;(We&#039;ll need the coordinate-transformed values for the PyMOL visualization)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Visualization of epitopes ==&lt;br /&gt;
Lastly, we&#039;ll want to visualize the epitopes in the VAR2CSA Duffy binding domain. Generally, BepiPred 2.0 is very good at selecting surface exposed epitopes, but it&#039;s still a good idea to check it visually.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
For the Q8I639 UniProt entry we have been working with, look at the structure section again, and find the link to the PDB structure of the Duffy binding domain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
Sometimes it will not be possible to get reliable structural information about the entire protein (or in this case the Duffy binding domain). This could for example be the case if parts of the protein are in &#039;&#039;disorder&#039;&#039; (essentially not stabilized and not fixed in place in the crystal). We&#039;ll investigate this next. &lt;br /&gt;
&lt;br /&gt;
From the UniProt page, locate the right structure in PDB:&lt;br /&gt;
* Method 1: Go to https://www.rcsb.org and search for the structure&lt;br /&gt;
* Method 2: Adjust the cross-link in the &amp;quot;structure&amp;quot; section in UniProt to be &amp;quot;RCSB PDB&amp;quot; and click the link.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
In the PDB database page for the structure you found in the last section, click the &amp;quot;Sequence&amp;quot; tab and look at the figure. In the case of this structure, the authors&#039; numbering directly follows the coordinates from the FULL UniProt sequence.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; &lt;br /&gt;
* Which intervals in the sequence are missing (disordered/invisible) in the structure? Hint: Look at the &amp;quot;UNMODELED&amp;quot; feature. &amp;lt;!-- DSSP legend and notice what the lack of underlining means. --&amp;gt;&lt;br /&gt;
* Will this have an impact on any of our predicted epitopes?&lt;br /&gt;
&lt;br /&gt;
Now it&#039;s time to work with visualization of the epitopes in PyMOL. IMPORTANT: Cross-reference with the exercises from the PyMOL exercise if you have forgotten some of the PyMOL fundamentals. &lt;br /&gt;
&lt;br /&gt;
The goal will be to:&lt;br /&gt;
* Colour the epitopes in different colours&lt;br /&gt;
* Have a look at where in the structure they are found: on the surface or inside.&lt;br /&gt;
&lt;br /&gt;
After you have loaded the structure (either via &amp;quot;fetch&amp;quot; or by downloading the file), you can help yourself by setting the base colour to a neutral grey, and with a basic &amp;quot;cartoon&amp;quot; visualization as the first step:&lt;br /&gt;
&lt;br /&gt;
 color gray80&lt;br /&gt;
 hide all&lt;br /&gt;
 show cartoon&lt;br /&gt;
&lt;br /&gt;
Since we&#039;re working with 7 epitopes it can be beneficial to work with named selections. To avoid renaming selections you can specify the name directly in the select command:&lt;br /&gt;
 select epitope_XXX, resi 1-3&lt;br /&gt;
&lt;br /&gt;
This will create the selection of residues 1 to 3 under the name &amp;quot;epitope_XXX&amp;quot; — please refer to the PyMOL exercise for more details about selection rules.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;TASK:&#039;&#039;&#039;&lt;br /&gt;
* Create named selections for all seven epitopes&lt;br /&gt;
** Select a good naming scheme — for example epitope_1 to epitope_7 or reference the first position (e.g. epitope_273 for the last one)&lt;br /&gt;
** Select a unique and easy to identify colour for each epitope.&lt;br /&gt;
** HINT: Turn on the sequence viewer — then you can directly see your selections AND colours in the sequence as well!&lt;br /&gt;
&lt;br /&gt;
As you may have noticed there are two (identical) chains in the structure. We only need one of them, and the next step will be to separate them out.&lt;br /&gt;
&lt;br /&gt;
 create ka, chain A&lt;br /&gt;
&lt;br /&gt;
This will create a new object with the A chain. &lt;br /&gt;
* Repeat this for the B chain (you could call the object kb), and then use the click-interface to hide the whole structure, and select ONLY one of the chains to continue to work with. REMEMBER to write which chain you have chosen to work with.&lt;br /&gt;
&lt;br /&gt;
Lastly, we&#039;ll need to look at how the epitopes are located relative to the surface. Here you can benefit from switching between two types of visualization (using the click interface):&lt;br /&gt;
* show as → surface &lt;br /&gt;
to show the protein from the outside.&lt;br /&gt;
* show as → cartoon&lt;br /&gt;
* show → mesh&lt;br /&gt;
to show BOTH the inside and outside — it especially works nicely when you actively rotate the structure.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5b):&#039;&#039;&#039; Play around with the visualization, and create one (or more) good figures for your report that show the following:&lt;br /&gt;
* Placement of the epitopes&lt;br /&gt;
* A legend for the colours (or arrows with explanations or something similar)&lt;br /&gt;
* Which epitopes are (partly) missing?&lt;br /&gt;
* Are the remaining epitopes accessible on the surface of the protein?&lt;br /&gt;
&lt;br /&gt;
== Epilogue ==&lt;br /&gt;
&#039;&#039;Now all that remains is to ship off the sequences of the surface accessible epitopes to the lab, to start the long process of constructing an expression vector with the gene fragments, with the right linker sequences, getting it expressed in a production host, follow up with animal testing and phase 1, 2 and 3 clinical trials, and the vaccine should be ready for the market.&#039;&#039;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=618</id>
		<title>Exercise:Malaria Vaccine</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=618"/>
		<updated>2025-10-14T13:37:43Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Analysis of membrane protein domain structure */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Exercise written by: Thomas Salhøj Rask and [http://www.dtu.dk/service/telefonbog/person?id=25617&amp;amp;cpid=214126&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Henrik Nielsen] — translated, revised and updated to BepiPred 2.0 by [http://www.dtu.dk/service/telefonbog/person?id=18103&amp;amp;cpid=214039&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Rasmus Wernersson] and Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
The purpose of this exercise is to apply the methods and knowledge you have learned so far on a real biological problem: Taking steps towards designing a malaria vaccine, by selecting peptides from the malaria parasite that have a chance of inflicting an immune response and therefore could be used in a vaccine. As part of the exercise some new material will be introduced, especially concerning prediction of B-cell epitopes (immuno-reactive peptides). The outline of the exercise is as follows:&lt;br /&gt;
&lt;br /&gt;
# What exactly is malaria?&lt;br /&gt;
# Identification of membrane bound proteins (potential vaccine targets)&lt;br /&gt;
# Analysis of membrane protein domain structure&lt;br /&gt;
# Prediction of B-cell epitopes from membrane proteins&lt;br /&gt;
# Modelling / visualization of predicted epitopes in the 3D structure of a protein domain.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== What exactly is malaria? ==&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 1:&#039;&#039;&#039; &#039;&#039;Which organism causes malaria? Bacteria, protozoa (single cell eukaryote), worm or virus?&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
Investigate this by looking up the organism in the two taxonomy databases we have been covering earlier in the course:&lt;br /&gt;
*&#039;&#039;&#039;NCBI Taxonomy:&#039;&#039;&#039; http://www.ncbi.nlm.nih.gov/Taxonomy &amp;amp;nbsp;&amp;amp;nbsp; (&#039;&#039;&#039;Hint:&#039;&#039;&#039; If you don&#039;t know the Latin name for the organism, it will be easier to search for a name as a &amp;quot;[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi Token set]&amp;quot; rather than as a &amp;quot;Complete name&amp;quot;.&lt;br /&gt;
*&#039;&#039;&#039;Tree of life:&#039;&#039;&#039; http://www.tolweb.org/ &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1a)&#039;&#039;&#039; Identify the following taxonomical levels for the malaria-causing organism:&lt;br /&gt;
* Genus&lt;br /&gt;
* Phylum&lt;br /&gt;
* (Super)Kingdom&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1b)&#039;&#039;&#039; How &amp;quot;close&amp;quot; in taxonomy space is the organism to the following other organisms (find the upper level taxonomical group, that ties them together). &#039;&#039;&#039;Hint:&#039;&#039;&#039; as an alternative to manually comparing the taxonomy-strings (the &amp;quot;lineage&amp;quot;), you can use the [http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi NCBI Taxonomy Common Tree] tool to automate the comparison.&lt;br /&gt;
* &#039;&#039;Homo sapiens&#039;&#039;&lt;br /&gt;
* &#039;&#039;Babesia microti&#039;&#039;&amp;amp;nbsp;&amp;amp;nbsp;&amp;amp;nbsp; (Can in rare cases be transmitted by ticks (danish: &amp;quot;Skovflåt&amp;quot;) and can lead to the disease &#039;&#039;[https://en.wikipedia.org/wiki/Babesiosis babesiosis]&#039;&#039;, where the red blood cells (erythrocytes) are invaded as in malaria, and which will lead to &#039;&#039;anemia&#039;&#039; (&amp;quot;blood loss&amp;quot;, in this case lack of oxygen carrying capacity in the blood) — see the Tree of Life page for this organism for images of infected erythrocytes.&lt;br /&gt;
&lt;br /&gt;
Finally, read more about malaria and the complicated life cycle of the malaria parasite here: [http://www.cdc.gov/dpdx/malaria/ CDC - DPDx Malaria] .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1c)&#039;&#039;&#039; Report the names of the &#039;&#039;&#039;four&#039;&#039;&#039; species of parasites causing malaria in humans, and use the NCBI Genome (https://www.ncbi.nlm.nih.gov/datasets/genome/) database to investigate which of them (if any) have had their genomes sequenced.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Identification of membrane proteins (potential vaccine targets) ==&lt;br /&gt;
Malaria caused by &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) is by far the most lethal malaria variant. This parasite species is responsible for 80%-90% of the ~1 million annual deaths due to malaria. It will therefore be a natural starting point to develop a vaccine against this type of malaria.&lt;br /&gt;
&lt;br /&gt;
When the &#039;&#039;Pf&#039;&#039; genome was initially sequenced in the 1990s, it was based on &#039;&#039;Pf&#039;&#039; cells isolated from the blood of a Dutch malaria patient, who picked up the disease while traveling. Unfortunately, it was not recorded exactly where the patient had been. This isolate is named &#039;&#039;3D7&#039;&#039; and is the most studied malaria strain to this day (even though it&#039;s not known from where in the world it originates).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039;&lt;br /&gt;
Locate the entry for &#039;&#039;Pf&#039;&#039; 3D7 in [http://www.ncbi.nlm.nih.gov/Taxonomy NCBIs taxonomy browser]. &amp;lt;!-- At the bottom of the page some technical information regarding the genome sequencing is shown (&amp;quot;Genome Information&amp;quot;), and --&amp;gt; In the multi-colored table on the right hand side (&amp;quot;Entrez records&amp;quot;), a set of sequence related data is shown. For instance the &amp;quot;Gene&amp;quot; link describes how many genes have been identified in the genome (including both manually curated genes as well as genes predicted using bioinformatics methods).&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 2a)&#039;&#039;&#039; How many verified genes (NOT hypothetical) does &#039;&#039;Pf&#039;&#039; 3D7 have? (&#039;&#039;&#039;Hint:&#039;&#039;&#039; Follow the &amp;lt;u&amp;gt;Gene&amp;lt;/u&amp;gt; link and add &amp;lt;tt&amp;gt;NOT hypothetical&amp;lt;/tt&amp;gt; to the search string).&lt;br /&gt;
&lt;br /&gt;
Malaria takes place in different stages within the human host (see figure below), and this is important to take into account when designing a malaria vaccine. The disease development can be divided into two phases: 1) The liver-stage and 2) the blood-stage. The liver-stage is defined by &#039;&#039;sporozoites&#039;&#039; injected by the malaria mosquito, which travel to the liver and invade hepatocytes (liver cells). The blood-stage is the second stage and is reached when &#039;&#039;merozoites&#039;&#039; developed within the hepatocytes are released into the blood stream, where they invade erythrocytes (red blood cells). In both stages the malaria parasite hides from the human immune system by staying inside native human cells. &lt;br /&gt;
&lt;br /&gt;
Much of the effort towards developing malaria vaccines so far has been focused on surface exposed (cell-membrane) proteins from the &#039;&#039;sporozoites&#039;&#039; and &#039;&#039;merozoites&#039;&#039; as well as non-human proteins on the surface of infected hepatocytes and erythrocytes. &lt;br /&gt;
&lt;br /&gt;
[[Image:Nm0206-170-F1.jpg | center]]&lt;br /&gt;
&lt;br /&gt;
=== Searching UniProt ===&lt;br /&gt;
We&#039;ll now see if we can use the annotation of protein properties in UniProt to point us towards potential vaccine targets. When designing a vaccine it is important to make sure that the intended vaccine target is indeed &amp;quot;visible&amp;quot; to the immune system. Building on the information from the previous section, we therefore need to identify proteins that &#039;&#039;&#039;originate&#039;&#039;&#039; from the parasite, and that are present on the cell surface of &#039;&#039;sporozoites&#039;&#039;, &#039;&#039;merozoites&#039;&#039; OR infected host cells. In the case of infected host cells, we would therefore be looking for proteins that fulfill the following criteria:&lt;br /&gt;
&lt;br /&gt;
# Are secreted from the parasite to the vacuole &#039;&#039;inside&#039;&#039; the host cell,&lt;br /&gt;
# Migrate from the vacuole to the host cell, and&lt;br /&gt;
# Are transported to the surface (membrane) of the host cell&lt;br /&gt;
&lt;br /&gt;
Initially, we&#039;ll see how many hits we can find by searching for one or more of these criteria in relevant UniProtKB fields. Here we&#039;ll use the same search interface as in the UniProt exercise. We recommend to have the original [[Exercise: The protein database UniProt|UniProt Exercise manual]] open in a different browser window for quick cross-referencing of what we have already learned about searching UniProt.&lt;br /&gt;
&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;Note:&#039;&#039;&#039; When answering the questions below, you have to &#039;&#039;write the search string&#039;&#039; you used in the answer; merely writing a number is not enough. When the search string is included in the answer, we can understand the reason for possible wrong answers.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2b)&#039;&#039;&#039; Go to [http://www.uniprot.org/ UniProt]. Investigate how many &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) proteins there are in total in UniProtKB (i.e. proteins from all &#039;&#039;Pf&#039;&#039; strains, not only from 3D7). How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2c)&#039;&#039;&#039; Now try to see how many of the hits from the previous question are from the strain (isolate) 3D7. Is the number approximately equal to the number you got in question &#039;&#039;&#039;2a)&#039;&#039;&#039;? How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
Now, we shall investigate whether we can use the annotations of subcellular location in UniProt. &#039;&#039;&#039;Note:&#039;&#039;&#039; We go back to working with all strains of &#039;&#039;Pf&#039;&#039;, not exclusively 3D7. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2d)&#039;&#039;&#039; First, check how many &#039;&#039;Pf&#039;&#039; proteins have a &amp;quot;&amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;&amp;quot; comment at all (&#039;&#039;&#039;Tip:&#039;&#039;&#039; choose &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt; in the menu and &amp;lt;!-- leave the &amp;lt;u&amp;gt;Term&amp;lt;/u&amp;gt; field empty)--&amp;gt;enter a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). How many from each part of the database? (&#039;&#039;&#039;Note&#039;&#039;&#039; that the ratio between TrEMBL and Swiss-Prot numbers changes considerably relative to question &#039;&#039;&#039;2b)&#039;&#039;&#039; — Swiss-Prot entries on average contain many more annotations than TrEMBL entries).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2e)&#039;&#039;&#039; How many of these are secreted? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; that should go into the field that pops up when the menu is set to &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt;).  &lt;br /&gt;
&lt;br /&gt;
To get more hits, we will try to search for other terms in the &amp;lt;u&amp;gt;Subcellular location term&amp;lt;/u&amp;gt; field. Interesting subcellular locations might include words such as &amp;quot;&amp;lt;tt&amp;gt;surface&amp;lt;/tt&amp;gt;&amp;quot; or &amp;quot;&amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt;&amp;quot;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2f)&#039;&#039;&#039; How many are there of these, respectively? &lt;br /&gt;
&lt;br /&gt;
The word &amp;quot;membrane&amp;quot; gave the highest number of hits, so we will examine those in more detail. Far from all of these proteins are suitable as vaccine targets. In order to be potentially interesting, they need to be located in the cell membrane (plasma membrane) of either the parasite or the host cell, &#039;&#039;not&#039;&#039; in an inner membrane in the cell. To get an overview, you should try another function in UniProt&#039;s interface: First, click to select the &amp;lt;u&amp;gt;Table&amp;lt;/u&amp;gt; view instead of the &amp;lt;u&amp;gt;Card&amp;lt;/u&amp;gt; view (above the results list). Then, click the button &amp;lt;u&amp;gt;Customize columns&amp;lt;/u&amp;gt;; that will bring up a table where you can find a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; item. Click it, mark &amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;, and click &amp;lt;u&amp;gt;Close&amp;lt;/u&amp;gt;.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2g)&#039;&#039;&#039; Now look at the list of results, where &amp;quot;&amp;lt;u&amp;gt;subcellular location&amp;lt;/u&amp;gt;&amp;quot; contained &amp;quot;membrane&amp;quot;, again. Consider the field &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt;. Give some examples (including accession codes, protein names, and reasons for selecting them) of hits that may be useful, and hits that are surely not useful as vaccine targets (at least two &#039;&#039;different&#039;&#039; examples of each). &#039;&#039;&#039;Hint:&#039;&#039;&#039; if you need to see some different examples, try clicking on the column headings in the table to sort the results list after, e.g., Accession (&amp;lt;u&amp;gt;Entry&amp;lt;/u&amp;gt;), Entry name, or Protein name. &lt;br /&gt;
&lt;br /&gt;
Now, let us focus on the life stage of the parasite where it is located inside an erythrocyte (a red blood cell), and thereby focus on the vaccine targets that are in the plasma membrane of the &#039;&#039;host cell&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2h)&#039;&#039;&#039; How many of the hits have the location &amp;quot;host cell membrane&amp;quot;?&lt;br /&gt;
&amp;lt;!-- These should ideally have a &amp;quot;Subcellular location&amp;quot; annotated as &amp;quot;erythrocyte membrane&amp;quot; or &amp;quot;host cell membrane&amp;quot; — but there are no examples of that in your search from the last question (you are welcome to try!).  --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These proteins could be very interesting as vaccine targets. However, the experimental researchers from your organization report that these have already been tried and do not work in practice, so they ask you to find other examples. We therefore try another approach: If the information we are looking for is not part of the &amp;quot;Subcellular location&amp;quot; annotation, it might be a part of the description (the protein name). &#039;&#039;&#039;Tip:&#039;&#039;&#039; you can always discard a search term in the Advanced interface by clicking the &amp;lt;u&amp;gt;Remove&amp;lt;/u&amp;gt; button.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2i)&#039;&#039;&#039; How many &#039;&#039;Pf&#039;&#039; proteins contain &amp;lt;tt&amp;gt;erythrocyte&amp;lt;/tt&amp;gt; in their &amp;lt;u&amp;gt;Protein Name [DE]&amp;lt;/u&amp;gt; field? How many of these are from Swiss-Prot (reviewed)?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2j)&#039;&#039;&#039; How many of these erythrocyte proteins also have &amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt; in their name? &lt;br /&gt;
&lt;br /&gt;
Some of the hits you find in this way are very short (you can try to sort them by length by clicking the &amp;lt;u&amp;gt;Length&amp;lt;/u&amp;gt; heading). These short proteins might be fragments. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2k)&#039;&#039;&#039; How many of the hits are complete (not annotated as fragments)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; see question 16 in [[Exercise: The protein database UniProt|the UniProt exercise]]).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2l)&#039;&#039;&#039; Do any of these proteins have a determined 3D structure? In other words: Do any proteins from the previous search have a cross-reference to the database PDB? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you should look for &amp;lt;u&amp;gt;Cross-references&amp;lt;/u&amp;gt; in the menu, and again place a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). If yes, what are their names and accession codes?&lt;br /&gt;
&lt;br /&gt;
As a last step in this part of the exercise, you should save all sequences from the last search in FASTA format. This is most easily done by clicking &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; above the results list and choosing &amp;lt;u&amp;gt;FASTA (canonical)&amp;lt;/u&amp;gt;. You can either choose to download them (remember to choose &amp;lt;u&amp;gt;No&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Compressed&amp;lt;/u&amp;gt;) and then open them in a text editor or to preview them in the browser. In the latter case, keep the browser window with the sequences; we will need them later in the exercise.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Analysis of membrane protein domain structure ==&lt;br /&gt;
[[Image:PfEMP1_transport.jpg|right|border]]&lt;br /&gt;
&lt;br /&gt;
The PfEMP1 (&#039;&#039;Plasmodium falciparum&#039;&#039; Erythrocyte Membrane Protein 1) proteins which we now have found constitute a group of proteins expressed by the malaria parasite and transported to the plasma membrane of the infected erythrocyte (see figure — the red/orange sticks represent PfEMP1 proteins). &lt;br /&gt;
&lt;br /&gt;
The function of the PfEMP1 proteins on the surface of the infected erythrocytes is to mediate binding to certain receptors on the surface of endothelial cells (the cells making up the walls of blood vessels). In this way, the malaria parasite can make the infected erythrocytes stick to the walls of the blood vessels in various tissues of the body, and thereby it can avoid being transported through the spleen (Danish: &#039;&#039;milten&#039;&#039;) which otherwise removes diseased erythrocytes from the blood and is one of the main actors in generating an immune response against malaria.&lt;br /&gt;
&lt;br /&gt;
If we, using a vaccine, can generate antibodies that bind to the PfEMP1 proteins, preventing the infected erythrocytes from binding to the endothelial cells, the body would be able to generate a faster and broader immune response against &#039;&#039;Pf&#039;&#039;. Symptoms such as anemia would thereby not become so severe. &lt;br /&gt;
&lt;br /&gt;
We will now examine how the PfEMP1 proteins are built. &lt;br /&gt;
&lt;br /&gt;
Look at the entries you found in the end of section 2. Select just those hits that are from &#039;&#039;Pf&#039;&#039; 3D7 (there should be three of them — otherwise, revisit section 2). &lt;br /&gt;
&lt;br /&gt;
Take a closer look (in UniProt) at these three entries. Scroll down to &amp;lt;u&amp;gt;Family and domain databases&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Family &amp;amp; Domains&amp;lt;/u&amp;gt;. Here, you will find some services providing an overview of known families/domains in the protein in question. &amp;lt;u&amp;gt;InterPro&amp;lt;/u&amp;gt; is the most important of these, since it collects information from a number of family &amp;amp; domain databases (including the one called &amp;lt;u&amp;gt;Pfam&amp;lt;/u&amp;gt;) and therefore has the widest repertoire of domain types. &lt;br /&gt;
&lt;br /&gt;
Open the link labeled &amp;lt;u&amp;gt;View protein in InterPro&amp;lt;/u&amp;gt; in a new tab. Note the graphical interface of InterPro under the heading &amp;quot;Entry matches to this protein&amp;quot;. When you hover the mouse over one of the coloured bars, the name of the family/domain will appear. Note that each family/domain in InterPro has at least &#039;&#039;two&#039;&#039; names and identifiers, an InterPro identifier beginning with &amp;quot;IPR&amp;quot; and a member database identifier, e.g. beginning with &amp;quot;PF&amp;quot; if it is derived from Pfam.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;What are families and domains, anyway?&#039;&#039;&#039;&lt;br /&gt;
:Here are the definitions from the [https://www.ebi.ac.uk/interpro/help/faqs/ InterPro FAQ]:&lt;br /&gt;
:*&#039;&#039;&#039;Domains&#039;&#039;&#039; are distinct functional, structural or sequence units that may exist in a variety of biological contexts. A match to an InterPro entry of this type indicates the presence of a domain. Common examples of protein domains are the PH domain, Immunoglobulin domain or the classical C2H2 zinc finger. &lt;br /&gt;
:*A protein &#039;&#039;&#039;family&#039;&#039;&#039; is a group of proteins that share a common evolutionary origin reflected by their related functions, similarities in sequence, or similar primary, secondary or tertiary structure. A match to an InterPro entry of this type indicates membership of a protein family. &lt;br /&gt;
:However, the distinction between what is regarded as a family and what is regarded as a domain is not completely sharp.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3a)&#039;&#039;&#039; Note that one domain named &amp;quot;Duffy binding domain&amp;quot; is found in several copies in all our three erythrocyte membrane proteins. What are the identifiers of this domain? How many times does it occur in each of the proteins?&lt;br /&gt;
&lt;br /&gt;
Click the identifiers for this particular family/domain and read more about it.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3b)&#039;&#039;&#039; Under &amp;quot;Other Features&amp;quot;, Interpro has annotated a transmembrane segment. Which positions are transmembrane in the three proteins? Which part (N- or C-terminal part) of the proteins is intracellular, and which part is extracellular? Are the Duffy binding domains intra- or extracellular?&lt;br /&gt;
&lt;br /&gt;
Look (in UniProt) at the PDB cross-references under &amp;lt;u&amp;gt;3D structure databases&amp;lt;/u&amp;gt; (under &amp;lt;u&amp;gt;Structure&amp;lt;/u&amp;gt;). Focus on X-ray structures only. Compare the coordinates (positions) for the structures to the coordinates for the domains denoted in Pfam. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3c)&#039;&#039;&#039; Which positions are structurally determined &#039;&#039;&#039;by X-ray&#039;&#039;&#039; in each of the three proteins? If you number the occurrences of the known family/domain from &#039;&#039;&#039;3a&#039;&#039;&#039; (1, 2, 3, and so on, starting from the N-terminus), which number(s) are covered by the structurally determined region(s) in each of the three proteins? &lt;br /&gt;
&lt;br /&gt;
Now read what is said about the function and location of our proteins according to Gene Ontology (&amp;lt;u&amp;gt;GO - Molecular function&amp;lt;/u&amp;gt;, &amp;lt;u&amp;gt;GO - Biological process&amp;lt;/u&amp;gt; and &amp;lt;u&amp;gt;GO - Cellular component&amp;lt;/u&amp;gt;) in UniProt.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3d)&#039;&#039;&#039; Do these pieces of information support our choice of these proteins as vaccine targets? Give at least 3 examples. &#039;&#039;&#039;Tip:&#039;&#039;&#039; You can click &amp;lt;u&amp;gt;View the complete GO annotation on QuickGO&amp;lt;/u&amp;gt; in UniProt.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Prediction of B-cell epitopes in a membrane protein ==&lt;br /&gt;
&#039;&#039;&#039;Q8I639&#039;&#039;&#039; is also known as VAR2CSA, and this protein is of particular interest, since it is considered to be responsible for &#039;&#039;Pregnacy associated malaria&#039;&#039; (PAM). Pregnant women are more prone to contract malaria, which sadly leads to a fatality rate of ~10,000 mothers and ~200,000 newborn/unborn children annually. &lt;br /&gt;
&lt;br /&gt;
One of the reasons why it has been so difficult to develop a malaria vaccine, is that the malaria parasite carries ~60 PfEMP1 protein variants, and that you&#039;ll need antibodies against all of them to be immune. However, in the case of PAM there is only one specific PfEMP1 in play, and this special case is therefore easier to start to address with a vaccine.&lt;br /&gt;
&lt;br /&gt;
In order to have a better handle on our bioinformatics work, we&#039;ll concentrate the effort on the Duffy binding domain in VAR2CSA for which a 3D structure is available (the one we found in &#039;&#039;&#039;question 3c&#039;&#039;&#039;).  &lt;br /&gt;
&lt;br /&gt;
=== Epitope prediction ===&lt;br /&gt;
The vaccine we are working towards designing should contain &#039;&#039;&#039;epitopes&#039;&#039;&#039;. Epitopes are the parts of the disease-associated protein the immune system will recognize, for instance the parts the infected person&#039;s antibodies will bind to (the so called &#039;&#039;&#039;B-cell epitopes&#039;&#039;&#039; — there also exist &#039;&#039;&#039;T-cell epitopes&#039;&#039;&#039;, which we&#039;ll not cover here).&lt;br /&gt;
&lt;br /&gt;
For predicting which parts of the protein are potential epitopes, we&#039;ll use the &#039;&#039;&#039;BepiPred 2.0 server&#039;&#039;&#039;, which was created here at DTU.  &lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Important Note:&#039;&#039;&#039; Please run the prediction on the web server of the IEDB instead of the one at DTU, as our local servers had an update that has modified the results. &lt;br /&gt;
:Please select the method called &amp;quot;BepiPred 2.0&amp;quot;&lt;br /&gt;
:http://tools.iedb.org/bcell/ &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the structure-determined Duffy binding domain in VAR2CSA. This must be done using the link to PDB from UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt.&lt;br /&gt;
# Go to the Structure section.&lt;br /&gt;
# Right-click the link labeled &amp;lt;u&amp;gt;RCSB-PDB&amp;lt;/u&amp;gt; and open it in a new tab. This will take you to a PDB page.&lt;br /&gt;
# Here, you can find the sequence by clicking &amp;lt;u&amp;gt;Display Files&amp;lt;/u&amp;gt; and choosing &amp;lt;u&amp;gt;FASTA Sequence&amp;lt;/u&amp;gt;. Alternative, you can choose to download the sequence by clicking &amp;lt;u&amp;gt;Download Files&amp;lt;/u&amp;gt;.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the Duffy binding domain in VAR2CSA. This can be done using only the web-interface for UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt&lt;br /&gt;
# Locate the section concerning CROSS-REFERENCES to 3D structures (NOT the &amp;quot;live action&amp;quot; 3D structure you can move around).&lt;br /&gt;
#* Find the field called &#039;&#039;&#039;positions&#039;&#039;&#039; — this is actually a &#039;&#039;&#039;clickable link!&#039;&#039;&#039;&lt;br /&gt;
#* Click the positions link — this will open up a new page where this subsequence can be used for a BLAST query.&lt;br /&gt;
#* &#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; do NOT start the BLAST run, but just copy out the FASTA sequence, it contains ONLY the sequence interval specified in the &#039;&#039;&#039;positions&#039;&#039;&#039; field.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4a&#039;&#039;&#039;: What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4b&#039;&#039;&#039;: Report the FASTA sequence of the structure-determined Duffy binding domain in VAR2CSA. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4c&#039;&#039;&#039;: &lt;br /&gt;
Note down the following from the UniProt entry, you&#039;ll need it in the next section:&lt;br /&gt;
* What was the sequence interval in the coordinates of the original (full) UniProt sequence?&lt;br /&gt;
* What position in the original protein does position 1 in the new FASTA file correspond to?&lt;br /&gt;
&lt;br /&gt;
You can now run the &#039;&#039;&#039;BepiPred 2.0&#039;&#039;&#039; prediction server on the domain sequence (ONLY the subset extracted above). Run it and then adjust the following on the &#039;&#039;&#039;results page&#039;&#039;&#039;: &lt;br /&gt;
* Set &#039;&#039;&#039;threshold&#039;&#039;&#039; to &#039;&#039;&#039;0.55&#039;&#039;&#039;&lt;br /&gt;
This gives us a reasonable amount of epitopes to continue our work with:&lt;br /&gt;
* Write down the start/end sequence positions of all epitopes of at least &#039;&#039;&#039;8 amino acids&#039;&#039;&#039;&lt;br /&gt;
* &#039;&#039;&#039;Hint:&#039;&#039;&#039; there should be &#039;&#039;&#039;7&#039;&#039;&#039; such epitopes, and the last one starts at position &#039;&#039;&#039;276&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
[[image:BepiPred-2_onIEDB.png|thumb|center|600px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4d&#039;&#039;&#039;: Create a table with the following information about the predicted epitopes:&lt;br /&gt;
* Start/end position, length, Start/end position &#039;&#039;in the original protein&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;(We&#039;ll need the coordinate-transformed values for the PyMOL visualization)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Visualization of epitopes ==&lt;br /&gt;
Lastly, we&#039;ll want to visualize the epitopes in the VAR2CSA Duffy binding domain. Generally, BepiPred 2.0 is very good at selecting surface exposed epitopes, but it&#039;s still a good idea to check it visually.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
For the Q8I639 UniProt entry we have been working with, look at the structure section again, and find the link to the PDB structure of the Duffy binding domain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
Sometimes it will not be possible to get reliable structural information about the entire protein (or in this case the Duffy binding domain). This could for example be the case if parts of the protein are in &#039;&#039;disorder&#039;&#039; (essentially not stabilized and not fixed in place in the crystal). We&#039;ll investigate this next. &lt;br /&gt;
&lt;br /&gt;
From the UniProt page, locate the right structure in PDB:&lt;br /&gt;
* Method 1: Go to https://www.rcsb.org and search for the structure&lt;br /&gt;
* Method 2: Adjust the cross-link in the &amp;quot;structure&amp;quot; section in UniProt to be &amp;quot;RCSB PDB&amp;quot; and click the link.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
In the PDB database page for the structure you found in the last section, click the &amp;quot;Sequence&amp;quot; tab and look at the figure. In the case of this structure, the authors&#039; numbering directly follows the coordinates from the FULL UniProt sequence.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; &lt;br /&gt;
* Which intervals in the sequence are missing (disordered/invisible) in the structure? Hint: Look at the &amp;quot;UNMODELED&amp;quot; feature. &amp;lt;!-- DSSP legend and notice what the lack of underlining means. --&amp;gt;&lt;br /&gt;
* Will this have an impact on any of our predicted epitopes?&lt;br /&gt;
&lt;br /&gt;
Now it&#039;s time to work with visualization of the epitopes in PyMOL. IMPORTANT: Cross-reference with the exercises from the PyMOL exercise if you have forgotten some of the PyMOL fundamentals. &lt;br /&gt;
&lt;br /&gt;
The goal will be to:&lt;br /&gt;
* Colour the epitopes in different colours&lt;br /&gt;
* Have a look at where in the structure they are found: on the surface or inside.&lt;br /&gt;
&lt;br /&gt;
After you have loaded the structure (either via &amp;quot;fetch&amp;quot; or by downloading the file), you can help yourself by setting the base colour to a neutral grey, and with a basic &amp;quot;cartoon&amp;quot; visualization as the first step:&lt;br /&gt;
&lt;br /&gt;
 color gray80&lt;br /&gt;
 hide all&lt;br /&gt;
 show cartoon&lt;br /&gt;
&lt;br /&gt;
Since we&#039;re working with 7 epitopes it can be beneficial to work with named selections. To avoid renaming selections you can specify the name directly in the select command:&lt;br /&gt;
 select epitope_XXX, resi 1-3&lt;br /&gt;
&lt;br /&gt;
This will create the selection of residues 1 to 3 under the name &amp;quot;epitope_XXX&amp;quot; — please refer to the PyMOL exercise for more details about selection rules.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;TASK:&#039;&#039;&#039;&lt;br /&gt;
* Create named selections for all seven epitopes&lt;br /&gt;
** Select a good naming scheme — for example epitope_1 to epitope_7 or reference the first position (e.g. epitope_273 for the last one)&lt;br /&gt;
** Select a unique and easy to identify colour for each epitope.&lt;br /&gt;
** HINT: Turn on the sequence viewer — then you can directly see your selections AND colours in the sequence as well!&lt;br /&gt;
&lt;br /&gt;
As you may have noticed there are two (identical) chains in the structure. We only need one of them, and the next step will be to separate them out.&lt;br /&gt;
&lt;br /&gt;
 create ka, chain A&lt;br /&gt;
&lt;br /&gt;
This will create a new object with the A chain. &lt;br /&gt;
* Repeat this for the B chain (you could call the object kb), and then use the click-interface to hide the whole structure, and select ONLY one of the chains to continue to work with. REMEMBER to write which chain you have chosen to work with.&lt;br /&gt;
&lt;br /&gt;
Lastly, we&#039;ll need to look at how the epitopes are located relative to the surface. Here you can benefit from switching between two types of visualization (using the click interface):&lt;br /&gt;
* show as → surface &lt;br /&gt;
to show the protein from the outside.&lt;br /&gt;
* show as → cartoon&lt;br /&gt;
* show → mesh&lt;br /&gt;
to show BOTH the inside and outside — it especially works nicely when you actively rotate the structure.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5b):&#039;&#039;&#039; Play around with the visualization, and create one (or more) good figures for your report that show the following:&lt;br /&gt;
* Placement of the epitopes&lt;br /&gt;
* A legend for the colours (or arrows with explanations or something similar)&lt;br /&gt;
* Which epitopes are (partly) missing?&lt;br /&gt;
* Are the remaining epitopes accessible on the surface of the protein?&lt;br /&gt;
&lt;br /&gt;
== Epilogue ==&lt;br /&gt;
&#039;&#039;Now all that remains is to ship off the sequences of the surface accessible epitopes to the lab, to start the long process of constructing an expression vector with the gene fragments, with the right linker sequences, getting it expressed in a production host, follow up with animal testing and phase 1, 2 and 3 clinical trials, and the vaccine should be ready for the market.&#039;&#039;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Answers:Malaria_Vaccine&amp;diff=617</id>
		<title>Answers:Malaria Vaccine</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Answers:Malaria_Vaccine&amp;diff=617"/>
		<updated>2025-10-14T13:37:15Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* 3 - Analysis of membrane protein domain structure */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Answers to case study exercise about malaria vaccines&#039;&#039;&#039; (NB: numbers etc. found in the databases 11-10-2023):&lt;br /&gt;
&lt;br /&gt;
== 1 - What exactly is malaria? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1a)&#039;&#039;&#039; If you search for &amp;quot;malaria&amp;quot; on NCBIs Taxonomy page, you find some mosquitoes and some protozoans with the Genus name &#039;&#039;Plasmodium&#039;&#039;. Clicking the name of one of these (twice) gets you to a page where you can see the &#039;&#039;lineage&#039;&#039;:&lt;br /&gt;
&lt;br /&gt;
* Genus: &#039;&#039;Plasmodium&#039;&#039;&lt;br /&gt;
* Phylum: &#039;&#039;Apicomplexa&#039;&#039; &lt;br /&gt;
* (Super)Kingdom: &#039;&#039;Eukaryota&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1b)&#039;&#039;&#039; On NCBI&#039;s Taxonomy page is a function named ”[http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi Taxonomy common tree]” which gives a nice overview. Alternatively you can open taxonomy pages for the two organisms to compare, and see on their &#039;&#039;lineages&#039;&#039; how much they have in common. &lt;br /&gt;
* &#039;&#039;Homo sapiens&#039;&#039; and &#039;&#039;Plasmodium&#039;&#039;: &#039;&#039;Eukaryota&#039;&#039;&lt;br /&gt;
* &#039;&#039;Babesia microti&#039;&#039; and &#039;&#039;Plasmodium&#039;&#039;: &#039;&#039;Aconoidasida&#039;&#039;&lt;br /&gt;
Here is the picture you can get from the &amp;quot;Taxonomy common tree&amp;quot; function:&lt;br /&gt;
[[Image:Common Taxonomy Tree.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1c)&#039;&#039;&#039; On [http://www.cdc.gov/dpdx/malaria/ CDC&#039;s page about malaria] or on [http://tolweb.org/Plasmodium/68071 Tree of Life&#039;s page about Plasmodium] you find:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;P. malariae&#039;&#039;, &#039;&#039;P. ovale&#039;&#039;, &#039;&#039;P. falciparum&#039;&#039; and &#039;&#039;P. vivax&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
By looking up these four species in NCBI Taxonomy and looking at the table to the right, you see that &#039;&#039;&#039;all four&#039;&#039;&#039; species have a full genome in the databases (see the link named &amp;lt;u&amp;gt;Genome&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Entrez records&amp;lt;/u&amp;gt;). &lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 2 - Identification of membrane proteins (potential vaccine targets) ==&lt;br /&gt;
&lt;br /&gt;
===2a)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- &#039;&#039;&#039;14&#039;&#039;&#039; chromosomes. Actually, this is not as easy to find as it used to be. Previously, you got a list of the 14 chromosomes just by following the &amp;lt;u&amp;gt;Genome&amp;lt;/u&amp;gt; link. Now, you can see the previous Genome page with the chromosomes by following the link labeled &amp;quot;View the legacy Genome page&amp;quot;. Alternatively, you can see a list of the chromosomes by clicking the link under &amp;quot;Reference genome&amp;quot; (Genome assembly GCA_000002765). --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5566&#039;&#039;&#039; not hypothetical genes (search details below)&lt;br /&gt;
 txid36329[Organism:noexp] NOT hypothetical[All Fields] AND alive[prop]&lt;br /&gt;
&lt;br /&gt;
If you instead found 5570 not hypothetical genes, it is because you found the species &#039;&#039;Plasmodium falciparum&#039;&#039; (taxID:5833) in NCBI Taxonomy instead of the specific isolate 3D7 (taxID:36329) as specified in the exercise.&lt;br /&gt;
&lt;br /&gt;
===2b)===&lt;br /&gt;
&lt;br /&gt;
The correct search strings&lt;br /&gt;
 (taxonomy_id:5833)&lt;br /&gt;
or &lt;br /&gt;
 (organism_name:&amp;quot;Plasmodium falciparum&amp;quot;)&lt;br /&gt;
both give &#039;&#039;&#039;129,611&#039;&#039;&#039; hits in total, &#039;&#039;&#039;483&#039;&#039;&#039; from Swiss-Prot and &#039;&#039;&#039;129,128&#039;&#039;&#039; from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
If you only found 34,196 hits, it was because you used &lt;br /&gt;
 (organism_id:5833)&lt;br /&gt;
which only gives those &#039;&#039;Pf&#039;&#039; proteins that do &#039;&#039;not&#039;&#039; have a specified strain or isolate — cf. question 3.4+3.5 in [[Exercise: The protein database UniProt|the UniProt exercise]].  &lt;br /&gt;
&lt;br /&gt;
If, on the other hand, you found 131,860 hits, it was because you searched in All instead of specifying the search field:&lt;br /&gt;
 Plasmodium falciparum&lt;br /&gt;
In that case, you will include some proteins that originate from e.g. humans but play a role in &#039;&#039;Plasmodium falciparum&#039;&#039; infection, which may be mentioned in some comment field or reference title.&lt;br /&gt;
&lt;br /&gt;
===2c)===&lt;br /&gt;
&lt;br /&gt;
This can be solved in several ways:&lt;br /&gt;
* &amp;lt;tt&amp;gt;(taxonomy_id:36329)&amp;lt;/tt&amp;gt; (either selecting the right isolate from the drop-down menu or using the TaxID you found in the Taxonomy database)&lt;br /&gt;
* &amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (organism_name:3d7)&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;(organism_name:&amp;quot;Plasmodium falciparum&amp;quot;) AND (organism_name:3d7)&amp;lt;/tt&amp;gt;&lt;br /&gt;
They all give: &#039;&#039;&#039;5,495&#039;&#039;&#039; in total, &#039;&#039;&#039;295&#039;&#039;&#039; from Swiss-Prot and &#039;&#039;&#039;5,200&#039;&#039;&#039; from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
That corresponds &#039;&#039;approximately&#039;&#039; to the number of genes found in &#039;&#039;&#039;2a)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
===2d)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:*)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;24,439&#039;&#039;&#039; (&#039;&#039;&#039;390&#039;&#039;&#039; from Swiss-Prot and &#039;&#039;&#039;24,049&#039;&#039;&#039; from TrEMBL).&lt;br /&gt;
&lt;br /&gt;
===2e)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:secreted)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
or&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0243)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;594&#039;&#039;&#039; (39 from Swiss-Prot).&lt;br /&gt;
&lt;br /&gt;
===2f)===&lt;br /&gt;
&lt;br /&gt;
surface:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:surface)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
or&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0310)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;413&#039;&#039;&#039; hits&lt;br /&gt;
&lt;br /&gt;
membrane: &lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:membrane)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
or&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0162)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;14,515&#039;&#039;&#039; hits&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===2g)===&lt;br /&gt;
&lt;br /&gt;
Potentially useful (found in the cell membrane): &lt;br /&gt;
* Q7KQL9 / ALF_PLAF7 / Fructose-bisphosphate aldolase: &amp;quot;&amp;quot;Host cell membrane&amp;quot;&lt;br /&gt;
* A0A2I0BVG8 / CDPK1_PLAFO / Calcium-dependent protein kinase 1: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* W7KN63 / W7KN63_PLAFO / Merozoite surface antigen 2: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Q8IFM5 / RH5_PLAF7 / Reticulocyte-binding protein homolog 5: &amp;quot;Host cell membrane&amp;quot;&lt;br /&gt;
* O97364 / SUB2_PLAFA / Subtilisin-like protease 2: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* C6KSX0 / PF12_PLAF7 / Merozoite surface protein P12: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* Q8I1Y0 / PF41_PLAF7 / Merozoite surface protein P41: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* Q8IDN0 / PFS47_PLAF7 / Female gametocyte surface protein P47: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* P62343 / CDPK1_PLAFK / Calcium-dependent protein kinase 1: &amp;quot;In the parasite and on erythrocytic membrane at a lower level&amp;quot;&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Definitely not useful (found in an inner membrane):&lt;br /&gt;
* Q8I6V3 / PLM2_PLAF7 / Plasmepsin II: &amp;quot;Vacuole membrane&amp;quot;&lt;br /&gt;
* U3M186 / U3M186_PLAFA	/ Cytochrome c oxidase subunit 1: &amp;quot;Mitochondrion inner membrane&amp;quot;&lt;br /&gt;
* O97321 / O97321_PLAF7	/ GlcNAc-1-P transferase: &amp;quot;Endoplasmic reticulum membrane&amp;quot;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Q08210 / PYRD_PLAF7 / Dihydroorotate dehydrogenase: &amp;quot;Mitochondrion inner membrane&amp;quot;&lt;br /&gt;
* Q8I719 / KGP_PLAF7 / cGMP-dependent protein kinase: &amp;quot;Endoplasmic reticulum membrane&amp;quot;&lt;br /&gt;
* Q8IDR3 / MYOA_PLAF7 / Myosin-A: &#039;&#039;&#039;NB:&#039;&#039;&#039; even though this is found associated with the cell membrane, it is useless, because it is a Peripheral membrane protein bound to the &#039;&#039;Cytoplasmic&#039;&#039; side of the membrane.&lt;br /&gt;
* Q08210 / PYRD_PLAF7 / Dihydroorotate dehydrogenase (quinone), mitochondrial: &amp;quot;Mitochondrion inner membrane&amp;quot;&lt;br /&gt;
* Q9N623 / CRT_PLAFA / Chloroquine resistance transporter, PfCRT: &amp;quot;Localizes to the parasite digestive vacuole&amp;quot;&lt;br /&gt;
* Q9GPP8 / PSD_PLAFA / Phosphatidylserine decarboxylase proenzyme: &amp;quot;Endoplasmic reticulum membrane&amp;quot;&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
Of course, the actual examples you selected may differ from these!&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;1699&#039;&#039;&#039; of the hits contain the phrase &amp;quot;cell membrane&amp;quot;, this can be found by modifying the search to:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;taxonomy:&amp;quot;Plasmodium falciparum [5833]&amp;quot; locations:(location:&amp;quot;cell membrane&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
or, you can find &#039;&#039;&#039;1697&#039;&#039;&#039; hits with the search string:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;taxonomy:&amp;quot;Plasmodium falciparum [5833]&amp;quot; locations:(location:&amp;quot;Cell membrane [SL-0039]&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; UniProt is developing, and not everything we wrote in the exercise guide earlier is still true. In this question, it was stated:&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
These should ideally have a &amp;quot;Subcellular location&amp;quot; annotated as &amp;quot;erythrocyte membrane&amp;quot; or &amp;quot;host cell membrane&amp;quot; — but there are no examples of that in your search from the last question (you are welcome to try!). &lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
Some of you actually tried this and found three hits with the search string:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;organism:&amp;quot;plasmodium falciparum&amp;quot; locations:(location:&amp;quot;host cell membrane&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
or &lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;organism:&amp;quot;plasmodium falciparum&amp;quot; locations:(location:&amp;quot;Host cell membrane [SL-0375]&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We stand corrected!&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===2h)===&lt;br /&gt;
&#039;&#039;&#039;61&#039;&#039;&#039; hits, 27 from Swiss-Prot.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:&amp;quot;host cell membrane&amp;quot;)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
or&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0375)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===2i)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10,134&#039;&#039;&#039;, among these only &#039;&#039;&#039;4&#039;&#039;&#039; from Swiss-Prot.&lt;br /&gt;
&lt;br /&gt;
===2j)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte) AND (protein_name:membrane)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9,543&#039;&#039;&#039; hits, all from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;or&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:&amp;quot;erythrocyte membrane&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9,502&#039;&#039;&#039; hits, all from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
===2k)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte) AND (protein_name:membrane) AND (fragment:false)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2,387&#039;&#039;&#039; (or &#039;&#039;&#039;2,385&#039;&#039;&#039; if the words &amp;quot;erythrocyte membrane&amp;quot; are combined)&lt;br /&gt;
&lt;br /&gt;
===2l)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte) AND (protein_name:membrane) AND (fragment:false) AND (database:pdb)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8&#039;&#039;&#039; hits, called &amp;quot;Erythrocyte membrane protein 1&amp;quot; or &amp;quot;Erythrocyte membrane protein 2&amp;quot;: &#039;&#039;&#039;Q6UDW7&#039;&#039;&#039;, &#039;&#039;&#039;Q8I098&#039;&#039;&#039;, &#039;&#039;&#039;Q8I639&#039;&#039;&#039;, &#039;&#039;&#039;Q8IHM0&#039;&#039;&#039;, &#039;&#039;&#039;W7K270&#039;&#039;&#039;, &#039;&#039;&#039;A3R6S4&#039;&#039;&#039;, &#039;&#039;&#039;A0A024V5I6&#039;&#039;&#039;, and &#039;&#039;&#039;I1X0L2&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 3 - Analysis of membrane protein domain structure ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3a)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
InterPro identifier: &#039;&#039;&#039;IPR008602&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
Pfam identifier: &#039;&#039;&#039;PF05424&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
It is found &#039;&#039;&#039;4&#039;&#039;&#039; times in Q8IHM0, &#039;&#039;&#039;5&#039;&#039;&#039; times in Q8I098 and &#039;&#039;&#039;6&#039;&#039;&#039; times in Q8I639.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3b)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
The transmembrane segments are in the following positions:&lt;br /&gt;
* Q8I098: 3124-3146&lt;br /&gt;
* Q8I639: 2650-2667 &lt;br /&gt;
* Q8IHM0: 2695-2717&lt;br /&gt;
&lt;br /&gt;
The extracellular parts are the N-terminal parts (all the positions &#039;&#039;before&#039;&#039; the transmembrane segments), and the intracellular (cytoplasmic) parts are C-terminal (positions &#039;&#039;after&#039;&#039; the transmembrane segments). All the Duffy binding domains are in the extracellular part.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3c)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
The following positions are structurally determined by X-ray in the three proteins:&lt;br /&gt;
* Q8I098: No X-Ray, only EM (Electron Microscopy).&lt;br /&gt;
* Q8I639: 2333-2634, covering Duffy_binding domain 6&lt;br /&gt;
* Q8IHM0: 728-1214, covering Duffy_binding domain 2&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3d)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Yes! &lt;br /&gt;
* Biological process: cytoadherence to microvasculature, mediated by symbiont protein&lt;br /&gt;
* Biological process: pathogenesis&lt;br /&gt;
* Cellular component: host cell plasma membrane&lt;br /&gt;
* Cellular component: infected host cell surface knob&lt;br /&gt;
* Molecular function: cell adhesion molecule binding&lt;br /&gt;
* Molecular function: host cell surface receptor binding&lt;br /&gt;
All these examples support that these proteins are involved in binding the infected erythrocytes to the endothelial cells (as described in the exercise).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 4 - Prediction of B-cell epitopes in a membrane protein ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4a)&#039;&#039;&#039; The PDB entry is &#039;&#039;&#039;2WAU&#039;&#039;&#039; and it&#039;s a crystal structure (X-ray).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4b)&#039;&#039;&#039; &lt;br /&gt;
FASTA sequence for the Duffy Binding domain covered by the 3D structure:&lt;br /&gt;
 &amp;gt;2WAU_1|Chains A, B|ERYTHROCYTE MEMBRANE PROTEIN 1 (PFEMP1)|PLASMODIUM FALCIPARUM (36329)&lt;br /&gt;
 ICNKYKNINVNMKKNNDDTWTDLVKNSSDINKGVLLPPRRKNLFLKIDESDICKYKRDPKLFKDFIYSSAISEVERLKKV&lt;br /&gt;
 YGEAKTKVVHAMKYSFADIGSIIKGDDMMENNSSDKIGKILGDGVGQNEKRKKWWDMNKYHIWESMLSGYKHAYGNISEN&lt;br /&gt;
 DRKMLDIPNNDDEHQFLRWFQEWTENFCTKRNELYENMVTACNSAKCNTSNGSVDKKECTEACKNYSNFILIKKKEYQSL&lt;br /&gt;
 NSQYDMNYKETKAEKKESPEYFKDKCNGECSCLSEYFKDETRWKNPYETLDDTEVKNNCMCK&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4c)&#039;&#039;&#039; &lt;br /&gt;
The sequence interval was 2333-2634. This means that the &#039;&#039;&#039;first&#039;&#039;&#039; postion in the new FASTA file corresponds to position &#039;&#039;&#039;2333&#039;&#039;&#039; in the original sequence.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4d)&#039;&#039;&#039; It&#039;s possible to convert from the coordinates in the FASTA files to the full length sequence by adding &#039;&#039;&#039;2332&#039;&#039;&#039;. In the table below the epitopes have been named by their starting position as well as numbered.&lt;br /&gt;
&lt;br /&gt;
 EPITOPE     POSITIONS    LENGTH     ORIG_POSITIONS&lt;br /&gt;
 #1 ep_5       5 to  29       25     2337 to 2361&lt;br /&gt;
 #2 ep_49     49 to  57        9     2381 to 2389&lt;br /&gt;
 #3 ep_107   107 to 114        8     2439 to 2446&lt;br /&gt;
 #4 ep_153   153 to 172       20     2485 to 2504&lt;br /&gt;
 #5 ep_209   209 to 218       10     2541 to 2550&lt;br /&gt;
 #6 ep_249   249 to 258       10     2581 to 2590&lt;br /&gt;
 #7 ep_273   273 to 294       22     2605 to 2626&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 5 - Visualization of epitopes ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5a)&#039;&#039;&#039; &lt;br /&gt;
Invisible positions:&lt;br /&gt;
&lt;br /&gt;
 Chain A: 2333-2349 and 2540-2546&lt;br /&gt;
 Chain B: 2333-2348 and 2535-2549&lt;br /&gt;
&lt;br /&gt;
This means that the first epitope (pos 5-29, orig pos 2337 to 2361) and the 5th epitope (pos 209 to 218, orig pos 2541 to 2550) are partially invisible. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5b)&#039;&#039;&#039; &lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
Overview figure with SURFACE visualization and indication of epitopes in Chain B. Notice that epitope #1 is partly hidden and epitope #6 is fully hidden (as expected from &#039;&#039;&#039;Q5b&#039;&#039;&#039; - here its also directly seen by the grey positions in the sequence). Note that if chain A is chosen a few amino acids of epitope #6 will be visible.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
[[Image:Epitopes_PyMol_figure.pptx.png|thumb|center|800px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
[[Image:2wau_figure_for_wiki1.png|thumb|center|1000px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
Same as above, but with MESH + CARTOON visualization (for a combined overview of surface + interior)&lt;br /&gt;
[[Image:3d_mesh_view3.png|thumb|center|1000px|Click to zoom]]&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Answers:Malaria_Vaccine&amp;diff=616</id>
		<title>Answers:Malaria Vaccine</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Answers:Malaria_Vaccine&amp;diff=616"/>
		<updated>2025-10-14T13:29:14Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* 3 - Analysis of membrane protein domain structure */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Answers to case study exercise about malaria vaccines&#039;&#039;&#039; (NB: numbers etc. found in the databases 11-10-2023):&lt;br /&gt;
&lt;br /&gt;
== 1 - What exactly is malaria? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1a)&#039;&#039;&#039; If you search for &amp;quot;malaria&amp;quot; on NCBIs Taxonomy page, you find some mosquitoes and some protozoans with the Genus name &#039;&#039;Plasmodium&#039;&#039;. Clicking the name of one of these (twice) gets you to a page where you can see the &#039;&#039;lineage&#039;&#039;:&lt;br /&gt;
&lt;br /&gt;
* Genus: &#039;&#039;Plasmodium&#039;&#039;&lt;br /&gt;
* Phylum: &#039;&#039;Apicomplexa&#039;&#039; &lt;br /&gt;
* (Super)Kingdom: &#039;&#039;Eukaryota&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1b)&#039;&#039;&#039; On NCBI&#039;s Taxonomy page is a function named ”[http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi Taxonomy common tree]” which gives a nice overview. Alternatively you can open taxonomy pages for the two organisms to compare, and see on their &#039;&#039;lineages&#039;&#039; how much they have in common. &lt;br /&gt;
* &#039;&#039;Homo sapiens&#039;&#039; and &#039;&#039;Plasmodium&#039;&#039;: &#039;&#039;Eukaryota&#039;&#039;&lt;br /&gt;
* &#039;&#039;Babesia microti&#039;&#039; and &#039;&#039;Plasmodium&#039;&#039;: &#039;&#039;Aconoidasida&#039;&#039;&lt;br /&gt;
Here is the picture you can get from the &amp;quot;Taxonomy common tree&amp;quot; function:&lt;br /&gt;
[[Image:Common Taxonomy Tree.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1c)&#039;&#039;&#039; On [http://www.cdc.gov/dpdx/malaria/ CDC&#039;s page about malaria] or on [http://tolweb.org/Plasmodium/68071 Tree of Life&#039;s page about Plasmodium] you find:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;P. malariae&#039;&#039;, &#039;&#039;P. ovale&#039;&#039;, &#039;&#039;P. falciparum&#039;&#039; and &#039;&#039;P. vivax&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
By looking up these four species in NCBI Taxonomy and looking at the table to the right, you see that &#039;&#039;&#039;all four&#039;&#039;&#039; species have a full genome in the databases (see the link named &amp;lt;u&amp;gt;Genome&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Entrez records&amp;lt;/u&amp;gt;). &lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 2 - Identification of membrane proteins (potential vaccine targets) ==&lt;br /&gt;
&lt;br /&gt;
===2a)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- &#039;&#039;&#039;14&#039;&#039;&#039; chromosomes. Actually, this is not as easy to find as it used to be. Previously, you got a list of the 14 chromosomes just by following the &amp;lt;u&amp;gt;Genome&amp;lt;/u&amp;gt; link. Now, you can see the previous Genome page with the chromosomes by following the link labeled &amp;quot;View the legacy Genome page&amp;quot;. Alternatively, you can see a list of the chromosomes by clicking the link under &amp;quot;Reference genome&amp;quot; (Genome assembly GCA_000002765). --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5566&#039;&#039;&#039; not hypothetical genes (search details below)&lt;br /&gt;
 txid36329[Organism:noexp] NOT hypothetical[All Fields] AND alive[prop]&lt;br /&gt;
&lt;br /&gt;
If you instead found 5570 not hypothetical genes, it is because you found the species &#039;&#039;Plasmodium falciparum&#039;&#039; (taxID:5833) in NCBI Taxonomy instead of the specific isolate 3D7 (taxID:36329) as specified in the exercise.&lt;br /&gt;
&lt;br /&gt;
===2b)===&lt;br /&gt;
&lt;br /&gt;
The correct search strings&lt;br /&gt;
 (taxonomy_id:5833)&lt;br /&gt;
or &lt;br /&gt;
 (organism_name:&amp;quot;Plasmodium falciparum&amp;quot;)&lt;br /&gt;
both give &#039;&#039;&#039;129,611&#039;&#039;&#039; hits in total, &#039;&#039;&#039;483&#039;&#039;&#039; from Swiss-Prot and &#039;&#039;&#039;129,128&#039;&#039;&#039; from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
If you only found 34,196 hits, it was because you used &lt;br /&gt;
 (organism_id:5833)&lt;br /&gt;
which only gives those &#039;&#039;Pf&#039;&#039; proteins that do &#039;&#039;not&#039;&#039; have a specified strain or isolate — cf. question 3.4+3.5 in [[Exercise: The protein database UniProt|the UniProt exercise]].  &lt;br /&gt;
&lt;br /&gt;
If, on the other hand, you found 131,860 hits, it was because you searched in All instead of specifying the search field:&lt;br /&gt;
 Plasmodium falciparum&lt;br /&gt;
In that case, you will include some proteins that originate from e.g. humans but play a role in &#039;&#039;Plasmodium falciparum&#039;&#039; infection, which may be mentioned in some comment field or reference title.&lt;br /&gt;
&lt;br /&gt;
===2c)===&lt;br /&gt;
&lt;br /&gt;
This can be solved in several ways:&lt;br /&gt;
* &amp;lt;tt&amp;gt;(taxonomy_id:36329)&amp;lt;/tt&amp;gt; (either selecting the right isolate from the drop-down menu or using the TaxID you found in the Taxonomy database)&lt;br /&gt;
* &amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (organism_name:3d7)&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;(organism_name:&amp;quot;Plasmodium falciparum&amp;quot;) AND (organism_name:3d7)&amp;lt;/tt&amp;gt;&lt;br /&gt;
They all give: &#039;&#039;&#039;5,495&#039;&#039;&#039; in total, &#039;&#039;&#039;295&#039;&#039;&#039; from Swiss-Prot and &#039;&#039;&#039;5,200&#039;&#039;&#039; from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
That corresponds &#039;&#039;approximately&#039;&#039; to the number of genes found in &#039;&#039;&#039;2a)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
===2d)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:*)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;24,439&#039;&#039;&#039; (&#039;&#039;&#039;390&#039;&#039;&#039; from Swiss-Prot and &#039;&#039;&#039;24,049&#039;&#039;&#039; from TrEMBL).&lt;br /&gt;
&lt;br /&gt;
===2e)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:secreted)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
or&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0243)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;594&#039;&#039;&#039; (39 from Swiss-Prot).&lt;br /&gt;
&lt;br /&gt;
===2f)===&lt;br /&gt;
&lt;br /&gt;
surface:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:surface)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
or&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0310)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;413&#039;&#039;&#039; hits&lt;br /&gt;
&lt;br /&gt;
membrane: &lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:membrane)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
or&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0162)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;14,515&#039;&#039;&#039; hits&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===2g)===&lt;br /&gt;
&lt;br /&gt;
Potentially useful (found in the cell membrane): &lt;br /&gt;
* Q7KQL9 / ALF_PLAF7 / Fructose-bisphosphate aldolase: &amp;quot;&amp;quot;Host cell membrane&amp;quot;&lt;br /&gt;
* A0A2I0BVG8 / CDPK1_PLAFO / Calcium-dependent protein kinase 1: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* W7KN63 / W7KN63_PLAFO / Merozoite surface antigen 2: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Q8IFM5 / RH5_PLAF7 / Reticulocyte-binding protein homolog 5: &amp;quot;Host cell membrane&amp;quot;&lt;br /&gt;
* O97364 / SUB2_PLAFA / Subtilisin-like protease 2: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* C6KSX0 / PF12_PLAF7 / Merozoite surface protein P12: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* Q8I1Y0 / PF41_PLAF7 / Merozoite surface protein P41: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* Q8IDN0 / PFS47_PLAF7 / Female gametocyte surface protein P47: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* P62343 / CDPK1_PLAFK / Calcium-dependent protein kinase 1: &amp;quot;In the parasite and on erythrocytic membrane at a lower level&amp;quot;&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Definitely not useful (found in an inner membrane):&lt;br /&gt;
* Q8I6V3 / PLM2_PLAF7 / Plasmepsin II: &amp;quot;Vacuole membrane&amp;quot;&lt;br /&gt;
* U3M186 / U3M186_PLAFA	/ Cytochrome c oxidase subunit 1: &amp;quot;Mitochondrion inner membrane&amp;quot;&lt;br /&gt;
* O97321 / O97321_PLAF7	/ GlcNAc-1-P transferase: &amp;quot;Endoplasmic reticulum membrane&amp;quot;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Q08210 / PYRD_PLAF7 / Dihydroorotate dehydrogenase: &amp;quot;Mitochondrion inner membrane&amp;quot;&lt;br /&gt;
* Q8I719 / KGP_PLAF7 / cGMP-dependent protein kinase: &amp;quot;Endoplasmic reticulum membrane&amp;quot;&lt;br /&gt;
* Q8IDR3 / MYOA_PLAF7 / Myosin-A: &#039;&#039;&#039;NB:&#039;&#039;&#039; even though this is found associated with the cell membrane, it is useless, because it is a Peripheral membrane protein bound to the &#039;&#039;Cytoplasmic&#039;&#039; side of the membrane.&lt;br /&gt;
* Q08210 / PYRD_PLAF7 / Dihydroorotate dehydrogenase (quinone), mitochondrial: &amp;quot;Mitochondrion inner membrane&amp;quot;&lt;br /&gt;
* Q9N623 / CRT_PLAFA / Chloroquine resistance transporter, PfCRT: &amp;quot;Localizes to the parasite digestive vacuole&amp;quot;&lt;br /&gt;
* Q9GPP8 / PSD_PLAFA / Phosphatidylserine decarboxylase proenzyme: &amp;quot;Endoplasmic reticulum membrane&amp;quot;&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
Of course, the actual examples you selected may differ from these!&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;1699&#039;&#039;&#039; of the hits contain the phrase &amp;quot;cell membrane&amp;quot;, this can be found by modifying the search to:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;taxonomy:&amp;quot;Plasmodium falciparum [5833]&amp;quot; locations:(location:&amp;quot;cell membrane&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
or, you can find &#039;&#039;&#039;1697&#039;&#039;&#039; hits with the search string:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;taxonomy:&amp;quot;Plasmodium falciparum [5833]&amp;quot; locations:(location:&amp;quot;Cell membrane [SL-0039]&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; UniProt is developing, and not everything we wrote in the exercise guide earlier is still true. In this question, it was stated:&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
These should ideally have a &amp;quot;Subcellular location&amp;quot; annotated as &amp;quot;erythrocyte membrane&amp;quot; or &amp;quot;host cell membrane&amp;quot; — but there are no examples of that in your search from the last question (you are welcome to try!). &lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
Some of you actually tried this and found three hits with the search string:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;organism:&amp;quot;plasmodium falciparum&amp;quot; locations:(location:&amp;quot;host cell membrane&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
or &lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;organism:&amp;quot;plasmodium falciparum&amp;quot; locations:(location:&amp;quot;Host cell membrane [SL-0375]&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We stand corrected!&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===2h)===&lt;br /&gt;
&#039;&#039;&#039;61&#039;&#039;&#039; hits, 27 from Swiss-Prot.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:&amp;quot;host cell membrane&amp;quot;)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
or&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0375)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===2i)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10,134&#039;&#039;&#039;, among these only &#039;&#039;&#039;4&#039;&#039;&#039; from Swiss-Prot.&lt;br /&gt;
&lt;br /&gt;
===2j)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte) AND (protein_name:membrane)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9,543&#039;&#039;&#039; hits, all from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;or&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:&amp;quot;erythrocyte membrane&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9,502&#039;&#039;&#039; hits, all from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
===2k)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte) AND (protein_name:membrane) AND (fragment:false)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2,387&#039;&#039;&#039; (or &#039;&#039;&#039;2,385&#039;&#039;&#039; if the words &amp;quot;erythrocyte membrane&amp;quot; are combined)&lt;br /&gt;
&lt;br /&gt;
===2l)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte) AND (protein_name:membrane) AND (fragment:false) AND (database:pdb)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8&#039;&#039;&#039; hits, called &amp;quot;Erythrocyte membrane protein 1&amp;quot; or &amp;quot;Erythrocyte membrane protein 2&amp;quot;: &#039;&#039;&#039;Q6UDW7&#039;&#039;&#039;, &#039;&#039;&#039;Q8I098&#039;&#039;&#039;, &#039;&#039;&#039;Q8I639&#039;&#039;&#039;, &#039;&#039;&#039;Q8IHM0&#039;&#039;&#039;, &#039;&#039;&#039;W7K270&#039;&#039;&#039;, &#039;&#039;&#039;A3R6S4&#039;&#039;&#039;, &#039;&#039;&#039;A0A024V5I6&#039;&#039;&#039;, and &#039;&#039;&#039;I1X0L2&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 3 - Analysis of membrane protein domain structure ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3a)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
InterPro identifier: &#039;&#039;&#039;IPR008602&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
Pfam identifier: &#039;&#039;&#039;PF05424&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
It is found &#039;&#039;&#039;4&#039;&#039;&#039; times in Q8IHM0, &#039;&#039;&#039;5&#039;&#039;&#039; times in Q8I098 and &#039;&#039;&#039;6&#039;&#039;&#039; times in Q8I639.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3b)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
The transmembrane segments are in the following positions:&lt;br /&gt;
* Q8I098: 3124-3146&lt;br /&gt;
* Q8I639: 2650-2667 &lt;br /&gt;
* Q8IHM0: 2695-2717&lt;br /&gt;
&lt;br /&gt;
The extracellular parts are the N-terminal parts (all the positions &#039;&#039;before&#039;&#039; the transmembrane segments), and the intracellular (cytoplasmic) parts are C-terminal (positions &#039;&#039;after&#039;&#039; the transmembrane segments). All the Duffy binding domains are in the extracellular part.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3c)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
The following positions are structurally determined by X-ray in the three proteins:&lt;br /&gt;
* Q8I098: No X-Ray, only EM (Electron Microscopy).&lt;br /&gt;
* Q8I639: 2333-2634, covering Duffy_binding domain 6&lt;br /&gt;
* Q8IHM0: 728-1214, covering Duffy_binding domain 2&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3d)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Yes! &lt;br /&gt;
* Biological process: cytoadherence to microvasculature, mediated by symbiont protein&lt;br /&gt;
* Biological process: pathogenesis&lt;br /&gt;
* Cellular component: host cell plasma membrane&lt;br /&gt;
* Cellular component: infected host cell surface knob&lt;br /&gt;
* Molecular function: cell adhesion molecule binding&lt;br /&gt;
* Molecular function: host cell surface receptor binding&lt;br /&gt;
All these examples support that these proteins are involved in binding the infected erythrocytes to the endothelial cells (as described in the exercise).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Tip:&#039;&#039;&#039; You can click &amp;lt;u&amp;gt;View the complete GO annotation on QuickGO&amp;lt;/u&amp;gt; in UniProt.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 4 - Prediction of B-cell epitopes in a membrane protein ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4a)&#039;&#039;&#039; The PDB entry is &#039;&#039;&#039;2WAU&#039;&#039;&#039; and it&#039;s a crystal structure (X-ray).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4b)&#039;&#039;&#039; &lt;br /&gt;
FASTA sequence for the Duffy Binding domain covered by the 3D structure:&lt;br /&gt;
 &amp;gt;2WAU_1|Chains A, B|ERYTHROCYTE MEMBRANE PROTEIN 1 (PFEMP1)|PLASMODIUM FALCIPARUM (36329)&lt;br /&gt;
 ICNKYKNINVNMKKNNDDTWTDLVKNSSDINKGVLLPPRRKNLFLKIDESDICKYKRDPKLFKDFIYSSAISEVERLKKV&lt;br /&gt;
 YGEAKTKVVHAMKYSFADIGSIIKGDDMMENNSSDKIGKILGDGVGQNEKRKKWWDMNKYHIWESMLSGYKHAYGNISEN&lt;br /&gt;
 DRKMLDIPNNDDEHQFLRWFQEWTENFCTKRNELYENMVTACNSAKCNTSNGSVDKKECTEACKNYSNFILIKKKEYQSL&lt;br /&gt;
 NSQYDMNYKETKAEKKESPEYFKDKCNGECSCLSEYFKDETRWKNPYETLDDTEVKNNCMCK&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4c)&#039;&#039;&#039; &lt;br /&gt;
The sequence interval was 2333-2634. This means that the &#039;&#039;&#039;first&#039;&#039;&#039; postion in the new FASTA file corresponds to position &#039;&#039;&#039;2333&#039;&#039;&#039; in the original sequence.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4d)&#039;&#039;&#039; It&#039;s possible to convert from the coordinates in the FASTA files to the full length sequence by adding &#039;&#039;&#039;2332&#039;&#039;&#039;. In the table below the epitopes have been named by their starting position as well as numbered.&lt;br /&gt;
&lt;br /&gt;
 EPITOPE     POSITIONS    LENGTH     ORIG_POSITIONS&lt;br /&gt;
 #1 ep_5       5 to  29       25     2337 to 2361&lt;br /&gt;
 #2 ep_49     49 to  57        9     2381 to 2389&lt;br /&gt;
 #3 ep_107   107 to 114        8     2439 to 2446&lt;br /&gt;
 #4 ep_153   153 to 172       20     2485 to 2504&lt;br /&gt;
 #5 ep_209   209 to 218       10     2541 to 2550&lt;br /&gt;
 #6 ep_249   249 to 258       10     2581 to 2590&lt;br /&gt;
 #7 ep_273   273 to 294       22     2605 to 2626&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 5 - Visualization of epitopes ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5a)&#039;&#039;&#039; &lt;br /&gt;
Invisible positions:&lt;br /&gt;
&lt;br /&gt;
 Chain A: 2333-2349 and 2540-2546&lt;br /&gt;
 Chain B: 2333-2348 and 2535-2549&lt;br /&gt;
&lt;br /&gt;
This means that the first epitope (pos 5-29, orig pos 2337 to 2361) and the 5th epitope (pos 209 to 218, orig pos 2541 to 2550) are partially invisible. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5b)&#039;&#039;&#039; &lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
Overview figure with SURFACE visualization and indication of epitopes in Chain B. Notice that epitope #1 is partly hidden and epitope #6 is fully hidden (as expected from &#039;&#039;&#039;Q5b&#039;&#039;&#039; - here its also directly seen by the grey positions in the sequence). Note that if chain A is chosen a few amino acids of epitope #6 will be visible.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
[[Image:Epitopes_PyMol_figure.pptx.png|thumb|center|800px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
[[Image:2wau_figure_for_wiki1.png|thumb|center|1000px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
Same as above, but with MESH + CARTOON visualization (for a combined overview of surface + interior)&lt;br /&gt;
[[Image:3d_mesh_view3.png|thumb|center|1000px|Click to zoom]]&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Answers:Malaria_Vaccine&amp;diff=615</id>
		<title>Answers:Malaria Vaccine</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Answers:Malaria_Vaccine&amp;diff=615"/>
		<updated>2025-10-14T13:06:55Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* 2a) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Answers to case study exercise about malaria vaccines&#039;&#039;&#039; (NB: numbers etc. found in the databases 11-10-2023):&lt;br /&gt;
&lt;br /&gt;
== 1 - What exactly is malaria? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1a)&#039;&#039;&#039; If you search for &amp;quot;malaria&amp;quot; on NCBIs Taxonomy page, you find some mosquitoes and some protozoans with the Genus name &#039;&#039;Plasmodium&#039;&#039;. Clicking the name of one of these (twice) gets you to a page where you can see the &#039;&#039;lineage&#039;&#039;:&lt;br /&gt;
&lt;br /&gt;
* Genus: &#039;&#039;Plasmodium&#039;&#039;&lt;br /&gt;
* Phylum: &#039;&#039;Apicomplexa&#039;&#039; &lt;br /&gt;
* (Super)Kingdom: &#039;&#039;Eukaryota&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1b)&#039;&#039;&#039; On NCBI&#039;s Taxonomy page is a function named ”[http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi Taxonomy common tree]” which gives a nice overview. Alternatively you can open taxonomy pages for the two organisms to compare, and see on their &#039;&#039;lineages&#039;&#039; how much they have in common. &lt;br /&gt;
* &#039;&#039;Homo sapiens&#039;&#039; and &#039;&#039;Plasmodium&#039;&#039;: &#039;&#039;Eukaryota&#039;&#039;&lt;br /&gt;
* &#039;&#039;Babesia microti&#039;&#039; and &#039;&#039;Plasmodium&#039;&#039;: &#039;&#039;Aconoidasida&#039;&#039;&lt;br /&gt;
Here is the picture you can get from the &amp;quot;Taxonomy common tree&amp;quot; function:&lt;br /&gt;
[[Image:Common Taxonomy Tree.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;1c)&#039;&#039;&#039; On [http://www.cdc.gov/dpdx/malaria/ CDC&#039;s page about malaria] or on [http://tolweb.org/Plasmodium/68071 Tree of Life&#039;s page about Plasmodium] you find:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;P. malariae&#039;&#039;, &#039;&#039;P. ovale&#039;&#039;, &#039;&#039;P. falciparum&#039;&#039; and &#039;&#039;P. vivax&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
By looking up these four species in NCBI Taxonomy and looking at the table to the right, you see that &#039;&#039;&#039;all four&#039;&#039;&#039; species have a full genome in the databases (see the link named &amp;lt;u&amp;gt;Genome&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Entrez records&amp;lt;/u&amp;gt;). &lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 2 - Identification of membrane proteins (potential vaccine targets) ==&lt;br /&gt;
&lt;br /&gt;
===2a)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!-- &#039;&#039;&#039;14&#039;&#039;&#039; chromosomes. Actually, this is not as easy to find as it used to be. Previously, you got a list of the 14 chromosomes just by following the &amp;lt;u&amp;gt;Genome&amp;lt;/u&amp;gt; link. Now, you can see the previous Genome page with the chromosomes by following the link labeled &amp;quot;View the legacy Genome page&amp;quot;. Alternatively, you can see a list of the chromosomes by clicking the link under &amp;quot;Reference genome&amp;quot; (Genome assembly GCA_000002765). --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5566&#039;&#039;&#039; not hypothetical genes (search details below)&lt;br /&gt;
 txid36329[Organism:noexp] NOT hypothetical[All Fields] AND alive[prop]&lt;br /&gt;
&lt;br /&gt;
If you instead found 5570 not hypothetical genes, it is because you found the species &#039;&#039;Plasmodium falciparum&#039;&#039; (taxID:5833) in NCBI Taxonomy instead of the specific isolate 3D7 (taxID:36329) as specified in the exercise.&lt;br /&gt;
&lt;br /&gt;
===2b)===&lt;br /&gt;
&lt;br /&gt;
The correct search strings&lt;br /&gt;
 (taxonomy_id:5833)&lt;br /&gt;
or &lt;br /&gt;
 (organism_name:&amp;quot;Plasmodium falciparum&amp;quot;)&lt;br /&gt;
both give &#039;&#039;&#039;129,611&#039;&#039;&#039; hits in total, &#039;&#039;&#039;483&#039;&#039;&#039; from Swiss-Prot and &#039;&#039;&#039;129,128&#039;&#039;&#039; from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
If you only found 34,196 hits, it was because you used &lt;br /&gt;
 (organism_id:5833)&lt;br /&gt;
which only gives those &#039;&#039;Pf&#039;&#039; proteins that do &#039;&#039;not&#039;&#039; have a specified strain or isolate — cf. question 3.4+3.5 in [[Exercise: The protein database UniProt|the UniProt exercise]].  &lt;br /&gt;
&lt;br /&gt;
If, on the other hand, you found 131,860 hits, it was because you searched in All instead of specifying the search field:&lt;br /&gt;
 Plasmodium falciparum&lt;br /&gt;
In that case, you will include some proteins that originate from e.g. humans but play a role in &#039;&#039;Plasmodium falciparum&#039;&#039; infection, which may be mentioned in some comment field or reference title.&lt;br /&gt;
&lt;br /&gt;
===2c)===&lt;br /&gt;
&lt;br /&gt;
This can be solved in several ways:&lt;br /&gt;
* &amp;lt;tt&amp;gt;(taxonomy_id:36329)&amp;lt;/tt&amp;gt; (either selecting the right isolate from the drop-down menu or using the TaxID you found in the Taxonomy database)&lt;br /&gt;
* &amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (organism_name:3d7)&amp;lt;/tt&amp;gt;&lt;br /&gt;
* &amp;lt;tt&amp;gt;(organism_name:&amp;quot;Plasmodium falciparum&amp;quot;) AND (organism_name:3d7)&amp;lt;/tt&amp;gt;&lt;br /&gt;
They all give: &#039;&#039;&#039;5,495&#039;&#039;&#039; in total, &#039;&#039;&#039;295&#039;&#039;&#039; from Swiss-Prot and &#039;&#039;&#039;5,200&#039;&#039;&#039; from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
That corresponds &#039;&#039;approximately&#039;&#039; to the number of genes found in &#039;&#039;&#039;2a)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
===2d)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:*)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;24,439&#039;&#039;&#039; (&#039;&#039;&#039;390&#039;&#039;&#039; from Swiss-Prot and &#039;&#039;&#039;24,049&#039;&#039;&#039; from TrEMBL).&lt;br /&gt;
&lt;br /&gt;
===2e)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:secreted)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
or&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0243)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;594&#039;&#039;&#039; (39 from Swiss-Prot).&lt;br /&gt;
&lt;br /&gt;
===2f)===&lt;br /&gt;
&lt;br /&gt;
surface:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:surface)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
or&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0310)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;413&#039;&#039;&#039; hits&lt;br /&gt;
&lt;br /&gt;
membrane: &lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:membrane)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
or&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0162)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;14,515&#039;&#039;&#039; hits&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===2g)===&lt;br /&gt;
&lt;br /&gt;
Potentially useful (found in the cell membrane): &lt;br /&gt;
* Q7KQL9 / ALF_PLAF7 / Fructose-bisphosphate aldolase: &amp;quot;&amp;quot;Host cell membrane&amp;quot;&lt;br /&gt;
* A0A2I0BVG8 / CDPK1_PLAFO / Calcium-dependent protein kinase 1: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* W7KN63 / W7KN63_PLAFO / Merozoite surface antigen 2: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Q8IFM5 / RH5_PLAF7 / Reticulocyte-binding protein homolog 5: &amp;quot;Host cell membrane&amp;quot;&lt;br /&gt;
* O97364 / SUB2_PLAFA / Subtilisin-like protease 2: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* C6KSX0 / PF12_PLAF7 / Merozoite surface protein P12: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* Q8I1Y0 / PF41_PLAF7 / Merozoite surface protein P41: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* Q8IDN0 / PFS47_PLAF7 / Female gametocyte surface protein P47: &amp;quot;Cell membrane&amp;quot;&lt;br /&gt;
* P62343 / CDPK1_PLAFK / Calcium-dependent protein kinase 1: &amp;quot;In the parasite and on erythrocytic membrane at a lower level&amp;quot;&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Definitely not useful (found in an inner membrane):&lt;br /&gt;
* Q8I6V3 / PLM2_PLAF7 / Plasmepsin II: &amp;quot;Vacuole membrane&amp;quot;&lt;br /&gt;
* U3M186 / U3M186_PLAFA	/ Cytochrome c oxidase subunit 1: &amp;quot;Mitochondrion inner membrane&amp;quot;&lt;br /&gt;
* O97321 / O97321_PLAF7	/ GlcNAc-1-P transferase: &amp;quot;Endoplasmic reticulum membrane&amp;quot;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
* Q08210 / PYRD_PLAF7 / Dihydroorotate dehydrogenase: &amp;quot;Mitochondrion inner membrane&amp;quot;&lt;br /&gt;
* Q8I719 / KGP_PLAF7 / cGMP-dependent protein kinase: &amp;quot;Endoplasmic reticulum membrane&amp;quot;&lt;br /&gt;
* Q8IDR3 / MYOA_PLAF7 / Myosin-A: &#039;&#039;&#039;NB:&#039;&#039;&#039; even though this is found associated with the cell membrane, it is useless, because it is a Peripheral membrane protein bound to the &#039;&#039;Cytoplasmic&#039;&#039; side of the membrane.&lt;br /&gt;
* Q08210 / PYRD_PLAF7 / Dihydroorotate dehydrogenase (quinone), mitochondrial: &amp;quot;Mitochondrion inner membrane&amp;quot;&lt;br /&gt;
* Q9N623 / CRT_PLAFA / Chloroquine resistance transporter, PfCRT: &amp;quot;Localizes to the parasite digestive vacuole&amp;quot;&lt;br /&gt;
* Q9GPP8 / PSD_PLAFA / Phosphatidylserine decarboxylase proenzyme: &amp;quot;Endoplasmic reticulum membrane&amp;quot;&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
Of course, the actual examples you selected may differ from these!&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;1699&#039;&#039;&#039; of the hits contain the phrase &amp;quot;cell membrane&amp;quot;, this can be found by modifying the search to:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;taxonomy:&amp;quot;Plasmodium falciparum [5833]&amp;quot; locations:(location:&amp;quot;cell membrane&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
or, you can find &#039;&#039;&#039;1697&#039;&#039;&#039; hits with the search string:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;taxonomy:&amp;quot;Plasmodium falciparum [5833]&amp;quot; locations:(location:&amp;quot;Cell membrane [SL-0039]&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&#039;&#039;&#039;Note:&#039;&#039;&#039; UniProt is developing, and not everything we wrote in the exercise guide earlier is still true. In this question, it was stated:&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&lt;br /&gt;
These should ideally have a &amp;quot;Subcellular location&amp;quot; annotated as &amp;quot;erythrocyte membrane&amp;quot; or &amp;quot;host cell membrane&amp;quot; — but there are no examples of that in your search from the last question (you are welcome to try!). &lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
Some of you actually tried this and found three hits with the search string:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;organism:&amp;quot;plasmodium falciparum&amp;quot; locations:(location:&amp;quot;host cell membrane&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
or &lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;organism:&amp;quot;plasmodium falciparum&amp;quot; locations:(location:&amp;quot;Host cell membrane [SL-0375]&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We stand corrected!&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===2h)===&lt;br /&gt;
&#039;&#039;&#039;61&#039;&#039;&#039; hits, 27 from Swiss-Prot.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:&amp;quot;host cell membrane&amp;quot;)&amp;lt;/tt&amp;gt;&amp;lt;br&amp;gt;&lt;br /&gt;
or&amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (cc_scl_term:SL-0375)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===2i)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;10,134&#039;&#039;&#039;, among these only &#039;&#039;&#039;4&#039;&#039;&#039; from Swiss-Prot.&lt;br /&gt;
&lt;br /&gt;
===2j)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte) AND (protein_name:membrane)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9,543&#039;&#039;&#039; hits, all from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;or&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:&amp;quot;erythrocyte membrane&amp;quot;)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;9,502&#039;&#039;&#039; hits, all from TrEMBL.&lt;br /&gt;
&lt;br /&gt;
===2k)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte) AND (protein_name:membrane) AND (fragment:false)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2,387&#039;&#039;&#039; (or &#039;&#039;&#039;2,385&#039;&#039;&#039; if the words &amp;quot;erythrocyte membrane&amp;quot; are combined)&lt;br /&gt;
&lt;br /&gt;
===2l)===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;tt&amp;gt;(taxonomy_id:5833) AND (protein_name:erythrocyte) AND (protein_name:membrane) AND (fragment:false) AND (database:pdb)&amp;lt;/tt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;8&#039;&#039;&#039; hits, called &amp;quot;Erythrocyte membrane protein 1&amp;quot; or &amp;quot;Erythrocyte membrane protein 2&amp;quot;: &#039;&#039;&#039;Q6UDW7&#039;&#039;&#039;, &#039;&#039;&#039;Q8I098&#039;&#039;&#039;, &#039;&#039;&#039;Q8I639&#039;&#039;&#039;, &#039;&#039;&#039;Q8IHM0&#039;&#039;&#039;, &#039;&#039;&#039;W7K270&#039;&#039;&#039;, &#039;&#039;&#039;A3R6S4&#039;&#039;&#039;, &#039;&#039;&#039;A0A024V5I6&#039;&#039;&#039;, and &#039;&#039;&#039;I1X0L2&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 3 - Analysis of membrane protein domain structure ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3a)&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
InterPro identifier and name: &#039;&#039;&#039;IPR008602, Duffy-antigen binding&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
Pfam identifier and name: &#039;&#039;&#039;PF05424, Duffy binding domain&#039;&#039;&#039;&amp;lt;br&amp;gt;&lt;br /&gt;
It is found &#039;&#039;&#039;4&#039;&#039;&#039; times in Q8IHM0 and &#039;&#039;&#039;6&#039;&#039;&#039; times in each of Q8I639 and Q6UDW7.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3b)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
The transmembrane segments are in the following positions:&lt;br /&gt;
* Q6UDW7: 2653-2674&lt;br /&gt;
* Q8I639: 2650-2667 &lt;br /&gt;
* Q8IHM0: 2695-2717&lt;br /&gt;
&lt;br /&gt;
The extracellular parts are the N-terminal parts (all the positions &#039;&#039;before&#039;&#039; the transmembrane segments), and the intracellular (cytoplasmic) parts are C-terminal (positions &#039;&#039;after&#039;&#039; the transmembrane segments).  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3c)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
The following positions are structurally determined by X-ray in the three proteins:&lt;br /&gt;
* Q6UDW7: &lt;br /&gt;
** 1215-1950, covering Duffy_binding domain 3 and 4&lt;br /&gt;
** 1218-1577 or 1220-1580, covering Duffy_binding domain 3 &lt;br /&gt;
** 2326-2631 covering Duffy_binding domain 6&lt;br /&gt;
* Q8I639: 2333-2634, covering Duffy_binding domain 6&lt;br /&gt;
* Q8IHM0: 728-1214, covering Duffy_binding domain 2&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3d)&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Yes! &lt;br /&gt;
* Biological process: cytoadherence to microvasculature, mediated by symbiont protein&lt;br /&gt;
* Biological process: pathogenesis&lt;br /&gt;
* Cellular component: host cell plasma membrane&lt;br /&gt;
* Cellular component: infected host cell surface knob&lt;br /&gt;
* Molecular function: cell adhesion molecule binding&lt;br /&gt;
* Molecular function: host cell surface receptor binding&lt;br /&gt;
All these examples support that these proteins are involved in binding the infected erythrocytes to the endothelial cells (as described in the exercise).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Tip:&#039;&#039;&#039; You can click &amp;lt;u&amp;gt;View the complete GO annotation on QuickGO&amp;lt;/u&amp;gt; in UniProt.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 4 - Prediction of B-cell epitopes in a membrane protein ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4a)&#039;&#039;&#039; The PDB entry is &#039;&#039;&#039;2WAU&#039;&#039;&#039; and it&#039;s a crystal structure (X-ray).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4b)&#039;&#039;&#039; &lt;br /&gt;
FASTA sequence for the Duffy Binding domain covered by the 3D structure:&lt;br /&gt;
 &amp;gt;2WAU_1|Chains A, B|ERYTHROCYTE MEMBRANE PROTEIN 1 (PFEMP1)|PLASMODIUM FALCIPARUM (36329)&lt;br /&gt;
 ICNKYKNINVNMKKNNDDTWTDLVKNSSDINKGVLLPPRRKNLFLKIDESDICKYKRDPKLFKDFIYSSAISEVERLKKV&lt;br /&gt;
 YGEAKTKVVHAMKYSFADIGSIIKGDDMMENNSSDKIGKILGDGVGQNEKRKKWWDMNKYHIWESMLSGYKHAYGNISEN&lt;br /&gt;
 DRKMLDIPNNDDEHQFLRWFQEWTENFCTKRNELYENMVTACNSAKCNTSNGSVDKKECTEACKNYSNFILIKKKEYQSL&lt;br /&gt;
 NSQYDMNYKETKAEKKESPEYFKDKCNGECSCLSEYFKDETRWKNPYETLDDTEVKNNCMCK&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4c)&#039;&#039;&#039; &lt;br /&gt;
The sequence interval was 2333-2634. This means that the &#039;&#039;&#039;first&#039;&#039;&#039; postion in the new FASTA file corresponds to position &#039;&#039;&#039;2333&#039;&#039;&#039; in the original sequence.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;4d)&#039;&#039;&#039; It&#039;s possible to convert from the coordinates in the FASTA files to the full length sequence by adding &#039;&#039;&#039;2332&#039;&#039;&#039;. In the table below the epitopes have been named by their starting position as well as numbered.&lt;br /&gt;
&lt;br /&gt;
 EPITOPE     POSITIONS    LENGTH     ORIG_POSITIONS&lt;br /&gt;
 #1 ep_5       5 to  29       25     2337 to 2361&lt;br /&gt;
 #2 ep_49     49 to  57        9     2381 to 2389&lt;br /&gt;
 #3 ep_107   107 to 114        8     2439 to 2446&lt;br /&gt;
 #4 ep_153   153 to 172       20     2485 to 2504&lt;br /&gt;
 #5 ep_209   209 to 218       10     2541 to 2550&lt;br /&gt;
 #6 ep_249   249 to 258       10     2581 to 2590&lt;br /&gt;
 #7 ep_273   273 to 294       22     2605 to 2626&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== 5 - Visualization of epitopes ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5a)&#039;&#039;&#039; &lt;br /&gt;
Invisible positions:&lt;br /&gt;
&lt;br /&gt;
 Chain A: 2333-2349 and 2540-2546&lt;br /&gt;
 Chain B: 2333-2348 and 2535-2549&lt;br /&gt;
&lt;br /&gt;
This means that the first epitope (pos 5-29, orig pos 2337 to 2361) and the 5th epitope (pos 209 to 218, orig pos 2541 to 2550) are partially invisible. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;5b)&#039;&#039;&#039; &lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
Overview figure with SURFACE visualization and indication of epitopes in Chain B. Notice that epitope #1 is partly hidden and epitope #6 is fully hidden (as expected from &#039;&#039;&#039;Q5b&#039;&#039;&#039; - here its also directly seen by the grey positions in the sequence). Note that if chain A is chosen a few amino acids of epitope #6 will be visible.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
[[Image:Epitopes_PyMol_figure.pptx.png|thumb|center|800px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
[[Image:2wau_figure_for_wiki1.png|thumb|center|1000px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
Same as above, but with MESH + CARTOON visualization (for a combined overview of surface + interior)&lt;br /&gt;
[[Image:3d_mesh_view3.png|thumb|center|1000px|Click to zoom]]&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=614</id>
		<title>Exercise:Malaria Vaccine</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=614"/>
		<updated>2025-10-14T13:06:12Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* Identification of membrane proteins (potential vaccine targets) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Exercise written by: Thomas Salhøj Rask and [http://www.dtu.dk/service/telefonbog/person?id=25617&amp;amp;cpid=214126&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Henrik Nielsen] — translated, revised and updated to BepiPred 2.0 by [http://www.dtu.dk/service/telefonbog/person?id=18103&amp;amp;cpid=214039&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Rasmus Wernersson] and Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
The purpose of this exercise is to apply the methods and knowledge you have learned so far on a real biological problem: Taking steps towards designing a malaria vaccine, by selecting peptides from the malaria parasite that have a chance of inflicting an immune response and therefore could be used in a vaccine. As part of the exercise some new material will be introduced, especially concerning prediction of B-cell epitopes (immuno-reactive peptides). The outline of the exercise is as follows:&lt;br /&gt;
&lt;br /&gt;
# What exactly is malaria?&lt;br /&gt;
# Identification of membrane bound proteins (potential vaccine targets)&lt;br /&gt;
# Analysis of membrane protein domain structure&lt;br /&gt;
# Prediction of B-cell epitopes from membrane proteins&lt;br /&gt;
# Modelling / visualization of predicted epitopes in the 3D structure of a protein domain.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== What exactly is malaria? ==&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 1:&#039;&#039;&#039; &#039;&#039;Which organism causes malaria? Bacteria, protozoa (single cell eukaryote), worm or virus?&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
Investigate this by looking up the organism in the two taxonomy databases we have been covering earlier in the course:&lt;br /&gt;
*&#039;&#039;&#039;NCBI Taxonomy:&#039;&#039;&#039; http://www.ncbi.nlm.nih.gov/Taxonomy &amp;amp;nbsp;&amp;amp;nbsp; (&#039;&#039;&#039;Hint:&#039;&#039;&#039; If you don&#039;t know the Latin name for the organism, it will be easier to search for a name as a &amp;quot;[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi Token set]&amp;quot; rather than as a &amp;quot;Complete name&amp;quot;.&lt;br /&gt;
*&#039;&#039;&#039;Tree of life:&#039;&#039;&#039; http://www.tolweb.org/ &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1a)&#039;&#039;&#039; Identify the following taxonomical levels for the malaria-causing organism:&lt;br /&gt;
* Genus&lt;br /&gt;
* Phylum&lt;br /&gt;
* (Super)Kingdom&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1b)&#039;&#039;&#039; How &amp;quot;close&amp;quot; in taxonomy space is the organism to the following other organisms (find the upper level taxonomical group, that ties them together). &#039;&#039;&#039;Hint:&#039;&#039;&#039; as an alternative to manually comparing the taxonomy-strings (the &amp;quot;lineage&amp;quot;), you can use the [http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi NCBI Taxonomy Common Tree] tool to automate the comparison.&lt;br /&gt;
* &#039;&#039;Homo sapiens&#039;&#039;&lt;br /&gt;
* &#039;&#039;Babesia microti&#039;&#039;&amp;amp;nbsp;&amp;amp;nbsp;&amp;amp;nbsp; (Can in rare cases be transmitted by ticks (danish: &amp;quot;Skovflåt&amp;quot;) and can lead to the disease &#039;&#039;[https://en.wikipedia.org/wiki/Babesiosis babesiosis]&#039;&#039;, where the red blood cells (erythrocytes) are invaded as in malaria, and which will lead to &#039;&#039;anemia&#039;&#039; (&amp;quot;blood loss&amp;quot;, in this case lack of oxygen carrying capacity in the blood) — see the Tree of Life page for this organism for images of infected erythrocytes.&lt;br /&gt;
&lt;br /&gt;
Finally, read more about malaria and the complicated life cycle of the malaria parasite here: [http://www.cdc.gov/dpdx/malaria/ CDC - DPDx Malaria] .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1c)&#039;&#039;&#039; Report the names of the &#039;&#039;&#039;four&#039;&#039;&#039; species of parasites causing malaria in humans, and use the NCBI Genome (https://www.ncbi.nlm.nih.gov/datasets/genome/) database to investigate which of them (if any) have had their genomes sequenced.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Identification of membrane proteins (potential vaccine targets) ==&lt;br /&gt;
Malaria caused by &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) is by far the most lethal malaria variant. This parasite species is responsible for 80%-90% of the ~1 million annual deaths due to malaria. It will therefore be a natural starting point to develop a vaccine against this type of malaria.&lt;br /&gt;
&lt;br /&gt;
When the &#039;&#039;Pf&#039;&#039; genome was initially sequenced in the 1990s, it was based on &#039;&#039;Pf&#039;&#039; cells isolated from the blood of a Dutch malaria patient, who picked up the disease while traveling. Unfortunately, it was not recorded exactly where the patient had been. This isolate is named &#039;&#039;3D7&#039;&#039; and is the most studied malaria strain to this day (even though it&#039;s not known from where in the world it originates).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039;&lt;br /&gt;
Locate the entry for &#039;&#039;Pf&#039;&#039; 3D7 in [http://www.ncbi.nlm.nih.gov/Taxonomy NCBIs taxonomy browser]. &amp;lt;!-- At the bottom of the page some technical information regarding the genome sequencing is shown (&amp;quot;Genome Information&amp;quot;), and --&amp;gt; In the multi-colored table on the right hand side (&amp;quot;Entrez records&amp;quot;), a set of sequence related data is shown. For instance the &amp;quot;Gene&amp;quot; link describes how many genes have been identified in the genome (including both manually curated genes as well as genes predicted using bioinformatics methods).&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 2a)&#039;&#039;&#039; How many verified genes (NOT hypothetical) does &#039;&#039;Pf&#039;&#039; 3D7 have? (&#039;&#039;&#039;Hint:&#039;&#039;&#039; Follow the &amp;lt;u&amp;gt;Gene&amp;lt;/u&amp;gt; link and add &amp;lt;tt&amp;gt;NOT hypothetical&amp;lt;/tt&amp;gt; to the search string).&lt;br /&gt;
&lt;br /&gt;
Malaria takes place in different stages within the human host (see figure below), and this is important to take into account when designing a malaria vaccine. The disease development can be divided into two phases: 1) The liver-stage and 2) the blood-stage. The liver-stage is defined by &#039;&#039;sporozoites&#039;&#039; injected by the malaria mosquito, which travel to the liver and invade hepatocytes (liver cells). The blood-stage is the second stage and is reached when &#039;&#039;merozoites&#039;&#039; developed within the hepatocytes are released into the blood stream, where they invade erythrocytes (red blood cells). In both stages the malaria parasite hides from the human immune system by staying inside native human cells. &lt;br /&gt;
&lt;br /&gt;
Much of the effort towards developing malaria vaccines so far has been focused on surface exposed (cell-membrane) proteins from the &#039;&#039;sporozoites&#039;&#039; and &#039;&#039;merozoites&#039;&#039; as well as non-human proteins on the surface of infected hepatocytes and erythrocytes. &lt;br /&gt;
&lt;br /&gt;
[[Image:Nm0206-170-F1.jpg | center]]&lt;br /&gt;
&lt;br /&gt;
=== Searching UniProt ===&lt;br /&gt;
We&#039;ll now see if we can use the annotation of protein properties in UniProt to point us towards potential vaccine targets. When designing a vaccine it is important to make sure that the intended vaccine target is indeed &amp;quot;visible&amp;quot; to the immune system. Building on the information from the previous section, we therefore need to identify proteins that &#039;&#039;&#039;originate&#039;&#039;&#039; from the parasite, and that are present on the cell surface of &#039;&#039;sporozoites&#039;&#039;, &#039;&#039;merozoites&#039;&#039; OR infected host cells. In the case of infected host cells, we would therefore be looking for proteins that fulfill the following criteria:&lt;br /&gt;
&lt;br /&gt;
# Are secreted from the parasite to the vacuole &#039;&#039;inside&#039;&#039; the host cell,&lt;br /&gt;
# Migrate from the vacuole to the host cell, and&lt;br /&gt;
# Are transported to the surface (membrane) of the host cell&lt;br /&gt;
&lt;br /&gt;
Initially, we&#039;ll see how many hits we can find by searching for one or more of these criteria in relevant UniProtKB fields. Here we&#039;ll use the same search interface as in the UniProt exercise. We recommend to have the original [[Exercise: The protein database UniProt|UniProt Exercise manual]] open in a different browser window for quick cross-referencing of what we have already learned about searching UniProt.&lt;br /&gt;
&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;Note:&#039;&#039;&#039; When answering the questions below, you have to &#039;&#039;write the search string&#039;&#039; you used in the answer; merely writing a number is not enough. When the search string is included in the answer, we can understand the reason for possible wrong answers.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2b)&#039;&#039;&#039; Go to [http://www.uniprot.org/ UniProt]. Investigate how many &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) proteins there are in total in UniProtKB (i.e. proteins from all &#039;&#039;Pf&#039;&#039; strains, not only from 3D7). How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2c)&#039;&#039;&#039; Now try to see how many of the hits from the previous question are from the strain (isolate) 3D7. Is the number approximately equal to the number you got in question &#039;&#039;&#039;2a)&#039;&#039;&#039;? How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
Now, we shall investigate whether we can use the annotations of subcellular location in UniProt. &#039;&#039;&#039;Note:&#039;&#039;&#039; We go back to working with all strains of &#039;&#039;Pf&#039;&#039;, not exclusively 3D7. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2d)&#039;&#039;&#039; First, check how many &#039;&#039;Pf&#039;&#039; proteins have a &amp;quot;&amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;&amp;quot; comment at all (&#039;&#039;&#039;Tip:&#039;&#039;&#039; choose &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt; in the menu and &amp;lt;!-- leave the &amp;lt;u&amp;gt;Term&amp;lt;/u&amp;gt; field empty)--&amp;gt;enter a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). How many from each part of the database? (&#039;&#039;&#039;Note&#039;&#039;&#039; that the ratio between TrEMBL and Swiss-Prot numbers changes considerably relative to question &#039;&#039;&#039;2b)&#039;&#039;&#039; — Swiss-Prot entries on average contain many more annotations than TrEMBL entries).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2e)&#039;&#039;&#039; How many of these are secreted? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; that should go into the field that pops up when the menu is set to &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt;).  &lt;br /&gt;
&lt;br /&gt;
To get more hits, we will try to search for other terms in the &amp;lt;u&amp;gt;Subcellular location term&amp;lt;/u&amp;gt; field. Interesting subcellular locations might include words such as &amp;quot;&amp;lt;tt&amp;gt;surface&amp;lt;/tt&amp;gt;&amp;quot; or &amp;quot;&amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt;&amp;quot;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2f)&#039;&#039;&#039; How many are there of these, respectively? &lt;br /&gt;
&lt;br /&gt;
The word &amp;quot;membrane&amp;quot; gave the highest number of hits, so we will examine those in more detail. Far from all of these proteins are suitable as vaccine targets. In order to be potentially interesting, they need to be located in the cell membrane (plasma membrane) of either the parasite or the host cell, &#039;&#039;not&#039;&#039; in an inner membrane in the cell. To get an overview, you should try another function in UniProt&#039;s interface: First, click to select the &amp;lt;u&amp;gt;Table&amp;lt;/u&amp;gt; view instead of the &amp;lt;u&amp;gt;Card&amp;lt;/u&amp;gt; view (above the results list). Then, click the button &amp;lt;u&amp;gt;Customize columns&amp;lt;/u&amp;gt;; that will bring up a table where you can find a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; item. Click it, mark &amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;, and click &amp;lt;u&amp;gt;Close&amp;lt;/u&amp;gt;.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2g)&#039;&#039;&#039; Now look at the list of results, where &amp;quot;&amp;lt;u&amp;gt;subcellular location&amp;lt;/u&amp;gt;&amp;quot; contained &amp;quot;membrane&amp;quot;, again. Consider the field &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt;. Give some examples (including accession codes, protein names, and reasons for selecting them) of hits that may be useful, and hits that are surely not useful as vaccine targets (at least two &#039;&#039;different&#039;&#039; examples of each). &#039;&#039;&#039;Hint:&#039;&#039;&#039; if you need to see some different examples, try clicking on the column headings in the table to sort the results list after, e.g., Accession (&amp;lt;u&amp;gt;Entry&amp;lt;/u&amp;gt;), Entry name, or Protein name. &lt;br /&gt;
&lt;br /&gt;
Now, let us focus on the life stage of the parasite where it is located inside an erythrocyte (a red blood cell), and thereby focus on the vaccine targets that are in the plasma membrane of the &#039;&#039;host cell&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2h)&#039;&#039;&#039; How many of the hits have the location &amp;quot;host cell membrane&amp;quot;?&lt;br /&gt;
&amp;lt;!-- These should ideally have a &amp;quot;Subcellular location&amp;quot; annotated as &amp;quot;erythrocyte membrane&amp;quot; or &amp;quot;host cell membrane&amp;quot; — but there are no examples of that in your search from the last question (you are welcome to try!).  --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These proteins could be very interesting as vaccine targets. However, the experimental researchers from your organization report that these have already been tried and do not work in practice, so they ask you to find other examples. We therefore try another approach: If the information we are looking for is not part of the &amp;quot;Subcellular location&amp;quot; annotation, it might be a part of the description (the protein name). &#039;&#039;&#039;Tip:&#039;&#039;&#039; you can always discard a search term in the Advanced interface by clicking the &amp;lt;u&amp;gt;Remove&amp;lt;/u&amp;gt; button.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2i)&#039;&#039;&#039; How many &#039;&#039;Pf&#039;&#039; proteins contain &amp;lt;tt&amp;gt;erythrocyte&amp;lt;/tt&amp;gt; in their &amp;lt;u&amp;gt;Protein Name [DE]&amp;lt;/u&amp;gt; field? How many of these are from Swiss-Prot (reviewed)?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2j)&#039;&#039;&#039; How many of these erythrocyte proteins also have &amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt; in their name? &lt;br /&gt;
&lt;br /&gt;
Some of the hits you find in this way are very short (you can try to sort them by length by clicking the &amp;lt;u&amp;gt;Length&amp;lt;/u&amp;gt; heading). These short proteins might be fragments. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2k)&#039;&#039;&#039; How many of the hits are complete (not annotated as fragments)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; see question 16 in [[Exercise: The protein database UniProt|the UniProt exercise]]).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2l)&#039;&#039;&#039; Do any of these proteins have a determined 3D structure? In other words: Do any proteins from the previous search have a cross-reference to the database PDB? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you should look for &amp;lt;u&amp;gt;Cross-references&amp;lt;/u&amp;gt; in the menu, and again place a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). If yes, what are their names and accession codes?&lt;br /&gt;
&lt;br /&gt;
As a last step in this part of the exercise, you should save all sequences from the last search in FASTA format. This is most easily done by clicking &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; above the results list and choosing &amp;lt;u&amp;gt;FASTA (canonical)&amp;lt;/u&amp;gt;. You can either choose to download them (remember to choose &amp;lt;u&amp;gt;No&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Compressed&amp;lt;/u&amp;gt;) and then open them in a text editor or to preview them in the browser. In the latter case, keep the browser window with the sequences; we will need them later in the exercise.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Analysis of membrane protein domain structure ==&lt;br /&gt;
[[Image:PfEMP1_transport.jpg|right|border]]&lt;br /&gt;
&lt;br /&gt;
The PfEMP1 (&#039;&#039;Plasmodium falciparum&#039;&#039; Erythrocyte Membrane Protein 1) proteins which we now have found constitute a group of proteins expressed by the malaria parasite and transported to the plasma membrane of the infected erythrocyte (see figure — the red/orange sticks represent PfEMP1 proteins). &lt;br /&gt;
&lt;br /&gt;
The function of the PfEMP1 proteins on the surface of the infected erythrocytes is to mediate binding to certain receptors on the surface of endothelial cells (the cells making up the walls of blood vessels). In this way, the malaria parasite can make the infected erythrocytes stick to the walls of the blood vessels in various tissues of the body, and thereby it can avoid being transported through the spleen (Danish: &#039;&#039;milten&#039;&#039;) which otherwise removes diseased erythrocytes from the blood and is one of the main actors in generating an immune response against malaria.&lt;br /&gt;
&lt;br /&gt;
If we, using a vaccine, can generate antibodies that bind to the PfEMP1 proteins, preventing the infected erythrocytes from binding to the endothelial cells, the body would be able to generate a faster and broader immune response against &#039;&#039;Pf&#039;&#039;. Symptoms such as anemia would thereby not become so severe. &lt;br /&gt;
&lt;br /&gt;
We will now examine how the PfEMP1 proteins are built. &lt;br /&gt;
&lt;br /&gt;
Look at the entries you found in the end of section 2. Select just those hits whose accession codes start with &amp;quot;Q&amp;quot; (there should be three of them — otherwise, revisit section 2). &lt;br /&gt;
&lt;br /&gt;
Take a closer look (in UniProt) at these three entries. Scroll down to &amp;lt;u&amp;gt;Family and domain databases&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Family &amp;amp; Domains&amp;lt;/u&amp;gt;. Here, you will find some services providing an overview of known families/domains in the protein in question. &amp;lt;u&amp;gt;InterPro&amp;lt;/u&amp;gt; is the most important of these, since it collects information from a number of family &amp;amp; domain databases (including the one called &amp;lt;u&amp;gt;Pfam&amp;lt;/u&amp;gt;) and therefore has the widest repertoire of domain types. &lt;br /&gt;
&lt;br /&gt;
Open the link labeled &amp;lt;u&amp;gt;View protein in InterPro&amp;lt;/u&amp;gt; in a new tab. Note the graphical interface of InterPro under the heading &amp;quot;Entry matches to this protein&amp;quot;. When you hover the mouse over one of the coloured bars, the name of the family/domain will appear. Note that each family/domain in InterPro has at least &#039;&#039;two&#039;&#039; names and identifiers, an InterPro identifier beginning with &amp;quot;IPR&amp;quot; and a member database identifier, e.g. beginning with &amp;quot;PF&amp;quot; if it is derived from Pfam.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;What are families and domains, anyway?&#039;&#039;&#039;&lt;br /&gt;
:Here are the definitions from the [https://www.ebi.ac.uk/interpro/help/faqs/ InterPro FAQ]:&lt;br /&gt;
:*&#039;&#039;&#039;Domains&#039;&#039;&#039; are distinct functional, structural or sequence units that may exist in a variety of biological contexts. A match to an InterPro entry of this type indicates the presence of a domain. Common examples of protein domains are the PH domain, Immunoglobulin domain or the classical C2H2 zinc finger. &lt;br /&gt;
:*A protein &#039;&#039;&#039;family&#039;&#039;&#039; is a group of proteins that share a common evolutionary origin reflected by their related functions, similarities in sequence, or similar primary, secondary or tertiary structure. A match to an InterPro entry of this type indicates membership of a protein family. &lt;br /&gt;
:However, the distinction between what is regarded as a family and what is regarded as a domain is not completely sharp.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3a)&#039;&#039;&#039; Note that one named family/domain is found in several copies in all our three erythrocyte membrane proteins. What are the names and identifiers of  this family/domain? How many times does it occur in each of the proteins?&lt;br /&gt;
&lt;br /&gt;
Click the identifiers for this particular family/domain and read more about it.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3b)&#039;&#039;&#039; Under &amp;quot;Other Features&amp;quot;, Interpro has annotated a transmembrane segment. Which positions are transmembrane in the three proteins? Which part (N- or C-terminal part) of the proteins is intracellular, and which part is extracellular?&lt;br /&gt;
&lt;br /&gt;
Look (in UniProt) at the PDB cross-references under &amp;lt;u&amp;gt;3D structure databases&amp;lt;/u&amp;gt; (under &amp;lt;u&amp;gt;Structure&amp;lt;/u&amp;gt;). Focus on X-ray structures only. Compare the coordinates (positions) for the structures to the coordinates for the domains denoted in Pfam. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3c)&#039;&#039;&#039; Which positions are structurally determined &#039;&#039;&#039;by X-ray&#039;&#039;&#039; in each of the three proteins? If you number the occurrences of the known family/domain from &#039;&#039;&#039;3a&#039;&#039;&#039; (1, 2, 3, and so on, starting from the N-terminus), which number(s) are covered by the structurally determined region(s) in each of the three proteins? &lt;br /&gt;
&lt;br /&gt;
Now read what is said about the function and location of our proteins according to Gene Ontology (&amp;lt;u&amp;gt;GO - Molecular function&amp;lt;/u&amp;gt;, &amp;lt;u&amp;gt;GO - Biological process&amp;lt;/u&amp;gt; and &amp;lt;u&amp;gt;GO - Cellular component&amp;lt;/u&amp;gt;) in UniProt.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3d)&#039;&#039;&#039; Do these pieces of information support our choice of these proteins as vaccine targets? Give at least 3 examples.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Prediction of B-cell epitopes in a membrane protein ==&lt;br /&gt;
&#039;&#039;&#039;Q8I639&#039;&#039;&#039; is also known as VAR2CSA, and this protein is of particular interest, since it is considered to be responsible for &#039;&#039;Pregnacy associated malaria&#039;&#039; (PAM). Pregnant women are more prone to contract malaria, which sadly leads to a fatality rate of ~10,000 mothers and ~200,000 newborn/unborn children annually. &lt;br /&gt;
&lt;br /&gt;
One of the reasons why it has been so difficult to develop a malaria vaccine, is that the malaria parasite carries ~60 PfEMP1 protein variants, and that you&#039;ll need antibodies against all of them to be immune. However, in the case of PAM there is only one specific PfEMP1 in play, and this special case is therefore easier to start to address with a vaccine.&lt;br /&gt;
&lt;br /&gt;
In order to have a better handle on our bioinformatics work, we&#039;ll concentrate the effort on the Duffy binding domain in VAR2CSA for which a 3D structure is available (the one we found in &#039;&#039;&#039;question 3c&#039;&#039;&#039;).  &lt;br /&gt;
&lt;br /&gt;
=== Epitope prediction ===&lt;br /&gt;
The vaccine we are working towards designing should contain &#039;&#039;&#039;epitopes&#039;&#039;&#039;. Epitopes are the parts of the disease-associated protein the immune system will recognize, for instance the parts the infected person&#039;s antibodies will bind to (the so called &#039;&#039;&#039;B-cell epitopes&#039;&#039;&#039; — there also exist &#039;&#039;&#039;T-cell epitopes&#039;&#039;&#039;, which we&#039;ll not cover here).&lt;br /&gt;
&lt;br /&gt;
For predicting which parts of the protein are potential epitopes, we&#039;ll use the &#039;&#039;&#039;BepiPred 2.0 server&#039;&#039;&#039;, which was created here at DTU.  &lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Important Note:&#039;&#039;&#039; Please run the prediction on the web server of the IEDB instead of the one at DTU, as our local servers had an update that has modified the results. &lt;br /&gt;
:Please select the method called &amp;quot;BepiPred 2.0&amp;quot;&lt;br /&gt;
:http://tools.iedb.org/bcell/ &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the structure-determined Duffy binding domain in VAR2CSA. This must be done using the link to PDB from UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt.&lt;br /&gt;
# Go to the Structure section.&lt;br /&gt;
# Right-click the link labeled &amp;lt;u&amp;gt;RCSB-PDB&amp;lt;/u&amp;gt; and open it in a new tab. This will take you to a PDB page.&lt;br /&gt;
# Here, you can find the sequence by clicking &amp;lt;u&amp;gt;Display Files&amp;lt;/u&amp;gt; and choosing &amp;lt;u&amp;gt;FASTA Sequence&amp;lt;/u&amp;gt;. Alternative, you can choose to download the sequence by clicking &amp;lt;u&amp;gt;Download Files&amp;lt;/u&amp;gt;.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the Duffy binding domain in VAR2CSA. This can be done using only the web-interface for UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt&lt;br /&gt;
# Locate the section concerning CROSS-REFERENCES to 3D structures (NOT the &amp;quot;live action&amp;quot; 3D structure you can move around).&lt;br /&gt;
#* Find the field called &#039;&#039;&#039;positions&#039;&#039;&#039; — this is actually a &#039;&#039;&#039;clickable link!&#039;&#039;&#039;&lt;br /&gt;
#* Click the positions link — this will open up a new page where this subsequence can be used for a BLAST query.&lt;br /&gt;
#* &#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; do NOT start the BLAST run, but just copy out the FASTA sequence, it contains ONLY the sequence interval specified in the &#039;&#039;&#039;positions&#039;&#039;&#039; field.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4a&#039;&#039;&#039;: What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4b&#039;&#039;&#039;: Report the FASTA sequence of the structure-determined Duffy binding domain in VAR2CSA. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4c&#039;&#039;&#039;: &lt;br /&gt;
Note down the following from the UniProt entry, you&#039;ll need it in the next section:&lt;br /&gt;
* What was the sequence interval in the coordinates of the original (full) UniProt sequence?&lt;br /&gt;
* What position in the original protein does position 1 in the new FASTA file correspond to?&lt;br /&gt;
&lt;br /&gt;
You can now run the &#039;&#039;&#039;BepiPred 2.0&#039;&#039;&#039; prediction server on the domain sequence (ONLY the subset extracted above). Run it and then adjust the following on the &#039;&#039;&#039;results page&#039;&#039;&#039;: &lt;br /&gt;
* Set &#039;&#039;&#039;threshold&#039;&#039;&#039; to &#039;&#039;&#039;0.55&#039;&#039;&#039;&lt;br /&gt;
This gives us a reasonable amount of epitopes to continue our work with:&lt;br /&gt;
* Write down the start/end sequence positions of all epitopes of at least &#039;&#039;&#039;8 amino acids&#039;&#039;&#039;&lt;br /&gt;
* &#039;&#039;&#039;Hint:&#039;&#039;&#039; there should be &#039;&#039;&#039;7&#039;&#039;&#039; such epitopes, and the last one starts at position &#039;&#039;&#039;276&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
[[image:BepiPred-2_onIEDB.png|thumb|center|600px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4d&#039;&#039;&#039;: Create a table with the following information about the predicted epitopes:&lt;br /&gt;
* Start/end position, length, Start/end position &#039;&#039;in the original protein&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;(We&#039;ll need the coordinate-transformed values for the PyMOL visualization)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Visualization of epitopes ==&lt;br /&gt;
Lastly, we&#039;ll want to visualize the epitopes in the VAR2CSA Duffy binding domain. Generally, BepiPred 2.0 is very good at selecting surface exposed epitopes, but it&#039;s still a good idea to check it visually.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
For the Q8I639 UniProt entry we have been working with, look at the structure section again, and find the link to the PDB structure of the Duffy binding domain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
Sometimes it will not be possible to get reliable structural information about the entire protein (or in this case the Duffy binding domain). This could for example be the case if parts of the protein are in &#039;&#039;disorder&#039;&#039; (essentially not stabilized and not fixed in place in the crystal). We&#039;ll investigate this next. &lt;br /&gt;
&lt;br /&gt;
From the UniProt page, locate the right structure in PDB:&lt;br /&gt;
* Method 1: Go to https://www.rcsb.org and search for the structure&lt;br /&gt;
* Method 2: Adjust the cross-link in the &amp;quot;structure&amp;quot; section in UniProt to be &amp;quot;RCSB PDB&amp;quot; and click the link.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
In the PDB database page for the structure you found in the last section, click the &amp;quot;Sequence&amp;quot; tab and look at the figure. In the case of this structure, the authors&#039; numbering directly follows the coordinates from the FULL UniProt sequence.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; &lt;br /&gt;
* Which intervals in the sequence are missing (disordered/invisible) in the structure? Hint: Look at the &amp;quot;UNMODELED&amp;quot; feature. &amp;lt;!-- DSSP legend and notice what the lack of underlining means. --&amp;gt;&lt;br /&gt;
* Will this have an impact on any of our predicted epitopes?&lt;br /&gt;
&lt;br /&gt;
Now it&#039;s time to work with visualization of the epitopes in PyMOL. IMPORTANT: Cross-reference with the exercises from the PyMOL exercise if you have forgotten some of the PyMOL fundamentals. &lt;br /&gt;
&lt;br /&gt;
The goal will be to:&lt;br /&gt;
* Colour the epitopes in different colours&lt;br /&gt;
* Have a look at where in the structure they are found: on the surface or inside.&lt;br /&gt;
&lt;br /&gt;
After you have loaded the structure (either via &amp;quot;fetch&amp;quot; or by downloading the file), you can help yourself by setting the base colour to a neutral grey, and with a basic &amp;quot;cartoon&amp;quot; visualization as the first step:&lt;br /&gt;
&lt;br /&gt;
 color gray80&lt;br /&gt;
 hide all&lt;br /&gt;
 show cartoon&lt;br /&gt;
&lt;br /&gt;
Since we&#039;re working with 7 epitopes it can be beneficial to work with named selections. To avoid renaming selections you can specify the name directly in the select command:&lt;br /&gt;
 select epitope_XXX, resi 1-3&lt;br /&gt;
&lt;br /&gt;
This will create the selection of residues 1 to 3 under the name &amp;quot;epitope_XXX&amp;quot; — please refer to the PyMOL exercise for more details about selection rules.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;TASK:&#039;&#039;&#039;&lt;br /&gt;
* Create named selections for all seven epitopes&lt;br /&gt;
** Select a good naming scheme — for example epitope_1 to epitope_7 or reference the first position (e.g. epitope_273 for the last one)&lt;br /&gt;
** Select a unique and easy to identify colour for each epitope.&lt;br /&gt;
** HINT: Turn on the sequence viewer — then you can directly see your selections AND colours in the sequence as well!&lt;br /&gt;
&lt;br /&gt;
As you may have noticed there are two (identical) chains in the structure. We only need one of them, and the next step will be to separate them out.&lt;br /&gt;
&lt;br /&gt;
 create ka, chain A&lt;br /&gt;
&lt;br /&gt;
This will create a new object with the A chain. &lt;br /&gt;
* Repeat this for the B chain (you could call the object kb), and then use the click-interface to hide the whole structure, and select ONLY one of the chains to continue to work with. REMEMBER to write which chain you have chosen to work with.&lt;br /&gt;
&lt;br /&gt;
Lastly, we&#039;ll need to look at how the epitopes are located relative to the surface. Here you can benefit from switching between two types of visualization (using the click interface):&lt;br /&gt;
* show as → surface &lt;br /&gt;
to show the protein from the outside.&lt;br /&gt;
* show as → cartoon&lt;br /&gt;
* show → mesh&lt;br /&gt;
to show BOTH the inside and outside — it especially works nicely when you actively rotate the structure.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5b):&#039;&#039;&#039; Play around with the visualization, and create one (or more) good figures for your report that show the following:&lt;br /&gt;
* Placement of the epitopes&lt;br /&gt;
* A legend for the colours (or arrows with explanations or something similar)&lt;br /&gt;
* Which epitopes are (partly) missing?&lt;br /&gt;
* Are the remaining epitopes accessible on the surface of the protein?&lt;br /&gt;
&lt;br /&gt;
== Epilogue ==&lt;br /&gt;
&#039;&#039;Now all that remains is to ship off the sequences of the surface accessible epitopes to the lab, to start the long process of constructing an expression vector with the gene fragments, with the right linker sequences, getting it expressed in a production host, follow up with animal testing and phase 1, 2 and 3 clinical trials, and the vaccine should be ready for the market.&#039;&#039;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=613</id>
		<title>Exercise:Malaria Vaccine</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Malaria_Vaccine&amp;diff=613"/>
		<updated>2025-10-14T13:00:53Z</updated>

		<summary type="html">&lt;p&gt;Henni: /* What exactly is malaria? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Exercise written by: Thomas Salhøj Rask and [http://www.dtu.dk/service/telefonbog/person?id=25617&amp;amp;cpid=214126&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Henrik Nielsen] — translated, revised and updated to BepiPred 2.0 by [http://www.dtu.dk/service/telefonbog/person?id=18103&amp;amp;cpid=214039&amp;amp;tab=2&amp;amp;qt=dtupublicationquery Rasmus Wernersson] and Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
The purpose of this exercise is to apply the methods and knowledge you have learned so far on a real biological problem: Taking steps towards designing a malaria vaccine, by selecting peptides from the malaria parasite that have a chance of inflicting an immune response and therefore could be used in a vaccine. As part of the exercise some new material will be introduced, especially concerning prediction of B-cell epitopes (immuno-reactive peptides). The outline of the exercise is as follows:&lt;br /&gt;
&lt;br /&gt;
# What exactly is malaria?&lt;br /&gt;
# Identification of membrane bound proteins (potential vaccine targets)&lt;br /&gt;
# Analysis of membrane protein domain structure&lt;br /&gt;
# Prediction of B-cell epitopes from membrane proteins&lt;br /&gt;
# Modelling / visualization of predicted epitopes in the 3D structure of a protein domain.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== What exactly is malaria? ==&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 1:&#039;&#039;&#039; &#039;&#039;Which organism causes malaria? Bacteria, protozoa (single cell eukaryote), worm or virus?&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
Investigate this by looking up the organism in the two taxonomy databases we have been covering earlier in the course:&lt;br /&gt;
*&#039;&#039;&#039;NCBI Taxonomy:&#039;&#039;&#039; http://www.ncbi.nlm.nih.gov/Taxonomy &amp;amp;nbsp;&amp;amp;nbsp; (&#039;&#039;&#039;Hint:&#039;&#039;&#039; If you don&#039;t know the Latin name for the organism, it will be easier to search for a name as a &amp;quot;[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi Token set]&amp;quot; rather than as a &amp;quot;Complete name&amp;quot;.&lt;br /&gt;
*&#039;&#039;&#039;Tree of life:&#039;&#039;&#039; http://www.tolweb.org/ &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1a)&#039;&#039;&#039; Identify the following taxonomical levels for the malaria-causing organism:&lt;br /&gt;
* Genus&lt;br /&gt;
* Phylum&lt;br /&gt;
* (Super)Kingdom&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1b)&#039;&#039;&#039; How &amp;quot;close&amp;quot; in taxonomy space is the organism to the following other organisms (find the upper level taxonomical group, that ties them together). &#039;&#039;&#039;Hint:&#039;&#039;&#039; as an alternative to manually comparing the taxonomy-strings (the &amp;quot;lineage&amp;quot;), you can use the [http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi NCBI Taxonomy Common Tree] tool to automate the comparison.&lt;br /&gt;
* &#039;&#039;Homo sapiens&#039;&#039;&lt;br /&gt;
* &#039;&#039;Babesia microti&#039;&#039;&amp;amp;nbsp;&amp;amp;nbsp;&amp;amp;nbsp; (Can in rare cases be transmitted by ticks (danish: &amp;quot;Skovflåt&amp;quot;) and can lead to the disease &#039;&#039;[https://en.wikipedia.org/wiki/Babesiosis babesiosis]&#039;&#039;, where the red blood cells (erythrocytes) are invaded as in malaria, and which will lead to &#039;&#039;anemia&#039;&#039; (&amp;quot;blood loss&amp;quot;, in this case lack of oxygen carrying capacity in the blood) — see the Tree of Life page for this organism for images of infected erythrocytes.&lt;br /&gt;
&lt;br /&gt;
Finally, read more about malaria and the complicated life cycle of the malaria parasite here: [http://www.cdc.gov/dpdx/malaria/ CDC - DPDx Malaria] .&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 1c)&#039;&#039;&#039; Report the names of the &#039;&#039;&#039;four&#039;&#039;&#039; species of parasites causing malaria in humans, and use the NCBI Genome (https://www.ncbi.nlm.nih.gov/datasets/genome/) database to investigate which of them (if any) have had their genomes sequenced.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Identification of membrane proteins (potential vaccine targets) ==&lt;br /&gt;
Malaria caused by &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) is by far the most lethal malaria variant. This parasite species is responsible for 80%-90% of the ~1 million annual deaths due to malaria. It will therefore be a natural starting point to develop a vaccine against this type of malaria.&lt;br /&gt;
&lt;br /&gt;
When the &#039;&#039;Pf&#039;&#039; genome was initially sequenced in the 1990s, it was based on &#039;&#039;Pf&#039;&#039; cells isolated from the blood of a Dutch malaria patient, who picked up the disease while traveling. Unfortunately, it was not recorded exactly where the patient had been. This isolate is named &#039;&#039;3D7&#039;&#039; and is the most studied malaria strain to this day (even though it&#039;s not known from where in the world it originates).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Task:&#039;&#039;&#039;&lt;br /&gt;
Locate the entry for &#039;&#039;Pf&#039;&#039; 3D7 in [http://www.ncbi.nlm.nih.gov/Taxonomy NCBIs taxonomy browser]. &amp;lt;!-- At the bottom of the page some technical information regarding the genome sequencing is shown (&amp;quot;Genome Information&amp;quot;), and --&amp;gt; In the multi-colored table on the right hand side (&amp;quot;Entrez records&amp;quot;), a set of sequence related data is shown. For instance the &amp;quot;Gene&amp;quot; entry describes how many genes have been identified in the genome (including both manually curated genes as well as genes predicted using bioinformatics methods).&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;Question 2a)&#039;&#039;&#039; How many chromosomes, and how many verified genes (NOT hypothetical) does &#039;&#039;Pf&#039;&#039; 3D7 have? (&#039;&#039;&#039;Hints:&#039;&#039;&#039; First, follow the &amp;lt;u&amp;gt;Genome&amp;lt;/u&amp;gt; link and select the first assembly to see an overview of the chromosomes. Then, go back to the taxonomy page and follow the &amp;lt;u&amp;gt;Gene&amp;lt;/u&amp;gt; link and add &amp;lt;tt&amp;gt;NOT hypothetical&amp;lt;/tt&amp;gt; to the search string).&lt;br /&gt;
&lt;br /&gt;
Malaria takes place in different stages within the human host (see figure below), and this is important to take into account when designing a malaria vaccine. The disease development can be divided into two phases: 1) The liver-stage and 2) the blood-stage. The liver-stage is defined by &#039;&#039;sporozoites&#039;&#039; injected by the malaria mosquito, which travel to the liver and invade hepatocytes (liver cells). The blood-stage is the second stage and is reached when &#039;&#039;merozoites&#039;&#039; developed within the hepatocytes are released into the blood stream, where they invade erythrocytes (red blood cells). In both stages the malaria parasite hides from the human immune system by staying inside native human cells. &lt;br /&gt;
&lt;br /&gt;
Much of the effort towards developing malaria vaccines so far has been focused on surface exposed (cell-membrane) proteins from the &#039;&#039;sporozoites&#039;&#039; and &#039;&#039;merozoites&#039;&#039; as well as non-human proteins on the surface of infected hepatocytes and erythrocytes. &lt;br /&gt;
&lt;br /&gt;
[[Image:Nm0206-170-F1.jpg | center]]&lt;br /&gt;
&lt;br /&gt;
=== Searching UniProt ===&lt;br /&gt;
We&#039;ll now see if we can use the annotation of protein properties in UniProt to point us towards potential vaccine targets. When designing a vaccine it is important to make sure that the intended vaccine target is indeed &amp;quot;visible&amp;quot; to the immune system. Building on the information from the previous section, we therefore need to identify proteins that &#039;&#039;&#039;originate&#039;&#039;&#039; from the parasite, and that are present on the cell surface of &#039;&#039;sporozoites&#039;&#039;, &#039;&#039;merozoites&#039;&#039; OR infected host cells. In the case of infected host cells, we would therefore be looking for proteins that fulfill the following criteria:&lt;br /&gt;
&lt;br /&gt;
# Are secreted from the parasite to the vacuole &#039;&#039;inside&#039;&#039; the host cell,&lt;br /&gt;
# Migrate from the vacuole to the host cell, and&lt;br /&gt;
# Are transported to the surface (membrane) of the host cell&lt;br /&gt;
&lt;br /&gt;
Initially, we&#039;ll see how many hits we can find by searching for one or more of these criteria in relevant UniProtKB fields. Here we&#039;ll use the same search interface as in the UniProt exercise. We recommend to have the original [[Exercise: The protein database UniProt|UniProt Exercise manual]] open in a different browser window for quick cross-referencing of what we have already learned about searching UniProt.&lt;br /&gt;
&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;Note:&#039;&#039;&#039; When answering the questions below, you have to &#039;&#039;write the search string&#039;&#039; you used in the answer; merely writing a number is not enough. When the search string is included in the answer, we can understand the reason for possible wrong answers.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2b)&#039;&#039;&#039; Go to [http://www.uniprot.org/ UniProt]. Investigate how many &#039;&#039;Plasmodium falciparum&#039;&#039; (&#039;&#039;Pf&#039;&#039;) proteins there are in total in UniProtKB (i.e. proteins from all &#039;&#039;Pf&#039;&#039; strains, not only from 3D7). How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2c)&#039;&#039;&#039; Now try to see how many of the hits from the previous question are from the strain (isolate) 3D7. Is the number approximately equal to the number you got in question &#039;&#039;&#039;2a)&#039;&#039;&#039;? How many of these are from Swiss-Prot and how many from TrEMBL? &lt;br /&gt;
&lt;br /&gt;
Now, we shall investigate whether we can use the annotations of subcellular location in UniProt. &#039;&#039;&#039;Note:&#039;&#039;&#039; We go back to working with all strains of &#039;&#039;Pf&#039;&#039;, not exclusively 3D7. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2d)&#039;&#039;&#039; First, check how many &#039;&#039;Pf&#039;&#039; proteins have a &amp;quot;&amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;&amp;quot; comment at all (&#039;&#039;&#039;Tip:&#039;&#039;&#039; choose &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt; in the menu and &amp;lt;!-- leave the &amp;lt;u&amp;gt;Term&amp;lt;/u&amp;gt; field empty)--&amp;gt;enter a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). How many from each part of the database? (&#039;&#039;&#039;Note&#039;&#039;&#039; that the ratio between TrEMBL and Swiss-Prot numbers changes considerably relative to question &#039;&#039;&#039;2b)&#039;&#039;&#039; — Swiss-Prot entries on average contain many more annotations than TrEMBL entries).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2e)&#039;&#039;&#039; How many of these are secreted? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; that should go into the field that pops up when the menu is set to &amp;lt;u&amp;gt;Subcellular location &amp;gt; Subcellular location [CC] &amp;gt; Subcellular location term&amp;lt;/u&amp;gt;).  &lt;br /&gt;
&lt;br /&gt;
To get more hits, we will try to search for other terms in the &amp;lt;u&amp;gt;Subcellular location term&amp;lt;/u&amp;gt; field. Interesting subcellular locations might include words such as &amp;quot;&amp;lt;tt&amp;gt;surface&amp;lt;/tt&amp;gt;&amp;quot; or &amp;quot;&amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt;&amp;quot;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2f)&#039;&#039;&#039; How many are there of these, respectively? &lt;br /&gt;
&lt;br /&gt;
The word &amp;quot;membrane&amp;quot; gave the highest number of hits, so we will examine those in more detail. Far from all of these proteins are suitable as vaccine targets. In order to be potentially interesting, they need to be located in the cell membrane (plasma membrane) of either the parasite or the host cell, &#039;&#039;not&#039;&#039; in an inner membrane in the cell. To get an overview, you should try another function in UniProt&#039;s interface: First, click to select the &amp;lt;u&amp;gt;Table&amp;lt;/u&amp;gt; view instead of the &amp;lt;u&amp;gt;Card&amp;lt;/u&amp;gt; view (above the results list). Then, click the button &amp;lt;u&amp;gt;Customize columns&amp;lt;/u&amp;gt;; that will bring up a table where you can find a &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt; item. Click it, mark &amp;lt;u&amp;gt;Subcellular location [CC]&amp;lt;/u&amp;gt;, and click &amp;lt;u&amp;gt;Close&amp;lt;/u&amp;gt;.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2g)&#039;&#039;&#039; Now look at the list of results, where &amp;quot;&amp;lt;u&amp;gt;subcellular location&amp;lt;/u&amp;gt;&amp;quot; contained &amp;quot;membrane&amp;quot;, again. Consider the field &amp;lt;u&amp;gt;Subcellular location&amp;lt;/u&amp;gt;. Give some examples (including accession codes, protein names, and reasons for selecting them) of hits that may be useful, and hits that are surely not useful as vaccine targets (at least two &#039;&#039;different&#039;&#039; examples of each). &#039;&#039;&#039;Hint:&#039;&#039;&#039; if you need to see some different examples, try clicking on the column headings in the table to sort the results list after, e.g., Accession (&amp;lt;u&amp;gt;Entry&amp;lt;/u&amp;gt;), Entry name, or Protein name. &lt;br /&gt;
&lt;br /&gt;
Now, let us focus on the life stage of the parasite where it is located inside an erythrocyte (a red blood cell), and thereby focus on the vaccine targets that are in the plasma membrane of the &#039;&#039;host cell&#039;&#039;. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2h)&#039;&#039;&#039; How many of the hits have the location &amp;quot;host cell membrane&amp;quot;?&lt;br /&gt;
&amp;lt;!-- These should ideally have a &amp;quot;Subcellular location&amp;quot; annotated as &amp;quot;erythrocyte membrane&amp;quot; or &amp;quot;host cell membrane&amp;quot; — but there are no examples of that in your search from the last question (you are welcome to try!).  --&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These proteins could be very interesting as vaccine targets. However, the experimental researchers from your organization report that these have already been tried and do not work in practice, so they ask you to find other examples. We therefore try another approach: If the information we are looking for is not part of the &amp;quot;Subcellular location&amp;quot; annotation, it might be a part of the description (the protein name). &#039;&#039;&#039;Tip:&#039;&#039;&#039; you can always discard a search term in the Advanced interface by clicking the &amp;lt;u&amp;gt;Remove&amp;lt;/u&amp;gt; button.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2i)&#039;&#039;&#039; How many &#039;&#039;Pf&#039;&#039; proteins contain &amp;lt;tt&amp;gt;erythrocyte&amp;lt;/tt&amp;gt; in their &amp;lt;u&amp;gt;Protein Name [DE]&amp;lt;/u&amp;gt; field? How many of these are from Swiss-Prot (reviewed)?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2j)&#039;&#039;&#039; How many of these erythrocyte proteins also have &amp;lt;tt&amp;gt;membrane&amp;lt;/tt&amp;gt; in their name? &lt;br /&gt;
&lt;br /&gt;
Some of the hits you find in this way are very short (you can try to sort them by length by clicking the &amp;lt;u&amp;gt;Length&amp;lt;/u&amp;gt; heading). These short proteins might be fragments. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2k)&#039;&#039;&#039; How many of the hits are complete (not annotated as fragments)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; see question 16 in [[Exercise: The protein database UniProt|the UniProt exercise]]).&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;2l)&#039;&#039;&#039; Do any of these proteins have a determined 3D structure? In other words: Do any proteins from the previous search have a cross-reference to the database PDB? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you should look for &amp;lt;u&amp;gt;Cross-references&amp;lt;/u&amp;gt; in the menu, and again place a &amp;lt;tt&amp;gt;*&amp;lt;/tt&amp;gt; in the field). If yes, what are their names and accession codes?&lt;br /&gt;
&lt;br /&gt;
As a last step in this part of the exercise, you should save all sequences from the last search in FASTA format. This is most easily done by clicking &amp;lt;u&amp;gt;Download&amp;lt;/u&amp;gt; above the results list and choosing &amp;lt;u&amp;gt;FASTA (canonical)&amp;lt;/u&amp;gt;. You can either choose to download them (remember to choose &amp;lt;u&amp;gt;No&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Compressed&amp;lt;/u&amp;gt;) and then open them in a text editor or to preview them in the browser. In the latter case, keep the browser window with the sequences; we will need them later in the exercise.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Analysis of membrane protein domain structure ==&lt;br /&gt;
[[Image:PfEMP1_transport.jpg|right|border]]&lt;br /&gt;
&lt;br /&gt;
The PfEMP1 (&#039;&#039;Plasmodium falciparum&#039;&#039; Erythrocyte Membrane Protein 1) proteins which we now have found constitute a group of proteins expressed by the malaria parasite and transported to the plasma membrane of the infected erythrocyte (see figure — the red/orange sticks represent PfEMP1 proteins). &lt;br /&gt;
&lt;br /&gt;
The function of the PfEMP1 proteins on the surface of the infected erythrocytes is to mediate binding to certain receptors on the surface of endothelial cells (the cells making up the walls of blood vessels). In this way, the malaria parasite can make the infected erythrocytes stick to the walls of the blood vessels in various tissues of the body, and thereby it can avoid being transported through the spleen (Danish: &#039;&#039;milten&#039;&#039;) which otherwise removes diseased erythrocytes from the blood and is one of the main actors in generating an immune response against malaria.&lt;br /&gt;
&lt;br /&gt;
If we, using a vaccine, can generate antibodies that bind to the PfEMP1 proteins, preventing the infected erythrocytes from binding to the endothelial cells, the body would be able to generate a faster and broader immune response against &#039;&#039;Pf&#039;&#039;. Symptoms such as anemia would thereby not become so severe. &lt;br /&gt;
&lt;br /&gt;
We will now examine how the PfEMP1 proteins are built. &lt;br /&gt;
&lt;br /&gt;
Look at the entries you found in the end of section 2. Select just those hits whose accession codes start with &amp;quot;Q&amp;quot; (there should be three of them — otherwise, revisit section 2). &lt;br /&gt;
&lt;br /&gt;
Take a closer look (in UniProt) at these three entries. Scroll down to &amp;lt;u&amp;gt;Family and domain databases&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Family &amp;amp; Domains&amp;lt;/u&amp;gt;. Here, you will find some services providing an overview of known families/domains in the protein in question. &amp;lt;u&amp;gt;InterPro&amp;lt;/u&amp;gt; is the most important of these, since it collects information from a number of family &amp;amp; domain databases (including the one called &amp;lt;u&amp;gt;Pfam&amp;lt;/u&amp;gt;) and therefore has the widest repertoire of domain types. &lt;br /&gt;
&lt;br /&gt;
Open the link labeled &amp;lt;u&amp;gt;View protein in InterPro&amp;lt;/u&amp;gt; in a new tab. Note the graphical interface of InterPro under the heading &amp;quot;Entry matches to this protein&amp;quot;. When you hover the mouse over one of the coloured bars, the name of the family/domain will appear. Note that each family/domain in InterPro has at least &#039;&#039;two&#039;&#039; names and identifiers, an InterPro identifier beginning with &amp;quot;IPR&amp;quot; and a member database identifier, e.g. beginning with &amp;quot;PF&amp;quot; if it is derived from Pfam.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;What are families and domains, anyway?&#039;&#039;&#039;&lt;br /&gt;
:Here are the definitions from the [https://www.ebi.ac.uk/interpro/help/faqs/ InterPro FAQ]:&lt;br /&gt;
:*&#039;&#039;&#039;Domains&#039;&#039;&#039; are distinct functional, structural or sequence units that may exist in a variety of biological contexts. A match to an InterPro entry of this type indicates the presence of a domain. Common examples of protein domains are the PH domain, Immunoglobulin domain or the classical C2H2 zinc finger. &lt;br /&gt;
:*A protein &#039;&#039;&#039;family&#039;&#039;&#039; is a group of proteins that share a common evolutionary origin reflected by their related functions, similarities in sequence, or similar primary, secondary or tertiary structure. A match to an InterPro entry of this type indicates membership of a protein family. &lt;br /&gt;
:However, the distinction between what is regarded as a family and what is regarded as a domain is not completely sharp.&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3a)&#039;&#039;&#039; Note that one named family/domain is found in several copies in all our three erythrocyte membrane proteins. What are the names and identifiers of  this family/domain? How many times does it occur in each of the proteins?&lt;br /&gt;
&lt;br /&gt;
Click the identifiers for this particular family/domain and read more about it.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3b)&#039;&#039;&#039; Under &amp;quot;Other Features&amp;quot;, Interpro has annotated a transmembrane segment. Which positions are transmembrane in the three proteins? Which part (N- or C-terminal part) of the proteins is intracellular, and which part is extracellular?&lt;br /&gt;
&lt;br /&gt;
Look (in UniProt) at the PDB cross-references under &amp;lt;u&amp;gt;3D structure databases&amp;lt;/u&amp;gt; (under &amp;lt;u&amp;gt;Structure&amp;lt;/u&amp;gt;). Focus on X-ray structures only. Compare the coordinates (positions) for the structures to the coordinates for the domains denoted in Pfam. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3c)&#039;&#039;&#039; Which positions are structurally determined &#039;&#039;&#039;by X-ray&#039;&#039;&#039; in each of the three proteins? If you number the occurrences of the known family/domain from &#039;&#039;&#039;3a&#039;&#039;&#039; (1, 2, 3, and so on, starting from the N-terminus), which number(s) are covered by the structurally determined region(s) in each of the three proteins? &lt;br /&gt;
&lt;br /&gt;
Now read what is said about the function and location of our proteins according to Gene Ontology (&amp;lt;u&amp;gt;GO - Molecular function&amp;lt;/u&amp;gt;, &amp;lt;u&amp;gt;GO - Biological process&amp;lt;/u&amp;gt; and &amp;lt;u&amp;gt;GO - Cellular component&amp;lt;/u&amp;gt;) in UniProt.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;3d)&#039;&#039;&#039; Do these pieces of information support our choice of these proteins as vaccine targets? Give at least 3 examples.&lt;br /&gt;
&lt;br /&gt;
&amp;amp;nbsp;&lt;br /&gt;
&lt;br /&gt;
== Prediction of B-cell epitopes in a membrane protein ==&lt;br /&gt;
&#039;&#039;&#039;Q8I639&#039;&#039;&#039; is also known as VAR2CSA, and this protein is of particular interest, since it is considered to be responsible for &#039;&#039;Pregnacy associated malaria&#039;&#039; (PAM). Pregnant women are more prone to contract malaria, which sadly leads to a fatality rate of ~10,000 mothers and ~200,000 newborn/unborn children annually. &lt;br /&gt;
&lt;br /&gt;
One of the reasons why it has been so difficult to develop a malaria vaccine, is that the malaria parasite carries ~60 PfEMP1 protein variants, and that you&#039;ll need antibodies against all of them to be immune. However, in the case of PAM there is only one specific PfEMP1 in play, and this special case is therefore easier to start to address with a vaccine.&lt;br /&gt;
&lt;br /&gt;
In order to have a better handle on our bioinformatics work, we&#039;ll concentrate the effort on the Duffy binding domain in VAR2CSA for which a 3D structure is available (the one we found in &#039;&#039;&#039;question 3c&#039;&#039;&#039;).  &lt;br /&gt;
&lt;br /&gt;
=== Epitope prediction ===&lt;br /&gt;
The vaccine we are working towards designing should contain &#039;&#039;&#039;epitopes&#039;&#039;&#039;. Epitopes are the parts of the disease-associated protein the immune system will recognize, for instance the parts the infected person&#039;s antibodies will bind to (the so called &#039;&#039;&#039;B-cell epitopes&#039;&#039;&#039; — there also exist &#039;&#039;&#039;T-cell epitopes&#039;&#039;&#039;, which we&#039;ll not cover here).&lt;br /&gt;
&lt;br /&gt;
For predicting which parts of the protein are potential epitopes, we&#039;ll use the &#039;&#039;&#039;BepiPred 2.0 server&#039;&#039;&#039;, which was created here at DTU.  &lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Important Note:&#039;&#039;&#039; Please run the prediction on the web server of the IEDB instead of the one at DTU, as our local servers had an update that has modified the results. &lt;br /&gt;
:Please select the method called &amp;quot;BepiPred 2.0&amp;quot;&lt;br /&gt;
:http://tools.iedb.org/bcell/ &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the structure-determined Duffy binding domain in VAR2CSA. This must be done using the link to PDB from UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt.&lt;br /&gt;
# Go to the Structure section.&lt;br /&gt;
# Right-click the link labeled &amp;lt;u&amp;gt;RCSB-PDB&amp;lt;/u&amp;gt; and open it in a new tab. This will take you to a PDB page.&lt;br /&gt;
# Here, you can find the sequence by clicking &amp;lt;u&amp;gt;Display Files&amp;lt;/u&amp;gt; and choosing &amp;lt;u&amp;gt;FASTA Sequence&amp;lt;/u&amp;gt;. Alternative, you can choose to download the sequence by clicking &amp;lt;u&amp;gt;Download Files&amp;lt;/u&amp;gt;.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]]&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; In order to run the prediction, we&#039;ll first need to extract the amino acid sequence for the Duffy binding domain in VAR2CSA. This can be done using only the web-interface for UniProt:&lt;br /&gt;
# Find the [https://www.uniprot.org/uniprotkb/Q8I639/entry VAR2CSA entry] in UniProt&lt;br /&gt;
# Locate the section concerning CROSS-REFERENCES to 3D structures (NOT the &amp;quot;live action&amp;quot; 3D structure you can move around).&lt;br /&gt;
#* Find the field called &#039;&#039;&#039;positions&#039;&#039;&#039; — this is actually a &#039;&#039;&#039;clickable link!&#039;&#039;&#039;&lt;br /&gt;
#* Click the positions link — this will open up a new page where this subsequence can be used for a BLAST query.&lt;br /&gt;
#* &#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; do NOT start the BLAST run, but just copy out the FASTA sequence, it contains ONLY the sequence interval specified in the &#039;&#039;&#039;positions&#039;&#039;&#039; field.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4a&#039;&#039;&#039;: What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4b&#039;&#039;&#039;: Report the FASTA sequence of the structure-determined Duffy binding domain in VAR2CSA. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4c&#039;&#039;&#039;: &lt;br /&gt;
Note down the following from the UniProt entry, you&#039;ll need it in the next section:&lt;br /&gt;
* What was the sequence interval in the coordinates of the original (full) UniProt sequence?&lt;br /&gt;
* What position in the original protein does position 1 in the new FASTA file correspond to?&lt;br /&gt;
&lt;br /&gt;
You can now run the &#039;&#039;&#039;BepiPred 2.0&#039;&#039;&#039; prediction server on the domain sequence (ONLY the subset extracted above). Run it and then adjust the following on the &#039;&#039;&#039;results page&#039;&#039;&#039;: &lt;br /&gt;
* Set &#039;&#039;&#039;threshold&#039;&#039;&#039; to &#039;&#039;&#039;0.55&#039;&#039;&#039;&lt;br /&gt;
This gives us a reasonable amount of epitopes to continue our work with:&lt;br /&gt;
* Write down the start/end sequence positions of all epitopes of at least &#039;&#039;&#039;8 amino acids&#039;&#039;&#039;&lt;br /&gt;
* &#039;&#039;&#039;Hint:&#039;&#039;&#039; there should be &#039;&#039;&#039;7&#039;&#039;&#039; such epitopes, and the last one starts at position &#039;&#039;&#039;276&#039;&#039;&#039; &lt;br /&gt;
&lt;br /&gt;
[[image:BepiPred-2_onIEDB.png|thumb|center|600px|Click to zoom]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Question 4d&#039;&#039;&#039;: Create a table with the following information about the predicted epitopes:&lt;br /&gt;
* Start/end position, length, Start/end position &#039;&#039;in the original protein&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;(We&#039;ll need the coordinate-transformed values for the PyMOL visualization)&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Visualization of epitopes ==&lt;br /&gt;
Lastly, we&#039;ll want to visualize the epitopes in the VAR2CSA Duffy binding domain. Generally, BepiPred 2.0 is very good at selecting surface exposed epitopes, but it&#039;s still a good idea to check it visually.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
For the Q8I639 UniProt entry we have been working with, look at the structure section again, and find the link to the PDB structure of the Duffy binding domain.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; What is the name of the PDB entry, and is it a crystal or NMR structure?&lt;br /&gt;
&lt;br /&gt;
Sometimes it will not be possible to get reliable structural information about the entire protein (or in this case the Duffy binding domain). This could for example be the case if parts of the protein are in &#039;&#039;disorder&#039;&#039; (essentially not stabilized and not fixed in place in the crystal). We&#039;ll investigate this next. &lt;br /&gt;
&lt;br /&gt;
From the UniProt page, locate the right structure in PDB:&lt;br /&gt;
* Method 1: Go to https://www.rcsb.org and search for the structure&lt;br /&gt;
* Method 2: Adjust the cross-link in the &amp;quot;structure&amp;quot; section in UniProt to be &amp;quot;RCSB PDB&amp;quot; and click the link.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
In the PDB database page for the structure you found in the last section, click the &amp;quot;Sequence&amp;quot; tab and look at the figure. In the case of this structure, the authors&#039; numbering directly follows the coordinates from the FULL UniProt sequence.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5a):&#039;&#039;&#039; &lt;br /&gt;
* Which intervals in the sequence are missing (disordered/invisible) in the structure? Hint: Look at the &amp;quot;UNMODELED&amp;quot; feature. &amp;lt;!-- DSSP legend and notice what the lack of underlining means. --&amp;gt;&lt;br /&gt;
* Will this have an impact on any of our predicted epitopes?&lt;br /&gt;
&lt;br /&gt;
Now it&#039;s time to work with visualization of the epitopes in PyMOL. IMPORTANT: Cross-reference with the exercises from the PyMOL exercise if you have forgotten some of the PyMOL fundamentals. &lt;br /&gt;
&lt;br /&gt;
The goal will be to:&lt;br /&gt;
* Colour the epitopes in different colours&lt;br /&gt;
* Have a look at where in the structure they are found: on the surface or inside.&lt;br /&gt;
&lt;br /&gt;
After you have loaded the structure (either via &amp;quot;fetch&amp;quot; or by downloading the file), you can help yourself by setting the base colour to a neutral grey, and with a basic &amp;quot;cartoon&amp;quot; visualization as the first step:&lt;br /&gt;
&lt;br /&gt;
 color gray80&lt;br /&gt;
 hide all&lt;br /&gt;
 show cartoon&lt;br /&gt;
&lt;br /&gt;
Since we&#039;re working with 7 epitopes it can be beneficial to work with named selections. To avoid renaming selections you can specify the name directly in the select command:&lt;br /&gt;
 select epitope_XXX, resi 1-3&lt;br /&gt;
&lt;br /&gt;
This will create the selection of residues 1 to 3 under the name &amp;quot;epitope_XXX&amp;quot; — please refer to the PyMOL exercise for more details about selection rules.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;TASK:&#039;&#039;&#039;&lt;br /&gt;
* Create named selections for all seven epitopes&lt;br /&gt;
** Select a good naming scheme — for example epitope_1 to epitope_7 or reference the first position (e.g. epitope_273 for the last one)&lt;br /&gt;
** Select a unique and easy to identify colour for each epitope.&lt;br /&gt;
** HINT: Turn on the sequence viewer — then you can directly see your selections AND colours in the sequence as well!&lt;br /&gt;
&lt;br /&gt;
As you may have noticed there are two (identical) chains in the structure. We only need one of them, and the next step will be to separate them out.&lt;br /&gt;
&lt;br /&gt;
 create ka, chain A&lt;br /&gt;
&lt;br /&gt;
This will create a new object with the A chain. &lt;br /&gt;
* Repeat this for the B chain (you could call the object kb), and then use the click-interface to hide the whole structure, and select ONLY one of the chains to continue to work with. REMEMBER to write which chain you have chosen to work with.&lt;br /&gt;
&lt;br /&gt;
Lastly, we&#039;ll need to look at how the epitopes are located relative to the surface. Here you can benefit from switching between two types of visualization (using the click interface):&lt;br /&gt;
* show as → surface &lt;br /&gt;
to show the protein from the outside.&lt;br /&gt;
* show as → cartoon&lt;br /&gt;
* show → mesh&lt;br /&gt;
to show BOTH the inside and outside — it especially works nicely when you actively rotate the structure.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&#039;&#039;&#039;Question 5b):&#039;&#039;&#039; Play around with the visualization, and create one (or more) good figures for your report that show the following:&lt;br /&gt;
* Placement of the epitopes&lt;br /&gt;
* A legend for the colours (or arrows with explanations or something similar)&lt;br /&gt;
* Which epitopes are (partly) missing?&lt;br /&gt;
* Are the remaining epitopes accessible on the surface of the protein?&lt;br /&gt;
&lt;br /&gt;
== Epilogue ==&lt;br /&gt;
&#039;&#039;Now all that remains is to ship off the sequences of the surface accessible epitopes to the lab, to start the long process of constructing an expression vector with the gene fragments, with the right linker sequences, getting it expressed in a production host, follow up with animal testing and phase 1, 2 and 3 clinical trials, and the vaccine should be ready for the market.&#039;&#039;&lt;/div&gt;</summary>
		<author><name>Henni</name></author>
	</entry>
</feed>