<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://teaching.healthtech.dtu.dk/22111/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Carol</id>
	<title>22111 - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://teaching.healthtech.dtu.dk/22111/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Carol"/>
	<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php/Special:Contributions/Carol"/>
	<updated>2026-04-26T11:50:44Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Unknown_variant&amp;diff=798</id>
		<title>Exercise:Unknown variant</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Unknown_variant&amp;diff=798"/>
		<updated>2026-04-19T17:54:10Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;br /&gt;
What happens when you don&#039;t find your variant of interest?&lt;br /&gt;
As an example, we are going to work with a variant in glucagon-like peptide-1 receptor to assess whether patients bearing that mutation can respond to Ozempic&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
In this exercise, you will:&lt;br /&gt;
&lt;br /&gt;
Identify a mutation in a patient GLP1R sequence&lt;br /&gt;
&lt;br /&gt;
Determine its effect at the protein level&lt;br /&gt;
&lt;br /&gt;
Evaluate whether the mutation affects drug response (Ozempic)&lt;br /&gt;
&lt;br /&gt;
===Tools you will use===&lt;br /&gt;
&lt;br /&gt;
NCBI Nucleotide and GeneBank&lt;br /&gt;
&lt;br /&gt;
ClinVar&lt;br /&gt;
&lt;br /&gt;
Ensembl&lt;br /&gt;
&lt;br /&gt;
gnomAD&lt;br /&gt;
&lt;br /&gt;
Virtual Ribosome: The Virtual Ribosome translates DNA into protein sequences and can scan reading frames to find coding regions&lt;br /&gt;
&lt;br /&gt;
UniProt&lt;br /&gt;
&lt;br /&gt;
PDB&lt;br /&gt;
&lt;br /&gt;
PyMol&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Where to find GenBank===&lt;br /&gt;
The GenBank database is hosted at NCBI (National Center for Biotechnology Information, USA) (Link: http://www.ncbi.nlm.nih.gov/). Besides the main GenBank database, NCBI also hosts a number of other biological databases (for example whole-genome databases for human, mouse, chimp etc.). In this particular exercise we will concentrate on the classical &amp;quot;GenBank&amp;quot; database. &amp;lt;!-- (http://www.ncbi.nlm.nih.gov/gene/).--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Using the &amp;quot;Entrez&amp;quot; database browser===&lt;br /&gt;
ALL the NCBI databases can be queried through a common search interface named Entrez. On next to all NCBI webpages a search box can be found in the upper part of the page, allowing an easy access for searching the individual databases (or searching across all databases). Click on the following link to open up a new browser window with Entrez, where the focus is pre-set to search in the GenBank database:&lt;br /&gt;
&lt;br /&gt;
http://www.ncbi.nlm.nih.gov/gene&lt;br /&gt;
&lt;br /&gt;
(Alternatively go to [http://www.ncbi.nlm.nih.gov/ the main NCBI webpage] and choose &amp;quot;Nucleotide&amp;quot; as the database).&lt;br /&gt;
[[Image:NCBI-nucleotide.png|link=http://www.ncbi.nlm.nih.gov/nucleotide|center]]&lt;br /&gt;
&lt;br /&gt;
==Part 1: Concerning the DATA in GenBank==&lt;br /&gt;
This part of the exercise is about the types of data hosted in GenBank.&lt;br /&gt;
&lt;br /&gt;
===Searching for a specific ID===&lt;br /&gt;
&lt;br /&gt;
The typical case for searching for a specific ID in GenBank, will be looking up information from the literature (e.g. a gene found in a study), following up on information from other databases, investigation of lists of interesting genes etc. In this part of the exercise we will be working with a set of &#039;&#039;&#039;alpha-globin genes&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Search for &amp;lt;tt&amp;gt;AB001981&amp;lt;/tt&amp;gt;&#039;&#039;&#039; - by default the result is shown in the &#039;&#039;&#039;GenBank format&#039;&#039;&#039;. &lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 1.1:&#039;&#039;&#039; &lt;br /&gt;
:: a) How many genes are contained in this entry? &lt;br /&gt;
:: b) From which organism does the DNA originate? &lt;br /&gt;
:: c) What kind of information is contained within the HEADER and within the FEATURE block?&lt;br /&gt;
&lt;br /&gt;
===PubMed links===&lt;br /&gt;
Notice that the publication from which the DNA sequence originates is cited (and linked via a [http://www.ncbi.nlm.nih.gov/pubmed/ PubMed] ID) within the header. Sometimes multiple publications related to the same gene is listed. This is of great importance since it makes it possible to trace the source(s) of the DNA sequence and investigate if the experiments carried out are to be trusted. &lt;br /&gt;
&lt;br /&gt;
This can be of real importance if something seems &amp;quot;wrong&amp;quot; with the sequence (for example if this particular gene exhibits a really strange intron/exon structure compared to other closely related genes, or if it simply doesn&#039;t match ANY other known genes of the same family). By investigation of the original publication it&#039;s possible to double-check the experimental procedure. It may be that the article correctly states the gene to be of type XXX but when that data submitted it was accidentally annotated as YYY (it is the original researchers&#039; responsibility to double-check this). There can also be more serious problems with the experiments ranging from bad/wrong PCR primers, to contamination with DNA from a different species during a cloning step.&lt;br /&gt;
&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]] &#039;&#039;NEVER FORGET: biological data CAN be wrong.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Investigate the PubMed link(s):&#039;&#039;&#039; &lt;br /&gt;
** Follow the &amp;lt;u&amp;gt;PubMed&amp;lt;/u&amp;gt; link from the sequence entry. &lt;br /&gt;
** Observe that it is always possible to read the ABSTRACT of the publication in PubMed, even if access to the publication requires subscription. For most (new) publications there will also be a direct link to the publication itself. &lt;br /&gt;
** Return to the sequence entry once again (or perform the search again if you closed the window).&lt;br /&gt;
&lt;br /&gt;
===GenBank vs. FASTA format===&lt;br /&gt;
* &#039;&#039;&#039;View the sequence entry in FASTA format&#039;&#039;&#039; (Simply click on &amp;quot;&amp;lt;u&amp;gt;FASTA&amp;lt;/u&amp;gt;&amp;quot; in the top part of the page, below the page title) &amp;lt;br/&amp;gt; Now the entire GenBank entry is shown in FASTA format. &lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 1.2:&#039;&#039;&#039;&lt;br /&gt;
:: a) What happened to the alpha-globin genes? Can they still be found? &lt;br /&gt;
:: b) Which part of the GenBank entry has been converted?&lt;br /&gt;
: Observe that the name of the sequence is based on the name of the GenBank entry. &lt;br /&gt;
* &#039;&#039;&#039;Go back to GenBank format&#039;&#039;&#039; (Click on &amp;quot;&amp;lt;u&amp;gt;GenBank&amp;lt;/u&amp;gt;&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;TASK: Save the GenBank &amp;quot;raw data&amp;quot; on your own computer:&#039;&#039;&#039;&lt;br /&gt;
* Click on &amp;quot;&amp;lt;u&amp;gt;Send:&amp;lt;/u&amp;gt;&amp;quot; in the upper right part of the page &lt;br /&gt;
* Choose &amp;quot;Complete Record&amp;quot;, &amp;quot;File&amp;quot; and &amp;quot;Genbank(full)&amp;quot; and click on &amp;quot;&amp;lt;u&amp;gt;Create file&amp;lt;/u&amp;gt;&amp;quot; &lt;br /&gt;
* Locate the downloaded file on your own computer &lt;br /&gt;
* By default it has a pretty generic name (&amp;quot;sequence.gb&amp;quot;) - rename the file to &amp;quot;&amp;lt;tt&amp;gt;AB001981.gb&amp;lt;/tt&amp;gt;&amp;quot; &amp;lt;br/&amp;gt;&#039;&#039;Notice&#039;&#039;: The reason for renaming the file is simply a practice of good file management - now we can by just skimming the filenames guess that it&#039;s a GenBank file (&amp;quot;&amp;lt;tt&amp;gt;*.gb&amp;lt;/tt&amp;gt;&amp;quot;) and that it contains the &amp;quot;&amp;lt;tt&amp;gt;AB001981&amp;lt;/tt&amp;gt;&amp;quot; entry.&lt;br /&gt;
* Open it in Geany. &amp;lt;br/&amp;gt; &#039;&#039;Notice&#039;&#039;: What we have now is the &amp;quot;raw&amp;quot; data behind the information shown online, with no fancy HTML formatting and cross-links.  &lt;br /&gt;
* Verify that the contents of the file is as expected by inspecting it in Geany (it should look exactly like the information shown online).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
::&#039;&#039;&#039;QUESTION 1.3&#039;&#039;&#039;: Does the downloaded file have UNIX or Windows line-endings?&lt;br /&gt;
&lt;br /&gt;
===Exploring the genes defined in a GenBank entry===&lt;br /&gt;
&#039;&#039;&#039;Go back to the GenBank entry in your browser. Click the first &amp;quot;&amp;lt;u&amp;gt;CDS&amp;lt;/u&amp;gt;&amp;quot; element (Alpha-D)&#039;&#039;&#039; &lt;br /&gt;
*CDS = &#039;&#039;&#039;C&#039;&#039;&#039;o&#039;&#039;&#039;D&#039;&#039;&#039;ing &#039;&#039;&#039;S&#039;&#039;&#039;equences: The PROTEIN CODING part of a gene. Basically: the sequence you get when the CODING exons are concatenated (UTR regions are ignored). A CDS always starts with a START codon and ends with a STOP codon. &lt;br /&gt;
* Hopefully it&#039;s quite intuitive why some of the sequence is high-lighted - otherwise discuss it within the group (or with the instructor)&lt;br /&gt;
&lt;br /&gt;
Repeat the same procedure for the other CDS (Alpha-A). &lt;br /&gt;
*When looking at the FEATURE table, the first line of text in the definition of each CDS is as follows: &lt;br /&gt;
 join(1104..1192,1306..1510,1614..1742) &lt;br /&gt;
 join(4915..5009,5165..5369,5474..5602) &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
::&#039;&#039;&#039;QUESTION 1.4:&#039;&#039;&#039; Based on your observations: &lt;br /&gt;
::a) What do these numbers mean? &lt;br /&gt;
::b) How many coding exons does each gene contain?&lt;br /&gt;
* &#039;&#039;&#039;View both of the CDS&#039; in FASTA format&#039;&#039;&#039; (click &amp;quot;Send to&amp;quot; in the upper right corner, choose &amp;quot;Coding Sequences&amp;quot; and set format to &amp;quot;FASTA&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
::&#039;&#039;&#039;QUESTION 1.5&#039;&#039;&#039;: What do the numbers in the sequence title represent? &lt;br /&gt;
* &#039;&#039;&#039;Switch to Graphic view&#039;&#039;&#039; (Click on &amp;lt;u&amp;gt;Graphics&amp;lt;/u&amp;gt; at the top of the page) &amp;lt;br/&amp;gt; An interactive graphical representation of the GenBank entry will now be shown. The upper part of the visualization shows the entire length of the entry (5.891 bp) with bars representing the individual exons within the two genes. &lt;br /&gt;
** This zoomed view below can be changed by dragging the transparent box with the blue borders in the overview representation at the top of the page. &lt;br /&gt;
** The zoom level can be changed. &lt;br /&gt;
** By &amp;quot;mousing over&amp;quot; the bars additional information about that particular feature will be shown. &lt;br /&gt;
The graphical overview is mostly useful for inspecting GenBank entries with multiple genes (some entries have hundreds of embedded genes). Play around with the interface for a few minutes to see what functionality is offered.&lt;br /&gt;
&lt;br /&gt;
== Part 2: Searching GenBank ==&lt;br /&gt;
The key issue to keep in mind when searching GenBank is to avoid drowning in huge amounts of irrelevant data. It is therefore of great importance to filter out unwanted information, WITHOUT losing the relevant entries. Today we will work with searching the TEXTUAL annotation of GenBank entries (keywords, free text etc). We will later get back to sequence based searches (BLAST). &lt;br /&gt;
&lt;br /&gt;
In the first part of the exercise we&#039;ll investigate various ways to search using &#039;&#039;&#039;insulin&#039;&#039;&#039; as the example.&lt;br /&gt;
&lt;br /&gt;
===Naïve search===&lt;br /&gt;
&#039;&#039;&#039;Search for GenBank entries containing the term &amp;quot;&amp;lt;tt&amp;gt;insulin&amp;lt;/tt&amp;gt;&amp;quot;&#039;&#039;&#039;&lt;br /&gt;
* Just do a simple search for INSULIN - don&#039;t put anything else in the search box.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Observe the following&#039;&#039;&#039;: &lt;br /&gt;
* A large number of entries are found. &lt;br /&gt;
* Go through a few pages of results and notice that we are offered data from a diverse set of sources: Experimental work, Patent applications, predicted genes, partial genes etc. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 2.1.1:&#039;&#039;&#039; &lt;br /&gt;
:: a) How many search results were returned? &amp;lt;!-- (only the &amp;quot;Nucleotide&amp;quot; hits, not the &amp;quot;EST&amp;quot; and &amp;quot;GSS&amp;quot; hits) --&amp;gt;&lt;br /&gt;
:: b) Are they all from Human? If no, give a counterexample. (Would you have expected them to be all human?)&lt;br /&gt;
:: c) Are they all insulin? If no, give a counterexample.&lt;br /&gt;
&lt;br /&gt;
By default the search term is matched against ALL POSSIBLE fields in the GenBank entries - including almost all text in the HEADER and FEATURE table. It&#039;s even possible to pick up entries where the match is to one of the authors names and not a gene name! (Perhaps not an issue for insulin). Luckily it is possible to restrict the search to specific pre-indexed fields in the HEADER and FEATURE table (&amp;quot;&#039;&#039;Search fields&#039;&#039;&amp;quot;), which makes it possible to make the search much more focused.&lt;br /&gt;
&lt;br /&gt;
====How the search is interpreted====&lt;br /&gt;
When you do a naïve search (just write a few terms google-style) GenBank tried to interpret what you most likely meant, it is has a behind-the-scene scheme to sorting the results to push the most interesting ones to the top. It is actually possible to see exactly how your search query is interpreted by locating the &#039;&#039;&#039;SEARCH DETAILS&#039;&#039;&#039; box.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 2.1.2:&#039;&#039;&#039;&lt;br /&gt;
:: a) What have your search for &amp;quot;insulin&amp;quot; been expanded into?&lt;br /&gt;
&lt;br /&gt;
Spend a few moments to investigate the HEADER section of the GenBank entry you have all received as a hand-out (&amp;lt;tt&amp;gt;&#039;&#039;&#039;X01831&#039;&#039;&#039;&amp;lt;/tt&amp;gt;) to get an idea of how the data is related to specific sections (e.g. &amp;lt;tt&amp;gt;&#039;&#039;&#039;KEYWORDS&#039;&#039;&#039;&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;&#039;&#039;&#039;ORGANISM&#039;&#039;&#039;&amp;lt;/tt&amp;gt; which we will use in a moment).&lt;br /&gt;
&lt;br /&gt;
Try to find a search result that appears NOT to be the real insulin gene, and see why it was picked up by the search. If you have trouble finding one in your own result, search for &#039;&#039;&#039;DL142095.1&#039;&#039;&#039; which came up around page 200 when the exercise was written.&lt;br /&gt;
&lt;br /&gt;
The main issue here is that we find entries where &amp;quot;insulin&amp;quot; is mentioned anywhere in the entry, and sometimes it&#039;s unrelated genes like &amp;quot;Insulin-receptor&amp;quot;, &amp;quot;Insulin inhibitor&amp;quot; etc.&lt;br /&gt;
&lt;br /&gt;
====Searching for human insulin====&lt;br /&gt;
Search for &amp;lt;tt&amp;gt;human insulin&amp;lt;/tt&amp;gt; and see what happens.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 2.1.3:&#039;&#039;&#039;&lt;br /&gt;
:: a) How many search results were returned?&lt;br /&gt;
:: b) Can you find the human insulin entry? (If yes, write down its title and Accession)&lt;br /&gt;
:: c) How was your search interpreted by the system (the SEARCH DETAILS box)?&lt;br /&gt;
&lt;br /&gt;
------&lt;br /&gt;
&lt;br /&gt;
===Advanced search===&lt;br /&gt;
Looking at the &#039;&#039;&#039;SEARCH DETAILS&#039;&#039;&#039; from the naïve searches we have just performed, give us a good idea on how we can build our own more powerful searches. This can be done in two ways:&lt;br /&gt;
# Simply writing the advanced search string yourself (e.g. &amp;quot;insulin[title]&amp;quot; - to search in the title field)&lt;br /&gt;
# Using the &amp;quot;Search builder&amp;quot; to put together the query bit by bit.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;But why did the naïve search for &amp;quot;human insulin&amp;quot; go so well?&#039;&#039;&#039;&lt;br /&gt;
* If you just need a single (and well-known) gene from one of the well-known model organism, it will indeed work very well to do a simple search. (Much like when you do a Google search and get your desired hit on the first page).&lt;br /&gt;
* However, there are some situations where it&#039;s beneficial to specify the search in more details - e.g. for building data sets of the same gene across multiple species, or just trying to locate a slightly more obscure gene. (Same as when the link you were looking for at Google was on page 10+ and you have to provide more accurate search terms).&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:keys1a.png|right|frame|link=http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers|It&#039;s possible to restrict the search to specific fields in the GenBank entires [http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers (click to open the entire list)]]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Now we are going to narrow down the search to specific parts of the annotation.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Click on &amp;lt;u&amp;gt;Advanced&amp;lt;/u&amp;gt; in the top of the page.&#039;&#039;&#039; &amp;lt;br/&amp;gt; This brings up a form with a &amp;quot;&amp;lt;u&amp;gt;Search Builder&amp;lt;/u&amp;gt;&amp;quot; that can be used to select and combine terms restricted to specific fields.&lt;br /&gt;
* Select &amp;quot;&amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt;&amp;quot; and enter &amp;lt;tt&amp;gt;human&amp;lt;/tt&amp;gt;.&lt;br /&gt;
* Select &amp;quot;&amp;lt;u&amp;gt;Title&amp;lt;/u&amp;gt;&amp;quot; and enter &amp;lt;tt&amp;gt;insulin&amp;lt;/tt&amp;gt;.&lt;br /&gt;
* Click &amp;quot;&amp;lt;u&amp;gt;Search&amp;lt;/u&amp;gt;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
::&#039;&#039;&#039;QUESTION 2.2:&#039;&#039;&#039; &lt;br /&gt;
:: a) How many hits do we have now? &lt;br /&gt;
:: b) Are they all from Human? If no, give a counterexample. &lt;br /&gt;
:: c) Do they all appear to be insulin genes? If no, give a counterexample.&lt;br /&gt;
&lt;br /&gt;
* Now use the &amp;quot;&amp;lt;u&amp;gt;Search Builder&amp;lt;/u&amp;gt;&amp;quot; to search for &amp;lt;tt&amp;gt;insulin&amp;lt;/tt&amp;gt; in other fields instead of &amp;quot;&amp;lt;u&amp;gt;Title&amp;lt;/u&amp;gt;&amp;quot; (&#039;&#039;&#039;still with &amp;quot;&amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt;&amp;quot; set to &amp;lt;tt&amp;gt;human&amp;lt;/tt&amp;gt;&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
::&#039;&#039;&#039;QUESTION 2.3:&#039;&#039;&#039; &lt;br /&gt;
:: a) How many hits are found when &amp;quot;&amp;lt;u&amp;gt;Keyword&amp;lt;/u&amp;gt;&amp;quot; is set to &amp;lt;tt&amp;gt;insulin&amp;lt;/tt&amp;gt;?&lt;br /&gt;
:: b) How many hits are found when &amp;quot;&amp;lt;u&amp;gt;Protein Name&amp;lt;/u&amp;gt;&amp;quot; is set to &amp;lt;tt&amp;gt;insulin&amp;lt;/tt&amp;gt;?&lt;br /&gt;
:: c) Find the correct Human Insulin gene entry (the correct hit). Click on it and write down its Accession codes (there are more than one!), Locus name and Definition (title).&lt;br /&gt;
&lt;br /&gt;
Note that the &amp;quot;&amp;lt;u&amp;gt;Search Builder&amp;lt;/u&amp;gt;&amp;quot; simply is a tool for filling out the search box. If you know the names of the available search fields, it is often more convenient to type your search with the field names manually. A schematic overview of the search fields can be found on the NCBI homepage: [http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers Search Fields and Qualifiers].&lt;br /&gt;
&lt;br /&gt;
===Combining search terms using boolean operators: NOT, AND and OR===&lt;br /&gt;
[[Image:T044680.gif|thumb|400px|[http://www.mountsaintvincent.edu/library2/venn.htm Venn Diagrams for Boolean Logic]]]&lt;br /&gt;
&lt;br /&gt;
Our next task will be to find full length insulin genes from &#039;&#039;as many different organisms as possible&#039;&#039; using the &amp;lt;u&amp;gt;Title&amp;lt;/u&amp;gt; field. Note that it might have been easier to use the &amp;lt;u&amp;gt;Protein name&amp;lt;/u&amp;gt; or &amp;lt;u&amp;gt;Keyword&amp;lt;/u&amp;gt; fields, but with &amp;lt;u&amp;gt;Title&amp;lt;/u&amp;gt; we can immediately see the results of what we are doing, so we are using it for pedagogical reasons. We will now type the searches directly into the Search Box without using the Search Builder.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Let&#039;s start out with a new clean search for Insulin:&#039;&#039;&#039; &lt;br /&gt;
Query:&lt;br /&gt;
&amp;lt;pre style=&amp;quot;overflow:auto;&amp;quot;&amp;gt;&lt;br /&gt;
insulin[title]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The number of hits is very high, and there are many partial genes and mRNA entries. &lt;br /&gt;
* Let&#039;s now specify that the entries should be complete:&lt;br /&gt;
&amp;lt;pre style=&amp;quot;overflow:auto;&amp;quot;&amp;gt;&lt;br /&gt;
insulin[title] AND complete[title]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
About the use of &#039;&#039;&#039;AND&#039;&#039;&#039;: The AND keyword is implicitly used when ever you enter more than one search term: &amp;quot;&amp;lt;tt&amp;gt;human globin&amp;lt;/tt&amp;gt;&amp;quot; will be interpreted as &amp;quot;&amp;lt;tt&amp;gt;human AND globin&amp;lt;/tt&amp;gt;&amp;quot; and only results where BOTH terms are found will be reported. We could therefore have omitted the &amp;quot;&amp;lt;tt&amp;gt;AND&amp;lt;/tt&amp;gt;&amp;quot; in the previous query.&lt;br /&gt;
&lt;br /&gt;
Observe that we still have many hits that are not actually insulin, so we want to add search terms to AVOID in order to bring down the &#039;&#039;false positive&#039;&#039; rate. By a brief inspection of some of the search hits, it turns out that some of them are, e.g., insulin receptors. &lt;br /&gt;
* Let&#039;s get rid of these with the NOT keyword:&lt;br /&gt;
&amp;lt;pre style=&amp;quot;overflow:auto;&amp;quot;&amp;gt;&lt;br /&gt;
insulin[title] complete[title] NOT receptor[title]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Conceptually what we are doing here is to conduct a number of searches that are either COMBINED or SUBTRACTED from each other. The &amp;quot;&amp;lt;tt&amp;gt;&amp;lt;nowiki&amp;gt;receptor[title]&amp;lt;/nowiki&amp;gt;&amp;lt;/tt&amp;gt;&amp;quot; search term finds all entries where this term is found. This list is then excluded from the combined &amp;quot;&amp;lt;tt&amp;gt;&amp;lt;nowiki&amp;gt;insulin[title] AND complete[title]&amp;lt;/nowiki&amp;gt;&amp;lt;/tt&amp;gt;&amp;quot; list by using the NOT operator. &lt;br /&gt;
&lt;br /&gt;
The use of boolean operators can be visualized graphically using Venn diagrams (see the figure to the right). A good strategy for narrowing down a GenBank search is to build a list of &amp;quot;&#039;&#039;kill words&#039;&#039;&amp;quot;/&amp;quot;&#039;&#039;filter words&#039;&#039;&amp;quot; (terms to avoid). More terms can be added to the list as search results are inspected, and it&#039;s found out why strange entries appear on the result list.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;A word of caution&#039;&#039;: Be careful of not throwing the baby out with the bath water - don&#039;t add kill-words that are so broad that they will actually exclude the gene(s) we are looking for. And don&#039;t add kill-words without specifying a search field - e.g. the search &lt;br /&gt;
&amp;lt;pre style=&amp;quot;overflow:auto;&amp;quot;&amp;gt;&lt;br /&gt;
insulin[title] complete[title] NOT receptor&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
would exclude some real insulin hits that just happened to mention &amp;quot;receptor&amp;quot; in some reference!&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;The final part of the exercise to continue to find terms to exclude on your own hand.&#039;&#039;&#039; The point is to bring down the number of search results to a level where it&#039;s easy to pick the correct ones. &#039;&#039;&#039;Remember:&#039;&#039;&#039; the task is to find full length insulin genes from as many different organisms as possible using the Title field.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 2.4:&#039;&#039;&#039; &lt;br /&gt;
:: a) Which search term did you end up using? &lt;br /&gt;
:: b) How many search results do you get now? &lt;br /&gt;
Notice: There are several possible answers to this question, as it will be a balance between filtering out False Positives (things that are NOT insulin) without filtering out (too many) True Positives (things that are actually insulin). &amp;lt;br/&amp;gt;&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== &amp;quot;Free exercise&amp;quot; ===&lt;br /&gt;
[[Image:Cogs_brain.png|50px]]&lt;br /&gt;
Now it&#039;s time to perform a number of GenBank searches on your own. It&#039;s important to think about the search strategy - discuss this within the group. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt; &lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;QUESTION 3:&#039;&#039;&#039; Do at least three of the below and report your findings. Remember to write down the search string you ended up using for each question. &lt;br /&gt;
# &#039;&#039;&#039;Find the Rat and Mouse Insulin gene&#039;&#039;&#039; &lt;br /&gt;
# &#039;&#039;&#039;Find the alcohol-dehydrogenase gene from as many organisms as possible.&#039;&#039;&#039; &lt;br /&gt;
# &#039;&#039;&#039;Find the alpha-globin gene from &#039;&#039;Capra hircus&#039;&#039;&#039;&#039;&#039; - (Remember: Alpha-globin is part of hemoglobin). &lt;br /&gt;
# &#039;&#039;&#039;Find the alpha-globin gene from all ruminants&#039;&#039;&#039; - (hint: inspect the ORGANISM fields in a GenBank entry from an animal you know to be a ruminant, in order to pick up a good search term). If you want to go deeper into the taxonomy, the Tree of Life project have an entry on placental mammals here: [http://tolweb.org/tree?group=Eutheria&amp;amp;contgroup=Mammalia http://tolweb.org/tree?group=Eutheria&amp;amp;contgroup=Mammalia]. &lt;br /&gt;
# &#039;&#039;&#039;Find the actin gene from as many organisms as possible.&#039;&#039;&#039; &amp;lt;br/&amp;gt; Avoid mRNA and entries that are part of whole chromosomes, cosmids etc &lt;br /&gt;
# &#039;&#039;&#039;Find the human insulin receptor gene.&#039;&#039;&#039; Avoid partial genes / single exons in the results.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
# &#039;&#039;&#039;Find the NORMAL p53 gene from human&#039;&#039;&#039;  &amp;lt;br/&amp;gt; p53 is involved in cancer and therefore a large number of mutated versions of the gene have been investigated. The problem is here that these mutant versions &amp;quot;pollute&amp;quot; the GenBank database, when we want to search for the &amp;quot;vanilla&amp;quot; version of the gene. &amp;lt;br/&amp;gt; For starters try to have a look at one of the mutated versions: &amp;lt;tt&amp;gt;&#039;&#039;&#039;S66666&#039;&#039;&#039;&amp;lt;/tt&amp;gt;. Notice where the term &amp;quot;&#039;&#039;&#039;p53&#039;&#039;&#039;&amp;quot; is present and use this to devise your search strategy. (Sometimes this gene also goes by the name &amp;quot;&#039;&#039;&#039;TP53&#039;&#039;&#039;&amp;quot;). &amp;lt;br/&amp;gt; The tricky part of this assignment is to find the best search fields (and terms) to use, and to avoid eliminating the real (unmutated) version of the gene when you put together your &amp;quot;kill-word&amp;quot; list. &amp;lt;br/&amp;gt; Can you find the mRNA version? The full length gene complete with intron/exon structure?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Unknown_variant&amp;diff=797</id>
		<title>Exercise:Unknown variant</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise:Unknown_variant&amp;diff=797"/>
		<updated>2026-04-19T17:52:24Z</updated>

		<summary type="html">&lt;p&gt;Carol: Created page with &amp;quot; What happens when you don&amp;#039;t find your variant of interest?  As an example, we are going to work with a variant in glucagon-like peptide-1 receptor to assess whether patients bearing that mutation can respond to Ozempic  ==Introduction==  In this exercise, you will:  Identify a mutation in a patient GLP1R sequence Determine its effect at the protein level Evaluate whether the mutation affects drug response (Ozempic)  ===Tools you will use===  NCBI Nucleotide and GeneBank...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;br /&gt;
What happens when you don&#039;t find your variant of interest?&lt;br /&gt;
&lt;br /&gt;
As an example, we are going to work with a variant in glucagon-like peptide-1 receptor to assess whether patients bearing that mutation can respond to Ozempic&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
In this exercise, you will:&lt;br /&gt;
&lt;br /&gt;
Identify a mutation in a patient GLP1R sequence&lt;br /&gt;
Determine its effect at the protein level&lt;br /&gt;
Evaluate whether the mutation affects drug response (Ozempic)&lt;br /&gt;
&lt;br /&gt;
===Tools you will use===&lt;br /&gt;
&lt;br /&gt;
NCBI Nucleotide and GeneBank&lt;br /&gt;
ClinVar&lt;br /&gt;
Ensembl&lt;br /&gt;
gnomAD&lt;br /&gt;
Virtual Ribosome: The Virtual Ribosome translates DNA into protein sequences and can scan reading frames to find coding regions.&lt;br /&gt;
UniProt&lt;br /&gt;
PDB&lt;br /&gt;
PyMol&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Where to find GenBank===&lt;br /&gt;
The GenBank database is hosted at NCBI (National Center for Biotechnology Information, USA) (Link: http://www.ncbi.nlm.nih.gov/). Besides the main GenBank database, NCBI also hosts a number of other biological databases (for example whole-genome databases for human, mouse, chimp etc.). In this particular exercise we will concentrate on the classical &amp;quot;GenBank&amp;quot; database. &amp;lt;!-- (http://www.ncbi.nlm.nih.gov/gene/).--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Using the &amp;quot;Entrez&amp;quot; database browser===&lt;br /&gt;
ALL the NCBI databases can be queried through a common search interface named Entrez. On next to all NCBI webpages a search box can be found in the upper part of the page, allowing an easy access for searching the individual databases (or searching across all databases). Click on the following link to open up a new browser window with Entrez, where the focus is pre-set to search in the GenBank database:&lt;br /&gt;
&lt;br /&gt;
http://www.ncbi.nlm.nih.gov/gene&lt;br /&gt;
&lt;br /&gt;
(Alternatively go to [http://www.ncbi.nlm.nih.gov/ the main NCBI webpage] and choose &amp;quot;Nucleotide&amp;quot; as the database).&lt;br /&gt;
[[Image:NCBI-nucleotide.png|link=http://www.ncbi.nlm.nih.gov/nucleotide|center]]&lt;br /&gt;
&lt;br /&gt;
==Part 1: Concerning the DATA in GenBank==&lt;br /&gt;
This part of the exercise is about the types of data hosted in GenBank.&lt;br /&gt;
&lt;br /&gt;
===Searching for a specific ID===&lt;br /&gt;
&lt;br /&gt;
The typical case for searching for a specific ID in GenBank, will be looking up information from the literature (e.g. a gene found in a study), following up on information from other databases, investigation of lists of interesting genes etc. In this part of the exercise we will be working with a set of &#039;&#039;&#039;alpha-globin genes&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Search for &amp;lt;tt&amp;gt;AB001981&amp;lt;/tt&amp;gt;&#039;&#039;&#039; - by default the result is shown in the &#039;&#039;&#039;GenBank format&#039;&#039;&#039;. &lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 1.1:&#039;&#039;&#039; &lt;br /&gt;
:: a) How many genes are contained in this entry? &lt;br /&gt;
:: b) From which organism does the DNA originate? &lt;br /&gt;
:: c) What kind of information is contained within the HEADER and within the FEATURE block?&lt;br /&gt;
&lt;br /&gt;
===PubMed links===&lt;br /&gt;
Notice that the publication from which the DNA sequence originates is cited (and linked via a [http://www.ncbi.nlm.nih.gov/pubmed/ PubMed] ID) within the header. Sometimes multiple publications related to the same gene is listed. This is of great importance since it makes it possible to trace the source(s) of the DNA sequence and investigate if the experiments carried out are to be trusted. &lt;br /&gt;
&lt;br /&gt;
This can be of real importance if something seems &amp;quot;wrong&amp;quot; with the sequence (for example if this particular gene exhibits a really strange intron/exon structure compared to other closely related genes, or if it simply doesn&#039;t match ANY other known genes of the same family). By investigation of the original publication it&#039;s possible to double-check the experimental procedure. It may be that the article correctly states the gene to be of type XXX but when that data submitted it was accidentally annotated as YYY (it is the original researchers&#039; responsibility to double-check this). There can also be more serious problems with the experiments ranging from bad/wrong PCR primers, to contamination with DNA from a different species during a cloning step.&lt;br /&gt;
&lt;br /&gt;
[[Image:Emblem-important_tiny.png‎|left]] &#039;&#039;NEVER FORGET: biological data CAN be wrong.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Investigate the PubMed link(s):&#039;&#039;&#039; &lt;br /&gt;
** Follow the &amp;lt;u&amp;gt;PubMed&amp;lt;/u&amp;gt; link from the sequence entry. &lt;br /&gt;
** Observe that it is always possible to read the ABSTRACT of the publication in PubMed, even if access to the publication requires subscription. For most (new) publications there will also be a direct link to the publication itself. &lt;br /&gt;
** Return to the sequence entry once again (or perform the search again if you closed the window).&lt;br /&gt;
&lt;br /&gt;
===GenBank vs. FASTA format===&lt;br /&gt;
* &#039;&#039;&#039;View the sequence entry in FASTA format&#039;&#039;&#039; (Simply click on &amp;quot;&amp;lt;u&amp;gt;FASTA&amp;lt;/u&amp;gt;&amp;quot; in the top part of the page, below the page title) &amp;lt;br/&amp;gt; Now the entire GenBank entry is shown in FASTA format. &lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 1.2:&#039;&#039;&#039;&lt;br /&gt;
:: a) What happened to the alpha-globin genes? Can they still be found? &lt;br /&gt;
:: b) Which part of the GenBank entry has been converted?&lt;br /&gt;
: Observe that the name of the sequence is based on the name of the GenBank entry. &lt;br /&gt;
* &#039;&#039;&#039;Go back to GenBank format&#039;&#039;&#039; (Click on &amp;quot;&amp;lt;u&amp;gt;GenBank&amp;lt;/u&amp;gt;&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;TASK: Save the GenBank &amp;quot;raw data&amp;quot; on your own computer:&#039;&#039;&#039;&lt;br /&gt;
* Click on &amp;quot;&amp;lt;u&amp;gt;Send:&amp;lt;/u&amp;gt;&amp;quot; in the upper right part of the page &lt;br /&gt;
* Choose &amp;quot;Complete Record&amp;quot;, &amp;quot;File&amp;quot; and &amp;quot;Genbank(full)&amp;quot; and click on &amp;quot;&amp;lt;u&amp;gt;Create file&amp;lt;/u&amp;gt;&amp;quot; &lt;br /&gt;
* Locate the downloaded file on your own computer &lt;br /&gt;
* By default it has a pretty generic name (&amp;quot;sequence.gb&amp;quot;) - rename the file to &amp;quot;&amp;lt;tt&amp;gt;AB001981.gb&amp;lt;/tt&amp;gt;&amp;quot; &amp;lt;br/&amp;gt;&#039;&#039;Notice&#039;&#039;: The reason for renaming the file is simply a practice of good file management - now we can by just skimming the filenames guess that it&#039;s a GenBank file (&amp;quot;&amp;lt;tt&amp;gt;*.gb&amp;lt;/tt&amp;gt;&amp;quot;) and that it contains the &amp;quot;&amp;lt;tt&amp;gt;AB001981&amp;lt;/tt&amp;gt;&amp;quot; entry.&lt;br /&gt;
* Open it in Geany. &amp;lt;br/&amp;gt; &#039;&#039;Notice&#039;&#039;: What we have now is the &amp;quot;raw&amp;quot; data behind the information shown online, with no fancy HTML formatting and cross-links.  &lt;br /&gt;
* Verify that the contents of the file is as expected by inspecting it in Geany (it should look exactly like the information shown online).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
::&#039;&#039;&#039;QUESTION 1.3&#039;&#039;&#039;: Does the downloaded file have UNIX or Windows line-endings?&lt;br /&gt;
&lt;br /&gt;
===Exploring the genes defined in a GenBank entry===&lt;br /&gt;
&#039;&#039;&#039;Go back to the GenBank entry in your browser. Click the first &amp;quot;&amp;lt;u&amp;gt;CDS&amp;lt;/u&amp;gt;&amp;quot; element (Alpha-D)&#039;&#039;&#039; &lt;br /&gt;
*CDS = &#039;&#039;&#039;C&#039;&#039;&#039;o&#039;&#039;&#039;D&#039;&#039;&#039;ing &#039;&#039;&#039;S&#039;&#039;&#039;equences: The PROTEIN CODING part of a gene. Basically: the sequence you get when the CODING exons are concatenated (UTR regions are ignored). A CDS always starts with a START codon and ends with a STOP codon. &lt;br /&gt;
* Hopefully it&#039;s quite intuitive why some of the sequence is high-lighted - otherwise discuss it within the group (or with the instructor)&lt;br /&gt;
&lt;br /&gt;
Repeat the same procedure for the other CDS (Alpha-A). &lt;br /&gt;
*When looking at the FEATURE table, the first line of text in the definition of each CDS is as follows: &lt;br /&gt;
 join(1104..1192,1306..1510,1614..1742) &lt;br /&gt;
 join(4915..5009,5165..5369,5474..5602) &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
::&#039;&#039;&#039;QUESTION 1.4:&#039;&#039;&#039; Based on your observations: &lt;br /&gt;
::a) What do these numbers mean? &lt;br /&gt;
::b) How many coding exons does each gene contain?&lt;br /&gt;
* &#039;&#039;&#039;View both of the CDS&#039; in FASTA format&#039;&#039;&#039; (click &amp;quot;Send to&amp;quot; in the upper right corner, choose &amp;quot;Coding Sequences&amp;quot; and set format to &amp;quot;FASTA&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
::&#039;&#039;&#039;QUESTION 1.5&#039;&#039;&#039;: What do the numbers in the sequence title represent? &lt;br /&gt;
* &#039;&#039;&#039;Switch to Graphic view&#039;&#039;&#039; (Click on &amp;lt;u&amp;gt;Graphics&amp;lt;/u&amp;gt; at the top of the page) &amp;lt;br/&amp;gt; An interactive graphical representation of the GenBank entry will now be shown. The upper part of the visualization shows the entire length of the entry (5.891 bp) with bars representing the individual exons within the two genes. &lt;br /&gt;
** This zoomed view below can be changed by dragging the transparent box with the blue borders in the overview representation at the top of the page. &lt;br /&gt;
** The zoom level can be changed. &lt;br /&gt;
** By &amp;quot;mousing over&amp;quot; the bars additional information about that particular feature will be shown. &lt;br /&gt;
The graphical overview is mostly useful for inspecting GenBank entries with multiple genes (some entries have hundreds of embedded genes). Play around with the interface for a few minutes to see what functionality is offered.&lt;br /&gt;
&lt;br /&gt;
== Part 2: Searching GenBank ==&lt;br /&gt;
The key issue to keep in mind when searching GenBank is to avoid drowning in huge amounts of irrelevant data. It is therefore of great importance to filter out unwanted information, WITHOUT losing the relevant entries. Today we will work with searching the TEXTUAL annotation of GenBank entries (keywords, free text etc). We will later get back to sequence based searches (BLAST). &lt;br /&gt;
&lt;br /&gt;
In the first part of the exercise we&#039;ll investigate various ways to search using &#039;&#039;&#039;insulin&#039;&#039;&#039; as the example.&lt;br /&gt;
&lt;br /&gt;
===Naïve search===&lt;br /&gt;
&#039;&#039;&#039;Search for GenBank entries containing the term &amp;quot;&amp;lt;tt&amp;gt;insulin&amp;lt;/tt&amp;gt;&amp;quot;&#039;&#039;&#039;&lt;br /&gt;
* Just do a simple search for INSULIN - don&#039;t put anything else in the search box.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Observe the following&#039;&#039;&#039;: &lt;br /&gt;
* A large number of entries are found. &lt;br /&gt;
* Go through a few pages of results and notice that we are offered data from a diverse set of sources: Experimental work, Patent applications, predicted genes, partial genes etc. &lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 2.1.1:&#039;&#039;&#039; &lt;br /&gt;
:: a) How many search results were returned? &amp;lt;!-- (only the &amp;quot;Nucleotide&amp;quot; hits, not the &amp;quot;EST&amp;quot; and &amp;quot;GSS&amp;quot; hits) --&amp;gt;&lt;br /&gt;
:: b) Are they all from Human? If no, give a counterexample. (Would you have expected them to be all human?)&lt;br /&gt;
:: c) Are they all insulin? If no, give a counterexample.&lt;br /&gt;
&lt;br /&gt;
By default the search term is matched against ALL POSSIBLE fields in the GenBank entries - including almost all text in the HEADER and FEATURE table. It&#039;s even possible to pick up entries where the match is to one of the authors names and not a gene name! (Perhaps not an issue for insulin). Luckily it is possible to restrict the search to specific pre-indexed fields in the HEADER and FEATURE table (&amp;quot;&#039;&#039;Search fields&#039;&#039;&amp;quot;), which makes it possible to make the search much more focused.&lt;br /&gt;
&lt;br /&gt;
====How the search is interpreted====&lt;br /&gt;
When you do a naïve search (just write a few terms google-style) GenBank tried to interpret what you most likely meant, it is has a behind-the-scene scheme to sorting the results to push the most interesting ones to the top. It is actually possible to see exactly how your search query is interpreted by locating the &#039;&#039;&#039;SEARCH DETAILS&#039;&#039;&#039; box.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 2.1.2:&#039;&#039;&#039;&lt;br /&gt;
:: a) What have your search for &amp;quot;insulin&amp;quot; been expanded into?&lt;br /&gt;
&lt;br /&gt;
Spend a few moments to investigate the HEADER section of the GenBank entry you have all received as a hand-out (&amp;lt;tt&amp;gt;&#039;&#039;&#039;X01831&#039;&#039;&#039;&amp;lt;/tt&amp;gt;) to get an idea of how the data is related to specific sections (e.g. &amp;lt;tt&amp;gt;&#039;&#039;&#039;KEYWORDS&#039;&#039;&#039;&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;&#039;&#039;&#039;ORGANISM&#039;&#039;&#039;&amp;lt;/tt&amp;gt; which we will use in a moment).&lt;br /&gt;
&lt;br /&gt;
Try to find a search result that appears NOT to be the real insulin gene, and see why it was picked up by the search. If you have trouble finding one in your own result, search for &#039;&#039;&#039;DL142095.1&#039;&#039;&#039; which came up around page 200 when the exercise was written.&lt;br /&gt;
&lt;br /&gt;
The main issue here is that we find entries where &amp;quot;insulin&amp;quot; is mentioned anywhere in the entry, and sometimes it&#039;s unrelated genes like &amp;quot;Insulin-receptor&amp;quot;, &amp;quot;Insulin inhibitor&amp;quot; etc.&lt;br /&gt;
&lt;br /&gt;
====Searching for human insulin====&lt;br /&gt;
Search for &amp;lt;tt&amp;gt;human insulin&amp;lt;/tt&amp;gt; and see what happens.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 2.1.3:&#039;&#039;&#039;&lt;br /&gt;
:: a) How many search results were returned?&lt;br /&gt;
:: b) Can you find the human insulin entry? (If yes, write down its title and Accession)&lt;br /&gt;
:: c) How was your search interpreted by the system (the SEARCH DETAILS box)?&lt;br /&gt;
&lt;br /&gt;
------&lt;br /&gt;
&lt;br /&gt;
===Advanced search===&lt;br /&gt;
Looking at the &#039;&#039;&#039;SEARCH DETAILS&#039;&#039;&#039; from the naïve searches we have just performed, give us a good idea on how we can build our own more powerful searches. This can be done in two ways:&lt;br /&gt;
# Simply writing the advanced search string yourself (e.g. &amp;quot;insulin[title]&amp;quot; - to search in the title field)&lt;br /&gt;
# Using the &amp;quot;Search builder&amp;quot; to put together the query bit by bit.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;But why did the naïve search for &amp;quot;human insulin&amp;quot; go so well?&#039;&#039;&#039;&lt;br /&gt;
* If you just need a single (and well-known) gene from one of the well-known model organism, it will indeed work very well to do a simple search. (Much like when you do a Google search and get your desired hit on the first page).&lt;br /&gt;
* However, there are some situations where it&#039;s beneficial to specify the search in more details - e.g. for building data sets of the same gene across multiple species, or just trying to locate a slightly more obscure gene. (Same as when the link you were looking for at Google was on page 10+ and you have to provide more accurate search terms).&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Image:keys1a.png|right|frame|link=http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers|It&#039;s possible to restrict the search to specific fields in the GenBank entires [http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers (click to open the entire list)]]]&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Now we are going to narrow down the search to specific parts of the annotation.&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Click on &amp;lt;u&amp;gt;Advanced&amp;lt;/u&amp;gt; in the top of the page.&#039;&#039;&#039; &amp;lt;br/&amp;gt; This brings up a form with a &amp;quot;&amp;lt;u&amp;gt;Search Builder&amp;lt;/u&amp;gt;&amp;quot; that can be used to select and combine terms restricted to specific fields.&lt;br /&gt;
* Select &amp;quot;&amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt;&amp;quot; and enter &amp;lt;tt&amp;gt;human&amp;lt;/tt&amp;gt;.&lt;br /&gt;
* Select &amp;quot;&amp;lt;u&amp;gt;Title&amp;lt;/u&amp;gt;&amp;quot; and enter &amp;lt;tt&amp;gt;insulin&amp;lt;/tt&amp;gt;.&lt;br /&gt;
* Click &amp;quot;&amp;lt;u&amp;gt;Search&amp;lt;/u&amp;gt;&amp;quot;&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
::&#039;&#039;&#039;QUESTION 2.2:&#039;&#039;&#039; &lt;br /&gt;
:: a) How many hits do we have now? &lt;br /&gt;
:: b) Are they all from Human? If no, give a counterexample. &lt;br /&gt;
:: c) Do they all appear to be insulin genes? If no, give a counterexample.&lt;br /&gt;
&lt;br /&gt;
* Now use the &amp;quot;&amp;lt;u&amp;gt;Search Builder&amp;lt;/u&amp;gt;&amp;quot; to search for &amp;lt;tt&amp;gt;insulin&amp;lt;/tt&amp;gt; in other fields instead of &amp;quot;&amp;lt;u&amp;gt;Title&amp;lt;/u&amp;gt;&amp;quot; (&#039;&#039;&#039;still with &amp;quot;&amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt;&amp;quot; set to &amp;lt;tt&amp;gt;human&amp;lt;/tt&amp;gt;&#039;&#039;&#039;)&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
::&#039;&#039;&#039;QUESTION 2.3:&#039;&#039;&#039; &lt;br /&gt;
:: a) How many hits are found when &amp;quot;&amp;lt;u&amp;gt;Keyword&amp;lt;/u&amp;gt;&amp;quot; is set to &amp;lt;tt&amp;gt;insulin&amp;lt;/tt&amp;gt;?&lt;br /&gt;
:: b) How many hits are found when &amp;quot;&amp;lt;u&amp;gt;Protein Name&amp;lt;/u&amp;gt;&amp;quot; is set to &amp;lt;tt&amp;gt;insulin&amp;lt;/tt&amp;gt;?&lt;br /&gt;
:: c) Find the correct Human Insulin gene entry (the correct hit). Click on it and write down its Accession codes (there are more than one!), Locus name and Definition (title).&lt;br /&gt;
&lt;br /&gt;
Note that the &amp;quot;&amp;lt;u&amp;gt;Search Builder&amp;lt;/u&amp;gt;&amp;quot; simply is a tool for filling out the search box. If you know the names of the available search fields, it is often more convenient to type your search with the field names manually. A schematic overview of the search fields can be found on the NCBI homepage: [http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers Search Fields and Qualifiers].&lt;br /&gt;
&lt;br /&gt;
===Combining search terms using boolean operators: NOT, AND and OR===&lt;br /&gt;
[[Image:T044680.gif|thumb|400px|[http://www.mountsaintvincent.edu/library2/venn.htm Venn Diagrams for Boolean Logic]]]&lt;br /&gt;
&lt;br /&gt;
Our next task will be to find full length insulin genes from &#039;&#039;as many different organisms as possible&#039;&#039; using the &amp;lt;u&amp;gt;Title&amp;lt;/u&amp;gt; field. Note that it might have been easier to use the &amp;lt;u&amp;gt;Protein name&amp;lt;/u&amp;gt; or &amp;lt;u&amp;gt;Keyword&amp;lt;/u&amp;gt; fields, but with &amp;lt;u&amp;gt;Title&amp;lt;/u&amp;gt; we can immediately see the results of what we are doing, so we are using it for pedagogical reasons. We will now type the searches directly into the Search Box without using the Search Builder.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Let&#039;s start out with a new clean search for Insulin:&#039;&#039;&#039; &lt;br /&gt;
Query:&lt;br /&gt;
&amp;lt;pre style=&amp;quot;overflow:auto;&amp;quot;&amp;gt;&lt;br /&gt;
insulin[title]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The number of hits is very high, and there are many partial genes and mRNA entries. &lt;br /&gt;
* Let&#039;s now specify that the entries should be complete:&lt;br /&gt;
&amp;lt;pre style=&amp;quot;overflow:auto;&amp;quot;&amp;gt;&lt;br /&gt;
insulin[title] AND complete[title]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
About the use of &#039;&#039;&#039;AND&#039;&#039;&#039;: The AND keyword is implicitly used when ever you enter more than one search term: &amp;quot;&amp;lt;tt&amp;gt;human globin&amp;lt;/tt&amp;gt;&amp;quot; will be interpreted as &amp;quot;&amp;lt;tt&amp;gt;human AND globin&amp;lt;/tt&amp;gt;&amp;quot; and only results where BOTH terms are found will be reported. We could therefore have omitted the &amp;quot;&amp;lt;tt&amp;gt;AND&amp;lt;/tt&amp;gt;&amp;quot; in the previous query.&lt;br /&gt;
&lt;br /&gt;
Observe that we still have many hits that are not actually insulin, so we want to add search terms to AVOID in order to bring down the &#039;&#039;false positive&#039;&#039; rate. By a brief inspection of some of the search hits, it turns out that some of them are, e.g., insulin receptors. &lt;br /&gt;
* Let&#039;s get rid of these with the NOT keyword:&lt;br /&gt;
&amp;lt;pre style=&amp;quot;overflow:auto;&amp;quot;&amp;gt;&lt;br /&gt;
insulin[title] complete[title] NOT receptor[title]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Conceptually what we are doing here is to conduct a number of searches that are either COMBINED or SUBTRACTED from each other. The &amp;quot;&amp;lt;tt&amp;gt;&amp;lt;nowiki&amp;gt;receptor[title]&amp;lt;/nowiki&amp;gt;&amp;lt;/tt&amp;gt;&amp;quot; search term finds all entries where this term is found. This list is then excluded from the combined &amp;quot;&amp;lt;tt&amp;gt;&amp;lt;nowiki&amp;gt;insulin[title] AND complete[title]&amp;lt;/nowiki&amp;gt;&amp;lt;/tt&amp;gt;&amp;quot; list by using the NOT operator. &lt;br /&gt;
&lt;br /&gt;
The use of boolean operators can be visualized graphically using Venn diagrams (see the figure to the right). A good strategy for narrowing down a GenBank search is to build a list of &amp;quot;&#039;&#039;kill words&#039;&#039;&amp;quot;/&amp;quot;&#039;&#039;filter words&#039;&#039;&amp;quot; (terms to avoid). More terms can be added to the list as search results are inspected, and it&#039;s found out why strange entries appear on the result list.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;A word of caution&#039;&#039;: Be careful of not throwing the baby out with the bath water - don&#039;t add kill-words that are so broad that they will actually exclude the gene(s) we are looking for. And don&#039;t add kill-words without specifying a search field - e.g. the search &lt;br /&gt;
&amp;lt;pre style=&amp;quot;overflow:auto;&amp;quot;&amp;gt;&lt;br /&gt;
insulin[title] complete[title] NOT receptor&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
would exclude some real insulin hits that just happened to mention &amp;quot;receptor&amp;quot; in some reference!&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;The final part of the exercise to continue to find terms to exclude on your own hand.&#039;&#039;&#039; The point is to bring down the number of search results to a level where it&#039;s easy to pick the correct ones. &#039;&#039;&#039;Remember:&#039;&#039;&#039; the task is to find full length insulin genes from as many different organisms as possible using the Title field.&lt;br /&gt;
&lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
:: &#039;&#039;&#039;QUESTION 2.4:&#039;&#039;&#039; &lt;br /&gt;
:: a) Which search term did you end up using? &lt;br /&gt;
:: b) How many search results do you get now? &lt;br /&gt;
Notice: There are several possible answers to this question, as it will be a balance between filtering out False Positives (things that are NOT insulin) without filtering out (too many) True Positives (things that are actually insulin). &amp;lt;br/&amp;gt;&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
=== &amp;quot;Free exercise&amp;quot; ===&lt;br /&gt;
[[Image:Cogs_brain.png|50px]]&lt;br /&gt;
Now it&#039;s time to perform a number of GenBank searches on your own. It&#039;s important to think about the search strategy - discuss this within the group. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt; &lt;br /&gt;
[[Image:Office-notes-line_drawing.png|30px|left]]&lt;br /&gt;
&#039;&#039;&#039;QUESTION 3:&#039;&#039;&#039; Do at least three of the below and report your findings. Remember to write down the search string you ended up using for each question. &lt;br /&gt;
# &#039;&#039;&#039;Find the Rat and Mouse Insulin gene&#039;&#039;&#039; &lt;br /&gt;
# &#039;&#039;&#039;Find the alcohol-dehydrogenase gene from as many organisms as possible.&#039;&#039;&#039; &lt;br /&gt;
# &#039;&#039;&#039;Find the alpha-globin gene from &#039;&#039;Capra hircus&#039;&#039;&#039;&#039;&#039; - (Remember: Alpha-globin is part of hemoglobin). &lt;br /&gt;
# &#039;&#039;&#039;Find the alpha-globin gene from all ruminants&#039;&#039;&#039; - (hint: inspect the ORGANISM fields in a GenBank entry from an animal you know to be a ruminant, in order to pick up a good search term). If you want to go deeper into the taxonomy, the Tree of Life project have an entry on placental mammals here: [http://tolweb.org/tree?group=Eutheria&amp;amp;contgroup=Mammalia http://tolweb.org/tree?group=Eutheria&amp;amp;contgroup=Mammalia]. &lt;br /&gt;
# &#039;&#039;&#039;Find the actin gene from as many organisms as possible.&#039;&#039;&#039; &amp;lt;br/&amp;gt; Avoid mRNA and entries that are part of whole chromosomes, cosmids etc &lt;br /&gt;
# &#039;&#039;&#039;Find the human insulin receptor gene.&#039;&#039;&#039; Avoid partial genes / single exons in the results.&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
# &#039;&#039;&#039;Find the NORMAL p53 gene from human&#039;&#039;&#039;  &amp;lt;br/&amp;gt; p53 is involved in cancer and therefore a large number of mutated versions of the gene have been investigated. The problem is here that these mutant versions &amp;quot;pollute&amp;quot; the GenBank database, when we want to search for the &amp;quot;vanilla&amp;quot; version of the gene. &amp;lt;br/&amp;gt; For starters try to have a look at one of the mutated versions: &amp;lt;tt&amp;gt;&#039;&#039;&#039;S66666&#039;&#039;&#039;&amp;lt;/tt&amp;gt;. Notice where the term &amp;quot;&#039;&#039;&#039;p53&#039;&#039;&#039;&amp;quot; is present and use this to devise your search strategy. (Sometimes this gene also goes by the name &amp;quot;&#039;&#039;&#039;TP53&#039;&#039;&#039;&amp;quot;). &amp;lt;br/&amp;gt; The tricky part of this assignment is to find the best search fields (and terms) to use, and to avoid eliminating the real (unmutated) version of the gene when you put together your &amp;quot;kill-word&amp;quot; list. &amp;lt;br/&amp;gt; Can you find the mRNA version? The full length gene complete with intron/exon structure?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;/blockquote&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=757</id>
		<title>ExPSIBLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=ExPSIBLAST&amp;diff=757"/>
		<updated>2025-11-07T14:13:30Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
&lt;br /&gt;
Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today&#039;s lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to:&lt;br /&gt;
* Identify relationships between proteins with low sequence similarity&lt;br /&gt;
* Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein)&lt;br /&gt;
&lt;br /&gt;
===Links=== &lt;br /&gt;
* NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/&lt;br /&gt;
&amp;lt;!-- * [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] is a tool for visualization of protein sequence profiles and identification of conserved residues.&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Say you have a protein sequence [https://teaching.healthtech.dtu.dk/material/22111/files/Query1.txt Query] (also pasted below), and you want to make predictions about its structural homologue. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;QUERY1&lt;br /&gt;
 MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV&lt;br /&gt;
 EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK&lt;br /&gt;
 LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS&lt;br /&gt;
 IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL&lt;br /&gt;
 YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID&lt;br /&gt;
 LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE&lt;br /&gt;
 IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL&lt;br /&gt;
 QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in nr.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all significant hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: How large a fraction (Query coverage) of the query sequence do the significant hits match (excluding the identical match)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Do you find any PDB hits among the significant hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; look for a PDB identifier in the &amp;lt;u&amp;gt;Accession&amp;lt;/u&amp;gt; column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as &amp;quot;1XYZ_A&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lightyellow; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If you see the error message “&amp;lt;u&amp;gt;Entrez Query: txid2157 [ORGN] is not supported&amp;lt;/u&amp;gt;”, then click &amp;lt;u&amp;gt;Recent Results&amp;lt;/u&amp;gt; in the upper right part of the BLAST window, select your most recent search, and try again. &lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.&lt;br /&gt;
&lt;br /&gt;
Go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.&lt;br /&gt;
&lt;br /&gt;
Then, open &#039;&#039;a new BLAST window&#039;&#039; (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; as the database. Do &#039;&#039;not&#039;&#039; limit your search to Archaea this time. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; click on the description to get to the actual alignment between the query sequence and the PDB hit)? &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
===One more round===&lt;br /&gt;
Let&#039;s try one more iteration of PSI-BLAST: &lt;br /&gt;
* Go back to your first BLAST window (the one with the results from the &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt; database limited to Archaea) and press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 3&amp;lt;/u&amp;gt;. &lt;br /&gt;
* Save the resulting PSSM file (make sure you give it a different name!).&lt;br /&gt;
* Launch a new PSI-BLAST search against &amp;lt;u&amp;gt;pdb&amp;lt;/u&amp;gt; in all organisms using this PSSM (you may have to click on &amp;lt;u&amp;gt;Clear&amp;lt;/u&amp;gt; to erase your first PSSM file from the server).&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog (on your own)==&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Concluding remarks==&lt;br /&gt;
Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=756</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=756"/>
		<updated>2025-11-06T16:08:28Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the match with the lower E-value? Provide sequence Id, %identity and coverage. Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT08HV28016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
&lt;br /&gt;
First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma &#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; &lt;br /&gt;
refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GTHC3N0F016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=755</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=755"/>
		<updated>2025-11-06T15:45:33Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Finding a remote homolog in a specific taxa (Optional) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the match with the lower E-value? Provide sequence Id, %identity and coverage. Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT08HV28016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
&lt;br /&gt;
First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma &#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; &lt;br /&gt;
refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GTDKMR8P014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=754</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=754"/>
		<updated>2025-11-06T14:56:30Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Finding a remote homolog in a specific taxa (Optional) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the match with the lower E-value? Provide sequence Id, %identity and coverage. Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT08HV28016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma cruzi&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
&lt;br /&gt;
First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma cruzi&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; &lt;br /&gt;
refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma cruzi&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GTDKMR8P014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=753</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=753"/>
		<updated>2025-11-06T14:55:55Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Finding a remote homolog in a specific taxa (Optional) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the match with the lower E-value? Provide sequence Id, %identity and coverage. Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT08HV28016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma cruzi&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma cruzi&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma cruzi&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GTDKMR8P014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=752</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=752"/>
		<updated>2025-11-06T14:52:35Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Finding a remote homolog in a specific taxa (Optional) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the match with the lower E-value? Provide sequence Id, %identity and coverage. Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT08HV28016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT51ND2J016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=751</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=751"/>
		<updated>2025-11-06T12:34:49Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the match with the lower E-value? Provide sequence Id, %identity and coverage. Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT08HV28016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT3TCGYR016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=750</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=750"/>
		<updated>2025-11-06T12:26:48Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
&lt;br /&gt;
[[File:results_PSI-BLAST_iteration2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
[[File:results_PSI-BLAST_iteration3.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB3.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: The e-values are lower this time and the query cover has increased to around 74% but the cover seems to be skewed to only one part of the previous matches. These suggests that only from the two types of matches one has dominated in the construction of the PSSM-3.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
Answer: In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
Answer: In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator, this also agrees with what we observed on the graphic summary showing the coverage it seems only the matches on one of the regions (C-terminal part of the protein) has dominated the PSSM-3.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
&lt;br /&gt;
[[File:PSSM-2_on_PDB.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PSSM_onPDB.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: Both searching wiht PSSM-2 and PSSM-3 we can find the same matches on PDB. The coverage coincides with the second half (C-terminal) of the orphan protein. This result is expected since in our PSSM constructions we had more proteins from this region of the protein.. so we can see it as if we had amplified the signal from this region.&lt;br /&gt;
We could have selected only the other proteins for the second iteration of PSI-BLAST to skew the results to the other (N-terminal) portion of the protein. It also makes sense that the PDB structures are coming from Bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
The first two hit proteins on PDB (5N07_A and 5N08_A) are called HTH-type transcriptional repressor NsrR. These proteins bind DNA with the Helix-turn-Helix motif (HTH) to repress the transcription of genes, but that part of the protein is outside of our alignment so it is not a function that would be present in our orphan protein. &lt;br /&gt;
Additionally there is some additional information on Uniprot about this protein (Q9L132) Binds DNA; this binding is disrupted by nitrosylation upon exposure to nitric oxide (NO) and also by EDTA and iron chelators. The 2Fe-2S cluster is stable in the presence of O2. This regulatory function is dependent on three Cysteines (C) that bind iron. Interestingly we also found the three cysteins in our protein of interest so might have some regulatory function.. but at this point we will probably need to do some experimental assays to test this hypothesis.&lt;br /&gt;
&lt;br /&gt;
QUESTION 13: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
The Orphan protein used for the example is a real case scenario protein, unfortunately we do not know the function. There are still many genes that we do not know what they do, and some are involved in diseases so it is important to find ways to find a potential function for them.&lt;br /&gt;
When we use PSI-BLAST we select some sequences to build a position-specific scoring matrix (PSSM). The advantages of using a matrix instead of a single sequence for remote homologous sequences has the advantage of learning a wider range of preferences for each position, and that is the reason why we find more hits and with lower (more significant) E-values.&lt;br /&gt;
However, a cautious note, you should be sure that the sequences that you include in your PSSM will not polute the initial signal.. so preferentially they will come from a few or lower significant hits.&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) == &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=749</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=749"/>
		<updated>2025-11-06T12:11:02Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
&lt;br /&gt;
[[File:results_PSI-BLAST_iteration2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
[[File:results_PSI-BLAST_iteration3.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB3.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: The e-values are lower this time and the query cover has increased to around 74% but the cover seems to be skewed to only one part of the previous matches. These suggests that only from the two types of matches one has dominated in the construction of the PSSM-3.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
Answer: In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
Answer: In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator, this also agrees with what we observed on the graphic summary showing the coverage it seems only the matches on one of the regions (C-terminal part of the protein) has dominated the PSSM-3.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
&lt;br /&gt;
[[File:PSSM-2_on_PDB.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PSSM_onPDB.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: Both searching wiht PSSM-2 and PSSM-3 we can find the same matches on PDB. The coverage coincides with the second half (C-terminal) of the orphan protein. This result is expected since in our PSSM constructions we had more proteins from this region of the protein.. so we can see it as if we had amplified the signal from this region.&lt;br /&gt;
We could have selected only the other proteins for the second iteration of PSI-BLAST to skew the results to the other (N-terminal) portion of the protein. It also makes sense that the PDB structures are coming from Bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
The first two hit proteins on PDB (5N07_A and 5N08_A) are called HTH-type transcriptional repressor NsrR. These proteins bind DNA with the Helix-turn-Helix motif (HTH) to repress the transcription of genes, but that part of the protein is outside of our alignment so it is not a function that would be present in our orphan protein. &lt;br /&gt;
Additionally there is some additional information on Uniprot about this protein (Q9L132) Binds DNA; this binding is disrupted by nitrosylation upon exposure to nitric oxide (NO) and also by EDTA and iron chelators. The 2Fe-2S cluster is stable in the presence of O2. This regulatory function is dependent on three Cysteines (C) that bind iron. Interestingly we also found the three cysteins in our protein of interest so might have some regulatory function.. but at this point we will probably need to do some experimental assays to test this hypothesis.&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) == &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=748</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=748"/>
		<updated>2025-11-06T12:08:21Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Finding a remote homolog in a specific taxa (Optional) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT08HV28016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT3TCGYR016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=747</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=747"/>
		<updated>2025-11-06T12:07:01Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Finding a remote homolog in a specific taxa (Optional) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT08HV28016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
GT3TCGYR016&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:GraphicSummary_PSSM_onPDB.png&amp;diff=746</id>
		<title>File:GraphicSummary PSSM onPDB.png</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:GraphicSummary_PSSM_onPDB.png&amp;diff=746"/>
		<updated>2025-11-06T12:01:14Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:PSSM-2_on_PDB.png&amp;diff=745</id>
		<title>File:PSSM-2 on PDB.png</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:PSSM-2_on_PDB.png&amp;diff=745"/>
		<updated>2025-11-06T12:01:02Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=744</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=744"/>
		<updated>2025-11-06T12:00:42Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
&lt;br /&gt;
[[File:results_PSI-BLAST_iteration2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
[[File:results_PSI-BLAST_iteration3.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB3.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: The e-values are lower this time and the query cover has increased to around 74% but the cover seems to be skewed to only one part of the previous matches. These suggests that only from the two types of matches one has dominated in the construction of the PSSM-3.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
Answer: In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
Answer: In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator, this also agrees with what we observed on the graphic summary showing the coverage it seems only the matches on one of the regions (C-terminal part of the protein) has dominated the PSSM-3.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
&lt;br /&gt;
[[File:PSSM-2_on_PDB.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PSSM_onPDB.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: Both searching wiht PSSM-2 and PSSM-3 we can find the same matches on PDB. The coverage coincides with the second half (C-terminal) of the orphan protein. This result is expected since in our PSSM constructions we had more proteins from this region of the protein.. so we can see it as if we had amplified the signal from this region.&lt;br /&gt;
We could have selected only the other proteins for the second iteration of PSI-BLAST to skew the results to the other (N-terminal) portion of the protein. It also makes sense that the PDB structures are coming from Bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
The first two hit proteins on PDB (5N07_A and 5N08_A) are called HTH-type transcriptional repressor NsrR. These proteins bind DNA with the Helix-turn-Helix motif (HTH) to repress the transcription of genes, but that part of the protein is outside of our alignment so it is not a function that would be present in our orphan protein. &lt;br /&gt;
Additionally there is some additional information on Uniprot about this protein (Q9L132) Binds DNA; this binding is disrupted by nitrosylation upon exposure to nitric oxide (NO) and also by EDTA and iron chelators. The 2Fe-2S cluster is stable in the presence of O2. This regulatory function is dependent on three Cysteines (C) that bind iron. Interestingly we also found the three cysteins in our protein of interest so might have some regulatory function.. but at this point seems a bit of speculation.. and we will probably need to do some experimental assays to test its function.&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) == &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=743</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=743"/>
		<updated>2025-11-06T11:59:36Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
&lt;br /&gt;
[[File:results_PSI-BLAST_iteration2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
[[File:results_PSI-BLAST_iteration3.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB3.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: The e-values are lower this time and the query cover has increased to around 74% but the cover seems to be skewed to only one part of the previous matches. These suggests that only from the two types of matches one has dominated in the construction of the PSSM-3.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
Answer: In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
Answer: In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator, this also agrees with what we observed on the graphic summary showing the coverage it seems only the matches on one of the regions (C-terminal part of the protein) has dominated the PSSM-3.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
&lt;br /&gt;
Answer: Both searching wiht PSSM-2 and PSSM-3 we can find the same matches on PDB. The coverage coincides with the second half (C-terminal) of the orphan protein. This result is expected since in our PSSM constructions we had more proteins from this region of the protein.. so we can see it as if we had amplified the signal from this region.&lt;br /&gt;
We could have selected only the other proteins for the second iteration of PSI-BLAST to skew the results to the other (N-terminal) portion of the protein. It also makes sense that the PDB structures are coming from Bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
The first two hit proteins on PDB (5N07_A and 5N08_A) are called HTH-type transcriptional repressor NsrR. These proteins bind DNA with the Helix-turn-Helix motif (HTH) to repress the transcription of genes, but that part of the protein is outside of our alignment so it is not a function that would be present in our orphan protein. &lt;br /&gt;
Additionally there is some additional information on Uniprot about this protein (Q9L132) Binds DNA; this binding is disrupted by nitrosylation upon exposure to nitric oxide (NO) and also by EDTA and iron chelators. The 2Fe-2S cluster is stable in the presence of O2. This regulatory function is dependent on three Cysteines (C) that bind iron. Interestingly we also found the three cysteins in our protein of interest so might have some regulatory function.. but at this point seems a bit of speculation.. and we will probably need to do some experimental assays to test its function.&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) == &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=742</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=742"/>
		<updated>2025-11-06T11:12:55Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
&lt;br /&gt;
[[File:results_PSI-BLAST_iteration2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
[[File:results_PSI-BLAST_iteration3.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB3.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: The e-values are lower this time but the query cover seems to be skewed to only one part of the previous matches.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) == &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=741</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=741"/>
		<updated>2025-11-06T11:09:28Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT08HV28016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=740</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=740"/>
		<updated>2025-11-06T11:09:02Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GT08HV28016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=739</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=739"/>
		<updated>2025-11-06T11:08:03Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Constructing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=738</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=738"/>
		<updated>2025-11-06T11:07:36Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Constructing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|300px|thumb|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Results_PSI-BLAST_iteration2.png&amp;diff=737</id>
		<title>File:Results PSI-BLAST iteration2.png</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Results_PSI-BLAST_iteration2.png&amp;diff=737"/>
		<updated>2025-11-06T11:06:45Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=736</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=736"/>
		<updated>2025-11-06T10:57:00Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|200px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GR15WYYN016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=735</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=735"/>
		<updated>2025-11-06T10:50:41Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|200px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-2&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;PSSM-3&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=734</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=734"/>
		<updated>2025-11-06T10:48:52Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|200px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GSW70U2V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=733</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=733"/>
		<updated>2025-11-06T10:47:27Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Constructing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|200px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=732</id>
		<title>Exercise PSI-BLAST</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST&amp;diff=732"/>
		<updated>2025-11-06T10:47:11Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Written by: Carolina Barra Quaglia&lt;br /&gt;
&lt;br /&gt;
==Overview==&lt;br /&gt;
&lt;br /&gt;
In this exercise you will learn how to&lt;br /&gt;
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.&lt;br /&gt;
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).&lt;br /&gt;
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.&lt;br /&gt;
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.&lt;br /&gt;
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.&lt;br /&gt;
&lt;br /&gt;
==Introduction: What are orphan genes?==&lt;br /&gt;
&lt;br /&gt;
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.&lt;br /&gt;
&lt;br /&gt;
In today&#039;s exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.&lt;br /&gt;
&lt;br /&gt;
Interestingly this gene (C22orf45) may have once originated from &#039;Junk DNA&#039; and it is supposed to have gained function through mutations that allowed it to start producing proteins. &lt;br /&gt;
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])&lt;br /&gt;
&lt;br /&gt;
==When BLAST fails==&lt;br /&gt;
&lt;br /&gt;
Here you have the protein‐coding sequence with unknown function from the human gene named &amp;quot;C22orf45&amp;quot;. This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;C22orf45&lt;br /&gt;
 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG&lt;br /&gt;
 CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP&lt;br /&gt;
 CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select &amp;lt;u&amp;gt;blastp&amp;lt;/u&amp;gt; as the algorithm. Paste in the query sequence. Change the database from nr to &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt;, and press &amp;lt;u&amp;gt;BLAST&amp;lt;/u&amp;gt; (Figure 1).&lt;br /&gt;
&lt;br /&gt;
[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPHA6F6K016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&amp;amp;PAGE_TYPE=BlastSearch&amp;amp;LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to &amp;lt;u&amp;gt;Non-redundant protein sequences (nr)&amp;lt;/u&amp;gt; and select &amp;lt;u&amp;gt;PSI-BLAST (Position-Specific Iterated BLAST)&amp;lt;/u&amp;gt; as the algorithm (Figure 2). &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;IMPORTANT:&#039;&#039;&#039; To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPJM9RYM014&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
&lt;br /&gt;
===Constructing the PSSM===&lt;br /&gt;
&lt;br /&gt;
Now retain the hits with an E-value&amp;lt;10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the &amp;lt;u&amp;gt;Run&amp;lt;/u&amp;gt; button at &amp;lt;u&amp;gt;Run PSI-Blast iteration 2&amp;lt;/u&amp;gt; (you can find it at both the bottom and top of the results table).&lt;br /&gt;
&lt;br /&gt;
[[File:PSI-BLAST_firstrun.png|250px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;div style=&amp;quot;background-color: lavender; border: solid thin grey;&amp;quot;&amp;gt;&lt;br /&gt;
:&#039;&#039;&#039;Note:&#039;&#039;&#039; If BLAST collapses you can check pre-run results using this ID: &#039;&#039;&#039;GPX0AZ4V016&#039;&#039;&#039; in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&amp;amp;RECENT_RESULTS=on Lookup BLAST Job]]&lt;br /&gt;
&amp;lt;/div&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
&lt;br /&gt;
===Saving and reusing the PSSM===&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, but before that, let&#039;s save the PSSM for future searches.&lt;br /&gt;
&lt;br /&gt;
In order to do that, go to the top of the PSI-BLAST output page and click &amp;lt;u&amp;gt;Download All&amp;lt;/u&amp;gt;, then click &amp;lt;u&amp;gt;PSSM&amp;lt;/u&amp;gt;. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2&lt;br /&gt;
&lt;br /&gt;
You can run a second iteration, this time with the maximum number of sequences that have an E-value &amp;lt; 0.005. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
&lt;br /&gt;
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.&lt;br /&gt;
&lt;br /&gt;
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
&lt;br /&gt;
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Open &#039;&#039;a new BLAST window&#039;&#039;. Select &amp;lt;u&amp;gt; Protein Data Bank (pdb)&amp;lt;/u&amp;gt; as the database. Click on &amp;lt;u&amp;gt;Algorithm parameters&amp;lt;/u&amp;gt; to show the extended settings. Click the button next to &amp;lt;u&amp;gt;Upload PSSM&amp;lt;/u&amp;gt; and select the file you just saved. Remember to change the Expect threshold to significant (E-value &amp;lt;0.005) As default the E value is saved from the last search that should be 100. &#039;&#039;&#039;Note:&#039;&#039;&#039; You don&#039;t have to paste the query sequence again, it is stored in the PSSM! &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
==Reflection time==&lt;br /&gt;
&lt;br /&gt;
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work. &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: However, can you see any potential risks on doing so? Can we believe in the results?&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Hint:&#039;&#039;&#039; Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.&lt;br /&gt;
&lt;br /&gt;
==Finding a remote homolog in a specific taxa (Optional)==&lt;br /&gt;
&lt;br /&gt;
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB &amp;amp;mdash; now it is time to search the broader database &amp;quot;Reference proteins&amp;quot; (&amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt;).  (&#039;&#039;&#039;Note:&#039;&#039;&#039; we would have liked to do this exercise in the broadest database &amp;lt;u&amp;gt;nr&amp;lt;/u&amp;gt;, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID &#039;&#039;&#039;GPAA1_HUMAN&#039;&#039;&#039; has a homolog in the genus &#039;&#039;Trypanosoma&#039;&#039; (unicellular parasites which cause diseases like sleeping sickness or Chaga&#039;s disease).&lt;br /&gt;
* First, try a standard BlastP (where you set &amp;lt;u&amp;gt;Organism&amp;lt;/u&amp;gt; to &#039;&#039;Trypanosoma&#039;&#039;, &amp;lt;u&amp;gt;Database&amp;lt;/u&amp;gt; to &amp;lt;u&amp;gt;refseq_protein&amp;lt;/u&amp;gt; (&#039;&#039;&#039;not&#039;&#039;&#039; refseq_select), switch the &amp;lt;u&amp;gt;Low complexity regions&amp;lt;/u&amp;gt; filter off, and set the E-value threshold to 10). &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit?&lt;br /&gt;
* Then, try PSI-BLAST. &#039;&#039;&#039;Hint:&#039;&#039;&#039; You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in &#039;&#039;Trypanosoma&#039;&#039;.&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
==Identifying conserved residues==&lt;br /&gt;
[[File: Logo.png‎|right|frame|thumb|Logo of a sequence profile spanning residues 279-296. The logo is calculated from a Psi-Blast profile]] &lt;br /&gt;
&lt;br /&gt;
We now return to the Query sequence you used in questions 1-12. You have now (hopefully) identified a structural relationship between the Query sequence and a protein sequence in the PDB database of protein structures. Say you would like to validate this relationship. This could be done by mutating (substituting) essential residues in the query sequence and test if the protein function (or structure) is affected by these mutations.&lt;br /&gt;
&lt;br /&gt;
The protein sequence of the query is large (more than 400 amino acids) and a complete mutation study including all residues would be extremely costly. Instead one can use PSI-BLAST and sequence profiles to identify conserved residues that are likely to be essential for the protein structure and/or protein function.&lt;br /&gt;
&lt;br /&gt;
Below you find a set of 8 residues from the Query protein sequence. You shall use the PSI-BLAST and Blast2logo programs to select four of the eight residues for a mutagenesis study (you shall select the four residues based on sequence conservation only).&lt;br /&gt;
&lt;br /&gt;
* (a): H271&lt;br /&gt;
* (b): R287&lt;br /&gt;
* (c): E290&lt;br /&gt;
* (d): Y334&lt;br /&gt;
* (e): F371&lt;br /&gt;
* (f): R379&lt;br /&gt;
* (g): R400&lt;br /&gt;
* (h): Y436&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?Blast2logo Blast2logo] server to identify which residues are conserved in the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] protein sequence. Go to the Blast2logo server and upload the Query sequence. Set the Blast database to &amp;lt;u&amp;gt;NR70&amp;lt;/u&amp;gt;, set the logo type to &amp;lt;u&amp;gt;Shannon&amp;lt;/u&amp;gt; and press submit (note it might take some (10-15) minutes before your job is completed). If the job does not complete, or if you don&#039;t have the patience to wait, you can find the output following this link [https://teaching.healthtech.dtu.dk/material/36611/files/Blast2logo_Query1_frame.htm Blast2logo output].&lt;br /&gt;
&lt;br /&gt;
When the job is completed you should see the logo-plot on the website. You can download it in PDF format. To improve the readability of the logo, you can also click on the &amp;lt;u&amp;gt;Customize visualization using Seq2Logo&amp;lt;/u&amp;gt; button. In doing this your are transferred to the Seq2Logo web server. Here, just leave all options as default, and press submit.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;:  Which four of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
&lt;br /&gt;
===Homology modelling ===&lt;br /&gt;
You shall use the [http://www.sbg.bio.ic.ac.uk/phyre2/ Phyre2] program to validate if the structural properties of the four most conserved residues from question Q12 indeed could form an active site.  Phyre is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the Phyre web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here [http://www.sbg.bio.ic.ac.uk/servers/phyre/qphyre_scripts/results.cgi?jobid=070ac42bdea13d4e Phyre output].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Find the PDB hit identified by PSI-BLAST (you can click on the on the 3D model of the protein to get the relevant PDB filel).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
You shall use the [https://services.healthtech.dtu.dk/service.php?CPHmodels CPHmodels] server to validate if the structural properties of the four most conserved residues from question Q14 indeed could form an active site.  CHPmodels is a program for protein homology modelling. This program takes as input a protein amino acids sequence and produces as output a 3D protein structure based on single template homology modelling. Go to the CPHmodels web-site and upload the [https://teaching.healthtech.dtu.dk/material/36611/files/Query1.txt Query] sequence. Note it might take some (10-20) minutes before your job is completed. To save you time, I have run the calculation for you. You can find the output here: [http://www.cbs.dtu.dk/services/CPHmodels/teaching/query1.html CPHmodels output] &lt;br /&gt;
&lt;br /&gt;
The output from CPHmodels is not straightforward to interpret. However, the method provides a Z-score in situations where the query and template shares little sequence similarity. As a rule of thumb, a Z-score greater than 10 will signify a reliable model. You find the template used by CPHmodels and the Z-score for the model in the last part of the output file.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
&lt;br /&gt;
Download the homology model made by CPHmodels (click on the query.pdb link), and open the model file in Pymol. If you do not have Pymol installed on your computer, you can find it on Campusnet in the Pymol folder. Show the location of the four essential residues from question Q14 on the structure.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:GraphicSummary_PB3.png&amp;diff=731</id>
		<title>File:GraphicSummary PB3.png</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:GraphicSummary_PB3.png&amp;diff=731"/>
		<updated>2025-11-06T10:45:26Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Results_PSI-BLAST_iteration3.png&amp;diff=730</id>
		<title>File:Results PSI-BLAST iteration3.png</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Results_PSI-BLAST_iteration3.png&amp;diff=730"/>
		<updated>2025-11-06T10:45:12Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=729</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=729"/>
		<updated>2025-11-06T09:54:36Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
&lt;br /&gt;
[[File:graphicSummary_PB2.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
Answer: The e-values are lower this time but the query cover seems to be skewed to only one part of the previous matches.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) == &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:GraphicSummary_PB2.png&amp;diff=728</id>
		<title>File:GraphicSummary PB2.png</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:GraphicSummary_PB2.png&amp;diff=728"/>
		<updated>2025-11-06T09:53:25Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=727</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=727"/>
		<updated>2025-11-06T09:52:52Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
Answer: The e-values are lower this time but the query cover seems to be skewed to only one part of the previous matches.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) == &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=726</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=726"/>
		<updated>2025-11-06T09:51:16Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
Answer: The e-values are lower this time but the query cover seems to be skewed to only one part of the previous matches.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) == &lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=725</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=725"/>
		<updated>2025-11-06T09:49:00Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Saving and reusing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.&lt;br /&gt;
Answer: The e-values are lower this time but the query cover seems to be skewed to only one part of the previous matches.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: Are there any homologous sequences found in search 2 that have an annotated function?&lt;br /&gt;
In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?&lt;br /&gt;
In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=724</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=724"/>
		<updated>2025-11-06T09:35:59Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Constructing the PSSM */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: After iteration 2, How many significant hits (E-value &amp;lt; 0.005) are now found? What happened with E-value of the hits found before?&lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?&lt;br /&gt;
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.&lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=723</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=723"/>
		<updated>2025-11-06T09:24:28Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|800px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=722</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=722"/>
		<updated>2025-11-06T09:24:15Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|500px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=721</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=721"/>
		<updated>2025-11-06T09:23:57Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=720</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=720"/>
		<updated>2025-11-06T09:23:32Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|250px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=719</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=719"/>
		<updated>2025-11-06T09:23:26Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|2500px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=718</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=718"/>
		<updated>2025-11-06T09:23:14Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|250px|center]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=717</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=717"/>
		<updated>2025-11-06T09:22:51Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|250px|center|frame|Partial screenshot of the first PSI-BLAST search with no significant hits but C22orf45 against itself]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=716</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=716"/>
		<updated>2025-11-06T09:22:31Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|250px|center|frame| Partial screenshot of the first PSI-BLAST search with no significant hits but C22orf45 against itself]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=715</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=715"/>
		<updated>2025-11-06T09:21:59Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|250px|center|Partial screenshot of the first PSI-BLAST search with no significant hits but C22orf45 against itself]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=714</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=714"/>
		<updated>2025-11-06T09:21:39Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|250px|center|frame|Partial screenshot of the first PSI-BLAST search with no significant hits but C22orf45 against itself]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=713</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=713"/>
		<updated>2025-11-06T09:21:21Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|250dpi|center|frame|Partial screenshot of the first PSI-BLAST search with no significant hits but C22orf45 against itself]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=712</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=712"/>
		<updated>2025-11-06T09:20:44Z</updated>

		<summary type="html">&lt;p&gt;Carol: /* Trying another approach */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
[[File:question2_answer.png|center|frame|Partial screenshot of the first PSI-BLAST search with no significant hits but C22orf45 against itself]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Question2_answer.png&amp;diff=711</id>
		<title>File:Question2 answer.png</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=File:Question2_answer.png&amp;diff=711"/>
		<updated>2025-11-06T09:19:04Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=710</id>
		<title>Exercise PSI-BLAST ans</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22111/index.php?title=Exercise_PSI-BLAST_ans&amp;diff=710"/>
		<updated>2025-11-06T09:18:51Z</updated>

		<summary type="html">&lt;p&gt;Carol: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;NEW answers are being updated!&lt;br /&gt;
&lt;br /&gt;
== When BLAST fails ==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 1&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)?&lt;br /&gt;
Answer: No sequences with E-value below 0.005.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Trying another approach==&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 2&#039;&#039;&#039;: How many hits do you obtain (E-value &amp;lt; 10)? (&#039;&#039;&#039;Tip:&#039;&#039;&#039; you can see the number by selecting all hits (clicking &amp;lt;u&amp;gt;All&amp;lt;/u&amp;gt; under &amp;lt;u&amp;gt;Sequences producing significant alignments with E-value BETTER than threshold&amp;lt;/u&amp;gt;) and then looking at the number of selected hits)&lt;br /&gt;
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 3&#039;&#039;&#039;: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?&lt;br /&gt;
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 4&#039;&#039;&#039;: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?&lt;br /&gt;
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Constructing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 5&#039;&#039;&#039;: How many significant hits does BLAST find (E-value &amp;lt; 0.005)? &lt;br /&gt;
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 6&#039;&#039;&#039;: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?&lt;br /&gt;
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 7&#039;&#039;&#039;: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! &lt;br /&gt;
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).&lt;br /&gt;
&lt;br /&gt;
=== Saving and reusing the PSSM ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 8&#039;&#039;&#039;: Do you find any significant PDB hits now? If yes, how many?&lt;br /&gt;
Answer: Yes, 13&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 9&#039;&#039;&#039;: What are the PDB identifiers and the E-values for the two best PDB hits?&lt;br /&gt;
Answer: 4A8E_A with an E-value of 2&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, 5HXY_A with an E-value of 8&amp;amp;times;10&amp;lt;sup&amp;gt;-19&amp;lt;/sup&amp;gt;, &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 10&#039;&#039;&#039;: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039;&lt;br /&gt;
 ID      cov   ident  sim/pos &lt;br /&gt;
 4A8E_A  46%   21%    39%&lt;br /&gt;
 5HXY_A  61%   18%    31%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
  &lt;br /&gt;
 Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301&lt;br /&gt;
             +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ &lt;br /&gt;
 Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149&lt;br /&gt;
 &lt;br /&gt;
 Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361&lt;br /&gt;
             +  +I + +  +  +      S      +++ YL +R +        + +          &lt;br /&gt;
 Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197&lt;br /&gt;
 &lt;br /&gt;
 Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421&lt;br /&gt;
                K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L &lt;br /&gt;
 Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250&lt;br /&gt;
 &lt;br /&gt;
 Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453&lt;br /&gt;
             G    +  +I    YT  + + L++   +A L&lt;br /&gt;
 Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 Length=317&lt;br /&gt;
 &lt;br /&gt;
 Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233&lt;br /&gt;
             SRYT      L+  ++ F   K       +   Y+                         &lt;br /&gt;
 Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115&lt;br /&gt;
 &lt;br /&gt;
 Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292&lt;br /&gt;
                D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL &lt;br /&gt;
 Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175&lt;br /&gt;
 &lt;br /&gt;
 Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352&lt;br /&gt;
             N+ I ++DL+  II + +  +  +      + +  + L   YL  R              &lt;br /&gt;
 Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219&lt;br /&gt;
 &lt;br /&gt;
 Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411&lt;br /&gt;
                + + + D      +   +    + R I +   +A   K+   + LR  FAT +    &lt;br /&gt;
 Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277&lt;br /&gt;
 &lt;br /&gt;
 Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
                  I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 11&#039;&#039;&#039;: What is the function of these proteins?&lt;br /&gt;
Answer: They are recombinases.&lt;br /&gt;
&lt;br /&gt;
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.&lt;br /&gt;
&lt;br /&gt;
=== One more round ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 12&#039;&#039;&#039;: Answer questions 8-10 again for the new search.&lt;br /&gt;
&#039;&#039;&#039;Answer:&#039;&#039;&#039; There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.&lt;br /&gt;
 ID      E      cov   ident  sim/pos &lt;br /&gt;
 5HXY_A  5e-34  63%   18%    32%&lt;br /&gt;
 4A8E_A  1e-30  65%   17%    33%&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Alignments:&#039;&#039;&#039;&lt;br /&gt;
 &amp;gt;5HXY_A Chain A, Crystal Structure Of Xera Recombinase&lt;br /&gt;
 &lt;br /&gt;
 Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222&lt;br /&gt;
                E  +    SRYT      L+  ++ F   K       +   Y+              &lt;br /&gt;
 Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104&lt;br /&gt;
 &lt;br /&gt;
 Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281&lt;br /&gt;
                           D ++  +   PK      V +  +E K + +         A   +L&lt;br /&gt;
 Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164&lt;br /&gt;
 &lt;br /&gt;
 Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341&lt;br /&gt;
             A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   &lt;br /&gt;
 Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219&lt;br /&gt;
 &lt;br /&gt;
 Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400&lt;br /&gt;
                           + + + D      +   +    + R I +   +A   K+   + LR&lt;br /&gt;
 Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266&lt;br /&gt;
 &lt;br /&gt;
 Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454&lt;br /&gt;
               FAT +         I  + G       +I    YT      LR+ Y +    &lt;br /&gt;
 Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316&lt;br /&gt;
&lt;br /&gt;
 &amp;gt;4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea&lt;br /&gt;
 &lt;br /&gt;
 Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212&lt;br /&gt;
                +    I     Y  L   SR T       I  +      +  S    + + +     &lt;br /&gt;
 Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60&lt;br /&gt;
 &lt;br /&gt;
 Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272&lt;br /&gt;
                 S  +  L  +  +            +  KTPK+       +  EE++ +    E +&lt;br /&gt;
 Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120&lt;br /&gt;
 &lt;br /&gt;
 Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332&lt;br /&gt;
               +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++&lt;br /&gt;
 Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179&lt;br /&gt;
 &lt;br /&gt;
 Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392&lt;br /&gt;
              YL +R +        + +             K K KL P     L +K      R  G &lt;br /&gt;
 Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221&lt;br /&gt;
 &lt;br /&gt;
 Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452&lt;br /&gt;
                 + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A &lt;br /&gt;
 Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277&lt;br /&gt;
 &lt;br /&gt;
 Query  453  L  453&lt;br /&gt;
             L&lt;br /&gt;
 Sbjct  278  L  278&lt;br /&gt;
&lt;br /&gt;
== Finding a remote homolog (on your own) ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 13&#039;&#039;&#039;: Do you find any significant (E&amp;lt;0.005) hits? What is the E-value of the best hit? &lt;br /&gt;
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 14&#039;&#039;&#039;: How many significant (E&amp;lt;0.005) hits do you find now? What is the E-value of the best hit?&lt;br /&gt;
Answer: There are 2 significant hits: &lt;br /&gt;
* &amp;quot;GPI transamidase component Gaa1&amp;quot; from &#039;&#039;Trypanosoma melophagium&#039;&#039; with an E-value of 1e-05&lt;br /&gt;
* &amp;quot;putative GPI transamidase component GAA1&amp;quot; from &#039;&#039;Trypanosoma theileri&#039;&#039; withs an E-value of 8e-04 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
== Identifying conserved residues ==&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 15&#039;&#039;&#039;: Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)? &lt;br /&gt;
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:&lt;br /&gt;
[[File:Blast_QUERY1.png]]&lt;br /&gt;
&lt;br /&gt;
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 16&#039;&#039;&#039;: Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?&lt;br /&gt;
Answer: R287, E290, R400, Y436&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Homology modelling ===&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 17&#039;&#039;&#039;: Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?&lt;br /&gt;
Answer: Yes - CPHmodels comes up with a Z-score of 31.75&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;QUESTION 18&#039;&#039;&#039;: Could the residues form an active site?&lt;br /&gt;
Answer: Yes - the four residues are close in space.&lt;br /&gt;
[[File:active_site.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]&lt;br /&gt;
&lt;br /&gt;
--&amp;gt;&lt;/div&gt;</summary>
		<author><name>Carol</name></author>
	</entry>
</feed>