Exercise PSI-BLAST ans: Difference between revisions

From 22111
Jump to navigation Jump to search
(Created page with "NEW answers are being updated! == When BLAST fails == * '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)? Answer: No sequences with E-value below 0.005. ==Trying another approach== * '''QUESTION 2''': How many significant hits does BLAST find (E-value < 0.005)? Answer: After the first iteration, 494 hits are found. * '''QUESTION 3''': How large a fraction of the query sequence do the significant hits match (excluding the identical match)...")
 
No edit summary
 
(22 intermediate revisions by the same user not shown)
Line 1: Line 1:
NEW answers are being updated!
== When BLAST fails ==
* '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)?
* '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)?
Answer: No sequences with E-value below 0.005.  
Answer: No sequences with E-value below 0.005.


==Trying another approach==
[[File:question2_answer.png|800px|center]]
* '''QUESTION 2''': How many significant hits does BLAST find (E-value < 0.005)?
Answer: After the first iteration, 494 hits are found.


* '''QUESTION 3''': How large a fraction of the query sequence do the significant hits match (excluding the identical match)?  
* '''QUESTION 2''': How many hits do you obtain (E-value < 10)? ('''Tip:''' you can see the number by selecting all hits (clicking <u>All</u> under <u>Sequences producing significant alignments with E-value BETTER than threshold</u>) and then looking at the number of selected hits)
Answer: For most hits between 45 and 55%. One hit (#2) is 84%. A few hits are lower, down to 11%.
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.
* '''QUESTION 4''': Do you find any PDB hits among the significant hits?
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.


=== Constructing the PSSM ===
* '''QUESTION 3''': Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?
* '''QUESTION 5''': How many significant hits does BLAST find (E-value < 0.005)?  
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)


* '''QUESTION 6''': How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?
* '''QUESTION 4''': Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.


* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!
* '''QUESTION 5''': After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).


=== Saving and reusing the PSSM ===
[[File:results_PSI-BLAST_iteration2.png|800px|center]]
* '''QUESTION 8''': Do you find any significant PDB hits now? If yes, how many?
Answer: Yes, 13


* '''QUESTION 9''': What are the PDB identifiers and the E-values for the two best PDB hits?
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.
Answer: 4A8E_A with an E-value of 2&times;10<sup>-19</sup>, 5HXY_A with an E-value of 8&times;10<sup>-19</sup>,


* '''QUESTION 10''': What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?  
* '''QUESTION 6''': Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?


'''Answer:'''
[[File:graphicSummary_PB2.png|800px|center]]
ID      cov  ident  sim/pos
4A8E_A  46%  21%    39%
5HXY_A  61%  18%    31%


'''Alignments:'''
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.
>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
 
Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301
            +  KTPK+      +  EE++ +    E +  +  +LL  +GLR  EL N+ +E+++
Sbjct  90  EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149
Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361
            +  +I + +  +  +      S      +++ YL +R +        + +         
Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197
Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421
                K K KL P    L +K      R  G    + LR  FAT+M  + +    I  L
Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250
Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453
            G    +  +I    YT  + + L++  +A L
Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278


>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.
Length=317
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).
Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233
            SRYT      L+  ++ F  K      +  Y+                       
Sbjct  56  SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115
Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292
                D ++  +  PK      V +  +E K + +        A  +LA +G+R GEL
Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175
Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352
            N+ I ++DL+  II + +  +  +      + +  + L  YL  R             
Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219
Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411
                + + + D      +  +    + R I +  +A   K+  + LR  FAT +   
Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277
Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454
                  I  + G      +I    YT      LR+ Y +   
Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316


* '''QUESTION 8''': Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.


[[File:results_PSI-BLAST_iteration3.png|800px|center]]


* '''QUESTION 11''': What is the function of these proteins?
[[File:graphicSummary_PB3.png|800px|center]]
Answer: They are recombinases.


There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.
Answer: The e-values are lower this time and the query cover has increased to around 74% but the cover seems to be skewed to only one part of the previous matches. These suggests that only from the two types of matches one has dominated in the construction of the PSSM-3.


=== One more round ===
* '''QUESTION 9''': Are there any homologous sequences found in search 2 that have an annotated function?
* '''QUESTION 12''': Answer questions 8-10 again for the new search.
Answer: In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.
'''Answer:''' There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.
ID      E      cov  ident  sim/pos
5HXY_A  5e-34  63%  18%    32%
4A8E_A  1e-30  65%  17%    33%


'''Alignments:'''
* '''QUESTION 10''': Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
Answer: In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator, this also agrees with what we observed on the graphic summary showing the coverage it seems only the matches on one of the regions (C-terminal part of the protein) has dominated the PSSM-3.
Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222
                E  +    SRYT      L+  ++ F  K      +  Y+             
Sbjct  45  RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104
Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281
                          D ++  +  PK      V +  +E K + +        A  +L
Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164
Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341
            A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L  YL  R 
Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219
Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400
                          + + + D      +  +    + R I +  +A  K+  + LR
Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266
Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454
              FAT +        I  + G      +I    YT      LR+ Y +   
Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316


>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
* '''QUESTION 11''': Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?
Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212
                +    I    Y  L  SR T      I  +      +  S    + + +   
Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60
Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272
                S  +  L  +  +            +  KTPK+      +  EE++ +    E +
Sbjct  61  RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120
Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332
              +  +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++
Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179
Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392
              YL +R +        + +            K K KL P    L +K      R  G
Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221
Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452
                + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++  +A
Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277
Query  453  L  453
            L
Sbjct  278  L  278


== Finding a remote homolog (on your own) ==
[[File:PSSM-2_on_PDB.png|800px|center]]
* '''QUESTION 13''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.  


* '''QUESTION 14''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
[[File:graphicSummary_PSSM_onPDB.png|800px|center]]
Answer: There are 2 significant hits:
* "GPI transamidase component Gaa1" from ''Trypanosoma melophagium'' with an E-value of 1e-05
* "putative GPI transamidase component GAA1" from ''Trypanosoma theileri'' withs an E-value of 8e-04


Answer: Both searching wiht PSSM-2 and PSSM-3 we can find the same matches on PDB. The coverage coincides with the second half (C-terminal) of the orphan protein. This result is expected since in our PSSM constructions we had more proteins from this region of the protein.. so we can see it as if we had amplified the signal from this region.
We could have selected only the other proteins for the second iteration of PSI-BLAST to skew the results to the other (N-terminal) portion of the protein. It also makes sense that the PDB structures are coming from Bacteria.


<!--
* '''QUESTION 12''': What is the function of these proteins?
== Identifying conserved residues ==
The first two hit proteins on PDB (5N07_A and 5N08_A) are called HTH-type transcriptional repressor NsrR. These proteins bind DNA with the Helix-turn-Helix motif (HTH) to repress the transcription of genes, but that part of the protein is outside of our alignment so it is not a function that would be present in our orphan protein.
* '''QUESTION 15''': Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?  
Additionally there is some additional information on Uniprot about this protein (Q9L132) Binds DNA; this binding is disrupted by nitrosylation upon exposure to nitric oxide (NO) and also by EDTA and iron chelators. The 2Fe-2S cluster is stable in the presence of O2. This regulatory function is dependent on three Cysteines (C) that bind iron. Interestingly we also found the three cysteins in our protein of interest so might have some regulatory function.. but at this point we will probably need to do some experimental assays to test this hypothesis.
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:
[[File:Blast_QUERY1.png]]


In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.
QUESTION 13: However, can you see any potential risks on doing so? Can we believe in the results?
The Orphan protein used for the example is a real case scenario protein, unfortunately we do not know the function. There are still many genes that we do not know what they do, and some are involved in diseases so it is important to find ways to find a potential function for them.
When we use PSI-BLAST we select some sequences to build a position-specific scoring matrix (PSSM). The advantages of using a matrix instead of a single sequence for remote homologous sequences has the advantage of learning a wider range of preferences for each position, and that is the reason why we find more hits and with lower (more significant) E-values.
However, a cautious note, you should be sure that the sequences that you include in your PSSM will not polute the initial signal.. so preferentially they will come from a few or lower significant hits.


* '''QUESTION 16''': Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?
== Finding a remote homolog (on your own) ==
Answer: R287, E290, R400, Y436
* '''QUESTION 14''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?  
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.


 
* '''QUESTION 15''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
=== Homology modelling ===
Answer: There are 2 significant hits:
* '''QUESTION 17''': Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?
* "GPI transamidase component Gaa1" from ''Trypanosoma melophagium'' with an E-value of 1e-05
Answer: Yes - CPHmodels comes up with a Z-score of 31.75
* "putative GPI transamidase component GAA1" from ''Trypanosoma theileri'' withs an E-value of 8e-04
 
* '''QUESTION 18''': Could the residues form an active site?
Answer: Yes - the four residues are close in space.
[[File:active_site.png]]
 
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]
 
-->

Latest revision as of 13:26, 6 November 2025

  • QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?

Answer: No sequences with E-value below 0.005.

  • QUESTION 2: How many hits do you obtain (E-value < 10)? (Tip: you can see the number by selecting all hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)

Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.

  • QUESTION 3: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?

This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.

  • QUESTION 4: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?

Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.

  • QUESTION 5: After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?

Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.

  • QUESTION 6: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?

Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.

  • QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.

Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).

  • QUESTION 8: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.

Answer: The e-values are lower this time and the query cover has increased to around 74% but the cover seems to be skewed to only one part of the previous matches. These suggests that only from the two types of matches one has dominated in the construction of the PSSM-3.

  • QUESTION 9: Are there any homologous sequences found in search 2 that have an annotated function?

Answer: In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.

  • QUESTION 10: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?

Answer: In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator, this also agrees with what we observed on the graphic summary showing the coverage it seems only the matches on one of the regions (C-terminal part of the protein) has dominated the PSSM-3.

  • QUESTION 11: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?

Answer: Both searching wiht PSSM-2 and PSSM-3 we can find the same matches on PDB. The coverage coincides with the second half (C-terminal) of the orphan protein. This result is expected since in our PSSM constructions we had more proteins from this region of the protein.. so we can see it as if we had amplified the signal from this region. We could have selected only the other proteins for the second iteration of PSI-BLAST to skew the results to the other (N-terminal) portion of the protein. It also makes sense that the PDB structures are coming from Bacteria.

  • QUESTION 12: What is the function of these proteins?

The first two hit proteins on PDB (5N07_A and 5N08_A) are called HTH-type transcriptional repressor NsrR. These proteins bind DNA with the Helix-turn-Helix motif (HTH) to repress the transcription of genes, but that part of the protein is outside of our alignment so it is not a function that would be present in our orphan protein. Additionally there is some additional information on Uniprot about this protein (Q9L132) Binds DNA; this binding is disrupted by nitrosylation upon exposure to nitric oxide (NO) and also by EDTA and iron chelators. The 2Fe-2S cluster is stable in the presence of O2. This regulatory function is dependent on three Cysteines (C) that bind iron. Interestingly we also found the three cysteins in our protein of interest so might have some regulatory function.. but at this point we will probably need to do some experimental assays to test this hypothesis.

QUESTION 13: However, can you see any potential risks on doing so? Can we believe in the results? The Orphan protein used for the example is a real case scenario protein, unfortunately we do not know the function. There are still many genes that we do not know what they do, and some are involved in diseases so it is important to find ways to find a potential function for them. When we use PSI-BLAST we select some sequences to build a position-specific scoring matrix (PSSM). The advantages of using a matrix instead of a single sequence for remote homologous sequences has the advantage of learning a wider range of preferences for each position, and that is the reason why we find more hits and with lower (more significant) E-values. However, a cautious note, you should be sure that the sequences that you include in your PSSM will not polute the initial signal.. so preferentially they will come from a few or lower significant hits.

Finding a remote homolog (on your own)

  • QUESTION 14: Do you find any significant (E<0.005) hits? What is the E-value of the best hit?

Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.

  • QUESTION 15: How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?

Answer: There are 2 significant hits:

  • "GPI transamidase component Gaa1" from Trypanosoma melophagium with an E-value of 1e-05
  • "putative GPI transamidase component GAA1" from Trypanosoma theileri withs an E-value of 8e-04