Exercise PSI-BLAST ans: Difference between revisions

From 22111
Jump to navigation Jump to search
No edit summary
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
NEW answers are being updated!
== When BLAST fails ==
* '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)?
* '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)?
Answer: No sequences with E-value below 0.005.
Answer: No sequences with E-value below 0.005.
==Trying another approach==


[[File:question2_answer.png|800px|center]]
[[File:question2_answer.png|800px|center]]
Line 20: Line 13:
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.


* '''QUESTION 5''': After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?


[[File:results_PSI-BLAST_iteration2.png|800px|center]]


=== Constructing the PSSM ===
* '''QUESTION 5''': After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.


* '''QUESTION 6''': Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?
* '''QUESTION 6''': Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?
[[File:graphicSummary_PB2.png|800px|center]]
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.
Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.


Line 33: Line 28:
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).
Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).


=== Saving and reusing the PSSM ===
* '''QUESTION 8''': Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.
* '''QUESTION 8''': Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.
Answer: The e-values are lower this time but the query cover seems to be skewed to only one part of the previous matches.
 
[[File:results_PSI-BLAST_iteration3.png|800px|center]]
 
[[File:graphicSummary_PB3.png|800px|center]]
 
Answer: The e-values are lower this time and the query cover has increased to around 74% but the cover seems to be skewed to only one part of the previous matches. These suggests that only from the two types of matches one has dominated in the construction of the PSSM-3.


* '''QUESTION 9''': Are there any homologous sequences found in search 2 that have an annotated function?
* '''QUESTION 9''': Are there any homologous sequences found in search 2 that have an annotated function?
In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.
Answer: In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.


* '''QUESTION 10''': Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?
* '''QUESTION 10''': Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?
In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator  
Answer: In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator, this also agrees with what we observed on the graphic summary showing the coverage it seems only the matches on one of the regions (C-terminal part of the protein) has dominated the PSSM-3.
 
* '''QUESTION 11''': Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?


[[File:PSSM-2_on_PDB.png|800px|center]]


* '''QUESTION 11''': Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?
[[File:graphicSummary_PSSM_onPDB.png|800px|center]]
* '''QUESTION 12''': What is the function of these proteins?


=== One more round ===
Answer: Both searching wiht PSSM-2 and PSSM-3 we can find the same matches on PDB. The coverage coincides with the second half (C-terminal) of the orphan protein. This result is expected since in our PSSM constructions we had more proteins from this region of the protein.. so we can see it as if we had amplified the signal from this region.
* '''QUESTION 12''': Answer questions 8-10 again for the new search.
We could have selected only the other proteins for the second iteration of PSI-BLAST to skew the results to the other (N-terminal) portion of the protein. It also makes sense that the PDB structures are coming from Bacteria.
'''Answer:''' There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.
ID      E      cov  ident  sim/pos
5HXY_A  5e-34  63%  18%    32%
4A8E_A  1e-30  65%  17%    33%


'''Alignments:'''
* '''QUESTION 12''': What is the function of these proteins?
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
The first two hit proteins on PDB (5N07_A and 5N08_A) are called HTH-type transcriptional repressor NsrR. These proteins bind DNA with the Helix-turn-Helix motif (HTH) to repress the transcription of genes, but that part of the protein is outside of our alignment so it is not a function that would be present in our orphan protein.
Additionally there is some additional information on Uniprot about this protein (Q9L132) Binds DNA; this binding is disrupted by nitrosylation upon exposure to nitric oxide (NO) and also by EDTA and iron chelators. The 2Fe-2S cluster is stable in the presence of O2. This regulatory function is dependent on three Cysteines (C) that bind iron. Interestingly we also found the three cysteins in our protein of interest so might have some regulatory function.. but at this point we will probably need to do some experimental assays to test this hypothesis.
Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222
                E  +    SRYT      L+  ++ F  K      +  Y+             
Sbjct  45  RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104
Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281
                          D ++  +  PK      V +  +E K + +        A  +L
Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164
Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341
            A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L  YL  R 
Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219
Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400
                          + + + D      +  +    + R I +  +A  K+  + LR
Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266
Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454
              FAT +        I  + G      +I    YT      LR+ Y +   
Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316


>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
QUESTION 13: However, can you see any potential risks on doing so? Can we believe in the results?
The Orphan protein used for the example is a real case scenario protein, unfortunately we do not know the function. There are still many genes that we do not know what they do, and some are involved in diseases so it is important to find ways to find a potential function for them.
Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212
When we use PSI-BLAST we select some sequences to build a position-specific scoring matrix (PSSM). The advantages of using a matrix instead of a single sequence for remote homologous sequences has the advantage of learning a wider range of preferences for each position, and that is the reason why we find more hits and with lower (more significant) E-values.
                +    I    Y  L  SR T      I  +      +  S    + + +   
However, a cautious note, you should be sure that the sequences that you include in your PSSM will not polute the initial signal.. so preferentially they will come from a few or lower significant hits.
Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60
Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272
                S  +  L  +  +            +  KTPK+      +  EE++ +    E +
Sbjct  61  RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120
Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332
              +  +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++
Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179
Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392
              YL +R +        + +            K K KL P    L +K      R  G
Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221
Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452
                + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++  +A
Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277
Query  453  L  453
            L
Sbjct  278  L  278


== Finding a remote homolog (on your own) ==
== Finding a remote homolog (on your own) ==  
* '''QUESTION 13''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?  
* '''QUESTION 14''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?  
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.  
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.  


* '''QUESTION 14''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
* '''QUESTION 15''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
Answer: There are 2 significant hits:  
Answer: There are 2 significant hits:  
* "GPI transamidase component Gaa1" from ''Trypanosoma melophagium'' with an E-value of 1e-05
* "GPI transamidase component Gaa1" from ''Trypanosoma melophagium'' with an E-value of 1e-05
* "putative GPI transamidase component GAA1" from ''Trypanosoma theileri'' withs an E-value of 8e-04  
* "putative GPI transamidase component GAA1" from ''Trypanosoma theileri'' withs an E-value of 8e-04
 
 
<!--
== Identifying conserved residues ==
* '''QUESTION 15''': Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:
[[File:Blast_QUERY1.png]]
 
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.
 
* '''QUESTION 16''': Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?
Answer: R287, E290, R400, Y436
 
 
=== Homology modelling ===
* '''QUESTION 17''': Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?
Answer: Yes - CPHmodels comes up with a Z-score of 31.75
 
* '''QUESTION 18''': Could the residues form an active site?
Answer: Yes - the four residues are close in space.
[[File:active_site.png]]
 
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]
 
-->

Latest revision as of 13:26, 6 November 2025

  • QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?

Answer: No sequences with E-value below 0.005.

  • QUESTION 2: How many hits do you obtain (E-value < 10)? (Tip: you can see the number by selecting all hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)

Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.

  • QUESTION 3: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?

This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.

  • QUESTION 4: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?

Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.

  • QUESTION 5: After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?

Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.

  • QUESTION 6: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?

Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.

  • QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.

Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).

  • QUESTION 8: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.

Answer: The e-values are lower this time and the query cover has increased to around 74% but the cover seems to be skewed to only one part of the previous matches. These suggests that only from the two types of matches one has dominated in the construction of the PSSM-3.

  • QUESTION 9: Are there any homologous sequences found in search 2 that have an annotated function?

Answer: In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.

  • QUESTION 10: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?

Answer: In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator, this also agrees with what we observed on the graphic summary showing the coverage it seems only the matches on one of the regions (C-terminal part of the protein) has dominated the PSSM-3.

  • QUESTION 11: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?

Answer: Both searching wiht PSSM-2 and PSSM-3 we can find the same matches on PDB. The coverage coincides with the second half (C-terminal) of the orphan protein. This result is expected since in our PSSM constructions we had more proteins from this region of the protein.. so we can see it as if we had amplified the signal from this region. We could have selected only the other proteins for the second iteration of PSI-BLAST to skew the results to the other (N-terminal) portion of the protein. It also makes sense that the PDB structures are coming from Bacteria.

  • QUESTION 12: What is the function of these proteins?

The first two hit proteins on PDB (5N07_A and 5N08_A) are called HTH-type transcriptional repressor NsrR. These proteins bind DNA with the Helix-turn-Helix motif (HTH) to repress the transcription of genes, but that part of the protein is outside of our alignment so it is not a function that would be present in our orphan protein. Additionally there is some additional information on Uniprot about this protein (Q9L132) Binds DNA; this binding is disrupted by nitrosylation upon exposure to nitric oxide (NO) and also by EDTA and iron chelators. The 2Fe-2S cluster is stable in the presence of O2. This regulatory function is dependent on three Cysteines (C) that bind iron. Interestingly we also found the three cysteins in our protein of interest so might have some regulatory function.. but at this point we will probably need to do some experimental assays to test this hypothesis.

QUESTION 13: However, can you see any potential risks on doing so? Can we believe in the results? The Orphan protein used for the example is a real case scenario protein, unfortunately we do not know the function. There are still many genes that we do not know what they do, and some are involved in diseases so it is important to find ways to find a potential function for them. When we use PSI-BLAST we select some sequences to build a position-specific scoring matrix (PSSM). The advantages of using a matrix instead of a single sequence for remote homologous sequences has the advantage of learning a wider range of preferences for each position, and that is the reason why we find more hits and with lower (more significant) E-values. However, a cautious note, you should be sure that the sequences that you include in your PSSM will not polute the initial signal.. so preferentially they will come from a few or lower significant hits.

Finding a remote homolog (on your own)

  • QUESTION 14: Do you find any significant (E<0.005) hits? What is the E-value of the best hit?

Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.

  • QUESTION 15: How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?

Answer: There are 2 significant hits:

  • "GPI transamidase component Gaa1" from Trypanosoma melophagium with an E-value of 1e-05
  • "putative GPI transamidase component GAA1" from Trypanosoma theileri withs an E-value of 8e-04