Exercise PSI-BLAST ans: Difference between revisions

From 22111
Jump to navigation Jump to search
No edit summary
Line 47: Line 47:
* '''QUESTION 12''': What is the function of these proteins?
* '''QUESTION 12''': What is the function of these proteins?


=== One more round ===
* '''QUESTION 12''': Answer questions 8-10 again for the new search.
'''Answer:''' There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.
ID      E      cov  ident  sim/pos
5HXY_A  5e-34  63%  18%    32%
4A8E_A  1e-30  65%  17%    33%


'''Alignments:'''
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222
                E  +    SRYT      L+  ++ F  K      +  Y+             
Sbjct  45  RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104
Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281
                          D ++  +  PK      V +  +E K + +        A  +L
Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164
Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341
            A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L  YL  R 
Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219
Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400
                          + + + D      +  +    + R I +  +A  K+  + LR
Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266
Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454
              FAT +        I  + G      +I    YT      LR+ Y +   
Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316


>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
== Finding a remote homolog (on your own) ==  
* '''QUESTION 14''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?  
Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212
                +    I    Y  L  SR T      I  +      +  S    + + +   
Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60
Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272
                S  +  L  +  +            +  KTPK+      +  EE++ +    E +
Sbjct  61  RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120
Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332
              +  +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++
Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179
Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392
              YL +R +        + +            K K KL P    L +K      R  G
Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221
Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452
                + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++  +A
Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277
Query  453  L  453
            L
Sbjct  278  L  278
 
== Finding a remote homolog (on your own) ==
* '''QUESTION 13''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?  
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.  
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.  


* '''QUESTION 14''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
* '''QUESTION 15''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
Answer: There are 2 significant hits:  
Answer: There are 2 significant hits:  
* "GPI transamidase component Gaa1" from ''Trypanosoma melophagium'' with an E-value of 1e-05
* "GPI transamidase component Gaa1" from ''Trypanosoma melophagium'' with an E-value of 1e-05
* "putative GPI transamidase component GAA1" from ''Trypanosoma theileri'' withs an E-value of 8e-04  
* "putative GPI transamidase component GAA1" from ''Trypanosoma theileri'' withs an E-value of 8e-04
 
 
<!--
== Identifying conserved residues ==
* '''QUESTION 15''': Spend a little time looking at the logo plot. Can you understand why the logo is so flat for the first 150 residues (how large a fraction of the query section did the Blast search cover)?
Answer: Query coverage vs 5HXY_A was around 64% (positions 159-450 - in the second round). There is only limited sequence coverage for the first 150 aa of the query sequence (See ncbi-blast graphics). You can also compare to the graphics from the first BLASTP search:
[[File:Blast_QUERY1.png]]
 
In this picture, you can clearly see that the vast majority of hits only covers the right half of the query sequence.
 
* '''QUESTION 16''': Which of the eight residues listed above are most conserved and hence most likely to be essential for the protein stability and/or function?
Answer: R287, E290, R400, Y436
 
 
=== Homology modelling ===
* '''QUESTION 17''': Does CPHmodels agree that the hit identified by PSI-Blast hit is significant?
Answer: Yes - CPHmodels comes up with a Z-score of 31.75
 
* '''QUESTION 18''': Could the residues form an active site?
Answer: Yes - the four residues are close in space.
[[File:active_site.png]]
 
[[File:Psi-blast_active_site.png|center|frame|Another view from a different angle, which shows that the residues could potentially be a part of the active site.]]
 
-->

Revision as of 10:51, 6 November 2025

NEW answers are being updated!

When BLAST fails

  • QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?

Answer: No sequences with E-value below 0.005.


Trying another approach

  • QUESTION 2: How many hits do you obtain (E-value < 10)? (Tip: you can see the number by selecting all hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)

Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.

  • QUESTION 3: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?

This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.

  • QUESTION 4: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?

Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.


Constructing the PSSM

  • QUESTION 5: After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?

Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005). The E-values of the previous hits are much lower and look significant this time. This is because those sequences were integrated on the PSSM and therefore on the search.

  • QUESTION 6: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?

Answer: Most query coverage of the hits is around 45-50%, however it seems that there are two regions of the protein that have separated hits, like if our orphan protein would contain a mix of two different proteins which seem to be abundant in many genus of bacteria.

  • QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.

Answer: During the first iteration a generic Blosum62 substitution matrix was used. The hits found there were made into a multiple alignment and a new and more sensitive position-specific-substitution-matrix (PSSM) based on the selected sequences, was constructed for the second iteration. This is why more sequences are found after the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).

Saving and reusing the PSSM

  • QUESTION 8: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.

Answer: The e-values are lower this time but the query cover seems to be skewed to only one part of the previous matches.

  • QUESTION 9: Are there any homologous sequences found in search 2 that have an annotated function?

In the previous search (PSI-BLAST run 2) the functions were mostly deaminase domain-containing protein and Rrf2 family transcriptional regulator and some hypethical proteins with unknown function.

  • QUESTION 10: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?

In the new search (PSI-BLAST run 3) the functions were mostly Rrf2 family transcriptional regulator


  • QUESTION 11: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?
  • QUESTION 12: What is the function of these proteins?


Finding a remote homolog (on your own)

  • QUESTION 14: Do you find any significant (E<0.005) hits? What is the E-value of the best hit?

Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.

  • QUESTION 15: How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?

Answer: There are 2 significant hits:

  • "GPI transamidase component Gaa1" from Trypanosoma melophagium with an E-value of 1e-05
  • "putative GPI transamidase component GAA1" from Trypanosoma theileri withs an E-value of 8e-04