ExPSIBLAST answer: Difference between revisions

From 22111
Jump to navigation Jump to search
(Created page with "Note: E-values etc. are found November 8, 2023. == When BLAST fails == * '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)? Answer: No sequences with E-value below 0.005. ==Trying another approach== * '''QUESTION 2''': How many significant hits does BLAST find (E-value < 0.005)? Answer: After the first iteration, 363 hits are found. * '''QUESTION 3''': How large a fraction of the query sequence do the significant hits match (excluding the...")
 
 
(3 intermediate revisions by the same user not shown)
Line 8: Line 8:
==Trying another approach==
==Trying another approach==
* '''QUESTION 2''': How many significant hits does BLAST find (E-value < 0.005)?
* '''QUESTION 2''': How many significant hits does BLAST find (E-value < 0.005)?
Answer: After the first iteration, 363 hits are found.
Answer: After the first iteration, 494 hits are found.


* '''QUESTION 3''': How large a fraction of the query sequence do the significant hits match (excluding the identical match)?  
* '''QUESTION 3''': How large a fraction of the query sequence do the significant hits match (excluding the identical match)?  
Line 31: Line 31:


* '''QUESTION 9''': What are the PDB identifiers and the E-values for the two best PDB hits?
* '''QUESTION 9''': What are the PDB identifiers and the E-values for the two best PDB hits?
Answer: 4A8E_A with an E-value of 2&times;10<sup>-20</sup>, 5HXY_A with an E-value of 3&times;10<sup>-20</sup>,  
Answer: 4A8E_A with an E-value of 2&times;10<sup>-19</sup>, 5HXY_A with an E-value of 8&times;10<sup>-19</sup>,  


* '''QUESTION 10''': What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?  
* '''QUESTION 10''': What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?  
Line 37: Line 37:
'''Answer:'''
'''Answer:'''
  ID      cov  ident  sim/pos  
  ID      cov  ident  sim/pos  
  4A8E_A  46%  21%    40%
  4A8E_A  46%  21%    39%
  5HXY_A  61%  18%    32%
  5HXY_A  61%  18%    31%


'''Alignments:'''
'''Alignments:'''
Line 94: Line 94:
'''Answer:''' There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.
'''Answer:''' There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.
  ID      E      cov  ident  sim/pos  
  ID      E      cov  ident  sim/pos  
  4A8E_A 5e-43 65%  18%    34%
  5HXY_A 5e-34 63%  18%    32%
  5HXY_A 1e-42 63%  17%    31%
  4A8E_A 1e-30 65%  17%    33%


'''Alignments:'''
'''Alignments:'''
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222
                E  +    SRYT      L+  ++ F  K      +  Y+             
Sbjct  45  RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104
Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281
                          D ++  +  PK      V +  +E K + +        A  +L
Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164
Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341
            A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L  YL  R 
Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219
Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400
                          + + + D      +  +    + R I +  +A  K+  + LR
Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266
Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454
              FAT +        I  + G      +I    YT      LR+ Y +   
Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316
  >4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
  >4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
   
   
Line 123: Line 145:
             L
             L
  Sbjct  278  L  278
  Sbjct  278  L  278
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222
                E  +    SRYT      L+  ++ F  K      +  Y+             
Sbjct  45  RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104
Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281
                          D ++  +  PK      V +  +E K + +        A  +L
Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164
Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341
            A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L  YL  R 
Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219
Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400
                          + + + D      +  +    + R I +  +A  K+  + LR
Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266
Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454
              FAT +        I  + G      +I    YT      LR+ Y +   
Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316


== Finding a remote homolog (on your own) ==
== Finding a remote homolog (on your own) ==
* '''QUESTION 13''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?  
* '''QUESTION 13''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?  
Answer: There are no significant hits. The best hit has an E-value of 5.8.  
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.  


* '''QUESTION 14''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
* '''QUESTION 14''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
Answer: There is 1 significant hit: "putative GPI transamidase component GAA1" from ''Trypanosoma theileri''. It has an E-value of 4e-04.
Answer: There are 2 significant hits:  
* "GPI transamidase component Gaa1" from ''Trypanosoma melophagium'' with an E-value of 1e-05
* "putative GPI transamidase component GAA1" from ''Trypanosoma theileri'' withs an E-value of 8e-04  
 


<!--
<!--

Latest revision as of 11:47, 14 November 2024

Note: E-values etc. are found November 8, 2023.

When BLAST fails

  • QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?

Answer: No sequences with E-value below 0.005.

Trying another approach

  • QUESTION 2: How many significant hits does BLAST find (E-value < 0.005)?

Answer: After the first iteration, 494 hits are found.

  • QUESTION 3: How large a fraction of the query sequence do the significant hits match (excluding the identical match)?

Answer: For most hits between 45 and 55%. One hit (#2) is 84%. A few hits are lower, down to 11%.

  • QUESTION 4: Do you find any PDB hits among the significant hits?

Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.

Constructing the PSSM

  • QUESTION 5: How many significant hits does BLAST find (E-value < 0.005)?

Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)

  • QUESTION 6: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?

Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.

  • QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!

Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).

Saving and reusing the PSSM

  • QUESTION 8: Do you find any significant PDB hits now? If yes, how many?

Answer: Yes, 13

  • QUESTION 9: What are the PDB identifiers and the E-values for the two best PDB hits?

Answer: 4A8E_A with an E-value of 2×10-19, 5HXY_A with an E-value of 8×10-19,

  • QUESTION 10: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?

Answer:

ID      cov   ident  sim/pos 
4A8E_A  46%   21%    39%
5HXY_A  61%   18%    31%

Alignments:

>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
 
Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301
            +  KTPK+       +  EE++ +    E +  +   +LL  +GLR  EL N+ +E+++ 
Sbjct  90   EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149

Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361
            +  +I + +  +  +      S      +++ YL +R +        + +          
Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197

Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421
               K K KL P     L +K      R  G     + LR  FAT+M  + +    I  L 
Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250

Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453
            G    +  +I    YT  + + L++   +A L
Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278

>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
Length=317

Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233
            SRYT      L+  ++ F   K       +   Y+                         
Sbjct  56   SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115

Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292
               D ++  +   PK      V +  +E K + +         A   +LA +G+R GEL 
Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175

Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352
            N+ I ++DL+  II + +  +  +      + +  + L   YL  R              
Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219

Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411
               + + + D      +   +    + R I +   +A   K+   + LR  FAT +    
Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277

Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454
                 I  + G       +I    YT      LR+ Y +    
Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316


  • QUESTION 11: What is the function of these proteins?

Answer: They are recombinases.

There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.

One more round

  • QUESTION 12: Answer questions 8-10 again for the new search.

Answer: There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.

ID      E      cov   ident  sim/pos 
5HXY_A  5e-34  63%   18%    32%
4A8E_A  1e-30  65%   17%    33%

Alignments:

>5HXY_A Chain A, Crystal Structure Of Xera Recombinase

Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222
               E  +    SRYT      L+  ++ F   K       +   Y+              
Sbjct  45   RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104

Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281
                          D ++  +   PK      V +  +E K + +         A   +L
Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164

Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341
            A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L   YL  R   
Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219

Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400
                          + + + D      +   +    + R I +   +A   K+   + LR
Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266

Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454
              FAT +         I  + G       +I    YT      LR+ Y +    
Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316
>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea

Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212
               +    I     Y  L   SR T       I  +      +  S    + + +     
Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60

Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272
                S  +  L  +  +            +  KTPK+       +  EE++ +    E +
Sbjct  61   RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120

Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332
              +   +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++
Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179

Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392
             YL +R +        + +             K K KL P     L +K      R  G 
Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221

Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452
                + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++   +A 
Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277

Query  453  L  453
            L
Sbjct  278  L  278

Finding a remote homolog (on your own)

  • QUESTION 13: Do you find any significant (E<0.005) hits? What is the E-value of the best hit?

Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.

  • QUESTION 14: How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?

Answer: There are 2 significant hits:

  • "GPI transamidase component Gaa1" from Trypanosoma melophagium with an E-value of 1e-05
  • "putative GPI transamidase component GAA1" from Trypanosoma theileri withs an E-value of 8e-04