ExPSIBLAST answer: Difference between revisions

From 22111
Jump to navigation Jump to search
No edit summary
 
(7 intermediate revisions by the same user not shown)
Line 8: Line 8:
==Trying another approach==
==Trying another approach==
* '''QUESTION 2''': How many significant hits does BLAST find (E-value < 0.005)?
* '''QUESTION 2''': How many significant hits does BLAST find (E-value < 0.005)?
Answer: After the first iteration, 494 hits are found.
Answer: After the first iteration, 181 significant hits are found.


* '''QUESTION 3''': How large a fraction of the query sequence do the significant hits match (excluding the identical match)?  
* '''QUESTION 3''': How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)?  
Answer: For most hits between 45 and 55%. One hit (#2) is 84%. A few hits are lower, down to 11%.
Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.
   
   
* '''QUESTION 4''': Do you find any PDB hits among the significant hits?  
<!--* '''QUESTION 4''': (deleted)  Do you find any PDB hits among the significant hits?  
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.-->


=== Constructing the PSSM ===
=== Constructing the PSSM ===
* '''QUESTION 5''': How many significant hits does BLAST find (E-value < 0.005)?  
* '''QUESTION 4''': How many significant hits does BLAST find (E-value < 0.005)?  
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500)  


* '''QUESTION 6''': How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?
* '''QUESTION 5''': What is the E-value of the ''least'' significant hit shown on the results page?
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.
Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005.
 
* '''QUESTION 6''': How large a fraction of the query sequence do the 20 most significant hits match (do not include the the 100% identity match))?
Answer: 53%-87%  


* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!  
* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!  
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).
Answer: During the first iteration a generic BLOSUM62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).


=== Saving and reusing the PSSM ===
=== Saving and reusing the PSSM ===
* '''QUESTION 8''': Do you find any significant PDB hits now? If yes, how many?
* '''QUESTION 8''': Do you find any significant PDB hits now? If yes, how many?
Answer: Yes, 13
Answer: Yes, 16


* '''QUESTION 9''': What are the PDB identifiers and the E-values for the two best PDB hits?
* '''QUESTION 9''': What are the PDB identifiers and the E-values for the two best PDB hits?
Answer: 4A8E_A with an E-value of 2&times;10<sup>-19</sup>, 5HXY_A with an E-value of 8&times;10<sup>-19</sup>,
Answer: 5HXY_A with an E-value of 8´2&times;10<sup>-19</sup>; 4A8E_A with an E-value of 2&times;10<sup>-17</sup>.


* '''QUESTION 10''': What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?  
* '''QUESTION 10''': What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?  
Line 37: Line 40:
'''Answer:'''
'''Answer:'''
  ID      cov  ident  sim/pos  
  ID      cov  ident  sim/pos  
  4A8E_A 46%  21%    39%
  5HXY_A 46%  20%    35%
  5HXY_A 61%  18%    31%
  4A8E_A 63%  18%    37%
 
'''Alignments:'''
>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
 
Query  242  DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL  301
            +  KTPK+      +  EE++ +    E +  +  +LL  +GLR  EL N+ +E+++
Sbjct  90  EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF  149
Query  302  KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI  361
            +  +I + +  +  +      S      +++ YL +R +        + +         
Sbjct  150  EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR----------  197
Query  362  DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ  421
                K K KL P    L +K      R  G    + LR  FAT+M  + +    I  L
Sbjct  198  ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL  250
Query  422  GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL  453
            G    +  +I    YT  + + L++  +A L
Sbjct  251  GHSNLSTTQI----YTKVSTKHLKEAVKKAKL  278
 
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
Length=317
Query  174  SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT  233
            SRYT      L+  ++ F  K      +  Y+                       
Sbjct  56  SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY  115
Query  234  IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL  292
                D ++  +  PK      V +  +E K + +        A  +LA +G+R GEL
Sbjct  116  KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC  175
Query  293  NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL  352
            N+ I ++DL+  II + +  +  +      + +  + L  YL  R             
Sbjct  176  NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--------------  219
Query  353  AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK  411
                + + + D      +  +    + R I +  +A  K+  + LR  FAT +   
Sbjct  220  --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG  277
Query  412  VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454
                  I  + G      +I    YT      LR+ Y +   
Sbjct  278  GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316
 




Line 92: Line 51:
=== One more round ===
=== One more round ===
* '''QUESTION 12''': Answer questions 8-10 again for the new search.
* '''QUESTION 12''': Answer questions 8-10 again for the new search.
'''Answer:''' There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.
'''Answer:''' There are now 18 significant hits. The two best are still 4A8E_A and 5HXY_A.
  ID      E      cov  ident  sim/pos  
  ID      E      cov  ident  sim/pos  
  5HXY_A 5e-34 63%  18%    32%
  4A8E_A 3e-55 65%  18%    36%
  4A8E_A 1e-30 6517%    33%
  5HXY_A 3e-41 6118%    32%
 
'''Alignments:'''
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
Query  163  LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI  222
                E  +    SRYT      L+  ++ F  K      +  Y+             
Sbjct  45  RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ  104
Query  223  LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL  281
                          D ++  +  PK      V +  +E K + +        A  +L
Sbjct  105  YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL  164
Query  282  AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF  341
            A +G+R GEL N+ I ++DL+  II + +  +  +      + +  + L  YL  R 
Sbjct  165  AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR---  219
Query  342  IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR  400
                          + + + D      +  +    + R I +  +A  K+  + LR
Sbjct  220  -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR  266
Query  401  RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV  454
              FAT +        I  + G      +I    YT      LR+ Y +   
Sbjct  267  HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR  316
 
>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
Query  154  IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT  212
                +    I    Y  L  SR T      I  +      +  S    + + +   
Sbjct  5    EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK  60
Query  213  SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI  272
                S  +  L  +  +            +  KTPK+      +  EE++ +    E +
Sbjct  61  RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL  120
Query  273  PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK  332
              +  +LL  +GLR  EL N+ +E+++ +  +I + +  +  +      S      +++
Sbjct  121  RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR  179
Query  333  VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK  392
              YL +R +        + +            K K KL P    L +K      R  G
Sbjct  180  -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV  221
Query  393  RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG  452
                + LR  FAT+M  + +    I  L G    +  +I    YT  + + L++  +A
Sbjct  222  ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK  277
Query  453  L  453
            L
Sbjct  278  L  278


== Finding a remote homolog (on your own) ==
== Finding a remote homolog (on your own) ==

Latest revision as of 10:08, 10 November 2025

Note: E-values etc. are found November 8, 2025.

When BLAST fails

  • QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?

Answer: No sequences with E-value below 0.005.

Trying another approach

  • QUESTION 2: How many significant hits does BLAST find (E-value < 0.005)?

Answer: After the first iteration, 181 significant hits are found.

  • QUESTION 3: How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)?

Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.


Constructing the PSSM

  • QUESTION 4: How many significant hits does BLAST find (E-value < 0.005)?

Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500)

  • QUESTION 5: What is the E-value of the least significant hit shown on the results page?

Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005.

  • QUESTION 6: How large a fraction of the query sequence do the 20 most significant hits match (do not include the the 100% identity match))?

Answer: 53%-87%

  • QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!

Answer: During the first iteration a generic BLOSUM62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).

Saving and reusing the PSSM

  • QUESTION 8: Do you find any significant PDB hits now? If yes, how many?

Answer: Yes, 16

  • QUESTION 9: What are the PDB identifiers and the E-values for the two best PDB hits?

Answer: 5HXY_A with an E-value of 8´2×10-19; 4A8E_A with an E-value of 2×10-17.

  • QUESTION 10: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?

Answer:

ID      cov   ident  sim/pos 
5HXY_A  46%   20%    35%
4A8E_A  63%   18%    37%


  • QUESTION 11: What is the function of these proteins?

Answer: They are recombinases.

There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.

One more round

  • QUESTION 12: Answer questions 8-10 again for the new search.

Answer: There are now 18 significant hits. The two best are still 4A8E_A and 5HXY_A.

ID      E      cov   ident  sim/pos 
4A8E_A  3e-55  65%   18%    36%
5HXY_A  3e-41  61%   18%    32%

Finding a remote homolog (on your own)

  • QUESTION 13: Do you find any significant (E<0.005) hits? What is the E-value of the best hit?

Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.

  • QUESTION 14: How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?

Answer: There are 2 significant hits:

  • "GPI transamidase component Gaa1" from Trypanosoma melophagium with an E-value of 1e-05
  • "putative GPI transamidase component GAA1" from Trypanosoma theileri withs an E-value of 8e-04