ExPSIBLAST answer
Note: E-values etc. are found November 8, 2023.
When BLAST fails
- QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?
Answer: No sequences with E-value below 0.005.
Trying another approach
- QUESTION 2: How many significant hits does BLAST find (E-value < 0.005)?
Answer: After the first iteration, 494 hits are found.
- QUESTION 3: How large a fraction of the query sequence do the significant hits match (excluding the identical match)?
Answer: For most hits between 45 and 55%. One hit (#2) is 84%. A few hits are lower, down to 11%.
- QUESTION 4: Do you find any PDB hits among the significant hits?
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.
Constructing the PSSM
- QUESTION 5: How many significant hits does BLAST find (E-value < 0.005)?
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)
- QUESTION 6: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.
- QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).
Saving and reusing the PSSM
- QUESTION 8: Do you find any significant PDB hits now? If yes, how many?
Answer: Yes, 13
- QUESTION 9: What are the PDB identifiers and the E-values for the two best PDB hits?
Answer: 4A8E_A with an E-value of 2×10-19, 5HXY_A with an E-value of 8×10-19,
- QUESTION 10: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?
Answer:
ID cov ident sim/pos 4A8E_A 46% 21% 39% 5HXY_A 61% 18% 31%
Alignments:
>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea Query 242 DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL 301 + KTPK+ + EE++ + E + + +LL +GLR EL N+ +E+++ Sbjct 90 EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF 149 Query 302 KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI 361 + +I + + + + S +++ YL +R + + + Sbjct 150 EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR---------- 197 Query 362 DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ 421 K K KL P L +K R G + LR FAT+M + + I L Sbjct 198 ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL 250 Query 422 GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL 453 G + +I YT + + L++ +A L Sbjct 251 GHSNLSTTQI----YTKVSTKHLKEAVKKAKL 278
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase Length=317 Query 174 SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT 233 SRYT L+ ++ F K + Y+ Sbjct 56 SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY 115 Query 234 IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL 292 D ++ + PK V + +E K + + A +LA +G+R GEL Sbjct 116 KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC 175 Query 293 NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL 352 N+ I ++DL+ II + + + + + + + L YL R Sbjct 176 NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR-------------- 219 Query 353 AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK 411 + + + D + + + R I + +A K+ + LR FAT + Sbjct 220 --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG 277 Query 412 VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV 454 I + G +I YT LR+ Y + Sbjct 278 GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR 316
- QUESTION 11: What is the function of these proteins?
Answer: They are recombinases.
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.
One more round
- QUESTION 12: Answer questions 8-10 again for the new search.
Answer: There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.
ID E cov ident sim/pos 5HXY_A 5e-34 63% 18% 32% 4A8E_A 1e-30 65% 17% 33%
Alignments:
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase Query 163 LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI 222 E + SRYT L+ ++ F K + Y+ Sbjct 45 RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ 104 Query 223 LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL 281 D ++ + PK V + +E K + + A +L Sbjct 105 YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL 164 Query 282 AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF 341 A +G+R GEL N+ I ++DL+ II + + + + + + + L YL R Sbjct 165 AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--- 219 Query 342 IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR 400 + + + D + + + R I + +A K+ + LR Sbjct 220 -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR 266 Query 401 RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV 454 FAT + I + G +I YT LR+ Y + Sbjct 267 HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR 316
>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea Query 154 IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT 212 + I Y L SR T I + + S + + + Sbjct 5 EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK 60 Query 213 SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI 272 S + L + + + KTPK+ + EE++ + E + Sbjct 61 RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL 120 Query 273 PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK 332 + +LL +GLR EL N+ +E+++ + +I + + + + S +++ Sbjct 121 RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR 179 Query 333 VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK 392 YL +R + + + K K KL P L +K R G Sbjct 180 -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV 221 Query 393 RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG 452 + LR FAT+M + + I L G + +I YT + + L++ +A Sbjct 222 ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK 277 Query 453 L 453 L Sbjct 278 L 278
Finding a remote homolog (on your own)
- QUESTION 13: Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.
- QUESTION 14: How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
Answer: There are 2 significant hits:
- "GPI transamidase component Gaa1" from Trypanosoma melophagium with an E-value of 1e-05
- "putative GPI transamidase component GAA1" from Trypanosoma theileri withs an E-value of 8e-04