Exercise PSI-BLAST ans: Difference between revisions
| Line 9: | Line 9: | ||
==Trying another approach== | ==Trying another approach== | ||
[[File:question2_answer.png|250px|center|Partial screenshot of the first PSI-BLAST search with no significant hits but C22orf45 against itself]] | [[File:question2_answer.png|250px|center|frame| Partial screenshot of the first PSI-BLAST search with no significant hits but C22orf45 against itself]] | ||
* '''QUESTION 2''': How many hits do you obtain (E-value < 10)? ('''Tip:''' you can see the number by selecting all hits (clicking <u>All</u> under <u>Sequences producing significant alignments with E-value BETTER than threshold</u>) and then looking at the number of selected hits) | * '''QUESTION 2''': How many hits do you obtain (E-value < 10)? ('''Tip:''' you can see the number by selecting all hits (clicking <u>All</u> under <u>Sequences producing significant alignments with E-value BETTER than threshold</u>) and then looking at the number of selected hits) | ||
Revision as of 10:22, 6 November 2025
NEW answers are being updated!
When BLAST fails
- QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?
Answer: No sequences with E-value below 0.005.
Trying another approach

- QUESTION 2: How many hits do you obtain (E-value < 10)? (Tip: you can see the number by selecting all hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.
- QUESTION 3: Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.
- QUESTION 4: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.
Constructing the PSSM
- QUESTION 5: How many significant hits does BLAST find (E-value < 0.005)?
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)
- QUESTION 6: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.
- QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).
Saving and reusing the PSSM
- QUESTION 8: Do you find any significant PDB hits now? If yes, how many?
Answer: Yes, 13
- QUESTION 9: What are the PDB identifiers and the E-values for the two best PDB hits?
Answer: 4A8E_A with an E-value of 2×10-19, 5HXY_A with an E-value of 8×10-19,
- QUESTION 10: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?
Answer:
ID cov ident sim/pos 4A8E_A 46% 21% 39% 5HXY_A 61% 18% 31%
Alignments:
>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
Query 242 DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL 301
+ KTPK+ + EE++ + E + + +LL +GLR EL N+ +E+++
Sbjct 90 EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF 149
Query 302 KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI 361
+ +I + + + + S +++ YL +R + + +
Sbjct 150 EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR---------- 197
Query 362 DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ 421
K K KL P L +K R G + LR FAT+M + + I L
Sbjct 198 ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL 250
Query 422 GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL 453
G + +I YT + + L++ +A L
Sbjct 251 GHSNLSTTQI----YTKVSTKHLKEAVKKAKL 278
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
Length=317
Query 174 SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT 233
SRYT L+ ++ F K + Y+
Sbjct 56 SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY 115
Query 234 IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL 292
D ++ + PK V + +E K + + A +LA +G+R GEL
Sbjct 116 KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC 175
Query 293 NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL 352
N+ I ++DL+ II + + + + + + + L YL R
Sbjct 176 NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR-------------- 219
Query 353 AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK 411
+ + + D + + + R I + +A K+ + LR FAT +
Sbjct 220 --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG 277
Query 412 VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV 454
I + G +I YT LR+ Y +
Sbjct 278 GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR 316
- QUESTION 11: What is the function of these proteins?
Answer: They are recombinases.
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.
One more round
- QUESTION 12: Answer questions 8-10 again for the new search.
Answer: There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.
ID E cov ident sim/pos 5HXY_A 5e-34 63% 18% 32% 4A8E_A 1e-30 65% 17% 33%
Alignments:
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
Query 163 LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI 222
E + SRYT L+ ++ F K + Y+
Sbjct 45 RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ 104
Query 223 LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL 281
D ++ + PK V + +E K + + A +L
Sbjct 105 YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL 164
Query 282 AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF 341
A +G+R GEL N+ I ++DL+ II + + + + + + + L YL R
Sbjct 165 AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--- 219
Query 342 IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR 400
+ + + D + + + R I + +A K+ + LR
Sbjct 220 -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR 266
Query 401 RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV 454
FAT + I + G +I YT LR+ Y +
Sbjct 267 HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR 316
>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea
Query 154 IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT 212
+ I Y L SR T I + + S + + +
Sbjct 5 EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK 60
Query 213 SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI 272
S + L + + + KTPK+ + EE++ + E +
Sbjct 61 RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL 120
Query 273 PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK 332
+ +LL +GLR EL N+ +E+++ + +I + + + + S +++
Sbjct 121 RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR 179
Query 333 VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK 392
YL +R + + + K K KL P L +K R G
Sbjct 180 -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV 221
Query 393 RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG 452
+ LR FAT+M + + I L G + +I YT + + L++ +A
Sbjct 222 ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK 277
Query 453 L 453
L
Sbjct 278 L 278
Finding a remote homolog (on your own)
- QUESTION 13: Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.
- QUESTION 14: How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
Answer: There are 2 significant hits:
- "GPI transamidase component Gaa1" from Trypanosoma melophagium with an E-value of 1e-05
- "putative GPI transamidase component GAA1" from Trypanosoma theileri withs an E-value of 8e-04