ExPSIBLAST answer: Difference between revisions
| (6 intermediate revisions by the same user not shown) | |||
| Line 10: | Line 10: | ||
Answer: After the first iteration, 181 significant hits are found. | Answer: After the first iteration, 181 significant hits are found. | ||
* '''QUESTION 3''': How large a fraction of the query sequence do the significant hits match (excluding the | * '''QUESTION 3''': How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)? | ||
Answer: For most hits between 45 and 55 | Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%. | ||
* '''QUESTION 4''': (deleted) | <!--* '''QUESTION 4''': (deleted) Do you find any PDB hits among the significant hits? | ||
Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--> | Answer: No, of course not. If there were any significant PDB hits, we would have found them under QUESTION 1.--> | ||
=== Constructing the PSSM === | === Constructing the PSSM === | ||
* '''QUESTION | * '''QUESTION 4''': How many significant hits does BLAST find (E-value < 0.005)? | ||
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 | Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500) | ||
* '''QUESTION 6''': How large a fraction of the query sequence do the 20 most significant hits match (do not include the | * '''QUESTION 5''': What is the E-value of the ''least'' significant hit shown on the results page? | ||
Answer: | Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005. | ||
* '''QUESTION 6''': How large a fraction of the query sequence do the 20 most significant hits match (do not include the the 100% identity match))? | |||
Answer: 53%-87% | |||
* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! | * '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on! | ||
Answer: During the first iteration a generic | Answer: During the first iteration a generic BLOSUM62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position). | ||
=== Saving and reusing the PSSM === | === Saving and reusing the PSSM === | ||
* '''QUESTION 8''': Do you find any significant PDB hits now? If yes, how many? | * '''QUESTION 8''': Do you find any significant PDB hits now? If yes, how many? | ||
Answer: Yes, | Answer: Yes, 16 | ||
* '''QUESTION 9''': What are the PDB identifiers and the E-values for the two best PDB hits? | * '''QUESTION 9''': What are the PDB identifiers and the E-values for the two best PDB hits? | ||
Answer: | Answer: 5HXY_A with an E-value of 8´2×10<sup>-19</sup>; 4A8E_A with an E-value of 2×10<sup>-17</sup>. | ||
* '''QUESTION 10''': What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? | * '''QUESTION 10''': What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? | ||
| Line 37: | Line 40: | ||
'''Answer:''' | '''Answer:''' | ||
ID cov ident sim/pos | ID cov ident sim/pos | ||
5HXY_A 46% 20% 35% | |||
4A8E_A 63% 18% 37% | |||
| Line 92: | Line 51: | ||
=== One more round === | === One more round === | ||
* '''QUESTION 12''': Answer questions 8-10 again for the new search. | * '''QUESTION 12''': Answer questions 8-10 again for the new search. | ||
'''Answer:''' There are now | '''Answer:''' There are now 18 significant hits. The two best are still 4A8E_A and 5HXY_A. | ||
ID E cov ident sim/pos | ID E cov ident sim/pos | ||
4A8E_A 3e-55 65% 18% 36% | |||
5HXY_A 3e-41 61% 18% 32% | |||
== Finding a remote homolog (on your own) == | == Finding a remote homolog (on your own) == | ||
Latest revision as of 10:08, 10 November 2025
Note: E-values etc. are found November 8, 2025.
When BLAST fails
- QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?
Answer: No sequences with E-value below 0.005.
Trying another approach
- QUESTION 2: How many significant hits does BLAST find (E-value < 0.005)?
Answer: After the first iteration, 181 significant hits are found.
- QUESTION 3: How large a fraction of the query sequence do the significant hits typically match (excluding the 100% identity match)?
Answer: For most hits between 45 and 55%. A few hits are lower, down to 5%.
Constructing the PSSM
- QUESTION 4: How many significant hits does BLAST find (E-value < 0.005)?
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500)
- QUESTION 5: What is the E-value of the least significant hit shown on the results page?
Answer: Hit #500 has an E-value of 8e-14, i.e., much much smaller than 0.005.
- QUESTION 6: How large a fraction of the query sequence do the 20 most significant hits match (do not include the the 100% identity match))?
Answer: 53%-87%
- QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!
Answer: During the first iteration a generic BLOSUM62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).
Saving and reusing the PSSM
- QUESTION 8: Do you find any significant PDB hits now? If yes, how many?
Answer: Yes, 16
- QUESTION 9: What are the PDB identifiers and the E-values for the two best PDB hits?
Answer: 5HXY_A with an E-value of 8´2×10-19; 4A8E_A with an E-value of 2×10-17.
- QUESTION 10: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?
Answer:
ID cov ident sim/pos 5HXY_A 46% 20% 35% 4A8E_A 63% 18% 37%
- QUESTION 11: What is the function of these proteins?
Answer: They are recombinases.
There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.
One more round
- QUESTION 12: Answer questions 8-10 again for the new search.
Answer: There are now 18 significant hits. The two best are still 4A8E_A and 5HXY_A.
ID E cov ident sim/pos 4A8E_A 3e-55 65% 18% 36% 5HXY_A 3e-41 61% 18% 32%
Finding a remote homolog (on your own)
- QUESTION 13: Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.
- QUESTION 14: How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
Answer: There are 2 significant hits:
- "GPI transamidase component Gaa1" from Trypanosoma melophagium with an E-value of 1e-05
- "putative GPI transamidase component GAA1" from Trypanosoma theileri withs an E-value of 8e-04