Exercise PSI-BLAST

2025-11-06T11:09:28Z

Carol: /* Saving and reusing the PSSM */

Written by: Carolina Barra Quaglia

==Overview==

In this exercise you will learn how to
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.

==Introduction: What are orphan genes?==

In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.

In today's exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.

Interestingly this gene (C22orf45) may have once originated from 'Junk DNA' and it is supposed to have gained function through mutations that allowed it to start producing proteins.
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])

==When BLAST fails==

Here you have the protein‐coding sequence with unknown function from the human gene named "C22orf45". This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.

>C22orf45
MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG
CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP
CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP

First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select blastp as the algorithm. Paste in the query sequence. Change the database from nr to Protein Data Bank (pdb), and press BLAST (Figure 1).

[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPHA6F6K016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)?

==Trying another approach==

Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to Non-redundant protein sequences (nr) and select PSI-BLAST (Position-Specific Iterated BLAST) as the algorithm (Figure 2).

'''IMPORTANT:''' To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.

[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPJM9RYM014''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 2''': How many hits do you obtain (E-value < 10)? ('''Tip:''' you can see the number by selecting all hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)

* '''QUESTION 3''': Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?

* '''QUESTION 4''': Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?

===Constructing the PSSM===

Now retain the hits with an E-value<10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the Run button at Run PSI-Blast iteration 2 (you can find it at both the bottom and top of the results table).

[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPX0AZ4V016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 5''': After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?
* '''QUESTION 6''': Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?
* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.

===Saving and reusing the PSSM===

You can run a second iteration, but before that, let's save the PSSM for future searches.

In order to do that, go to the top of the PSI-BLAST output page and click Download All, then click PSSM. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2

You can run a second iteration, this time with the maximum number of sequences that have an E-value < 0.005.

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GSW70U2V016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 8''': Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.

You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.

Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches.

* '''QUESTION 9''': Are there any homologous sequences found in search 2 that have an annotated function?
* '''QUESTION 10''': Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?

We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.

Open ''a new BLAST window''. Select Protein Data Bank (pdb) as the database. Click on Algorithm parameters to show the extended settings. Click the button next to Upload PSSM and select the file you just saved. Remember to change the Expect threshold to significant (E-value <0.005) As default the E value is saved from the last search that should be 100. '''Note:''' You don't have to paste the query sequence again, it is stored in the PSSM!

'''PSSM-2'''

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GR15WYYN016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

'''PSSM-3'''

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GT08HV28016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 11''': Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?
* '''QUESTION 12''': What is the function of these proteins?

==Reflection time==

Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work.
* '''QUESTION 13''': However, can you see any potential risks on doing so? Can we believe in the results?

'''Hint:''' Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.

==Finding a remote homolog in a specific taxa (Optional)==

PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB — now it is time to search the broader database "Reference proteins" (refseq_protein). ('''Note:''' we would have liked to do this exercise in the broadest database nr, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID '''GPAA1_HUMAN''' has a homolog in the genus ''Trypanosoma'' (unicellular parasites which cause diseases like sleeping sickness or Chaga's disease).
* First, try a standard BlastP (where you set Organism to ''Trypanosoma'', Database to refseq_protein ('''not''' refseq_select), switch the Low complexity regions filter off, and set the E-value threshold to 10).
* '''QUESTION 14''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
* Then, try PSI-BLAST. '''Hint:''' You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in ''Trypanosoma''.
* '''QUESTION 15''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?

Exercise PSI-BLAST

2025-11-06T11:09:02Z

Carol: /* Saving and reusing the PSSM */

Written by: Carolina Barra Quaglia

==Overview==

In this exercise you will learn how to
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.

==Introduction: What are orphan genes?==

In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.

In today's exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.

Interestingly this gene (C22orf45) may have once originated from 'Junk DNA' and it is supposed to have gained function through mutations that allowed it to start producing proteins.
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])

==When BLAST fails==

Here you have the protein‐coding sequence with unknown function from the human gene named "C22orf45". This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.

>C22orf45
MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG
CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP
CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP

First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select blastp as the algorithm. Paste in the query sequence. Change the database from nr to Protein Data Bank (pdb), and press BLAST (Figure 1).

[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPHA6F6K016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)?

==Trying another approach==

Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to Non-redundant protein sequences (nr) and select PSI-BLAST (Position-Specific Iterated BLAST) as the algorithm (Figure 2).

'''IMPORTANT:''' To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.

[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPJM9RYM014''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 2''': How many hits do you obtain (E-value < 10)? ('''Tip:''' you can see the number by selecting all hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)

* '''QUESTION 3''': Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?

* '''QUESTION 4''': Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?

===Constructing the PSSM===

Now retain the hits with an E-value<10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the Run button at Run PSI-Blast iteration 2 (you can find it at both the bottom and top of the results table).

[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPX0AZ4V016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 5''': After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?
* '''QUESTION 6''': Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?
* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.

===Saving and reusing the PSSM===

You can run a second iteration, but before that, let's save the PSSM for future searches.

In order to do that, go to the top of the PSI-BLAST output page and click Download All, then click PSSM. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2

You can run a second iteration, this time with the maximum number of sequences that have an E-value < 0.005.

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GSW70U2V016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 8''': Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.

You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.

Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches.

* '''QUESTION 9''': Are there any homologous sequences found in search 2 that have an annotated function?
* '''QUESTION 10''': Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?

We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.

Open ''a new BLAST window''. Select Protein Data Bank (pdb) as the database. Click on Algorithm parameters to show the extended settings. Click the button next to Upload PSSM and select the file you just saved. Remember to change the Expect threshold to significant (E-value <0.005) As default the E value is saved from the last search that should be 100. '''Note:''' You don't have to paste the query sequence again, it is stored in the PSSM!

'''PSSM-2'''

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GR15WYYN016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

'''PSSM-3'''

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GT08HV28016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 11''': Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?
* '''QUESTION 12''': What is the function of these proteins?

==Reflection time==

Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work.
* '''QUESTION 13''': However, can you see any potential risks on doing so? Can we believe in the results?

'''Hint:''' Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.

==Finding a remote homolog in a specific taxa (Optional)==

PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB — now it is time to search the broader database "Reference proteins" (refseq_protein). ('''Note:''' we would have liked to do this exercise in the broadest database nr, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID '''GPAA1_HUMAN''' has a homolog in the genus ''Trypanosoma'' (unicellular parasites which cause diseases like sleeping sickness or Chaga's disease).
* First, try a standard BlastP (where you set Organism to ''Trypanosoma'', Database to refseq_protein ('''not''' refseq_select), switch the Low complexity regions filter off, and set the E-value threshold to 10).
* '''QUESTION 14''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
* Then, try PSI-BLAST. '''Hint:''' You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in ''Trypanosoma''.
* '''QUESTION 15''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?

Exercise PSI-BLAST

2025-11-06T11:08:03Z

Carol: /* Constructing the PSSM */

Written by: Carolina Barra Quaglia

==Overview==

In this exercise you will learn how to
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.

==Introduction: What are orphan genes?==

In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.

In today's exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.

Interestingly this gene (C22orf45) may have once originated from 'Junk DNA' and it is supposed to have gained function through mutations that allowed it to start producing proteins.
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])

==When BLAST fails==

Here you have the protein‐coding sequence with unknown function from the human gene named "C22orf45". This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.

>C22orf45
MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG
CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP
CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP

First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select blastp as the algorithm. Paste in the query sequence. Change the database from nr to Protein Data Bank (pdb), and press BLAST (Figure 1).

[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPHA6F6K016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)?

==Trying another approach==

Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to Non-redundant protein sequences (nr) and select PSI-BLAST (Position-Specific Iterated BLAST) as the algorithm (Figure 2).

'''IMPORTANT:''' To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.

[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPJM9RYM014''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 2''': How many hits do you obtain (E-value < 10)? ('''Tip:''' you can see the number by selecting all hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)

* '''QUESTION 3''': Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?

* '''QUESTION 4''': Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?

===Constructing the PSSM===

Now retain the hits with an E-value<10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the Run button at Run PSI-Blast iteration 2 (you can find it at both the bottom and top of the results table).

[[File:PSI-BLAST_firstrun.png|100px|center|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPX0AZ4V016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 5''': After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?
* '''QUESTION 6''': Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?
* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.

===Saving and reusing the PSSM===

You can run a second iteration, but before that, let's save the PSSM for future searches.

In order to do that, go to the top of the PSI-BLAST output page and click Download All, then click PSSM. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2

You can run a second iteration, this time with the maximum number of sequences that have an E-value < 0.005.

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GSW70U2V016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 8''': Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.

You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.

Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches.

* '''QUESTION 9''': Are there any homologous sequences found in search 2 that have an annotated function?
* '''QUESTION 10''': Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?

We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.

Open ''a new BLAST window''. Select Protein Data Bank (pdb) as the database. Click on Algorithm parameters to show the extended settings. Click the button next to Upload PSSM and select the file you just saved. Remember to change the Expect threshold to significant (E-value <0.005) As default the E value is saved from the last search that should be 100. '''Note:''' You don't have to paste the query sequence again, it is stored in the PSSM!

'''PSSM-2'''

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GR15WYYN016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

'''PSSM-3'''

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GSW70U2V016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 11''': Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?
* '''QUESTION 12''': What is the function of these proteins?

==Reflection time==

Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work.
* '''QUESTION 13''': However, can you see any potential risks on doing so? Can we believe in the results?

'''Hint:''' Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.

==Finding a remote homolog in a specific taxa (Optional)==

PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB — now it is time to search the broader database "Reference proteins" (refseq_protein). ('''Note:''' we would have liked to do this exercise in the broadest database nr, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID '''GPAA1_HUMAN''' has a homolog in the genus ''Trypanosoma'' (unicellular parasites which cause diseases like sleeping sickness or Chaga's disease).
* First, try a standard BlastP (where you set Organism to ''Trypanosoma'', Database to refseq_protein ('''not''' refseq_select), switch the Low complexity regions filter off, and set the E-value threshold to 10).
* '''QUESTION 14''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
* Then, try PSI-BLAST. '''Hint:''' You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in ''Trypanosoma''.
* '''QUESTION 15''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?

Exercise PSI-BLAST

2025-11-06T11:07:36Z

Carol: /* Constructing the PSSM */

Written by: Carolina Barra Quaglia

==Overview==

In this exercise you will learn how to
* Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.
* Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).
* Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.
* Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.
* Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.

==Introduction: What are orphan genes?==

In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.

In today's exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.

Interestingly this gene (C22orf45) may have once originated from 'Junk DNA' and it is supposed to have gained function through mutations that allowed it to start producing proteins.
(You can find more known information of the gene here: [https://www.uniprot.org/uniprotkb/P86434/publications C22orf45 Publications])

==When BLAST fails==

Here you have the protein‐coding sequence with unknown function from the human gene named "C22orf45". This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.

>C22orf45
MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG
CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP
CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP

First we are going to check that BLAST does not find any homologous sequence. Go to the [http://www.ncbi.nlm.nih.gov/BLAST BLAST] web-site at NCBI. Select blastp as the algorithm. Paste in the query sequence. Change the database from nr to Protein Data Bank (pdb), and press BLAST (Figure 1).

[[File:blastp_pdb.png|center|frame|Figure 1. Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPHA6F6K016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)?

==Trying another approach==

Now go back to the search web-site of [https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLASTP]. Paste in the query sequence again. This time, set the database to Non-redundant protein sequences (nr) and select PSI-BLAST (Position-Specific Iterated BLAST) as the algorithm (Figure 2).

'''IMPORTANT:''' To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.

[[File:psiblastp_nr.png|250px|center|frame|Figure 2. Partial screenshot of the PSI-BLAST interface. The red arrow shows the settings change to PSI-BLAST.]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPJM9RYM014''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 2''': How many hits do you obtain (E-value < 10)? ('''Tip:''' you can see the number by selecting all hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)

* '''QUESTION 3''': Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?

* '''QUESTION 4''': Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?

===Constructing the PSSM===

Now retain the hits with an E-value<10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the Run button at Run PSI-Blast iteration 2 (you can find it at both the bottom and top of the results table).

[[File:PSI-BLAST_firstrun.png|300px|thumb|frame|Figure 3. Partial screenshot of the PSI-BLAST interface before running Iteration 2. The red square shows how to change the settings for the run.]]

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GPX0AZ4V016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 5''': After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?
* '''QUESTION 6''': Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?
* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.

===Saving and reusing the PSSM===

You can run a second iteration, but before that, let's save the PSSM for future searches.

In order to do that, go to the top of the PSI-BLAST output page and click Download All, then click PSSM. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2

You can run a second iteration, this time with the maximum number of sequences that have an E-value < 0.005.

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GSW70U2V016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 8''': Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.

You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.

Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches.

* '''QUESTION 9''': Are there any homologous sequences found in search 2 that have an annotated function?
* '''QUESTION 10''': Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?

We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.

Open ''a new BLAST window''. Select Protein Data Bank (pdb) as the database. Click on Algorithm parameters to show the extended settings. Click the button next to Upload PSSM and select the file you just saved. Remember to change the Expect threshold to significant (E-value <0.005) As default the E value is saved from the last search that should be 100. '''Note:''' You don't have to paste the query sequence again, it is stored in the PSSM!

'''PSSM-2'''

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GR15WYYN016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

'''PSSM-3'''

<div style="background-color: lavender; border: solid thin grey;">
:'''Note:''' If BLAST collapses you can check pre-run results using this ID: '''GSW70U2V016''' in here [[https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=GetSaved&RECENT_RESULTS=on Lookup BLAST Job]]
</div>

* '''QUESTION 11''': Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?
* '''QUESTION 12''': What is the function of these proteins?

==Reflection time==

Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work.
* '''QUESTION 13''': However, can you see any potential risks on doing so? Can we believe in the results?

'''Hint:''' Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.

==Finding a remote homolog in a specific taxa (Optional)==

PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB — now it is time to search the broader database "Reference proteins" (refseq_protein). ('''Note:''' we would have liked to do this exercise in the broadest database nr, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID '''GPAA1_HUMAN''' has a homolog in the genus ''Trypanosoma'' (unicellular parasites which cause diseases like sleeping sickness or Chaga's disease).
* First, try a standard BlastP (where you set Organism to ''Trypanosoma'', Database to refseq_protein ('''not''' refseq_select), switch the Low complexity regions filter off, and set the E-value threshold to 10).
* '''QUESTION 14''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
* Then, try PSI-BLAST. '''Hint:''' You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in ''Trypanosoma''.
* '''QUESTION 15''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?

2025-11-06T09:19:04Z

Carol:

Exercise PSI-BLAST ans

2025-11-06T09:18:51Z

Carol:

NEW answers are being updated!

== When BLAST fails ==

* '''QUESTION 1''': How many significant hits does BLAST find (E-value < 0.005)?
Answer: No sequences with E-value below 0.005.

==Trying another approach==

* '''QUESTION 2''': How many hits do you obtain (E-value < 10)? ('''Tip:''' you can see the number by selecting all hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)
Answer: This is a very unknown gene and not many good hits appear. Only 5 sequences have E-value below 10, the sequence we are searching and 4 more, but these are not siginificant hits.

* '''QUESTION 3''': Excluding the identical match, what is the highest sequence identity (provide sequence Id) and coverage among the hits? Are the hits only human, or do they include other mammals/vertebrates?
This is the WP_340711999.1 a deaminase-domain contanining protein from thermoactinomicetes sp. sequence Identity is 33.33% and query coverage 48%. The hits appart from itself are not human. thermoactinomicetes is a genus of gram positive bacteria, so it also looks a bit weird to find only a partial match in bacteria before having any match on vertebrates.

* '''QUESTION 4''': Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?
Apart from the orphan protein hit to itself, none of the hits are significant (E-values are in between 1-10, meaning the chance to get a random hit with the same score is one to ten sequences. The fact that the sequences are from Bacteria does not make the homology hypethesis very promising either.. but since a google search of the C22orf45 orphan gene suggests that the function is unknown we will continue the searches to see what we get.

=== Constructing the PSSM ===
* '''QUESTION 5''': How many significant hits does BLAST find (E-value < 0.005)?
Answer: 500 (actually, much more than 500 hits, but BLAST by default only shows 500 — note that the last hit has an E-value much much smaller than 0.005)

* '''QUESTION 6''': How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?
Answer: approx. 50-60% sequence coverage, except one (#2) that is 84%.

* '''QUESTION 7''': Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!
Answer: During the first iteration a generic Blosum62 substitution matrix was used, and hits found there were made into a multiple alignment and next a more sensitive position-specific-substitution-matrix (PSSM). This is why more sequences are found in the second iteration. A PSSM can capture evolutionary sequence information i.e. conserved regions, active sites and regions with less evolutionary pressure (many different amino acids at a certain position).

=== Saving and reusing the PSSM ===
* '''QUESTION 8''': Do you find any significant PDB hits now? If yes, how many?
Answer: Yes, 13

* '''QUESTION 9''': What are the PDB identifiers and the E-values for the two best PDB hits?
Answer: 4A8E_A with an E-value of 2×10-19, 5HXY_A with an E-value of 8×10-19,

* '''QUESTION 10''': What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits?

'''Answer:'''
ID cov ident sim/pos
4A8E_A 46% 21% 39%
5HXY_A 61% 18% 31%

'''Alignments:'''
>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea

Query 242 DSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENIDL 301
+ KTPK+ + EE++ + E + + +LL +GLR EL N+ +E+++
Sbjct 90 EKLKTPKMPKTLPKSLTEEEVRRIINAAETLRDRLILLLLYGAGLRVSELCNLRVEDVNF 149

Query 302 KARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQEI 361
+ +I + + + + S +++ YL +R + + +
Sbjct 150 EYGVIVV-RGGKGGKDRVVPISESLLSEIKR-YLESRNDDSPYLFVEMKR---------- 197

Query 362 DLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINILQ 421
K K KL P L +K R G + LR FAT+M + + I L
Sbjct 198 ---KRKDKLSPKTVWRLVKK----YGRKAGVELTPHQLRHSFATHMLERGIDIRIIQELL 250

Query 422 GRVGPNEFRILKENYTVFTIEDLRKLYDEAGL 453
G + +I YT + + L++ +A L
Sbjct 251 GHSNLSTTQI----YTKVSTKHLKEAVKKAKL 278

>5HXY_A Chain A, Crystal Structure Of Xera Recombinase
Length=317

Query 174 SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTT 233
SRYT L+ ++ F K + Y+
Sbjct 56 SRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQYLAIKAVKLFY 115

Query 234 IGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLLAESGLRPGELL 292
D ++ + PK V + +E K + + A +LA +G+R GEL
Sbjct 116 KALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVLAYTGVRVGELC 175

Query 293 NVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKL 352
N+ I ++DL+ II + + + + + + + L YL R
Sbjct 176 NLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR-------------- 219

Query 353 AAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALRRHFATYMQLKK 411
+ + + D + + + R I + +A K+ + LR FAT +
Sbjct 220 --LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLRHTFATSVLRNG 277

Query 412 VPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV 454
I + G +I YT LR+ Y +
Sbjct 278 GDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR 316

* '''QUESTION 11''': What is the function of these proteins?
Answer: They are recombinases.

There are two families of site-specific recombinases; the resolvase/invertase family use a serine nucleophile to mediate a concerted double strand cleavage and rejoining reaction at nucleotide phosphates separated by 2 bp, while the lambda integrase family enzymes use a tyrosine nucleophile to mediate sequential pairs of strand exchanges that are positioned 6–8 bp apart. In site-specific recombination reactions mediated by both families, four recombinase molecules bound to two approx30 bp recombination core sites catalyse the breaking and rejoining of four DNA phosphodiester bonds.

=== One more round ===
* '''QUESTION 12''': Answer questions 8-10 again for the new search.
'''Answer:''' There are now 17 significant hits. The two best are still 4A8E_A and 5HXY_A.
ID E cov ident sim/pos
5HXY_A 5e-34 63% 18% 32%
4A8E_A 1e-30 65% 17% 33%

'''Alignments:'''
>5HXY_A Chain A, Crystal Structure Of Xera Recombinase

Query 163 LLNEVYSNLFSSRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRI 222
E + SRYT L+ ++ F K + Y+
Sbjct 45 RFVEYXTGERKSRYTIKEYRFLVDQFLSFXNKKPDEITPXDIERYKNFLAVKKRYSKTSQ 104

Query 223 LNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIE-HIPSKAYFVLL 281
D ++ + PK V + +E K + + A +L
Sbjct 105 YLAIKAVKLFYKALDLRVPINLTPPKRPSHXPVYLSEDEAKRLIEAASSDTRXYAIVSVL 164

Query 282 AESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEF 341
A +G+R GEL N+ I ++DL+ II + + + + + + + L YL R
Sbjct 165 AYTGVRVGELCNLKISDVDLQESIINV-RSGKGDKDRIVIXAEECVKAL-GSYLDLR--- 219

Query 342 IRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALG-KRFELYALR 400
+ + + D + + + R I + +A K+ + LR
Sbjct 220 -------------LSXDTDNDYLFVSNRRVRFDTSTIERXIRDLGKKAGIQKKVTPHVLR 266

Query 401 RHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLV 454
FAT + I + G +I YT LR+ Y +
Sbjct 267 HTFATSVLRNGGDIRFIQQILGHASVATTQI----YTHLNDSALREXYTQHRPR 316

>4A8E_A Chain A, The Structure Of A Dimeric Xer Recombinase From Archaea

Query 154 IILLSIGFILLNEVYSNLFS-SRYTTISIFTLIVSYMLFIRNKIISSEEEEQIEYEKVAT 212
+ I Y L SR T I + + S + + +
Sbjct 5 EERVRDDTIEEFATYLELEGKSRNTVRMYTYYISKFFE----EGHSPTARDALRFLAKLK 60

Query 213 SYISSLINRILNTKFTENTTTIGQDKQLYDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHI 272
S + L + + + KTPK+ + EE++ + E +
Sbjct 61 RKGYSTRSLNLVIQALKAYFKFEGLDSEAEKLKTPKMPKTLPKSLTEEEVRRIINAAETL 120

Query 273 PSKAYFVLLAESGLRPGELLNVSIENIDLKARIIWINKETQTKRAYFSFFSRKTAEFLEK 332
+ +LL +GLR EL N+ +E+++ + +I + + + + S +++
Sbjct 121 RDRLILLLLYGAGLRVSELCNLRVEDVNFEYGVIVV-RGGKGGKDRVVPISESLLSEIKR 179

Query 333 VYLPAREEFIRANEKNIAKLAAANENQEIDLEKWKAKLFPYKDDVLRRKIYEAMDRALGK 392
YL +R + + + K K KL P L +K R G
Sbjct 180 -YLESRNDDSPYLFVEMKR-------------KRKDKLSPKTVWRLVKK----YGRKAGV 221

Query 393 RFELYALRRHFATYMQLKKVPPLAINILQGRVGPNEFRILKENYTVFTIEDLRKLYDEAG 452
+ LR FAT+M + + I L G + +I YT + + L++ +A
Sbjct 222 ELTPHQLRHSFATHMLERGIDIRIIQELLGHSNLSTTQI----YTKVSTKHLKEAVKKAK 277

Query 453 L 453
L
Sbjct 278 L 278

== Finding a remote homolog (on your own) ==
* '''QUESTION 13''': Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
Answer: There are no significant hits. The best hit has an E-value of 6.9, and it is a hypothetical protein.

* '''QUESTION 14''': How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
Answer: There are 2 significant hits:
* "GPI transamidase component Gaa1" from ''Trypanosoma melophagium'' with an E-value of 1e-05
* "putative GPI transamidase component GAA1" from ''Trypanosoma theileri'' withs an E-value of 8e-04