Exercise PSI-BLAST
Written by: Carolina Barra Quaglia
Overview
In this exercise you will learn how to
- Critically assess when BLAST fails (e.g., no significant hits) and explore alternative strategies.
- Use PSI-BLAST to search for remote homologues of a given protein sequence (an orphan gene).
- Interpret iterative PSI-BLAST output (number of hits, coverage, E-value, identity/positives) to assess significance.
- Save and reuse a PSSM (profile) to search specialized databases (e.g., PDB, RefSeq) for structural or functional insights.
- Make a reasoned functional hypothesis for a gene of unknown function (the orphan gene) based on remote homology, domain architecture, structural clues, conserved residues, etc.
Introduction: What are orphan genes?
In genomics and evolutionary biology, an orphan gene (also called a taxonomically-restricted gene, TRG) is a gene for which no detectable homologue exists outside a given species or lineage.
In today's exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and to discover what is the function of a real human orphan gene called C22orf45. We will aim to do a research‐style annotation of a “dark” gene that is not well annotated.
Interestingly this gene (C22orf45) may have once originated from 'Junk DNA' and it is supposed to have gained function through mutations that allowed it to start producing proteins. (You can find more known information of the gene here: C22orf45 Publications)
When BLAST fails
Here you have the protein‐coding sequence with unknown function from the human gene named "C22orf45". This gene is currently poorly annotated in the human genome, and initial BLAST searches show no obvious homologues. Your task is to use PSI-BLAST to search for remote homologues, explore whether this gene might belong to a known protein family, gain insight into its possible function and structure, and reflect on its status as a potential orphan gene.
>C22orf45 MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP
First we are going to check that BLAST does not find any homologous sequence. Go to the BLAST web-site at NCBI. Select blastp as the algorithm. Paste in the query sequence. Change the database from nr to Protein Data Bank (pdb), and press BLAST (Figure 1).

- Note: If BLAST collapses you can check pre-run results using this ID: GPHA6F6K016 in here [Lookup BLAST Job]
- QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?
Trying another approach
Now go back to the search web-site of BLASTP. Paste in the query sequence again. This time, set the database to Non-redundant protein sequences (nr) and select PSI-BLAST (Position-Specific Iterated BLAST) as the algorithm (Figure 2).
IMPORTANT: To allow for more remote homologues we will increase the E-value of our search to 100. Note that this will riks finding non-homologous proteins in our results.

- Note: If BLAST collapses you can check pre-run results using this ID: GPJM9RYM014 in here [Lookup BLAST Job]
- QUESTION 2: How many hits do you obtain (E-value < 10)? (Tip: you can see the number by selecting all hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)
- QUESTION 3: Excluding the identical match, what is the match with the lower E-value? Provide sequence Id, %identity and coverage. Are the hits only human, or do they include other mammals/vertebrates?
- QUESTION 4: Based on the first result, is there a clear homologue in non‐human species? What does that suggest about the gene’s taxonomic distribution?
Constructing the PSSM
Now retain the hits with an E-value<10 to build the PSSM (Position-Specific Scoring Matrix) and run a second iteration of BLAST. Press the Run button at Run PSI-Blast iteration 2 (you can find it at both the bottom and top of the results table).

- Note: If BLAST collapses you can check pre-run results using this ID: GPX0AZ4V016 in here [Lookup BLAST Job]
- QUESTION 5: After iteration 2, How many significant hits (E-value < 0.005) are now found? What happened with E-value of the hits found before?
- QUESTION 6: Explore the the Graphic Summary tab. What can you say about the query coverage of the matches?
- QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Explain in your own words the principle of profile‐based search in PSI-BLAST.
Saving and reusing the PSSM
You can run a second iteration, but before that, let's save the PSSM for future searches.
In order to do that, go to the top of the PSI-BLAST output page and click Download All, then click PSSM. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable. Change the name of the file to PSSM-2
You can run a second iteration, this time with the maximum number of sequences that have an E-value < 0.005.
- Note: If BLAST collapses you can check pre-run results using this ID: GSW70U2V016 in here [Lookup BLAST Job]
- QUESTION 8: Can you see any changes on the results now? Look at the E-values, and the query cover on the Graphic Summary tab.
You can save the PSSM again, and rename-it to PSSM-3 to recall that this one comes from iteration 3.
Now that we have our PSSMs we are back on track to answer the original question. What is the function of this orphan gene in humans? You can get some hints from the BLAST searches.
- QUESTION 9: Are there any homologous sequences found in search 2 that have an annotated function?
- QUESTION 10: Are there any homologous sequences found in search 3 that have an annotated function? Is there anything in common with search 2?
We know that the function is closely related with the protein structure so we will use our PSSMs to search for structures from PDB.
Open a new BLAST window. Select Protein Data Bank (pdb) as the database. Click on Algorithm parameters to show the extended settings. Click the button next to Upload PSSM and select the file you just saved. Remember to change the Expect threshold to significant (E-value <0.005) As default the E value is saved from the last search that should be 100. Note: You don't have to paste the query sequence again, it is stored in the PSSM!
PSSM-2
- Note: If BLAST collapses you can check pre-run results using this ID: GR15WYYN016 in here [Lookup BLAST Job]
PSSM-3
- Note: If BLAST collapses you can check pre-run results using this ID: GT08HV28016 in here [Lookup BLAST Job]
- QUESTION 11: Do you find any significant PDB hits now? Look at the Graphic Summary and query coverage. Is this what you expected, Why?
- QUESTION 12: What is the function of these proteins?
Reflection time
Now you have learnt how to construct a PSSM and use it to improve your search when BLAST does not work.
- QUESTION 13: However, can you see any potential risks on doing so? Can we believe in the results?
Hint: Think on our Orphan gene from humans, the query cover on the PSI-BLAST searches, the PDB structures, and the species where we have found homology.
Finding a remote homolog in a specific taxa (Optional)
PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB — now it is time to search the broader database "Reference proteins" (refseq_protein). (Note: we would have liked to do this exercise in the broadest database nr, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID GPAA1_HUMAN has a homolog in the genus Trypanosoma (unicellular parasites which cause diseases like sleeping sickness or Chaga's disease).
First, try a standard BlastP (where you set Organism to Trypanosoma , Database to refseq_protein (not refseq_select), switch the Low complexity regions filter off, and set the E-value threshold to 10).
- QUESTION 14: Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
Then, try PSI-BLAST. Hint: You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in Trypanosoma.
- QUESTION 15: How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?
- Note: If BLAST collapses you can check pre-run results using this ID: GTHC3N0F016 in here [Lookup BLAST Job]