ExPSIBLAST

From 22111
Revision as of 11:01, 15 March 2024 by WikiSysop (talk | contribs) (→‎When BLAST fails)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Originally written by: Morten Nielsen — some editing by Rasmus Wernersson and Bent Petersen — new version by Henrik Nielsen.

Introduction

Earlier in the course you have used the BLAST program to perform fast alignments of DNA and protein sequences. As shown in today's lecture, BLAST will often fail to recognize relationships between proteins with low sequence similarity. In today's exercise, you will use the iterative BLAST program (PSI-BLAST) to calculate sequence profiles and see how such profiles can be used to:

  • Identify relationships between proteins with low sequence similarity
  • Identify conserved residues in protein sequences (residues important for the structural stability or function of the protein)

Links

When BLAST fails

Partial screenshot of the Blast interface. The red arrow shows the settings change to the database to pdb

Say you have a sequence Query (pasted below) and you want to make predictions about its function and structure. As seen earlier in the course, you will most often use BLAST to do this. However what happens when BLAST fails?

>QUERY1
MKDTDLSTLLSIIRLTELKESKRNALLSLIFQLSVAYFIALVIVSRFVRYVNYITYNNLV
EFIIVLSLIMLIIVTDIFIKKYISKFSNILLETLNLKINSDNNFRREIINASKNHNDKNK
LYDLINKTFEKDNIEIKQLGLFIISSVINNFAYIILLSIGFILLNEVYSNLFSSRYTTIS
IFTLIVSYMLFIRNKIISSEEEEQIEYEKVATSYISSLINRILNTKFTENTTTIGQDKQL
YDSFKTPKIQYGAKVPVKLEEIKEVAKNIEHIPSKAYFVLLAESGLRPGELLNVSIENID
LKARIIWINKETQTKRAYFSFFSRKTAEFLEKVYLPAREEFIRANEKNIAKLAAANENQE
IDLEKWKAKLFPYKDDVLRRKIYEAMDRALGKRFELYALRRHFATYMQLKKVPPLAINIL
QGRVGPNEFRILKENYTVFTIEDLRKLYDEAGLVVLE

Go to the BLAST web-site at NCBI. Select blastp as the algorithm. Paste in the query sequence. Change the database from nr to pdb, and press BLAST.

  • QUESTION 1: How many significant hits does BLAST find (E-value < 0.005)?

Trying another approach

Partial screenshot of the Psi-Blast interface. The red arrow shows the settings change to Psi-Blast.

Now go back to the BLAST web-site. Paste in the query sequence Query1. This time, set the database to nr and select PSI-BLAST (Position-Specific Iterated BLAST) as the algorithm. IMPORTANT: To ease the load on the NCBI server, limit the search to Archaea (TaxID 2157) when searching Query1 in nr.

  • QUESTION 2: How many significant hits does BLAST find (E-value < 0.005)? (Tip: you can see the number by selecting all significant hits (clicking All under Sequences producing significant alignments with E-value BETTER than threshold) and then looking at the number of selected hits)
  • QUESTION 3: How large a fraction (Query coverage) of the query sequence do the significant hits match (excluding the identical match)?
  • QUESTION 4: Do you find any PDB hits among the significant hits? (Tip: look for a PDB identifier in the Accession column — a PDB identifier is a 4 character code, where the first character is a number, followed by a single letter chain name, such as "1XYZ_A")

Constructing the PSSM

Note: If you see the error message “Entrez Query: txid2157 [ORGN] is not supported”, then click Recent Results in the upper right part of the BLAST window, select your most recent search, and try again.

Now run a second BLAST iteration in order to construct a PSSM (Position-Specific Scoring Matrix). Press the Run button at Run PSI-Blast iteration 2 (you can find it at both the bottom and top of the results table).

  • QUESTION 5: How many significant hits does BLAST find (E-value < 0.005)?
  • QUESTION 6: How large a fraction of the query sequence do the 20 most significant hits match (do not include the first hit since this is identical to the query)?
  • QUESTION 7: Why does BLAST come up with more significant hits in the second iteration? Make sure you answer this question and understand what is going on!

Saving and reusing the PSSM

This time, we will not ask you to look for PDB identifiers manually among the significant hits. Instead, you should save the PSSM that PSI-BLAST has created and use it for searching PDB directly.

Go to the top of the PSI-BLAST output page and click Download All, then click PSSM. Save the file to a place on your computer where you can find it again! You can take a look at this file using Geany, but it is really not meant to be human-readable.

Then, open a new BLAST window (this is important—you need your first BLAST window again later) where you again select PSI-BLAST as the algorithm. Select pdb as the database. Do not limit your search to Archaea this time. Click on Algorithm parameters to show the extended settings. Click the button next to Upload PSSM and select the file you just saved. Note: You don't have to paste the query sequence again, it is stored in the PSSM!

  • QUESTION 8: Do you find any significant PDB hits now? If yes, how many?
  • QUESTION 9: What are the PDB identifiers and the E-values for the two best PDB hits?
  • QUESTION 10: What are the values for Query coverage, sequence identity, and sequence similarity (Positives) for the two best PDB hits? (Tip: click on the description to get to the actual alignment between the query sequence and the PDB hit)?
  • QUESTION 11: What is the function of these proteins?

One more round

Let's try one more iteration of PSI-BLAST:

  • Go back to your first BLAST window (the one with the results from the nr database limited to Archaea) and press the Run button at Run PSI-Blast iteration 3.
  • Save the resulting PSSM file (make sure you give it a different name!).
  • Launch a new PSI-BLAST search against pdb in all organisms using this PSSM (you may have to click on Clear to erase your first PSSM file from the server).
  • QUESTION 12: Answer questions 8-10 again for the new search.

Finding a remote homolog (on your own)

PSI-Blast is not only useful for finding a remote homolog in a specific database such as PDB — now it is time to search the broader database "Reference proteins" (refseq_protein). (Note: we would have liked to do this exercise in the broadest database nr, but that search runs into technical problems). PSI-BLAST can be used in the same way for finding a remote homolog in a specific organism or taxonomic group. Your task in this round is to find out whether the protein with the UniProt ID GPAA1_HUMAN has a homolog in the genus Trypanosoma (unicellular parasites which cause diseases like sleeping sickness or Chaga's disease).

  • First, try a standard BlastP (where you set Organism to Trypanosoma, Database to refseq_protein (not refseq_select), switch the Low complexity regions filter off, and set the E-value threshold to 10).
  • QUESTION 13: Do you find any significant (E<0.005) hits? What is the E-value of the best hit?
  • Then, try PSI-BLAST. Hint: You need to search in all organisms (still using refseq_protein) to build a PSSM, then save your PSSM and use that to search in Trypanosoma.
  • QUESTION 14: How many significant (E<0.005) hits do you find now? What is the E-value of the best hit?


Concluding remarks

Now you have seen the power of sequence profiles in general and the PSI-BLAST program in particular. Using sequence profiles you have been able to identify a relationship between protein sequences far below 30% sequence similarity. Further, you have made qualified predictions on the protein function and selected a set of essential amino acids suitable for experimental validation of the structural and functional predictions.