ExPairwiseAlignment

From 22111
Jump to navigation Jump to search

Exercise written by: Rasmus Wernersson, with modifications by Anne Bresciani, Ulrich Johan Kudahl, Anders Gorm Pedersen and Henrik Nielsen.

Introduction

In this exercise we will be working with pairwise alignment of protein sequences. As explained in today's lecture, pairwise alignment is performed using an algorithm known as Dynamic Programming (DP). In this exercise we will not go into further details about the mathematics behind the algorithm, but rather focus on the practical use of Pairwise Alignment, and how to critically evaluate the result.


Keep the following in mind as we go through the exercise (as mentioned in the lecture):

  • The quality of an alignment is assessed through its alignment score.
  • The alignment score is calculated based on match/mismatch of the amino acids via look-ups in a substitution matrix (for example BLOSUM62). The penalty for introducing gaps in the alignment is governed by two parameters: Gap opening ("expensive" - big penalty, e.g. 10 points) and Gap elongation ("cheap", e.g. 1 point).
  • The BLOSUM substitution matrices are based on empirical data: substitutions that are seen frequently in real proteins have a higher score. We don't cover DNA matrices in this exercise.
  • There are two main variants of the DP algorithm:
    • Global alignment (Needleman-Wunsch) — the sequences are forced to be aligned across their entire length.
    • Local alignment (Smith-Waterman) — only the best matching sub-part of the sequences are aligned).

Part 0 — interactive demo

Go to this website: http://experiments.mostafa.io/public/needleman-wunsch/ and play around with it a bit.

Note how the entire alignment matrix changes interactively if you introduce mutations in the two sequences or change the values of the three scores.

QUESTION 0:

Try replacing the two sequences with the sequences we used in the example in the lecture (slides 36-43). Does the website give you the same alignment matrix? Insert a screenshot of the resulting alignment matrix in your answer.

Part 1 — basic use

The different options at EBI (local and global alignment)

Open the alignment webpage at EBI: https://www.ebi.ac.uk/jdispatcher/psa .

Notice that

  • EBI has help pages for the tools we are going to use. Unfortunately, they are not very easy to find, but here are the direct links:
  • as described in the help pages, a number of sequence formats are supported (e.g. FASTA).

For starters we will work with a pair of prokaryotic serine-proteases (both retrieved from UniProt). The first one (P29600) is the thermostable protease that the company Novozymes sells to the washing-powder industry under the trade-name "Savinase". The other sequence is likewise a thermostable serine-protease, but from a different species of bacteria.

>P29600|SUBS_LEDLE Subtilisin Savinase - Lederbergia lentus
AQSVPWGISRVQAPAAHNRGLTGSGVKVAVLDTGISTHPDLNIRGGASFVPGEPSTQDGN
GHGTHVAGTIAALNNSIGVLGVAPSAELYAVKVLGASGSGSVSSIAQGLEWAGNNGMHVA
NLSLGSPSPSATLEQAVNSATSRGVLVVAASGNSGAGSISYPARYANAMAVGATDQNNNR
ASFSQYGAGLDIVAPGVNVQSTYPGSTYASLNGTSMATPHVAGAAALVKQKNPSWSNVQI
RNHLKNTATSLGSTNLYGSGLVNAEAATR

>P41363|ELYA_ALKHC Thermostable alkaline protease precursor - Alkalihalobacillus halodurans
MRQSLKVMVLSTVALLFMANPAAASEEKKEYLIVVEPEEVSAQSVEESYDVDVIHEFEEI
PVIHAELTKKELKKLKKDPNVKAIEKNAEVTISQTVPWGISFINTQQAHNRGIFGNGARV
AVLDTGIASHPDLRIAGGASFISSEPSYHDNNGHGTHVAGTIAALNNSIGVLGVAPSADL
YAVKVLDRNGSGSLASVAQGIEWAINNNMHIINMSLGSTSGSSTLELAVNRANNAGILLV
GAAGNTGRQGVNYPARYSGVMAVAAVDQNGQRASFSTYGPEIEISAPGVNVNSTYTGNRY
VSLSGTSMATPHVAGVAALVKSRYPSYTNNQIRQRINQTATYLGSPSLYGNGLVHAGRAT
Q
QUESTION 1:
Which sequence format are the two sequences listed in?

Global alignment

Your first task is to make a global alignment.

  • In the Needle section (Short for Needleman & Wunsch) at the site http://www.ebi.ac.uk/Tools/psa/, Click the button labeled "Protein" and paste in the sequences - including the header line - one sequence in each search box.
  • Click the button labeled "More options" (under "STEP 2") and ensure that Matrix is set to "BLOSUM62"
  • Also under "options" ("STEP 2") set End Gap Penalty to "true" (otherwise, the algorithm ignores gaps at the very ends of the alignments, i.e. gaps positioned at the end of one of the sequences would not cost anything).
  • Finally, start the alignment by clicking the button labeled "Submit"
  • Have a look at the result. Notice that the degree of similarity between aligned amino acids is indicated with the following symbols: "|" indicates a perfect match, ":" indicates a "close" mismatch (positive value on the BLOSUM matrix, also known as a conservative substitution — the amino acids share some physico-chemical properties), and "." indicates a "far" mismatch (zero or negative value on the BLOSUM matrix).


QUESTION 2:
Report the following values / observations from the alignment.
  • Alignment score
  • Alignment length
  • % and fraction Identity (The value reported for "Identity" includes perfect matches only)
  • % and fraction Similarity (The value reported for "Similarity" includes perfect matches + "close" mismatches)

Local Alignment

Now, perform the alignment once again (preferably in a new browser tab/window, so it's easy to compare), but this time use the Water (short for Smith-Waterman) algorithm. This can (again) be found on http://www.ebi.ac.uk/Tools/psa/ . Again use protein alignment.

QUESTION 3:
Report the same values as above (Alignment score etc). Consider the alignments produced by the two different approaches: do YOU think one of them is more biologically relevant than the other, or do both contribute valuable information?


QUESTION 4:

Let's go a bit deeper into why the two sequences differ in the N-terminal part: Look up both entries in UniProt (http://www.uniprot.org) and try to locate information regarding the following questions.
  • How were the amino acid sequences of the two proteins determined? (Hint: look at the titles of the papers, and the Cited for fields, listed under Publications).
  • Subcellular localization: Where in (or outside) the cell do the enzymes function?
  • The feature table contains details about the regions/domains of the protein - try to do a comparison to spot the differences between the two UniProt entries (Hint: focus on the "PTM / processing" section).


QUESTION 5:
Based on what you've learned about the P41363 protein from the alignment to Savinase and from the data on the Uniprot site: do you think this could be used as an enzyme in washing powder? (Why? / why not?).

Part 2 — about gaps and dubious alignments

So far we have been comparing sequences that are relatively closely related. It's now the time to compare our prokaryotic Savinase (P29600) protease with a human peptidase (a Tripeptidyl Peptidase), and see if they are related at all.

>sp|P29144|TPP2_HUMAN Tripeptidyl-peptidase 2 
MATAATEEPFPFHGLLPKKETGAASFLCRYPEYDGRGVLIAVLDTGVDPGAPGMQVTTDG
KPKIVDIIDTTGSGDVNTATEVEPKDGEIVGLSGRVLKIPASWTNPSGKYHIGIKNGYDF
YPKALKERIQKERKEKIWDPVHRVALAEACRKQEEFDVANNGSSQANKLIKEELQSQVEL
LNSFEKKYSDPGPVYDCLVWHDGEVWRACIDSNEDGDLSKSTVLRNYKEAQEYGSFGTAE
MLNYSVNIYDDGNLLSIVTSGGAHGTHVASIAAGHFPEEPERNGVAPGAQILSIKIGDTR
LSTMETGTGLIRAMIEVINHKCDLVNYSYGEATHWPNSGRICEVINEAVWKHNIIYVSSA
GNNGPCLSTVGCPGGTTSSVIGVGAYVSPDMMVAEYSLREKLPANQYTWSSRGPSADGAL
GVSISAPGGAIASVPNWTLRGTQLMNGTSMSSPNACGGIALILSGLKANNIDYTVHSVRR
ALENTAVKADNIEVFAQGHGIIQVDKAYDYLVQNTSFANKLGFTVTVGNNRGIYLRDPVQ
VAAPSDHGVGIEPVFPENTENSEKISLQLHLALTSNSSWVQCPSHLELMNQCRHINIRVD
PRGLREGLHYTEVCGYDIASPNAGPLFRVPITAVIAAKVNESSHYDLAFTDVHFKPGQIR
RHFIEVPEGATWAEVTVCSCSSEVSAKFVLHAVQLVKQRAYRSHEFYKFCSLPEKGTLTE
AFPVLGGKAIEFCIARWWASLSDVNIDYTISFHGIVCTAPQLNIHASEGINRFDVQSSLK
YEDLAPCITLKNWVQTLRPVSAKTKPLGSRDVLPNNRQLYEMVLTYNFHQPKSGEVTPSC
PLLCELLYESEFDSQLWIIFDQNKRQMGSGDAYPHQYSLKLEKGDYTIRLQIRHEQISDL
ERLKDLPFIVSHRLSNTLSLDIHENHSFALLGKKKSSNLTLPPKYNQPFFVTSLPDDKIP
KGAGPGCYLAGSLTLSKTELGKKADVIPVHYYLIPPPTKTKNGSKDKEKDSEKEKDLKEE
FTEALRDLKIQWMTKLDSSDIYNELKETYPNYLPLYVARLHQLDAEKERMKRLNEIVDAA
NAVISHIDQTALAVYIAMKTDPRPDAATIKNDMDKQKSTLVDALCRKGCALADHLLHTQA
QDGAISTDAEGKEEEGESPLDSLAETFWETTKWTDLFDNKVLTFAYKHALVNKMYGRGLK
FATKLVEEKPTKENWKNCIQLMKLLGWTHCASFTENWLPIMYPPDYCVF
QUESTION 6:
Compare Savinase to the human peptidase by global alignment (Needle) — remember again to set End Gap Penalty to "true" — and report the following:
  • Alignment score
  • Alignment length
  • Identity and Similarity
  • How large a part of the alignment is gaps?


QUESTION 7:
Repeat the alignment with End Gap Penalty set to "false" and report the same results as above.


QUESTION 8:
Repeat the alignment again — this time using the local alignment algorithm (Water) — and report the same results as above.


QUESTION 9:
Do you think local or global alignment is best for finding similar parts of distantly related proteins? Why?
  • Hint: Distantly related proteins typically share a core, that relates to the function of the protein


When is an alignment "true"?
When we perform an alignment of two (or more) sequences, it's actually a statement of evolutionary relationship: We believe that the sequences share a common ancestor - and that the sequences thus have diverged from an identical gene/protein throughout the eons.
The problem that we face right now in this part of the exercise, is that we're actually NOT sure if the two peptidases are related at all. As we discussed in the lecture, it's often possible to get at least some kind of alignment from unrelated sequences, but then the alignment score will be low. But in order to investigate if the alignment score we got from the alignment above indicates that the alignment is trustworthy, we need to get an idea of what range of alignment scores we can expect by aligning truly unrelated sequences.
It's difficult to find a sequence we can be 100% sure to be unrelated, and even if we did, the differences we might observe in alignment score could be due to differences in length and amino acid composition of the sequences. Instead, we can simply apply a small trick: if we shuffle one of the sequences (i.e. the amino acids are reordered randomly) prior to the alignment, any signature of its evolutionary relationships will be erased, and it will appear unrelated to any protein (while still having the same length and amino acid composition).


  • Open the "Shuffle Protein" page (http://www.bioinformatics.org/sms2/shuffle_protein.html) in a new window/tab, paste in the tripeptidyl peptidase sequence and shuffle it.
  • Then align Savinase and the shuffled tripeptidyl peptidase sequence using local alignment.
  • Repeat the above procedure two more times, so that you align Savinase with three different shuffled versions of tripeptidyl peptidase.


QUESTION 10:
How do the local alignments look? (What are the ranges of Alignment score, Alignment length, Identity, Similarity, and gap percentage)?


QUESTION 11:
Comparing the Savinase/shuffled alignment to the previous Savinase/Human Peptidase alignment - how will you judge the alignment with human peptidase now? (More/Less confidence in relation between the sequences?).

Part 3 — about parameters

Color coded BLOSUM 62 matrix (click to zoom)

The third part of the exercise is meant to illustrate that alignments is dependent on parameters. We can make anything look like similar sequences, if we twist the parameters enough.

Choose the Matrix

First, let's see how the alignment depends on the choice of substitution matrix.

  • Go the the page for local protein alignment.
  • Click "More Options"
  • Choose the substitution matrix "BLOSUM90" instead of the default (BLOSUM62)
  • Align Savinase and Tripeptidyl Peptidase
  • Redo the alignment, this time with the matrix "BLOSUM30"

You can see the BLOSUM matrices on this page: BLOSUM matrix examples


QUESTION 12:
  • What are the alignment results (Length, score, gaps, identity, similarity)?
  • How do alignment length and % identity depend on the BLOSUM number (compare also to your answer to question 8)?

Release the Gaps!

As the last part of the comparison of sequences — let's allow as many gaps as possible, by making it almost "free" for the algorithm to insert gaps.

  • Go the the page for local protein alignment.
  • Click "More Options"
  • Set the matrix back to "BLOSUM62"
  • Set the Gap opening penalty to 1.0 (smallest possible value).
  • Set the Gap extend penalty to 0.0005 (smallest possible value).
  • Align Savinase and Tripeptidyl Peptidase


QUESTION 13:
  • How do the quality parameters look this time (Length, score, gaps, identity, similarity)?
  • Is this alignment biologically meaningful at all?

Epilogue — mostly for fun

Go to UniProt and find the sequences with the IDs GLBE_CHITH and GLB7A_CHITH. These are two globins from a midge, a very small and annoying insect. If you have ever been to Scotland in the summer, you will know about midges. Make a global alignment of GLBE_CHITH and GLB7A_CHITH.

QUESTION 14:

Note that there is a gap of 6 positions in GLBE_CHITH. What is the corresponding 6 amino acid long sequence of GLB7A_CHITH? This is an authentic example! Nature truly is fascinating...