Exercise: Multiple Alignments (English version)

From 22111
Jump to navigation Jump to search

Exercise written by: Rasmus Wernersson with some editing by Henrik Nielsen and an intermezzo by Anders Gorm Pedersen.


Step 0 - installing Jalview

Go to https://www.jalview.org/, click the green downwards arrow and download the software for your platform. Install and test it. Note that the software starts showing a root window with several subwindows showing pre-loaded examples.



Part 1 - using MAFFT

One of the most widely used alignment programs is the "Clustal" package. It comes in two varieties: ClustalW (with a command line interface) and ClustalX (with a graphical interface). Typically, ClustalX is chosen for interactive use, and ClustalW is chosen when there is a need for automating the workflow. The underlying algorithm is precisely the same, and the results will be identical.

ClustalW/X is free to use and available for many computer systems — including Windows and Mac. If you want to install ClustalX on your own machine, it can be downloaded from here: http://www.clustal.org/clustal2/.

However, as mentioned during the Multiple Alignment lecture, ClustalW/X is neither the fastest nor the most precise program available. No single program performs best in all comparisons, but one of the best and fastest is called MAFFT. It can be downloaded from here: http://mafft.cbrc.jp/alignment/software/ (Note: command line interface only) or used online from here: http://mafft.cbrc.jp/alignment/server/. It is also available in an online version under EBI's multiple alignment page: http://www.ebi.ac.uk/Tools/msa/.


Step 1

For the first part of the exercise we are going to consider a set of alpha-globin genes from a number of different animals. The first task is to construct a useful dataset. Below is a list of GenBank IDs for entries containing the sequences we need (some entries contain more than one gene).

GenBank: Nucleotide search

AB001981
X01831
AH002483 
J00043
J00044
X01086
X07053
AF098919

Open a text editor (e.g. jEdit) — as we find the genes we search for, we need to collect them in a FASTA file using descriptive short unique names.

NOTE:

  1. By definition, FASTA names do not contain spaces, therefore use underscore or dash if you want to specify more than one word.
  2. Names should be unique within the first 15 characters, since some programs (e.g. JalView) only consider the first 15 characters and fail in "interesting" ways if names are identical.
  3. You should ignore the alpha-D-globin gene that has a CDS of only 89 nucleotides; it is not complete.
  4. You should not include "embryonic alpha-type globin pi".
  5. There can be more than one alpha-globin gene in some of the GenBank entries. The resulting FASTA file should have 10 sequences.

For example, a sequence could be named like this:

>goat_alpha_globin_II
ATGGTGCTGTCTGCCGCCGACAAGTCCAATGTCAAGGCCGCCTGGGGCAAGGTTGGCAGCAACGCTGGAG
CTTATGGCGCAGAGGCTCTGGAGAGGATGTTCCTGAGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGACCTGAGCCACGGCTCGGCCCAGGTCAAGGGCCACGGCGAGAAGGTGGCCGCCGCGCTGACCAAAGCG
GTGGGCCACCTGGACGACCTGCCCGGTACTCTGTCTGATCTGAGTGACCTGCACGCCCACAAGCTGCGTG
TGGACCCGGTCAACTTTAAGCTTCTGAGCCACTCCCTGCTGGTGACCCTGGCCTGCCACCACCCCAGTGA
TTTCACCCCCGCGGTCCACGCCTCCCTGGACAAGTTCTTGGCCAACGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA

For every GenBank entry find the genes (CDS features) coding for alpha-globin. We need the DNA sequence for the CDS features specifically — remember that you can click "CDS" to show that sequence only, and then change the display to FASTA format.

Copy each DNA sequence into your text editor as you find it. Give it a descriptive name that conveys which organism it comes from, and what type of alpha-globin it is. Remember to save your work often.

Important: Always use the gene/protein name from the CDS feature, not the GenBank entry name. (There is a trap buried somewhere here, where the entry name is directly misleading). See this PDF: How to locate CDS names in GenBank.

QUESTION 1:
Save your FASTA file and write the filename in your report. Hand in the file as an attachment to your report when you are finished.

Step 2

Go to EBI's multiple alignment page: http://www.ebi.ac.uk/Tools/msa/ and choose the program MAFFT. Paste your sequences in the box (or upload the entire file). Choose "ClustalW" as Output format, and tell the program whether you are submitting Protein or Nucleic Acid sequences. Start the program.

Basically, MAFFT yields a purely text-based output, and all the graphics and tables you can see on the web page are added by EBI's web server.

The tab "Alignment" shows the actual alignment.

QUESTION 2a:
  1. What do you guess that the asterisks ("*") under the alignment mean?
  2. How many stretches of perfectly conserved sequence (of at least, say, 10 nucleotides) can you find? Write down the sequence(s) of the perfectly conserved stretch(es).

In the tab "Guide Tree" a graphical representation of the distances between the sequences is shown (note: this is not a "real" phylogenetic tree — it is an estimate based on the pairwise alignments. Actual phylogenetic trees need a multiple alignment as input — we'll get there next week).

Look at the three main groups (clusters) that the sequences fall into.

QUESTION 2b:
  1. Are the sequences "naturally" (biologically plausibly) placed? Or do the sequences seem to be randomly intermixed?
  2. Do alpha-A and alpha-D seem closely or distantly related, sequence-wise?
  3. What about alpha-1 and alpha-2?
JalView on Mac

It can be difficult to get a good overview over the sequences by looking at the raw text output. As an alternative, we will use the java-based "Jalview" program. Select the Results Viewers tab and click View result with Jalview. This is a link to a .jvl file, which should be opened with Jalview. The .jvl file contains a link to the alignment you just made. Note: if this does not work (on Linux, it doesn't), open Jalview manually and then follow the instructions in the Results Viewers tab.

Jalview opens with a large root window, which contains a smaller alignment window. Note that both windows have their own menus. By default, the alignment is in black and white, but you can choose a colouring scheme in the Colour menu in the alignment window.

Explore the full length of the alignment in the JalView window by using the scrollbar in the bottom of the window — note the colouring of the nucleotides and the "consensus" line at the bottom. Note also that you can get information about the position of each nucleotide in each sequence by positioning the mouse over it.

QUESTION 2c:
Include a screenshot of the alignment window in your report. It should show the 3' part of the alignment (the rightmost part), and it should be coloured by nucleotide.

Step 3

Now translate the DNA sequences to protein sequences and construct a new alignment. Link: https://services.healthtech.dtu.dk/services/VirtualRibosome-2.0/

QUESTION 3:
  1. Insert the translated sequences in FASTA format in your report.
  2. Examine the "Guide tree" tab again — do you get the same results as last time?
  3. Inspect the alignment. How many perfectly conserved stretches can you find now (of at least, say, 5 amino acids)? Write down the sequence(s) of the perfectly conserved stretch(es).

Again, use "Jalview" to inspect the alignment (Note: if you want to launch Jalview via the .jvl file link, you have to close the program first. If you prefer to see both alignments at the same time, you should use the other option, i.e. File → Input Alignment → From URL).

  • Experiment with different colouring schemes for the amino acids.
  • Note also that a "conservation" and a "quality" score is calculated for each position.

Step 4

New data set: Insulin. This FASTA file contains the DNA sequence from the Insulin gene from a range of organisms

Notice that this FASTA file has been auto-generated from a database, and it is currently not that informative with regards to entry names. Before we carry on with the analysis, you need to figure out which organisms the sequences belong to by looking up the entries in GenBank. Based on this information construct a new FASTA file with names that 1) describe what organism each sequence came from and 2) keep in the GenBank ID for later reference.

Naming guideline: For example, the first entry (U00659 (now AH005355)) can be updated to:

>Sheep_AH005355
ATGGCCCTGTGGACACGCCTGGTGCCCCTGCTGGCCCTGCTGGCACTCTGGGCCCCCGCC
CCGGCCCACGCCTTCGTCAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGAGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGCCGGGAGGTGGAGGGC
CCCCAGGTGGGGGCGCTGGAGCTGGCCGGAGGCCCCGGCGCGGGTGGCCTGGAGGGGCCC
CCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCGCCGGCGTCTGCTCTCTCTACCAGCTG
GAGAACTACTGTAACTAG
QUESTION 4:
  • Use the naming guideline above to identify all species names, save your FASTA file and write the filename in your report. Hand in the file as an attachment to your report when you are finished.

Notice: As you might notice while you investigate the individual sequences in the FASTA file, it contains a certain level of redundancy (identical sequences). For now we keep in all the entries — we will learn from the multiple alignment we construct in the next step, which of the sequences are identical and should therefore be combined into a single entry.

Step 5

Construct a multiple alignment at the DNA level.

QUESTION 5:
  1. Inspect the alignment (textual + JalView) - can you find any gap which is not a multiple of three (and which therefore cannot correspond to a number of whole codons)? Does it otherwise look like all gaps follow the codon boundaries?
  2. By just eye-balling the differences between the sequences (easiest observed in JalView), can you immediately point to one sequence being the most "remote" (with the most differences compared to the rest)? Does this make taxonomical sense? (Hint: are all the sequences vertebrate?)
  3. As mentioned above, some sequences are identical. Based on the guide tree - which of the sequences can we remove as being redundant (branch length = 0)?

Keep the browser window/tab with the DNA alignment open - we'll need it for comparison in a moment.

Step 6

Now construct a multiple alignment at the peptide level.

  • Once again look over the alignment — and pay special attention to the gaps, which will now truly represent the underlying codons. Try to see if you can find some of the locations that correspond to regions where small gaps had been inserted in the DNA alignment.
  • Why do you think there may be a disagreement between the DNA and peptide alignment?


QUESTION 6:
  1. Inspect the peptide alignment guide tree: Which of the sequences can we safely eliminate now? More than before?

Intermezzo - alternative splicing & protein isoforms

Overview of different types of alternative splicing. CLICK HERE to view full screen PDF version

.

In the alignments we have been working with so far, the proteins have been related by evolution; either orthologous proteins from different organisms or paralogous proteins from the same organism (e.g. alpha-A and alpha-D globin). Now, we will work with a dataset of proteins related by alternative splicing: In some genes, introns can be spliced out of the transcript in more than one way, with the consequence that the same gene can produce a number of different proteins (isoforms). There are several kinds of alternative splicing, summarized in the figure to the right.

When aligning isoforms (proteins related by alternative splicing), it is important to realize that stretches of amino acid sequence are either completely identical (if they originate from the same stretch of nucleotide sequence) or completely unrelated (if they originate from different stretches of nucleotide sequence). Therefore, a correct alignment of isoforms will contain only matches and gaps, no mismatches.

Step 7

We are now going to investigate how three different alignment programs perform on a dataset of isoforms from one particular gene.

Here is a dataset consisting of 11 alternatively-spliced gene products from the human erythrocyte membrane protein band 4.1 (EPB). The goal of this exercise is to compare how well three different popular multiple alignment programs perform when attempting to align a set of proteins that are identical except for having different deletions.

Align the EPB sequences using the MAFFT, MUSCLE, and Kalign servers at the EBI Multiple Sequence Alignment page with Clustalw as output format. For each alignment, keep the window (or tab) open after use.

Now compare the the three alignments you just constructed using the EBI servers. For this purpose you should use the JalView alignment viewer linked on the EBI result pages. Remember that an alignment of isoforms ideally should have only matches and gaps, no mismatches.

QUESTION 7:
Are the three alignments different? Which, if any, of the three alignment methods got the alignment entirely correct?

You should note that this was just one particular form of test. On a different problem the relative performance of the alignment methods could well be different. However, you should also note that this was a fairly simple problem, and one where you could easily see artefacts. That will not be the case for most real biological data sets.

Part 2 - RevTrans

Step 8

As the final step in this exercise, we will have a look at how to get the best of both worlds: how to combine knowledge of both DNA and protein biology in a single multiple alignment (for the theory behind this, please refer to the RevTrans paper, linked on the main course page). RevTrans v.2 uses MAFFT as the default algorithm for constructing the peptide alignment — other options are ClustalW, T-Coffee and Dialign (a locally optimizing program, not available at the EBI servers).


If you haven't read the RevTrans paper yet — please quickly skim through it now (it's an easy read). The paper explains the concept behind the RevTrans method in details (DNA → Protein; Multiple alignment of the proteins; Construct DNA alignment from the DNA sequences using the peptide alignment as a scaffold).

The data set for this part of the exercise will be a cleaned-up version of the Insulin FASTA file from Step 4 above (redundancy reduced and with short informative names), available via this link.

Link to the RevTrans 2.0 server: https://services.healthtech.dtu.dk/services/RevTrans-2.0/

Notice that it's possible to specify which translation table to use. You may recognize the options from VirtualRibosome. This is no coincidence: both servers are using the same underlying algorithm for translating DNA to protein.

Paste in the sequences and start the RevTrans analysis with default settings.

QUESTION 8:
  • Inspect the alignment:
    1. Are gap lengths always a multiple of 3?
    2. Are all codons aligned? (Codon 1st positions will be in the same columns as other 1st positions, 2nd positions only in columns with other 2nd positions etc.).

Closing remark: Currently the RevTRans server does not perform a lot of additional analysis on top of the actual alignment. The idea with the server is to provide input to "downstream" analysis in other tools. For example construction of phylogenetic trees and statistical analysis of silent vs. non-silent mutations (that is, mutations that do not change the amino acid sequence vs. mutations that do).