Sequence alignment

In todays exercise you shall implement a program to perform sequence alignment.

The exercise has 4 parts


Smith Waterman in O3 time

First you must access the program templates of today exercise

Download the file Align.tar.gz file

Open the file (using tar -xzvf Align.tar.gz) and place the created Align directory in the "Algo/code" directory (or what ever directory you have created for the code course material).

Now the Align directory should contain two Jupyter-Notebook files and a python program

smith_waterman_O3.ipynb
smith_waterman_O2.ipynb
smith_waterman_O2_all_against_all.py

Now we can get started. The two ipynb files contain program templates for Smith Waterman sequence alignment in O3 and O2 time, respectively.
Finally the file smith_waterman_O2_all_against_all.py contains a implementation of smith_waterman_O2 performing an all against all sequence alignment.

Note, that the matrices in the different programs should be read so that the first index refer to a row, and the second to a column.

Open the smith_waterman_O3.ipynb program. Go through the code. Check what each box is doing, and make sure you understand how the main program is organized.

Fill in the places with XX

Once you have completed the code, you can test the program using the example given in the Test box.

If you run the example called "#Matrix dump exercise 1", you should get

ALN IDVLLGADDGSLAFVPSEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVGKVTVN 99 AEVKLGSDDGGLVFSPSSFTVAAGEKITFKNNAGFPHNIVFDEDEVPAGVNAEKISQPEYLNGAGETYEVTLTEKGTYKFYCEPHAGAGMKGEVTVN 97 98 64 439.0
QAL 1 DVLLGADDGSLAFVPSEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVGKVTVN
DAL 1 EVKLGSDDGGLVFSPSSFTVAAGEKITFKNNAGFPHNIVFDEDEVPAGVNAEKISQPE--YLNGAGETYEVTLTEKGTYKFYCEPHAGAGMKGEVTVN

The first line starts with the keywork ALN. This line has the format

ALN Q-name Q-len D-name D-len alignment-length nid alignment-score 

Make sure you understand each of these fields. The next two lines contain the sequence alignment in the format

QAL q-offset Q-alignment
DAL d-offset D-alignment

Here the q-offset and d-offset are starting location in the query and database sequences, respectively (counting from 0).

Test the program on the other examples in the "Now test the code on a few examples" box

Finally, you shall test how fast the program is by aligning a query sequence against a database of 100 sequences. This you do in the box "Align query against database. Might take a while. Go get some coffee". This will take some time, so be patient.

Finally download the program as a python program, and add a commandline parser with the options

  -q QUERY_FILE      File with query sequence
  -db DB_FILE        File with database sequence
  -go GAP_OPEN       Value of gap open (-11.0)
  -ge GAP_EXTENSION  Value of gap extension (-1.0)

Smith Waterman in O2 time

Now you can look at the other alignment program. This program implements the Smith Waterman alignment algorithm in O2 time using the method described by Gotoh. Open the file smith_waterman_O2.ipynb Make sure you understand the structure of the program.

Now fill in the lines blanked out with X's, to make the code run, and test the code on the examples in the "Now test the code on a few examples" box.

The example "Matrix dump exercise 1" should come out as

ALN Query 99 Database 97 98 439.0 64
QAL DVLLGADDGSLAFVPSEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVGKVTVN
DAL EVKLGSDDGGLVFSPSSFTVAAGEKITFKNNAGFPHNIVFDEDEVPAGVNAEKISQPE--YLNGAGETYEVTLTEKGTYKFYCEPHAGAGMKGEVTVN

Now you can can test the speed of the program

How does this time compare with the time used by the O3 code. How many times fast does the O2 code run? And also make sure the two program produces the same alignments (use for instance the diff command with the output generated by the two programs).

Finally donwload the code as python program and add a commandl ine parser with the options

  -q QUERY_FILE      File with query sequence
  -db DB_FILE        File with database sequence
  -go GAP_OPEN       Value of gap open (-11.0)
  -ge GAP_EXTENSION  Value of gap extension (-1.0)

Aligning all against all

The last file in the Align code directory is a python program to run an all against all alignment using the O2 algorithm. This program is complete, so you here only need to check the go to be use you understand what is going on.

You can test the program using the commands

cd data/Align
head -3 db_100.tab > small_db
cd ../../code/Align
python smith_waterman_O2_all_against_all.py -db small_db

We can now do a large scale calculation where you align 1000 protein sequences against each other. We shall use the output from the calculation in the exercise Wednesday.

We will however not do this here since the calculations takes too long so complete. You will get the result file from me tomorrow


This is it for today!!! Remember to upload the two alignment programs via DTU-Learn