In todays exercise you shall implement a program to perform sequence alignment.
The exercise has 4 parts
First you must access the program templates of today exercise
Download the file Align.tar.gz file
Open the file (using tar -xzvf Align.tar.gz) and place the created Align directory in the "Algo/code" directory (or what ever directory you have created for the code course material).
Now the Align directory should contain two Jupyter-Notebook files and a python program
smith_waterman_O3.ipynb smith_waterman_O2.ipynb smith_waterman_O2_all_against_all.py
Now we can get started. The two ipynb files contain program templates
for Smith Waterman sequence alignment in O3 and O2 time, respectively.
Finally the file smith_waterman_O2_all_against_all.py
contains a implementation of smith_waterman_O2 performing an all against all sequence alignment.
Note, that the matrices in the different programs should be read so that the first index refer to a row, and the second to a column.
Open the smith_waterman_O3.ipynb program. Go through the code. Check what each box is doing, and make sure you understand how the main program is organized.
Fill in the places with XX
Once you have completed the code, you can test the program using the example given in the Test box.
If you run the example called "#Matrix dump exercise 1", you should get
ALN IDVLLGADDGSLAFVPSEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVGKVTVN 99 AEVKLGSDDGGLVFSPSSFTVAAGEKITFKNNAGFPHNIVFDEDEVPAGVNAEKISQPEYLNGAGETYEVTLTEKGTYKFYCEPHAGAGMKGEVTVN 97 98 64 439.0 QAL 1 DVLLGADDGSLAFVPSEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVGKVTVN DAL 1 EVKLGSDDGGLVFSPSSFTVAAGEKITFKNNAGFPHNIVFDEDEVPAGVNAEKISQPE--YLNGAGETYEVTLTEKGTYKFYCEPHAGAGMKGEVTVN
The first line starts with the keywork ALN. This line has the format
ALN Q-name Q-len D-name D-len alignment-length nid alignment-score
Make sure you understand each of these fields. The next two lines contain the sequence alignment in the format
QAL q-offset Q-alignment DAL d-offset D-alignment
Here the q-offset and d-offset are starting location in the query and database sequences, respectively (counting from 0).
Test the program on the other examples in the "Now test the code on a few examples" box
Finally, you shall test how fast the program is by aligning a query sequence against a database of 100 sequences. This you do in the box "Align query against database. Might take a while. Go get some coffee". This will take some time, so be patient.
Finally download the program as a python program, and add a commandline parser with the options
-q QUERY_FILE File with query sequence -db DB_FILE File with database sequence -go GAP_OPEN Value of gap open (-11.0) -ge GAP_EXTENSION Value of gap extension (-1.0)
Now you can look at the other alignment program. This program implements the Smith Waterman alignment algorithm in O2 time using the method described by Gotoh. Open the file smith_waterman_O2.ipynb Make sure you understand the structure of the program.
Now fill in the lines blanked out with X's, to make the code run, and test the code on the examples in the "Now test the code on a few examples" box.
The example "Matrix dump exercise 1" should come out as
ALN Query 99 Database 97 98 439.0 64 QAL DVLLGADDGSLAFVPSEFSISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLNAKGETFEVALSNKGEYSFYCSPHQGAGMVGKVTVN DAL EVKLGSDDGGLVFSPSSFTVAAGEKITFKNNAGFPHNIVFDEDEVPAGVNAEKISQPE--YLNGAGETYEVTLTEKGTYKFYCEPHAGAGMKGEVTVN
Now you can can test the speed of the program
How does this time compare with the time used by the O3 code. How many times fast does the O2 code run? And also make sure the two program produces the same alignments (use for instance the diff command with the output generated by the two programs).
Finally donwload the code as python program and add a commandl ine parser with the options
-q QUERY_FILE File with query sequence -db DB_FILE File with database sequence -go GAP_OPEN Value of gap open (-11.0) -ge GAP_EXTENSION Value of gap extension (-1.0)
The last file in the Align code directory is a python program to run an all against all alignment using the O2 algorithm. This program is complete, so you here only need to check the go to be use you understand what is going on.
You can test the program using the commands
cd data/Align head -3 db_100.tab > small_db cd ../../code/Align python smith_waterman_O2_all_against_all.py -db small_db
We can now do a large scale calculation where you align 1000 protein sequences against each other. We shall use the output from the calculation in the exercise Wednesday.
We will however not do this here since the calculations takes too long so complete. You will get the result file from me tomorrow
This is it for today!!! Remember to upload the two alignment programs via DTU-Learn