Data redundancy reduction

I todays exercise you shall implement the two data redundancy algorithms Hobohm 1 and Hobohm 2 introduced in the lecture.


First you must access the program templates of today exercise

Download the file Hobohm.tar.gz file

Open the file (using tar -xzvf Hobohm.tar.gz) and place the created Hobohm directory in the "Algo/code" directory.

Now the Hobohm directory should contain two Jupyter-Notebook files

Hobohm1.ipynb
Hobohm2.ipynb

Implementing the algorithms

Now we are ready to code.

Hobohm-1

Open the Hobohm1.ipynb program. Go to the main procedure. Make sure you understand the structure of the program. In particular make sure you understand how the accepted_sequences list is updated to keep track of the list of unique sequences.

You shall fill in the missing code (find the place with the missing code (XXXX)).

When the completed, test the code by running the Hobohm1 algorithm on the data file "database_list.tab" containing 1052 protein sequences.

How many unique sequences does the algorithm find, and how long time did the calculation take? It takes som e time to run the code, so either go for some coffee or move on while the code is running.

Now download the code as a python program, and add a commandline parser with the options

optional arguments:
  -h, --help         show this help message and exit
  -f ALIGNMENT_FILE  File input data

Hobohm-2

To run Hobohm 2 you need an additional data file generated by running the all-against-all alignment from yesterday. Download the file alignment.aln.gz file

Open the file (using gunzip alignment_aln.gz) and place the file alignment_aln in the "Algo/data/Hobohm/" directory.

You can check the content of the alignment_aln by typing

head alignment_aln

Part of the file content is shown below

ALN 1US0.A 316 2EAB.A 899 308 64 87.0
ALN 1US0.A 316 1QWN.A 1045 343 68 95.0
ALN 1US0.A 316 1HLR.A 907 207 43 105.0
ALN 1US0.A 316 1MXT.A 504 160 37 62.0
ALN 1US0.A 316 1MUW.A 386 190 37 56.0
ALN 1US0.A 316 1SU8.A 636 240 56 67.0
ALN 1US0.A 316 1WUI.L 534 232 53 90.0
ALN 1US0.A 316 2BHU.A 602 179 37 84.0
ALN 1US0.A 316 2E7Z.A 727 222 44 97.0
ALN 1US0.A 316 1JZ7.A 1023 432 80 103.0

Each line has the format

ALN Q-name Q-len D-name D-len alignment-length nid alignment-score

The input to the hobohm2 algorithm is a file in the format

name1 name2 score

where score is the similarity score between the sequences name1 and name2. We can transform the output from the all_against_all alignemnt into this format using the following command

cd data/Hohobm
cat alignment_aln | gawk '{if ( 2.9*sqrt($6) > $7 ) { h=0} else { h=1} print $2,$4,h}' > alignment_aln.fmt

This command takes each line in the file out and checks if 2.9*sqtr(alen) > nid. The syntax $6 refers to the 6th column in the line, and $7 to the 7th column. If 2.9*sqtr(alen) > nid the variable h is set to 0, otherwise it is set to 1. Next name1 ($2), name2 ($4) and h are printed to standard output.

You can now run the Hobohm2.ipynb algorithm on the file alignment_aln.fmt

Now open the file Hobohm2.ipynb. Spend some time to make sure you understand the structure of the program. Fill in the missing code (XXXXXX), to make the code run.

How many unique clusters does the algorithm find? How does the number compare to the number of clusters found using the Hobohm1 algorithm. And how fast did the program complete the calculations?

Finally, safe the code as a python program, and add a command limne parser with the option

optional arguments:
  -h, --help         show this help message and exit
  -f ALIGNMENT_FILE  File input data

Now you are done. Remember to upload the two python programs via DTU-Inside