I todays exercise you shall implement the two data redundancy algorithms Hobohm 1 and Hobohm 2 introduced in the lecture.
First you must access the program templates of today exercise
Download the file Hobohm.tar.gz file
Open the file (using tar -xzvf Hobohm.tar.gz) and place the created Hobohm directory in the "Algo/code" directory.
Now the Hobohm directory should contain two Jupyter-Notebook files
Hobohm1.ipynb Hobohm2.ipynb
Now we are ready to code.
Open the Hobohm1.ipynb program. Go to the main procedure. Make sure you understand the structure of the program. In particular make sure you understand how the accepted_sequences list is updated to keep track of the list of unique sequences.
You shall fill in the missing code (find the place with the missing code (XXXX)).
When the completed, test the code by running the Hobohm1 algorithm on the data file "database_list.tab" containing 1052 protein sequences.
How many unique sequences does the algorithm find, and how long time did the calculation take? It takes som e time to run the code, so either go for some coffee or move on while the code is running.
Now download the code as a python program, and add a commandline parser with the options
optional arguments: -h, --help show this help message and exit -f ALIGNMENT_FILE File input data
To run Hobohm 2 you need an additional data file generated by running the all-against-all alignment from yesterday. Download the file alignment.aln.gz file
Open the file (using gunzip alignment_aln.gz) and place the file alignment_aln in the "Algo/data/Hobohm/" directory.
You can check the content of the alignment_aln by typing
head alignment_aln
Part of the file content is shown below
ALN 1US0.A 316 2EAB.A 899 308 64 87.0 ALN 1US0.A 316 1QWN.A 1045 343 68 95.0 ALN 1US0.A 316 1HLR.A 907 207 43 105.0 ALN 1US0.A 316 1MXT.A 504 160 37 62.0 ALN 1US0.A 316 1MUW.A 386 190 37 56.0 ALN 1US0.A 316 1SU8.A 636 240 56 67.0 ALN 1US0.A 316 1WUI.L 534 232 53 90.0 ALN 1US0.A 316 2BHU.A 602 179 37 84.0 ALN 1US0.A 316 2E7Z.A 727 222 44 97.0 ALN 1US0.A 316 1JZ7.A 1023 432 80 103.0
Each line has the format
ALN Q-name Q-len D-name D-len alignment-length nid alignment-score
The input to the hobohm2 algorithm is a file in the format
name1 name2 score
where score is the similarity score between the sequences name1 and name2. We can transform the output from the all_against_all alignemnt into this format using the following command
cd data/Hohobm cat alignment_aln | gawk '{if ( 2.9*sqrt($6) > $7 ) { h=0} else { h=1} print $2,$4,h}' > alignment_aln.fmt
This command takes each line in the file out and checks if 2.9*sqtr(alen) > nid. The syntax $6 refers to the 6th column in the line, and $7 to the 7th column. If 2.9*sqtr(alen) > nid the variable h is set to 0, otherwise it is set to 1. Next name1 ($2), name2 ($4) and h are printed to standard output.
You can now run the Hobohm2.ipynb algorithm on the file alignment_aln.fmt
Now open the file Hobohm2.ipynb. Spend some time to make sure you understand the structure of the program. Fill in the missing code (XXXXXX), to make the code run.
How many unique clusters does the algorithm find? How does the number compare to the number of clusters found using the Hobohm1 algorithm. And how fast did the program complete the calculations?
Finally, safe the code as a python program, and add a command limne parser with the option
optional arguments: -h, --help show this help message and exit -f ALIGNMENT_FILE File input data
Now you are done. Remember to upload the two python programs via DTU-Inside