Stabilization matrix method

In todays exercise you will accomplish the following

Data

The project consists of two algorithm-templates where you must fill out the missing code, and two data sets of MHC-peptide binding data. First you must copy the data files

Now download data for three HLA alleles from the list below.

You can find the number of peptides and number of peptide binders per allele in the following link Data statistics. Here a binder is a peptide with a binding affinity stronger than 500 nM.

Note, you could select at least three alleles so that one has many peptide data points, and one has few, and one is in between these.

A0101.tar.gz
A0201.tar.gz
A0202.tar.gz
A0203.tar.gz
A0206.tar.gz
A0301.tar.gz
A1101.tar.gz
A2301.tar.gz
A2402.tar.gz
A2403.tar.gz
A2601.tar.gz
A2902.tar.gz
A3001.tar.gz
A3002.tar.gz
A3101.tar.gz
A3301.tar.gz
A6801.tar.gz
A6802.tar.gz
A6901.tar.gz
B0702.tar.gz
B0801.tar.gz
B1501.tar.gz
B1801.tar.gz
B2705.tar.gz
B3501.tar.gz
B4001.tar.gz
B4002.tar.gz
B4402.tar.gz
B4403.tar.gz
B4501.tar.gz
B5101.tar.gz
B5301.tar.gz
B5401.tar.gz
B5701.tar.gz
B5801.tar.gz

Next, place the downloaded files in the data/SMM directory, and open the files (using tar -xzvf XXXXX.tar.gz). Here, XXXXX is the name of an MHC allele (A0201 for instance). For each allele (MHC molecule) you will now have 10 files called c000, c001, c002, c003, c004, f000, f001, f002, f003, and f004. Check this by typing

ls -ltr XXXXX

where XXXXX is the name of one of the selected MHC alleles. Each of these files contains peptide-MHC binding data. The format of the files is

YVMTMILFL 1 A0201
YLYALYSPL 1 A0201
YLMDKLNLT 1 A0201
YITALNHLV 1 A0201
WLIGFDFDV 1 A0201
VLALYSPPL 1 A0201
TLAPFNFLV 1 A0201
SLDVINYLI 1 A0201

The first column gives the peptide sequence, the second the binding value, and the last column the MHC allele name. The MHC binding values are given in 1-log(IC50)/log(50000) values meaning that a value of 1 corresponds to an IC50 value of 1 nM, and a value of 0 to an IC50 value of 50000 nM. Note, that low IC50 values correspond to strong binding.

The data files are prepared to be used for 5 fold cross validated training. The files c000, f000 for instance contain data for one of the five cross-validation partitions where the f000 file is used for training and the c000 file for testing.


Implementing the algorithms

Now we are ready to code. First you must copy the program templates.

Download the file SMM.tar.gz file

Open the file (using tar -xzvf SMM.tar.gz) and place the created SMM directory in the "Algo/code" directory. The SMM directory should now contain two jupyter-notebook program files

smm_gradient_descent.ipynb
smm_monte_carlo.ipynb

The two files are templates for the gradient decent, and Monte Carlo based minimization procedures of the SMM method, respectively.

smm_gradient_descent

Open the program smm_gradient_descent. Make sure you understand the structure of the code, and then fill out the missing parts.

Next, test the code with the two examples of training and evaluation data "A0201_training/A0201_evaluation, and A2403_training/A2403_evaluation". Try to change the value of lambda. You do see any impact on the training and evaluation performance. Try also to change epsilon. How does this affect the training and evaluation performance.

Now download the code as a python program, remove the part of the code related to making the graphical plots of performance and the parts related to the evalaution ("the last part called "Performance Evaluation"), and add a commandline parser with the options

SMM_GD

optional arguments:
  -h, --help          show this help message and exit
  -l LAMB             Lambda (default: 0.01)
  -t TRAINING_FILE    File with training data
  -e EVALUATION_FILE  File with evaluation data
  -epi EPSILON        Epsilon (default 0.05)
  -s SEED             Seed for random numbers (default 1)
  -i EPOCHS           Number of epochs to train (default 100)

With these changes, the program smm_gradient_descent.py will take as input two files TRAINING_FILE and EVALUATION_FILE, run the SMM algortihm, and print out the generate PSSM to standard out.

Test this with one of the examples in the data/SMM directory.


smm_monte_carlo.ipynb

The smm_monte_carlo.ipynb program implements a Monte Carlo based optimizaation of the SMM algorithm. this program is almost completely written out for you. Open the program. Spend some time to make sure you understand the structure, next fill in the missing code (XX's).

Test the code with the two examples of training and evaluation data "A0201_training/A0201_evaluation, and A2403_training/A2403_evaluation". Try to change the value of lambda. You do see any impact on the training and evaluation performance. Try also to change epsilon. How does this affect the training and evaluation performance.

Now download the code as a python program, remove the part of the code related to making the graphical plots of performance and the parts related to the evalaution ("the last part called "Performance Evaluation"), and add a commandline parser with the options

optional arguments:
  -h, --help          show this help message and exit
  -l LAMB             Lambda (default: 1)
  -t TRAINING_FILE    File with training data
  -e EVALUATION_FILE  File with evaluation data
  -s SEED             Seed for random numbers (default 1)
  -i ITERS            Number of epochs to train (default 1000)
  -Ts T_I             Start Temp (0.01)
  -Te T_F             End Temp (0.000001)
  -nT T_STEPS         Number of T steps (10)

and test the code with one of the examples in the data/SMM directory.


Training and evaluation of the two SMM algorithms

Now we are ready to train the first MHC peptide binding prediction algorithm.

To evaluate the code you need a script to calculate the Pearsons correlation coefficient. You can download this script here
xycorr script.

Save this script in the SMM code directory. After saving the file, type the command

chmod u+x xycorr
This will allow you to execute the script.

Demonstrate to your-self that the two implementations work, and that they convert to similar performance values. Use default parameters

python smm_gradient_descent.py -t ../../data/SMM/A0201/f000 -e ../../data/SMM/A0201/c000 | grep -v "#" > A0201.gd_mat.0
python smm_monte_carlo.py -t ../../data/SMM/A0201/f000 -e ../../data/SMM/A0201/c000 | grep -v "#" > A0201.mc_mat.0
python ../PSSM/pep2score.py -mat A0201.gd_mat.0 -f ../../data/SMM/A0201/c000 > A0201.gd_pred_0
python ../PSSM/pep2score.py -mat A0201.mc_mat.0 -f ../../data/SMM/A0201/c000 > A0201.mc_pred_0
cat A0201.gd_pred_0 | gawk '{print $2,$3}' | xycorr
cat A0201.mc_pred_0 | gawk '{print $2,$3}' | xycorr

were A0201 is an example of an allele you could have selected for download.

Design a 5 fold cross validation training procedure for each allele using the data sets c000, c001, c002, c003, and c004. Train the SMM method using the smm_gradient_descent.py and smm_monte_carlo.py programs with default parameters and calculate the cross-validated performance in terms of the Pearsons correlation (using the xycorr program) and the MSE (mean-squared-error). The latter is calculated as

cat file | gawk 'BEGIN{n=0; e=0.0}{n++; e += ($1-$2)*($1-$2)}END{print e/n}'

where file is a two-column file with target and prediction values.

Using default parameters for the two programs smm_gradient_descent.py and smm_monte_carlo.py, which program has the highest performance, and is this consistent for the different HLA alleles?

Using the same 5 fold-cross validation setup, try to find the optimal value for lambda (the -l option). Does this value differ between the two methods and MHC alleles? If you have time you can do a series of HLA alleles and try to find a correlation exists between the optimal lambda value and the size of the training data (or the number of peptide binders in the training data).


You might want to use some shell-scripting to make the training and evaluation more effective. You can get an exammple of such a tcsh script here
CV doit script

or here if you prefer BASH
CV doit script (bash)

and here is an example of the output generated with the script
doit.out

Make sure you understand the structure of both the doit script and the doit.out output file. Also think about why using the values in the doit.out file to estimate the optimal lambda value and model performance is misleading.


Now you are done. Remember to upload the two smm python programs via DTU-Learn.