The project consists of two algorithm-templates where you must fill out the missing code, and two data sets of MHC-peptide binding data. First you must copy the data files
Now download data for three HLA alleles from the list below.
You can find the number of peptides and number of peptide binders per allele in the following link Data statistics. Here a binder is a peptide with a binding affinity stronger than 500 nM.
Note, you could select at least three alleles so that one has many peptide data points, and one has few, and one is in between these.
A0101.tar.gz
A0201.tar.gz
A0202.tar.gz
A0203.tar.gz
A0206.tar.gz
A0301.tar.gz
A1101.tar.gz
A2301.tar.gz
A2402.tar.gz
A2403.tar.gz
A2601.tar.gz
A2902.tar.gz
A3001.tar.gz
A3002.tar.gz
A3101.tar.gz
A3301.tar.gz
A6801.tar.gz
A6802.tar.gz
A6901.tar.gz
B0702.tar.gz
B0801.tar.gz
B1501.tar.gz
B1801.tar.gz
B2705.tar.gz
B3501.tar.gz
B4001.tar.gz
B4002.tar.gz
B4402.tar.gz
B4403.tar.gz
B4501.tar.gz
B5101.tar.gz
B5301.tar.gz
B5401.tar.gz
B5701.tar.gz
B5801.tar.gz
Next, place the downloaded files in the data/SMM directory, and open the files (using tar -xzvf XXXXX.tar.gz). Here, XXXXX is the name of an MHC allele (A0201 for instance). For each allele (MHC molecule) you will now have 10 files called c000, c001, c002, c003, c004, f000, f001, f002, f003, and f004. Check this by typing
ls -ltr XXXXX
where XXXXX is the name of one of the selected MHC alleles. Each of these files contains peptide-MHC binding data. The format of the files is
YVMTMILFL 1 A0201 YLYALYSPL 1 A0201 YLMDKLNLT 1 A0201 YITALNHLV 1 A0201 WLIGFDFDV 1 A0201 VLALYSPPL 1 A0201 TLAPFNFLV 1 A0201 SLDVINYLI 1 A0201
The first column gives the peptide sequence, the second the binding value, and the last column the MHC allele name. The MHC binding values are given in 1-log(IC50)/log(50000) values meaning that a value of 1 corresponds to an IC50 value of 1 nM, and a value of 0 to an IC50 value of 50000 nM. Note, that low IC50 values correspond to strong binding.
The data files are prepared to be used for 5 fold cross validated training. The files c000, f000 for instance contain data for one of the five cross-validation partitions where the f000 file is used for training and the c000 file for testing.
Now we are ready to code. First you must copy the program templates.
Download the file SMM.tar.gz file
Open the file (using tar -xzvf SMM.tar.gz) and place the created SMM directory in the "Algo/code" directory. The SMM directory should now contain two jupyter-notebook program files
smm_gradient_descent.ipynb smm_monte_carlo.ipynb
The two files are templates for the gradient decent, and Monte Carlo based minimization procedures of the SMM method, respectively.
Open the program smm_gradient_descent. Make sure you understand the structure of the code, and then fill out the missing parts.
Next, test the code with the two examples of training and evaluation data "A0201_training/A0201_evaluation, and A2403_training/A2403_evaluation". Try to change the value of lambda. You do see any impact on the training and evaluation performance. Try also to change epsilon. How does this affect the training and evaluation performance.
Now download the code as a python program, remove the part of the code related to making the graphical plots of performance and the parts related to the evalaution ("the last part called "Performance Evaluation"), and add a commandline parser with the options
SMM_GD optional arguments: -h, --help show this help message and exit -l LAMB Lambda (default: 0.01) -t TRAINING_FILE File with training data -e EVALUATION_FILE File with evaluation data -epi EPSILON Epsilon (default 0.05) -s SEED Seed for random numbers (default 1) -i EPOCHS Number of epochs to train (default 100)
With these changes, the program smm_gradient_descent.py will take as input two files TRAINING_FILE and EVALUATION_FILE, run the SMM algortihm, and print out the generate PSSM to standard out.
Test this with one of the examples in the data/SMM directory.
The smm_monte_carlo.ipynb program implements a Monte Carlo based optimizaation of the SMM algorithm. this program is almost completely written out for you. Open the program. Spend some time to make sure you understand the structure, next fill in the missing code (XX's).
Test the code with the two examples of training and evaluation data "A0201_training/A0201_evaluation, and A2403_training/A2403_evaluation". Try to change the value of lambda. You do see any impact on the training and evaluation performance. Try also to change epsilon. How does this affect the training and evaluation performance.
Now download the code as a python program, remove the part of the code related to making the graphical plots of performance and the parts related to the evalaution ("the last part called "Performance Evaluation"), and add a commandline parser with the options
optional arguments: -h, --help show this help message and exit -l LAMB Lambda (default: 1) -t TRAINING_FILE File with training data -e EVALUATION_FILE File with evaluation data -s SEED Seed for random numbers (default 1) -i ITERS Number of epochs to train (default 1000) -Ts T_I Start Temp (0.01) -Te T_F End Temp (0.000001) -nT T_STEPS Number of T steps (10)
and test the code with one of the examples in the data/SMM directory.
Now we are ready to train the first MHC peptide binding prediction algorithm.
To evaluate the code you need a script to calculate the Pearsons correlation
coefficient. You can download this script here
xycorr script.
Save this script in the SMM code directory. After saving the file, type the command
chmod u+x xycorrThis will allow you to execute the script.
Demonstrate to your-self that the two implementations work, and that they convert to similar performance values. Use default parameters
python smm_gradient_descent.py -t ../../data/SMM/A0201/f000 -e ../../data/SMM/A0201/c000 | grep -v "#" > A0201.gd_mat.0 python smm_monte_carlo.py -t ../../data/SMM/A0201/f000 -e ../../data/SMM/A0201/c000 | grep -v "#" > A0201.mc_mat.0
python ../PSSM/pep2score.py -mat A0201.gd_mat.0 -f ../../data/SMM/A0201/c000 > A0201.gd_pred_0 python ../PSSM/pep2score.py -mat A0201.mc_mat.0 -f ../../data/SMM/A0201/c000 > A0201.mc_pred_0
cat A0201.gd_pred_0 | gawk '{print $2,$3}' | xycorr cat A0201.mc_pred_0 | gawk '{print $2,$3}' | xycorr
were A0201 is an example of an allele you could have selected for download.
Design a 5 fold cross validation training procedure for each allele using the data sets c000, c001, c002, c003, and c004. Train the SMM method using the smm_gradient_descent.py and smm_monte_carlo.py programs with default parameters and calculate the cross-validated performance in terms of the Pearsons correlation (using the xycorr program) and the MSE (mean-squared-error). The latter is calculated as
cat file | gawk 'BEGIN{n=0; e=0.0}{n++; e += ($1-$2)*($1-$2)}END{print e/n}'
where file is a two-column file with target and prediction values.
Using default parameters for the two programs smm_gradient_descent.py and smm_monte_carlo.py, which program has the highest performance, and is this consistent for the different HLA alleles?
Using the same 5 fold-cross validation setup, try to find the optimal value for lambda (the -l option). Does this value differ between the two methods and MHC alleles? If you have time you can do a series of HLA alleles and try to find a correlation exists between the optimal lambda value and the size of the training data (or the number of peptide binders in the training data).
You might want to use some shell-scripting to make the training and evaluation more effective. You can get an exammple
of such a tcsh script here
CV doit script
or here if you prefer BASH
CV doit script (bash)
and here is an example of the output generated with the script
doit.out
Make sure you understand the structure of both the doit script and the doit.out output file. Also think about why using the values in the doit.out file to estimate the optimal lambda value and model performance is misleading.