During this exercise you will use bioinformatics tools to predict peptide-MHC binding. The exercise has three parts:
The most selective step in identifying potential peptide immunogens is the binding of the peptide to the MHC complex. Only one in about 200 peptides will bind to a given MHC complex. A very large number of different MHC alleles exist each with a highly selective peptide binding specificity.
The binding motif for a given MHC class I complex is in most cases 9 amino acids long. The motif is characterized by a strong amino acid preference at specific positions in the motif. These position are called anchor positions. For many MHC complexes the anchor position are placed at P2 and P9 in the motif. However this is not always the case.
Large number of peptide data exist describing this MHC specificity variation. One important source of data is the SYFPEITHI MHC database (http://www.syfpeithi.de). This database contains information on MHC ligands and binding motifs.
In this exercise you are going to
Go to the SYFPEITHI MHC database (click on the right mouse button, and select open in a new window). Use the Find motif, Ligand or epitope option. Have a look at the peptide characteristics for different MHC alleles like HLA-A*0201, HLA-A*01, HLA-A*1101, and HLA-B27. This you do by selecting for instance HLA-A*0201 and press do Query.
A powerful way to visualize the peptide characteristics of the binding motif of an MHC complex, is to plot a sequence logo. The files HLA-A01, HLA-A0201, HLA-B27, in the exercise directory contain peptides known to bind to a particular MHC complex (HLA-A*0201 for example). The files are in the format used by a program that generates sequence logos. Go to the web-site Seq2Logo.
You shall use Seq2Logo to visualize sequence logos. This you by pasting in the three files one at the time into the Submission window and type Submit query. Do this for each of the three peptide files and examine the sequence logos.
In this part of the exercise you shall use the EasyPred web-interface to train and evaluate a series of different MHC-peptide binding predictors. You shall use two data sets (eval.set, train.set) that contain peptides and binding affinity to the MHC alleles HLA-A*0201. The binding affinity is a number between 0 and 1, where a high value indicates strong binding (a value of 0.5 corresponds to a binding affinity of approximately 200 nM). The eval.set contains 66, and the train.set 1200 such peptides. Click on the filenames to view the content of the files.
Before you start using the EasyPred you must save the train.set and eval.set files locally on the Desktop on your lab-top. You do that by clicking on the files names (eval.set, train.set) and saving the files as text files on the Desktop.You shall now use EasyPred web-server to train a series of methods to predict peptide-MHC binding. Go to the EasyPred web-server.
First you shall train a matrix predictor. On the EasyPred web-server press Clear fields. In the upload training examples window browse and select the train.set file from the Desktop, in the upload evaluation window browse and select the eval.set file from the Desktop. In the Matrix method parameters select Clustering at 62% identity, and set weight on prior (weight on pseudo counts) to 200. Press Submit query. This will calculate a weight-matrix using sequence weighting by clustering, and a weight on prior (pseudo counts) of 200.
Go back to the EasyPred server window (use the Back bottom). Set clustering method to No clustering and the weight on prior to zero and redo calculation.
Now the fun starts. You shall now train some neural networks to predict MHC-peptide binding. In the Type of prediction method window select neural networks. Leave all other parameters as they are. Press Submit query.
This will train a neural network with 2 hidden neurons running up-to 300 training epochs. The top 80% (960 peptides) of the train.set is used to train the neural network and the bottom 20% (240 peptides) are used to stop the training to avoid over-fitting.
Go back to the EasyPred interface and change the parameters so that you use the bottom 80% of the train.set to train the neural network and the top 20% to stop the training. Redo the network training with the new parameters.
Go back to the EasyPred interface and change the parameters back so that you use the top 80% of the train.set of training. Next do neural network training with a different set of hidden neurons (1 and 5 for instance).
As you found in the first part of neural network training, the network performance can depend strongly on the partition of the training data into the training and stop set. One way of improving the network performance is to make use of this network variation in a cross-validated training. The general idea behind the cross-validated training is that since you cannot in advance tell which training set partition that will be optimal you make a series of N network trainings each with a different partition. The final network prediction is then taken as the simple average over the N predictions. In a 5-fold cross-validated training, the training set is split up into 5 sets. In one training the sets 1,2,3 and 4 are used to train the network and the 5th set to stop the training, in the another training the sets 1,3,4,5 are used for training and the 2nd set to stop the training, and so forth.
Go back to the EasyPred interface and set the hidden neuron parameter back to 2. Next set the number of partitions for cross-validated training to 5 and redo the neural network training (this might take some minutes).
Write down the test performance for each of the five networks
Cross validation number 1 Maximal test set pearson correlation coefficient sum = 0.862300 in epoch 109 minimal per example squared error = 0.012800 in epoch 121 Cross validation number 2 Maximal test set pearson correlation coefficient sum = 0.824700 in epoch 89 minimal per example squared error = 0.017100 in epoch 89 Cross validation number 3 Maximal test set pearson correlation coefficient sum = 0.794500 in epoch 59 minimal per example squared error = 0.025000 in epoch 80 Cross validation number 4 Maximal test set pearson correlation coefficient sum = 0.834600 in epoch 72 minimal per example squared error = 0.016400 in epoch 80 Cross validation number 5 Maximal test set pearson correlation coefficient sum = 0.823400 in epoch 117 minimal per example squared error = 0.014700 in epoch 101
You shall use the neural network to find potential epitopes in the Sars virus. In the EasyPred web-interface clear field to reset all parameter fields. Go to the Uniprot homepage Uniprot. Search for a Sars entry by typing "Sars virus" in the search window. Click you way to the FASTA format for one of the proteins. Here is a link if you are lazy. Paste in FASTA file into the Paste in evaluation examples. Upload the network parameter file (para.dat) from before into the Load saved prediction method window. Leave the window Networks to chose in ensemble blank, make sure that the option for sorting the output is set to Sort output on predicted values, and press Submit query.
If you have more time, you shall now make a prediction method for MHC class II binding.
HLA class II binding peptides have a broad length distribution complicating the development of prediction methods. Identifying the correct alignment of a set of peptides known to bind the MHC class II complex is a crucial part of identifying the core of an MHC class II binding motif.
Here you shall use the NNAlign web-server to develop and evaluate a prediction method for the HLA class II allele DRB1*0401. The data used to train and evaluate the method are down-loaded from the Immune Epitope Database (IEDB). The DRB10401.train contains 800 peptides with measured binding affinity to the DRB1*0401 MHC class II allele. The binding values have also here been transformed to fall in the range 0-1, where a high value indicates strong binding (a value of 0.5 corresponds to a binding affinity of approximately 200 nM). The DRB10401.test likewise contains 800 peptides with associated binding affinity to be used for evaluation.
As explained in the todays lecture, the prediction of MHC class II is complicated by the fact that one needs to correctly identify the binding register in order to learn the binding preferences. To illustrate this, we have extracted the set of binding peptides from the Class II_train.set. Use this data set DRB10401.train_bind to display the binding motif using the Seq2Logo server.
Next you shall use the NNAlign server to derive the binding motif for the class II molecule. Go to the NNAlign web-server. Upload the DRB10401.train as training data, and the DRB10401.test as evaluation data. Select the MHC CLASS II ligands: Load parameters option. Set Folds for cross-validation to "No CV", Number of seeds to 10, Number of hidden neurons to 20, Number of training cycles to 300, Burn-in period to 20 and leave the other settings unchanged. Press submit. It takes a little while for the calculation to complete so be patient. If the job does not complete click here for a link to the output page.
Now you are done!!