Gibbs sampling

In this exercise, you shall implement the Gibbs sampling algorithm for identification of linear binding motifs in peptide ligand data.


Implementing the algorithms

First you must access the program template of today exercise

Download the file Gibbs.tar.gz file

Open the file (using tar -xzvf Gibbs.tar.gz) and place the created Gibbs directory in the "Algo/code" directory.

Now the Gibbs directory should contain one Jupyter-Notebook file and a small gawk script to calculate the PCC between a two column input file

GibbsSampler.ipynb
xycorr

The first program constructs a PSSM from a list of peptides using the Gibbs sampler algorithm to identify the optimal core alignment and writes the PSSM to standard output. The second program scores a peptide list against a PSSM and prints the peptide sequence, (eventual binding affinity) and PSSM score to standard output. The prediction score of the peptide to a PSSM is the score of the highest scoring binding core contained within the peptide. The format of the PSSM matrix made by the gibbs_sampler program is the standard Psi-Blast profile matrix output.


Gibbs_sampler

Open the GibbsSampler.ipynb jupyter-notebook file. Go through the code. Make sure you understand the structure of the program.

You shall fill in the missing code (XXXX). Again make sure you understand the structure of the code, and then fill out the missing code.

First check what is the content of the file DRB10401.lig by typing

cat data/Gibbs/DRB10401.lig | more

Next, run the GibbsSampler code on the file DRB10401.lig.

Plot the "learning" curve of the KLD score as a function of the iterations. Does it behave as expected?

Look at the generated PSSM. Copy the PSSM to the Seq2Logo server, and generate the logo. How does it look? Can you identify the HLA anchor positions?

Now you can evaluate the predictive power of the generate PSSM on the independent Gibbs/DRB10401.eval data set.

Fill in the missing parts of the code in the "Scoring peptides to weight matrix" box, and record the PCC performance.

If you have time, try to change some of the parameters of the code; seed, T_i, T_f, T_step, and iters_per_point, to see how these alter the PSSM (logo), and the PCC performance.

Finally, download the code as a python program. Remove the part related to the evaluation, and add a command line parser with the options

optional arguments:
  -h, --help          show this help message and exit
  -b BETA             Weight on pseudo count (default: 50.0)
  -w                  Use Sequence weighting
  -f PEPTIDES_FILE    File with peptides
  -i ITERS_PER_POINT  Number of iteration per data point
  -s SEED             Random number seed
  -Ts T_I             Start Temp
  -Te T_F             End Temp
  -nT T_STEPS         Number of T steps

cl2pred

As the last part of the exercise, make a copy called cl2pred.py of the pep2score.py program from the PSSM exercise and place it in the Gibbs code directory.

Modify the code to score a peptide of variable length against a PSSM. Hint the code is written for you in the evaluation part of the GibbsSampler.ipynb program.

Check that the code is running, by comparing the prediction values you get for each peptide, to the values produced in the GibbsSampler.ipynb program.

You can now use the two programs GibbsSampler.py and cl2pred.py to first make a PSSM and next evaluate it performance, i.e

python GibbsSampler.py -f ../data/Gibbs/DRB10401.lig -w > DRB10401.pssm
python cl2pred.py -mat DRB10401.pssm -f ../data/Gibbs/DRB10401.eval | gawk '{print $2,$3}' | xycorr

And with this set up, you are ready to launch large benchmark calculations to find the optimal parameters for the GibbsSampler.


Now you are done. Remember to upload both the GibbsSampler.py and the cl2pred.py programs via DTU-Learn.