<h2>Gibbs sampling</h2>

<p>
In this exercise, you shall implement the Gibbs sampling algorithm for identification of linear binding motifs
in peptide ligand data.

<hr>

<h3>Implementing the algorithms</h3>

<p>
First you must access the program template of today exercise

<p>
Download the file

<a href="./Gibbs.tar.gz" target_blank>Gibbs.tar.gz file</a>

<p>
Open the file (using tar -xzvf Gibbs.tar.gz) and place the created Gibbs directory in the "Algo/code" directory.

<p>
Now the Gibbs directory should contain one Jupyter-Notebook file and
a small gawk script to calculate the PCC between a two column input file

<pre>
GibbsSampler.ipynb
xycorr
</pre>

<p>
The first program constructs a PSSM from a list of peptides using the Gibbs sampler algorithm to identify the optimal
core alignment and writes the PSSM to standard output.
The second program scores a peptide list against a PSSM and prints the peptide
sequence, (eventual binding affinity) and PSSM score to standard output. The prediction score
of the peptide to a PSSM is the score of the highest scoring binding core contained within the peptide.
The format of the PSSM matrix
made by the <b>gibbs_sampler</b> program is the standard Psi-Blast profile matrix output.

<hr>

<h4>Gibbs_sampler</h4>

<p>
Open the GibbsSampler.ipynb jupyter-notebook file. Go through the code. 
Make sure you understand the structure of the program. 

<p>
You shall fill in the missing code (XXXX). Again make sure you understand the structure of the code, 
and then fill out the missing code.

<p>
First check what is the content of the file <b>DRB10401.lig</b> by typing

<pre>
cat data/Gibbs/DRB10401.lig | more
</pre>

<p>
Next, run the GibbsSampler code on the file DRB10401.lig. 

<p>
Plot the "learning" curve of the KLD score as a function of the iterations. Does it behave 
as expected?

<p>
Look at the generated PSSM. Copy the PSSM to the <a HREF="https://services.healthtech.dtu.dk/service.php?Seq2Logo-2.0" target=_blank>Seq2Logo server</a>, and generate the logo. How does it look? Can you identify the HLA anchor positions?

<p>
Now you can evaluate the predictive power of the generate PSSM on the
independent Gibbs/DRB10401.eval data set.

<p>
Fill in the missing parts of the code in the "Scoring peptides to weight matrix" box, and
record the PCC performance. 

<p>
If you have time, try to change some of the parameters of the code; seed, 
T_i, T_f, T_step, and iters_per_point, to see how these alter the PSSM (logo),
and the PCC performance. 

<p>
Finally, download the code as a python program. Remove the part related to
the evaluation, and add a command line parser with the options 

<pre>
optional arguments:
  -h, --help          show this help message and exit
  -b BETA             Weight on pseudo count (default: 50.0)
  -w                  Use Sequence weighting
  -f PEPTIDES_FILE    File with peptides
  -i ITERS_PER_POINT  Number of iteration per data point
  -s SEED             Random number seed
  -Ts T_I             Start Temp
  -Te T_F             End Temp
  -nT T_STEPS         Number of T steps
</pre>

<hr>

<h4>cl2pred</h4>

<p>
As the last part of the exercise, make a copy called cl2pred.py 
of the pep2score.py program from the PSSM exercise and place
it in the Gibbs code directory. 

<p>
Modify the code to score a peptide of variable length against a PSSM. Hint
the code is written for you in the evaluation part of the GibbsSampler.ipynb 
program.

<p>
Check that the code is running, by comparing the prediction values you get for 
each peptide, to the values produced in the GibbsSampler.ipynb program. 

<p>
You can now use the two programs GibbsSampler.py and cl2pred.py to first
make a PSSM and next evaluate it performance, i.e 

<pre>
python GibbsSampler.py -f ../data/Gibbs/DRB10401.lig -w > DRB10401.pssm
python cl2pred.py -mat DRB10401.pssm -f ../data/Gibbs/DRB10401.eval | gawk '{print $2,$3}' | xycorr
</pre>

<p>
And with this set up, you are ready to launch large benchmark calculations
to find the optimal parameters for the GibbsSampler. 

<hr>

<p>
Now you are done. Remember to upload both the GibbsSampler.py and the cl2pred.py programs via DTU-Learn.