Construction of PSSMs

In todays exercise, you shall implement the algorithm for construction of position specific scoring matrices (PSSM) introduced in this weeks lecture. The algorithm shall include pseudo counts to correct for small number of observations, and sequences weighting using heuristics.


Implementing the algorithms

Download the file PSSM.tar.gz file

Place the file in the code directory. Remove the current PSSM directory if present (this code has been updated), and open the PSSM.tar.gz file (using tar -xzvf PSSM.tar.gz). In the PSSM directory you will now find two files

pep2mat.ipynb
pep2score.ipynb

The first program constructs a PSSM from a list of peptides of equal length. The PSSM can be constructed with and without sequence weighting, and with pseudo count correction for low counts, as described in todays lecture. The format of the PSSM matrix made by the pep2mat program is the standard Psi-Blast profile matrix output.

The second program scores a peptide list against a PSSM and generates a scatter plot of the predictions versus target values, and reports the Pearsons correlation coefficient as a performance measure.


pep2mat

Open the file pep2mat.ipynb as a jupyther-notebook. Add the path to your course directory in the "DEFINE THE PATH TO YOUR COURSE DIRECTORY" box. Next step through each box of the code from the top and down. Make sure you understand what the code in each box is supposed to do, and if the code is imcomplete (some lines have been blanked out with XXX), write these lines of code.

First, try to run the code with the file A0201.single_lig (defined in the box "Load Peptides").

Check what is the content of the file A0201.single_lig. It is located in the data/PSSM directory.

Next, look at the PSSM generated by your code. How does the out compare to the BLOSUM62 scoring matrix? That is, the first line in the PSSSM matrix should correspond to the BLOSUM scoring matrix line for the amino acids in the first position in the peptide from the A0201.single_lig file.

Next, make a PSSM based on the peptide data in the file A0201.small_lig without sequence weighting and pseudo counts. That is, set beta to 0.0 and sequence_weighting = False in the "Define options for run" box

Look at the output. Can you understand why the matrix has so many zero (or -999 depending on you implementation) values?

Now, make a matrix using the pseudo counts with a weight on prior of 50

Look at the output. What happened to all the zero (or -999) values? Can you understand what has happened?

Try also to visualize some of the PSSM's you have generated using the Seq2Logo server. You do this by simply copying the PSSM matrix made by your code into the submission window of Seq2Logo. Can you from these logos appreciate the power of pseudo counts?

The last part of the pep2mat program contains code to score peptides against a generated PSSM. Using this code you can test the performance of your PSSM by evaluating how this matrix can be used to predict the binding affinity for other peptide data. The file A0201.eval in the data/PSSM directory contain 1266 such peptide data. The format of the file is

ILYQVPFSV 0.8532
VVMGTLVAL 0.5891
ILDEAYVMA 0.4941
KILSVFFLA 0.8512
HLYQGCQVV 0.5386
where the first column is the peptide sequence and the second column is the binding score normalized so the a value of 1 corresponds to strong binding and a value of 0 corresponds to non-binding.

We can now investigate how the performance depends on the use of pseudo counts, sequence weighting and the number of peptide include in the data used for making the PSSM.

Using the PSSM/A0201.small_lig ligand data set to generate the PSSM, investigate how the evaluation performance depends on

Finally, you shall make a PSSM from a larger peptide data set (PSSM/A0201.large_lig) again using sequence weighting and beta = 50. What is the predictive performance of that matrix?

If you have more time you can try to modify the part of the program score a peptide to the PSSM to allow for position specific weighting to increase the weights of position 2 and 9 in the scoring matrix, and test if this can be used to improve the predictive performance

Finally, save the pep2mat program, delete the part of the code related to scoring of a peptide to a PSSM, and add commandline options to define

  -b BETA           Weight on pseudo count (default: 50.0)
  -w                Use Sequence weighting
  -f PEPTIDES_FILE  File with peptides
so that the code takes a peptide-file as input (define with the -f option) and and writes a PSSM to standard output (calculated defined by thge options -b and -w).

pep2score

Finally, open the file pep2score.ipynb. This program contains an extract of the last parts of the pep2mat code. This program might come handy if you later need to run larger benchmark analysis.

Spend some time to make sure you understand the structure of the program.

Add a commandline parser to allow for the options

 -mat MAT_FILE     File with PSSM
  -f PEPTIDES_FILE  File with peptides (format [peptide target])

Test the code with some of the PSSMs generated with you pep2mat program.


You can see the output I have gotten for some of the calcutations here

A0201.single_lig_mat
A0201.mat_b0_small.lig
A0201.mat_b50_small.lig
A0201.mat_small.lig_sw
A0201.mat_large.lig_sw
A0201.eval scored against A0201.mat_large.lig_sw

Now you are done. Remember to upload the pep2mat.py and pep2score.py programs via DTU-Learn.