Score sequence data with a PSSM

Description

Position specific scoring matrices (PSSM) are statistically motivated sequence motif models that provide higher sensitivity and specificity than regular expressions. The project consists of reading a TRANSFAC matrix table, converting it to a log-likelihood matrix, which is used to find matching motifs in a DNA fasta file.

Learn more about PSSM: https://en.wikipedia.org/wiki/Position_weight_matrix

Input and output

The program is given a fasta file with DNA sequences, a TRANSFAC matrix table, an 'A' prior and a score threshold (you decide the threshold) for the motif. The motif in the TRANSFAC table can be of varying length, as can the DNA sequences. The output should be the sites that score above the provided threshold (where on which sequence). Site scores are calculated as the sum of PSSM weights selected at each position.

Getting TRANSFAC matrices

You can find weight matrices that describe the DNA binding preference for transcription factors in the database TRANSFAC. You will have to register with TRANSFAC to gain access but it is free. Choose the TRANSFAC matrix table and search for 'M' to see all. The project can be done without registering as examples are provided.
See TRANSFAC examples here: M00302, M00218, M00038, M00014.
The units for these weights are often unknown and as such should be interpreted as un-normalized frequencies. Basically, just use the numbers as they are, even if they contain decimals. The numbers needed are between the PO (position) line and the following XX line. Only the numbers (which represents positional base counts from an alignment) are needed, i.e. the numbers in the A/C/G/T columns. They can be extracted with stateful parsing.

PO	A	C	G	T
01	6	11	2	7	N
02	13	4	5	4	A
03	5	5	8	8	N
04	12	1	2	11	W
05	2	0	23	1	G
06	0	0	26	0	G
07	26	0	0	0	A
08	25	0	1	0	A
09	25	1	0	0	A
10	15	5	2	4	A
11	9	6	2	9	N
12	5	6	6	9	N
XX

Converting from a positional base count matrix to a log-likelihood matrix

The first 3 rows in the extracted matrix are used for demonstration.

A	C	G	T
6	11	2	7
13	4	5	4
5	5	8	8

Frequencies can be interpreted as letter (nucleotide) probabilities for a particular letter at a particular position in the matrix. The letter background probabilities, i.e. probability of a letter not within the pattern, are often called prior probabilities (priors), i.e. what we would expect before we knew about the motif. The priors will depend on the organism and how the background is defined.
In E.coli, for example, the genomic nucleotide priors are almost equiprobable (p_A:0.25, p_C:0.25, p_G:0.25, p_T:0.25) but in yeast they are closer to p_A:0.31, p_C:0.19, p_G:0.19, p_T:0.31 (note that p_A==p_T and p_C==p_G).
Notice that if you know the prior for one base (A), then you can calculate the rest due to the p_A==p_T, p_C==p_G and p_A + p_T + p_C + p_G = 1 properties.

As can be seen there is a risk of a number (count) being 0, which will introduce some unpleasant infinities in further calculations. Hence, we will introduce pseudo-counts to the matrix. For each position (every row and column) in the matrix a pseudo-count of (1 * the prior) will be added. Using the priors for yeast in this example the first 3 rows will be:

A	C	G	T
6.31	11.19	2.19	7.31
13.31	4.19	5.19	4.31
5.31	5.19	8.19	8.31

Now the numbers must be converted from counts to probabilities. Each number is divided by the sum of it's row; 6.31/(6.31+11.19+2.19+7.31) = 0.2337

A	C	G	T
0.2337	0.4144	0.0811	0.2707
0.4930	0.1552	0.1922	0.1596
0.1967	0.1922	0.3033	0.3078

The probability matrix can then be converted to a log-likelihood matrix by dividing with the appropriate prior and then use the log2. log2(0.2337/0.31) = log(0.2337/0.31)/log(2) = -0.4076

A	C	G	T
-0.4076	1.1250	-1.2282	-0.1956
0.6693	-0.2919	0.0166	-0.9578
-0.6563	0.0166	0.6747	-0.0103

Using the log-likelihood matrix on a sequence

Given a sequence TGATAGT and above matrix, compute the score for each position of the sequence and report those above the given threshold. As an example, just using the first 3 positions (TGA) the score is found by addition: -0.1956 + 0.0166 + -0.6563 = −0.8353, and for the next (GAT) it is -1.2282 + 0.6693 + -0.0103 = −0.5692.
Just to be clear: The length (height) of the matrix determines the minimum length of sequence you can use for scoring. When the score is 0 it is equally likely that the site is a real motif as it is a random occurrence.

PO	A	C	G	T
01	6	11	2	7	N
02	13	4	5	4	A
03	5	5	8	8	N
04	12	1	2	11	W
05	2	0	23	1	G
06	0	0	26	0	G
07	26	0	0	0	A
08	25	0	1	0	A
09	25	1	0	0	A
10	15	5	2	4	A
11	9	6	2	9	N
12	5	6	6	9	N
XX

PO	A	C	G	T
01	6	11	2	7	N
02	13	4	5	4	A
03	5	5	8	8	N
04	12	1	2	11	W
05	2	0	23	1	G
06	0	0	26	0	G
07	26	0	0	0	A
08	25	0	1	0	A
09	25	1	0	0	A
10	15	5	2	4	A
11	9	6	2	9	N
12	5	6	6	9	N
XX