ExLogo+Matrix-answers
Answers to "Construction of sequence logos and weight matrices"
Identification of MHC binding motifs
QUESTION 1: Which positions are anchor position and what amino acids are found at the anchor positions?
- Anchor positions are P2 and P9. Preferred amino acids are P2: LM, P9: VL. You don't have to take the "Auxiliary anchor" into account.
Construction of weightmatrices
Unnumbered question: Have a look at the sequence logo. How many different amino acids are present in the logo?
- More than 10 (all 20 in fact)!
Unnumbered question: Can you understand the weight matrix values? Hint, compare the weight matrix values to the Blosum62 scoring matrix values for L.
- If you only have one sequence (one amino acid), alpha in the equation for the combined frequency is zero, and p = g. To calculate the weight matrix values for for instance A we get
g(A) = q(A|L) = 0.04 p(A) = 0.04 w(A) = 2*log(0.04/0.074)/log(2) = -1.78
- This value compares well with the Blosum scoring matrix value for matching L to A, BL(A,L) = -1. The matrix value reported by the EasyPred program is -1.468. The difference between the value found here (1.78) and the EasyPred value (-1.468) is due to round-off errors. The EasyPred program uses a Blosum matrix with more digits defining the substitutions.
QUESTION 2: How many different amino acids are present at the P1 position in the logo (just give a rough estimate)?
- More than 10 (all 20 in fact)!
QUESTION 3: How many different amino acids are present at the P1 position in the binding data?
- 2.
QUESTION 3.1: Try to reproduce the matrix values for P1(I), and P1(K).
- The pseudo frequences for I and K are
g(I) = 0.4*0.12 + 0.6*0.16 = 0.144 g(K) = 0.4*0.03 + 0.6*0.03 = 0.03
- Since weight on prior (or weight on pseudo count, beta) is much greater than the number of sequences, the final frequencies p(I) = g(I), and p(K) = g(K). Using the formula for the values in the weight matrix with q(I)=0.068, and q(K)=0.058 (remember q is the back ground frequencies), we find the weight matrix value to be
- Score(I) = 2.17, and Score(K) = -1.90
- These values compare fine with the values calculated by the Easypred program (2.18, and -2.34). Remember, the EasyPred program uses a Blosum matrix with more digits defining the substitutions.
Weight Matrix generation
Small training set
QUESTION 4: What is the predictive performance of the matrix method?
- Pearson coefficient for N= 1266 data: 0.07628 Aroc value: 0.56979
View the logo plot of the calculated matrix. Can you understand why the matrix performs so poorly?
- The logo shows very low information at all positions. We have trained a method for peptide:MHC binding on a mixture of peptide binders and peptide non-binders. This is clearly wrong. We can only include binders when estimating a binding motif.
QUESTION 5: How many of the 110 peptides in the small.train.set are included in the matrix construction (Look for number of positive training examples)?
- All 110 peptides were included.
QUESTION 6: What is the predictive performance of the matrix method now?
- Pearson coefficient for N= 1266 data: 0.29529 Aroc value: 0.71191
QUESTION 7: How many of the 110 peptides in the small.train.set are included in the matrix construction?
- 10.
View the logo plot of the calculated matrix. Does the logo resemble the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise?
- No.
QUESTION 8: What is the predictive performance of the matrix method now?
- Pearson coefficient for N= 1266 data: 0.45328 Aroc value: 0.81865
Again view the logo plot of the calculated matrix. Has it changed compared to the previous calculation?
- In some positions, the order of letters have changed, but it still does not resemble the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise.
QUESTION 9: What is the predictive performance of the matrix method now?
- Pearson coefficient for N= 1266 data: 0.49684 Aroc value: 0.84838
View the logo plot of the calculated matrix. What is the big difference between this logo and the two previous ones? (how many different amino acids are present at each position in the binding motif?)
- In the two previous logo plots, only four amino acids were present at for instance P2. In this last logo all amino acids are present. The information content is also much lower.
What are the reasons for these differences?
- Using pseudo-counts will give non-zero frequency values also for amino acids not observed, and add more terms to the sum in the equation for the information content, thereby lowering the value.
Does the logo begin to resemble the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise?
- Yes, it captures some of the features (especially that position 2 is most important). Remember, it is still made from only 10 binding peptides.
Large training set
QUESTION 10: What is the predictive performance of the matrix method now?
- Pearson coefficient for N= 1266 data: 0.71798 Aroc value: 0.96651
View the logo plot of the calculated matrix. Does the logo compare to the logo for the HLA-A*0201 binding motif shown in the beginning of the exercise?
- Yes.
QUESTION 11: Look at the prediction list. How many false positive hits do you find among the top 20 highest scoring peptides (Assignment score < 0.426)?
- One false positive (Assignment score = 0).