ExSeqLogosAnswers

From 22111
Jump to navigation Jump to search

Answers to the Sequence logo exercise

Written by: Rasmus Wernersson, April 2015 (latest update: April 2018)

Q1

  • The "GT" consensus site is clearly seen (completely conserved: 2 bits of information), and it appears that there is also some signal on the exon side (preference for "G" on the position before) and on the intron side.
  • The intron starts at position 11 - this means that position 1-10: EXON and 11-20: INTRON.

Q4 - cross species comparison

IMPORTANT: This question is the easiest to answer, if you compare the DONOR and ACCEPTOR site separately across the 5 species (it's simply easier to spot the differences this way)

Donor sites

Observations:

  • The animal (human + fruit fly) DONOR sites contain ~1 bit of information in the very last position of the exon (with a preference for "G"), and some information (< 1 bit) for the 4 positions following "GT" in the intron (=> in 6 intron positions in all)
  • The plant (Arabidopsis) has the same pattern of some signal in the final 2 exon positions, but the signal in the intron is very weak after the "GT".
  • The two fungal species show a pattern of next to no signal in the exon, but a very strong signal in the intron side beyond the GT.

Acceptor sites

  • Overall observation: the ACCEPTOR site motif is much more alike across all 5 species compared to the DONOR sites.
    • In all cases there is next to no signal on the EXON side (after the "AG") - and there is a strong preference for T and C (~1 bit, as strong as it can get for a two-letter preference) immediately before AG.
    • There is a diffuse preference for Ts in the region before AG in the animals + the plant.
    • This preference for Ts is clearly centered around the -9 position in the fungi.

Q5 - E. coli - Shine-Dalgano

  • The START codon is mostly ATG, but GTG is common enough to be seen at the first position. If you zoom in at position 51-53 it is possible to a small number of other bases also being used in rare cases.
  • A region with As and Gs can been seen a position 40-44 which could potentially be part of the SD sequence.

Q6 SD zoom

The LOGO is consistent with the consensus sequence AGGAGG in the way, that it's not a huge disagreement with it. From the LOGO it appears that it's a bit of a stretch to claim A over G in any of the positions, but a clear overrepresentation of As OR Gs is clearly seen.

Q7 - Kozak sequence (Yeast)

There appears to be a weak signal in the positions immediately before the START codon (especially the -3 position = coordinate 48).

Zoom of the 40-50 region
Zoom of the 40-50 region
Zoom + Y axis rescale of the 40-50 region
Zoom + Y axis rescale of the 40-50 region


It can now clearly be seen that only position 48 (= 3 before the ATG) has information content above 0.2 bits.

By plotting a frequency plot of the same region, it can be seen that >50% of the sequences have A in position 48.

Q8 - Signal peptides comparison

  • Similarities:
    • It's clearly seen that position -1 (just before the cleavage) and -3 is important and A (alanine) is preferred here (especially in the prokaryotes).
    • In all three cases there is a stretch of hydrophobic (color = black) amino acids (L, V, A, I) in the middle of the signal peptide.
  • Differences:
    • The preference for A (alanine) at the -1 position is much stronger in the prokaryotic sequences
    • The hydrophobic stretch is longer in Gram positive bacteria
    • There is a preference for S/A at position -6 in Gram negatives that is not seen elsewhere
    • There is no signal after the cleavage site in eukaryotes and some signal in the first few positions in both prokaryotic groups

  • Yes - it clearly shows the same overall motif as above. Note that, unlike WebLogo, Seq2logo indicates positions with gaps by making the stack of letters more narrow.

Q10 - small data sets

IMPORTANT: Compare the LOGOs from the small data set to the LOGO we got from the large data set (Question 9+10) and investigate if you can see the same pattern.

  • The first plot (without pseudo-counts) is very noisy, and only the very overall trends can be seen: the tendency to have an "A" at the -1 position and a somewhat diffuse hydrophobic region.
  • In the second plot (with the pseudo-counts) the picture looks a lot more like what we saw in the big data sets: a specific pattern at the -1 and -3 positions and the hydrophobic region much more in the shape with what we saw before.