ExSeqLogos

From 22111
Revision as of 10:30, 15 March 2024 by WikiSysop (talk | contribs) (→‎Seq2Logo (CBS/DTU))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Exercise written by: Rasmus Wernersson (last update: March 2017)

Introduction

In this exercise we will introduce two methods for generating sequences logos, and we will investigate, how we can extract and compare sequences information from large sets of sequences.

WebLogo (Berkeley)

Link: http://weblogo.berkeley.edu/ (we'll use WebLogo version 2 for this exercise)

A good general-purpose logo generator for BOTH DNA and peptide sequences.

Seq2Logo (CBS/DTU)

Link: https://services.healthtech.dtu.dk/services/Seq2Logo-2.0/

A more advanced method for working with peptide sequences.

Part 1: DNA logos

We'll start out by investigating a small dataset of human splice sites (donor/acceptor)

http://weblogo.berkeley.edu/logo.cgi

HUMAN donor sites

The sequences below has been extracted from a random sample of human genes, and each line corresponds to the DNA sequence immetidately BEFORE and AFTER the EXON/INTRON boundary. You can think of it as a multiple alignment written in a compact way, where all the sequence names have been discarded.

CAAAACCATTGTGAGTAATC
GCCAGAGCAGGTAAAATATC
GAACAGTCAGGTCTGTTGCT
GAAGGCCCAGGTGAGCATAA
TCCTCTACAGGTGGGTACAT
GGCGTCCCGCGTAAGTATGG
CCTCGTGCAGGTAAGATTAA
TGCATGACAGGTGAGTGTTA
GAAATGTACAGTAAGTCTCT
GGTTCTCTGGGTAAGTAGAG
AAATGTACAGGTGAGTACTG
ACCTCGCTTGGTACGTGGGA
AATCAGACAGGTATAGAAAC
AGGACAGAAGGTAATTTTCT
AACTATTTGGGTAGGTAGCA
GAACTTCCAGGTGTGTGCAG
AAACTTGAAGGTATGTTGTT
CTGGGATAAGGTAAAAGTAT
TTGCACCCAGGTTAGTGGAT
ACTTCAATCGGTATGTTTTC

TASK 1: Our first DNA LOGO

QUESTION #1:

  • Paste in the logo in your report (if the PNG files gives you problems, try generating a PDF instead)
  • Can you recognize the DONOR site pattern? (Compare to the lecture slides) - how many bits of information are in the GT positions?
  • How many bases are from the EXON and how many are from the INTRON?


TASK 2: prettifying the LOGO

  • Since the interesting part of the sequences is at the EXON/INTRON boundary, it would be nice to be able to high-light this. An easy way to achieve this is to adjust the numbering scheme to display the GT to start at position "0".
  • Play around with the "First position number" setting to give the EXON sequence negative numbers and the INTRON sequence positive numbers.
  • While we're at it, put a title on the LOGO plot to indicate that it's about human donor sites.

QUESTION #2:

  • Paste in the new logo plot in your report (if the PNG files gives you problems, try generating a PDF instead)


TASK/QUESTION #3:

  • Finally for good measure we also want to generate a frequency plot for the human donor sites.
  • Find the option to do this, and paste in the resultant LOGO in your report.

Research task: cross-species comparison

As we have seen above, it's pretty straightforward to generate DNA logos (provided that the data has already been well prepared), and it's now time to perform a real research task.

INPUT DATA:

The ZIP file linked below, contains 500 sequences related to DONOR and ACCEPTOR sites for the following species:

  • Homo sapiens (Human)
  • Drosophila melanogaster (Fruit fly)
  • Arabidopsis thaliana (Thale Cress)
  • Saccharomyces cerevisiea (Baker's / Brewer's yeast)
  • Schizosaccharomyces pombe (Fission yeast)
  • Download link: LOGO_exercise_donor+acceptor_Q4.zip


OVERALL TASK:

It's is now your task to investigate the signal around first the DONOR and then the ACCEPTOR site for all 5 species, and conclude what is similar and what is different.

IMPORTANT:

  • Make sure to label all your plot with species + donor/acceptor (you will be generating 10 different plots, and it's important that they are easy to tell apart).

QUESTION #4:

  • Include all LOGOs in your report, and note down your observations — e.g. if less signal is seen in the EXON part, will more signal be seen in the INTRON part?

Looking for much weaker DNA motifs

In the examples above, we have been investigating some pretty strong signals, and now we turn our attention to how to work with data sets with somewhat weaker motifs.

Shine-Dalgarno sequence

In prokaryotes the translation of a mRNA transcript is initiated by the binding of the ribosome to the mRNA a little upstream of the start codon. This binding site is known as the RBS (ribosomal binding site) and the sequence being recognized by the ribosome is known as the "Shine-Dalgarno sequence". The consensus sequence is AGGAGG (DNA: AGGAGG) and it's located approximately 8bp before the AUG (DNA: ATG).

In order to investigate this in more details, we have prepared a data set of 500 E. coli genes which includes 50 bp before the START codon and the first 50 bp of the coding sequence (100 bp in all).


TASK:

  • Generate a LOGO of the 500 E. coli sequences. Since we're visualizing 100bps it will be easier to see if we stretch out the LOGO a bit: set the dimensions to 50*5cm.
  • As it can easily be seen, much of the sequence positions contain next to no signal, and we can benefit from narrowing down the region we look at.


TASK:

  • Generate a new LOGO of the E. coli data set, but set the Logo range to position 25-75 to only show the middle part of the data.
  • Play around with the dimensions to get a good looking LOGO.


QUESTION #5:

  • Paste your LOGO into the report.
  • Is the start codon always "ATG"?
  • Can you see something resembling the Shine-Dalgarno sequence (at which positions)?

In order to get a better "view" of the sequence region with the Shine-Dalgarno sequence, we need to zoom in a bit further. The very strong signal from the START codon is blinding us a bit, and it can be beneficial to remove that from our field of view.

TASK:

  • Set the range to position 30-50, and play around with the dimensions to get a good looking plot.
  • Furthermore, set the Y-axis to be capped at 0.5 bits, and put in Y-axis tic marks at 0.1 bit intervals.


QUESTION #6:

  • Paste your LOGO into the report.
  • Is the logo (more or less) consistent with the 6 bp consensus sequence?

Kozak sequence

As the final step in our work with DNA logos, we shall investigate the signal around eukaryotic START codons, and see if we can find the "Kozak sequence" which aids in the initiation of eukaryotic translation. For this we have prepared a data set of 500 yeast sequences (50 bp before+after CDS start - exactly as above). It's is now your task to investigate, if you can find a signal upstream (before) the START codon, and prepare a good visualization of this using the tricks you have learned so far.


QUESTION 7:

  • Can you find any positions (except for the ATG) with information content above 0.2 bits?
  • For any such position what is the approximate frequency of the most common base?
  • Include relevant LOGO plots in your report

Part 2: Protein logos

We'll start our work with peptide LOGOs with a study of signal peptides, which is a well understood system, and we have prepared data files for this in advance.

Signal peptides

Most secreted proteins have an N-terminal signal peptide that directs the protein to the secretory pathway. During passage of the membrane (plasma membrane in prokaryotes, ER membrane in eukaryotes), the signal peptide is cleaved off.

Signal peptides are known from all domains of life, but there are certain differences between signal peptides of various taxonomical groups. Eukaryotic signal peptides are on average shorter than those of bacteria, and Gram-negative bacteria (bacteria with two membranes) have shorter signal peptides than Gram-positive bacteria (bacteria with one membrane). (Not much is known about Archaeal signal peptides).

It is now your task to investigate whether there are other differences between signal peptides of Eukaryotes and the two bacterial groups.

DATA: Zip archive signal peptides for all 3 taxonomical groups. Download link: signal_peptides.zip

  #Seqs #File name
   3280 EUK.sp.25+5.fasta
    416 gram+.sp.25+5.fasta
    846 gram-.sp.25+5.fasta

Each sequence line contains (up to) 25aa signal peptide + 5aa following the cleavage site. All sequences has been aligned at the cleavage site as illustrated below. Notice that not all signal peptides are 25aa long, and in these cases gaps have been inserted at the front instead.

--MKQILLISLVVVLAVFAFNVAEG CDATC
KTHSVFGFFFKVLLIQVYLFNKTLA APSPI
---MARGAALALLLFGLLGVLVAAP DGGFD
MARMKYNIALIGILASVLLTIAVNA ENACN
-------MSFRSLLALSGLVCSGLA NVISK
--MSSTWIKFLFILTLVLLPYSVFS VNIFA
---------MIALFVLMGLMAAASA SSCCS
KKTVAALSFLFIVLFVAQEIAVTEA KTCEN
-----MKAIFVSALLVVALVASTSA HHQEL
------MLRLLLLPLFLFTLSMCMG QTFQY

|.... signal peptide ...|^next 5 amino acids - the peptide is cleaved at the "^" mark

Signal peptide comparison

OVERALL TASK:

  • Generating peptide LOGOs with the WebLogo resource is as straightforward as before, and your goal is to compare the motifs in the signal peptides across the three taxonomical groups.
  • Play around with the options (dimensions, Y-axis scaling, tics etc) as much as you like, but make sure that:
    1. It's easy to compare the plots (put in labels etc)
    2. The y-axis scaling must be the same across the 3 plots
    3. Plot the signal peptide part of the sequences in negative numbers
    4. Read up on what the coloring means (click the "?" in "color scheme) - this will help you interpret the results.

QUESTION 8:

  • Put the 3 plots in your report and comment on the following:
  • Which single positions (using the negative numbering scheme) appears to be the most important across all data sets?
  • Are there any regions where a certain class of amino acids appears to be important? What characterizes these amino acids?
  • Any striking differences between the three taxonomical groups?

As the next step we shall investigate how to work with the Seq2Logo tool hosted locally here at DTU. It offers a lot more advanced functionality with regards to the algorithms behind the plots, and many of these functions we'll get back to next week when we work with weight matrices.

TASK: create a basic LOGO plot for the eukaryote dataset

  • Set logo type to "Shannon"
  • Put in a title
  • Leave the rest of the "strange" options as default - for reasons we'll learn later they matter less for a large and well-balanced data set such as this one

QUESTION 9:

  • Include the plot in your report
  • Does it (more or less) show the same as the plot from WebLogo?

Small data sets

So far we have been working with relatively large data sets and have been able to see some quite clear patterns. But what happens if the data is limited? Below is a small set (n=20) of the signal peptides from the eukaryote data set:

--MKQILLISLVVVLAVFAFNVAEGCDATC
KTHSVFGFFFKVLLIQVYLFNKTLAAPSPI
---MARGAALALLLFGLLGVLVAAPDGGFD
MARMKYNIALIGILASVLLTIAVNAENACN
-------MSFRSLLALSGLVCSGLANVISK
--MSSTWIKFLFILTLVLLPYSVFSVNIFA
---------MIALFVLMGLMAAASASSCCS
KKTVAALSFLFIVLFVAQEIAVTEAKTCEN
-----MKAIFVSALLVVALVASTSAHHQEL
------MLRLLLLPLFLFTLSMCMGQTFQY
MFRVTSVGCLLLVIVFLNLVVPTSACRAEG
GRAMVARLGLGLLLLALLLPTQIYCNQTSV
---MKNHLLFWGVLAVFIKAVHVKAQEDER
----MQRLCVCVLILALALTAFSEASWKPR
---MKLFTTLSASLIFIHSLGSTRAAPVTG
LRLLLSALKPGIHVPRAGPAAAFGTSVTSA
---------MKSLIVFACLVAYAAADCTSL
------MKTALPLLLLTCLVAAVQSTGSQG
--MGLRALMLWLLAAAGLVRESLQGEFQRK
-MATTRFPSLLFYSYIFLLCNGSMAQLFGQ

TASK: Generate a standard LOGO of the 20 sequences

  • IMPORTANT: Be sure to reset the Seq2Logo page, so that you're not accidentally calculating the plots with the previous input data
  • Set type to "Shannon"
  • Set Weight on prior to 0 (this will disable the advanced functionality for small sequence sets, and produce a bare-bones LOGO)
  • Generate the LOGO - and save it for comparison (or just paste it into your report right away).

This logo represented the information there can be obtained from this limited set of sequences using the standard LOGO algorithm.

Follow-up TASK: Generate a pseudo-count assisted LOGO of the 20 sequences

  • Set Weight on prior to 200
  • Generate the new LOGO and compare it to the one we created a moment ago.

QUESTION 10:

  • Include BOTH plots in your report
  • Comment briefly on what you see - which of the plots best resemble what we learned from the big data sets?