Exercise: The protein database UniProt
Exercise written by: Henrik Nielsen - updated by Morten Nielsen and Rasmus Wernersson
In this exercise, we shall extract information from the protein database, Uniprot. This database is administrated in collaboration between Swiss Institute of Bioinformatics (SIB), European Bioinformatics Institute (EBI), England, and Georgetown University, Washington DC, USA.
UniProt, http://www.uniprot.org/, consists of three parts:
- UniProt Knowledge-base (UniProtKB)
- protein sequences with annotation and references
- UniProt Reference Clusters (UniRef)
- homology-reduced database, where similar sequences (having a certain percentage identity) are merged into clusters, each with a representative sequence
- UniProt Archive (UniParc)
- an archive containing all versions of Uniprot without annotations
Of these databases, Uniprot Knowledge-base is the most useful, and this is the database we shall be using today. Uniprot Knowledge-base consists of two parts:
- UniProtKB/Swiss-Prot
- a manually annotated (reviewed) protein-database.
- UniProtKB/TrEMBL
- a computer-annotated supplement to Swiss-Prot, that contains all translations of EMBL nucleotide sequences not yet included in Swiss-Prot.
Simple text mining
First, we will find some UniProt entries using simple text mining. You are supposed to find the entry for human insulin.
- Open the UniProt home-page http://www.uniprot.org/
- Type human insulin in the search field in the top of the page. Leave the search menu on "UniProtKB", which is default. Press Enter or click the Search button.
- QUESTION 1:
- How many hits do you find? (tip: See the number above the results list)
- How many of these hits are from Swiss-Prot? (tip: See under "Reviewed" at the top left)
- Can you identify the correct hit (i.e. see which one is actually human insulin and not something else)? If yes, write down is Accession code (found under Entry) and Entry name (also called ID).
In this case, it was relatively easy to spot the correct hit, but sometimes it is more difficult. If you do not identify the correct hit immediately, it will often help to narrow down the search, and that is exactly what we ask you to do in the next four questions.
The first step is searching for proteins that actually come from the organism "human" and are named something containing the word "insulin", as opposed to just containing the words "human" and "insulin" somewhere in the description. This can be done very easily: To the left of the results list under Search terms you find a list of links that allow you to restrict the search to specific fields.
- Under Filter "human" as: click on: organism.
- QUESTION 2:
- How many hits are now left? How many of these are from Swiss-Prot?
- Under Filter "insulin" as: click on: protein name.
- QUESTION 3:
- How many hits are now left? How many of these are from Swiss-Prot?
Note that all selections made with the mouse are shown in text format in the search field at the top of the page. It is possible to edit the search criteria manually in this field to make them broader or more narrow.
- Try for instance to exclude proteins that are not insulin, but only insulin-like. You do this by adding the following text in the search field: NOT name:insulin-like and click on the Search button.
- QUESTION 4:
- How many hits are now left?
- Try now to exclude proteins that are insulin receptors (or described as substrates for insulin receptors).
- QUESTION 5:
- How did you do this?
- How many hits are now left?
The contents of UniProt
We shall now see what information is contained in a UniProt entry, and what further information is available as links in each entry.
Click on the accession-number for insulin (the blue code in the field Entry). This will take you to the insulin entry in the UniProtKB/Swiss-Prot database. Spend some time to get an overview of the page and the information it contains.
- Note that you can click on the blue boxes in the left side of the page to scroll to different sections of the page. Try it!
- Note also that every time there is a small "i" after a term on the page, you can click it to get information about the term. Try it!
Now click on Publications under Display in the upper left part of the window. Click on UniProtKB/Swiss-Prot under Source to show only those references that are part of the entry and exclude those that are "computationally mapped". Note that it is indicated what each reference has contributed ("Cited for"). You can get to the PubMed literature database at NCBI by clicking at the link "PubMed" for a reference — try this. The abstract of a publication can be read here (or directly at UniProt using the "Abstract"-link), if the work is an actual published article and not a "direct submission".
- QUESTION 6:
- How many references are there in the insulin entry?
- Why do you think insulin is such a highly investigated protein? (Hint: see other sections of the entry, e.g. Function and Pathol./Biotech, especially the subsections Involvement in disease and Pharmaceutical use)
- Scroll back to Function and read the free-text description at the top of the section. Also have a look at the controlled vocabulary annotations: "Gene Ontology" (GO) and Keywords. Note that both of these are split into two different aspects: Molecular function and Biological process.
- Now scroll to Subcell. location and read what is written there. Note that you find another set of "Gene Ontology" (GO) and Keywords annotations here; this time labelled Cellular component.
- QUESTION 7:
- Where in the cell / outside the cell do you find insulin?
- Why do you think is it found there? (Hint: consider the function)
Just like in GenBank, a UniProt entry has a Feature Table containing annotations that are coupled to specific parts of the sequence. In the default view, the Feature Table is not so easy to spot, since it is split up under different sections corresponding to the biological significance of the various annotations. However, you can click on Feature table under Display in the upper left part of the window to see those annotations only. Try it! Try also clicking on Feature viewer, which shows the same information in a graphical form.
Now switch back to the default (Entry) view. In the following, you will see some examples of Feature Table annotations.
- Under Sequences, the subsection Natural variant lists the variants (mutations) of insulin that have been described in the literature. Under the heading Description, it is indicated which amino acid is changed into which other amino acid. If the variant is known to be associated with a disease, this is indicated with an abbreviation of the disease (e.g. "R → C in IDDM2").
- Under Pathol./Biotech, the subsection Involvement in disease gives a description of each disease that is mentioned in the Feature Table, and repeats those variants that are associated with each disease.
- Under PTM / Processing, the subsection Molecule processing shows that insulin has both a signal peptide and a pro-peptide. These are both cleaved off before secretion. The mature insulin (the A and B chains) is hence much smaller than what was shown under Sequences.
- QUESTION 8:
- How long is the signal peptide and the propeptide, respectively?
- Under Structure, the subsection Secondary structure shows a colour-coded representation of the sequence, showing the secondary structure elements "Helix" (α-helix), "Beta strand" (part of a β-pleated sheet) or "Turn" (the grey regions without specified secondary structure are often called "Loop" or "Coil"). Try to see what happens when you hover the mouse (without clicking) over the coloured bars. To see that this is really part of the feature table, try clicking on Show more details.
- QUESTION 9:
- Which positions are in β-sheet conformation in insulin?
Other databases linked from Swiss-Prot
Now, scroll to Cross-references. Here, you can find links to other databases. Under the subsection Sequence databases you find links to corresponding entries in the nucleotide databases. If you set the radio button on the left to GenBank, you can click on one of the blue GenBank identifiers and see a GenBank entry for the insulin gene (or for several genes including the insulin gene)—try it!
To look at the three dimensional structure of a protein, you must go to yet another database, the PDB under 3D structure databases (can also be found in the section Structure). We will be working with 3D structures later in the course, but let's just have a quick look here today also. As you can see, the 3D structure of insulin has been determined several times. Set the radio button on the left to RCSB PDB, select one structure marked X-ray under Method, and click on the blue identifier under Entry. Besides a lot of information on how the molecule and the experimental procedure used to solve the structure, the page also contains a nice picture of the insulin molecule.
Under Family and domain databases (can also be found in the section Family & Domains) you find a list of databases containing proteins that are similar (protein families). These have been collected using various techniques that you will hear about later in the course (multiple alignment). In some cases, the proteins are similar only in smaller parts (domains) but not in other parts, and in some cases the databases can tell which parts of the actual protein are known in other species. Some large proteins can contain several different parts (domains) each with their own evolutionary history. The most important of these databases is InterPro, because it collects the information from most of the other databases. Try to click on one of the InterPro links. This will take you to the Interpro page with lots of information about the protein family that insulin belongs to.
Advanced search
The UniProt interface allows you to use most of the fields in the database for searching, not only the fields like name and organism, as we did previously, but also the functional and structural annotations. We shall now try a few of these.
- Go back to UniProt's website http://www.uniprot.org/. Important: If the search string from the previous search is still shown in the search field, clear it. Then click Advanced to the right of the search field. This brings up a box with a new interface.
- Now we will find out how many proteins have signal peptides (just like insulin has). In the drop-down menu that appears in the box, select PTM/Processing, then select Molecule Processing, then select Signal peptide. Click the button.
- QUESTION 10:
- How many proteins did you find, and what was the search string (the text that appeared in the search field)?
- Evidence: The proteins we find in this way include proteins that are predicted to have signal peptides, without necessarily having any experimental evidence for the signal peptides. We will now limit the search to experimentally confirmed signal peptides. Click Advanced again (without erasing your previous search) and change the Evidence menu to Any experimental assertion.
- QUESTION 11:
- How many proteins do you find now, and what has the search string changed into?
- Combining fields: How many experimentally confirmed signal peptides are found in humans? Click on Advanced Search again and go the the bottom of the box (with light blue background). Leave the menu to the left on AND, select Organism [OS] in the drop-down menu, type human in the field Term, accept the suggestion "Human [9606]" and click the button.
- QUESTION 12:
- How many proteins do you find now, and what is the search string? (Note that you can always perform the search by editing the text in the search field — however to do this you need to know the names for the fields).
Important note about the organism field: when you type some letters, a drop-down list with suggestions will come up. Each has a number in brackets — this is the TaxID, which you can also find in the NCBI Taxonomy Browser. If you search for e.g. Human proteins, it is a good idea to include the TaxID; if you omit it and just write "human", you will also find proteins from organisms like Human immunodeficiency virus (try it!).
On the other hand, if you search for proteins from a microbial species, you may run into trouble, because each subspecies or strain has its own TaxID, and you probably want all possible strains. Let's try an example (first, clear the previous search): Say you want all proteins from the bacterium Neisseria gonorrhoeae — you can probably guess which disease it causes in humans. Try to type Neisseria gonorrhoeae in the organism field: you will see a suggestion named "Neisseria gonorrhoeae [485]" – accept that.
- QUESTION 13 a:
- How many proteins are there in UniProt from Neisseria gonorrhoeae with the default TaxID [485]?
That's not a very high number, considering that Neisseria gonorrhoeae is a very well studied species with a lot of known strains which have been fully sequenced. Now note the line above the results that says "Expand search "Neisseria gonorrhoeae [485]" to include lower taxonomic ranks" – click it.
- QUESTION 13 b:
- How many proteins are there in UniProt from Neisseria gonorrhoeae in total (all strains and subspecies)?
- QUESTION 13 c:
- What does the search string look like now?
In conclusion, use the field Taxonomy [OC] instead of Organism [OS] when working with microbial species where you want all strains.
- Numerical field: Now we will try to answer a completely different question: Which extremely short proteins are present in UniProt? Clear the previous search. In the advanced drop-down menu, select Sequence and then Sequence length. Now two new fields appear where you can define the lower and upper limits for the search. Type 1 and 10 and search. Note: in your answers to questions 14-17, include the search string just like you did in the questions above!
- QUESTION 14:
- How many proteins of maximum length 10 do you find?
- Extremely short proteins are often mistakes translated directly from a nucleotide sequence with no evidence for the sequences being protein coding. Limit your search to proteins that actually have evidence for their existence at the protein level (set the drop-down menu to Protein existence [PE] and select Evidence at protein level).
- QUESTION 15:
- How many proteins are now left?
- A large fraction of the proteins identified in this way are fragments. Try to exclude fragments from the search. In the drop-down menu, choose Sequence, then Fragment, then No.
- QUESTION 16:
- How many proteins are now left?
- And as the final question, how many of these proteins are found in humans?. Do as before...
- QUESTION 17:
- How many human non-fragment proteins of maximum length 10 do you find in UniProt?
- Finally you can save the results of your search. Click on the blue Download button. You can now save the search results in the format you prefer (try FASTA (canonical) and click Preview).
- QUESTION 18:
- Copy the FASTA sequences to your report.