Exercise: The protein database UniProt
Exercise written by: Henrik Nielsen - updated by Morten Nielsen and Rasmus Wernersson
In this exercise, we shall extract information from the protein database, Uniprot. This database is administrated in collaboration between Swiss Institute of Bioinformatics (SIB), European Bioinformatics Institute (EBI), England, and Georgetown University, Washington DC, USA.
UniProt, http://www.uniprot.org/, consists of three parts:
- UniProt Knowledge-base (UniProtKB)
- protein sequences with annotation and references
- UniProt Reference Clusters (UniRef)
- homology-reduced database, where similar sequences (having a certain percentage identity) are merged into clusters, each with a representative sequence
- UniProt Archive (UniParc)
- an archive containing all versions of Uniprot without annotations
Of these databases, Uniprot Knowledge-base is the most useful, and this is the database we shall be using today. Uniprot Knowledge-base consists of two parts:
- UniProtKB/Swiss-Prot
- a manually annotated (reviewed) protein-database.
- UniProtKB/TrEMBL
- a computer-annotated supplement to Swiss-Prot, that contains all translations of EMBL nucleotide sequences not yet included in Swiss-Prot.
Simple text mining
First, we will find some UniProt entries using simple text mining. You are supposed to find the entry for human insulin.
- Open the UniProt home-page https://www.uniprot.org/
- Type human insulin in the search field in the top of the page. Leave the search menu on "UniProtKB", which is default. Press Enter or click the Search button.
- If you are new to UniProt, you will be asked whether you want to view your results as "Cards" or "Table". Choose "Table".
- QUESTION 1.1:
- How many hits do you find? (tip: See the number above the results list)
- How many of these hits are from Swiss-Prot? (tip: See under "Reviewed" at the top left)
- Can you identify the correct hit (i.e. see which one is actually human insulin and not something else)? If yes, write down is Accession code and Entry name (also called ID).
In this case, it was relatively easy to spot the correct hit, but sometimes it is more difficult. If you do not identify the correct hit immediately, it will often help to narrow down the search, and that is exactly what we ask you to do in the next four questions.
The first step is searching for proteins that actually come from the organism "human" and are named something containing the word "insulin", as opposed to just containing the words "human" and "insulin" somewhere in the entry.
On the left, you can see a list of "Model organisms". Try to click "Human".
- QUESTION 1.2:
- How many hits are now left? How many of these are from Swiss-Prot?
However, to really solve the problem, we have to enter Advanced mode. Click on Advanced in the right part of the search field. Search for human in the Organism [OS] field, then click Add field and search for insulin in the Protein Name [DE] field.
- QUESTION 1.3:
- How many hits are now left? How many of these are from Swiss-Prot? And what has the search string in the text box at the top of the page now turned into?
Now, you should exclude proteins that are not insulin, but only insulin-like. Open the Advanced menu again, add a field, make sure it is combined by NOT instead of AND, and remove hits that have insulin-like in the protein name.
- QUESTION 1.4:
- How many hits are now left? How many of these are from Swiss-Prot? And what is the search string?
Note that you can also edit the search string directly, instead of going through the Advaced menu every time.
- Try now to exclude proteins that are insulin receptors (or substrates for insulin receptors).
- QUESTION 1.5:
- How did you do this?
- How many hits are now left? How many of these are from Swiss-Prot?
The contents of UniProt
We shall now see what information is contained in a UniProt entry, and what further information is available as links in each entry.
Click on the accession-code or ID for insulin. This will take you to the insulin entry in the UniProtKB/Swiss-Prot database. Spend some time to get an overview of the page and the information it contains.
- Note that you can click on the headings in the left side of the page to scroll to different sections of the page. Try it!
- Note also that every time there is a small "i" after a term on the page, you can click it to get information about the term. Try it!
Now click on Publications in the top part of the window. Click on UniProtKB/Swiss-Prot under Source to show only those references that are part of the entry and exclude those that are "computationally mapped". Note that it is indicated what each reference has contributed ("Cited for"). You can get to the PubMed literature database at NCBI by clicking at the link "PubMed" for a reference — try this. The abstract of a publication can be read here (or directly in UniProt using the "View abstract"-link), if the work is an actual published article and not a "direct submission".
- QUESTION 2.1:
- How many references are there in the insulin entry?
- Why do you think insulin is such a highly investigated protein? (Hint: see other sections of the entry, e.g. Function and Disease & Drugs, especially the subsections Involvement in disease and Pharmaceutical)
- Scroll back to Function and read the free-text description at the top of the section. Also have a look at the controlled vocabulary annotations: "Gene Ontology" (GO) and Keywords. Note that both of these are split into two different aspects: Molecular function and Biological process.
- Now scroll to Subcellular Location and read what is written there. Note that you find another set of "Gene Ontology" (GO) and Keywords annotations here; this time labelled Cellular component.
- QUESTION 2.2:
- Where in the cell / outside the cell do you find insulin?
- Why do you think is it found there? (Hint: consider the function)
Just like in GenBank, a UniProt entry has a Feature Table containing annotations that are coupled to specific parts of the sequence. In the default view, the Feature Table is not so easy to spot, since it is split up under different sections corresponding to the biological significance of the various annotations. However, in the top part of the window you can click on Feature viewer, which shows the feature table information in a graphical form. Try it. Then click on Molecule processing to show the signal peptide and the propeptide.
Now switch back to the default (Entry) view. In the following, you will see some examples of Feature Table annotations.
- Under Disease & Drugs, the subsection Variants lists the variants (mutations) of insulin that have been described in the literature. Under the heading Change, it is indicated which amino acid is changed into which other amino acid. If the variant is known to be associated with a disease, this is indicated under the heading Description.
- Under PTM/Processing, the subsection Features shows that insulin has both a signal peptide and a pro-peptide. These are both cleaved off before secretion. The mature insulin (the A and B chains) is hence much smaller than what was shown under Sequences.
- QUESTION 2.3:
- How long is the signal peptide and the propeptide, respectively?
- Under Structure, the subsection Features shows the secondary structure elements "Helix" (α-helix), "Beta strand" (part of a β-pleated sheet) or "Turn". The regions without specified secondary structure are often called "Loop" or "Coil". CORRECTION 2024: With the latest update of the UniProt interface, you need to go to the top of the window and select Feature viewer to see the secondary structure annotations! Click Structural features on the left to see helices, strands, and turns. Click each coloured box to see positions.
- QUESTION 2.4:
- Which positions are in β-sheet conformation in insulin?
Other databases linked from UniProt
UniProt has many useful links to other databases. In the graphical view, the cross-references are spread among several different headings, just like the feature table is.
Under the heading Sequence & Isoform, there is a sub-heading named Sequence databases. Here, you can e,g, find links to nucleotide sequences in the databases EMBL / GenBank / DDBJ. Try clicking one of the GenBank links marked "Genomic DNA"; that should take you to a page that looks like something you have seen last week.
Under the heading Structure, there is an interactive window showing a three-dimensional structure of insulin. Note that you can rotate the structure with your mouse. Actually, this structure is not part of UniProt itself, it is a cross-link to the protein structure database PDB. Below the interactive window, you can see the actual cross-links to PDB. Note that PDB is not one single database – just like it was the case for the nucleotide databases, there is a European version (PDBe), an American version (RCSB-PDB), and a Japanese version (PDBj), but luckily, they contain the same data. We will work with the American version of PDB later in the course. As you can see, there are many PDB structures of insulin; in other words, the 3D structure of insulin has been determined several times.
Under the heading Family & Domains, there is a subsection named Family and domain databases. It has links to databases containing proteins that are similar (protein families). These have been collected using various techniques that you will hear about later in the course (multiple alignment). In some cases, the proteins are similar only in smaller parts (domains) but not in other parts, and in some cases the databases can tell which parts of the actual protein are known in other species. Some large proteins (not small ones like insulin) can contain several different parts (domains) each with their own evolutionary history. The most important of these databases is InterPro, because it collects the information from most of the other databases. Try to click on one of the InterPro links. This will take you to the Interpro page with lots of information about the protein family that insulin belongs to.
Text format
Until now, we have been working with the graphical user interface to UniProt. However, all the information is also available in plain text format, and that's what you will be working with if you are going to analyze larger amounts of UniProt data later in your studies. For now, let's just have a look at it.
Scroll to the top of the Human Insulin page and find the menu labeled Download. It looks like this: . Click it, and then right-click the option Text and open it in a new tab. What you see here basically contains all the information you have seen in the graphical interface.
Scroll through the plain text file and see if you can find the same information that you just found in the graphical interface. Note that every line starts with a two-letter code specifying the type of the information in the line. Here are some examples:
- ID: Entry name (ID). There is only one ID.
- AC: Accession code. There may be more than one.
- DE: Description (protein names).
- GN: Gene Name
- OS: Organism/Species.
- OC: Organism Classification.
- OX: TaxID (as defined in the NCBI Taxonomy database).
- RN, RP, RX, RA, RT, RL: References.
- CC: Comments (annotations pertaining to the whole protein).
- DR: Cross-references to other databases.
- KW: Keywords.
- FT: Feature Table (annotations pertaining to specified parts of the sequence).
- SQ: Sequence header line.
Advanced search
The UniProt interface allows you to use most of the fields in the database for searching, not only the fields like name and organism, as we did previously, but also the functional and structural annotations. We shall now try a few of these.
- Go back to UniProt's main page, http://www.uniprot.org/. Important: If the search string from the previous search is still shown in the search field, clear it. Then click Advanced to the right of the search field. This brings up a box with a new interface.
- Now we will find out how many proteins have signal peptides (just like insulin has). In the drop-down menu that appears in the box, select PTM/Processing, then select Molecule Processing, then select Signal peptide. In the empty field that now appears to the right of the word Signal, type a * (otherwise, it will not work). Click the Search button.
- QUESTION 3.1:
- How many proteins did you find, how many of them are from Swiss-Prot, and what was the search string (the text that appeared in the search field)?
- Evidence: The proteins we find in this way include proteins that are predicted to have signal peptides, without necessarily having any experimental evidence for the signal peptides. We will now limit the search to experimentally confirmed signal peptides. Click Advanced again (without erasing your previous search) and change the Evidence menu to Any experimental assertion.
- QUESTION 3.2:
- How many proteins do you find now, how many of them are from Swiss-Prot, and what has the search string changed into?
- Combining fields: How many experimentally confirmed signal peptides are found in humans? Click on Advanced Search again and click Add field to get a second search line. Leave the menu to the left on AND, select Organism [OS] in the drop-down menu, type human in the field Term, accept the suggestion "Homo sapiens (Human) [9606]" and click the Search button.
- QUESTION 3.3:
- How many proteins do you find now, and what is the search string? (Note that you can always perform the search by editing the text in the search field — however to do this you need to know the names for the fields).
About strains and subspecies
Let us now try something different. If you search for proteins from a microbial species, you may run into trouble, because each subspecies or strain has its own TaxID, and you probably want all possible strains. Let's try an example (first, clear the previous search): Say you want all proteins from the bacterium Bacillus subtilis — a very important production organism in biotechnology. Try to type Bacillus subtilis in the Organism [OS] field: you will see a suggestion named "Bacillus subtilis [1423]" – accept that.
- QUESTION 3.4:
- How many proteins are there in UniProt from Bacillus subtilis with the default TaxID [1423]? How many of these are from Swiss-Prot? And what is the search string?
The number of entries in Swiss-Prot may seem low for such a well-studied organism. In addition, you may note that there is a link next to the total number of results saying "or expand search to "1423" to include lower taxonomic ranks". Click it.
- QUESTION 3.5:
- How many proteins are there in UniProt from Bacillus subtilis in total (all strains and subspecies)? How many of these are from Swiss-Prot? And what is the search string?
In conclusion, use the field Taxonomy [OC] instead of Organism [OS] when working with microbial species where you want all strains.
Searching for short proteins
- Numerical field: Now we will try to answer a completely different question: Which extremely short proteins are present in UniProt? Clear the previous search. In the advanced drop-down menu, select Sequence and then Sequence length. Now two new fields appear where you can define the lower and upper limits for the search. Type 1 and 10 and search. Note: in your answers to the questions below, include the search string just like you did in the questions above!
- QUESTION 3.6:
- How many proteins of maximum length 10 do you find?
- Extremely short proteins are often mistakes translated directly from a nucleotide sequence with no evidence for the sequences being protein coding. Limit your search to proteins that actually have evidence for their existence at the protein level (add a field, and set the drop-down menu to Protein existence [PE] and select Evidence at protein level).
- QUESTION 3.7:
- How many proteins are now left?
- A large fraction of the proteins identified in this way are fragments. Try to exclude fragments from the search. Add a field. In the drop-down menu, choose Sequence, then Fragment, then No.
- QUESTION 3.8:
- How many proteins are now left?
- And how many of these proteins are found in humans?. Do as before...
- QUESTION 3.9:
- How many human non-fragment proteins of maximum length 10 do you find in UniProt?
- Finally you can save the results of your search. First, sort them by length by clicking on the column header. Then, click on Download above the list of results. You can now save the search results in the format you prefer (try FASTA (canonical) and click Preview).
- QUESTION 3.10:
- Copy the FASTA sequences to your report.
On your own
QUESTION 4: Now that you are proficient in UniProt searches, try the following:
(As always, remember to write your search string in the answer).
- Find out how many proteins from Escherichia coli (all strains) there are in UniProt.
- How many of these are from the notorious pathogenic serotype O157:H7 (including its sub-strains)?
- Find insulin from as many organisms as possible, without including entries that are not insulin (Hint: If you attempt to do this with the Protein Name field only, it will require an unwieldy amount of kill-words. Therefore, take the gene name into account).
- Find alpha-globin (the alpha subunit of hemoglobin) from as many ruminants as possible (see the GenBank exercise).
- Find alpha-A globin and alpha-D globin from Columba livia (Hint: You can use a "*" to perform the search with one search string).