Find the mature part of human genes with a signal peptide

From 22113
Jump to navigation Jump to search

Description

Find all human genes in uniprot with a signal peptide. Extract the entire sequence and create a fasta file with only the mature proteins.

Input/output

Download the entire swissprot database. This will be your input file to your program.
Unpack it yourself with

gunzip uniprot_sprot.dat.gz

or whatever method you prefer. Careful, it will take up 3 GB.
Notice there are many swissprot entries in the file and your program must handle that. The output must be a fasta file, where the header is the swisprot ID of the entry and the mature protein sequence from the entry.

How to pick the right entries/genes

Study the example entry carefully.
First of all we are only interested in human genes. The means that the OS line must contain "Homo sapiens". To determine if there is a signal peptide (which is first in the sequence), then the first FT (feature) line must contain a SIGNAL. The signal is not part of the mature protein, so when you extract the sequence, you must remove the signal. Sometimes there is a propeptide right after the signal. In such cases the propeptide is not part of the mature protein either and must be removed. Note that a FT feature can span several lines as can be seen with the propeptide.

FT   SIGNAL        1     20       {ECO:0000255}.
FT   PROPEP       21     82       {ECO:0000269|PubMed:16384863,
FT                                ECO:0000269|PubMed:17185225}.
FT                                /FTId=PRO_0000434044.