Human genes with activities in more than one region of the cell
Description
Find human genes which are targeted to more than one region in the cell. Save the genes in fasta format.
Input/output
Download the entire swissprot database.
This will be your input file to your program.
Unpack it yourself with
gunzip uniprot_sprot.dat.gz
or whatever method you prefer. Careful, it will take up 3 GB.
Notice there are many swissprot entries in the file and your program must handle that.
The output must be a fasta file, where the header is the swisprot ID of the entry and the protein sequence from the entry.
How to pick the right entries/genes
Study the example entry carefully.
First of all we are only interested in human genes. The means that the OS line must contain "Homo sapiens".
For finding entries with where the gene is present in more than one region, we need to look at the CC lines.
There is a pattern in how the CC lines are constructed. We are only interested in this section:
"CC -!- SUBCELLULAR LOCATION:"
When that is found, it can be seen that it consist of lines like
"Cytoplasm {ECO:0000269|PubMed:12080473}."
Every such line depicts a region in the cell where the protein is found. The words before the { is the location and text inside the {} is the evidence where this has been found. We are only interested in locations where the evidence code is ECO:0000269, as this means it has been experimentally proven. Note there are no more locations when you see the "Note=" in the paragraph. To conclude: In order to determine if you must extract the gene, then it must be a human gene, and the subcellular location paragraph in the CC lines must contain more than one location with ECO:0000269.