Spider toxins
Description
Find all spider toxins in uniprot and output them in a fasta file. Who knows when it will be useful to produce venom?
Input/output
Download the entire swissprot database.
This will be your input file to your program.
Unpack it yourself with
gunzip uniprot_sprot.dat.gz
or whatever method you prefer. Careful, it will take up 3 GB.
Notice there are many swissprot entries in the file and your program must handle that.
The output must be a fasta file, where the header is the swisprot ID of the entry and the protein sequence from the entry.
How to pick the right entries/genes
Study the example entry carefully.
First of all we are only interested in spider genes. The means that the OC line must contain "Araneae".
Observe the CC lines in the entries. It can be seen that there are several sections and they start with "-!-" and capital letters describing the section. Pay attention to the following 3 sections: FUNCTION, TISSUE SPECIFICITY and SIMILARITY. From these sections you can derive that the protein is a toxin, like this:
Set a counter to 0. If a FUNCTION section exists and it contains the word "toxin", increase the counter by 1. If a TISSUE SPECIFICITY section exists and it contains the words "venom gland", increase the counter by 1. If a SIMILARITY section exists and it contains the word "toxin", increase the counter by 1. If the counter is at least 2, then this is a spider toxin gene.