Find short virus genes with disulfid bridges

From 22113
Jump to navigation Jump to search

Description

Find all short (150 or less aa) virus genes in uniprot, that contain intrachain disulfid bridges. Interchain disulfide bonds can produce stable, covalently linked protein dimers, multimers or complexes, whereas intrachain disulfide bonds can contribute to protein folding and stability.

Input/output

Download the entire swissprot database. This will be your input file to your program.
Unpack it yourself with

gunzip uniprot_sprot.dat.gz

or whatever method you prefer. Careful, it will take up 3 GB.
Notice there are many swissprot entries in the file and your program must handle that. The output must be a fasta file, where the header is the swisprot ID of the entry and the protein sequence from the entry.

How to pick the right entries/genes

Study the example entry carefully.
First of all we are only interested in virus genes. The means that the OC line must contain "Viruses". The gene must be short (max 150), so that is another selection filter. There must be a DISULFID feature so there must be an FT line similar to

FT   DISULFID     69     75

You must verify that you actually have cysteine residues on the locations given by the FT line, i.e. in this example you must check that there are C on position 69 and 75 in the sequence. If not, discard the entry.

How to see if the cysteine(s) described by the DISULFID feature is a part of an interchain bridge or an intrachain bridge?
If the numbers are the same, then no matter if the word "interchain" is present in the feature or not, it is an interchain between di-mers, etc. Check for the presence of a cysteine on that position in the sequence. If the numbers are different and "interchain" is not present in the feature then it is an intrachain bridge, and then you must check for a cysteine on the two positions in the sequence.
If the numbers are different and "interchain" is present then it is an interchain bridge between two different chains. Just check for a cysteine for the first number, since the second is in another chain, i.e. another entry.

Note: While you check ALL disulfide positions in the entry for the presence of cysteine, then you ONLY keep the entry/sequence if it contains an intrachain bridge.