Positive proteins
Description
Find the top 1000 most positively charged protein sequences in uniprot and put them in a fasta file. Repeat the search but this time find the most positively charged protein sequences per molecular weight of the sequence and put that into another fasta file. Among the 20 common amino acids, five have a side chain which can be charged. At pH=7, two are negative charged: aspartic acid (D) and glutamic acid (E) (acidic side chains), and three are positive charged: lysine (K), arginine (R) and histidine (H) (basic side chains)
Input/output
Download the entire swissprot database.
This will be your input file to your program.
Unpack it yourself with
gunzip uniprot_sprot.dat.gz
or whatever method you prefer. Careful, it will take up 3 GB.
Notice there are many swissprot entries in the file and your program must handle that.
You are only allowed to read this file once to find the answer to both tasks.
The output is two fasta files; One file contains the 1000 most positive charged sequences, and one that contains the 1000 most positive charged per molecular weight. Write the charge and molecular weight of the sequence in the comment section of the header line (after the ID), like this:
>HEWA_HUMAN MW: 234356 Charge: 13 MADRRRQRASQDTEDEESGASGSDSGGSPLRGGGSCSGSAGGGGSGSLPSQRGGRTGALH LRRVESGGAKSAEESECESEDGIEGDAVLSDYESAEDSEGEEGEYSEEENSKVELKSEAN DAVNSSTKEEKGEEKPDTKSTVTGERQSGDGQESTEPVENKVGKKGPKHLDDDEDRKNPA YIPRKGLFFEHDLRGQTQEEEVRPKGRQRKLWKDEGRWEHDKFREDEQAPKSRQELIALY
While it is true that the actual charge of the protein depends on the pH of the environment, then this is simplified in the project to D and E counts for -1 and K, R and H counts for +1.
The molecular weight can be found on the SQ line.