Positive proteins

From 22113
Jump to navigation Jump to search

Description

Find the top 1000 most positively charged protein sequences in uniprot and put them in a fasta file. Repeat the search but this time find the most positively charged protein sequences per molecular weight of the sequence and put that into another fasta file. Among the 20 common amino acids, five have a side chain which can be charged. At pH=7, two are negative charged: aspartic acid (D) and glutamic acid (E) (acidic side chains), and three are positive charged: lysine (K), arginine (R) and histidine (H) (basic side chains)

Input/output

Download the entire swissprot database. This will be your input file to your program.
Unpack it yourself with

gunzip uniprot_sprot.dat.gz

or whatever method you prefer. Careful, it will take up 3 GB.
Notice there are many swissprot entries in the file and your program must handle that. You are only allowed to read this file once to find the answer to both tasks.

The output is two fasta files; One file contains the 1000 most positive charged sequences, and one that contains the 1000 most positive charged per molecular weight. Write the charge and molecular weight of the sequence in the comment section of the header line (after the ID), like this:

>HEWA_HUMAN     MW: 234356  Charge: 13
MADRRRQRASQDTEDEESGASGSDSGGSPLRGGGSCSGSAGGGGSGSLPSQRGGRTGALH
LRRVESGGAKSAEESECESEDGIEGDAVLSDYESAEDSEGEEGEYSEEENSKVELKSEAN
DAVNSSTKEEKGEEKPDTKSTVTGERQSGDGQESTEPVENKVGKKGPKHLDDDEDRKNPA
YIPRKGLFFEHDLRGQTQEEEVRPKGRQRKLWKDEGRWEHDKFREDEQAPKSRQELIALY

While it is true that the actual charge of the protein depends on the pH of the environment, then this is simplified in the project to D and E counts for -1 and K, R and H counts for +1.

The molecular weight can be found on the SQ line.