Example code - Wrong parsing
Sometimes people think that when they need to parse a SwissProt file or a Genbank file (or even a fasta file), they need to separate the file in entries. While that is not completely wrong - you do need to keep the entries apart from each other - then the way they go about it is wrong. Here is what typically happens.
# Read the file with open('swissprotfile, 'r') as infile: content = infile.read() # Now the file is read into content every byte of it. # That means you used your small (compared to the disk) computer memory to contain the file # Separate into entries entries = content.split('//') # As SwissProt/Genbank entires ends with a // line, then the entries have been split from each other. # The entries list also contains all the data in the input file. # You now have the entire file in memory TWICE and you have not done anything significant yet. And you won't either. # In real life these files grow big. The SwissProt database is almost 4 GB and that is the small database. # Essentially your program will break down at this point. # OK, when reading these kind of files you often have to extract the sequence. # Many think: "Let's use Stateful parsing", because they were taught that. # And they are right. Stateful parsing is the way to go. for entry in entries: # The entry is one long multi-line string. Must be split in lines. lines = entry.split('\n') seq, flag = , False for line in lines: # Some code that extract this and that # Standard Stateful parsing # The red line, where sequence ends if line[