Example code - Wrong parsing: Difference between revisions
Jump to navigation
Jump to search
(Created page with "Sometimes people think that when they need to parse a SwissProt file or a Genbank file (or even a fasta file), they need to separate the file in entries. While that is not completely wrong - you do need to keep the entries apart from each other - then the way they go about it is wrong. Here is what typically happens. <pre> # Read the file with open('swissprotfile, 'r') as infile: content = infile.read() # Now the file is read into content every byte of it. # That mea...") |
mNo edit summary |
||
Line 29: | Line 29: | ||
# Standard Stateful parsing | # Standard Stateful parsing | ||
# The red line, where sequence ends | # The red line, where sequence ends | ||
if line | if line.startswith('//'): | ||
flag = False | |||
# The sequence collection | |||
if flag: | |||
seq += ''.join(line.split()) | |||
# The green line, where sequence starts | |||
if line.startswith('SQ'): | |||
flag = True | |||
# Here we do something with the sequence and other stuff we extracted | |||
# So what is wrong here. Nothing ..... beautiful Stateful parsing. | |||
# EXCEPT WHERE IS THE RED LINE? | |||
# Earlier the file content was split in entries by the // pattern. The pattern is NOT part of the result of the split. | |||
# That means the // line has disappeared from the entry - you deleted the red line yourself. | |||
# The Stateful parsing will not work. | |||
</pre> | |||
Welcome to re-exam. | |||
Maybe study my solutions a bit. |
Revision as of 09:19, 1 September 2025
Sometimes people think that when they need to parse a SwissProt file or a Genbank file (or even a fasta file), they need to separate the file in entries. While that is not completely wrong - you do need to keep the entries apart from each other - then the way they go about it is wrong. Here is what typically happens.
# Read the file with open('swissprotfile, 'r') as infile: content = infile.read() # Now the file is read into content every byte of it. # That means you used your small (compared to the disk) computer memory to contain the file # Separate into entries entries = content.split('//') # As SwissProt/Genbank entires ends with a // line, then the entries have been split from each other. # The entries list also contains all the data in the input file. # You now have the entire file in memory TWICE and you have not done anything significant yet. And you won't either. # In real life these files grow big. The SwissProt database is almost 4 GB and that is the small database. # Essentially your program will break down at this point. # OK, when reading these kind of files you often have to extract the sequence. # Many think: "Let's use Stateful parsing", because they were taught that. # And they are right. Stateful parsing is the way to go. for entry in entries: # The entry is one long multi-line string. Must be split in lines. lines = entry.split('\n') seq, flag = '', False for line in lines: # Some code that extract this and that # Standard Stateful parsing # The red line, where sequence ends if line.startswith('//'): flag = False # The sequence collection if flag: seq += ''.join(line.split()) # The green line, where sequence starts if line.startswith('SQ'): flag = True # Here we do something with the sequence and other stuff we extracted # So what is wrong here. Nothing ..... beautiful Stateful parsing. # EXCEPT WHERE IS THE RED LINE? # Earlier the file content was split in entries by the // pattern. The pattern is NOT part of the result of the split. # That means the // line has disappeared from the entry - you deleted the red line yourself. # The Stateful parsing will not work.
Welcome to re-exam.
Maybe study my solutions a bit.