Example code - Wrong parsing

From 22116
Jump to navigation Jump to search

Sometimes people think that when they need to parse a SwissProt file or a Genbank file (or even a fasta file), they need to separate the file in entries. While that is not completely wrong - you do need to keep the entries apart from each other - then the way they go about it is wrong. Here is what typically happens.

# Read the file
with open('swissprotfile, 'r') as infile:
    content = infile.read()
# Now the file is read into content every byte of it.
# That means you used your small (compared to the disk) computer memory to contain the file

# Separate into entries
entries = content.split('//')
# As SwissProt/Genbank entires ends with a // line, then the entries have been split from each other.

# The entries list also contains all the data in the input file.
# You now have the entire file in memory TWICE and you have not done anything significant yet. And you won't either.
# In real life these files grow big. The SwissProt database is almost 4 GB and that is the small database.
# Essentially your program will break down at this point.

# OK, when reading these kind of files you often have to extract the sequence.
# Many think: "Let's use Stateful parsing", because they were taught that.
# And they are right. Stateful parsing is the way to go.

for entry in entries:
    # The entry is one long multi-line string. Must be split in lines.
    lines = entry.split('\n')
    seq, flag = '', False
    for line in lines:
        # Some code that extract this and that

        # Standard Stateful parsing
        # The red line, where sequence ends
        if line.startswith('//'):
            flag = False
        # The sequence collection
        if flag:
            seq += ''.join(line.split())
        # The green line, where sequence starts
        if line.startswith('SQ'):
            flag = True

    # Here we do something with the sequence and other stuff we extracted for this entry

# So what is wrong here. Nothing ..... beautiful Stateful parsing.
# EXCEPT WHERE IS THE RED LINE?
# Earlier the file content was split in entries by the // pattern. The pattern is NOT part of the result of the split.
# That means the // line has disappeared from the entry - you deleted the red line yourself.
# The Stateful parsing will not work.

Welcome to re-exam.

Maybe study my solutions a bit.