Example code - Wrong parsing

From 22116
Revision as of 09:07, 1 September 2025 by WikiSysop (talk | contribs) (Created page with "Sometimes people think that when they need to parse a SwissProt file or a Genbank file (or even a fasta file), they need to separate the file in entries. While that is not completely wrong - you do need to keep the entries apart from each other - then the way they go about it is wrong. Here is what typically happens. <pre> # Read the file with open('swissprotfile, 'r') as infile: content = infile.read() # Now the file is read into content every byte of it. # That mea...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Sometimes people think that when they need to parse a SwissProt file or a Genbank file (or even a fasta file), they need to separate the file in entries. While that is not completely wrong - you do need to keep the entries apart from each other - then the way they go about it is wrong. Here is what typically happens.

# Read the file
with open('swissprotfile, 'r') as infile:
    content = infile.read()
# Now the file is read into content every byte of it.
# That means you used your small (compared to the disk) computer memory to contain the file

# Separate into entries
entries = content.split('//')
# As SwissProt/Genbank entires ends with a // line, then the entries have been split from each other.

# The entries list also contains all the data in the input file.
# You now have the entire file in memory TWICE and you have not done anything significant yet. And you won't either.
# In real life these files grow big. The SwissProt database is almost 4 GB and that is the small database.
# Essentially your program will break down at this point.

# OK, when reading these kind of files you often have to extract the sequence.
# Many think: "Let's use Stateful parsing", because they were taught that.
# And they are right. Stateful parsing is the way to go.

for entry in entries:
    # The entry is one long multi-line string. Must be split in lines.
    lines = entry.split('\n')
    seq, flag = , False
    for line in lines:
        # Some code that extract this and that

        # Standard Stateful parsing
        # The red line, where sequence ends
        if line[