Example code - Wrong parsing: Difference between revisions

Revision as of 08:19, 1 September 2025

Sometimes people think that when they need to parse a SwissProt file or a Genbank file (or even a fasta file), they need to separate the file in entries. While that is not completely wrong - you do need to keep the entries apart from each other - then the way they go about it is wrong. Here is what typically happens.

# Read the file
with open('swissprotfile, 'r') as infile:
content = infile.read()
# Now the file is read into content every byte of it.
# That means you used your small (compared to the disk) computer memory to contain the file

# Separate into entries
entries = content.split('//')
# As SwissProt/Genbank entires ends with a // line, then the entries have been split from each other.

# The entries list also contains all the data in the input file.
# You now have the entire file in memory TWICE and you have not done anything significant yet. And you won't either.
# In real life these files grow big. The SwissProt database is almost 4 GB and that is the small database.
# Essentially your program will break down at this point.

# OK, when reading these kind of files you often have to extract the sequence.
# Many think: "Let's use Stateful parsing", because they were taught that.
# And they are right. Stateful parsing is the way to go.

for entry in entries:
# The entry is one long multi-line string. Must be split in lines.
lines = entry.split('\n')
seq, flag = '', False
for line in lines:
# Some code that extract this and that

# Standard Stateful parsing
# The red line, where sequence ends
if line.startswith('//'):
flag = False
# The sequence collection
if flag:
seq += ''.join(line.split())
# The green line, where sequence starts
if line.startswith('SQ'):
flag = True

# Here we do something with the sequence and other stuff we extracted

# So what is wrong here. Nothing ..... beautiful Stateful parsing.
# EXCEPT WHERE IS THE RED LINE?
# Earlier the file content was split in entries by the // pattern. The pattern is NOT part of the result of the split.
# That means the // line has disappeared from the entry - you deleted the red line yourself.
# The Stateful parsing will not work.

Welcome to re-exam.

Maybe study my solutions a bit.

@@ Line 29: / Line 29: @@
          # Standard Stateful parsing
          # The red line, where sequence ends
-         if line[
+         if line.startswith('//'):
+            flag = False
+        # The sequence collection
+        if flag:
+            seq += ''.join(line.split())
+        # The green line, where sequence starts
+        if line.startswith('SQ'):
+            flag = True
+    # Here we do something with the sequence and other stuff we extracted
+# So what is wrong here. Nothing ..... beautiful Stateful parsing.
+# EXCEPT WHERE IS THE RED LINE?
+# Earlier the file content was split in entries by the // pattern. The pattern is NOT part of the result of the split.
+# That means the // line has disappeared from the entry - you deleted the red line yourself.
+# The Stateful parsing will not work.
+</pre>
+Welcome to re-exam.
+Maybe study my solutions a bit.

Example code - Wrong parsing: Difference between revisions

Revision as of 08:19, 1 September 2025

Navigation menu

Search