Stateful parsing

Required course material for the lesson

Powerpoint: Stateful Parsing
Video: Stateful Parsing
Video: Finding errors with Stateful Parsing

Subjects covered

Using stateful parsing to extract data spanning several lines, by recognizing keywords.

Exercises to be handed in

The following 5 exercises deal with SwissProt. The file sprot1.dat is a SwissProt database entry. Study it carefully. Locate the SwissProt ID (SP96_DICDI), the accession number (P14328) and the amino acid sequence (MRVLLVLVAC....TTTATTTATS). There are other entries ( sprot2.dat, sprot3.dat, sprot4.dat). Your programs should work on those, too. Also your programs must solve all the problems in ONE reading of the file. It is acceptable if you just hand in one program that solves 1 to 4. 5 is separate. These exercises are about studying and understanding the file format.

Make a program that reads the ID and prints it. Try to use the "walk the line" technique.
Add the following functionality to the program: Read the accession number and print it. Again, walk the line.
Add the following functionality to the program: Read the amino acid sequence and print it. You really should use stateful parsing in this exercise. Maybe check the video.
Add the following functionality to the program: Verification of amino acid number. This means extract the number from the SQ line (example: SQ SEQUENCE 629 AA;) and check that the amino acid sequence has that number of residues. It should be the program that determines if something is wrong - not the user. Imagine that before you go home, you set the computer to run through a million swisprot entries. The next day, you must be able to see what failed. In a sense you don't care about what succeeded, as that is the common case. You care about what failed, because it is here you must take action.
Now that you have the ID, accession number and AA sequence save it to a file sprot.fsa in FASTA format. Look in the file dna.fsa for an example of FASTA. Notice the first line starts with > and immediately after comes an unique identifier, like an accession number or a SwissProt ID. Any other data must be on the header line only, but in free format. Sequence data is on the following lines.
Notice that this exercise incorporates the previous 4, but uses the result in a slightly different way.
In reality you never have just one entry in a SwissProt file. The current SwissProt database contains more than half a million entries and it grows all the time. A bioinformatician can access the database by selecting a subset or the whole database as one file for download and extract the needed information from it. In the file sprotall.dat the 4 sprot files have been collected as a very small example of what you really meet. The exercise is to read all entries in sprotall.dat and create one fasta file sprotall.fsa with the ids, accessions and sequence for all entries. In short, just like the previous exercise but for a file that contains more SwissProt entries. And please, don't do anything stupid like reading the entire file into a list and start parsing that. These SwissProt excerpts can get really big and it can kill your computer memory doing that.

Exercises for extra practice

Count the number of RA (Author) lines in the sprot1-4.dat files. sprot2.dat contains 25 RA lines.
Extract the author names from the RA lines in the sprot1-4.dat files. Display the names - only the names.
Continuing previous exercise: Now also extract the title (RT lines). Display like title first on one line, followed by authors on the next. Then empty line followed by the next title and authors and so forth, until no more authors/titles.

Stateful parsing

Required course material for the lesson

Subjects covered

Exercises to be handed in

Exercises for extra practice

Navigation menu

Search