Stateful Parsing
Previous: Exceptions and Bug Handling | Next: Lists/Sequences |
Required course material for the lesson
Powerpoint: Stateful Parsing
Video: Stateful Parsing
Video: Finding errors with Stateful Parsing
Subjects covered
Using stateful parsing to extract data spanning several lines, by recognizing keywords.
Exercises to be handed in
The following 5 exercises deal with SwissProt. The file sprot1.dat is a SwissProt database entry. Study it carefully. Locate the SwissProt ID (SP96_DICDI), the accession number (P14328) and the amino acid sequence (MRVLLVLVAC....TTTATTTATS). There are other entries ( sprot2.dat, sprot3.dat, sprot4.dat). Your programs should work on those, too. Also your programs must solve all the problems in ONE reading of the file. It is acceptable if you just hand in one program that solves 1 to 4. 5 is separate. These exercises are about studying and understanding the file format.
- Make a program that reads the ID and prints it.
- Add the following functionality to the program: Read the accession number and print it.
- Add the following functionality to the program: Read the amino acid sequence and print it. You really should use Stateful Parsing in this exercise. Maybe check the video.
- Add the following functionality to the program: Verification of amino acid number. This means extract the number from the SQ line (example: SQ SEQUENCE 629 AA;) and check that the amino acid sequence has that number of residues. It should be the program that determines if something is wrong - not the user. Imagine that before you go home, you set the computer to run through a million swisprot entries. The next day, you must be able to see what failed. In a sense you don't care about what succeeded, as that is the common case. You care about what failed, because it is here you must take action.
- Now that you have the ID, accession number and AA sequence save it to a file sprot.fsa in FASTA format. Look in the file dna.fsa for an example of FASTA. Notice the first line starts with > and immediately after comes an unique identifier, like an accession number or a SwissProt ID. Any other data must be on the header line only, but in free format. Sequence data is on the following lines.
Notice that this exercise incorporates the previous 4, but uses the result in a slightly different way.
Exercises for extra practice
- Count the number of RA (Author) lines in the sprot1-4.dat files. sprot2.dat contains 25 RA lines.
- Extract the author names from the RA lines in the sprot1-4.dat files. Display the names - only the names.
- Continuing previous exercise: Now also extract the title (RT lines). Display like title first on one line, followed by authors on the next. Then empty line followed by the next title and authors and so forth, until no more authors/titles.