Simple Pattern Matching

From 22101
Jump to navigation Jump to search
Previous: More with Lists Next: Sets

Required course material for the lesson

Powerpoint: Pattern Matching
Video: Pattern matching in Python

Subjects covered

String pattern matching, stripping, replacement, translation.
Pattern matching is about 2 things: WHAT is the pattern and WHERE is the pattern.
It is important to AVOID matching the wrong place.

Exercises to be handed in

The first 4 exercises will all have to do with the files data1-4.gb, which are various Genbank entries of genes. First you should study the files, notice the structure of the data. In all exercises you will have to parse (read and find the wanted data) the files using pattern matching. Every exercise adds to the previous ones, so the final program can do a lot. Remember. Your program should be able to handle all files, but just one at a time.

  1. Extract the accession number, the definition and the organism (and print it).
  2. Extract and print all MEDLINE article numbers which are mentioned in the entries.
  3. Extract and print the translated gene (the amino acid sequence). Look for the line starting with /translation=. Generalize; An amino acid sequence can be short, i.e. only one line in the feature table, or long, i.e. more than one line in the feature table. Use stateful parsing.
  4. Extract and print the DNA (whole base sequence in the end of the file). Use stateful parsing.
  5. Improve exercise 5.8 using all you have learned. The program shall now take a DNA FASTA file (asking interactively for it), and reverse and complement all entries in the file. There can be more than one entry, study dna7.fsa. Hint: Use substitution or translation for complementing the DNA. I will point out that the reading and writing of fasta files with many entries is a regular occurrence in bioinformatics (and exam), so be sure to get it right. Many people mistakenly believe that they should use a form of stateful parsing with a flag for this - doing so confuses the issue, so abstain from that.

Exercises for extra practice

  • Given that you have a string containing some DNA (ask for input or hardcode it in the program), how would you check that it actually only contains ATCG using the methods of today's lesson?
  • Given that you have a string of comma separated values, like "1,22,333,4,5". How would you find the positions of the commas in the string?
  • In the files data1-4.gb find all titles and authors using stateful parsing. Put every author/title record (i.e. all authors/title in one string) in one list. Display all neatly with title on one line, authors on the next and a separator line.
  • Continue with exercise 4: Your program already extracts the DNA. Make it also extract the positions of the coding sequence. Display them. You can recognize it by the line that starts with "     CDS     join("
  • Continue with exercise 4: Now that you have the DNA and the positions in the DNA for the coding sequence, extract the coding DNA directly from the DNA, such that you can display the actual gene. Verify that you have done so correctly by thinking about the properties of coding DNA sequence.