Regular Expressions
| Previous: Python Recap and Objects | Next: Making Functions | 
Required course material for the lesson
Powerpoint: Regular expressions in Python
Video: Regular Expressions Monday
Video: An (unfortunately) true story Monday
Resource: Example code - Regex
Video: Live Coding
PDF: Regular Expressions Cheat Sheet
WWW: Web page where you can test your regular expressions
Subjects covered
- Regular expressions, duh.
- Patterns, how to design and use them.
Exercises to be handed in
You might recognize some of these exercises. You must ONLY use regex for your pattern recognition and extraction of single data points (like an accession number).
If you don't know what stateful parsing is, look here
- Make a program that accepts a string as input from the keyboard. Use regular expressions (RE) to determine if the input is a number. The goal is to do this with a SINGLE regex.
 These should all be considered as numbers: "4" "-7" "0.656" "-67.35555"
 These are not numbers: "5." "56F" ".32" "-.04" "1+1"
 Note: The program is very simple, but it is likely the most difficult regular expression, you will have to make in this set of exercises. Perhaps you should do the following exercises before attempting this one - just to get some experience first.
- Make a program that can read and verify a fasta file. Test with dna7.fsa and dnanoise.fsa. Verification here means that the program prints "DNA fasta" or "Protein fasta" if the file is successfully verified for either dna or protein sequence, and "Not fasta" if unsuccessfully verified. You can find a description of fasta format in Biological knowledge needed in the course. You are expected to know which symbols are used for DNA and protein sequence - or that you are able to look it up. Hint: If you have made a program before (previous course) that reads a fasta file, this and the following exercise is not too hard, but otherwise you can consider doing them last.
- Change exercise 2 in the following way: Make the program discard entries that can not conform to DNA or protein sequence, and rewrite the acceptable entries in the output file fastaout.fsa, in such a way that the normal 60 chars per line is followed with no spaces in between. The program must inform the user how many entries was kept and how many discarded. Test on dnanoise.fsa, which contain 3 entries that should be discarded - this is a strong hint.
- The last exercises will all have to do with the files data1-4.gb, which are various Genbank entries of genes. First you should study the files, notice the structure of the data. In all exercises you will have to parse (read and find the wanted data) the files using RE's which are very well designed for that purpose. This is a build-up process, so every exercise is added to the previous ones, so the final program can do a lot. Your program should be able to handle all files (so test them), but just one at a time.
- Extract the accession number, the definition and the organism (and print it).
- Extract and print all MEDLINE article numbers which are mentioned in the entries.
- Extract and print the translated gene (the amino acid sequence). Look for the line starting with /translation=. Generalize; An amino acid sequence can be short, i.e. only one line in the feature table, or long, i.e. more than one line in the feature table. Use stateful parsing.
- Extract and print the DNA (whole base sequence in the end of the file). Use stateful parsing.
- Extract and print ONLY the coding DNA. That is described in FEATURES - CDS (Coding DNA Sequence). As an example, the line in data1.gb says 'join(2424..2610,3397..3542)' and means that the coding sequence are bases 2424-2610 followed by bases 3397-3542. The bases in between are an intron and not a part of the coding DNA. Remember to generalize; there can be more (or less) than two exons, and the 'join' line can continue on the next line. Use stateful parsing.