Regular expressions
Previous: Dict techniques | Next: Python object model |
Required course material for the lesson
Powerpoint: Regular expressions in Python
Video: An (unfortunately) true story
Resource: Example code - Regex
Video: Live Coding
PDF: Regular Expressions Cheat Sheet
WWW: Web page where you can test your regular expressions
Subjects covered
- Regular expressions, duh.
- Patterns, how to design and use them.
Exercises to be handed in
- Make a program that accepts a string as input from the keyboard. Use regular expressions (RE) to determine if the input is a number. The goal is to do this with a SINGLE regex.
These should all be considered as numbers: "4" "-7" "0.656" "-67.35555"
These are not numbers: "5." "56F" ".32" "-.04" "1+1"
Note: The program is very simple, but it is likely the most difficult regular expression, you will have to make in this set of exercises. Perhaps you should do the following exercises before attempting this one - just to get some experience first. - Make a program that can read and verify a fasta file. Use your previous function fastaread() Test the program with dna7.fsa and dnanoise.fsa. Verification here means that the program prints "DNA fasta" or "Protein fasta" if the file is successfully verified for either dna or protein sequence, and "Not fasta" if unsuccessfully verified. You can find a description of fasta format in Biological knowledge needed in the course. You are expected to know which symbols are used for DNA and protein sequence - or that you are able to look it up.
- Building on your experience with the previous exercise, make a program that reads a fasta file, discard entries that can not conform to DNA or protein sequence, and rewrite (using your fastawrite() function) the acceptable entries in the output file fastaout.fsa, in such a way that the normal 60 chars per line is followed with no spaces in between. The program must inform the user how many entries was kept and how many discarded. Hint: Test on dnanoise.fsa, which contain 3 entries that should be discarded.
- All HIV envelope proteins from various HIV strains in SwissProt have been identified and collected in the file HIVenvelope.txt. Using regular repressions you must extract the ID and the protein sequence from each entry and save them in a fasta file named HIVenv.fsa. This job is not new to you and you can use your fastawrite() function if you want to.