Regular expressions: Difference between revisions
Jump to navigation
Jump to search
Line 27: | Line 27: | ||
== Exercises for extra practice == | == Exercises for extra practice == | ||
protein alignment motif |
Revision as of 23:41, 3 September 2025
Previous: Dict techniques | Next: Python object model |
Required course material for the lesson
Powerpoint: Regular expressions in Python
Video: An (unfortunately) true story
Resource: Example code - Regex
Video: Live Coding
PDF: Regular Expressions Cheat Sheet
WWW: Web page where you can test your regular expressions
Subjects covered
- Regular expressions, duh.
- Patterns, how to design and use them.
Exercises to be handed in
Exercise 4 to 6 has strong taste of something I would do at an exam. It is also an interesting beginning of the making a HIV vaccine. The data is real and the methods are real.
- Make a program that accepts a string as input from the keyboard. Use regular expressions (RE) to determine if the input is a number. The goal is to do this with a SINGLE regex.
These should all be considered as numbers: "4" "-7" "0.656" "-67.35555"
These are not numbers: "5." "56F" ".32" "-.04" "1+1"
Note: The program is very simple, but it is likely the most difficult regular expression, you will have to make in this set of exercises. Perhaps you should do the following exercises before attempting this one - just to get some experience first. - Make a program that can read and verify a fasta file. Use your previous function fastaread() Test the program with dna7.fsa and dnanoise.fsa. Verification here means that the program prints "DNA fasta" or "Protein fasta" if the file is successfully verified for either dna or protein sequence, and "Not fasta" if unsuccessfully verified. You can find a description of fasta format in Biological knowledge needed in the course. You are expected to know which symbols are used for DNA and protein sequence - or that you are able to look it up.
- Building on your experience with the previous exercise, make a program that reads a fasta file, discard entries that can not conform to DNA or protein sequence, and rewrite (using your fastawrite() function) the acceptable entries in the output file fastaout.fsa, in such a way that the normal 60 chars per line is followed with no spaces in between. The program must inform the user how many entries was kept and how many discarded. Hint: Test on dnanoise.fsa, which contain 3 entries that should be discarded.
- All HIV envelope proteins from various HIV strains in SwissProt have been identified and collected in the file HIVenvelope.txt. Using regular repressions you must extract the ID and the protein sequence from each entry and save them in a fasta file named HIVenv.fsa. This job is not new to you and you can use your fastawrite() function if you want to.
- Continuing the investigation in HIV. Read the HIVenv.fsa fasta file and create a single set consisting of all possible epitopes for all sequences. Save the epitopes in the file HIVepitopes.txt - one epitope per line. An epitope is simply a k-mer 9 residues long, which can possibly elicit a immune system response. So save all unique 9-mers in the sequences in the file. To generate a file that you can verify against my file, you must sort the epitopes alphabetically before saving them.
- You must prepare the epitopes in HIVepitopes.txt for machine learning. A common ML problem is when you have many data points (here epitopes) that look like each other. This introduces an unwanted bias in the ML predictions. Your job is to eliminate epitopes that are too similar using the Hobohm-1 algorithm. Hobohm-1 works like this: Look a the first epitope in the list. Compare it sequentially with the rest of the epitopes. If an epitope is too similar with the first, throw it away. Now no epitopes looks like the first. Proceed to the second epitope and repeat the comparing and possibly throwing away of the subsequent epitopes. Proceed to the third epitope in the list. Repeat this pattern until you have reached the end of the list. Now all epitope left are dissimilar to each other.
How to determine if two epitopes are too similar. Easy - you earlier learned about the Hamming distance. Just compute the Hamming distance between two epitopes and if the distance is 3 or less, they are too similar. Save the "surviving" epitopes in the file HIVepitopesML.txt.
Hint: Think about the Hobohm-1 algorithm before you implement it. You can easily run into problems.
Exercises for extra practice
protein alignment motif