Simple pattern matching: Difference between revisions

From 22116
Jump to navigation Jump to search
mNo edit summary
Line 27: Line 27:


== Exercises for extra practice ==
== Exercises for extra practice ==
You will program a [https://en.wikipedia.org/wiki/Reverse_Polish_notation reverse polish notation] calculator. This is somewhat complicated, so I will split up the job in a number of smaller exercises, where the one exercise builds on the previous. Shortly, reverse polish notation is a mathematical notation in which operators follow their operands, i.e. adding two numbers is written by "3 4 +" instead of "3 + 4". We will utilize a [https://en.wikipedia.org/wiki/Stack_(abstract_data_type) stack], which is simply a list, where elements are added to and removed from the bottom.
* Let's start with making the overall structure; a loop that asks for input, which is one item/command at a time. Inside the loop, there can be a number of operations depending on the input. If you enter a number, it should be simply appended to your list (the stack, which is the word I will use in the rest of the exercises here). If you enter commands '''STOP''', '''EXIT''' or '''QUIT''', the loop should end and the program stop. If you enter something that the program does not understand it should write "Illegal command", but otherwise do nothing.
* Add the following commands: '''PEEK''', which displays the last number in the stack, and '''SHOW''' which will display the entire stack.
* Add the standard math operations, '''+''', '''-''', '''*''' and '''/''': These remove 2 numbers from the stack (the last number in stack is the first operand and the second last number is the second operand), perform the operation and appends the result back to the stack. You notice operand order matters for - and /. At this point we have a simple but functional calculator.
* Add single number operations '''LOG''' and '''SQRT''', which replaces the last number in the stack with the log/square root of the number.
* Solidify your program by ensuring any operations first check if the required number of operands are on the stack before trying to perform the operation. Display "Not enough operands on the stack" and otherwise do nothing if the check fails. Hint: Maybe lists of zero, one and two operands commands could be useful.
* Add the commands: '''DROP''' which simply removes the last number on the stack, '''=''', which removes the last number on the stack and displays it, '''SWITCH''', which switches the last two numbers on the stack, and '''DUP''' which duplicates the last number on the stack, i.e stack consists of 1, after DUP it consists of 1 1.
* Add the commands: '''ABS''', '''NEG''' which replaces the last number on the stack which the absolute/negative value of the number.
* At this point you can probably work out some mathematical operations to add that require one or 2 operands by yourself. I can think of the '''SIN''', '''COS''' & '''TAN''' operations, the '''INT''' taking the integer part of a number, '''EXP'''onential and '''SQUARE'''. Modulo operation with '''%'''. '''FAC'''torial. You could add constants like '''PI''' and '''E''' that simply puts an appropriate value on the stack.
* Add stack control with '''CLEAR''' to empty the stack and extend the '''SWITCH''' to switch the numbers as two given position like "SWITCH 2 4" if given positions.

Revision as of 16:01, 29 August 2025

Previous: Functions Next: Set techniques

Required course material for the lesson

Powerpoint: Simple pattern matching

Subjects covered

String pattern matching, stripping, replacement, translation.
Pattern matching is about 2 things: WHAT is the pattern and WHERE is the pattern.
It is important to AVOID matching the wrong place.

Exercises to be handed in

The first 4 exercises will all have to do with the files data1-4.gb, which are various Genbank entries of genes. First you should study the files, notice the structure of the data. In all exercises you will have to parse (read and find the wanted data) the files using pattern matching. Every exercise adds to the previous ones, so the final program can do a lot. Remember. Your program should be able to handle all files, but just one at a time.

  1. Extract the accession number, the definition and the organism (and print it).
  2. Extract and print all MEDLINE article numbers which are mentioned in the entries.
  3. Extract and print the translated gene (the amino acid sequence). Look for the line starting with /translation=. Generalize; An amino acid sequence can be short, i.e. only one line in the feature table, or long, i.e. more than one line in the feature table. Use stateful parsing.
  4. Extract and print the DNA (whole base sequence in the end of the file). Use stateful parsing.
  5. Improve exercise 5.8 using all you have learned. The program shall now take a DNA FASTA file (asking interactively for it), and reverse and complement all entries in the file. There can be more than one entry, study dna7.fsa. Hint: Use substitution or translation for complementing the DNA. I will point out that the reading and writing of fasta files with many entries is a regular occurrence in bioinformatics (and exam), so be sure to get it right. Many people mistakenly believe that they should use a form of stateful parsing with a flag for this - doing so confuses the issue, so abstain from that.
  • Given that you have a string containing some DNA (ask for input or hardcode it in the program), how would you check that it actually only contains ATCG using the methods of today's lesson?
  • Given that you have a string of comma separated values, like "1,22,333,4,5". How would you find the positions of the commas in the string?
  • In the files data1-4.gb find all titles and authors using stateful parsing. Put every author/title record (i.e. all authors/title in one string) in one list. Display all neatly with title on one line, authors on the next and a separator line.
  • Continue with exercise 4: Your program already extracts the DNA. Make it also extract the positions of the coding sequence. Display them. You can recognize it by the line that starts with "     CDS     join("
  • Continue with exercise 4: Now that you have the DNA and the positions in the DNA for the coding sequence, extract the coding DNA directly from the DNA, such that you can display the actual gene. Verify that you have done so correctly by thinking about the properties of coding DNA sequence.

Exercises for extra practice

You will program a reverse polish notation calculator. This is somewhat complicated, so I will split up the job in a number of smaller exercises, where the one exercise builds on the previous. Shortly, reverse polish notation is a mathematical notation in which operators follow their operands, i.e. adding two numbers is written by "3 4 +" instead of "3 + 4". We will utilize a stack, which is simply a list, where elements are added to and removed from the bottom.

  • Let's start with making the overall structure; a loop that asks for input, which is one item/command at a time. Inside the loop, there can be a number of operations depending on the input. If you enter a number, it should be simply appended to your list (the stack, which is the word I will use in the rest of the exercises here). If you enter commands STOP, EXIT or QUIT, the loop should end and the program stop. If you enter something that the program does not understand it should write "Illegal command", but otherwise do nothing.
  • Add the following commands: PEEK, which displays the last number in the stack, and SHOW which will display the entire stack.
  • Add the standard math operations, +, -, * and /: These remove 2 numbers from the stack (the last number in stack is the first operand and the second last number is the second operand), perform the operation and appends the result back to the stack. You notice operand order matters for - and /. At this point we have a simple but functional calculator.
  • Add single number operations LOG and SQRT, which replaces the last number in the stack with the log/square root of the number.
  • Solidify your program by ensuring any operations first check if the required number of operands are on the stack before trying to perform the operation. Display "Not enough operands on the stack" and otherwise do nothing if the check fails. Hint: Maybe lists of zero, one and two operands commands could be useful.
  • Add the commands: DROP which simply removes the last number on the stack, =, which removes the last number on the stack and displays it, SWITCH, which switches the last two numbers on the stack, and DUP which duplicates the last number on the stack, i.e stack consists of 1, after DUP it consists of 1 1.
  • Add the commands: ABS, NEG which replaces the last number on the stack which the absolute/negative value of the number.
  • At this point you can probably work out some mathematical operations to add that require one or 2 operands by yourself. I can think of the SIN, COS & TAN operations, the INT taking the integer part of a number, EXPonential and SQUARE. Modulo operation with %. FACtorial. You could add constants like PI and E that simply puts an appropriate value on the stack.
  • Add stack control with CLEAR to empty the stack and extend the SWITCH to switch the numbers as two given position like "SWITCH 2 4" if given positions.