String manipulation

From 22116
Jump to navigation Jump to search
Previous: Pseudocode and comments Next: Exceptions and bug handling

Required course material for the lesson

Powerpoint: String manipulation
Resource: Biological knowledge needed in the course
Resource: Example code - Changing tabs to spaces
Resource: Clean Code Every time you read it, you will take something from it.

Subjects covered

Strings, string-like data types, sequences.
Length - len, membership - in, slicing for substrings.

Exercises to be handed in

  1. Write a program that counts the number of negative numbers in ex1.dat file. Display the result, which is 3272. Hint: Think about what defines a negative number.
  2. Write a program that converts temperatures from Fahrenheit to Celsius or visa versa given input like "36F" of "15C" ( F = (C * 9/5) + 32 ).
  3. Enter the world of spies. In the file secret.txt is a hidden message. You can find it by replacing every char with an even Unicode number to a space (' '), and chars with odd numbers to a hash tag ('#'). Make a program that asks for an file name, reads the file and reveals the message.
  4. The secrecy just got worse. The lines in the secret2.txt file has also been split a random place and switched with each other. Fortunately there is a char '|' that shows where a line was split. Make a program that asks for an input filename and a output filename, then reads the input file, identifies where the split is and puts the line back in order again, and saves the corrected lines in the output file. Now you can use your program in the previous exercise to reveal the message. Hint: A line that looks like DEFG|ABC should look like ABCDEFG.
  5. This and the rest of the exercises aims to make the reverse complement string (called "complement strand") of DNA. They are building up in complexity, but every exercise must stand alone, i.e. NOT be dependent on what the previous exercises achieved but start over every time in order to be a coherent product.
    There is some human DNA in the dna.dat file. Read the file and put all the DNA in one variable. Now complement the DNA and store it in other variable. Display and ensure that it works. HINT: Complementing means changing all A's to T's, T's to A's, C's to G's and G's to C's.
  6. Now reverse the DNA after complementing it. Reverse means last letter (base) should be the first, next to last should be the second, and so forth. Display.
  7. Now write the DNA in the file revdna.dat. Make it look nice, just like dna.dat, i.e. 60 letters per line. This does NOT mean that you should insert newlines in the variable containing your complement strand (contamination of clean data you possibly should use later in the program). It just means that DNA in the output file must have 60 chars per line, just as in the input file.
  8. In the file dna.fsa is the same human DNA in FASTA format. This format is VERY often used in bioinformatics. Look at it using less and get used to the format. Observe the first line which starts with a > and identifies the sequence. The name (AB000410 in this case) MUST uniquely identify a sequence in the file. This is a DNA (actually mRNA) sequence taken from the GenBank database. Now make a program that reverse complements the sequence and writes it into the file revdna.fsa just like you did in previous assignments. This time you have to keep the first identifying line, so the sequence can be identified. You must add 'ComplementStrand' in the end of that line, though, so you later know that it is the complement strand.
    Summary: Keep the first line and reverse complement the sequence. Notice that this is a purple exercise; You have make the pseudo code first and hand it in as part of the exercise.

Exercises for extra practice

Slicing: the act of taking out a substring (part of a string) from a string. Learning the technique of "walking the line".

  • Input a line from the keyboard. Count the number of stars * on the line. Display. Example: DG*GDG*GSD*GG
    Star count: 3
  • Input a line from the keyboard. Now print the characters on the line vertically, i.e. one char per line downwards. Example: ABC
    A
    B
    C
  • Input a line from the keyboard. Now print 2 characters on the line vertically, i.e. two char per line downwards. Advance 1 char at a time. Example: ABC
    AB
    BC
  • Input a line from the keyboard. Now print 2 characters on the line vertically, i.e. two char per line downwards. Advance 2 chars at a time. If there is a leftover char at the end of the string, print it. Example: ABCDE
    AB
    CD
    E
  • Input a line from the keyboard. Now print 2 characters on the line vertically, i.e. two char per line downwards. Advance 2 chars at a time. If there is leftover char at the end of the string, do NOT print it. Example: ABCDE
    AB
    CD
  • Input a DNA line from the keyboard. Now print 3 characters (a codon) on the line vertically, i.e. three char per line downwards. Advance 3 chars at a time. If there is a leftover DNA that can not constitute a codon at the end of the string, do NOT print it. Example: TACCATCGATCAG
    TAC
    CAT
    CGA
    TCA
  • Input a sentence (a line) from the keyboard. Write every word on its own line. Example: The answer is 42.
    The
    answer
    is
    42.