Python Input-Output

From 22101
Jump to navigation Jump to search
Previous: Pseudocode and Comments Next: Exceptions and Bug Handling

Required course material for the lesson

Powerpoint: Input, output, libraries and strings
Video: File reading and writing
Video: Python libraries
Video: Strings and substrings
Resource: Biological knowledge needed in the course
Resource: Example code - Input Output
Resource: Clean Code Every time you read it, you will take something from it.

Subjects covered

Using files
open which opens a file (makes it ready) for reading or writing.
close which ends the reading/writing.
readline, which reads a line from a file handle.
write, which writes a line to a file (handle).
String manipulation
len, which tells how long a string is.
Slicing.
Standard library functions
sys.exit terminates the program.
os.system, which submits jobs to the operating system.

Exercises to be handed in

  1. Write a program that counts the number of negative numbers in ex1.dat file. Display the result, which is 3272. Hint: Think about what defines a negative number.
  2. Write a program that converts temperatures from Fahrenheit to Celsius or visa versa given input like "36F" of "15C" ( F = (C * 9/5) + 32 ).
  3. Read the file orphans.sp and find all accession numbers (and only the accession numbers), save them in another file of your choosing. Hint: an accession number might look like this AB000114.CDS.1 or like this AB000114 or like this AB000114.CDS.3. CDS means CoDing Sequence followed by a number. If the accession number contains the CDS part, consider .CDS.1 as a part of the accession number. Accession numbers differ in length for historical reasons. You can assume that the accession number comes straight after the >, which is first on the line. Notice that this is a purple exercise; You have make the pseudo code first and hand it in as part of the exercise.
  4. Now you must analyze the AT/GC content of the DNA in the file dna.dat. You must count all A, T, C and G, and display the result: A: 333 T: 303 C: 454 G: 469.
  5. This and the rest of the exercises aims to make the reverse complement string (called "complement strand") of DNA. They are building up in complexity, but every exercise must stand alone, i.e. NOT be dependent on what the previous exercises achieved but start over every time in order to be a coherent product.
    There is some human DNA in the dna.dat file. Read the file and put all the DNA in one variable. Now complement the DNA and store it in other variable. Display and ensure that it works. HINT: Complementing means changing all A's to T's, T's to A's, C's to G's and G's to C's.
  6. Now reverse the DNA after complementing it. Reverse means last letter (base) should be the first, next to last should be the second, and so forth. Display.
  7. Now write the DNA in the file revdna.dat. Make it look nice, just like dna.dat, i.e. 60 letters per line. This does NOT mean that you should insert newlines in the variable containing your complement strand (contamination of clean data you possibly should use later in the program). It just means that DNA in the output file must have 60 chars per line, just as in the input file.
  8. In the file dna.fsa is the same human DNA in FASTA format. This format is VERY often used in bioinformatics. Look at it using less and get used to the format. Observe the first line which starts with a > and identifies the sequence. The name (AB000410 in this case) MUST uniquely identify a sequence in the file. This is a DNA (actually mRNA) sequence taken from the GenBank database. Now make a program that reverse complements the sequence and writes it into the file revdna.fsa just like you did in previous assignments. This time you have to keep the first identifying line, so the sequence can be identified. You must add 'ComplementStrand' in the end of that line, though, so you later know that it is the complement strand.
    Summary: Keep the first line and reverse complement the sequence.

Exercises for extra practice

Slicing: the act of taking out a substring (part of a string) from a string. Learning the technique of "walking the line".

  • Input a line from the keyboard. Count the number of stars * on the line. Display. Example: DG*GDG*GSD*GG
    Star count: 3
  • Input a line from the keyboard. Now print the characters on the line vertically, i.e. one char per line downwards. Example: ABC
    A
    B
    C
  • Input a line from the keyboard. Now print 2 characters on the line vertically, i.e. two char per line downwards. Advance 1 char at a time. Example: ABC
    AB
    BC
  • Input a line from the keyboard. Now print 2 characters on the line vertically, i.e. two char per line downwards. Advance 2 chars at a time. If there is a leftover char at the end of the string, print it. Example: ABCDE
    AB
    CD
    E
  • Input a line from the keyboard. Now print 2 characters on the line vertically, i.e. two char per line downwards. Advance 2 chars at a time. If there is leftover char at the end of the string, do NOT print it. Example: ABCDE
    AB
    CD
  • Input a DNA line from the keyboard. Now print 3 characters (a codon) on the line vertically, i.e. three char per line downwards. Advance 3 chars at a time. If there is a leftover DNA that can not constitute a codon at the end of the string, do NOT print it. Example: TACCATCGATCAG
    TAC
    CAT
    CGA
    TCA
  • Input a sentence (a line) from the keyboard. Write every word on its own line. Example: The answer is 42.
    The
    answer
    is
    42.

Opening files.

  • Ask for 2 input file names (ex1.acc & ex1.dat are good) and an output file name. Copy (read and write) the input files into the output file, one after the other.
  • Ask for 2 input file names (ex1.acc & ex1.dat are good) and an output file name. Now copy/print the lines in the input files into the output file so that the first line of each input file are merged together with a tab and becomes the first line of the output. Continue that way with all the lines in the input file. A mental picture of what is supposed to happen is this: Imagine the 2 input files to be two pieces of paper with lines. In the exercise before the papers were put one after the other in the output. In this exercise the papers are put next to each other (sideways) in the output.
  • Read the mixedlines.txt and count how many long lines there are in the file. A long line has more than 30 chars.
  • In mixedlines.txt how many occurrences for "rna" can you find?
  • In mixedlines.txt how many lines contain "rna"?