Comprehension, generators, iteration

From 22118
Jump to navigation Jump to search
Previous: Functions, namespace, memory management Next: Beginning classes

Required course material for the lesson

Powerpoint: Comprehension, generators & iteration
Powerpoint: Randomness
Video: Comprehension
Video: Generators
Video: Iteration in detail, use of lambda function, libraries
Video: How to parse bio files with many entries
Resource: Example code - Comprehension
Resource: Example code - Misc
Video: Live Coding 2

Subjects covered

Comprehension, which is a way of manipulation/selecting data with a hidden loop.
Lambda, the small anonymous function.
Generators, which is like a function with memory of previous calls.
More theoretical iteration.
On the nature of randomness in the computer.

Exercises to be handed in

  1. Make a program that calculates the product of two matrices and prints it on the screen (which is STDOUT, remember unix). The matrices are in the files mat1.dat and mat2.dat. Numbers in the files are tab separated. A matrix should be stored as an list of lists.
    Advice: The program should have a function that reads a matrix from a given file (to be used twice), a function that calculates the product, and a function that prints a matrix. This way ensures that your program is easy to change to other forms of matrix calculations. Here are two links to the definition of matrix multiplication.
    Math is Fun
    Math world
  2. The purpose of this exercise is to find the 10 genes that has the biggest difference in expression between cancer and control patients in the dna-array.dat file after a linear transformation of the numbers in the columns. You should (re)visist exercise 4 & 5 in course 22116's last lesson. In order to not start from the beginning, use the file dna-array-norm.dat created in exercise 5 as input, both are supplied if you want to do extra work. The other tab-separated input file lineartransform.dat has an A (slope) and a B (intersection) - one AB pair for each number column in the dna-array-norm.dat file. For each line in dna-array-norm.dat you first linear transform the numbers according to the A & B in lineartransform.dat - first number uses first AB pair, second number uses second AB pair, and so forth. If your number is X, then the transformed number is A*X+B. When the entire line is transformed, you calculate the average of the cancer patients and the average of the controls. From that, find the 10 genes with the biggest difference in expression. There are a number of ways, but a simple one is to create a list of tuples with every tuple consisting of (gene name, cancer average, control average), and then sort the list according to the difference in cancer and control average. Using a lambda function when sorting springs to mind. Display the top 10 in the sorted list.
  3. Make a moving average generator: moving_avg(List_of_numbers, Window_size). The generator calculates the average number in a window moving across the list. Try it on the numbers in ex1.dat, i.e. load the numbers column-based into a single list first, i.e. first all the numbers in column 1, then the numbers in column 2, and so forth in the list.
  4. Make a trend discoverer generator: trend(List_of_numbers). It looks at a list of numbers in a moving window way and emits 1, if the next number is higher than the previous, and 0 otherwise. Any longer sequence of 0's or 1's in the generator output is a trend in the data. Check with ex1.dat (load same way a previous exercise) or another file of your choosing.
  5. Changing the previous exercise: Make a find_trend(List_of_numbers, Minimum_trend_size) generator, which return a tuple (Position_Start, Size, Direction) of where and how big the trends in List_of_numbers are. Direction is 0 or 1 as you want to know which direction the trend is going. Position_start is the position in the (zero-based) list, where the trend starts. Size is how long the trend of ascending/descending numbers is. This is surprisingly difficult. Test with a simple file of your own making to check your results.
  6. Make a generator combinations(), that takes a list of strings as input, e.g. combinations(["GAVIL", "ST", "NQ", "FWY", "D", "HKR"]), and generates all possible combinations. A combination is formed by choosing 1 letter from the first string, 1 letter from the second string, and so forth, in that order, until a letter from all strings is chosen. The input list can have any number of strings and the strings can have any length (greater than 0). There must be NO REPEATS - random is not an acceptable library to use. As is obvious, the example has 5*2*2*3*1*3 = 180 different combinations, the first being GSNFDH. Print them all on the screen. If your input is ['0123456789', '0123456789', '0123456789'], then you will print the numbers from 000 to 999. Hint: A list of counters, 1 per string, could be useful in iterating through the combinations.
    When can such a generator be useful? If you want to generate a list of antigens, which needs certain amino acids to be in certain positions.

Exercises for extra practice

  1. Study the file dna-array.dat a bit. This is real DNA array data taken from a number of persons, some controls and some suffering from colon cancer. If you look at the second line there is a lot of 0 and 1. A '0' means that values in that column are from a cancer patient and a '1' means data are from a control (healthy person). The data are all log(intensity), i.e. the logarithm of the measured intensity of the relevant spot on the dna-chip. The data in this file will be used in coming exercises. The data/columns are tab separated. The second item on each line is the accession number for that particular gene.
    Now make a main function that extracts data from one file and saves it in another, given the accession number, input file dna-array.dat and output file column.tab. Search in the file for the data concerning that accession number. If it does not find it (you gave a wrong accession no), complain and stop. Otherwise it shall display the data in two tab separated columns. First column shall be the data from the cancer patients, second column for the controls. There are not the same number of sick and healthy people - be able to handle that.

  2. The numbers in the input file dna-array.dat should be normalized between 0 and 1 for each line with an accession number, i.e. normalization only for the individual line - not across the data set. Write the result out in the file dna-array-norm.dat, but NOT the control lines, i.e. lines where the annotation says 'control'. The resulting file will be similar to the original, but control lines are removed and the numbers are different. The problem can (and should) be solved one line at a time.