List manipulation

From 22116
Jump to navigation Jump to search
Previous: Stateful parsing Next: Functions

Required course material for the lesson

Powerpoint: List manipulation

Subjects covered

Lists and list methods.
List algorithms.

Exercises to be handed in

  1. Searching for accession numbers. In the file ex5.acc there are 6461 unique GenBank accession numbers (taken from HU6800 DNA array chip). An inexperienced bioinformatician unfortunately fouled up the list, so many of the accession numbers appears more than once. Make a program that first reads the file into a list, then ask for an accession number and counts how many times it appears in the list and displays the result. Keep asking and searching for accession numbers until STOP is entered. Hint: this is 2 loops inside one another.
  2. You need to clean up the ex5.acc file. The first step is to sort the accession numbers alphabetically. You must program a sorting algorithm. There are many different algorithms for sorting, but let's pick a simple one - Bubble Sort.
    It goes like this. Read the accessions into a list like the previous exercises. Go through the list looking at pairs of accessions (at position i & i+1). If a pair is in the wrong order, you switch them. Repeat going through the list until you have gone through the entire list without switching once. Now the list is sorted and you save the list in the sorted5.acc file.
    Note: It takes 10-30 sec to run this program. There is room for optimization in the described algorithm and it is in any case not the most efficient method. Strings can be compared directly with each other using the operators ==, !=, >, <, >=, <=.
  3. It is now time to find the unique accession numbers, so you only have one of each - no duplicates. Read the accessions from sorted5.acc into a list. Since the list is now sorted, the duplicates are "next" to each other, which makes them easy to find. Make a new list with the unique accessions from the old list, and save that list in the file clean5.acc. Check that you have 6461 accessions, one per line.
  4. In this exercise you have to do the same and achieve the same result in a different way as the previous exercise. Instead of making a new list with the unique accessions, just keep the old list and remove the duplicates. You can use del or pop to remove elements. If you run into trouble imagine your code executed on this list: [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
  5. First ask for some input file names. Do this a either a loop where you ask for one file name at a time until empty string is entered, or all file names on a space separated string. You must compute the sum of the first number in all files, the sum of the second number in all files and so forth. Save the sums in the output file sums.txt. The input files may be corrupted or not the same length. Deal with that by assuming a corrupted line (a line you can not convert to a number) and missing lines in short files are 0. The files ex1_1.dat, ex1_2.dat, ex1_3.dat and ex1corrupt.dat are good for testing.
  6. After having looked at the cleaned accession numbers in clean5.acc, you will have seen that the accession numbers are sorted. This means that you can use the much more powerful binary search method when searching for accession numbers. Repeat exercise 3 from last week, but this time use binary search instead of the linear search you did then. See what Wikipedia has to say about binary search. Binary search as a method is well described in the solution to the "guess a number" exercise.
  7. Make a program that calculates the sum of all columns in an input file, no matter how many columns there are. Each column should be summed individually. You can assume that each row (line) has the same number of columns in the file. In the file ex1.dat, the sums of the three columns are: -904.4143, 482.8410, 292.0515
  8. Make a Python program that can select specific columns from a column based (tab separated) file and save them in a new file. It should select the columns that you specify in the order you specify. Ask for input and output file names and column numbers (3 questions, example "ex1.dat", "2col.acc" & "3 1"). The program should work with any number of columns in the input file.

Exercises for extra practice

  • Read a file with numbers into two lists. The first number in the file goes to the first list, the second number to the second list, the third number to the first list, and so forth - alternating numbers to alternating lists. Compute the sum of the two lists and display. Files ex1_1.dat, ex1_2.dat & ex1_3.dat are good to use for this.
  • Now compute the same sums - without the lists :-)
  • Ask for files names and read the numbers in the file into a list. Continue to ask for file names and add the numbers to the list until you enter STOP. Your list of numbers will simply grow with each file you add. Now compute the sum of the numbers in the list and display the result. Files ex1_1.dat, ex1_2.dat & ex1_3.dat are good to use for this.