K-nearest neighbor (k-NN) continuous variable estimation
Description
This scripts read a matrix-styled data file, containing missing values, and infers these values by finding the k-nearest neighbors. An application of this can be seen in Microarray experiments, in which the observed signal is not always significantly different from the background signal. Imputing these values are a cheaper solution rather than redoing the whole experiment. This method has been shown to perform better than e.g. rowmeans, and faster than SVD approaches (see reference 1).
An important step in validating new methods in science is to asses the error rates and accuracy feeding already-known results to a program, and verifying the output is in accordance. Hence, it is expected that the final submission of this project contains:
- A program that removes data points from a data set.
- A program that implements the k-NN algorithm.
- A program that validates the accuracy of the k-NN based on the original file and the resulting file from the k-NN.
- Graphs in the report presenting the accuracy of your implementation, for a variety of missing field percentages, as well as different k's.
Input and Output
The input should be a matrix-styled file, containing an identifier for each field M(i,j), in which a number of entries is listed as 'NA', as can be seen below.
Probe_ID ref1 ref2 ref3 week1 week2 week3 100002 68.42 89.72 76.86 NA NA 103.03 100037 20.79 23.86 25.65 20.6 NA 20.26 100039 3.95 5.94 5.51 6.49 NA 6.21 . . . 100058 64.29 80.97 64.92 42.49 34.54 39.08
In this example, the rows represent probes, and columns individual samples. The data file provided contains a FULL dataset, with no NAs, from an actual experiment,
E-GEOD-10590.
You have to create the missing entries in this datafile, and this is part of your project. The output of this script should be a matrix in which all NAs are filled with numbers, using the kNN algorithm. You must yourself find a method of evaluating your results, when comparing to the original data file. It is a good idea to look into the literature for this (both reference #1 and #2 presents methods for this).
Examples of execution could be:
# Generation of test-set missing 5% of all values in the data.txt using the 10 nearest neighbors generate_test.py -missing 5% -file data.txt -k 10
The above script will then call the script containing the k-NN algorithm, and pass onto it the generated test set, as well as the k parameter. To get input from command line, it is suggested to use the argparse module. This is, however, no requirement.