K-nearest neighbor (k-NN) continuous variable estimation

From 22113
Jump to navigation Jump to search

Description

This scripts read a matrix-styled data file, containing missing values, and infers these values by finding the k-nearest neighbors. An application of this can be seen in Microarray experiments, in which the observed signal is not always significantly different from the background signal. Imputing these values are a cheaper solution rather than redoing the whole experiment. This method has been shown to perform better than e.g. rowmeans, and faster than SVD approaches (see reference 1).

An important step in validating new methods in science is to asses the error rates and accuracy feeding already-known results to a program, and verifying the output is in accordance. Hence, it is expected that the final submission of this project contains:

  1. A program that removes data points from a data set.
  2. A program that implements the k-NN algorithm.
  3. A program that validates the accuracy of the k-NN based on the original file and the resulting file from the k-NN.
  4. Graphs in the report presenting the accuracy of your implementation, for a variety of missing field percentages, as well as different k's.

Input and Output

The input should be a matrix-styled file, containing an identifier for each field M(i,j), in which a number of entries is listed as 'NA', as can be seen below.

Probe_ID	ref1	ref2	ref3	week1	week2	week3
100002	        68.42	89.72	76.86	NA	NA	103.03
100037	        20.79	23.86	25.65	20.6	NA	20.26
100039	        3.95	5.94	5.51	6.49	NA	6.21
.
.
.
100058	        64.29	80.97	64.92	42.49	34.54	39.08

In this example, the rows represent probes, and columns individual samples. The data file provided contains a FULL dataset, with no NAs, from an actual experiment, E-GEOD-10590.
You have to create the missing entries in this datafile, and this is part of your project. The output of this script should be a matrix in which all NAs are filled with numbers, using the kNN algorithm. You must yourself find a method of evaluating your results, when comparing to the original data file. It is a good idea to look into the literature for this (both reference #1 and #2 presents methods for this).

Examples of execution could be:

# Generation of test-set missing 5% of all values in the data.txt using the 10 nearest neighbors 
generate_test.py -missing 5% -file data.txt -k 10

The above script will then call the script containing the k-NN algorithm, and pass onto it the generated test set, as well as the k parameter. To get input from command line, it is suggested to use the argparse module. This is, however, no requirement.

References

  1. Olga Troyanskaya, Michael Cantor, Gavin Sherlock et al. Missing value estimation methods for DNA. Bioinformatics 2001, 6:17, p. 520-525
  2. Ki-Yeol Kim, Byoung-Jin Kim and Gwan-Su Yi. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics 2004, 5:160, p. 160