K-means clustering
Description
The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into K clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, K-means partitions the data set such that each example (data point) is assigned to exactly one cluster - the one with the closest centroid.
The initial selection of centroids (the "middle" of a cluster) is very important for the performance of the algorithm, both with respect to time used, and results produced. The normal (and rather bad) approach is to randomly select points in the data set to represent the initial centroids. You must improve the algorithm with adding the K-means++ selection method for selecting the initial centroids. This basically selects the first centroid (data point) randomly, but the rest based on far they are from previous selected centroids. The further away - the higher probability of being picked.
Input and output
The input is a tab separated file containing one data point on each line. Each data point is a vector consisting of a number of numbers :-) The program should handle any given vector size, but the vector size is constant in any data file. Input file example:
ex01 8.76 3.29 1.05 ex02 12.3 2.33 3.53 ex03 -0.54 -3.56 1.45 . .
The output is all data points, partitioned in the clusters they belong to. Output example where each cluster starts with the cluster centroid and is proceeded by the the members of that cluster:
Cluster-1 2.13 3.24 1.54 ex10 1.04 2.98 1.34 ex12 1.23 2.34 1.69 . . Cluster-2 -4.13 1.25 6.34 ex04 -0.34 3.51 9.02 ex07 -8.56 5.12 12.5 . .
Examples of program execution
cluster.py dataset.dat 5
The 5 is the number of clusters the data point should be partitioned in.
Details
Wikipedia has a page on K-means clustering and on K-means++.
Various data sets: 100 data points, 1000 data points,
4169 data points, 5000 data points, 6000 data points, 10000 data points,