K-means clustering

From 22113
Revision as of 15:15, 6 March 2024 by WikiSysop (talk | contribs) (Created page with "__NOTOC__ ===Description=== The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into K clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, K-means partitions the data set such that each example (data point) is assigned to exactly one cluster - the one with the closest cen...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Description

The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into K clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, K-means partitions the data set such that each example (data point) is assigned to exactly one cluster - the one with the closest centroid.
The initial selection of centroids (the "middle" of a cluster) is very important for the performance of the algorithm, both with respect to time used, and results produced. The normal (and rather bad) approach is to randomly select points in the data set to represent the initial centroids. You must improve the algorithm with adding the K-means++ selection method for selecting the initial centroids. This basically selects the first centroid (data point) randomly, but the rest based on far they are from previous selected centroids. The further away - the higher probability of being picked.

Input and output

The input is a tab separated file containing one data point on each line. Each data point is a vector consisting of a number of numbers :-) The program should handle any given vector size, but the vector size is constant in any data file. Input file example:

ex01	8.76	3.29	1.05
ex02	12.3	2.33	3.53
ex03	-0.54	-3.56	1.45
.
.

The output is all data points, partitioned in the clusters they belong to. Output example where each cluster starts with the cluster centroid and is proceeded by the the members of that cluster:

Cluster-1	2.13	3.24	1.54
ex10	1.04	2.98	1.34
ex12	1.23	2.34	1.69
.
.
Cluster-2	-4.13	1.25	6.34
ex04	-0.34	3.51	9.02
ex07	-8.56	5.12	12.5
.
.

Examples of program execution

cluster.py dataset.dat 5

The 5 is the number of clusters the data point should be partitioned in.

Details

Wikipedia has a page on K-means clustering and on K-means++.
Various data sets: 100 data points, 1000 data points, 4169 data points, 5000 data points, 6000 data points, 10000 data points,