WikiSysop: /* Details */

2025-09-26T11:22:49Z

Details

← Older revision		Revision as of 13:22, 26 September 2025
Line 32:		Line 32:
	===Details===		===Details===
	Wikipedia has a page on [https://en.wikipedia.org/wiki/K-means_clustering K-means clustering] and on [https://en.wikipedia.org/wiki/K-means%2B%2B K-means++].<br>		Wikipedia has a page on [https://en.wikipedia.org/wiki/K-means_clustering K-means clustering] and on [https://en.wikipedia.org/wiki/K-means%2B%2B K-means++].<br>
	Various data sets: [https://teaching.healthtech.dtu.dk/material/~~22113~~/point100.lst 100 data points], [https://teaching.healthtech.dtu.dk/material/~~22113~~/point1000.lst 1000 data points],		Various data sets: [https://teaching.healthtech.dtu.dk/material/22118/point100.lst 100 data points], [https://teaching.healthtech.dtu.dk/material/22118/point1000.lst 1000 data points],
	[https://teaching.healthtech.dtu.dk/material/~~22113~~/point4169.lst 4169 data points], [https://teaching.healthtech.dtu.dk/material/~~22113~~/point5000.lst 5000 data points], [https://teaching.healthtech.dtu.dk/material/~~22113~~/point6000.lst 6000 data points], [https://teaching.healthtech.dtu.dk/material/~~22113~~/point10000.lst 10000 data points],		[https://teaching.healthtech.dtu.dk/material/22118/point4169.lst 4169 data points], [https://teaching.healthtech.dtu.dk/material/22118/point5000.lst 5000 data points], [https://teaching.healthtech.dtu.dk/material/22118/point6000.lst 6000 data points], [https://teaching.healthtech.dtu.dk/material/22118/point10000.lst 10000 data points],

WikiSysop: Created page with "NOTOC ===Description=== The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into K clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, K-means partitions the data set such that each example (data point) is assigned to exactly one cluster - the one with the closest cen..."

2025-09-26T11:17:51Z

Created page with "__NOTOC__ ===Description=== The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into K clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, K-means partitions the data set such that each example (data point) is assigned to exactly one cluster - the one with the closest cen..."

New page

__NOTOC__
===Description===
The program reads a number of data points (multi-dimensional vectors) from a file and partitions those into K clusters. Clustering is important in discovering patterns or modes in multi-dimensional data sets. It is also a method of organizing data examples into similar groups (clusters). In this particular case, K-means partitions the data set such that each example (data point) is assigned to exactly one cluster - the one with the closest centroid.<br>
The initial selection of centroids (the "middle" of a cluster) is very important for the performance of the algorithm, both with respect to time used, and results produced. The normal (and rather bad) approach is to randomly select points in the data set to represent the initial centroids. You must improve the algorithm with adding the K-means++ selection method for selecting the initial centroids. This basically selects the first centroid (data point) randomly, but the rest based on far they are from previous selected centroids. The further away - the higher probability of being picked.

===Input and output===
The input is a tab separated file containing one data point on each line. Each data point is a vector consisting of a number of numbers :-) The program should handle any given vector size, but the vector size is constant in any data file. Input file example:

ex01 8.76 3.29 1.05
ex02 12.3 2.33 3.53
ex03 -0.54 -3.56 1.45
.
.

The output is all data points, partitioned in the clusters they belong to. Output example where each cluster starts with the cluster centroid and is proceeded by the the members of that cluster:

Cluster-1 2.13 3.24 1.54
ex10 1.04 2.98 1.34
ex12 1.23 2.34 1.69
.
.
Cluster-2 -4.13 1.25 6.34
ex04 -0.34 3.51 9.02
ex07 -8.56 5.12 12.5
.
.

===Examples of program execution===
cluster.py dataset.dat 5
The 5 is the number of clusters the data point should be partitioned in.

===Details===
Wikipedia has a page on [https://en.wikipedia.org/wiki/K-means_clustering K-means clustering] and on [https://en.wikipedia.org/wiki/K-means%2B%2B K-means++].<br>
Various data sets: [https://teaching.healthtech.dtu.dk/material/22113/point100.lst 100 data points], [https://teaching.healthtech.dtu.dk/material/22113/point1000.lst 1000 data points],
[https://teaching.healthtech.dtu.dk/material/22113/point4169.lst 4169 data points], [https://teaching.healthtech.dtu.dk/material/22113/point5000.lst 5000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point6000.lst 6000 data points], [https://teaching.healthtech.dtu.dk/material/22113/point10000.lst 10000 data points],

K-means clustering - Revision history

WikiSysop: /* Details */