Talk:K-means algorithm

From Wikipedia, the free encyclopedia

WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, which collaborates on articles related to mathematics.
Mathematics rating: Start Class Low Priority  Field: Applied mathematics

Suggestion: Make title to lowercase (I believe k-means is the consensus and not K-means) --Chrike 03:04, 19 January 2007 (UTC)

Some examples? --Abdull 16:32, 21 February 2006 (UTC)

[edit] Spherical clusters

I assume that the comment in the end of the article ("Also, the algorithm works well only when spherical clusters are naturally available in data.") is very much dependent on the distance metric used... --Andkaha(talk) 15:52, 20 July 2006 (UTC)

[edit] This isn't the k-means algorithm!

Did anyone bother to lookup the original reference?

The algorithm, described by wikipedia, is known as the Lloyd-algorithm. It is described in "An Algorithm for Vector Quantizer Design" by Linde and Buzo and Gray, 1980.

The original k-means algorithm was described by MacQueen ("Some Methods for classification and Analysis of Multivariate Observations", 1967) as follows

Stated informally, the k-means procedure consitsts of simply starting with k groups each of which consists of a single random point, and thereafter adding each new point to the group whose mean the new point is nearest. After a point is added to a group, the mean of that group is adjusted in order to take account of the new point. Thus at each stage the k-means are, in fact, the means of the groups they represent (hence the term k-means).

Every data point is processed only once and only one cluster mean is recomputed everytime a new data point is added. This is hardly the same algorithm as the one described by wikipedia.

This is exactly what the Lloyd clustering does, except that we need to switch to Euclidean space (the algorithm famously works for nutrient grouping etc, but it is difficult to visualize): Centroid = mean; k = number of clusters; cluster = group; k-means is for k groups. The informal procedure you quoted is a different interpretation, although it does the exact same thing: In the mentioned method, you already have the points given. You don't have to add them one by one; instead you add the group centers one by one. This is exactly the same. You still associate each point with the nearest site (or the center to distinguish it from a point). Take k sites. Given n points, these k random sites are placed near the k of the n possible points (usually you select k points and make them the sites). For every point, figure out which of the k sites is the nearest; that point goes to that group. You will have at most k groups now (some groups may be left unassigned if no point lies near to the site). Now recalculate the mean, reposition the k sites; repeat lather rinse. I don't understand how the quoted verbatim differs from this; it describes this exactly. Please clarify. Also provide a link to the PDF or article where this MacQueen paper is, I couldn't find it using Scholar. User_talk:gfxguy

I'm afraid there's very little correlation between the example and the description, making the example all but useless for someone who's trying to understand what's going on. "The algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data." I assume that the value of k was chosen as 4. Where then are the 4 initial clusters?

"It then calculates the mean point, or centroid, of each set." Where is this done?

"It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters," Each frame should correspond with a step of the algorithm


Although MacQueen's original k-means algorithm may not be identical to Lloyd's method, the algorithm described in this article is identical to Lloyd's method. Moreover, both articles reflect the common understanding of the k-means algorithm, at least within the algorithms community. The first sentence of the the Arthur and Vassilviskii paper cited in this article is "The k-means method is a well known geometric clustering algorithm based on work by Lloyd in 1982." I've therefore proposed a merge. (Perhaps the merged article should include a clear explanation of MacQueen's algorithm.) — Jeff Erickson 04:41, 6 April 2007 (UTC)

I agree with the merge. Lloyd's algorithm is the most common algorithm for approaching the k-means problem. Sancho 08:51, 9 April 2007 (UTC)
If K-means is more general I don't see why they articles should be merged. There also seems to be a fair amount of active research on new approaches to K-means (from a survey of Google Scholar articles), so I think it would be a mistake to merge. Imran 05:07, 12 April 2007 (UTC)
Of course, k-means algorithm is used synonymously with Lloyd's algorithm by many. "The most prominent and widely used clustering algorithm is Lloyd's algorithms sometimes also referred to as the k-means algorithm." (From Frahling, G. and Sohler, C. 2006. A fast k-means implementation using coresets. In Proceedings of the Twenty-Second Annual Symposium on Computational Geometry (Sedona, Arizona, USA, June 05 - 07, 2006). SCG '06. ACM Press, New York, NY, 135-143. DOI= http://doi.acm.org/10.1145/1137856.1137879). Sancho 08:55, 9 April 2007 (UTC)

I liked Sancho's comment. I found an article supporting other methods to solve the K-means problem and referenced it and also clarified the difference between K-means and Lloyd's as well as creating a reference to Lloyd's. Weston.pace 21:56, 12 June 2007 (UTC)

[edit] Image Replacement

Originally I started working on replacement images for the examples because they weren't licensed properly and I figured replacement was easier than tracking down the owner. However, they have sinced been properly licensed but I went ahead and replaced them because the new images don't show external software (which I beleive detracts from the quality of the image) and are in .svg format. However, I could be mistaken and do not wish to unfairly promote my own work, so if someone preferred the old images they can revert. -192.25.240.225 19:30, 26 July 2007 (UTC)