Talk:Data clustering

From Wikipedia, the free encyclopedia

The explanation of the fuzzy c-means algorithm seems quite difficult to follow, the actual order of the bullet points is correct but which bit is to be repeated and when is misleading.

"The fuzzy c-means algorithm is greatly similar to the k-means algorithm:

  • Choose a number of clusters
  • Assign randomly to each point coefficients for being in the clusters
  • Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
    • Compute the centroid for each cluster, using the formula above
    • For each point, compute its coefficients of being in the clusters, using the formula above"

Also aren't c-means and k-means just different names for the same thing, in which case can they be changed to be consistent throughout?



The c-means clustering relates only to the fuzzy logic clustering algorithm. You could say that k-means is teh convergence of c-clustering with ordinary logic, rather than fuzzy logic.

[edit] On-line versus off-line clustering

Beyond the division of clustering methodologies hierarchical/partitional agglomerative/divisive, it is possible to differentiate betewen: Arrivial of data points: on-line/off-line Type of data: stationary/non-stationary


Additionally it may be helpful to discuss some of the difficulties in clustering data, in particular choosing the correct number of centroids, limits on memory or processing time, and techniques for solving them, such as measuring phase transition (Rose, Gurewitz & Fox)


Another division would be Extensional vs. Conceptual Clustering. --Beau 10:40, 28 June 2006 (UTC)

[edit] Missing reference

I find this in the article:

This is the basic structure of the algorithm (J. MacQueen, 1967):

But when I looked at the bibliograpy, it was not there. If anyone has the information, could they add it? Michael Hardy 18:36, 21 November 2005 (UTC)

[edit] Impossibility Theorem

The clustering impossibility theorem should be mentioned. it is similar to Arrow's impossibility theorem, except for clustering. I don't know who created it though. Briefly, it states that any clustering algorithm should have these 3 properties:

  • Richness - "all labelings are possible"
  • scale invarience
  • Consistancy - "if you increase inter-cluster distance and decrease intra-cluster distance, cluster assignments should not change"

No clustering algorithm can have all 3.

-- BAxelrod 03:53, 16 December 2005 (UTC)

It is a good thing to have and mostly one should reference, I guess \bibitem{JK02} Jon Kleinberg. An impossibility theorem for clustering - Advances in Neural Information Processing Systems, 2002.

However, if there is not another source, then I'd mention that there is a little problem with this theorem as it is presented in that article.

First, it deals with graphs G(V,E), having |V| >= 2 and having distances d(i,j) = 0 iff i==j, i,j in V. Thus, take richness and scale invariance (which means that a graph with some fixed weights has the same clustering if all the weights are multiplied by some positive constant), a graph with |V| = 2, and boom - here you go. For each clustering we get either scale invariance or richness. If there is richness, then scale invariance does not work and the other way round. Sweet, is not it? Or am I wrong somewhere?

Could you please explain something about Isodata Algorithm for data clustering

The last external link on this page has an example on ISODATA clustering. I will try to do a digest when I have time, but feel free to beat me to it. Bryan