Talk:Cluster analysis
From Wikipedia, the free encyclopedia
[edit] Sabotage
This page appears to have been deliberately vandalised.
[edit] Elbow criterion section
Someone should edit out the blatant advertising for the excel plugin product. It would also be nice to have some additional info on picking the number of clusters?
- Sorry, I just had Excel at hand when making the graph. This is not an advertisement, this is not real data, but a crappy Excel hand made graph to help visualize the EC heuristic. If you look through the archives, you will see I also made an extremely ugly representation for the hierarchical clustering, which was thankfully replaced.
- As for how to tweak clustering, either you have a good idea of how many clusters you want (number criterion), or a good idea of how much total variance (or another perf metric) you want to explain, or all you are left with is heuristic. That's from a Computer science POV.
- Looking in another direction, for example statistics, there are ways to compare models with a different number of parameters, like Akaike information criterion and the methods linked from there, or maybe something based on information entropy. It will again help you choose in the tradeoff between having too many cluster or having too low perf because of low cluser number. I'm sorry, I don't have any relevant article nor the time at the moment to find one. Bryan
- Actually ran into an article about using entropy criterion to stop clustering: Cluster Identification Using Maximum Configuration Entropy, by C.H. Li. —The preceding unsigned comment was added by 86.53.54.179 (talk) 19:02, 1 April 2007 (UTC).
I'm wondering if there isn't an error in this sentence: "Percent of variance explained or percent of total variance is the ratio of within-group variance to total variance." I'm thinking that as the number of clusters increases, the within-group variation decreases, which is not what is shown on the graph. Should this be "... the ratio of the between-group variance to the total variance." Mhorney (talk) 17:57, 11 January 2008 (UTC)
[edit] V-means clustering
A Google search for "V-means clustering" only returns this Wikipedia article. Can someone provide a citation for this?
for future ref, this is the V-means paragraph that was removed
[edit] V-means clustering
This article or section may contain original research or unverified claims. Please improve the article by adding references. See the talk page for details. (October 2007) |
The neutrality or factuality of this article or section may be compromised by unattributed statements. You can help Wikipedia by removing weasel worded statements. |
V-means clustering utilizes cluster analysis and nonparametric statistical tests to key researchers into segments of data that may contain distinct homogenous sub-sets. The methodology embraced by V-means clustering circumvents many of the problems that traditionally beleaguer standard techniques for categorizing data. First, instead of relying on analyst predictions for the number of distinct sub-sets (k-means clustering), V-means clustering generates a pareto optimal number of sub-sets. V-means clustering is calibrated to a usened confidence level p, whereby the algorithm divides the data and then recombines the resulting groups until the probability that any given group belongs to the same distribution as either of its neighbors is less than p.
Second, V-means clustering makes use of repeated iterations of the nonparametric Kolmogorov-Smirnov test. Standard methods of dividing data into its constituent parts are often entangled in definitions of distances (distance measure clustering) or in assumptions about the normality of the data (expectation maximization clustering), but nonparametric analysis draws inference from the distribution functions of sets.
Third, the method is conceptually simple. Some methods combine multiple techniques in sequence in order to produce more robust results. From a practical standpoint this muddles the meaning of the results and frequently leads to conclusions typical of “data dredging.”
[edit] Fuzzy c-means clarification
I believe ther is a typo at "typological analysis"; should be "topological"
The explanation of the fuzzy c-means algorithm seems quite difficult to follow, the actual order of the bullet points is correct but which bit is to be repeated and when is misleading.
"The fuzzy c-means algorithm is greatly similar to the k-means algorithm:
- Choose a number of clusters
- Assign randomly to each point coefficients for being in the clusters
- Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
- Compute the centroid for each cluster, using the formula above
- For each point, compute its coefficients of being in the clusters, using the formula above"
Also aren't c-means and k-means just different names for the same thing, in which case can they be changed to be consistent throughout?
The c-means clustering relates only to the fuzzy logic clustering algorithm. You could say that k-means is teh convergence of c-clustering with ordinary logic, rather than fuzzy logic.
[edit] On-line versus off-line clustering
Beyond the division of clustering methodologies hierarchical/partitional agglomerative/divisive, it is possible to differentiate betewen: Arrivial of data points: on-line/off-line Type of data: stationary/non-stationary
Additionally it may be helpful to discuss some of the difficulties in clustering data, in particular choosing the correct number of centroids, limits on memory or processing time, and techniques for solving them, such as measuring phase transition (Rose, Gurewitz & Fox)
- Another division would be Extensional vs. Conceptual Clustering. --Beau 10:40, 28 June 2006 (UTC)
[edit] Missing reference
I find this in the article:
-
- This is the basic structure of the algorithm (J. MacQueen, 1967):
But when I looked at the bibliograpy, it was not there. If anyone has the information, could they add it? Michael Hardy 18:36, 21 November 2005 (UTC)
[edit] Impossibility Theorem
The clustering impossibility theorem should be mentioned. it is similar to Arrow's impossibility theorem, except for clustering. I don't know who created it though. Briefly, it states that any clustering algorithm should have these 3 properties:
- Richness - "all labelings are possible"
- scale invarience
- Consistancy - "if you increase inter-cluster distance and decrease intra-cluster distance, cluster assignments should not change"
No clustering algorithm can have all 3.
-- BAxelrod 03:53, 16 December 2005 (UTC)
It is a good thing to have and mostly one should reference, I guess \bibitem{JK02} Jon Kleinberg. An impossibility theorem for clustering - Advances in Neural Information Processing Systems, 2002.
However, if there is not another source, then I'd mention that there is a little problem with this theorem as it is presented in that article.
First, it deals with graphs G(V,E), having |V| >= 2 and having distances d(i,j) = 0 iff i==j, i,j in V. Thus, take richness and scale invariance (which means that a graph with some fixed weights has the same clustering if all the weights are multiplied by some positive constant), a graph with |V| = 2, and boom - here you go. For each clustering we get either scale invariance or richness. If there is richness, then scale invariance does not work and the other way round. Sweet, is not it? Or am I wrong somewhere?
Could you please explain something about Isodata Algorithm for data clustering
The last external link on this page has an example on ISODATA clustering. I will try to do a digest when I have time, but feel free to beat me to it. Bryan
[edit] Relation to NLP
It seems a large amount of the effort in text mining related to text clustering is left out of this article, but it seems to be most appropriate place. Josh Froelich 20:16, 9 January 2007 (UTC)
[edit] No strict definition for the problem itself.
There are lots of details about different methods and metrics used to solve the problem, which is defined too unstrictly.
[edit] All software links were removed
Here they are, should we put them back ?
[edit] Software implementations
[edit] Free
- The flexclust package for R
- COMPACT - Comparative Package for Clustering Assessment (in Matlab)
- YALE (Yet Another Learning Environment): freely available open-source software for data pre-processing, knowledge discovery, data mining, machine learning, visualization, etc. also including a plugin for clustering, fully integrating Weka, easily extendible, and featuring a graphical user interface as well as a XML-based scripting language for data mining;
- mixmod : Model Based Cluster And Discriminant Analysis. Code in C++, interface with Matlab and Scilab
- LingPipe Clustering Tutorial Tutorial for doing complete- and single-link clustering using LingPipe, a Java text data mining package distributed with source.
- Weka : Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.
- Tanagra : a free data mining software including several clustering algorithms such as K-MEANS, SOM, Clustering Tree, HAC and more.
- Cluster : Open source clustering software. The routines are available in the form of a C clustering library, an extension module to Python, a module to Perl.
- python-cluster Pure python implementation
[edit] Non-free
- Clustan
- Peltarion Synapse (using self-organizing maps)[1]
- Eudaptics Viscovery: Data Mining Suite for Visual Cluster Analysis
- I removed these, about a month ago. Wikipedia's external link guidelines discourage large link directories like this. None of the links appear to meet the guidelines, each is just an instance of data clustering software with nothing to indicate it is especially notable to the topic. Normally I would have replaced them with a DMOZ link, but there doesn't appear to be a category for this. -SpuriousQ (talk) 10:14, 22 March 2007 (UTC)
- Actually it did take some time to build up such a list of software. Also, since this is an article about algorithms, any software implementing them is relevant to the topic (if you have an interest in the algo, it might very well be because you need a working implementation, or a reference implementation). Bryan
- I understand that, but that list was just an invitation for spam and links to data clustering software someone wrote one day. WP:EL is clear that links should be kept to a minimum. I would probably have no problem with links to clearly notable implementations; for example, one that was the subject of a peer-reviewed published paper or one created by noted researchers in the field. -SpuriousQ (talk) 15:12, 28 March 2007 (UTC)
- No problem, I accept the policy and see its utility, but think the links and comments are important, that's why I asked for a DMOZ directory to be created. Also, there is a problem with the fact that you removed some internal wikipedia links in the process (the YALE, Weka, Peltarion Synapse links). So, if we want to keep the links somewhere, do we try to move them to DMOZ or do we create an intermediary Wikipedia page for each package ? Bryan
- Hmm. The standard way of linking to internal articles is in the See also section. But I feel it's a bit questionable to give such prominence to the three instances that happen to have Wikipedia articles. Another idea would be to have an article List of data clustering software and link to that in See also. I would prefer that option, but it's also iffy to have a list article with so few examples. Do you know of any other articles we could add to that list? -SpuriousQ (talk) 01:58, 30 March 2007 (UTC)
- Data mining, software section. The software links there actually suffer the same problem as the ones here, although to a lesser extent (more wikipedia articles, less external links). Please see discussion with David Eppstein. Bryan
- Hmm. The standard way of linking to internal articles is in the See also section. But I feel it's a bit questionable to give such prominence to the three instances that happen to have Wikipedia articles. Another idea would be to have an article List of data clustering software and link to that in See also. I would prefer that option, but it's also iffy to have a list article with so few examples. Do you know of any other articles we could add to that list? -SpuriousQ (talk) 01:58, 30 March 2007 (UTC)
- No problem, I accept the policy and see its utility, but think the links and comments are important, that's why I asked for a DMOZ directory to be created. Also, there is a problem with the fact that you removed some internal wikipedia links in the process (the YALE, Weka, Peltarion Synapse links). So, if we want to keep the links somewhere, do we try to move them to DMOZ or do we create an intermediary Wikipedia page for each package ? Bryan
- I understand that, but that list was just an invitation for spam and links to data clustering software someone wrote one day. WP:EL is clear that links should be kept to a minimum. I would probably have no problem with links to clearly notable implementations; for example, one that was the subject of a peer-reviewed published paper or one created by noted researchers in the field. -SpuriousQ (talk) 15:12, 28 March 2007 (UTC)
- Actually it did take some time to build up such a list of software. Also, since this is an article about algorithms, any software implementing them is relevant to the topic (if you have an interest in the algo, it might very well be because you need a working implementation, or a reference implementation). Bryan
I think the list should either be kept out of the article altogether, or it should be explicitly limited to a small number (say half a dozen) of particularly notable clustering systems, for which some strong argument can be made for their inclusion beyond "it's a clustering system" or even "it's a free clustering system". By a strong argument, I mean statements such as "it's the most widely used free clustering system for Linux" or the like. Anything less restrictive is just an invitation for an unencyclopedic link farm. —David Eppstein 03:22, 30 March 2007 (UTC)
- I agree, it's time for a cleanup. Just to give a little "historical perspective", the list was started when commercial spam appeared, it kept it under very good control because it gave space for people tempted to spam the page with their soft's link to express themselves. The list has grown, and it now has many annotations, so I'm quite happy with it and would like to keep some of its info.
- Now, to continue the conversation together with SpuriousQ, imagine we move everything to a list of Data mining software. We would lose info about implementations (libraries,scripts) that are not autonomous software, thus failing to highlight "reference implementations" that people may want to examine at a code level, only linking to "working implementations" that people would like to use. So my proposition would be: if the soft is autonomous, we add it to the data mining software list. If the soft is just a library, and seems to be from academia, we keep a link to it. In our case, that would keep flexclust, COMPACT, mixmod and Cluster, the python-cluster link would be lost forever, and the rest would be transferred to the more general and richer data mining software list. What do you think ? Bryan.
- This would have been very useful a few weeks ago... trawling Google for OSS implementations and avoiding all the spam took ages. I am actually quite angry that someone just removed them all without a discussion. --MatthewKarlsen 09:34, 21 June 2007 (UTC)
- Now, to continue the conversation together with SpuriousQ, imagine we move everything to a list of Data mining software. We would lose info about implementations (libraries,scripts) that are not autonomous software, thus failing to highlight "reference implementations" that people may want to examine at a code level, only linking to "working implementations" that people would like to use. So my proposition would be: if the soft is autonomous, we add it to the data mining software list. If the soft is just a library, and seems to be from academia, we keep a link to it. In our case, that would keep flexclust, COMPACT, mixmod and Cluster, the python-cluster link would be lost forever, and the rest would be transferred to the more general and richer data mining software list. What do you think ? Bryan.
[edit] Affinity Propagation (Clustering by Passing Messages Between Data Points)
I believe that this algorithm developed at the University of Toronto by Brendan Frey, a professor in the department of Electrical and Computer Engineering and Delbert Dueck, a member of his research group which appeared in Science (journal) Feb 07 will change the way people think about clustering. See www.psi.toronto.edu/affinitypropagation/ and www.sciencemag.org/cgi/content/full/sci;315/5814/972 . However I am not capable of writing a full introduction, so I hope someone better equiped for the job will do that. Including the AP breakthrough is a must in my view to retain the currency of this article. Bunty.Gill 08:50, 24 April 2007 (UTC)
[edit] QT: recursion or iteration?
"Recurse with the reduced set of points." (with link to recursion)
Is this really recursion? I would call it iteration. You repeat the process until you can't go any further, then you stop. Sounds like a while loop to me.
--84.9.95.214 17:28, 1 July 2007 (UTC)
If you take a look at the Recursion (computer science) article "Any function that can be evaluated by a computer can be expressed in terms of recursive functions without the use of iteration, in continuation-passing style; and conversely any recursive function can be expressed in terms of iteration.", i.e. you can rewrite anything primitive recursive to an iterated algo. Here, recursion is in the philosophical sense, since you apply the same analysis to a reduced set of points.
[edit] Requested move: Should be entitled 'Cluster Analysis' and not 'Data Clustering'.
As mentioned in the article there are several different terms used to describe cluster analysis. However, the most frequent term used in books, papers, etc, always appears to be cluster analysis by a fair margin. It would make sense to use the most widely used term as the page name.
Approximately 7-10 days ago I collected statistics on the number of hits on each term from several different sources. The data could be used as an approximate indicator for the prevalence of each term. The numbers in brackets represent papers released since January 2000. On average the ratio of [papers published since January 2000 to number of papers overall] for the ACM and IEEE combined is approximately 0.757 and 0.759 for data clustering and cluster analysis respectively, thereby suggesting a relatively stable prevalence of each term in the literature over time.
Source | IEEE | ACM | Alexa | Yahoo | Ask | Gigablast | Live Search | |
---|---|---|---|---|---|---|---|---|
Results for "Data Clustering" | 524 (368) | 682 (555) | 20,000 | 96,900 | 353,000 | 43,600 | 38,219 | 27,279 |
Results for "Cluster Analysis" | 571 (511) | 1159 (724) | 130,000 | 715,000 | 1,860,000 | 307,200 | 316,708 | 148,096 |
Based on these results, and the titles and content of books on the subject, I propose that the page title be changed to "Cluster Analysis".
--MatthewKarlsen 16:07, 15 July 2007 (UTC)
[edit] Normalized Google Distance
Normalized Google Distance (NGD) -- when I saw this I thought it was a prank. It turns out that someone has actually written a (suprisingly well-informed) paper or two on this [2] -- but that does not mean it is a serious approach (or that it is not just the type of prank that bored computer scientists come up with in their free time).
Is anyone here informed on this? Is NGD a viable norm function (in its domain)? I have been unable to find any peer-reviewed publications regarding the topic (New Scientist hardly counts). --SteelSoul 23:12, 1 November 2007 (UTC)
[edit] Limits of Cluster Analysis
I understand that CA will cluster just anything we throw at it: random data, linearly correlated data, etc. Could somebody knowledgeable please point out when NOT to use CA? --Stevemiller (talk) 04:56, 15 February 2008 (UTC)
- It is often a good idea to compare the clustering results with random data and then compare it to confidence intervals. So I would say don't use clustering algorithms unless you are looking for or anticipating clusters (or their absence). Clustering algorithms can be used to compare the probability of a particular distribution of clusters forming and compare that to a random (or other comparator) case. For example you might be looking at two sets of star systems, you want to know if there is a tendency for them to form into clusters in the presence if there are detectable levels of ficticium (a fictious element). So you run your clustering algorithms past the data and see if there is a difference between the star systems with low ficticium and high ficticium, and how that correlates to clustering and if that is statistically significant, most likely using confidence intervals. You also may need to compare it to random data. User A1 (talk) 16:02, 15 February 2008 (UTC)