Canopy clustering algorithm

From Wikipedia, the free encyclopedia

The canopy clustering algorithm is an unsupervised clustering algorithm related to the K-means algorithm.

It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical because of the size of the data set.

The algorithm proceeds as follows:

Cheaply partition the data into overlapping subsets, called 'canopies'
Perform more expensive clustering, but only within these canopies

1 Benefits
2 References
3 External links
4 See also

[edit] Benefits

The number of instances of training data that must be compared at each step is reduced
There is some evidence that the resulting clusters are improved

[edit] References

McCallum, Nigamy and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching"

[edit] External links

Cluster Computing and MapReduce Lecture 4 from Google

[edit] See also

This computer science-related article is a stub. You can help Wikipedia by expanding it.

Canopy clustering algorithm

From Wikipedia, the free encyclopedia

Contents

[edit] Benefits

[edit] References

[edit] External links

[edit] See also

Views

Navigation

Interaction

Search