Affinity propagation

In statistics and data mining, affinity propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points.[1] Unlike clustering algorithms such as k-means or k-medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm. Similar to k-medoids, affinity propagation finds "exemplars", members of the input set that are representative of clusters.[1]

Algorithm

Let x1 through xn be a set of data points, with no assumptions made about their internal structure, and let s be a function that quantifies the similarity between any two points, such that s(xi, xj) > s(xi, xk) iff xi is more similar to xj than to xk. For this example, the negative squared distance of two data points was used i.e. for points xi and xk, [1]

The diagonal of s (i.e. ) is particularly important, as it represents the input preference, meaning how likely a particular input is to become an exemplar. When it is set to the same value for all inputs, it controls how many classes the algorithm produces. A value close to the minimum possible similarity produces fewer classes, while a value close or larger to the maximum possible similarity, produces many classes. It is typically initialized to the median similarity of all pairs of inputs.

The algorithm proceeds by alternating two message passing steps, to update two matrices:[1]

Both matrices are initialized to all zeroes, and can be viewed as log-probability tables. The algorithm then performs the following updates iteratively:

for and
.

The iterations are performed until either the cluster boundaries remain unchanged over a number of iterations, or after some predetermined number of iterations. The exemplars are extracted from the final matrices as those whose 'responsibility + availability' for themselves is positive (i.e. ).

Applications

The inventors of affinity propagation showed it is better for certain computer vision and computational biology tasks, e.g. clustering of pictures of human faces and identifying regulated transcripts, than k-means,[1] even when k-means was allowed many random restarts and initialized using PCA.[2] A study comparing affinity propagation and Markov clustering on protein interaction graph partitioning found Markov clustering to work better for that problem.[3] A semi-supervised variant has been proposed for text mining applications.[4]

Software

References

  1. 1 2 3 4 5 Brendan J. Frey; Delbert Dueck (2007). "Clustering by passing messages between data points". Science. 315 (5814): 972–976. PMID 17218491. doi:10.1126/science.1136800.
  2. Delbert Dueck; Brendan J. Frey (2007). Non-metric affinity propagation for unsupervised image categorization. Int'l Conf. on Computer Vision.
  3. James Vlasblom; Shoshana Wodak (2009). "Markov clustering versus affinity propagation for the partitioning of protein interaction graphs". BMC Bioinformatics. 10 (1): 99. PMC 2682798Freely accessible. PMID 19331680. doi:10.1186/1471-2105-10-99.
  4. Renchu Guan; Xiaohu Shi; Maurizio Marchese; Chen Yang; Yanchun Liang (2011). "Text Clustering with Seeds Affinity Propagation". IEEE Transactions on Knowledge & Data Engineering. 23 (4): 627–637. doi:10.1109/tkde.2010.144.
  5. Clustering.jl www.github.com
  6. "Clustering — scikit-learn 0.14.1 documentation". Retrieved 15 July 2014.
  7. apcluster cran.r-project.org>
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.