Rand index

From Wikipedia, the free encyclopedia

The Rand index or Rand measure is a measure of the similarity between two data clusters.

[edit] Definition

Given a set of n elements S = \{O_1, \ldots, O_n\} and two partitions of S to compare, X = \{x_1, \ldots, x_r\} and Y = \{y_1, \ldots, y_s\}, we define the following:

  • a, the number of pairs of elements in S that are in the same set in X and in the same set in Y
  • b, the number of pairs of elements in S that are in different sets in X and in different sets in Y
  • c, the number of pairs of elements in S that are in the same set in X and in different sets in Y
  • d, the number of pairs of elements in S that are in different sets in X and in the same set in Y

The Rand index, R, is:

R = \frac{a+b}{a+b+c+d} = \frac{a+b}{{n \choose 2 }}

Intuitively, one can think of a + b as the number of agreements between X and Y and c + d as the number of disagreements between X and Y.

The Rand index has a value between 0 and 1, with 0 indicating that the two data clusters do not agree on any pair of points and 1 indicating that the data clusters are exactly the same.

[edit] References

  • W. M. Rand, Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66, pp846–850 (1971).
  • K. Y. Yeung, W. L. Ruzzo, Details of the Adjusted Rand index and Clustering algorithms, Bioinformatics. [1]