Sørensen–Dice coefficient

The Sørensen–Dice index, also known by other names (see Name, below), is a statistic used for comparing the similarity of two samples. It was independently developed by the botanists Thorvald Sørensen^[1] and Lee Raymond Dice,^[2] who published in 1948 and 1945 respectively. The Sørensen–Dice is also known as F1 score or Dice similarity coefficient (DSC).

Name

The index is known by several other names, especially the Sørensen index or Dice's coefficient. Other variations include the "similarity coefficient" or "index". Common alternate spellings for Sørensen are Sorenson, Soerenson and Sörenson, and all three can also be seen with the –sen ending.

Other names include:

Czekanowski's binary (non-quantitative) index^[3]
Zijdenbos similarity index^[4]^[5], referring to a 1994 paper of Zijdenbos et al.^[6]

Formula

Sørensen's original formula was intended to be applied to presence/absence data, and is

QS={\frac {2|X\cap Y|}{|X|+|Y|}}

where |X| and |Y| are the numbers of elements in the two samples. Based on what is written here,

DSC={\frac {2TP}{2TP+FP+FN}}

as compared with the Jaccard index, which only counts true positives once in both the numerator and denominator. QS is the quotient of similarity and ranges between 0 and 1.^[7] It can be viewed as a similarity measure over sets.

Similarly to the Jaccard index, the set operations can be expressed in terms of vector operations over binary vectors A and B:

s_v = \frac{2 | A \cdot B |}{| A |^2 + | B |^2}

which gives the same outcome over binary vectors and also gives a more general similarity metric over vectors in general terms.

For sets X and Y of keywords used in information retrieval, the coefficient may be defined as twice the shared information (intersection) over the sum of cardinalities :^[8]

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:^[9]

s = \frac{2 n_t}{n_x + n_y}

where n_t is the number of character bigrams found in both strings, n_x is the number of bigrams in string x and n_y is the number of bigrams in string y. For example, to calculate the similarity between:

night

nacht

We would find the set of bigrams in each word:

{ni,ig,gh,ht}

{na,ac,ch,ht}

Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

Difference from Jaccard

This coefficient is not very different in form from the Jaccard index. In fact, both are equivalent in the sense that given a value for the Sørensen–Dice coefficient $S$ , one can calculate the respective Jaccard index value $J$ and vice versa, using the equations $J=S/(2-S)$ and $S=2J/(1+J)$ .

Since the Sørensen–Dice coefficient doesn't satisfy the triangle inequality, it can be considered a semimetric version of the Jaccard index.^[3]

The function ranges between zero and one, like Jaccard. Unlike Jaccard, the corresponding difference function

d = 1 - \frac{2 | X \cap Y |}{| X | + | Y |}

is not a proper distance metric as it does not possess the property of triangle inequality.^[3] The simplest counterexample of this is given by the three sets {a}, {b}, and {a,b}, the distance between the first two being 1, and the difference between the third and each of the others being one-third. To satisfy the triangle inequality, the sum of any two of these three sides must be greater than or equal to the remaining side. However, the distance between {a} and {a,b} plus the distance between {b} and {a,b} equals 2/3 and is therefore less than the distance between {a} and {b} which is 1.

Applications

The Sørensen–Dice coefficient is useful for ecological community data (e.g. Looman & Campbell, 1960^[10]). Justification for its use is primarily empirical rather than theoretical (although it can be justified theoretically as the intersection of two fuzzy sets^[11]). As compared to Euclidean distance, the Sørensen distance retains sensitivity in more heterogeneous data sets and gives less weight to outliers.^[12] Recently the Dice score (and its variations, e.g. logDice taking a logarithm of it) has become popular in computer lexicography for measuring the lexical association score of two given words.^[13] It is also commonly used in image segmentation, in particular for comparing algorithm output against reference masks in medical applications ^[14]

Abundance version

The expression is easily extended to abundance instead of presence/absence of species. This quantitative version is known by several names:

Quantitative Sørensen–Dice index^[3]
Quantitative Sørensen index^[3]
Quantitative Dice index^[3]
Bray–Curtis similarity (1 minus the Bray-Curtis dissimilarity)^[3]
Czekanowski's quantitative index^[3]
Steinhaus index^[3]
Pielou's percentage similarity^[3]
1 minus the Hellinger distance^[15]

References

↑ Sørensen, T. (1948). "A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons". Kongelige Danske Videnskabernes Selskab. 5 (4): 1–34.
↑ Dice, Lee R. (1945). "Measures of the Amount of Ecologic Association Between Species". Ecology. 26 (3): 297–302. JSTOR 1932409. doi:10.2307/1932409.
1 2 3 4 5 6 7 8 9 10 Gallagher, E.D., 1999. COMPAH Documentation, University of Massachusetts, Boston
↑ Prescott, J.W.; Pennell, M.; Best, T.M.; Swanson, M.S.; Haq, F.; Jackson, R.; Gurcan, M.N. (2009). An automated method to segment the femur for osteoarthritis research. IEEE. doi:10.1109/iembs.2009.5333257.
↑ Swanson, M.S.; Prescott, J.W.; Best, T.M.; Powell, K.; Jackson, R.D.; Haq, F.; Gurcan, M.N. (2010). "Semi-automated segmentation to assess the lateral meniscus in normal and osteoarthritic knees". Osteoarthritis and Cartilage. Elsevier BV. 18 (3): 344–353. ISSN 1063-4584. doi:10.1016/j.joca.2009.10.004.
↑ Zijdenbos, A.P.; Dawant, B.M.; Margolin, R.A.; Palmer, A.C. (1994). "Morphometric analysis of white matter lesions in MR images: method and validation". IEEE Transactions on Medical Imaging. Institute of Electrical and Electronics Engineers (IEEE). 13 (4): 716–724. ISSN 0278-0062. doi:10.1109/42.363096.
↑ http://www.sekj.org/PDF/anbf40/anbf40-415.pdf
↑ van Rijsbergen, Cornelis Joost (1979). Information Retrieval. London: Butterworths. ISBN 3-642-12274-4.
↑ Kondrak, Grzegorz; Marcu, Daniel; Knight, Kevin (2003). "Cognates Can Improve Statistical Translation Models" (PDF). Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. pp. 46–48.
↑ Looman, J. and Campbell, J.B. (1960) Adaptation of Sorensen's K (1948) for estimating unit affinities in prairie vegetation. Ecology 41 (3): 409–416.
↑ Roberts, D.W. (1986) Ordination on the basis of fuzzy set theory. Vegetatio 66 (3): 123–131.
↑ McCune, Bruce & Grace, James (2002) Analysis of Ecological Communities. Mjm Software Design; ISBN 0-9721290-0-6.
↑ Rychlý, P. (2008) A lexicographer-friendly association score. Proceedings of the Second Workshop on Recent Advances in Slavonic Natural Language Processing RASLAN 2008: 6–9
↑ Zijdenbos AP, Dawant BM, Margolin RA, Palmer AC (1994) Morphometric analysis of white matter lesions in MR images: method and validation. IEEE Trans Med Imaging 13 (4): 716-24.
↑ Bray, J. Roger; Curtis, J. T. (1957). "An Ordination of the Upland Forest Communities of Southern Wisconsin". Ecological Monographs. 27 (4): 326–349. doi:10.2307/1942268.

External links

The Wikibook Algorithm implementation has a page on the topic of: Dice's coefficient

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.