Dice's coefficient

From Wikipedia, the free encyclopedia

Dice's coefficient (also known as the Dice coefficient) is a similarity measure related to the Jaccard index.

For sets X and Y of keywords used in information retrieval, the coefficient may be defined as:^[1]

$s = \frac{2 | X \cap Y |}{| X | + | Y |}$

When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:^[2]

$s = \frac{2 n_{t}}{n_{x} + n_{y}}$

where $n t$ is the number of character bigrams found in both strings, $n x$ is the number of bigrams in string x and $n y$ is the number of bigrams in string y. For example, to calculate the similarity between:

night

nacht

We would find the set of bigrams in each word:

{ni,ig,gh,ht}

{na,ac,ch,ht}

Each set has 4 elements, and the intersection of these two sets has only one element: ht.

Plugging this into the formula, we calculate, $s = (2 * 1) / (4 + 4) = 0.25$

[edit] See also

Wikibooks Algorithm implementation has a page on the topic of

Dice's coefficient

[edit] Notes

^ C. J. van Rijsbergen (1979)
^ Kondrak, G. et al. (2003)

[edit] References

C. J. van Rijsbergen (1979) Information Retrieval (London: Butterworths)
Kondrak, G., Marcu, D. and Knight, K. (2003) "Cognates Can Improve Statistical Translation Models" in Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 46--48

Categories: Information retrieval | String similarity measures

Dice's coefficient

From Wikipedia, the free encyclopedia

[edit] See also

[edit] Notes

[edit] References

Views

Navigation

Interaction

Search