Dice's coefficient
From Wikipedia, the free encyclopedia
Dice's coefficient (also known as the Dice coefficient) is a similarity measure related to the Jaccard coefficient.
For sets X and Y of keywords used in information retrieval, the coefficient may be defined as:[1]
When taken as a string similarity measure, the coefficient may be calculated for two strings, x and y using bigrams as follows:[2]
where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example the similarity between,
night
nacht
Would be calculated as, s = (2 * 1) / (4 + 4) = 0.25
[edit] See also
[edit] Notes
[edit] References
- C. J. van Rijsbergen (1979) Information Retrieval (London: Butterworths)
- Kondrak, G., Marcu, D. and Knight, K. (2003) "Cognates Can Improve Statistical Translation Models" in Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 46--48