Jaro-Winkler distance
From Wikipedia, the free encyclopedia
The Jaro-Winkler distance (Winkler, 1999) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995).
The Jaro distance metric states that given two strings s1 and s2, their distance dj is:
where:
- m is the number of "matching" characters;
- a and b are the lengths of s1 and s2, respectively;
- t is the number of "transpositions".
Two characters from s1 and s2 respectively, are considered "matching" only if they are not farther than .
Each character of s1 is compared with all its matching characters in s2. The number of matching (but different) characters divided by two defines the number of "transpositions".
Jaro-Winkler distance uses a prefix scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length l. Given two strings s1 and s2, their Jaro-Winkler distance dw is:
- dw = dj + (l * p * (1 − dj))
where:
- dj is the Jaro distance for strings s1 and s2
- l is the length of common prefix at the start of the string
- p is a constant scaling factor for how much the score is adjusted upwards for having common prefixes
[edit] References
- Jaro, M. A. (1989). "Advances in record linking methodology as applied to the 1985 census of Tampa Florida". Journal of the American Statistical Society 64: 1183-1210.
- Jaro, M. A. (1995). "Probabilistic linkage of large public health data file". Statistics in Medicine 14: 491-498.
- Winkler, W. E. (1999). "The state of record linkage and current research problems". Statistics of Income Division, Internal Revenue Service Publication R99/04.