Jaro-Winkler distance

From Wikipedia, the free encyclopedia

The Jaro-Winkler distance (Winkler, 1999) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995).

The Jaro distance metric states that given two strings s1 and s2, their distance dj is:

d_j = \frac{m}{3a} + \frac{m}{3b} + \frac{m-t}{3m}

where:

  • m is the number of "matching" characters;
  • a and b are the lengths of s1 and s2, respectively;
  • t is the number of "transpositions".

Two characters from s1 and s2 respectively, are considered "matching" only if they are not farther than \frac{\max(a,b)}{2}-1.

Each character of s1 is compared with all its matching characters in s2. The number of matching (but different) characters divided by two defines the number of "transpositions".

Jaro-Winkler distance uses a prefix scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length l. Given two strings s1 and s2, their Jaro-Winkler distance dw is:

dw = dj + (l * p * (1 − dj))

where:

  • dj is the Jaro distance for strings s1 and s2
  • l is the length of common prefix at the start of the string
  • p is a constant scaling factor for how much the score is adjusted upwards for having common prefixes

[edit] References

  • Jaro, M. A. (1989). "Advances in record linking methodology as applied to the 1985 census of Tampa Florida". Journal of the American Statistical Society 64: 1183-1210. 
  • Jaro, M. A. (1995). "Probabilistic linkage of large public health data file". Statistics in Medicine 14: 491-498. 

[edit] External links