Fuzzy string searching

From Wikipedia, the free encyclopedia

Fuzzy string search is the name that is used for a category of techniques for finding strings that approximately match some given pattern string. It may also be known as approximate or inexact matching.

Approximate string searching has two different flavors:

finding one or more matching substrings of a text, and
finding similar strings in a string set, often referred to as a dictionary.

Approximate string searching has many application areas including information retrieval, spellchecking and computational biology [1].

Contents

[edit] Similarity functions

The cornerstone of any approximate searching method is a similarity function or string metric. Among the most commonly used similarity functions are Levenshtein distance (a type of edit distance) and n-gram distance. The latter is based on counting the number of common n-grams, and is used mostly for filtering. In contrast to n-gram distance, Levenshtein distance is a de-facto standard similarity function. It has several extensions. One well known extension is Damerau-Levenshtein distance that counts transposition as a single edit operation. Another extension is the so-called generalized or weighted Levenshtein distance. It assigns different costs to elementary edit operations. Ukkonen [2] described even more sophisticated similarity function where edit operations go beyond single-character insertions, deletions and substitutions and include substitutions of arbitrary-length strings.

[edit] On-line vs. off-line

Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line. With on-line algorithms the pattern can be preprocessed before searching but the text cannot. In other words, on-line techniques do searching without indexation. Early algorithms for on-line approximate matching were suggested by Wagner and Fisher[3] and by Sellers. [4] Both algorithms are based on dynamic programming but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.

On-line searching techniques were repeatedly improved. Perhaps the most famous improvement is the bitap algorithm (also known as the shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. The Bitap algorithm is the heart of the Unix searching utility agrep. An excellent review of on-line searching algorithms was done by G. Navarro.[5]

Although very fast on-line techniques exist, their performance on large data is unacceptable. Text preprocessing or indexing makes searching dramatically faster. Today, a variety of indexing algorithms are presented. Among them are suffix trees[6], metric trees[7] and n-gram methods.[8][9] For a detailed list of indexing techniques see the paper of Navarro et al.[10]

[edit] See also

[edit] References

[edit] External links

Languages