Fuzzy string searching
From Wikipedia, the free encyclopedia
Approximate string search is the name that is used for a category of techniques for finding strings that approximately match some given pattern string. It may also be known as approximate or inexact matching.
Note: In certain fields of practice, this type of technique is often mistakenly known as "fuzzy string search" or "fuzzy matching". The "fuzzy" misnomer has somehow been applied when in fact there is nothing fuzzy about searching strings. Characters in a string either exist or they do not. So there is no logical way to define them as fuzzy. Fuzzy is a term that is appropriate to entirely different fields (e.g. fuzzy logic; deals with linguistic variables and possibilities) where the application of probability and uncertain have entirely different meanings.
Moreover, the pioneering work on string search comes from computational biology. A quick search of that fields use of string search will reveal that the term "fuzzy" is almost never used. In fact, one of the landmark books in these techniques "Algorithms on Strings, Trees, and Sequences" by Dan Gusfield (ISBN 0-521-58519-8) never mentions the word fuzzy at all.
Approximate string searching has two different flavors:
- finding one or more matching substrings of a text, and
- finding similar strings in a string set often referred to as dictionary.
Approximate string searching has many application areas including information retrieval, spellchecking and computational biology .
Contents |
[edit] Similarity functions
The cornerstone of any approximate searching method is a similarity function. Among the most commonly used similarity functions are Levenshtein distance (a type of edit distance) and n-gram distance. The latter is based on counting of the number of common n-grams, and is used mostly for filtering. In contrast to n-gram distance, Levenshtein distance is a de-facto standard similarity function. It has several extensions. One well known extension is Damerau-Levenshtein distance that counts transposition as a single edit operation. Another extension is the so-called generalized or weighted Levenshtein distance. It assigns different costs to elementary edit operations. Ukkonen described even more sophisticated similarity function where edit operations go beyond single-character insertions, deletions and substitutions and include substitutions of arbitrary-length strings.
[edit] On-line vs. off-line
Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line. With on-line algorithms the pattern can be preprocessed before searching but the text cannot. In other words, on-line techniques do searching without indexation. Early algorithms for on-line approximate matching were suggested by Wagner and Fisher dynamic programming but solve different problems. Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.
and by Sellers . Both algorithms are based onOn-line searching techniques were repeatedly improved. Perhaps, the most famous improvement is bitap algorithm (also known as shift-or and shift-and algorithm), which is very efficient for relatively short pattern strings. Bitap algorithm is the heart of Unix searching utility agrep. An excellent review of on-line searching algorithms was done by G. Navarro .
Although very fast on-line techniques exist their performance on large data is unacceptable. In its turn, text preprocessing, or in other words indexing, makes searching dramatically faster. Today, a variety of indexing algorithms are presented. Among them are suffix trees , metric trees and n-gram methods . For a detailed list of indexing techniques see the paper of Navarro et. al.
[edit] See also
- Soundex
- Agrep
- Spellchecker
- String searching algorithm
- Wildcard character
- Levenshtein distance
- Computer-assisted translation
[edit] References
- ↑ R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Dan Hirchsberg and Gene Myers, editors, Combinatorial Pattern Matching (CPM'96), LNCS 1075, pages 1--23, Irvine, CA, Jun 1996.
- ↑ D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA, 1997.
- ↑ R. Baeza-Yates and G. Navarro. Fast Approximate String Matching in a Dictionary.Proc. SPIRE'98. IEEE CS Press, pages 14-22.
- ↑ G. Myers. A fast bit-vector algorithm for approximate string matching based on dynamic programming, Journal of the ACM (JACM) 46 (3), May 1999, 395 - 415.
- ↑ G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys (CSUR) archive 33(1), pp 31-88, 2001.
- ↑ G. Navarro, Ricardo Baeza-Yates, E. Sutinen and J. Tarhio. Indexing Methods for Approximate String Matching.IEEE Data Engineering Bulletin 24(4):19-27, 2001.
- ↑ P.H. Sellers. The Theory and Computation of Evolutionary Distances: Pattern Recognition. Journal of Algorithms, 1:359-373, 1980.
- ↑ E. Ukkonen, Algorithms for approximate string matching. Information and Control 64, 100-118. 1985.
- ↑ R. Wagner and M. Fisher, The string-to-string correction problem, Journal of the association for computing machinery, vol. 21, pp. 168 173, 1974.
- ↑ J. Zobel, P. Dart. Finding approximate matches in large lexicons. Software-Practice & Experience 25(3), pp 331-345, 1995.
[edit] External links
- Efficient POSIX compliant regexp matching library with support for approximate matching
- Site devoted to fuzzy searching and information retrieval
- The description of Levenshtein algorithm
- Project Dedupe
- The Fuzzy Gazetteer: A fuzzy string search engine for place names worldwide
- Source code for n-gram based matching
- Siderite's Sift2: An empiric, but fairly accurate and very fast edit-distance algorithm