Soundex
From Wikipedia, the free encyclopedia
- This article is about the phonetic algorithm. For the Rock n' Soul band, see the SoundEx.
Soundex is a phonetic algorithm for indexing names by their sound when pronounced in English. The basic aim is for names with the same pronunciation to be encoded to the same string so that matching can occur despite minor differences in spelling. Soundex is the most widely known of all phonetic algorithms and is often used (incorrectly) as a synonym for "phonetic algorithm".
Soundex was developed by Robert Russell and Margaret Odell and patented in 1918 and 1922 (U.S. Patent 1,261,167 and U.S. Patent 1,435,663 ). A variation called American Soundex was used in the 1930s for a retrospective analysis of the US censuses from 1890 through 1920. The Soundex code came to prominence in the 1960s when it was the subject of several articles in the Communications and Journal of the Association for Computing Machinery (CACM and JACM), and especially when described in Donald Knuth's magnum opus, The Art of Computer Programming.
The Soundex code for a name consists of a letter followed by three numbers: the letter is the first letter of the name, and the numbers encode the remaining consonants. Similar sounding consonants share the same number so, for example, the labial B, F, P and V are all encoded as 1. Vowels can affect the coding, but are never coded directly unless they appear at the start of the name.
The exact algorithm is as follows:
- Retain the first letter of the string
- Remove all occurrences of the following letters, unless it is the first letter: a, e, h, i, o, u, w, y
- Assign numbers to the remaining letters (after the first) as follows:
- b, f, p, v = 1
- c, g, j, k, q, s, x, z = 2
- d, t = 3
- l = 4
- m, n = 5
- r = 6
- If two or more letters with the same number were adjacent in the original name (before step 1), or adjacent except for any intervening h and w (American census only), then omit all but the first.
- Return the first four characters, right-padding with zeroes if there are fewer than four.
The National Archives and Records Administration (NARA) maintains the rule set for the official implementation of Soundex used by the U.S. Government.
Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150".
Contents |
[edit] Soundex variants
A similar algorithm called "Reverse Soundex" prefixes the last letter of the name instead of the first.
The NYSIIS algorithm was introduced by the New York State Identification and Intelligence System as an improvement to the Soundex algorithm. NYSIIS handles some multi-character n-grams and maintains relative vowel positioning, whereas Soundex does not.
The Celko Improved Soundex algorithm was introduced by Joe Celko in his book SQL For Smarties: Advanced SQL Programming.
As a response to deficiencies in the Soundex algorithm, Lawrence Philips developed the Metaphone algorithm for the same purpose. Philips later developed an improvement to Metaphone, which he called Double-Metaphone. Double-Metaphone includes a much larger encoding rule set than its predecessor, handles a subset of non-Latin characters, and returns a primary and a secondary encoding to account for different pronunciations of a single word in English.
Daitch-Mokotoff Soundex (D-M Soundex) was developed by genealogist Gary Mokotoff and later improved by genealogist Randy Daitch because of problems they encountered while trying to apply the Russell Soundex to Jews with Germanic or Slavic surnames (such as Moskowitz vs. Moskovitz or Levine vs. Lewin). D-M Soundex is sometimes referred to as "Jewish Soundex" or "Eastern European Soundex" [1], although the authors discourage the use of these nicknames. The D-M Soundex algorithm can return as many as 32 individual phonetic encodings for a single name. Results of D-M Soundex are returned in an all-numeric format between 100000 and 999999. This algorithm is much more complex than Russell Soundex.
[edit] See also
[edit] References
[edit] External links
- The Soundex Indexing System (U.S. National Archives and Records Administration)