Talk:Boyer–Moore string search algorithm

From Wikipedia, the free encyclopedia

Contents

[edit] Second Table

The second table is poorly explained and may be incorrect. Can anyone look into this see if its done correctly on this page? Or better describe it. I have reread the description several times and still can't figure it out. Is the table correct?

[edit] Concrete Examples of second table

For the sake of newbies and clearity, I think that you should include a concrete example showing the search string and the text being searched for the string. The ANPANMAN example isn't clearly showing the formation of the second table. I am trying to clearly understand it, maybe I'll add the example.

[edit] First Table

The first table (δ1) gives the next offset at which a match could possibly occur given the last (nonmatching) character encountered while comparing the pattern with the text; the second (δ2) gives a similar offset based on the number of characters that matched correctly. When a mismatch occurs, the largest of these two values is used to skip ahead to the next position where a match could potentially occur.

Does anyone know the reason that the somewhat obscure references of δ1 and δ2 are used here? Were those perhaps the way they were referenced in the original publications about the algorithm? I'm replacing this text with a (hopefully!) more straightforward explanation but I'm wondering why they were referenced that way. -- Antaeus Feldspar 05:04, 19 Dec 2004 (UTC)
Yes δ1 and δ2 are the names a lot of technical papers give to these functions. -- 14:03, 13 Feb 2007 (UTC)

[edit] who and when

Does anyone know when the Boyer-Moore algorithm was first published, so we can state it? I added the full names of Boyer and Moore. RJFJR 03:16, Jan 12, 2005 (UTC)

Yes, 1977. See R. BOYER AND S. MOORE, A fast string matching algorithm, CACM, 20 (1977), pp. 762-772.

What about Gosper? Sedgewick, in "Algorithms" says Gosper independently invented it at about the same time. I've occasionally heard the algorithm called Boyer-Moore-Gosper. Anyone know more? 203.164.223.88 11:05, 17 December 2006 (UTC)

[edit] No need for the second jump table.

[edit] Worse-case performance is contradictory: Reference for 3n complexity proof

I cannot find the paper that proves it online, however http://www.cs.nyu.edu/cs/faculty/cole/papers/CHPZ95.ps refers to it. Someone may wish to edit the page to provide that information.

The only online sources for the 1991 paper I could find required payment, but the CHPZ295.ps paper does have this information:
that in 1977 it was shown that the Boyer-Moore algorithm (BM) makes at most 6n comparisions, this was reduced to 4n in 1980 by Guibas and Odlyzko (ref GO80), and that in 1991 Cole et al had further proven the upper bound to be 3n − Ω(n / m). I don't understand the omega notation in this form. It goes on to cite paoers that show simple variants of BM have upper bounds of 2n. GO80 is A new proof of the linearity of the Boyer-Moore string searching algorithm, SIAM Journal of Computing 9 (1980) p672-682. And when I expanded the article's 3*N paragraph I noticed the preceeding paragraph talks about N*M which looks wrong, and contradicts the 3*N. -Wikibob 05:23, 18 December 2005 (UTC)

See the following two lines in the article:
"The worst-case performance of the algorithm to find all matches is approximately N*M."
and
"The worst-case to find all occurrences in a text needs approximately 3*N comparisons, hence the complexity is O(n), regardless whether the text contains a match or not."
I believe that, depending on the period of the pattern x, each of the above statements is true. If period(x) > |x|/2, then the worst case is O(3N) -> O(N). Otherwise, the worst case is O(N*M). I haven't posted this to the article page yet because I'm still learning about the algorithm. --Tekhnofiend 21:46, 2 December 2006 (UTC)

[edit] Wikipedia self-reference

I'm thinking that "WIKIPEDIA" should not be used as an example string, for the purpose of avoiding self-references. ~ Booyabazooka 20:06, 4 April 2006 (UTC)

Would be nice to have both example tables use the same word. 12.150.184.6 16:58, 10 April 2006 (UTC)

I agree that WIKIPEDIA should not be used as an example word, but that we should use the same word throughout. I'll fix it. Deco 06:03, 28 May 2006 (UTC)

[edit] About the example code

The example code I wrote on this page (the one with boyermoore_needlematch function) is quite slow.
I created it by literally translating the specs (the explanation how the algorithm works) into C code. I'm afraid it does not satisfy readers who are in search of a fast search algorithm example code. I would like it to be replaced with a code that is fast, but that is not unreadable.
The suffixes and preBmGs functions at [1] are very obfuscated in my opinion; they (and derivatives thereof) are not suitable for example code.
Is it possible to come up with a reasonably fast example that is not obfuscated? For comparison, the code at Boyer-Moore-Horspool algorithm is quite simple. It is this extra step, i.e. calculation of the "skip" array that is at question. --Bisqwit 08:59, 11 December 2006 (UTC)

[edit] better way to explain

I think the Boyer-Moore algorithm might be easier to understand if we:

  • explain how the first table works all by itself, Skip1 -- the Boyer-Moore-Horspool algorithm
  • explain how the second table works all by itself, Skip2 -- (or would it work all by itself? Is there a name for this algorithm? I suspect it might work well for searching bit-sequences or DNA base-pair sequences)

And finally summarize it by

  • the Boyer-Moore algorithm, whenever it hits a mismatch, skips ahead the *maximum* of Skip1 and Skip2 (since all shorter skips have been ruled out).

--68.0.120.35 14:49, 22 March 2007 (UTC)

[edit] First Table

Shouldn't the value of N be 3 in the first table in the article? If it's zero, then you get an infinite loop.

The second table

     ANPANMAN
     --------
           n  1
   aN         8
       mAN    3
   nMAN       6
  aNMAN       6
 pANMAN       6
nPANMAN       6

aNPANMAN 6 First 2 entries are slightly confusing. n 1 'n 1' means to match a character that is not 'n' then shift 1 to the left Second entry says 'aN 8' which means not the character a followed by 'N' needs 8 shifts to the left. But if you shift 8 to the left you don't match anything (the pattern is 8 characters long). This I understand - 'mAN 3' - not the character m followed by AN [pattern and partial pattern match]

ANPANMAN
     mAN<-fail    shift 0, partial pattern mAN matches and fails
    mAN <-fail    shift 1, 
   mAN  <-fail    shift 2
  mAN   <-success shift 3
nMAN 6
       <pattrn>


ANPANMAN----------------

           nMAN
          nMAN
         nMAN
        nMAN
       nMAN
      nMAN
     nMAN    <-success shift 6 from right, partial pattern nMAN only matches AN of pattern

==========Boyer-Moore algorithm in Robert Sedgewick 'Algorithms' is slightly different? ANPANMAN N=0 A=1 M=2 N=3 A=4 P=5 N=6 A=7 Ignore duplicates and use lowest gives N=0 A=1 M=2 P=5 (note: if you increment all these then you get the second table) All other letters =8, the length of the pattern It compares the last character of the pattern and the character of the text to search

 STING   <--pattern

CONSISTING<--text G compares with T T gives a shift value of 3 (T is 3 from right of pattern, assume last character position is 0) We move the pattern 3 to the right

    STING

CONSISTING Shift Table G=0 N=1 I=2 T=3 S=4 any other=5 Use the character in the text as an index to the skip table array. skip(0 to 255), 256 ascii characters? Initialise all to length of pattern Then go through pattern from right and set skip array index character to its position in the pattern. E.g. Asc("T")=84 'get ascii value of T skip(Asc("T"))=position in pattern string from right

    STING

CONSISTIGG----- G Compares with G N Compares with G Skip(Asc("G"))=0 Because we compare the second character in the pattern from the right we add 2, so we skip 2 to the right?

    STING

CONSIGTING---- S Compares with G Skip(Asc("G"))=0 + position of pattern compare character (in this case 5)? Sorry mangled text

[edit] Revised example programs

I revised the example functions for the Boyer-Moore and the Boyer-Moore-Horspool search algorithms and added a Turbo-Boyer-Moore example. However, I believe this code requires some refitting before it is suitable for an educational example. These are written in C++, and are fairly optimal. I have tested them extensively to ensure they produce correct results. --Bisqwit 08:27, 21 August 2007 (UTC)

[edit] Common functions

#include <vector>
 
typedef std::vector<size_t> occtable_type;
typedef std::vector<size_t> skiptable_type;
 
/* This function compares the two strings starting at ptr1 and ptr2,
 * which are assumed to be strlen bytes long, and returns a size_t
 * indicating how many bytes were identical, counting from the _end_
 * of both strings.
 * It is an utility function used by the actual Boyer-Moore algorithms.
 */
const size_t backwards_match_len(
    const unsigned char* ptr1,
    const unsigned char* ptr2,
    size_t strlen,
    size_t maxlen,
    size_t minlen)
{
    size_t result = minlen;
    while(result < maxlen && ptr1[strlen-1-result] == ptr2[strlen-1-result])
        ++result;
    return result;
}

[edit] Building of the tables

/* This function creates an occ table to be used by the search algorithms. */
/* It only needs to be created once per a needle to search. */
const occtable_type
    CreateOccTable(const unsigned char* needle, size_t needle_length)
{
    occtable_type occ(UCHAR_MAX+1, needle_length); // initialize a table of UCHAR_MAX+1 elements to value needle_length
 
    /* Populate it with the analysis of the needle */
    /* But ignoring the last letter */
    if(needle_length >= 1)
    {
        const size_t needle_length_minus_1 = needle_length-1;
        for(size_t a=0; a<needle_length_minus_1; ++a)
            occ[needle[a]] = needle_length_minus_1 - a;
    }
    return occ;
}
 
/* This function creates a skip table to be used by the search algorithms. */
/* It only needs to be created once per a needle to search. */
const skiptable_type
    CreateSkipTable(const unsigned char* needle, size_t needle_length)
{
    skiptable_type skip(needle_length, needle_length); // initialize a table of needle_length elements to value needle_length
 
    if(needle_length <= 1) return skip;
 
    /* I have absolutely no idea how this works. I just copypasted
     * it from http://www-igm.univ-mlv.fr/~lecroq/string/node14.html
     * and have since edited it in trying to seek clarity and efficiency.
     * In particular, the skip[] table creation is now interleaved within
     * building of the suff[] table, and the assumption that skip[] is
     * preinitialized into needle_length is utilized.
     * -Bisqwit
     */
 
    const size_t needle_length_minus_1 = needle_length-1;
 
    std::vector<ssize_t> suff(needle_length);
 
    suff[needle_length_minus_1] = needle_length;
 
    ssize_t f = 0;
    ssize_t g = needle_length_minus_1;
    size_t j = 0; // index for writing into skip[]
    for(ssize_t i = needle_length-2; i >= 0; --i) // For each suffix length
    {
        if(i > g)
        {
            const ssize_t tmp = suff[i + needle_length_minus_1 - f];
            if (tmp < i - g)
            {
                suff[i] = tmp;
                goto i_done;
            }
        }
        else
            g = i;
 
        f = i;
 
        g -= backwards_match_len(
                needle,
                needle + needle_length_minus_1 - f,
                g+1, // length of string to test
                g+1, // maximum match length
                0 // assumed minimum match length
              );
 
        suff[i] = f - g;
    i_done:;
 
        if(suff[i] == i+1) // This "if" matches sometimes. Less so on random data.
        {
            // Probably, this only happens when the needle contains self-similarity.
            size_t jlimit = needle_length_minus_1 - i;
            while(j < jlimit)
                skip[j++] = jlimit;
        }
    }
    /* j may be smaller than needle_length here. It doesn't matter, because
     * if we ran the skip[] initialization loop until j < needle_length (when i=-1),
     * we would only set the value as needle_length, which it already is.
     */
 
    for (size_t i = 0; i < needle_length_minus_1; ++i)
        skip[needle_length_minus_1 - suff[i]] = needle_length_minus_1 - i;
 
    return skip;
}

[edit] Boyer-Moore-Horspool search

/* A Boyer-Moore-Horspool search algorithm. */
/* If it finds the needle, it returns an offset to haystack from which
 * the needle was found. Otherwise, it returns haystack_length.
 */
const size_t SearchInHorspool(const unsigned char* haystack, const size_t haystack_length,
    const occtable_type& occ,
    const unsigned char* needle,
    const size_t needle_length)
{
    if(needle_length > haystack_length) return haystack_length;
    if(needle_length == 1)
    {
        const unsigned char* result = (const unsigned char*)std::memchr(haystack, *needle, haystack_length);
        return result ? result-haystack : haystack_length;
    }
 
    const size_t needle_length_minus_1 = needle_length-1;
 
    const unsigned char last_needle_char = needle[needle_length_minus_1];
 
    size_t haystack_position=0;
    while(haystack_position <= haystack_length-needle_length)
    {
        const unsigned char occ_char = haystack[haystack_position + needle_length_minus_1];
 
        if(last_needle_char == occ_char
        && std::memcmp(needle, haystack+haystack_position, needle_length_minus_1) == 0)
        {
            return haystack_position;
        }
 
        haystack_position += occ[occ_char];
    }
    return haystack_length;
}

[edit] Boyer-Moore search

/* A Boyer-Moore search algorithm. */
/* If it finds the needle, it returns an offset to haystack from which
 * the needle was found. Otherwise, it returns haystack_length.
 */
const size_t SearchIn(const unsigned char* haystack, const size_t haystack_length,
    const occtable_type& occ,
    const skiptable_type& skip,
    const unsigned char* needle,
    const size_t needle_length)
{
    if(needle_length > haystack_length) return haystack_length;
 
    if(needle_length == 1)
    {
        const unsigned char* result =
            (const unsigned char*)std::memchr(haystack, *needle, haystack_length);
        return result ? result-haystack : haystack_length;
    }
 
    const size_t needle_length_minus_1 = needle_length-1;
 
    size_t haystack_position=0;
    while(haystack_position <= haystack_length-needle_length)
    {
        const size_t match_len = backwards_match_len(
            needle,
            haystack+haystack_position,
            needle_length, // length of string to test
            needle_length, // maximum match length
            0 // assumed minimum match length
           );
        if(match_len == needle_length) return haystack_position;
 
        const size_t mismatch_position = needle_length_minus_1 - match_len;
 
        const unsigned char occ_char = haystack[haystack_position + mismatch_position];
 
        const ssize_t bcShift = occ[occ_char] - match_len;
        const ssize_t gcShift = skip[mismatch_position];
 
        size_t shift = std::max(gcShift, bcShift);
 
        haystack_position += shift;
    }
    return haystack_length;
}

[edit] Turbo Boyer-Moore search

/* A Turbo Boyer-Moore search algorithm. */
/* If it finds the needle, it returns an offset to haystack from which
 * the needle was found. Otherwise, it returns haystack_length.
 */
const size_t SearchInTurbo(const unsigned char* haystack, const size_t haystack_length,
    const occtable_type& occ,
    const skiptable_type& skip,
    const unsigned char* needle,
    const size_t needle_length)
{
    if(needle_length > haystack_length) return haystack_length;
 
    if(needle_length == 1)
    {
        const unsigned char* result =
            (const unsigned char*)std::memchr(haystack, *needle, haystack_length);
        return result ? result-haystack : haystack_length;
    }
 
    const size_t needle_length_minus_1 = needle_length-1;
 
    size_t haystack_position = 0;
    size_t ignore_num = 0, shift = needle_length;
    while(haystack_position <= haystack_length-needle_length)
    {
        size_t match_len;
        if(ignore_num == 0)
        {
            match_len = backwards_match_len(
                needle,
                haystack+haystack_position,
                needle_length, // length of string to test
                needle_length, // maximum match length
                0 // assumed minimum match length
               );
            if(match_len == needle_length) return haystack_position;
        }
        else
        {
            match_len =
                backwards_match_len(needle, haystack+haystack_position,
                    needle_length, // length of string to test
                    shift, // maximum match length
                    0 // assumed minimum match length
                   );
 
            if(match_len == shift)
            {
                match_len =
                    backwards_match_len(needle, haystack+haystack_position,
                        needle_length, // length of string to test
                        needle_length, // maximum match length
                        shift + ignore_num // assumed minimum match length
                      );
            }
            if(match_len >= needle_length) return haystack_position;
        }
 
        const size_t mismatch_position = needle_length_minus_1 - match_len;
 
        const unsigned char occ_char = haystack[haystack_position + mismatch_position];
 
        const ssize_t bcShift = occ[occ_char] - match_len;
        const size_t gcShift  = skip[mismatch_position];
        const ssize_t turboShift = ignore_num - match_len;
 
        shift = std::max(std::max((ssize_t)gcShift, bcShift), turboShift);
 
        if(shift == gcShift)
            ignore_num = std::min( needle_length - shift, match_len);
        else
        {
            if(turboShift < bcShift && ignore_num >= shift)
                shift = ignore_num + 1;
            ignore_num = 0;
        }
        haystack_position += shift;
    }
    return haystack_length;
}

[edit] Multiple byte character sets?

The example implementation in the article uses an auto-allocated array called occ to hold the first table. But how big will occ be for a string encoded in UTF-16 or UTF-32? Should the article clarify that the example is intended for use with 8-bit encodings? I'd imagine that occ would have to be an associative array in such cases because it will be filled so sparsely. --Damian Yerrick (talk | stalk) 21:55, 28 December 2007 (UTC)

That is a valid point. In 16-bit or 32-bit implementations, I would probably use an associative array of 256-element arrays, i.e. something like this:
std::map<unsigned, array<size_t,256> > occ;
      /* To index: */
      unsigned ch = 0x672C; /* U+672C: Unicode CJK symbol of book */
      occ[ch >> 8][ch & 0xFF] += 1;
For the reason that even in multilanguage content, characters tend to be clustered around some particular region of unicode. (For example, in Russian text; most of the symbols fall to the cyrillic range.) Then again, how effective is Boyer-Moore multibyte situations? --Bisqwit (talk) 09:18, 2 January 2008 (UTC)

[edit] memset

It appears that the behavior of memset from the C standard library isn't well defined across platforms except

  1. when filling memory with a value of 0,
  2. when filling arrays of signed char or unsigned char, or
  3. when your platform's ABI guarantees that a type has a specific representation.

Thegeneralguy (talk · contribs) edited the sample code to use memset to fill an array of ssize_t objects with -1 values. This is not portable, and Bisqwit (talk · contribs) reverted the edit. Two months later, Thegeneralguy made the same change again, and I reverted it again. I guess Thegeneralguy's rationale for the changes is option 3: memset with fill value of -1 works on most widespread 16-, 32-, and 64-bit platforms, all of which use a two's complement representation for ssize_t.

So here's my question: Do we require strictly conforming C, or do we relax C and assume two's complement? I propose we use ISO C, which works everywhere and is easier to understand, as opposed to what I view as premature optimization. --Damian Yerrick (talk | stalk) 23:23, 7 March 2008 (UTC)

Now that you bring this up (I was unaware of this portability issue) I would be in favor of removing memset. However, memset is not premature optimization in the sense that something like this (filling up memory with a particular value) is the purpose of memset; memset(occ,-1,sizeof(occ)) should probably be used if it is decided to be used. If there is a wikipedia policy on this default to that.Thegeneralguy (talk) 23:46, 7 March 2008 (UTC)

I think the point of example code is to document the algorithm, and in my opinion, a for loop documents the purpose (setting each element of the array to some integer value) better than memset (setting each byte on a memory area to some byte value) does. That said, has anyone had a look at the replacement example code posted on this talk page? --Bisqwit (talk) 13:54, 11 March 2008 (UTC)

[edit] Correct code?

Is this code actually correct? It looks pretty buggy (or, at best, relies on a lucky bug to be functional). It isn't referenced--isn't it WP:OR and/or WP:NOT#HOWTO, then? -- Mikeblas (talk) 15:38, 4 May 2008 (UTC)

  • Can you please elaborate on what in it looks "pretty buggy" and what is the "lucky bug" it relies on? I wrote it based on the explanation in the article. A replacement suggestion is provided on this talk page, but nobody has commented on it. :-/ --Bisqwit (talk) 22:26, 4 May 2008 (UTC)