Talk:Boyer–Moore string search algorithm

From Wikipedia, the free encyclopedia

The first table (δ1) gives the next offset at which a match could possibly occur given the last (nonmatching) character encountered while comparing the pattern with the text; the second (δ2) gives a similar offset based on the number of characters that matched correctly. When a mismatch occurs, the largest of these two values is used to skip ahead to the next position where a match could potentially occur.

Does anyone know the reason that the somewhat obscure references of δ1 and δ2 are used here? Were those perhaps the way they were referenced in the original publications about the algorithm? I'm replacing this text with a (hopefully!) more straightforward explanation but I'm wondering why they were referenced that way. -- Antaeus Feldspar 05:04, 19 Dec 2004 (UTC)

Contents

[edit] who and when

Does anyone know when the Boyer-Moore algorithm was first published, so we can state it? I added the full names of Boyer and Moore. RJFJR 03:16, Jan 12, 2005 (UTC)

Yes, 1977. See R. BOYER AND S. MOORE, A fast string matching algorithm, CACM, 20 (1977), pp. 762-772.

[edit] No need for the second jump table.

[edit] Reference for 3n complexity proof

I cannot find the paper that proves it online, however http://www.cs.nyu.edu/cs/faculty/cole/papers/CHPZ95.ps refers to it. Someone may wish to edit the page to provide that information.

The only online sources for the 1991 paper I could find required payment, but the CHPZ295.ps paper does have this information:
that in 1977 it was shown that the Boyer-Moore algorithm (BM) makes at most 6n comparisions, this was reduced to 4n in 1980 by Guibas and Odlyzko (ref GO80), and that in 1991 Cole et al had further proven the upper bound to be 3n − Ω(n / m). I don't understand the omega notation in this form. It goes on to cite paoers that show simple variants of BM have upper bounds of 2n. GO80 is A new proof of the linearity of the Boyer-Moore string searching algorithm, SIAM Journal of Computing 9 (1980) p672-682. And when I expanded the article's 3*N paragraph I noticed the preceeding paragraph talks about N*M which looks wrong, and contradicts the 3*N. -Wikibob 05:23, 18 December 2005 (UTC)

[edit] Wikipedia self-reference

I'm thinking that "WIKIPEDIA" should not be used as an example string, for the purpose of avoiding self-references. ~ Booyabazooka 20:06, 4 April 2006 (UTC)

Would be nice to have both example tables use the same word. 12.150.184.6 16:58, 10 April 2006 (UTC)

I agree that WIKIPEDIA should not be used as an example word, but that we should use the same word throughout. I'll fix it. Deco 06:03, 28 May 2006 (UTC)

[edit] Concrete Examples

For the sake of newbies and clearity, I think that you should include a concrete example showing the search string and the text being searched for the string. The ANPANMAN example isn't clearly showing the formation of the second table. I am trying to clearly understand it, maybe I'll add the example.

[edit] Worse-case performance is contradictory

See the following two lines in the article:
"The worst-case performance of the algorithm to find all matches is approximately N*M."
and
"The worst-case to find all occurrences in a text needs approximately 3*N comparisons, hence the complexity is O(n), regardless whether the text contains a match or not."
I believe that, depending on the period of the pattern x, each of the above statements is true. If period(x) > |x|/2, then the worst case is O(3N) -> O(N). Otherwise, the worst case is O(N*M). I haven't posted this to the article page yet because I'm still learning about the algorithm. --Tekhnofiend 21:46, 2 December 2006 (UTC)

[edit] About the example code

The example code I wrote on this page (the one with boyermoore_needlematch function) is quite slow.
I created it by literally translating the specs (the explanation how the algorithm works) into C code. I'm afraid it does not satisfy readers who are in search of a fast search algorithm example code. I would like it to be replaced with a code that is fast, but that is not unreadable.
The suffixes and preBmGs functions at [1] are very obfuscated in my opinion; they (and derivatives thereof) are not suitable for example code.
Is it possible to come up with a reasonably fast example that is not obfuscated? For comparison, the code at Boyer-Moore-Horspool algorithm is quite simple. It is this extra step, i.e. calculation of the "skip" array that is at question. --Bisqwit 08:59, 11 December 2006 (UTC)