Talk:Boyer–Moore string search algorithm
From Wikipedia, the free encyclopedia
Contents |
[edit] Second Table
The second table is poorly explained and may be incorrect. Can anyone look into this see if its done correctly on this page? Or better describe it. I have reread the description several times and still can't figure it out. Is the table correct?
[edit] Concrete Examples of second table
For the sake of newbies and clearity, I think that you should include a concrete example showing the search string and the text being searched for the string. The ANPANMAN example isn't clearly showing the formation of the second table. I am trying to clearly understand it, maybe I'll add the example.
[edit] First Table
The first table (δ1) gives the next offset at which a match could possibly occur given the last (nonmatching) character encountered while comparing the pattern with the text; the second (δ2) gives a similar offset based on the number of characters that matched correctly. When a mismatch occurs, the largest of these two values is used to skip ahead to the next position where a match could potentially occur.
- Does anyone know the reason that the somewhat obscure references of δ1 and δ2 are used here? Were those perhaps the way they were referenced in the original publications about the algorithm? I'm replacing this text with a (hopefully!) more straightforward explanation but I'm wondering why they were referenced that way. -- Antaeus Feldspar 05:04, 19 Dec 2004 (UTC)
-
- Yes δ1 and δ2 are the names a lot of technical papers give to these functions. -- 14:03, 13 Feb 2007 (UTC)
[edit] who and when
- Does anyone know when the Boyer-Moore algorithm was first published, so we can state it? I added the full names of Boyer and Moore. RJFJR 03:16, Jan 12, 2005 (UTC)
Yes, 1977. See R. BOYER AND S. MOORE, A fast string matching algorithm, CACM, 20 (1977), pp. 762-772.
What about Gosper? Sedgewick, in "Algorithms" says Gosper independently invented it at about the same time. I've occasionally heard the algorithm called Boyer-Moore-Gosper. Anyone know more? 203.164.223.88 11:05, 17 December 2006 (UTC)
[edit] No need for the second jump table.
- Please refer to http://www.movsd.com/bm.htm -- 83.237.28.135 2005-05-02 16:33:20
- The version without the second table is the Boyer-Moore-Horspool algorithm. The real Boyer-Moore algorithm requires both tables. --Bisqwit 11:01, 22 May 2006 (UTC)
[edit] Worse-case performance is contradictory: Reference for 3n complexity proof
I cannot find the paper that proves it online, however http://www.cs.nyu.edu/cs/faculty/cole/papers/CHPZ95.ps refers to it. Someone may wish to edit the page to provide that information.
- The only online sources for the 1991 paper I could find required payment, but the CHPZ295.ps paper does have this information:
- that in 1977 it was shown that the Boyer-Moore algorithm (BM) makes at most 6n comparisions, this was reduced to 4n in 1980 by Guibas and Odlyzko (ref GO80), and that in 1991 Cole et al had further proven the upper bound to be 3n − Ω(n / m). I don't understand the omega notation in this form. It goes on to cite paoers that show simple variants of BM have upper bounds of 2n. GO80 is A new proof of the linearity of the Boyer-Moore string searching algorithm, SIAM Journal of Computing 9 (1980) p672-682. And when I expanded the article's 3*N paragraph I noticed the preceeding paragraph talks about N*M which looks wrong, and contradicts the 3*N. -Wikibob 05:23, 18 December 2005 (UTC)
See the following two lines in the article:
"The worst-case performance of the algorithm to find all matches is approximately N*M."
and
"The worst-case to find all occurrences in a text needs approximately 3*N comparisons, hence the complexity is O(n), regardless whether the text contains a match or not."
I believe that, depending on the period of the pattern x, each of the above statements is true. If period(x) > |x|/2, then the worst case is O(3N) -> O(N). Otherwise, the worst case is O(N*M). I haven't posted this to the article page yet because I'm still learning about the algorithm. --Tekhnofiend 21:46, 2 December 2006 (UTC)
[edit] Wikipedia self-reference
I'm thinking that "WIKIPEDIA" should not be used as an example string, for the purpose of avoiding self-references. ~ Booyabazooka 20:06, 4 April 2006 (UTC)
Would be nice to have both example tables use the same word. 12.150.184.6 16:58, 10 April 2006 (UTC)
- I agree that WIKIPEDIA should not be used as an example word, but that we should use the same word throughout. I'll fix it. Deco 06:03, 28 May 2006 (UTC)
[edit] About the example code
The example code I wrote on this page (the one with boyermoore_needlematch function) is quite slow.
I created it by literally translating the specs (the explanation how the algorithm works) into C code. I'm afraid it does not satisfy readers who are in search of a fast search algorithm example code. I would like it to be replaced with a code that is fast, but that is not unreadable.
The suffixes and preBmGs functions at [1] are very obfuscated in my opinion; they (and derivatives thereof) are not suitable for example code.
Is it possible to come up with a reasonably fast example that is not obfuscated? For comparison, the code at Boyer-Moore-Horspool algorithm is quite simple. It is this extra step, i.e. calculation of the "skip" array that is at question. --Bisqwit 08:59, 11 December 2006 (UTC)
[edit] better way to explain
I think the Boyer-Moore algorithm might be easier to understand if we:
- explain how the first table works all by itself, Skip1 -- the Boyer-Moore-Horspool algorithm
- explain how the second table works all by itself, Skip2 -- (or would it work all by itself? Is there a name for this algorithm? I suspect it might work well for searching bit-sequences or DNA base-pair sequences)
And finally summarize it by
- the Boyer-Moore algorithm, whenever it hits a mismatch, skips ahead the *maximum* of Skip1 and Skip2 (since all shorter skips have been ruled out).
--68.0.120.35 14:49, 22 March 2007 (UTC)