Talk:Knuth–Morris–Pratt algorithm

From Wikipedia, the free encyclopedia

Mistake? "Thus in the fourth step, m = 0 and i = 3." should that be m = 3 and i = 3?


First, I've removed the rather antique discussion topic that was already here, as the article has been revised substantially since then and anyway, it was more than a year old.

Second, are there any thoughts on whether I should replace the verbal algorithms with the plain C code? Barely any knowledge of C is necessary to comprehend what is going on in the code; I have tried to write it in such a way that it avoids all idioms even at the cost of being a little too verbose, and the language itself is self-documenting in its usage for the most part. But now that the code is there it seems to reproduce very closely the English-language algorithms and creates a redundancy. I think, given the choice, I would rather keep the code, since it's clearer for the most part, in addition to being more useful, but then, I am comfortable with C and others may not be. So, keep both forms, or just one?

The verbal description reflects the C description too closely, probably because it was written to the C description. Replace it with something more high-level, if you can. No verbal description should say stuff like "if i > 0, set i = e." Deco 07:59, 8 December 2005 (UTC)
There are two verbal descriptions of the code; the lengthy example was the first thing written (actually, it was the reason I rewrote the article, since in the original it was all there was, and quite bad). The second verbal description, which I wrote in vague emulation of Knuth's style for describing algorithms, was written to the C code after I decided that it was unwise to have the only formal description of the algorithm written in a single programming language that some casual readers might not know. Those who do know can verify the two are the same and, if curious, can compile my code and run the algorithm. Ryan Reich 14:12, 21 February 2006 (UTC)

In the The efficiency of the search algorithm section, the second paragraph starts:

  • Using this fact, we will show that the loop can execute at most l times.

but the third paragraph has this contradiction:

  • Thus the loop executes at most 2l times,

It seems it should be 2l times (for example searching AAAAA for AB, if I've read the code right), so I changed it. Secondly I tried but failed to make the math tags work on the letters l, m i, and so on, but they stayed obstinately unclear (1 l and I looked practically indistinguisale on my browser) so I resorted to boldifying each math term. I guess there's some preference setting for making the < math > terms look consistent? -Wikibob 02:23, 18 December 2005 (UTC)

That's a fine solution. Computer users have been unable to distinguish the letters l and I, and the number 1, for at least the last fifteen years. I blame the invention of Times New Roman. Ryan Reich 14:12, 21 February 2006 (UTC)


Is the line below supposed to read "T[i-1]" instead of "T[i+1]"?--

"...T[i] in the code is equivalent to T[i + 1] above..."

Yeah... Ryan Reich 14:12, 21 February 2006 (UTC)


Shouldn't O(n) + O(1) be O(n) and not O(n+1)? Constants are not taken into account in O() notation. PedroCR

You're talking about the very last section? That's not a 1 ("one"), it's a l ("ell"). It's the length of the string. Ryan Reich 12:10, 10 April 2006 (UTC)
Maybe it would be better to write uppercase 'L' and 'N', but lowercase 'i'? T0ljan 12:53, 17 April 2006 (UTC)

Contents

[edit] Code -> pseudocode

Since some IP editor decided that it would be neat to write Java code for this algorithm, I realized that having any language-specific code in here is a bad idea. Of course, the Java code was practically identical to the C code, with minor differences mostly centered on how the length of a string is determined, so it added nothing. I've replaced my C code, their Java code, and also my previous attempt at pseudocode with new pseudocode formatted as suggested in the WikiProject Computer Science manual of style. If you are a future editor reading this article, and for some reason you also read the talk page before editing, please don't implement the algorithm in your favorite language, as it adds nothing. I've also warned you in the page's source. Ryan Reich 19:10, 10 August 2006 (UTC)

[edit] History of the algorithm

The algorithm described in Item 179 of HAKMEM seems quite similar to KMP, despite that HAKMEM is from 1972 so it predates the publication by KMP. Were variants of the algorithm for fast string matching known long before this publication or am I misinterpreting something? Should this be mentioned in the article? Thanks, – b_jonas 19:07, 9 November 2006 (UTC)

I've now also asked this at the Reference desk/Mathematics. – b_jonas 15:53, 29 September 2007 (UTC)

[edit] The efficiency of the KMP algorithm

The comment about the Boyer-Moore algorithm's worst case:

"...while the Boyer-Moore algorithm achieves its best case at the great expense of its worst."

is incorrect. The worst case is still linear in the text size. Sustik 22:28, 15 June 2007 (UTC)

So fix it? Ryan Reich 22:07, 16 June 2007 (UTC)
I came here precisely to point out the same thing, and seeing that I'm
not the first, I will fix it. So there. :) 128.210.4.214 (talk) 18:58, 7 March 2008 (UTC)
Funny, I was just remembering this question this morning, after months of neglect. I'm glad to see my goading has encouraged people to action :) Aside from my enthusiasm for the KMP algorithm I am not even distantly connected to this area, and I don't know how these algorithms compare; it seemed, from a long-ago look at the Boyer-Moore page, that the claim was right, but I guess not. Since obviously I'm uninformed, I didn't feel like I could make the change. Ryan Reich (talk) 19:25, 7 March 2008 (UTC)

The confusion may involve the fact that the version of Boyer-Moore in some textbooks is a simplified one with a slower worst case, not the full linear-time algorithm. —David Eppstein (talk) 19:17, 7 March 2008 (UTC)

[edit] Terminal substring

What is terminal substring (this term is used w/o definition and I cannot find one with google)? 217.21.164.43 09:18, 10 October 2007 (UTC)

Perhaps I was writing too much under the influence of mathematical jargon (though you won't find it there either). "Terminal" just means "at the end", so a "terminal substring" of a string S is a substring located at the end of S. For example if S is the string "abcdefg", then the substrings "efg" and even "bcdefg" are terminal, whereas "a" and "cde" are not (to pick just a few examples). Ryan Reich 13:40, 10 October 2007 (UTC)
Somehow, I'd feel a LOT more comfortable if somebody used "proper prefix" or "proper suffix" rather than "proper initial substring" or "proper terminal substring". I mean, it's a mouthful of syllables that clutters the rest of the article --202.168.251.139 16:22, 10 October 2007 (UTC)
Here's an opportunity to gain comfort with editing Wikipedia, then. Why don't you change the terminology you don't like to something clearer? Ryan Reich 18:13, 10 October 2007 (UTC)

[edit] How to go on

Well, the pseudocode doesn't state how far to shift once we found a match (there could be more). I've learned the algo with a table building function that returns an array that is one longer than the pattern: e.g.

 ABCDABCD
-10000123

wouldn't tell you that you have to shift by 4 to find a potential match in ABCDABCDABCD, but

 ABCDABCD
-100001234

does. I think there is something wrong here. Or did I miss something? Cheers, 88.73.111.77 16:49, 3 November 2007 (UTC)

Well, I don't get your example since the string ABCDABCDABCD contains the pattern ABCDABCD, so in fact no shifting is ever necessary. However, say the text were ABCDABCCABCDABCD (in other words, it fails when matching the second D when starting from the beginning of the text). So we do the following:
m: 0123456789012345
S: ABCDABCCABCDABCD
W: ABCDABCD
i: 01234567
and you see that with m = 0, the search fails once when i = 7. According to the pseudocode, we then replace m with m + i - T[i] which is 0 + 7 - 3 = 4. And indeed, that where the first false match "ABC..." begins. So all is well. I certainly hope the code is right, since I agonized over it for some days when writing this article, including writing it into an actual C program and testing it. I'm fairly sure I've got the indices correct. I also think the "worked example" is general enough that the details of the algorithm can be verified by checking them against the computations done there. Do they not match up after all? Ryan Reich 19:26, 3 November 2007 (UTC)

[edit] Error in "working example of ..."?

In the following text: "As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset m = 8, i = 2 and continue matching the current character.", shouldn't it be "m = 10, i = 2"? m[10] and i[2] are the next characters to be compared, so this seems wrong to me. --132.199.235.61 (talk) 19:06, 15 May 2008 (UTC)

No. You don't understand the notation: m is the beginning of the word, and i is the position within the word. So, we position W so that W[0] = S[8] (that's what m = 8 means) and then start matching at W[2] (which is S[10]). Ryan Reich (talk) 20:47, 15 May 2008 (UTC)

[edit] Mistake in the table

I am not sure about the whole table computing. I was doing this table but based on the explanation I know of and it didn't come out the same. So then I also checked against the interactive once following link at the bottom of the page and it confirmed my table. So mistake in the table?! —Preceding unsigned comment added by Leshiy uk (talk • contribs) 13:45, 4 June 2008 (UTC)