Talk:Rabin-Karp string search algorithm
From Wikipedia, the free encyclopedia
I think it would be nice if the article discussed extending the algorithm for 2 dimensional pattern matching, as well as giving some optimizations in the case of having varying string lengths for multi-pattern matching case.
Contents |
[edit] =
In the pseudocode of the RK string search algorithm, if hs is given as hash(s[1..n]), the algorithm can not detect a target pattern which occurs on the first m bytes of a given text. This is why I reverted it to be hash(s[1...m]).
One arduous but impetuous, so-called, vandalism detector claims that it should be hash(s[1...n]), because, when m is larger than n, the code crashes. That is true, but I reckon such a range check matter is not a major issue in describing the fundamentals of the algorithm.
Considering the following sentence which is excerpted from the article and is supposed to be written by one of the original contributors,
"Lines 2,3, and 6 each require omega(m) time."
if one still thinks hs := hash(s[1...n]) is correct, why doesn't he/she also modify the above sentence too?
footnote) The line 3 of the pseudocode has been in the form of hash(s[1..m]) for the last six months or so since its original edition by the original writer.
--129.254.65.18 08:35, 23 May 2005 (UTC)
- I agree with you. We are assuming m ≤ n, and if [1..n] is used, the algorithm won't work, because it'd be trying to compare hashes of strings of length n to hashes of strings of length m. Deco 21:41, 23 May 2005 (UTC)
[edit] Line 8 of the muli-pattern-search pseudo-code seems wrong to me
Quote: if s[i..i+m-1] = a substring with hash hs This is exactly the same condition as in the line above (element of hash). As pointed out corretly in the single patters search - to rule out a hash collision we need to compare with the actual search string(s). (Plural only if by fat chance or denial of service attack, two search strings have the same hash.)
Or am I missing something obvious here?--iSee 10:16, 28 July 2005 (UTC)
- I need to correct myself. The code is ok, although I don't quite get how the rolling hash would compare varying key-lengths. I improved the code indentation so the code is more readable iSee
[edit] No worst case
I removed the following text in the "multiple pattern search" section: "We can also ensure O(m n log k) time in the worst case, where m is the length of the longest of the k strings, by storing the hashes in a self-balancing binary search tree instead of a hash table." Rabin-Karp simply doesn't make sense in the worst case, because the whole idea of hashing requires randomization and can only work in expectation (or with high probability).
- That's true. My mistake. Deco 02:51, 31 July 2005 (UTC)
[edit] Suggestions
Firstly, it seems like
1 function RabinKarp(string s[1..n], string sub[1..m]) 2 hsub := hash(sub[1..m]) 3 hs := hash(s[1..m]) 4 for i from 1 to n 5 if hs = hsub 6 if s[i..i+m-1] = sub 7 return i 8 hs := hash(s[i+1..i+m]) 9 return not found
could be shortened to
1 function RabinKarp(string s[1..n], string sub[1..m]) 2 hsub := hash(sub[1..m]) 3 for i from 1 to n 4 hs := hash(s[i..i+m-1]) 5 if hs = hsub 6 if s[i..i+m-1] = sub 7 return i 8 return not found
Aside from being a line shorter, I find this more clearly displays what hs is for any given iteration. The same comment applies to the other example of multi-pattern matching. (Edit: Ignore this suggestion--of course you want to keep the structure of the code the way it is to make it easy to introduce rolling hash functions.)
On the topic of the multi-pattern matching code, I think it is worth noting in both the code and the surrounding discussion the possibility for hash table collisions. As it stands, one might wrongly infer from the code that the only possibility is that a hash table either has no entry or a single entry for any given hash code.
- Thanks for looking over the article. The collisions are internal to the implementation of the hash table. Here we are treating it as an abstract set data structure, and don't even really care whether it's a hash table, binary search tree, or association list. If you think you can make this clearer please feel free. Deco 05:14, 15 November 2005 (UTC)
StTwister 15:39, 25 January 2007 (UTC): In order to be possible to change compute the hash of the next subsequence of S in constant time (to keep a low complexity), the hash must first be calculated outisde the loop entirely (with a 1 to M iteration) and then only updated inside the loop (remove the ith element from the hash and add the (i+m)-th element to the hash). So I think it's clearer if left this way.
[edit] Some pseudocode fixes (both single and multiple pattern search)
Firstly, changed this in single pattern search:
hs := hash(s[1..n])
to
hs := hash(s[1..m])
Secondly, changed the loop from (1 to n) to (1 to n-m+1), because from that points onwards, the hash function would be impossible to calculate (it would get out of sub's range), in both single and multiple pattern search.
Changes by StTwister 15:39, 25 January 2007 (UTC)