Universal hashing

From Wikipedia, the free encyclopedia

Universal hashing is a randomized algorithm for selecting a hash function F with the following property: for any two distinct inputs x and y, the probability that F(x)=F(y) (i.e., that there is a hash collision between x and y) is the same as if F was a random function. Thus, if F has function values in a range of size r, the probability of any particular hash collision should be 1/r. There are universal hashing methods that give a function F that can be evaluated in a handful of computer instructions.

Hashing has been used to associate with an input, usually a string, a small value that originally was used as an index to look up something about that input in a table. Since then hashing has found other uses. For example, two inputs might be compared by checking to see if their hash values are the same. Thus, we can see that hash functions are many-to-one mappings. The use of the word "hash" is mnemonic because the intent of a hash function is to take as many of the inputs usually encountered and assign different values to them, by scrambling them or making a hash of the inputs, using the meaning of hash from domains such as cooking. If for any given input there are too many collisions that is viewed as unfortunate.

Because a hash function is a many-to-one mapping, there must exist some set of elements that will collide under the hash function. One wants to design the hash function such that for the input sets, it is unlikely that elements collide. Proving in a mathematical sense that you are unlikely to encounter a particular set of inputs would appear to be an impossible task.

Randomized algorithms present a way of proving that you are unlikely to encounter a bad set of inputs. We can construct a Universal Class of hash functions with the property that for any given set of inputs they will scatter the inputs among the range of the function well -- essentially as well as choosing random values for those inputs. Thus, simply choosing a random function from the class allows a proof that the probabilistic expectation for any set of inputs is that they will be distributed randomly.

In fact, we are in many cases interested in only pairwise collisions. That is to say, the odds that any two inputs x and y collide will be approximately the same as the reciprocal of the size of the range. It might be that for any given universal class of hash functions there exist x, y and z such that if x and y collide then so does z. While some work has been done on the set issue, universal hashing only makes statements about pairwise collisions.

A simple universal class of hash function is all functions h of the form h(x)= f(g(x)), where g(x)=ax+b (mod p) with p being a prime guaranteed larger than any possible input and each combination of a and b forming a different function in the class. f then becomes a mapping function to map elements from a domain which is 0 to p to a range of say 0 to n-1. f then can simply be taking the result of g mod n. There is only one f for all the functions in this class. To see why this class is universal, observe that for any two inputs and any two outputs, there are approximately p/n elements that can map to any output and for any of pair of those p/n elements you can solve the simultaneous equations in the field mod p, so for any pair of inputs there is a unique pair of a and b that will take it to those elements.

Universal hashing has numerous uses in computer science, for example in cryptography and in implementations of hash tables. Since the function is randomly chosen, an adversary hoping to create many hash collisions is unlikely to succeed.

Universal hashing has been generalized in many ways, most notably to the notion of k-wise independent hash functions, where the function is required to act like a random function on any set of k inputs.

[edit] See also

[edit] References

    [edit] External links

    J. Lawrence Carter and Mark N. Wegman's original paper "Universal Classes of Hash Functions"