Double hashing

From Wikipedia, the free encyclopedia

Double hashing is a computer programming technique used in hash tables to resolve hash collisions, cases when two different values to be searched for produce the same hash key. It is a popular collision-resolution technique in open-addressed hash tables. Double hashing is implemented in many popular libraries.

Like linear probing, it uses one hash value as a starting point and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, or the entire table has been searched; but this interval is decided using a second, independent hash function (hence the name double hashing). Unlike linear probing and quadratic probing, the interval depends on the data, so that even values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of clustering.

Given two randomly, uniformly, and independently selected hash functions h_{1} and h_{2}, the ith location in the bucket sequence for value k in a hash table T is: h(i,k)=(h_{1}(k)+i\cdot h_{2}(k))\mod |T|. Generally, h_{1} and h_{2} are selected from a set of universal hash functions.

Classical applied data structure

Double hashing with open addressing is a classical data structure on a table T. Let n be the number of elements stored in T, then T's load factor is \alpha ={\frac  {n}{|T|}}.

Double hashing approximates uniform open address hashing. That is, start by randomly, uniformly and independently selecting two universal hash functions h_{1} and h_{2} to build a double hashing table T.

All elements are put in T by double hashing using h_{1} and h_{2}. Given a key k, determining the (i+1)-st hash location is computed by:

h(i,k)=(h_{1}(k)+i\cdot h_{2}(k))\mod |T|.

Let T have fixed load factor \alpha :1>\alpha >0. Bradford and Katehakis[1] showed the expected number of probes for an unsuccessful search in T, still using these initially chosen hash functions, is {\frac  {1}{1-\alpha }} regardless of the distribution of the inputs. More precisely, these two uniformly, randomly and independently chosen hash functions are chosen from a set of universal hash functions where pair-wise independence suffices.

Previous results include: Guibas and Szemerédi[2] showed {\frac  {1}{1-\alpha }} holds for unsuccessful search for load factors \alpha <0.319. Also, Lueker and Molodowitch[3] showed this held assuming ideal randomized functions. Schmidt and Siegel[4] showed this with k-wise independent and uniform functions (for k=c\log n, and suitable constant c).

Implementation details for caching

Linear probing and, to a lesser extent, quadratic probing are able to take advantage of the data cache by accessing locations that are close together. Double hashing has, on average, larger intervals and is not able to achieve this advantage. To avoid this situation, store your data with the second key as the row, and your first key as the column. Doing this allows you to iterate on the column, thus preventing cache problems. This also prevents the need to rehash the second key.

For instance:

pData[hk_2][hk_1]
 
int hv_1 = Hash(v)
int hv_2 = Hash2(v)
 
int original_hash = hv_1
while(pData[hv_2][hv_1]){
  hv_1 = hv_1 + 1
}

Like all other forms of open addressing, double hashing becomes linear as the hash table approaches maximum capacity. The only solution to this is to rehash to a larger size, as with all other open addressing schemes.

On top of that, it is possible for the secondary hash function to evaluate to zero. For example, if we choose k=5 with the following function:

h_{2}(k)=5-(k\mod 7)

The resulting sequence will always remain at the initial hash value. One possible solution is to change the secondary hash function to:

h_{2}(k)=(k\mod 7)+1

This ensures that the secondary hash function will always be non zero.

See also

Notes

  1. P. G. Bradford and M. Katehakis: A Probabilistic Study on Combinatorial Expanders and Hashing, SIAM Journal on Computing 2007 (37:1), 83-111. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.91.2647
  2. L. Guibas and E. Szemerédi: The Analysis of Double Hashing, Journal of Computer and System Sciences, 1978, 16, 226-274.
  3. G. S. Lueker and M. Molodowitch: More Analysis of Double Hashing, Combinatorica, 1993, 13(1), 83-96.
  4. J. P. Schmidt and A. Siegel: Double Hashing is Computable and Randomizable with Universal Hash Functions, manuscript.

External links

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.