User:T Long/Hash Table

From Wikipedia, the free encyclopedia


In computer science, a hash table is a data structure that provides fast lookup of a record indexed by a key where the domain of the key is too large for simple indexing; as would occur if an array were used. Like arrays, hash tables can provide O(1) lookup with respect to the number of records in the table. Hash tables are often used to implement associative arrays.

Hash tables use an array to hold, or reference, the stored records. Each slot of the array is indexed by a range reduction of the key. The range reduction is achieved by a hash function (thus the name hash table). However, various keys will, potentially, map to the same array index. Thus a collision resolution strategy is used. Most collision resolution strategies are some form of linear search through a (hopefully small) set of keys that collide.

Hash tables allow reasonably simple implementations of the essential operations of associative arrays: lookup of a record given a key; setting a record at a given a key (adding or replacing as necessary); and removing a record with a given key. Note however that hash tables do not naturally support any particular ordering of the records in the table (unlike a binary tree or skip list).

There are two broad categories of hash table implementation: open addressing, and chaining. In open addressing hash tables, the records are all stored within the array. In chaining hash tables records are seperately allocated and chained from the array.

Contents

[edit] Common uses of hash tables

Hash tables are found in a wide variety of programs. Most programming languages provide them in standard libraries. Most interpreted or scripting languages have special syntactic support (examples being Python, Perl, Ruby, Io, Smalltalk, Lua and ICI). In these languages, hash tables tend to be used extensively as data structures, sometimes replacing records and arrays.

Hash tables are commonly used for symbol tables, caches, and sets.

[edit] Key range reduction by hashing

A good hashing of the key is essential for good hash table performance. Hash collisions are generally resolved by some form of linear search, so if a hash function tends to produce similar values for some set of keys, poor performance linear searches will result (Murphy's law causes all important hash table uses to find such sets if they exist). When hashes collide to induce linear searches, it is refered to as clustering.

In an ideal hash function, changing any single bit in the key (including extending or shortening the key) will produce a pseudo-random change in all of the bits of the hash, which are independent of the changes caused by any other bits of the key. Because a good hash function can be hard to design, or computationally expensive to execute, much research has been devoted to collision resolution strategies that mitigate poor hashing performance. However, in reality there is no substitute for a good hash.

In practice, the creation of a range reduced index from a key is almost always carried out as a two step process. The first is the generation of a generic hash value which fills a natural machine integer. The seconds step is its reduction by finding its modulus with respect to the hash table array size.

Hash table array sizes are sometimes set, by construction, to be prime numbers. This is done to avoid any tendancy for the large integer hash to have common divisors with the hash table size, which would otherwise induce collisions after the modulus operation. However, again, a good hash function makes this unnecessary. A common alternative to prime sizes is to use a power of 2 size, with simple bit masking to achieve the modulus operation. Such bit masking is often significantly computationally cheaper than the division operation used in a general modulus.

[edit] Open addressing hash tables

Open addressing hash tables store all the records within the array. A hash collision is resolved by searching through alternate locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found, which indicates the there is no such key in the table. Well known probe sequences include:

linear probing 
in which successive slots a fixed distance apart in the array are probed,
quadratic probing 
in which the space between probes increases quadratically, and
double hashing 
in which successive probes use a different hash function.

A critical influence on performance of an open addressing hash table is the load factor, that is, the proportion of the slots in the array that are used. As the load factor increases towards 100%, the number of probes that may be required to find or insert a given key rises dramatically. Once the table becomes full, most algorithms will fail to terminate. Even with good hash functions, load factors are normally limited to 80%. A poor hash function can exhibit poor performance even at very low load factors.

The following pseudocode illustrates implementation of an open addressing hash table using linear probing with single slot stepping (the commonest case, and effective if the hash function is good). Each of the lookup, set and remove functions use a common internal function find_slot to locate the array slot that does, or perhaps should, contain a given key. The varables i, j, and k are all integers. The element at index i is referred to as slot[i] and there are nslots elements in the array.

function find_slot(key)
  i = hash(key) modulus nslots
  loop
    if slot[i] is not occupied, or is occupied with the given key
      return i
    step i to next slot modulus nslots
function lookup(key)
  i = find_slot(key)
  if slot[i] is occupied
    the record has been found, finish
  else
    the record is not in the table, finish
function set(key, value)
  i = find_slot(key)
  if slot[i] is occupied
    update slot[i] with the value, finish
  else
    if the table is almost full
      rebuild the table larger  (note 1)
      i = find_slot(key)
    store value and key in slot[i], finish
note 1 
Rebuilding the table requires allocating a larger array and recursively using the set operation to insert all the elements of the old array into the new larger array. It is common to use exponential growth to increase the array size. For example, doubling the old array size.
function remove(key)
    i = find_slot(key)
    if slot[i] is unoccupied
      the key is not in the table, finish
    j = i;
    loop
      j = slot after j modulus nslots
      if slot[j] is unoccupied
        exit the loop
      k = hash(key of slot[j]) modulus nslots
      if (j > i AND (k <= i OR k > j)) OR (j < i AND (k <= i AND k > j))  (note 2)
        copy the record at slot j to slot i
        i = j
    mark slot[i] as unoccupied, finish
note 2 
For all records in a cluster, there must be no vacant slots between their natural hash position and their current position (else lookups will terminate before finding the record). At this point in the pseudocode, i is a vacant slot that might be invalidating this property for subsequent records in the cluster. j is such as subsequent record. k is the raw hash where the record at j would naturally land in the hash table if there were no collisions. This test is asking if the record at j is invalidly positioned with respect the required properties of a cluster now that i is vacant.

Another technique for removal is simply to mark the slot as deleted. However this eventually requires rebuilding the table simply to remove deleted records. The methods above provide O(1) updating and removal of existing records, with occasional rebuilding if the high water mark of the table size grows.

[edit] Chaining hash tables

In simple chaining hash tables, each slot in the array is the head of a linked list of records that hash to that slot. Chaining hash tables are sometimes preferred over open addressing hash tables because removal is easy and there is no requirement to resize the table. However, to maintain efficiency, table resizing is still required. In addition, seperate allocations are required for each record, and link pointers need to be associated with each allocated record.

[edit] Variations

If all of the keys that will be used are known ahead of time, perfect hashing can be used to create a perfect hash table, in which there will be no collisions. If minimal perfect hashing is used, the hash table need not be sparse.

If a hash table entry does not include the key under which the record has been stored, the table may still be useful for probabilistic checks. For example, a very sparse (that is, very low load factor) hash tables of single bit records can be used as a fast means of selecting if an incomming key is in an interesting set. A collision will occasionally give a false positive, but a nagative is definite. (An example of this use was one of the early UNIX spell checking programs that used a precomputed hash table of know good words. It would occasionally pass a miss-spelled word incorrectly.)

[edit] See also