Coalesced hashing

From Wikipedia, the free encyclopedia

Coalesced Hashing example
Coalesced Hashing example

Coalesced hashing, also called coalesced chaining, is a strategy of collision resolution in a hash table that forms a hybrid of separate chaining and open addressing. In a separately chained hash table, items that hash to the same index are placed on a list at that index. This can result in a great deal of wasted memory because the table itself must be large enough to maintain a load factor that performs well (typically twice the expected number of items), and extra memory must be used for all but the first item in a chain (unless list headers are used, in which case extra memory must be used for all items in a chain).

Given a sequence of randomly generated three character long strings, the following table would be generated (using Bob Jenkins'[1] One-at-a-Time hash algorithm) with a table of size 10:

(null)
"clq"
"qur"
(null)
(null)
"dim"
"aty" "qsu"
"rhv"
"qrj" "ofu" "gcl" "ecd
(null)
(null)

This strategy is effective, efficient, and very easy to implement. However, sometimes the extra memory use might be prohibitive, and the most common alternative, open addressing, has uncomfortable disadvantages that decrease performance. More specifically, open addressing has the problem of primary and secondary clustering, where there are long sequences of used buckets, and extra time is needed to calculate the next open bucket for a collision resolution.

One of the solutions to clustering is coalesced hashing. Another, more common, solution is double hashing. Coalesced hashing uses a similar technique as separate chaining, but instead of allocating new nodes for the linked list, buckets in the table are used. The first empty bucket in the table is considered a collision bucket. When a collision occurs anywhere in the table, the item is placed in the collision bucket and a link is made between the colliding index and the collision bucket. Then the next empty bucket is searched for to give the next collision bucket.

Because the collision bucket moves in a predictable pattern unrelated to how the keys are hashed, the effect of primary and secondary clustering is minimized, and search times remain efficient. The lists are interleaved around one another, so they are said to coalesce, hence the name, coalesced hashing. The code to insert into a coalesced hash table is surprisingly short for how convoluted the concept seems at first.

Insertion in C:

int insert ( char key[] )
{
  unsigned h = hash ( key, strlen ( key ) ) % N;

  if ( htab[h] == NULL ) {
    /* Make a new chain */
    htab[h] = make_node ( key, NULL );
  }
  else {
    struct node *it;
    int cursor = 0;

    /* Find the first empty bucket */
    while ( cursor < N && htab[cursor] != NULL )
      ++cursor;

    /* The table is full, terminate unsuccessfully */
    if ( cursor == N )
      return -1;

    htab[cursor] = make_node ( key, NULL );
    
    /* Find the last node in the chain and point to it */
    for ( it = htab[h]; it->next != NULL; it = it->next )
      ;

    it->next = htab[cursor];
  }

  return 0;
}

One benefit of this strategy is that the search algorithm for separate chaining can be used without change in a coalesced hash table.

Lookup in C:

char *find ( char key[] )
{
  unsigned h = hash ( key, strlen ( key ) ) % N;

  if ( htab[h] != NULL ) {
    struct node *it;

    /* Search the chain at index h */
    for ( it = htab[h]; it != NULL; it = it->next ) {
      if ( strcmp ( key, it->data ) == 0 )
        return it->data;
    }
  }

  return NULL;
}

[edit] Performance

Coalesced chaining avoids the effects of primary and secondary clustering, and as a result can take advantage of the efficient search algorithm for separate chaining. If the chains are short, this strategy is very efficient and can be highly condensed, memory-wise. As in open addressing, deletion from a coalesced hash table is awkward and potentially expensive, and resizing the table is terribly expensive and should be done rarely, if ever. All in all, coalesced hashing offers a tempting combination of separate chaining and open addressing.