Linear hashing

Linear hashing is a dynamic hash table algorithm invented by Witold Litwin (1980),^[1] and later popularized by Paul Larson. Linear hashing allows for the expansion of the hash table one slot at a time. The frequent single slot expansion can very effectively control the length of the collision chain. The cost of hash table expansion is spread out across each hash table insertion operation, as opposed to being incurred all at once.^[2] Linear hashing is therefore well suited for interactive applications.

Algorithm Details

First the initial hash table is set up with some arbitrary initial number of buckets. The following values need to be kept track of:

$N$ : The initial number of buckets.
$L$ : The current level which is an integer that indicates on a logarithmic scale approximately how many buckets the table has grown by. This is initially $0$ .
$S$ : The step pointer which points to a bucket. It initially points to the first bucket in the table.

Bucket collisions can be handled in a variety of ways but it is typical to have space for two items in each bucket and to add more buckets whenever a bucket overflows. More than two items can be used once the implementation is debugged. Addresses are calculated in the following way:

Apply a hash function to the key and call the result $H$ .
If $H{\bmod (}N\times 2^{L})$ is an address that comes before $S$ , the address is $H{\bmod (}N\times 2^{{L+1}})$ .
If $H{\bmod (}N\times 2^{L})$ is $S$ or an address that comes after $S$ , the address is $H{\bmod (}N\times 2^{L})$ .

To add a bucket:

Allocate a new bucket at the end of the table.
If $S$ points to the $N\times 2^{L}$ th bucket in the table, reset $S$ and increment $L$ .
Otherwise increment $S$ .

The effect of all of this is that the table is split into three sections; the section before $S$ , the section from $S$ to $N\times 2^{L}$ , and the section after $N\times 2^{L}$ . The first and last sections are stored using $H{\bmod (}N\times 2^{{L+1}})$ and the middle section is stored using $H{\bmod (}N\times 2^{L})$ . Each time $S$ reaches $N\times 2^{L}$ the table has doubled in size.

Points to ponder over

Full buckets are not necessarily split, and an overflow space for temporary overflow buckets is required. In external storage, this could mean a second file.
Buckets split are not necessarily full.
Every bucket will be split sooner or later and so all Overflows will be reclaimed and rehashed.
Split pointer s decides which bucket to split.
- s is independent to overflowing bucket.
- At level i, s is between 0 and 2ⁱ.
- s is incremented and if at end, is reset to 0.
- Since a bucket at s is split then s is in incremented, only buckets before s have the second doubled hash space.
- A large good pseudo random number is first obtained, and then is bit-masked with either 2ⁱ -1 or 2ⁱ⁺¹ -1, but the latter only applies if x, the random number, bit-masked with the former, 2ⁱ - 1, is less than S, so the larger range of hash values only apply to buckets that have already been split.
- e.g. To bit-mask a number, use x & 0111, where & is the AND operator, 111 is binary 7, where 7 = 8 - 1 and 8 is 2³ and i = 3.
- What if s lands on a bucket which has 1 or more full overflow buckets? The split will only reduce the overflow bucket count by 1, and the remaining overflow buckets will have to be recreated by seeing which of the new 2 buckets, or their overflow buckets, the overflow entries belong.
h_i (k)= h(k) mod(2ⁱ n).
h_i+1 doubles the range of h_i.

Algorithm for inserting ‘k’ and checking overflow condition

b = h₀(k)
if b < split-pointer then
b = h₁(k)

Searching in the hash table for ‘k’

b = h₀(k)
if b < split-pointer then
b = h₁(k)
read bucket b and search there

Adoption in language systems

Griswold and Townsend ^[3] discussed the adoption of linear hashing in the Icon language. They discussed the implementation alternatives of dynamic array algorithm used in linear hashing, and presented performance comparisons using a list of Icon benchmark applications.

Adoption in database systems

Linear hashing is used in the BDB Berkeley database system, which in turn is used by many software systems such as OpenLDAP, using a C implementation derived from the CACM article and first published on the Usenet in 1988 by Esmond Pitt.

References

↑ Litwin, Witold (1980), "Linear hashing: A new tool for file and table addressing" (PDF), Proc. 6th Conference on Very Large Databases: 212–223
↑ Larson, Per-Åke (April 1988), "Dynamic Hash Tables", Communications of the ACM, 31 (4): 446–457, doi:10.1145/42404.42410
↑ Griswold, William G.; Townsend, Gregg M. (April 1993), "The Design and Implementation of Dynamic Hashing for Sets and Tables in Icon", Software - Practice and Experience, 23 (4): 351–367

External links

TommyDS, C implementation of a Linear Hashtable
An in Memory Go Implementation with Explanation
This article incorporates public domain material from the NIST document: Black, Paul E. "linear hashing". Dictionary of Algorithms and Data Structures.
A C++ Implementation of Linear Hashtable which Supports Both Filesystem and In-Memory Storage