Interpolation search

From Wikipedia, the free encyclopedia

Interpolation search is an algorithm for searching a key in an indexed sorted list. It parallels how humans search through a telephone book. Instead of comparing against every item as in linear search, it attempts to find the item by estimating how far the item is likely to be from the current position and next searching there, whereas in the binary search only the sign of the difference is considered, and the binary search always divides the search space in half. On average the interpolation search makes about log(log(N)) comparisons (if the elements are uniformly distributed), where N is the number of elements to be searched. In the worst case (for instance where the numerical values of the items increase exponentially) it can make up to O(n) comparisons.

In reality, an interpolation search is often no faster than a binary search due to the complexity of the arithmetic calculations for estimating the next probe position. The key may not be easily regarded as a number, as when it is text such as names in a telephone book. Yet it is always possible to regard a bit pattern as a number, and text could be considered a number in base twenty-six (or however many distinct symbols are allowable) that could then be used in the interpolation arithmetic, but each such conversion is more trouble than a few binary chop probes. Nevertheless, a suitable numeric value could be pre-computed for every name and searching would employ those numerical values. Further, it may be easy enough to form a cumulative histogram of those values that follows a shape that can be approximated by a simple function (such as a straight line) or a few such curves over segments of the range. If so, then a suitable inverse interpolation is easily computed and the desired index found with very few probes. The particular advantage of this is that a probe of a particular index's value may involve access to slow storage (on disc, or not in the on-chip memory) whereas the interpolation calculation involves a small local collection of data which with many searches being performed is likely in fast storage. This approach shades into being a special-case Hash table search.

Such analysis should also detect the troublesome case of equal key values. If a run of equal key values exists, then the search will not necessarily select the first (or last) element of such a run, and if not carefully written, runs the risk of attempting a divide by zero during its interpolation calculation.

In the simple case, no analysis of values still less cumulative histogram approximations are prepared. With a sorted list, clearly the minimum value is at index one, and the maximum value is at index N. Assuming a uniform distribution of values amounts to assuming that the cumulative histogram's shape is a straight line from the minimum to the maximum and so linear interpolation will find that index whose associated value should be the sought value.

The following code example for the Java programming language is a simple implementation. At each stage it computes a probe position then as with the binary search, moves either the upper or lower bound in to define a smaller interval containing the sought value. Unlike the binary search which guarantees a halving of the interval's size with each stage, a misled interpolation may reduce it by only one, thus the O(N) worst case.

public int interpolationSearch(int[] sortedArray, int toFind){
 // Returns index of toFind in sortedArray, or -1 if not found
 int low = 0;
 int high = sortedArray.length - 1;
 int mid;

 while (sortedArray[low] < toFind && sortedArray[high] >= toFind) {
  mid = low + ((toFind - sortedArray[low]) * (high - low)) / (sortedArray[high] - sortedArray[low]);

  if (sortedArray[mid] < toFind)
   low = mid + 1;
  else if (sortedArray[mid] > toFind)      //Repetition of the comparison code is forced by syntax limitations.
   high = mid - 1;
  else
   return mid;
 }

 if (sortedArray[low] == toFind)
  return low;
 else
  return -1; // Not found
}

Notice that having probed the list at index mid, for reasons of loop control administration, this code sets either high or low to be not mid but an adjacent index, which location is then probed during the next iteration. Since an adjacent entry's value will not be much different the interpolation calculation is not much improved by this one step adjustment, at the cost of an additional reference to distant memory such as disc.

[edit] Book-based searching

The conversion of names in a telephone book to some sort of number clearly will not provide numbers having a uniform distribution (except via immense effort such as sorting the names and calling them name #1, name #2, etc.) and further, it is well-known that some names are much more common than others (Smith, Jones,) Similarly with dictionaries, where there are many more words starting with some letters than others. Some publishers go to the effort of preparing marginal annotations or even cutting into the side of the pages to show markers for each letter so that at a glance a segmented interpolation can be performed.

[edit] See also

[edit] External links

In other languages