Interpolation search
From Wikipedia, the free encyclopedia
This article or section needs to be wikified to meet Wikipedia's quality standards. Please help improve this article with relevant internal links. (August 2007) |
Interpolation search is an algorithm for searching for a given key value in an indexed array that has been ordered by the values of the key. It parallels how humans search through a telephone book for a particular name, the key value by which the book's entries are ordered. In each search step it calculates where in the remaining search space the sought item might be based on the key values at the bounds of the search space and the value of the sought key, usually via a linear interpolation. The key value actually found at this estimated position is then compared to the key value being sought. If it is not equal, then depending on the comparison, the remaining search space is reduced to the part before or after the estimated position. Only if calculations on the size of differences between key values are sensible will this method work. Humans have a rough idea of the distribution of surnames in an alphabetical listing, and can guess that "Baker" is closer to "Abadeeah" than it is to "Zyteca" and interpolate accordingly. In the absence of information about the popularity of surnames, computer representations of text strings do not facilitate such arithmetic.
By comparison, the binary search chooses always the middle of the remaining search space, discarding one half or the other, again depending on the comparison between the key value found at the estimated position and the key value sought. The remaining search space is reduced to the part before or after the estimated position. The linear search uses equality only as it compares elements one-by-one from the start, ignoring any sorting.
On average the interpolation search makes about log(log(n)) comparisons (if the elements are uniformly distributed), where n is the number of elements to be searched. In the worst case (for instance where the numerical values of the keys increase exponentially) it can make up to O(n) comparisons.
Contents |
[edit] Performance
In reality, an interpolation search is often no faster than a binary search due to the complexity of the arithmetic calculations for estimating the next probe position. The key may not be easily regarded as a number, as when it is text such as names in a telephone book. Yet it is always possible to regard a bit pattern as a number, and text could be considered a number in base twenty-six (or however many distinct symbols are allowable) that could then be used in the interpolation arithmetic, but each such conversion is more trouble than a few binary chop probes. Nevertheless, a suitable numeric value could be pre-computed for every name and searching would employ those numerical values. Further, it may be easy enough to form a cumulative histogram of those values that follows a shape that can be approximated by a simple function (such as a straight line) or a few such curves over segments of the range. If so, then a suitable inverse interpolation is easily computed and the desired index found with very few probes. The particular advantage of this is that a probe of a particular index's value may involve access to slow storage (on disc, or not in the on-chip memory) whereas the interpolation calculation involves a small local collection of data which with many searches being performed is likely in fast storage. This approach shades into being a special-case Hash table search.
Such analysis should also detect the troublesome case of equal key values. If a run of equal key values exists, then the search will not necessarily select the first (or last) element of such a run, and if not carefully written, runs the risk of attempting a divide by zero during its interpolation calculation.
In the simple case, no analysis of values still less cumulative histogram approximations are prepared. With a sorted list, clearly the minimum value is at index one, and the maximum value is at index n. Assuming a uniform distribution of values amounts to assuming that the cumulative histogram's shape is a straight line from the minimum to the maximum and so linear interpolation will find that index whose associated value should be the sought value.
[edit] Sample implementation
The following code example for the Java programming language is a simple implementation. At each stage it computes a probe position then as with the binary search, moves either the upper or lower bound in to define a smaller interval containing the sought value. Unlike the binary search which guarantees a halving of the interval's size with each stage, a misled interpolation may reduce it by only one, thus the O(n) worst case.
public int interpolationSearch(int[] sortedArray, int toFind){ // Returns index of toFind in sortedArray, or -1 if not found int low = 0; int high = sortedArray.length - 1; int mid; while (sortedArray[low] < toFind && sortedArray[high] >= toFind) { mid = low + ((toFind - sortedArray[low]) * (high - low)) / (sortedArray[high] - sortedArray[low]); if (sortedArray[mid] < toFind) low = mid + 1; else if (sortedArray[mid] > toFind) //Repetition of the comparison code is forced by syntax limitations. high = mid - 1; else return mid; } if (sortedArray[low] == toFind) return low; else return -1; // Not found }
Notice that having probed the list at index mid, for reasons of loop control administration, this code sets either high or low to be not mid but an adjacent index, which location is then probed during the next iteration. Since an adjacent entry's value will not be much different the interpolation calculation is not much improved by this one step adjustment, at the cost of an additional reference to distant memory such as disc.
Each iteration of the above code requires between five and six comparisons (the extra is due to the repetitions needed to distinguish the three states of < > and = via binary comparisons in the absence of a three-way comparison) plus some messy arithmetic, while the binary search algorithm can be written with one comparison per iteration and uses only trivial integer arithmetic. It would thereby search an array of a million elements with no more than twenty comparisons (involving accesses to slow memory where the array elements are stored); to beat that the interpolation search as written above would be allowed no more than three iterations.
[edit] Book-based searching
The conversion of names in a telephone book to some sort of number clearly will not provide numbers having a uniform distribution (except via immense effort such as sorting the names and calling them name #1, name #2, etc.) and further, it is well-known that some names are much more common than others (Smith, Jones,) Similarly with dictionaries, where there are many more words starting with some letters than others. Some publishers go to the effort of preparing marginal annotations or even cutting into the side of the pages to show markers for each letter so that at a glance a segmented interpolation can be performed.