Binary search algorithm

From Wikipedia, the free encyclopedia

Binary search algorithm
General Data
Class: Search algorithm
Data Structure: Array
Time Complexity: О(log n)
Space Complexity: O(1)
Optimal: Yes

A binary search algorithm (or binary chop) is a technique for locating a particular value in a sorted list of values. It makes progressively better guesses, and closes in on the sought value by selecting the middle element in the span (which, because the list is in sorted order, is the median value), comparing its value to the target value, and determining if the selected value is greater than, less than, or equal to the target value. A guess that turns out to be too high becomes the new upper bound of the span, and a guess that is too low becomes the new lower bound. Pursuing this strategy iteratively, it narrows the search by a factor of two each time, and finds the target value or else determines that it is not in the list at all. A binary search is an example of a dichotomic divide and conquer search algorithm.

The most common application of binary search is to find a specific value in a sorted list. To cast this in the frame of the guessing game (see Example below), realize that we now guess the index, or numbered place, of the value in the list. This is useful because, given the index, other data structures will contain associated information. Suppose a data structure containing the classic collection of name, address, telephone number and so forth has been accumulated, and an array is prepared containing the names, numbered from one to N. A query might be: what is the telephone number for a given name X. To answer this the array would be searched and the index (if any) corresponding to that name determined, whereupon it would be used to report the associated telephone number and so forth. Appropriate provision must be made for the name not being in the list (typically by returning an index value of zero), indeed the question of interest might be only whether X is in the list or not.

If the list of names is in sorted order, a binary search will find a given name with far fewer probes than the simple procedure of probing each name in the list, one after the other in a linear search, and the procedure is much simpler than organizing a hash table though once created, searching with a hash table may well be faster, typically averaging just over one probe per lookup. With a non-uniform distribution of values, if it is known that some few items are much more likely to be sought for than the majority, then a linear search with the list ordered so that the most popular items are first may do better than binary search. However, the choice of the best method may not be immediately obvious. If, between searches, items in the list are modified or items are added or removed, maintaining the required organisation may consume more time than the searches.

The binary search begins by comparing the sought value X to the value in the middle of the list; because the values are sorted, it is clear whether the sought value would belong before or after that middle value, and the search then continues through the correct half in the same way. Only the sign of the difference is inspected: there is no attempt at an interpolation search based on the size of the differences.

Contents

[edit] The method

In order to discuss the method in detail, a more formal description is necessary. The basic idea is that there is a data structure represented by array A in which individual elements are identified as A(1), A(2),...,A(N) and may be accessed in any order. The data structure contains a sub-element or data field called here Key, and the array is ordered so that the successive values A(1).Key ≤ A(2).Key and so on. Such a key might be a name, or it might be a number. The requirement is that given some value x, find an (not necessarily the one and only) index p such that A(p).Key = x.

To begin with, the span to be searched is the full supplied list of elements, as marked by variables L and R, and their values are changed with each iteration of the search process, as depicted by the flowchart. Note that the division by two is integer division, with any remainder lost, so that 3/2 comes out as 1, not 1½. The search finishes either because the value has been found, or else, the specified value is not in the list.

[edit] That it works

The method relies on and upholds the notion If x is to be found, it will be amongst elements (L + 1) to (R - 1) of the array.

The initialisation of L and R to 0 and N - 1 make this merely a restatement of the supplied problem, that elements 1 to N are to be searched, so the notion is established to begin with. Consider now that some element p of the indicated range is selected. At this point, it does not matter which specific element is selected, merely that one is. Compare x to A(p).Key. If x = A(p).Key then the method terminates in success. Otherwise, suppose x < A(p).Key. If so, then because the array is in sorted order, x will also be less than all later elements of the array, all the way to element (R - 1) as well. Accordingly, the value of the right-hand bound index R can be changed to be the value p, since, by the test just made, x < A(p).Key and so, if x is to be found, it will be amongst elements earlier than p, that is (p - 1) and earlier. And contrariwise, for the case x > A(p).Key, the value of L would be changed. Further, whichever bound is changed, the span remaining to be searched is reduced. If L is changed, it is changed to a higher number (at least L + 1), whereas if R is changed, it is to a lower number (at most R - 1) because those are the limits for p.

It remains to consider that a valid value for p is always chosen, which in turn depends on whether there are any elements remaining in the search span (L + 1) to (R - 1). The number of such elements remaining is clearly (R - L - 1) so computing (R - L) gives (number of elements + 1); halving that number (with integer division) means that if one element remained, then p = 1, but if no elements remain, p = 0, and in that case the method terminates with the report "Not found". Otherwise, for p > 0, the search continues with p:=L + p, which by construction is within the bounds (L + 1) to (R - 1).

Accordingly, with each iteration, if the search span is empty the result is "Not found", otherwise either x is found at the probe point p or the search span is reduced for the next iteration. Thus the method works, and so can be called an Algorithm.

[edit] That it is fast

The interval being search is successively halved (actually slightly better than halved) in width with each iteration, so the number of iterations required is at most the base two logarithm of N. - Zero for empty lists.

[edit] Extension

There is no particular requirement that the array being searched has the bounds 1 to N. It is possible to search a specified range, elements first to last instead of 1 to N. All that is necessary is that the intialisation be L:=first - 1 and R:=last + 1, then all proceeds as before.

In more complex contexts, it might be that the data structure has many sub fields, such as a telephone number along with the name. An indexing array such as xref could be introduced so that elements A(xref(1)).Telephone ≤ A(xref(2)).Telephone ... ≤ A(xref(N)).Telephone so that, "viewed through" array xref the array can be regarded as being sorted on the telephone number, and a search would be to find a given telephone number. In this case, A(i).Key would be replaced by A(xref(i)).Telephone and all would be as before. Thus, with auxiliary xref arrays, an array can be treated as if it is sorted in different ways without it having to be resorted for each different usage.

[edit] Variations

There are many, and they are easily confused. The most significant differences are between the "exclusive" and "inclusive" forms of the bounds. This description uses the "exclusive" bound form, that is the span to be searched is (L + 1) to (R - 1), and this may seem clumsy when the span to be searched could be described in the "inclusive" form, as L to R. This form may be attained by replacing all appearances of "L" by "(L + 1)" and "R" by "(R + 1)" then rearranging. Thus, the initialisation of L:=0 becomes (L - 1):=0 or L:=1, and R:=N + 1 becomes (R + 1):=N + 1 or R:=N. So far so good, but note now that the changes to L and R are no longer simply transferring the value of p to L or R as appropriate but now must be (R + 1):=p or R:=p - 1, and (L - 1):=p or L:=p + 1.

Thus, the gain of a simpler initialisation, done once, is lost by a more complex calculation, and which is done for every iteration. If that is not enough, the test for an empty span is more complex also, as compared to checking that the value of p is zero. Nevertheless, this is the form found in many publications, such as Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition.

[edit] Computer usage

"Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky..." - Professor Knuth. When Jon Bentley assigned it as a problem in a course for professional programmers, he found that an astounding ninety percent failed to code a binary search correctly after several hours of working on it[1], and another study shows that accurate code for it is only found in five out of twenty textbooks. (Kruse, 1999) Furthermore, Bentley's own implementation of binary search, published in his 1986 book Programming Pearls, contains an error that remained undetected for over twenty years.[2]

Careful thought is required. The first issue is minor to begin with - how to signify "Not found". If the array is indexed 1 to N, then a returned index of zero is an obvious choice. However, some computer languages (notably C et al) insist that arrays have a lower bound of zero. In such a case, the array might be indexed 0 to N - 1 and so a negative result would be chosen for "Not found". Except that this can interfere with the desire to use unsigned integers for indexing.

More serious are the limitations of computer arithmetic. Variables have limited size, for instance the (very common) sixteen-bit two's complement signed integer can only hold values of -32768 to +32767. (Exactly the same problems arise with thirty-two bit integers, except that the digit strings are longer.) If the array is to be indexed with such variables, then the values first -1 and last + 1 must be representatable, that is , last ≤ 32766, not 32767. Similarly, the lower bound first may be zero (for arrays whose indexing starts at zero), in which case a value -1 must be representatable, which precludes the use of unsigned integers. General-purpose testing is unlikely to present a test with this boundary exercised, and so the detail can be overlooked.

It is of course unlikely that if the collections being searched number around thirty thousand that sixteen bit integers will be used, but a second problem arises much sooner. A common variation computes the midpoint of the interval in one step, as p:=(L + R)/2; this means that the sum must not exceed the sixteen-bit limit for all to be well, and this detail is easily forgotten. The problem may be concealed on some computers, which use wider registers to perform sixteen-bit arithmetic so that there will be no overflow of intermediate results. But on a different computer, perhaps not. Thus, when tested and working code was transferred from a 16-bit version (in which there were never more than a few thousand elements being searched) to a 32-bit version, and then the problem sizes steadily inflated, the forgotten limit can suddenly become relevant again.

Another difficulty is presented by the absence in most computer languages of a three-way result from a comparison, which forces a comparison to be performed twice. The form is somewhat as follows:

if a < b then action1
 else if a > b then action2
  else action3;

About half the time, the first test will be true so that there will be only one comparison of a and b, but the other half of the time it will be false, and a second comparison forced. This is so grievous that some versions are recast so as not to make a second test at all, thus not determining equality until the span has been reduced to zero. This problem is exacerbated for floating point variables that offer the special value NaN, which violates even the notion of equality: x = x is false if x has the value NaN!

[edit] Implementations

The most straightforward implementation is recursive, which recursively searches the subrange dictated by the comparison:

   BinarySearch(A[0..N-1], value, low, high) {
       if (high < low)
           return -1 // not found
       mid = (low + high) / 2
       if (A[mid] > value)
           return BinarySearch(A, value, low, mid-1)
       else if (A[mid] < value)
           return BinarySearch(A, value, mid+1, high)
       else
           return mid // found
   }

It is invoked with initial low and high values of 0 and N-1. We can eliminate the tail recursion above and convert this to an iterative implementation:

   BinarySearch(A[0..N-1], value) {
       low = 0
       high = N - 1
       while (low <= high) {
           mid = (low + high) / 2
           if (A[mid] > value)
               high = mid - 1
           else if (A[mid] < value)
               low = mid + 1
           else
               return mid // found
       }
       return -1 // not found
   }

Some implementations may not include the early termination branch, preferring to check at the end if the value was found, shown below. Checking to see if the value was found during the search (as opposed to at the end of the search) may seem a good idea, but there are extra computations involved in each iteration of the search. Also, with an array of length N using the low and high indices, the probability of actually finding the value on the first iteration is 1 / N, and the probability of finding it later on (before the end) is the about 1 / (high - low). The following checks for the value at the end of the search:

       low = 0
       high = N
       while (low < high) {
           mid = (low + high)/2;
           if (A[mid] < value)
               low = mid + 1; 
           else
                //can't be high = mid-1: here A[mid] >= value,
                //so high can't be < mid if A[mid] == value
                high = mid; 
       }
       // high == low, using high or low depends on taste 
       if ((low < N) && (A[low] == value))
           return low // found
       else
           return -1 // not found        


This algorithm has two other advantages. At the end of the loop, low points to the first entry greater than or equal to value, so a new entry can be inserted if no match is found. Moreover, it only requires one comparison; which could be significant for complex keys in languages which do not allow the result of a comparison to be saved.

In practice, one frequently uses a three-way comparison instead of two comparisons per loop. Also, real implementations using fixed-width integers with modular arithmetic need to account for the possibility of overflow. One frequently-used technique for this is to compute mid, so that two smaller numbers are ultimately added:

           mid = low + ((high - low) / 2)

[edit] Equal elements

The elements of the list are not necessarily all unique. If one searches for a value that occurs multiple times in the list, the index returned will be of the first-encountered equal element, and this will not necessarily be that of the first, last, or middle element of the run of equal-key elements but will depend on the positions of the values. Modifying the list even in seemingly unrelated ways such as adding elements elsewhere in the list may change the result.

To find all equal elements an upward and downward linear search can be carried out from the initial result, stopping each search when the element is no longer equal. Thus, e.g. in a table of cities sorted by country, we can find all cities in a given country.

[edit] Sort key

A list of pairs (p,q) can be sorted based on just p. Then the comparisons in the algorithm need only consider the values of p, not those of q. For example, in a table of cities sorted on a column "country" we can find cities in Germany by comparing country names with "Germany", instead of comparing whole rows. Such partial content is called a sort key.

[edit] Testing

It is important to remember that the best way to verify the correctness of a binary search algorithm is to thoroughly test it on a computer. It is difficult to visually analyze the code without making a mistake.

To that end, the following code will thoroughly test a binary search at every index for many multiple lengths of arrays:

bool passed=true;
for(int offset=1; offset<5; offset++){ //tests with an offset between 1 and 4 for various amounts.
        for(int length=1; length<2049; length++){ //make array longer on each iteration
                int A[length];
                for(int i=0; i<length; i++) A[i]=i*offset; //init array values from 0 to length-1
                // check negative hits-----------------------------------------------------
 
                // search value too low
                // error if value is found
                // if search returns non-negative index, fail
                if(BinarySearch(A, length, A[0]-1)>=0) passed=false;
 
                // search value too high
                // error if value is found
                // if search returns non-negative index, fail
                if (BinarySearch(A, length, A[length - 1] + 1)>=0) passed=false;
 
                // check positive hits-----------------------------------------------------
                for(int i = 0; i < length; i++) //search for every array value
                        // error if value is NOT found
                        // if search does not return correct value, fail
                        if (BinarySearch(A, length, A[i])!=i) passed=false;
        }
}

In the above C++ test-code, if passed is ever false, then the binary search function has a bug. Note that this code assumes that you are returning the index of the search value within the array. In addition, it does not properly test duplicate values within the array, or errors that could be caused by more randomly distributed values.As such this should not be considered a complete proof of correctness, merely an aid for testing.

[edit] Performance

Binary search is a logarithmic algorithm and executes in O(logN) time. Specifically, 1 + log2N iterations are needed to return an answer. In most cases it is considerably faster than a linear search. It can be implemented using recursion or iteration, as shown above. In some languages it is more elegantly expressed recursively; however, in some C-based languages tail recursion is not eliminated and the recursive version requires more stack space.

Binary search can interact poorly with the memory hierarchy (i.e. caching), because of its random-access nature. For in-memory searching, if the interval to be searched is small, a linear search may have superior performance simply because it exhibits better locality of reference. For external searching, care must be taken or each of the first several probes will lead to a disk seek. A common technique is to abandon binary searching for linear searching as soon as the size of the remaining interval falls below a small value such as 8 or 16 or even more in recent computers. The exact value depends entirely on the machine running the algorithm.

When multiple binary searches are to be performed with the same key in related lists, fractional cascading can be used to speed up successive searches after the first one.

[edit] Examples

An example of binary search in action is a simple guessing game in which a player has to guess a positive integer, between 1 and N, selected by another player, using only questions answered with yes or no. Supposing N is 16 and the number 11 is selected, the game might proceed as follows.

  • Is the number greater than 8? (Yes).
  • Is the number greater than 12? (No)
  • Is the number greater than 10? (Yes)
  • Is the number greater than 11? (No)

Therefore, the number must be 11. At each step, we choose a number right in the middle of the range of possible values for the number. For example, once we know the number is greater than 8, but less than or equal to 12, we know to choose a number in the middle of the range [9, 12] (in this case 10 is optimal).

At most \lceil\log_2 N\rceil questions are required to determine the number, since each question halves the search space. Note that one less question (iteration) is required than for the general algorithm, since the number is constrained to a particular range.

Even if the number we're guessing can be arbitrarily large, in which case there is no upper bound N, we can still find the number in at most 2\lceil \log_2 k \rceil steps (where k is the (unknown) selected number) by first finding an upper bound by repeated doubling. For example, if the number were 11, we could use the following sequence of guesses to find it:

  • Is the number greater than 1? (Yes)
  • Is the number greater than 2? (Yes)
  • Is the number greater than 4? (Yes)
  • Is the number greater than 8? (Yes)
  • Is the number greater than 16? (No, N=16, proceed as above)

( We know the number is greater than 8 )

  • Is the number greater than 12? (No)
  • Is the number greater than 10? (Yes)
  • Is the number greater than 11? (No)

As one simple example, in revision control systems, it is possible to use a binary search to see in which revision a piece of content was added to a file. We simply do a binary search through the entire version history; if the content is not present in a particular version, it appeared later, while if it is present it appeared at that version or sooner. This is far quicker than checking every difference.

There are many occasions unrelated to computers when a binary search is the quickest way to isolate a solution we seek. In troubleshooting a single problem with many possible causes, we can change half the suspects, see if the problem remains and deduce in which half the culprit is; change half the remaining suspects, and so on. (For finding non-deterministic bugs, where the test used will not always reveal the bug even if it is present in the revision, see Variations below.) See: Shotgun debugging.

People typically use a mixture of the binary search and interpolative search algorithms when searching a telephone book, after the initial guess we exploit the fact that the entries are sorted and can rapidly find the required entry. For example when searching for Smith, if Rogers and Thomas have been found, one can flip to the page halfway between the previous guesses, if this shows Samson, we know that Smith is somewhere between the Samson and Thomas pages so we can bisect these.

[edit] Language support

Many standard libraries provide a way to do binary search. C provides bsearch(3) in its standard library. C++'s STL provides algorithm functions binary_search, lower_bound and upper_bound. Java offers a set of overloaded binarySearch() static methods in the classes Arrays and Collections for performing binary searches on Java arrays and Lists, respectively. They must be arrays of primitives, or the arrays or Lists must be of a type that implements the Comparable interface, or you must specify a custom Comparator object. Microsoft's .NET Framework 2.0 offers static generic versions of the Binary Search algorithm in its collection base classes. An example would be System.Array's method BinarySearch<T>(T[] array, T value). Python provides the bisect module. COBOL can perform binary search on internal tables using the SEARCH ALL statement.

[edit] Applications to complexity theory

Even if we do not know a fixed range the number k falls in, we can still determine its value by asking 2\lceil\log_2k\rceil simple yes/no questions of the form "Is k greater than x?" for some number x. As a simple consequence of this, if you can answer the question "Is this integer property k greater than a given value?" in some amount of time then you can find the value of that property in the same amount of time with an added factor of log2 k. This is called a reduction, and it is because of this kind of reduction that most complexity theorists concentrate on decision problems, algorithms that produce a simple yes/no answer.

For example, suppose we could answer "Does this n x n matrix have determinant larger than k?" in O(n2) time. Then, by using binary search, we could find the (ceiling of the) determinant itself in O(n2log d) time, where d is the determinant; notice that d is not the size of the input, but the size of the output.

[edit] Variations

Several algorithms closely related to or extending binary search exist. For instance, noisy binary search solves the same class of projects as regular binary search, with the added complexity that any given test can return a false value at random. (Usually, the number of such erroneous results are bounded in some way, either in the form of an average error rate, or in the total number of errors allowed per element in the search space.) Optimal algorithms for several classes of noisy binary search problems have been known since the late seventies, and more recently, optimal algorithms for noisy binary search in quantum computers (where several elements can be tested at the same time) have been discovered.

[edit] External links

[edit] References

  1. ^ Bentley, Jon [1986] (2000). Programming Pearls, 2nd edition, Addison-Wesley, p34. ISBN 0201657880. 
  2. ^ Extra, Extra - Read All About It: Nearly All Binary Searches and Mergesorts are Broken, Google Research Blog
  • Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0. Section 6.2.1: Searching an Ordered Table, pp.409–426.
  • Kruse, Robert L.: "Data Structures and Program Design in C++", Prentice-Hall, 1999, ISBN 0-13-768995-0, page 280.
  • Netty van Gasteren, Wim Feijen. The Binary Search Revisited, AvG127/WF214, 1995. (investigates the foundations of the Binary Search, debunking the myth that it applies only to sorted arrays)