Binary search algorithm

From Wikipedia, the free encyclopedia

A binary search algorithm (or binary chop) is a technique for finding a particular value in a linear array, by ruling out half of the data at each step, widely but not exclusively used in computer science. A binary search finds the median, makes a comparison to determine whether the desired value comes before or after it, and then searches the remaining half in the same manner. A binary search is an example of a divide and conquer algorithm (more specifically a decrease and conquer algorithm) and a dichotomic search (more at Search algorithm).

Contents

[edit] The algorithm

The most common application of binary search is to find a specific value in a sorted list. To cast this in the frame of the guessing game (see Example below), realize that we are now guessing the index, or numbered place, of the value in the list.

The search begins by examining the value in the center of the list; because the values are sorted, it then knows whether the value occurs before or after the center value, and searches through the correct half in the same way. Here is simple pseudocode which determines the index of a given value in a sorted list a between indices left and right:

function binarySearch(a, value, left, right)
    if right < left
        return not found
    mid := floor((left+right)/2)
    if value > a[mid]
        return binarySearch(a, value, mid+1, right)
    else if value < a[mid]
        return binarySearch(a, value, left, mid-1)
    else
        return mid

Because the calls are tail-recursive, this can be rewritten as a loop, making the algorithm in-place:

function binarySearch(a, value, left, right)
    while left ≤ right
        mid := floor((left+right)/2)
        if value > a[mid]
            left  := mid+1
        else if value < a[mid]
            right := mid-1
        else
            return mid
    return not found

In both cases, the algorithm terminates because on each recursive call or iteration, the range of indexes right minus left always gets smaller, and so must eventually become negative.

[edit] Performance

Binary search is a logarithmic algorithm and executes in O(log n) time. Specifically, 1 + log2N iterations are needed to return an answer. In most cases it is considerably faster than a linear search. It can be implemented using recursion or iteration, as shown above. In some languages it is more elegantly expressed recursively; however, in most C-based languages it is more efficient to use a loop, because the stack use is increased with every recursion.

In the above code examples, the branches are ordered so that earlier branches are the more likely ones to be taken, resulting in less comparisons overall; however, in performance-critical applications the effect on branch prediction should be evaluated.

It should be noted that binary search can interact poorly with the memory hierarchy (i.e. caching), because of its random-access nature. For in-memory searching, if the interval to be searched is small, a linear search may have superior performance simply because it exhibits better locality of reference. For external searching, care must be taken or each of the first several probes will lead to a disk seek. A common technique for high performance applications is to replace the loop test left ≤ right with something like right - left > N (where N has a small value like 8 or 16), followed by a short linear search over the remaining interval.

[edit] Examples

An example of binary search in action is a simple guessing game in which a player has to guess a positive integer, between 1 and N, selected by another player, using only questions answered with yes or no. Supposing N is 16 and the number 11 is selected, the game might proceed as follows.

  • Is the number greater than 8? (Yes).
  • Is the number greater than 12? (No)
  • Is the number greater than 10? (Yes)
  • Is the number greater than 11? (No)

Therefore, the number must be 11. At each step, we choose a number right in the middle of the range of possible values for the number. For example, once we know the number is greater than 8, but less than or equal to 12, we know to choose a number in the middle of the range [9, 12] (in this case 10 is optimal).

At most \lceil\log_2 N\rceil questions are required to determine the number, since each question halves the search space. Note that one less question (iteration) is required than for the general algorithm, since the number is constrained to a particular range.

Even if the number we're guessing can be arbitrarily large, in which case there is no upper bound N, we can still find the number in at most 2\lceil \log_2 k \rceil steps (where k is the (unknown) selected number) by first finding an upper bound by repeated doubling. For example, if the number were 11, we could use the following sequence of guesses to find it:

  • Is the number greater than 1? (Yes)
  • Is the number greater than 2? (Yes)
  • Is the number greater than 4? (Yes)
  • Is the number greater than 8? (Yes)
  • Is the number greater than 16? (No, N=16, proceed as above)

( We know the number greater than 8 )

  • Is the number greater than 12? (No)
  • Is the number greater than 10? (Yes)
  • Is the number greater than 11? (No)

As one simple application, in revision control systems, it is possible to use a binary search to see in which revision a piece of content was added to a file. We simply do a binary search through the entire version history; if the content is not present in a particular version, it appeared later, while if it is present it appeared at that version or sooner. This is far quicker than checking every difference.

There are many occasions unrelated to computers when a binary chop is the quickest way to isolate a solution we seek. In troubleshooting a single problem with many possible causes, we can change half the suspects, see if the problem remains and deduce in which half the culprit is; change half the remaining suspects, and so on.

People typically use a variant of the binary search algorithm when searching a telephone book, after the inital guess we exploit the fact that the entries are sorted and can rapidly find the required entry. For example when searching for Smith, if Rogers and Thomas have been found, one can flip to the page halfway between the previous guesses, if this shows Samson, we know that Smith is somewhere between the Samson and Thomas pages so we can bisect these.

[edit] Language support

Many standard libraries provide a way to do binary search. C provides bsearch in its standard library. C++'s STL provides algorithm functions lower bound and upper bound. Java offers a set of overloaded binarySearch() static methods in the classes Arrays and Collections for performing binary searches on Java arrays and Lists, respectively. They must be arrays of primitives, or the arrays or Lists must be of a type that implements the Comparable interface, or you must specify a custom Comparator object. Microsoft's .NET Framework 2.0 offers static generic versions of the Binary Search algorithm in its collection base classes. An example would be System.Array's method BinarySearch<T>(T[] array, T value). Python provides the bisect module.

[edit] Applications to complexity theory

Even if we do not know a fixed range the number k falls in, we can still determine its value by asking 2\lceil\log_2k\rceil simple yes/no questions of the form "Is k greater than x?" for some number x. As a simple consequence of this, if you can answer the question "Is this integer property k greater than a given value?" in some amount of time then you can find the value of that property in the same amount of time with an added factor of log k. This is called a reduction, and it is because of this kind of reduction that most complexity theorists concentrate on decision problems, algorithms that produce a simple yes/no answer.

For example, suppose we could answer "Does this n x n matrix have determinant larger than k?" in O(n2) time. Then, by using binary search, we could find the (ceiling of the) determinant itself in O(n2log d) time, where d is the determinant; notice that d is not the size of the input, but the size of the output.

[edit] See also

[edit] External links

[edit] References

  • Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0. Section 6.2.1: Searching an Ordered Table, pp.409–426.