Selection algorithm

From Wikipedia, the free encyclopedia

In computer science, a selection algorithm is an algorithm for finding the kth smallest (or kth largest) number in a list. This includes the simple common cases of finding the minimum, maximum, and median elements. These are also called order statistics. Relatively simple algorithms for finding the minimum, maximum, and kth smallest element in average linear time exist. It is also possible to find the kth smallest element in worst-case linear time or multiple order statistics at once. Selection is a subproblem of many more complex problems like the nearest neighbor problem and shortest path problems.

Contents

[edit] Selection with sorting algorithm

One simple and widely used selection algorithm is to use a sorting algorithm on the list, and then extract the kth element. This is an example of reduction of one problem into other problem. This is particularly useful when we wish to make many different selections from a single list, in which case only one initial, expensive sort is needed, followed by many cheap extraction operations. When we need only to perform one selection, however, or when we need to strongly mutate the list between selections, this method can be costly, typically requiring at least O(n log n) time, where n is the length of the list.

[edit] Linear minimum/maximum algorithms

Worst-case linear algorithms to find minimums or maximums are obvious; we keep two variables, one referring to the index of the minimum/maximum element seen so far, and one holding its value. As we scan through the list, we update these whenever we encounter a more extreme element:

 function minimum(a[1..n])
     minIndex := 1
     minValue := a[1]
     for i from 2 to n
         if a[i] < minValue
             minIndex := i
             minValue := a[i]
     return minValue
 function maximum(a[1..n])
     maxIndex := 1
     maxValue := a[1]
     for i from 2 to n
         if a[i] > maxValue
             maxIndex := i
             maxValue := a[i]
     return maxValue

This algorithm is based on a theorem which states that, for a finite subset A of a totally ordered set (for instance, a subset of integers, real numbers, or words in a dictionary), and any element of such a set not in A—say x—we have that \max (A \cup \{x\}) = \max(\{\max A, x\}). Note that there may be multiple minimum or maximum elements. Because the comparisons above are strict, these algorithms find the minimum element with minimum index. By using non-strict comparisons (≤ and ≥) instead, we would find the minimum element with maximum index.

If we want to find both minimum and maximum at the same time, a slightly improved approach is doing pairwise comparison, that is comparing the odd and even elements in each pair and then comparing them with maximum and minimum elements. Another approach is to follow the divide and conquer algorithm by finding the maximum and minimum of the first half and second half first and finding the global minimum and maximum based on the them.

[edit] Nonlinear general selection algorithm

Using the same ideas used in minimum/maximum algorithms, we can construct a simple, but inefficient general algorithm for finding the kth smallest or kth largest item in a list, requiring O(kn) time, which is effective when k is small. To accomplish this, we simply find the most extreme value and move it to the beginning until we reach our desired index. This can be seen as an incomplete selection sort. Here is the minimum-based algorithm:

 function select(a[1..n], k)
     for i from 1 to k
         minIndex = i
         minValue = a[i]
         for j = i+1 to n
             if a[j] < minValue
                 minIndex = j
                 minValue = a[j]
         swap a[i] and a[minIndex]
     return a[k]

Other advantages of this method are:

  • After locating the jth smallest element, it requires only O(j + (k-j)2) time to find the kth smallest element, or only O(k) for kj.
  • It can be done with linked list data structures, whereas the one based on partition requires random access.

[edit] Partition based general selection algorithm

Finding a worst-case linear algorithm for the general case of selecting the kth largest element is a much more difficult problem, but one does exist, and was published by Blum, Floyd, Pratt, Rivest, and Tarjan in their 1973 paper Time bounds for selection. It uses concepts based on those used in the quicksort sorting algorithm, along with an innovation of its own.

In quicksort, there is a subprocedure called partition which can, in linear time, group the list into two parts, those less than a certain value, and those greater than or equal to a certain value. Here is pseudocode which performs a partition about the value a[pivotIndex]:

 function partition(a, left, right, pivotIndex)
     pivotValue := a[pivotIndex]
     swap a[pivotIndex] and a[right]  // Move pivot to end
     storeIndex := left
     for i from left to right-1
         if a[i] ≤ pivotValue
             swap a[storeIndex] and a[i]
             storeIndex := storeIndex + 1
     swap a[right] and a[storeIndex]  // Move pivot to its final place
     return storeIndex

In quicksort, we recursively sort both branches, leading to best-case Ω(n log n) time. However, when doing selection, we already know which partition our desired element lies in, since the pivot is in its final sorted position, with all those preceding it in sorted order preceding it and all those following it in sorted order following it. Thus a single recursive call locates the desired element in the correct partition:

 function select(a, k, left, right)
     select a pivot value a[pivotIndex]
     pivotNewIndex := partition(a, left, right, pivotIndex)
     if k = pivotNewIndex
         return a[k]
     else if k < pivotNewIndex
         return select(a, k, left, pivotNewIndex-1)
     else
         return select(a, k-pivotNewIndex, pivotNewIndex+1, right)

Note the resemblance to quicksort; indeed, just as the minimum-based selection algorithm is a partial selection sort, this is a partial quicksort, generating and partitioning only O(log n) of its O(n) partitions. This simple procedure has expected linear performance, and, like quicksort, has quite good performance in practice. It is also an in-place algorithm, requiring only constant memory overhead, since the tail recursion can be eliminated with a loop like this:

 function select(a, k, left, right)
     loop
         select a pivot value a[pivotIndex]
         pivotNewIndex := partition(a, left, right, pivotIndex)
         if k = pivotNewIndex
             return a[k]
         else if k < pivotNewIndex
             right := pivotNewIndex-1
         else
             left := pivotNewIndex+1

Like quicksort, the performance of the algorithm is sensitive to the pivot that is chosen. If bad pivots are consistently chosen, this degrades to the minimum-based selection described previously.

[edit] Linear general selection algorithm - "Median of Medians algorithm"

There is a way to consistently find very good pivots; this is the key to making the previous algorithm worst-case linear. To accomplish this, we begin by dividing the list into groups of five elements. Any left over are ignored for now. Then, for each of these, we find the median of the group of five, an operation that can be made very fast by loading all five values into the register set and comparing them (provided the values are of simple types). We move all these medians into one contiguous block in the list, and proceed to invoke select recursively on this sublist of n/5 elements to find its median value. Then, we use this "median of medians" for our pivot.

What makes this a good pivot? Note that it is both less and greater than half the elements in our list of medians, or n/10 elements. Moreover, each of these elements is a median of 5, and so is both less and greater than 2 others outside the block, or 2(n/10) elements. Thus the median we chose splits the elements somewhere between 30%/70% and 70%/30%, which is more than good enough to assure worst-case linear behavior of the algorithm.

The trick, however, is ensuring that the added recursive call does not destroy this worst-case linear behavior. This is because the list of medians is 30% of the size of the list, and the other recursive call recurses on at most 70% of the list, so we have this recurrence relation for the running time:

 T(n) ≤ T(n/5) + T(7n/10) + O(n)

The O(n) is for the partitioning work. But, because n/5 + 7n/10 is 9n/10 < n, it's not hard to find a linear formula for running time that can be plugged into this formula to make it true.

In practice, although this approach optimizes quite well, it is still typically outperformed by a considerable margin by the expected linear algorithm with fairly naive pivot choices, such as choosing randomly. The worst-case algorithm still has importance, however; it can be used, for example, to construct a worst-case O(n log n) quicksort algorithm, by using it to find the median at every step.

[edit] Selection as incremental sorting

One of the advantages of the sort-and-index approach, as mentioned, is its ability to amortize the sorting cost over many subsequent selections. However, sometimes the number of selections that will be done is not known in advance, and may be either small or large. In these cases, we can adapt the algorithms given above to simultaneously select an element while partially sorting the list, thus accelerating future selections.

Both the selection procedure based on minimum-finding and the one based on partitioning can be seen as a form of partial sort. The minimum-based algorithm sorts the list up to the given index, and so clearly speeds up future selections, especially of smaller indexes. The partition-based algorithm does not achieve the same behaviour automatically, but can be adapted to remember its previous pivot choices and reuse them wherever possible, avoiding costly partition operations, particularly the top-level one. The list becomes gradually more sorted as more partition operations are done incrementally; no pivots are ever "lost." If desired, this same pivot list could be passed on to quicksort to reuse, again avoiding many costly partition operations.

[edit] Using data structures to select in sublinear time

Given an unorganized list of data, linear time (Ω(n)) is required to find the minimum element, because we have to examine every element (otherwise, we might miss it). If we organize the list, for example by keeping it sorted at all times, then selecting the kth largest element is trivial, but then insertion requires linear time, as do other operations such as combining two lists.

The strategy to find an order statistic in sublinear time is to store the data in an organized fashion using suitable data structures that facilitate the selection. Two such data structures are tree based structures and frequency tables.

When only the minimum (or maximum) is needed, a good approach is to use a priority queue, which is able to find the minimum (or maximum) element in constant time, while all other operations, including insertion, are O(log n) or better. More generally, a self-balancing binary search tree can easily be augmented to make it possible to both insert an element and find the kth largest element in O(log n) time. We simply store in each node a count of how many descendants it has, and use this to determine which path to follow. The information can be updated efficiently since adding a node only affects the counts of its O(log n) ancestors, and tree rotations only affect the counts of the nodes involved in the rotation.

Another simple strategy is based on some of the same concepts as the hash table. When we know the range of values beforehand, we can divide that range into h subintervals and assign these to h buckets. When we insert an element, we add it to the bucket corresponding to the interval it falls in. To find the minimum or maximum element, we scan from the beginning or end for the first nonempty bucket and find the minimum or maximum element in that bucket. In general, to find the kth element, we maintain a count of the number elements in each bucket, then scan the buckets from left to right adding up counts until we find the bucket containing the desired element, then use the expected linear-time algorithm to find the correct element in that bucket.

If we choose h of size roughly sqrt(n), and the input is close to uniformly distributed, this scheme can perform selections in expected O(sqrt(n)) time. Unfortunately, this strategy is also sensitive to clustering of elements in a narrow interval, which may result in buckets with large numbers of elements (clustering can be eliminated through a good hash function, but finding the element with the kth largest hash value isn't very useful). Additionally, like hash tables this structure requires table resizings to maintain efficiency as elements are added and n becomes much larger than h2. An useful case of this is finding an order statistic or extremum in a finite range of data, like marks limited 1 to 100 numbers for much higher number of students. Using above table with bucket interval 1 and maintaining counts in each bucket is much superior to other methods. Such hash tables are like frequency tables used to classify the data in descriptive statistics.

[edit] Selecting k smallest or largest elements

Another fundamental selection problem is that of selecting the k smallest or k largest elements, which is particularly useful where we want to present just the "top k" of a sorted list, such as the top 100 corporations by gross sales. A number of simple but inefficient solutions are possible. We can use the linear-time solutions discussed above to select each element one at a time, resulting in time O(kn) — we can achieve the same time just by running the first k iterations of selection sort. If log n is much less than k, a better simple strategy is to sort the list and then take the first or last k elements.

Another simple method is to add each element of the list into an ordered set data structure, such as a heap or self-balancing binary search tree, with at most k elements. Whenever the data structure has more than k elements, we remove the largest element, which can be done in O(log k) time. Each insertion operation also takes O(log k) time, resulting in O(nlog k) time overall.

More efficient than any of these are specialized partial sorting algorithms based on mergesort and quicksort. The simplest is the quicksort variation: there is no need to recursively sort partitions which only contain elements that would fall after the kth place in the end. Thus, if the pivot falls in position k or later, we only recur on the left partition:

 function quicksortFirstK(a, left, right, k)
     if right > left
         select a pivot value a[pivotIndex]
         pivotNewIndex := partition(a, left, right, pivotIndex)
         quicksortFirstK(a, left, pivotNewIndex-1, k)
         if pivotNewIndex < k
             quicksortFirstK(a, pivotNewIndex+1, right, k)

The resulting algorithm requires an expected time of only O(n + klogk), and is quite efficient in practice, especially if we substitute selection sort when k becomes small relative to n.

Even better is if we don't require those k items to be themselves sorted. Losing that requirement means we can ignore all partitions that fall before or after the kth place in the end. We only recur in to the partition that actually contains the kth element itself.

 function findFirstK(a, left, right, k)
     if right > left
         select a pivot value a[pivotIndex]
         pivotNewIndex := partition(a, left, right, pivotIndex)
         if pivotNewIndex > k  // new condition
             findFirstK(a, left, pivotNewIndex-1, k)
         if pivotNewIndex < k
             findFirstK(a, pivotNewIndex+1, right, k)

The resulting algorithm requires an expected time of only O(n), which is just about the best any algorithm can hope for.

Another method is tournament algorithm. The idea is to conduct a knockout minimal round tournament to decide the ranks. It first organises the games(comparisons) between adjacent pairs and moves the winners to next round until championship(the first best) is decided. It also constructs the tournament tree along the way. Now the second best element must be among the direct losers to winner and these losers can be found out by walking in the binary tree in O(log n) time. It organises another tournament to decide the second best among these potential elements. The third best must be one among the losers of the second best in either of the two tournament trees. The approach continues until we find k elements. This algorithm takes O(n + k log n) complexity.

A simple case of this selection occurs when k = 2, that is selecting the second best element in addition to the first element in O(n + log n) time.

[edit] Lower bounds

In his seminal The Art of Computer Programming, Don Knuth discussed a number of lower bounds for the number of comparisons required to locate the kth smallest entry of an unorganized list of n items (using only comparisons). There's a trivial lower bound of n − 1 for the minimum or maximum entry. To see this, consider a tournament where each game represents one comparison. Since every player except the winner of the tournament must lose a game before we know the winner, we have a lower bound of n − 1 comparisons.

The story becomes more complex for other indexes. To find the kth smallest value requires at least this many comparisons:

n - k + \sum_{n+1-k < j \leq n} \lceil{\operatorname{lg}\, j}\rceil

This bound is achievable for k=2 but better, more complex bounds exist for larger k.

[edit] Language support

Very few languages have built-in support for general selection, although many provide facilities for finding the smallest or largest element of a list. A notable exception is C++, which provides a templated nth_element method with a guarantee of expected linear time. It is implied but not required that it is based on the expected linear time algorithm using partition above. (Ref section 25.3.2 of ISO/IEC 14882:2003(E) and 14882:1998(E), see also SGI STL description of nth_element)

C++ also provides the partial_sort algorithm, which solves the problem of selecting the smallest k elements (sorted), with a time complexity of O(nlog k). No algorithm is provided for selecting the greatest k elements, but this can easily be achieved by inverting the ordering predicate.

For Perl, the module Sort::Key::Top, available from CPAN, provides a set of functions to select the top n elements from a list using several orderings and custom key extraction procedures.

Because language support for sorting is more ubiquitous, the simplistic approach of sorting followed by indexing is preferred in many environments despite its disadvantage in speed.

[edit] References

In other languages