No free lunch in search and optimization
From Wikipedia, the free encyclopedia
Many computational problems are solved by searching for good solutions in a space of candidates. An association of candidate solutions with goodness values is called an objective function. Problem solvers known as search algorithms take objective functions as inputs and evaluate the goodness of candidate solutions one-by-one. Here it is convenient to regard the output of a search algorithm as the observed sequence of goodness values[1].
There is an important class[2][3] of search algorithms for which inputs and outputs are in one-to-one correspondence [1]. Put loosely, the algorithms generate identical outputs, but in response to different inputs. If performance is measured on the outputs, then no algorithm is generally superior to any other. When an algorithm exhibits performance superior to that of another on selected inputs, it "pays" with inferiority on the excluded inputs. Thus there is no free lunch (NFL) for search algorithms[2].
Some measures of performance indicate how well search algorithms do at optimization of the objective function. Indeed, there seems to be no interesting application of search algorithms in the class under consideration but to optimization problems. Thus it is reasonable to say that there is no free lunch for optimizers[3], provided it is understood that the optimizers referred to are search algorithms[4].
Contents |
[edit] Informal introduction
If, each time a search algorithm executes, all inputs are equally likely, then all outputs are equally likely. The following makes this concrete, describing how to generate objective functions by spinning roulette wheels, and showing that the sequence of values observed in search does not depend upon the algorithm.
When a croupier spins a European roulette wheel, the outcome is that the ball randomly falls into one of 37 pockets numbered from 0 to 36. Because the ball is equally likely to end up in each pocket, the outcome is uniformly distributed on the pockets. Because the outcome of a spin is unaffected by the outcomes of other spins, the distributions of outcomes are independent.
Consider a one-player game with 100 roulette wheels numbered from 1 to 100. The game begins with the player sequestered. Croupiers spin all the wheels simultaneously, wait for 100 outcomes, and then shroud the wheels. The association of wheels with outcomes serves as the objective function f, where f(n) is the outcome at wheel n. The player now removes shrouds one at a time, attempting to locate the wheel n with the highest value f(n). The player's performance is 100 minus the number of shrouds removed to find the maximum value.
There is no strategy for beating this optimization game. This is easy to see in an equivalent version of the game that uses just one roulette wheel. The outcomes of consecutive spins of a single wheel are independent and identically distributed, so spinning one wheel 100 times is equivalent to spinning 100 wheels at once. In the revised game, the player repeatedly names a virtual wheel n to "unshroud" next, and a croupier gives the single wheel a fresh spin to determine f(n).
Each outcome is unaffected by the method (search algorithm) the player uses for deciding which wheel to "unshroud" next. All associations of wheels with outcomes (objective functions) are equally likely, as are all sequences of 100 observed values. In short, a uniform distribution of inputs to a search algorithm induces a uniform distribution of outputs. Performance is a function of outputs, and algorithms with identically distributed outputs must have identically distributed performance values. Identically distributed performance implies, but is not equivalent to, identical average performance for all algorithms over all objective functions.
In the original implementation of the optimization game, it is possible to have algorithms play head-to-head. First one plays as originally described. Then the croupiers cover the wheels with drapes and allow the second algorithm to play with precisely the objective function the first did. One algorithm may have performance superior to that of the other on some objective functions, but it is certain that the apparently inferior algorithm compensates with superiority on other objective functions. There is no free lunch for search algorithms.
[edit] Synopsis of NFL in mathematical terms
The set of all objective functions is YX, where X is a finite solution space and Y is a finite poset. The set of all permutations of X is J. Random variable F is distributed on YX. For all j in J, F o j is a random variable distributed on YX, with Pr{F o j = f} = Pr{F = f o j–1} for all f in YX. A search algorithm is functionally equivalent to
- Input: f ∈ YX
- Output: y ∈ Y|X|
- S ← X
- x ← ε
- y ← ε
- Do |X| times:
- a ← NextToEvaluate(S, x, y)
- b ← f(a)
- S ← S \ {a}
- x ← xa
- y ← yb
- Return y
for any deterministic implementation of NextToEvaluate that returns an element of S, the set of unevaluated candidate solutions, without evaluating any element of S. The sequences of evaluated candidates and corresponding values are stored in x and y, respectively.
Let a(f) denote the output of search algorithm a on input f. If a(F) and b(F) are identically distributed for all search algorithms a and b, then F has an NFL distribution. This condition holds if and only if F and F o j are identically distributed for all j in J.[4][5] In other words, there is no free lunch for search algorithms if and only if the distribution of objective functions is invariant under permutation of the solution space.
Let Φ be any performance measure on Y|X|. If Φ(a(F)) and Φ(b(F)) are identically distributed for all search algorithms a and b, then F has a Φ-NFL distribution. Every NFL distribution is a Φ-NFL distribution for all performance measures Φ. To see this, suppose that F has an NFL distribution. Then identically distributed a(F) and b(F) imply identically distributed Φ(a(F)) and Φ(b(F)) for all search algorithms a and b and for all single-valued performance measures Φ.
[edit] "The" NFL theorem
"The NFL theorem of Wolpert and Macready" is part of the mathematical folklore that has grown up around the researchers' seminal work. In the abstract of their first published article on NFL, Wolpert and Macready refer to "a number of 'no free lunch' (NFL) theorems."[3] Furthermore, the folkloric theorem does not correspond to any that Wolpert and Macready have published. It seems to originate in informal statements of the consequences of the formal theorems:
- "[A]ll algorithms that search for an extremum of a cost [objective] function perform exactly the same, when averaged over all possible cost functions."[2]
- [T]he average performance of any pair of algorithms across all possible problems is identical.[3]
The second quotation comes from a sentence that begins, "Roughly speaking, we show that..." Precisely, they show that[3]
- Theorem 1: For any pair of algorithms a1 and a2
Excluding stochastic algorithms from consideration, the probabilities are all zero or one, and a fairly accurate plain-language rendition is that the number of objective functions f for which an arbitrary multiset hy of values is observed after evaluation of m candidate solutions is identical for all algorithms. This implies the NFL theorem of folklore, and much more.
Theorem 1 applies to "static" objective functions. Wolpert and Macready's Theorem 2 establishes a "more subtle" NFL result for time-varying objective functions.[3]
[edit] Interpretations of NFL results
Theorem 1 explains why, over the set of all objective functions, each search algorithm will do on average as well as any other. This is due to the bias in each search algorithm, because sometimes the assumptions that the algorithm makes are not the correct ones.
An implicit assumption that the no-free-lunch theorem makes is that all types of problems or data sets are equally likely. It is debatable if this assumption is valid. Empirical studies such as the Statlog project have been performed on a number of different data sets and have shown that there is definitely no best classifier for all given problems.
Alternatively, the theorem establishes that "a general-purpose universal optimization strategy is theoretically impossible, and the only way one strategy can outperform another is if it is specialized to the specific problem under consideration" (Ho and Pepyne, 2002). That paper also contains a much simplified proof of the NFLT accessible to most engineers.
The theorem is used as an argument against generic searching algorithms such as genetic algorithms and simulated annealing when employed using as little domain knowledge as possible. It has also been applied to other generic algorithms, though in general arbitrarily large subsets of "real-world" problems can be constructed which are not covered under NFLT. Although mathematically sound, the fact that NFLT cannot be applied to arbitrary subsets of cost functions limits its applicability in practice.
The theorem justifies the view that as much prior domain knowledge should be utilized as possible, and custom optimization routines constructed for particular domains. It may be thought of as a full employment theorem for optimization professionals.
[edit] Intelligent design and no free lunch
The no-free-lunch theorem has been used in some of the critiques of evolution offered by intelligent design proponents, such as William A. Dembski in his book No Free Lunch: Why Specified Complexity Cannot be Purchased Without Intelligence. David Wolpert has said that Dembski misapplies the NFLT, characterizing Dembski's treatment of it as "written in jello" ([1]). Dembski's mathematical logic is technically incorrect, because the theorem assumes a constant cost function but in biology cost functions are dependent on such factors as allele frequencies and biotic behaviour, and also because the theorem only implies equal average success applying over all logically possible problems, but the biological world lacks some problems while other problems occur with varying probability. (In short, NFLT only says that unequal average performances require a posteriori knowledge of the world.) Although the theorems may be generalisable, they have not been generalized yet, so these problems with the argument are currently unresolvable in pure mathematics. Furthermore, other results in computation theory help to account for the observed success of evolutionary algorithms, which in the light of the NFLT often seems impossible (see also above).
[edit] References
- ^ a b English, T. M. 2000. "Optimization Is Easy and Learning Is Hard in the Typical Function," Proceedings of the 2000 Congress on Evolutionary Computation: CEC00, pp. 924-931. http://members.cox.net/tom.english/CEC2000.pdf
- ^ a b c Wolpert, D.H., Macready, W.G. (1995), No Free Lunch Theorems for Search, Technical Report SFI-TR-95-02-010 (Santa Fe Institute).
- ^ a b c d e f Wolpert, D.H., Macready, W.G. (1997), No Free Lunch Theorems for Optimization, IEEE Transactions on Evolutionary Computation 1, 67. http://ic.arc.nasa.gov/people/dhw/papers/78.pdf
- ^ a b English, T. (2004) No More Lunch: Analysis of Sequential Search, Proceedings of the 2004 IEEE Congress on Evolutionary Computation, pp. 227-234. http://members.cox.net/tom.english/CEC04.pdf
- ^ Igel, C., and Toussaint, M. (2004) "A No-Free-Lunch Theorem for Non-Uniform Distributions of Target Functions," Journal of Mathematical Modelling and Algorithms 3, pp. 313–322.
[edit] Bibliography
- Ho, Y.C., Pepyne, D.L. (2002), Simple Explanation of the No-Free-Lunch Theorem and Its Implications, Journal of Optimization Theory and Applications 115, 549.
- Y.C. Ho, Q.C. Zhao, and D. Pepyne, "The No Free Lunch Theorem, Complexity and Computer Security" IEEE Transactions on Automatic Control, v.48, #5, 783-793 May 2003.
- Y.C. Ho and D. Pepyne, "Conceptual Framework for Optimization and Distributed Intelligence" Proceedings of IEEE Conference on Decision and Control Dec. 2004.