Talk:Simulated annealing

From Wikipedia, the free encyclopedia

Contents

[edit] On the intro section

This was the original reading of the introductory section:

Simulated annealing is a generic probabilistic heuristic approach to global optimization problems. It locates a good approximation to the global optimum in a large search space with reasonable probability. It was independently invented by S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi in 1983, and by V. Cerny in 1985.
The name comes from annealing in metallurgy, a technique involving heating and controlled cooling of a material to improve its structural properties by removing crystal defects. In this process the metal reaches its most stable configuration, minimizing its total internal energy.

Three remarks:

  1. Shouldn't this be called a META-heuristic? As I understand, a heuristic is a non-guaranteed procedure that applies to a specific problem, e.g. "pick the nearest unvisited node" would be an heuristic for Euclidean TSP; whereas a meta-heuristic would be a general appraoch that can be used for almost any problem -- such as simulated annealing, genetic algorithm, etc.
    I've never heard of a meta-heuristic. I would not call it a heuristic, but a heuristic method (as opposed to a deterministic method, like SQP or steepest descent). moink 04:37, 2 Aug 2004 (UTC)
    Perhaps I hark from a different fold, but I have always seen SA (and GA, greedy, tabu, etc.) being labeled as "meta-heuristics".
    Basically, as said above, a heuristic is any technique which you cannot prove is good, but you intuitively expect to be better than random search — usually, for a specific class of problems, or for a specific distribution of instances of a problem. For Euclidean Traveling Salesman, a heuristic could be "compute a Delaunay triangulation of the sites and look for an Hamiltonian path on it."
    A meta-heuristic is a heuristic that (1) is supposed to work on an very general class of problems, such as "optimizing a black-box function" or "build a boolean string classifier from an arbitrary set of examples"; and (2) includes among its steps calls to user-given black-box procedures to do the problem-specific steps — enumerate the "neighboring" states, compute transition probabilities, discard or merge partial solutions, etc. Usually these black boxes are themselves heuristics, specific to the user's problem; hence the name "meta-heuristic".
    Jorge Stolfi 01:30, 5 January 2006 (UTC)
  2. The claim that simulated annealing finds a GOOD approximation to the global optimum with REASONABLE probability is and rather misleading (and, strictly speaking, rather meaningless). Simulated annealing may satisfy some users, but it is easy to make up instances where it will deliver, say, a solution 230 times worse than the global optimum, with probability 1-2-30, even with a cooling period of 230 steps. There is no reason to assume that the reader's optimization problems will be of the first type, and not of the second type.
    Yes, this should be more clearly explained. We should give examples of the types of problems in which it is likely to work and those in which it is likely not to work. moink 04:37, 2 Aug 2004 (UTC)
  3. Metallugical annealing does NOT "minimize the total internal" energy, by a long shot. Ignoring the nuclear energy and quantum noise, it will usually lower the total bond energy by only a little bit, and stop still very far from the optimum. Even if cooling is slow enough to turn the whole piece into a single perfect crystal, if the crystal axes happen to be oriented in a non-optimum way, the chances of them flipping to the optimum one are nil.

Jorge Stolfi 20:34, 30 Mar 2004 (UTC)

  1. Feel free to fix it. moink 04:37, 2 Aug 2004 (UTC)

This statement was deleted:

If T is infinity, it moves around randomly.

It is not correct for the "classical" transition probability function, because that version would always take a downhill move, even when T=&infinity;. (The phrase is valid for a physical system, though; which only underscores the bogosity of the "physical analogy" claims.)
Jorge Stolfi 03:49, 11 Jun 2004 (UTC)

Yeah, it's more correct to say that the method was inspired by metallurgy than that it is a direct analog. moink 04:37, 2 Aug 2004 (UTC)

This too was deleted:

Nonetheless, simulated annealing can be very effective at finding good suboptimal solutions.

This is definitely POV (or, rather, a meaningless statement — it all depends on what one considers "very effective" and "good solution").
In fact, it seems that the only true claim that can be made about SA is that, unlike the greedy method, it will not get stuck in the first local minimum. One cannot even claim that it will outperform the trivial method ("generate N random states, pick the best one").
Jorge Stolfi 04:46, 11 Jun 2004 (UTC)

No, but one can say that in a certain class of problems, it will with high probability outperform a random search. moink 04:37, 2 Aug 2004 (UTC)

I am done for a while. There is still a major problem in the article, which is the complicated interation between the neighbor selection fucntion and the energy-determined transition probabilities. They should perhaps be merged into a single "move" function.
Jorge Stolfi 06:48, 11 Jun 2004 (UTC)

Great! Feel free to change it. moink 04:37, 2 Aug 2004 (UTC)

I don't understand the above comment. The connection between neighborhoods and energy functions is explicitly made in the Hammersly-Clifford theorem, and its converse. One view is a Gibbs random field, one a Markov random field, but they are essentially equivalent. -ska

See the section "Greedy optimization" below. --Jorge Stolfi 01:30, 5 January 2006 (UTC)

[edit] Old temp page still in use?

There's a page sitting in Simulated annealing/Temp that hasn't seen any significant activity since it was created in October 2004, and isn't linked to anywhere. Does anyone know what's up with it? Bryan 07:10, 29 May 2005 (UTC)

[edit] Credit for Metropolis et al?

Hello. Wasn't simulated annealing introduced first in the article:
Metropolis,N., A. Rosenbluth, M. Rosenbluth, A. Teller, E. Teller, "Equation of State Calculations by Fast Computing Machines", J. Chem.Phys.,21, 6, 1087-1092, 1953.

?

As I understand it, no. The Metropolis algorithm is a way of sampling from a distribution
P(x) = {1 \over Z} \exp(-\beta E(x)).
for some value of inverse temperature β and for some energy function E that is a function of a vector x. It was typically used to computed expected values of functions over x. I believe that β was held fixed. Kirkpatrick realized (and proved) that if β decreased slowly as time approaches infinity, then the expected value of x approaches the global minimum of E. -- hike395 02:01, 4 Jun 2005 (UTC)

Hike has it basically correct. Metropolis introduced the first MCMC sampling method from a Gibbs distribution (a very important development!). This is sampling for a fixed temperature. Simulated annealing, though was introduced in the early to mid 1980's by Kirkpatrick et. al., and I think Cerny independently. Essentially these methods turn the homgeneous Markov chains of the MCMC samplers into non-homogenous chains. The MCMC sampler connection should be made explicitly in these pages.

Maybe you like the following paragraph I found in "Optimizing simulated annealing schedules with genetic programming" from Bölte and Thonemann, gives proper credit to both Metropolis and Kirkpatrick, in my opinion.

"Simulated Annealing originated in statistical mechanics. It is based on a Monte Carlo model that was used by Metropolis et al. to simulate a collection of atoms in equilibrium at a given temperature. Kirckpatrick et al. were the first who applied Simulated Annealing to solve combinatorial optimization problems."

[edit] let's cool down wp a bit

... using simulated annealing. i have recently suggested that the cooling process could be initiated by slowly raising the minimum amount of reference information for matured articles with software aids like combo-boxes for reference type, etc. on top or instead of the underused comment line. (http://mail.wikipedia.org/pipermail/wikitech-l/2005-December/032894.html) that should cool down at least vandals or scatterbrains... ;). ok let's hear more opinions. -- Kku 09:30, 7 December 2005 (UTC)


[edit] Greedy "optimization"

This section says:

In some implementations of the method (including the original formulation by Kirkpatrick et al.), the condition to perform the move is "en < e or random() < P(...)". That is, downhill moves are always taken, irrespective of the value of P. However, this "greedy optimization" is not only unnecessary, but it is in fact quite opposite to the fundamental principle of SA.

Is this "quite opposite to the fundamental principle of SA" according to whom? This is an unsubstantiated POV, unless there is a citable reference. I'd say that the "fundamental principle" of SA is that some uphill moves have to be accepted. There's no principle that says that not every downhill move has to be accepted. Itub 21:12, 4 January 2006 (UTC)

Unfortunately it is hard to tell exactly what is the "fundamental principle of SA", because there is quite some distance between the stated inspiration (the annealing of a physical system) and the concrete implementations that are usually given. The main discrepancies are the "greedy optimization" and in the serial testing of the neighbors. To see the latter problem, suppose you have neighboring states s1,s2,... and your probability function assigns to them probabilities p1,p2,.... Typical implementations of SA (without greedy optimization) do something like this
     if rand() < p1 then select s1; else 
     if rand() < p2 then select s2; else 
     if rand() < p3 then select s3; else
     ...
Obviously this algorithm will NOT select state si with probability pi. So the physical analogy and the exp(-dT/T) formula are quite misleading.
Anyway, we can guess that the "fundamental principle of SA" is the idea that moving downhill is not always the best choice, especially when T is high. And this principle is followed to some extent by standard implementations, even those with the "greedy optimization". In the above example, suppose that s3 is downhill but s1 and s2 are uphill; the standard algorithm will go uphill with probability p1 + (1-p1)p2. If we were to swap s1 with s3 in the list of neigbors, standard SA would always move to s3!
Said another way, the "greedy optimization" is equivalent to setting momentarily T=0 when the neighbor si has lower energy. Note, this "instant freezing" does not occur when s has such a neighbor, but only when (if) the algorithm gets to that neighbor while scanning through the list. Perhaps putting it this way makes it more obvious that the "greedy optimization" is against the "gradual cooling" spirit of SA --- a programmer's lapse from "SA thinking mode" into "greedy thinking mode"?
Said yet another way, if the "greedy optimization" is good, then it should be even better to just run the greedy meta-heuristic after each transition, and switch to SA only if the current state turns out to be a local minimum. Would you consider this to be an improvement on SA?
All the best, --Jorge Stolfi 00:56, 5 January 2006 (UTC)
I think the confusion is this "serial testing of the neighbors". In the systems I work with, the neighbor list is infinite and not enumerable, so the way it works is: 1) you generate a neighbor randomly; 2) you decide whether to accept it or reject it. In this context, accepting all the neighbors that go downhill is not "bad", its just a particular case of P where P = 1 where En < E (it should be pointed out that when generating this random move, the energy goes up more often than not). I can see how this could cause trouble if you do a serial testing of the neighbors, but this point should be specified in the article. Itub 03:40, 5 January 2006 (UTC)
An easier argument may be: if you set P=1 when dE<0, then the probabilities pi = P(E(si)-E(s)) will not add to 1, and hence they cannot be the actual transition probabilities.
As for your comment: if the neighbor list is infinite, then the enumerator must pick the first/next neighbor with non-uniform probability. In that case, it is easy to see that the actual probability of moving to si will not be the specified probability pi. I believe that they are different even if we assume that the neighbor list is finite, that the next neighbor is picked at random (with repetitions and uniform probabilities), and that the specified pi add to 1. But I cannot check that now...
Note that the algorithm "works" in some fashion even with serial testing and the "greedy optimization". With serial testing , the blabla about Metropolis-Hastings and exp(-dT) are pure nonsense; adding the GO just turns them into double-strength nonsense.
All the best, --Jorge Stolfi 08:16, 5 January 2006 (UTC)
Why would pi have to add up to 1? pi is not the probability of transition to state i, but the probability of the move being accepted. The transition probability is the product of pi and the probability of attempting the move. For example, let's say you have 10 neighbors and pick one randomly with uniform probability (that is, the probability of attempting to move to any given neighbor is 0.1). Let's say all the neighbors have pi = 0.5. Each one will contribute with 0.1*0.5 = 0.05 to the sum of the transition probabilities. This adds up to 0.5. The remaining 0.5 is the probability of staying in the original state, because the move was rejected. Therefore, all the probablities add up to 1. Now, let's say that all the neighbors are downhill and have pi = 1. Each one will contribute with 1*0.1=0.1 to the sum of the transition probabilities, which again will add up to 1 (the probability of staying in the original state is now zero). Itub 17:02, 5 January 2006 (UTC)
I should also point out that in the original version that uses P=exp(-dE/kT), the test is still always random() < P(...). It just happens that, for that function, P > 1 when dE < 0 (in actual implementations, yes, you might use an or to save computing the exponential function, but it's not strictly necessary). Itub 17:10, 5 January 2006 (UTC)
Indeed, that is the problem: the alleged physical justification assumes implicitly that the pi = exp(-dE/T) are the *transition* probabilities; but when one uses the pi as *acceptance* probabilities in serial testing, the transition probabilities are something completely different. Thus the physical "Metropolis-Hastings/exp(-dE/T)" justification simply does not hold; and neither the physical nor the intuitive justification hold when one uses GO.
I am adding a couple of paragraphs to the article which may help settling this argument. Please check.
All the best, --Jorge Stolfi 20:45, 5 January 2006 (UTC)
I'm still unconvinced (although the section about saving the best solution is helpful :). The main problem I see is that the use of the term "greedy" in this section is an exaggeration. A "greedy" optimization is one that always goes downhill. A simulated annealing implementation that just happens to accept every downhill move that it sees, due to its choice of P(), is not greedy as long as it also tries uphill moves. My suggestion is to delete the entire greedy "optimization" subsection, actually.
I must insist that pi = exp(-dE/T) has never been a transition probability, even in the original physical justification (Metropolis Monte Carlo). All the equations I've seen that deal with transition probabilities in MC methods include two factors: one is the probability of attempting the move (which can be largely arbitrary as long as certain conditions are met), and the other is the probability of accepting it (which is the Metropolis test). Itub 22:54, 5 January 2006 (UTC)

[edit] Metaphor too complicated?

It seems to me that the general description of the algorithm could be simplified greatly if the metallurgy metaphor was dumped for a paragraph.

I'd add something like:

Simulated Annealing is a search process that has a better chance at finding global maxima than simple hill climbing. This is because at the beginning of a space search it's random steps are very large and become smaller as the search continues until finally the algorithm decides on a local maxima from the best place it has found in the space.

It's a little rough because I'm terrible at describing algorithms, but it frames the concept in terms of another space search technique that the reader would either A) be familiar with or B) probablly able to understand a little bit more easily.

Would a paragraph like this be welcome?

I would favour such a change. Most people learning about simulated annealing are more likely to have already learned about hillclimbing than to have learned about metallurgical physics, although it's a vivid metaphor. I think it also gets across the point that simulated annealing is a search algorithm, and not some kind of physical simulation. Deco 23:15, 15 May 2006 (UTC)

[edit] Initial temperature

Currently the article says:

The initial temperature must be large enough to make the uphill and downhill transition probabilities about the same. To do that, one must have some estimate of the difference e' − e for a random state and its neighbours.

So far this is the best advice I've found on the subject of choosing an initial temperature, and it still doesn't spell out how exactly I do it - the author seems to know but isn't giving details. I'd like to see some added content at least giving an example of how to choose an effective initial temperature. Deco 23:24, 15 May 2006 (UTC)

For what it's worth, I just found out about the paper "Computing the Initial Temperature of Simulated Annealing" which probably answers my question. :-) Maybe a good reference to add? Deco 23:28, 15 May 2006 (UTC)
If you add it, I think you should link it as http://dx.doi.org/10.1023/B:COAP.0000044187.23143.bd, which is a more permanent URL. Itub 21:24, 16 May 2006 (UTC)

[edit] Quenching

This section reads highly unencyclopedic:

If the modified Simulated Quenching (SQ) works, fine, but it should not be called SA. SA is often used for hard problems where the chances are reduced for being trapped in local minima. SA also is used in problems with complex constraints since the method of rejection is simply applied to keep the sampling within the constrained region(s). In practice, when the need is clear to use a sampling technique, it is best to do at least a few long runs with SA, and then use SQ modifications to speed up calculations as long as the same results are achieved. To apply any generic sampling technique like SA over many classes of nonlinear problems, it should be expected that tuning of some basic search parameters will be required.

First, the first sentence is (oddly worded) POV. Second, the section doesn't explain what SQ is in comparison to SA. Third, the second half reads like a how-to. I don't know SQ so I can't modify this, but maybe someone here can. Thanks! ~ trialsanderrors 18:56, 1 September 2006 (UTC)

[edit] Monte Carlo

This is a good example of a Monte Carlo method, so shouldn't there be a reference to Monte_carlo_algorithm?

Shouldn't there also be a reference to the Ising model? (Might be a biased point of view, since I did my thesis on the Ising model).

I was struck how simulated annealing resembles the way the Ising_model is usually simulated; this can be considered simulating annealing where one usually takes the Metropolis acceptance probability as given above. I'm inclined to think that the people who invented simulated annealing were acquainted with the simulation of the Ising Model and invented it as a generalization of the method.

(The fact that SA resembles the Ising model is not a coincidence, since an adapted version of the Ising model (with conserved spin, where spin up represents the presence of a particle and spin down represents the absence of a particle) can actually be used to accurately model the cooling down of a dense gas/liquid.)

[edit] Energy conservation :)

How do you tell that "Therefore the goal is to bring the system, from an arbitrary initial state, to a state with the minimum possible energy"? --Javalenok 22:43, 11 November 2006 (UTC)