12 min read

Simulated Annealing

Table of Contents

Motivation

The method of simulated annealing (SA) draws its inspiration from the physical process of metallurgy and uses terminology that comes from that field. When a metal is heated to a sufficiently high temperature, its atoms undergo disordered movements of large amplitude. If one now cools the metal down progressively, the atoms reduce their movement and tend to stabilize around fixed positions in a regular crystal structure with minimal energy. In this state, in which internal structural constraints are minimized, ductility is improved and the metal becomes easier to work. This slow cooling process is called annealing by metallurgists and it is to be contrasted with the quenching process, which consists of a rapid cooling down of the metal or alloy.Quenching causes the cooled metal to be more fragile, but also harder and more resistant to wear and vibration. In this case, the resulting atomic structure corresponds to a local energy minimum whose value is higher than the one corresponding to the arrangement produced by annealing. Note finally that in practice metallurgists often used a process called tempering by which one alternates heating and cooling phases to obtain the desired physical properties.

We can intuitively understand this process in the following way: at high temperature, atoms undergo large random movements thereby exploring a large number of possible configurations. Since in nature the energy of systems tends to be minimized, low-energy configurations will be preferred, but, at this stage, higher energy configurations remain accessible thanks to the thermal energy transferred to the system. During the exploration, the system might find itself in a low-energy configuration by chance. If the energy barrier to leave this configuration is high, then the system will stay there longer on average. As temperature decreases, the system will be more and more constrained to exploit low-amplitude movements and, finally, it will “freeze” into a low-energy minimum that may be, but is not guaranteed to be, the global one.

In an optimization context SA seeks to emulate this process. SA begins at a very high temperature where the input values are allowed to assume a great range of variation. As algorithm progresses temperature is allowed to fall. This restricts the degree to which inputs are allowed to vary. This often leads the algorithm to a better solution, just as a metal achieves a better crystal structure through the actual annealing process. So, as long as temperature is being decreased, changes to the inputs result in the generation of successively better solutions finally giving rise to an optimum set of input values when temperature is close to zero.

SA can be used to find the minimum of an objective function and it is expected that the algorithm will find the inputs that will produce a minimum value of the objective function. The main feature of SA algorithm is the ability to avoid being trapped in local minimum. This is done letting the algorithm to accept not only better solutions but also worse solutions with a given probability. The main disadvantage, is that definition of some control parameters (initial temperature, cooling rate, etc.) is somewhat subjective and must be defined from an empirical basis. This means that the algorithm must be tuned in order to maximize its performance.

Principles of the Algorithm

The simulated annealing method is used to search for the minimum of a given objective function, often called the energy EE, by analogy to the physical origins of the method. The algorithm follows the basic principles of all metaheuristics. The process begins by choosing an arbitrary admissible initial solution, also called the initial configuration. Furthermore, an initial “temperature” must also be defined. Next, the moves that allow the current configuration to reach its neighbors must also be defined. These moves are also called elementary transformations. The algorithm doesn’t test all the neighbors of the current configuration; instead, a random move is selected among the allowed ones. If the move leads to a lower energy value, then the new configuration is accepted and becomes the current solution. But the original feature of SA is that even moves that lead to an increase of the energy can be accepted with positive probability. This probability of accepting moves that worsen the fitness are computed from the energy variation ΔE\Delta E before and after the given move:

ΔE=EnewEcurrent\Delta E = E_{new} - E_{current}

The probability p of accepting the new configuration is defined by the exponential

p=min(1,e(ΔE/T))p = \min(1, e^{-(\Delta E/T)})

This relation is called the Metropolis rule for historical reasons. The rule says that for ΔE0\Delta E \leq 0, the acceptance probability is one, as the exponential is larger than one in this case. In other words, a solution that is better than the current one will always be accepted. On the other hand, if ΔE>0\Delta E > 0, which means that the fitness of the new configuration is less good, the new configuration will nonetheless be accepted with probability p<1p < 1 computed according to the above equation. Thus, a move that worsens the fitness can still be accepted. It is also clear that the larger ΔE\Delta E is, the smaller pp will be and, for a given ΔE\Delta E, pp becomes larger with increasing temperature TT. As a consequence, at high temperatures worsening moves are more likely to be accepted, making it possible to overcome fitness barriers, providing exploration capabilities, and preventing the search being stuck in local minima. In contrast, as the temperature is progressively lowered, the configurations will tend to converge towards a local minimum, thus exploiting a good region of the search space. Indeed, in the limit for T0,p0T \rightarrow 0, p \rightarrow 0, and no new configuration with ΔE>0\Delta E > 0 is accepted.

The choice of the Metropolis rule for the acceptance probability is not arbitrary. The corresponding stochastic process that generates changes and that accepts them with probability p=e(ΔE/T)p = e^{-(\Delta E/T)} samples the system configurations according to a well-defined probability distribution p that is known in equilibrium statistical mechanics as the Maxwell-Boltzmann distribution. It is for this reason that the Metropolis rule is so widely used in the so-called Monte Carlo physical simulation methods. A fundamental aspect of simulated annealing is the fact that the temperature is progressively decreased during the search. The details of this process are specified by a temperature schedule, also called a cooling schedule, and can be defined in different ways. For instance, the temperature can be decreased at each iteration following a given law. In practice, it is more often preferred to lower the temperature in stages: after a given number of steps at a constant temperature the search reaches a stationary value of the energy that fluctuates around a given average value that doesn’t change any more. At this point, the temperature is decreased to allow the system to achieve convergence to a lower energy state. Finally, after several stages in which the temperature has been decreased, there are no possible fitness improvements; a state is reached that is to be considered the final one, and the algorithm stops.

Simulated annealing algorithm

Input\textbf{Input}: Cooling schedule.

s=s0s = s_0 ; /∗ Generation of the initial solution ∗/

T=TmaxT = T_{max} ; /∗ Starting temperature ∗/

Repeat\textbf{Repeat}

     Repeat\textbf{Repeat} /∗ At a fixed temperature ∗/

          Generate a random neighbor ss′;

          ΔE=f(s)f(s)\Delta E = f(s′) − f(s);

          If\textbf{If} E0E \leq 0 Then s=ss = s′ /∗ Accept the neighbor solution ∗/

          Else\textbf{Else} Accept ss′ with a probability eΔ/Te^{-\Delta/T};

     Until\textbf{Until} Equilibrium condition /∗ e.g. a given number of iterations executed at each temperature TT ∗/

T=g(T)T = g(T); /∗ Temperature update ∗/

Until\textbf{Until} Stopping criteria satisfied /∗ e.g. T<TminT < T_{min}∗/

Output\textbf{Output}: Best solution found.

Practical considerations

Choice of the initial temperature

In order to start a simulated annealing search, an initial temperature T0T_0 must be specified.Many methods have been proposed in literature to compute the initial temperature T0T_0.

  1. The following heuristic is useful to determine a suitable value for T0T_0:
  • Perform 100100 elementary transformations randomly starting from the initial configuration.
  • Compute the average of the energy variations ΔE\langle \Delta E \rangle observed in these 100100 moves.
  • Choose an initial acceptance probability p0p_0 for worsening moves according to the assumed quality of the initial solution. Typically, p0=0.5p_0 = 0.5 if the quality is assumed to be average, and p0=0.2p_0 = 0.2 if it is assumed to be good. After that, T0T_0 can be computed such that
exp(ΔET)=p0\exp \left( - \frac{\langle \Delta E \rangle}{T} \right)= p_0

which means that the temperature is high enough to allow the system to traverse energy barriers of size ΔE\langle \Delta E \rangle with probability p0p_0.

  1. It is suggested to take T0=EmaxT_0 = E_{max} where EmaxE_{max} is the maximal cost difference between any two neighboring solutions.

  2. Another scheme based on a more precise estimation of the cost distribution is proposed with multiple variants. It is recommended to choose T0=Kσ2T_0 = Kσ_{\infty}^2 where KK is a constant typically ranging from 55 to 1010 and σ2σ_{\infty}^2 is the second moment of the energy distribution when the temperature is . σσ_∞ is estimated using a random generation of some solutions.

  3. A more classical consists in computing a temperature such that the acceptance ratio is approximately equal to a given value χ0χ_0. First, we choose a large initial temperature. Then, we have to perform a number of transitions using this temperature. The ratio of accepted transitions is compared with χ0χ_0. If it is less than χ0χ_0, then the temperature is multiplied by 22. The procedure continues until the observed acceptance ratio exceeds χ0χ_0. Other variants are proposed to obtain an acceptance ratio which is close to χ0χ_0. It is, for example, possible to divide the temperature by 33 if the acceptance ratio is much higher than χ0χ_0. Using this kind of rules, cycles are avoided and a good estimation of the temperature can be found.

  4. In another procedure temperature is obtained using the formula

T0=Eln(χ0)T_0 = − E \ln(χ_0)

where EE is an estimation of the cost increase of strictly positive transitions. This estimation is again obtained by randomly generating some transitions. Notice that δtln(χ0)−δt \ln(χ_0), where δtδ_t is the cost increase induced by a transition tt, is the temperature allowing this transition to be accepted with a probability χ0χ_0. In other terms, T0=Eln(χ0)T_0 = − E \ln(χ_0) is the average of these temperatures over a set of random transitions.

Neighbour generation

The perturbation mechanism is the method to create new solutions from the current solution. In other words it is a method to explore the neighborhood of the current solution creating small changes in the current solution. SA is commonly used in combinatorial problems where the parameters being optimized are integer numbers. In an application where the parameters vary continuously, the exploration of neighborhood solutions can be made as presented next. A solution ss is defined as a vector s=(x1,,xn)s = (x_1,\dots, x_n) representing a point in the search space. A new solution is generated using a vector σ=(σ1,,σn)σ = (σ_1,\dots, σ_n) of standard deviations to create a perturbation from the current solution. A neighbor solution is then produced from the present solution by:

xi+1=xi+N(0,σi)x_{i+1} = x_i + N(0, \sigma_i)

where N(0,σi)N(0, σ_i) is a random Gaussian number with zero mean and σiσ_i standard deviation.

Cooling schedule

In the SA algorithm, the temperature is decreased gradually such that Ti>0,iT_i > 0, ∀i and limiTi=0\lim_{i \rightarrow \infty} T_i = 0

There is always a compromise between the quality of the obtained solutions and the speed of the cooling schedule. If the temperature is decreased slowly, better solutions are obtained but with a more significant computation time. The temperature TT can be updated in different ways:

  • Linear\textbf{Linear}: In the trivial linear schedule, the temperature TT is updated as follows: T=TβT = T − β, where ββ is a specified constant value. Hence, we have Ti=T0i×βT_i = T_0 − i × β where TiT_i represents the temperature at iteration ii.

  • Geometric\textbf{Geometric}: In the geometric schedule, the temperature is updated using the formula T=αTT = αT where α(0,1)α ∈ (0, 1). It is the most popular cooling function. Experience has shown that αα should be between 0.50.5 and 0.990.99.

  • Very slow decrease\textbf{Very slow decrease}: The main trade-off in a cooling schedule is the use of a large number of iterations at a few temperatures or a small number of iterations at many temperatures. A very slow cooling schedule such as

Ti+1=Ti1+βTi T_{i+1} = \frac{T_i}{1 + βT_i}

may be used where β=(T0TF)/(L1)T0TFβ = (T_0 − T_F)/(L − 1)T_0T_F and TFT_F is the final temperature. Only one iteration is allowed at each temperature in this very slow decreasing function.

  • Nonmonotonic\textbf{Nonmonotonic}: Typical cooling schedules use monotone temperatures. Some nonmonotone scheduling schemes where the temperature is increased again may be suggested. This will encourage the diversification in the search space. For some types of search landscapes, the optimal schedule is nonmonotone.

  • Adaptive\textbf{Adaptive}: Most of the cooling schedules are static in the sense that the cooling schedule is defined completely a priori. In this case, the cooling schedule is “blind” to the characteristics of the search landscape. In an adaptive cooling schedule, the decreasing rate is dynamic and depends on some information obtained during the search. A dynamic cooling schedule may be used where a small number of iterations are carried out at high temperatures and a large number of iterations at low temperatures.

Equilibrium State

To reach an equilibrium state at each temperature, a number of sufficient transitions (moves) must be applied. Theory suggests that the number of iterations at each temperature might be exponential to the problem size, which is a difficult strategy to apply in practice. The number of iterations must be set according to the size of the problem instance and particularly proportional to the neighborhood size N(s)|N(s)|. The number of transitions visited may be as follows:

  • Static\textbf{Static}: In a static strategy, the number of transitions is determined before the search starts.For instance, a given proportion yy of the neighborhood N(s)N(s) is explored. Hence, the number of generated neighbors from a solution ss is yN(s)y · |N(s)|. The more significant the ratio yy, the higher the computational cost and the better the results. In practice, we assume that an equilibrium state has been attained if 12F12F elementary transformations have been accepted over a total quantity of 100F100F tried moves. FF is the number of degrees of freedom of the problem, i.e., the number of variables that define the solution.

  • Adaptive\textbf{Adaptive}: The number of generated neighbors will depend on the characteristics of the search. For instance, it is not necessary to reach the equilibrium state at each temperature. Nonequilibrium simulated annealing algorithms may be used: the cooling schedule may be enforced as soon as an improving neighbor solution is generated. This feature may result in the reduction of the computational time without compromising the quality of the obtained solutions. Another adaptive approach using both the worst and the best solutions found in the inner loop of the algorithm may be used.

Stopping Condition

Concerning the stopping condition, theory suggests a final temperature equal to 00. In practice, one can stop the search when the probability of accepting a move is negligible. The following stopping criteria may also be used:

  • Maximum number of iterations is reached.

  • Reaching a final temperature TFT_F is the most popular stopping criteria. This temperature must be low (e.g., Tmin=0.01T_{min} = 0.01).

  • Achieving a predetermined number of iterations without improvement of the best found solution.

  • Achieving a predetermined number of times a percentage of neighbors at each temperature is accepted; that is, a counter increases by 11 each time a temperature is completed with the less percentage of accepted moves than a predetermined limit and is reset to 00 when a new best solution is found. If the counter reaches a predetermined limit RR, the SA algorithm is stopped.

Applications

Minimizing a non linear function

Problem

Minimize\textbf{Minimize} f(x1,x2)=(x12+x2211)2+(x22+x17)2,0x1,x25f(x_1,x_2) = (x_1^2 + x_2^2-11)^2 + (x_2^2+x_1-7)^2, 0\leq x_1,x_2 \leq 5

Solution using simulated annealing

The initial temperature is set to 10001000.

The final temperature is set to 0.010.01.

The geometric cooling schedule is used with α=0.9\alpha = 0.9.

The neighbour is generated using the perturbation mechanism xn+1=xn+N(0,1)x_{n+1} = x_n + N(0,1) and ensuring that they lie within the domain specified in the problem.

For each temperature, we generate 10001000 neighbours.

When the temperature is less than the minimum temperature, we stop.

Using simulated annealing we see that the function attains the minimum value 0.0010.001 at (2.999,2.008)(2.999, 2.008). The actual minimum value is 00 attained at (3,2)(3,2).

The Python code for solving the above problem is given below.

import numpy as np
from random import random
from math import exp

def f(x1, x2):
        return (x1**2 + x2 - 11)**2 + (x2**2 + x1 - 7)**2

def neighbour(x1, x2):
    while True:
        x1n, x2n = np.array([x1, x2]) + np.random.normal(0, 1, 2)
        if 0<=x1n<=5 and 0<=x2n<=5:
            return (x1n, x2n)

def simulated_annealing():  
    T, Tmin, alpha, x1, x2 = 1000, 0.01, 0.9, 2.5, 2.5
    while True:
        for _ in range(1000):
            x1n, x2n = neighbour(x1, x2)
            delta_E = f(x1n, x2n) - f(x1, x2)
            if delta_E <= 0:
                x1, x2 = x1n, x2n
            else:
                if random() < exp(-delta_E/T):
                    x1, x2 = x1n, x2n
        T = alpha*T
        if T < Tmin:
            return(x1, x2, f(x1, x2))

print(simulated_annealing())