Simulated Annealing | Vamshi Jandhyala

Motivation

The method of simulated annealing (SA) draws its inspiration from the physical process of metallurgy and uses terminology that comes from that field. When a metal is heated to a sufficiently high temperature, its atoms undergo disordered movements of large amplitude. If one now cools the metal down progressively, the atoms reduce their movement and tend to stabilize around fixed positions in a regular crystal structure with minimal energy. In this state, in which internal structural constraints are minimized, ductility is improved and the metal becomes easier to work. This slow cooling process is called annealing by metallurgists and it is to be contrasted with the quenching process, which consists of a rapid cooling down of the metal or alloy.Quenching causes the cooled metal to be more fragile, but also harder and more resistant to wear and vibration. In this case, the resulting atomic structure corresponds to a local energy minimum whose value is higher than the one corresponding to the arrangement produced by annealing. Note finally that in practice metallurgists often used a process called tempering by which one alternates heating and cooling phases to obtain the desired physical properties.

We can intuitively understand this process in the following way: at high temperature, atoms undergo large random movements thereby exploring a large number of possible configurations. Since in nature the energy of systems tends to be minimized, low-energy configurations will be preferred, but, at this stage, higher energy configurations remain accessible thanks to the thermal energy transferred to the system. During the exploration, the system might find itself in a low-energy configuration by chance. If the energy barrier to leave this configuration is high, then the system will stay there longer on average. As temperature decreases, the system will be more and more constrained to exploit low-amplitude movements and, finally, it will “freeze” into a low-energy minimum that may be, but is not guaranteed to be, the global one.

In an optimization context SA seeks to emulate this process. SA begins at a very high temperature where the input values are allowed to assume a great range of variation. As algorithm progresses temperature is allowed to fall. This restricts the degree to which inputs are allowed to vary. This often leads the algorithm to a better solution, just as a metal achieves a better crystal structure through the actual annealing process. So, as long as temperature is being decreased, changes to the inputs result in the generation of successively better solutions finally giving rise to an optimum set of input values when temperature is close to zero.

SA can be used to find the minimum of an objective function and it is expected that the algorithm will find the inputs that will produce a minimum value of the objective function. The main feature of SA algorithm is the ability to avoid being trapped in local minimum. This is done letting the algorithm to accept not only better solutions but also worse solutions with a given probability. The main disadvantage, is that definition of some control parameters (initial temperature, cooling rate, etc.) is somewhat subjective and must be defined from an empirical basis. This means that the algorithm must be tuned in order to maximize its performance.

Principles of the Algorithm

The simulated annealing method is used to search for the minimum of a given objective function, often called the energy $E$ , by analogy to the physical origins of the method. The algorithm follows the basic principles of all metaheuristics. The process begins by choosing an arbitrary admissible initial solution, also called the initial configuration. Furthermore, an initial “temperature” must also be defined. Next, the moves that allow the current configuration to reach its neighbors must also be defined. These moves are also called elementary transformations. The algorithm doesn’t test all the neighbors of the current configuration; instead, a random move is selected among the allowed ones. If the move leads to a lower energy value, then the new configuration is accepted and becomes the current solution. But the original feature of SA is that even moves that lead to an increase of the energy can be accepted with positive probability. This probability of accepting moves that worsen the fitness are computed from the energy variation $\Delta E$ before and after the given move:

\Delta E = E_{new} - E_{current}

The probability p of accepting the new configuration is defined by the exponential

p = \min(1, e^{-(\Delta E/T)})

This relation is called the Metropolis rule for historical reasons. The rule says that for $\Delta E \leq 0$ , the acceptance probability is one, as the exponential is larger than one in this case. In other words, a solution that is better than the current one will always be accepted. On the other hand, if $\Delta E > 0$ , which means that the fitness of the new configuration is less good, the new configuration will nonetheless be accepted with probability $p < 1$ computed according to the above equation. Thus, a move that worsens the fitness can still be accepted. It is also clear that the larger $\Delta E$ is, the smaller $p$ will be and, for a given $\Delta E$ , $p$ becomes larger with increasing temperature $T$ . As a consequence, at high temperatures worsening moves are more likely to be accepted, making it possible to overcome fitness barriers, providing exploration capabilities, and preventing the search being stuck in local minima. In contrast, as the temperature is progressively lowered, the configurations will tend to converge towards a local minimum, thus exploiting a good region of the search space. Indeed, in the limit for $T \rightarrow 0, p \rightarrow 0$ , and no new configuration with $\Delta E > 0$ is accepted.

The choice of the Metropolis rule for the acceptance probability is not arbitrary. The corresponding stochastic process that generates changes and that accepts them with probability $p = e^{-(\Delta E/T)}$ samples the system configurations according to a well-defined probability distribution p that is known in equilibrium statistical mechanics as the Maxwell-Boltzmann distribution. It is for this reason that the Metropolis rule is so widely used in the so-called Monte Carlo physical simulation methods. A fundamental aspect of simulated annealing is the fact that the temperature is progressively decreased during the search. The details of this process are specified by a temperature schedule, also called a cooling schedule, and can be defined in different ways. For instance, the temperature can be decreased at each iteration following a given law. In practice, it is more often preferred to lower the temperature in stages: after a given number of steps at a constant temperature the search reaches a stationary value of the energy that fluctuates around a given average value that doesn’t change any more. At this point, the temperature is decreased to allow the system to achieve convergence to a lower energy state. Finally, after several stages in which the temperature has been decreased, there are no possible fitness improvements; a state is reached that is to be considered the final one, and the algorithm stops.

Simulated annealing algorithm

$\textbf{Input}$ : Cooling schedule.

$s = s_0$ ; /∗ Generation of the initial solution ∗/

$T = T_{max}$ ; /∗ Starting temperature ∗/

$\textbf{Repeat}$

$\textbf{Repeat}$ /∗ At a fixed temperature ∗/

Generate a random neighbor $s′$ ;

$\Delta E = f(s′) − f(s)$ ;

$\textbf{If}$ $E \leq 0$ Then $s = s′$ /∗ Accept the neighbor solution ∗/

$\textbf{Else}$ Accept $s′$ with a probability $e^{-\Delta/T}$ ;

$\textbf{Until}$ Equilibrium condition /∗ e.g. a given number of iterations executed at each temperature $T$ ∗/

$T = g(T)$ ; /∗ Temperature update ∗/

$\textbf{Until}$ Stopping criteria satisfied /∗ e.g. $T < T_{min}$ ∗/

$\textbf{Output}$ : Best solution found.

Practical considerations

Choice of the initial temperature

In order to start a simulated annealing search, an initial temperature $T_0$ must be specified.Many methods have been proposed in literature to compute the initial temperature $T_0$ .

The following heuristic is useful to determine a suitable value for $T_0$ :

Perform $100$ elementary transformations randomly starting from the initial configuration.
Compute the average of the energy variations $\langle \Delta E \rangle$ observed in these $100$ moves.
Choose an initial acceptance probability $p_0$ for worsening moves according to the assumed quality of the initial solution. Typically, $p_0 = 0.5$ if the quality is assumed to be average, and $p_0 = 0.2$ if it is assumed to be good. After that, $T_0$ can be computed such that

\exp \left( - \frac{\langle \Delta E \rangle}{T} \right)= p_0

which means that the temperature is high enough to allow the system to traverse energy barriers of size $\langle \Delta E \rangle$ with probability $p_0$ .

It is suggested to take $T_0 = E_{max}$ where $E_{max}$ is the maximal cost difference between any two neighboring solutions.
Another scheme based on a more precise estimation of the cost distribution is proposed with multiple variants. It is recommended to choose $T_0 = Kσ_{\infty}^2$ where $K$ is a constant typically ranging from $5$ to $10$ and $σ_{\infty}^2$ is the second moment of the energy distribution when the temperature is $∞$ . $σ_∞$ is estimated using a random generation of some solutions.
A more classical consists in computing a temperature such that the acceptance ratio is approximately equal to a given value $χ_0$ . First, we choose a large initial temperature. Then, we have to perform a number of transitions using this temperature. The ratio of accepted transitions is compared with $χ_0$ . If it is less than $χ_0$ , then the temperature is multiplied by $2$ . The procedure continues until the observed acceptance ratio exceeds $χ_0$ . Other variants are proposed to obtain an acceptance ratio which is close to $χ_0$ . It is, for example, possible to divide the temperature by $3$ if the acceptance ratio is much higher than $χ_0$ . Using this kind of rules, cycles are avoided and a good estimation of the temperature can be found.
In another procedure temperature is obtained using the formula

T_0 = − E \ln(χ_0)

where $E$ is an estimation of the cost increase of strictly positive transitions. This estimation is again obtained by randomly generating some transitions. Notice that $−δt \ln(χ_0)$ , where $δ_t$ is the cost increase induced by a transition $t$ , is the temperature allowing this transition to be accepted with a probability $χ_0$ . In other terms, $T_0 = − E \ln(χ_0)$ is the average of these temperatures over a set of random transitions.

Neighbour generation

The perturbation mechanism is the method to create new solutions from the current solution. In other words it is a method to explore the neighborhood of the current solution creating small changes in the current solution. SA is commonly used in combinatorial problems where the parameters being optimized are integer numbers. In an application where the parameters vary continuously, the exploration of neighborhood solutions can be made as presented next. A solution $s$ is defined as a vector $s = (x_1,\dots, x_n)$ representing a point in the search space. A new solution is generated using a vector $σ = (σ_1,\dots, σ_n)$ of standard deviations to create a perturbation from the current solution. A neighbor solution is then produced from the present solution by:

x_{i+1} = x_i + N(0, \sigma_i)

where $N(0, σ_i)$ is a random Gaussian number with zero mean and $σ_i$ standard deviation.

Cooling schedule

In the SA algorithm, the temperature is decreased gradually such that $T_i > 0, ∀i$ and $\lim_{i \rightarrow \infty} T_i = 0$

There is always a compromise between the quality of the obtained solutions and the speed of the cooling schedule. If the temperature is decreased slowly, better solutions are obtained but with a more significant computation time. The temperature $T$ can be updated in different ways:

$\textbf{Linear}$ : In the trivial linear schedule, the temperature $T$ is updated as follows: $T = T − β$ , where $β$ is a specified constant value. Hence, we have $T_i = T_0 − i × β$ where $T_i$ represents the temperature at iteration $i$ .
$\textbf{Geometric}$ : In the geometric schedule, the temperature is updated using the formula $T = αT$ where $α ∈ (0, 1)$ . It is the most popular cooling function. Experience has shown that $α$ should be between $0.5$ and $0.99$ .
$\textbf{Very slow decrease}$ : The main trade-off in a cooling schedule is the use of a large number of iterations at a few temperatures or a small number of iterations at many temperatures. A very slow cooling schedule such as

$T_{i+1} = \frac{T_i}{1 + βT_i}$

may be used where $β = (T_0 − T_F)/(L − 1)T_0T_F$ and $T_F$ is the final temperature. Only one iteration is allowed at each temperature in this very slow decreasing function.

$\textbf{Nonmonotonic}$ : Typical cooling schedules use monotone temperatures. Some nonmonotone scheduling schemes where the temperature is increased again may be suggested. This will encourage the diversification in the search space. For some types of search landscapes, the optimal schedule is nonmonotone.
$\textbf{Adaptive}$ : Most of the cooling schedules are static in the sense that the cooling schedule is defined completely a priori. In this case, the cooling schedule is “blind” to the characteristics of the search landscape. In an adaptive cooling schedule, the decreasing rate is dynamic and depends on some information obtained during the search. A dynamic cooling schedule may be used where a small number of iterations are carried out at high temperatures and a large number of iterations at low temperatures.

Equilibrium State

To reach an equilibrium state at each temperature, a number of sufficient transitions (moves) must be applied. Theory suggests that the number of iterations at each temperature might be exponential to the problem size, which is a difficult strategy to apply in practice. The number of iterations must be set according to the size of the problem instance and particularly proportional to the neighborhood size $|N(s)|$ . The number of transitions visited may be as follows:

$\textbf{Static}$ : In a static strategy, the number of transitions is determined before the search starts.For instance, a given proportion $y$ of the neighborhood $N(s)$ is explored. Hence, the number of generated neighbors from a solution $s$ is $y · |N(s)|$ . The more significant the ratio $y$ , the higher the computational cost and the better the results. In practice, we assume that an equilibrium state has been attained if $12F$ elementary transformations have been accepted over a total quantity of $100F$ tried moves. $F$ is the number of degrees of freedom of the problem, i.e., the number of variables that define the solution.
$\textbf{Adaptive}$ : The number of generated neighbors will depend on the characteristics of the search. For instance, it is not necessary to reach the equilibrium state at each temperature. Nonequilibrium simulated annealing algorithms may be used: the cooling schedule may be enforced as soon as an improving neighbor solution is generated. This feature may result in the reduction of the computational time without compromising the quality of the obtained solutions. Another adaptive approach using both the worst and the best solutions found in the inner loop of the algorithm may be used.

Stopping Condition

Concerning the stopping condition, theory suggests a final temperature equal to $0$ . In practice, one can stop the search when the probability of accepting a move is negligible. The following stopping criteria may also be used:

Maximum number of iterations is reached.
Reaching a final temperature $T_F$ is the most popular stopping criteria. This temperature must be low (e.g., $T_{min} = 0.01$ ).
Achieving a predetermined number of iterations without improvement of the best found solution.
Achieving a predetermined number of times a percentage of neighbors at each temperature is accepted; that is, a counter increases by $1$ each time a temperature is completed with the less percentage of accepted moves than a predetermined limit and is reset to $0$ when a new best solution is found. If the counter reaches a predetermined limit $R$ , the SA algorithm is stopped.

Applications

Minimizing a non linear function

Problem

$\textbf{Minimize}$ $f(x_1,x_2) = (x_1^2 + x_2^2-11)^2 + (x_2^2+x_1-7)^2, 0\leq x_1,x_2 \leq 5$

Solution using simulated annealing

The initial temperature is set to $1000$ .

The final temperature is set to $0.01$ .

The geometric cooling schedule is used with $\alpha = 0.9$ .

The neighbour is generated using the perturbation mechanism $x_{n+1} = x_n + N(0,1)$ and ensuring that they lie within the domain specified in the problem.

For each temperature, we generate $1000$ neighbours.

When the temperature is less than the minimum temperature, we stop.

Using simulated annealing we see that the function attains the minimum value $0.001$ at $(2.999, 2.008)$ . The actual minimum value is $0$ attained at $(3,2)$ .

The Python code for solving the above problem is given below.

import numpy as np
from random import random
from math import exp

def f(x1, x2):
        return (x1**2 + x2 - 11)**2 + (x2**2 + x1 - 7)**2

def neighbour(x1, x2):
    while True:
        x1n, x2n = np.array([x1, x2]) + np.random.normal(0, 1, 2)
        if 0<=x1n<=5 and 0<=x2n<=5:
            return (x1n, x2n)

def simulated_annealing():  
    T, Tmin, alpha, x1, x2 = 1000, 0.01, 0.9, 2.5, 2.5
    while True:
        for _ in range(1000):
            x1n, x2n = neighbour(x1, x2)
            delta_E = f(x1n, x2n) - f(x1, x2)
            if delta_E <= 0:
                x1, x2 = x1n, x2n
            else:
                if random() < exp(-delta_E/T):
                    x1, x2 = x1n, x2n
        T = alpha*T
        if T < Tmin:
            return(x1, x2, f(x1, x2))

print(simulated_annealing())

Evolutionary Computation

Will Riddler Nation Win Gold In Archery?