On Proving Linear Convergence of Comparison-based Step-size Adaptive Randomized Search on Scaling-Invariant Functions via Stability of Markov Chains Anne Auger, Nikolaus Hansen
To cite this version:
Anne Auger, Nikolaus Hansen. On Proving Linear Convergence of Comparison-based Step-size Adap- tive Randomized Search on Scaling-Invariant Functions via Stability of Markov Chains. 2013. hal- 00877160v3
HAL Id: hal-00877160 https://hal.inria.fr/hal-00877160v3 Preprint submitted on 1 Nov 2013 (v3), last revised 2 Jun 2016 (v7)
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. ON PROVING LINEAR CONVERGENCE OF COMPARISON-BASED STEP-SIZE ADAPTIVE RANDOMIZED SEARCH ON SCALING-INVARIANT FUNCTIONS VIA STABILITY OF MARKOV CHAINS
ANNE AUGER AND NIKOLAUS HANSEN
Abstract. In the context of numerical optimization, this paper develops a methodology to ana- lyze the linear convergence of comparison-based step-size adaptive randomized search (CB-SARS), a class of probabilistic derivative-free optimization algorithms where the function is solely used through comparisons of candidate solutions. Various algorithms are included in the class of CB-SARS algo- rithms. On the one hand, a few methods introduced already in the 60’s: the step-size adaptive random search by Schumer and Steiglitz, the compound random search by Devroye and simplified versions of Matyas’ random optimization algorithm or Kjellstrom and Taxen Gaussian adaptation. On the other hand, it includes simplified versions of several recent algorithms: the covariance-matrix-adaptation evolution strategy algorithm (CMA-ES), the exponential natural evolution strategy (xNES), or the cross entropy method. CB-SARS algorithms typically exhibit several invariances. First of all, invariance to composing the objective function with a strictly monotonic transformation which is a direct consequence of the fact that the algorithms only use comparisons. Second, scale invariance that translates the fact that the algorithm has no intrinsic absolute notion of scale. The algorithms are investigated on scaling-invariant functions defined as functions that preserve the ordering of two points (with respect to their objective function values) when they are both scaled with the same positive parameter (this scaling is done w.r.t. a single constant reference point). This class of functions includes composite of norms by strictly increasing functions as well as some non quasi-convex functions and non-continuous functions. We prove that the algorithm invariance properties entail on scaling-invariant functions the ex- istence of a homogeneous Markov chain whose stability implies linear convergence of the original algorithm. Hence we turn the problem of studying linear convergence of CB-SARS optimization algorithms into a study of the stability of a joint Markov chain. We naturally derive sufficient conditions expressed in terms of different stability conditions for the Markov chain (irreducibility, positivity, Harris recurrence, geometric ergodicity) for the global linear convergence of CB-SARS. The methodology presented here is applied in a companion paper where stability sufficient con- ditions are proven for the (1+1)-ES with a generalized one-fifth success rule that coincides with Schumer and Steiglitz’ or Devroye’s algorithms. This leads to a first proof of linear convergence for the algorithm considered and first proof of linear convergence for a scale-invariant CB-SARS for such a wide family of functions including in particular non quasi-convex and non continuous functions.
Key words. linear convergence, derivative-free optimization, comparison-based algorithm, function-value-free optimization, discrete time Markov chain on continuous state space, stability of Markov chain, step-size adaptive randomized search, Evolution Strategies, CMA-ES, xNES
1. Introduction. Derivative-free optimization (DFO) algorithms were already proposed in the 60’s to solve numerical optimization problems: see for instance the pattern search method by Hooke and Jeeves [17], the simplex method by Nelder and Mead [28] or randomized DFO proposed by Matyas [26], Schumer and Steiglitz [35], Rechenberg [31] or Devroye [12]. DFO methods have the advantage that they can handle problems posed in a black- box setting, where the function f : Rn R to be (w.l.o.g.) minimized is seen as a black-box that is solely capable to return → a function value f(x) for any input vector x. In particular gradients are not needed. Such a black-box scenario is very common in industry for instance where the function can result from heavy numerical simulation codes for which it is too complex to extract information, compute a gradient or simply the person that has to optimize the problem has only access to an executable of the function but not its source code. 1 2 A. AUGER AND N. HANSEN
While almost abandoned for a few decades in the mathematical optimization community, there was recently a revival of interest for deterministic DFO with the introduction of new methods like the NEWUOA algorithm by Powell [30] or see also e.g. the book [11]. At the same time, randomized DFO, like Evolution Strategies, have been continuously developed and improved in the evolutionary computation commu- nity since their introduction in the 60’s [31]. The elegant and powerful covariance matrix adaptation Evolution Strategy (CMA-ES) [15] method is now considered as state-of-the-art for randomized continuous optimization with numerous empirical ev- idence of its efficiency and working principles [14]. While DFO methods optimize f without using its derivatives, an important dis- tinction, however rarely recognized, among DFO algorithms can be made. Indeed, some DFO algorithms are also function-value-free (FVF): they optimize f without using its function value but solely the comparison between function values. For in- stance the simplex method or Evolution Strategies are comparison based and thus FVF. This paper focuses on a specific subclass of FVF methods namely comparison- based step-size adaptive randomized search (CB-SARS). Underlying the algorithms is a family of probability distributions that is iteratively updated. More precisely we consider methods whose current state at iteration t is described by a random vector Rn R+ R+ (Xt, σt) > with Xt standing for the favorite solution and σt > being a step-size∈ representing× commonly the standard deviation of the underlying∈ probability distribution (while Xt is its expected value). When convergence towards a solution ⋆ ⋆ x occurs, the vector Xt converges to x and the step-size goes to zero. Linear ⋆ convergence is characterized by both Xt x and σt converging to zero linearly at the same rate as will be explained below (see− also Figure 5.1 left and middle plot). The algorithms considered here share the same general structure based on three steps: (i) sample the underlying probability distribution and form some candidate solutions, (ii) evaluate the candidate solutions on f, (iii) rank the solutions (in these solutions we could include the ones from the previous iteration) according to their f-values and update the current state (Xt, σt) based on the ranked solutions. This framework includes several methods proposed in the past: Step-size adaptive Evolu- tion Strategies [8], the step-size random search algorithm by Schumer and Steiglitz [35] or Devroyes’ compound random search [12]; and it can be seen as a simplified version of various other methods: Matyas’ random optimization algorithm [26] that additionally to the step-size adapts a search direction, CMA-ES, exponential natural evolution strategies (xNES) [13] or Gaussian adaptation [25] with the adaptation of the covariance matrix turned off or also the cross entropy method [33] provided the 2 covariance matrix divided by σt is constant and has full rank. One key feature of CB-SARS is their invariance to increasing transformations of the objective function. In particular the update function in (iii) uses solely the ranking of candidate solutions instead of their function value. It implies that the sequence (Xt, σt) generated for the optimization of f will be almost surely equal to the sequence generated for optimizing g f for any strictly monotonic function g. Consequently a convex smooth function f◦can be as “efficiently” optimized by a CB- SARS as any non-convex, non-smooth function g f for a choice of a strictly monotonic g that renders g f non-convex, non-smooth. Remark◦ that a method like simulated- annealing does not◦ share the property of invariance to increasing transformations as the acceptance of a new solution depends on the function value itself [24]. Another important invariance for an optimization algorithm is scale-invariance, LINEAR CONVERGENCE OF CB-SARS VIA STABILITY OF MARKOV CHAINS 3
that is, the algorithm has no intrinsic absolute notion of scale. For a step-size adap- tive randomized search (SARS), the given step-size defines the scale the algorithm is operating on, but the operations do not depend on the scale otherwise. Linear convergence of many CB-SARS is observed on a relatively wide class of functions. However few formal proofs of linear convergence exist. On the one hand J¨agersk¨upper proved linear convergence for a variant of the (1+1)-ES and (1+, λ)-ES with one-fifth success rule on spherical functions [21, 20] and on a certain class of convex quadratic functions [18, 19] using an ad-hoc technique consisting in deriving upper and lower bounds on the number of iterations to reduce the approximation error. On the other hand, another type of approach was proposed in [9] for a step-size adaptive ES with self-adaptation on spherical functions. The authors pointed out that the norm of the distance to the optimum divided by the step-size is a homogeneous Markov chain and deduced that studying the Markov chain could lead to a linear con- vergence proof. The study of the Markov chain was achieved in [3]. All those results however hold for a restricted class of functions, namely spherical or convex-quadratic functions (and of course their composite by a strictly monotonic transformation). We develop in this paper a methodology generalizing [9, 3] to analyze the linear convergence of scale-invariant CB-SARS algorithms on the class of scaling-invariant functions, where we define a function f as scaling-invariant with respect to x⋆ if for all ρ> 0, x, y Rn ∈ f(ρ(x x⋆)) f(ρ(y x⋆)) f(x x⋆) f(y x⋆) . − ≤ − ⇔ − ≤ − This class of functions includes functions that are the composition of norm functions by increasing transformations–having hence convex sublevel sets–but also non quasi- convex functions. Non-constant scaling-invariant functions do not admit strict local optima nor plateaux and are thus essentially unimodal. We prove that the algorithm properties of (i) invariance to strictly monotonic transformations and (ii) translation and scale-invariance imply on scaling-invariant functions, that the mean vector Xt ⋆ (w.l.o.g. x lies in zero) normalized by the step-size σt is a homogeneous Markov chain1. We then explain how linear convergence is a consequence of the stability— irreducibility, Harris recurrence, positivity, geometric ergodicity—of the normalized Markov chain and explicit in details the typical form of stability we can prove. We naturally derive sufficient conditions for linear convergence, expressed as stability of the normalized chain. Stability per se is studied for the (1 + 1)-ES with generalized one-fifth success rule in the companion paper [6]. Because of the stochastic nature of the algorithms, linear convergence can be formulated in different ways. We formulate almost sure linear convergence as the existence of a convergence rate CR such that 1 1 t ln Xt / X0 and t ln σt/σ0 converge to CR almost surely. Linear convergence − X σ in expectation is formulated as the log-progress E ln t+1 and E ln t+1 converge Xt σt to CR. We also prove that under geometric ergodicity of the normalized chain the− error between the log-progress and CR decreases to zero like 1/rt (for r > 1) and formulate a Central Limit theorem− to characterize the speed of convergence of 1 1 σt ln Xt / X0 and ln towards CR. t t σ0 − The paper is organized as follows. We define in Section 2.1 SARS and CB- SARS algorithms. In Section 2.2, we formalize different invariance properties. First
1We assume for the sake of simplicity a specific form of the operator for sampling new solutions typical of CB-SARS algorithms (property (2.15)). Otherwise the definition of the normalized Markov chain needs to be adapted. 4 A. AUGER AND N. HANSEN
invariance to strictly monotonic transformations satisfied directly by CB-SARS. Then we give a general definition of translation and scale-invariance for SARS and derive sufficient practical conditions for a CB-SARS to be translation and scale-invariant. In Section 2.3 we present several examples of methods that follow our general definition of CB-SARS and study their invariance properties. In Section 3 we define the class of scaling-invariant functions. In Section 4, we prove that for certain translation and scale-invariant CB-SARS algorithms optimizing scaling-invariant functions (w.l.o.g. w.r.t. zero), the mean vector divided by the step-size is a homogeneous Markov chain. In Section 5, we give sufficient conditions to linear convergence expressed in terms of stability of the normalized Markov chain exhibited in Section 3. A discussion is finally provided in Section 6. Notations and definitions. The set of nonnegative real numbers is denoted R+ R+ R+ N and > denotes elements of excluding 0, is the set of natural numbers including n zero while N> excludes zero. The Euclidian norm of a vector x of R is denoted x . A Gaussian vector or multivariate normal distribution with mean vector m and n n covariance matrix C is denoted (m, C). The identity matrix in R × is denoted In such that a standard multivariateN normal distribution, i.e. with mean vector zero and identity covariance matrix is denoted (0, In). The density of a standard multivariate normal distribution (in any dimension)N is denoted p . The set of strictly increasing functions g from R to R or from a subset I R to RNis denoted . ⊂ M 2. Comparison Based Step-size Adaptive Randomized Search. In this section, we present a formal definition of SARS and CB-SARS algorithms. We then define invariance properties typically associated to those algorithms and finish by giving several concrete examples of CB-SARS algorithms as well as analyzing their invariance properties. 2.1. Algorithm Definitions. For a SARS algorithm, the state of the algorithm Rn R+ is a mean vector living in and a step-size living in >. Hence, the state space for Rn R+ the algorithm is Ω = >. Mathematically, a SARS algorithm is described by × Rn R+ a stochastically recursive sequence on the state space Ω. Given (X0, σ0) >, the sequence is inductively defined as ∈ ×
(2.1) (X , σ )= ((X , σ ), U ) t+1 t+1 F t t t where X Rn represents the favorite solution at iteration t, σ R+ is the so- t ∈ t ∈ > called step-size, is a measurable function and (Ut)t N is an independent identically F ∈ distributed (i.i.d.) sequence of random vectors. Each vector Ut belongs to a space Up = U ... U and has p coordinates Ui belonging to U. The probability distribution × × t of the vector Ut is denoted pU. From the definition (2.1) follows that ((Xt, σt))t N is a time homogeneous Markov chain. We call the transition function of the algorithm.∈ The objective function f is also an inputF argument to the transition function as F the update of (Xt, σt) depends on f, however we omit this dependence in general for the sake of simplicity in the notations. If there is an ambiguity we add the function f as upper-script, i.e. f(x) or f . F F A CB-SARS is a SARS where the transition function depends on f only through comparison of candidate solutions and is the compositionF of several functions that we specify in the sequel. The p coordinates of Ut are in a first time used to create new candidate solutions according to a ol function: S Xi = ol((X , σ ), Ui) ,i =1, . . . , p . t S t t t LINEAR CONVERGENCE OF CB-SARS VIA STABILITY OF MARKOV CHAINS 5
(For instance in the case where U = Rn the ol function can equal ol((x, σ), ui)= x + σui.) The p candidate solutions are thenS evaluated on f and orderedS according to their objective function value. The permutation corresponding to the ordered i S S objective function values f(Xt) is denoted (p) where we denote (p) the set of permutations of p elements. Formally isS the ∈ output of the rd function defined S O below. It is then used to order the coordinates of the vector Ut accordingly. More formally the permutation acts on the coordinates of Ut via the following function: S(p) Up Up × → (2.2) (1) (p) ( , U ) U = US ,..., US S t →S ∗ t t t where the previous equation implicitly defines the operator . ∗ The update of (Xt, σt) is achieved using the current state (Xt, σt) and the ranked coordinates of Ut. More precisely let us consider a measurable function called update function that maps Ω Up onto Ω, the update of (X , σ ) reads G × t t (2.3) (X , σ )= ((X , σ ), U )= ((X , σ ), Y ) , t+1 t+1 G t t S∗ t G t t t (1) (p) where Yt denotes the ordered coordinates of Ut, i.e. Yt = (UtS ,..., UtS ). We formalize the definition of a CB-SARS below after introducing a definition for the function ol for generating solutions as well as for the ordering function. DefinitionS 2.1 ( ol function). Given U, the state space for the sampling co- S ordinates of Ut, a ol function used to create candidate solutions is a measurable function mapping ΩS U into Rn, i.e. × ol :Ω U Rn . S × → We now define the ordering function that returns a permutation based on the objec- tive function values. Definition 2.2 ( rd function). The ordering function rd maps Rp to S(p), O O the set of permutations with p elements and returns for any set of real values (f1,...,fp) the permutation of ordered indexes. That is = rd(f ,...,f ) S(p) where S O 1 p ∈ f (1) ... f (p) . S ≤ ≤ S When more convenient we might denote rd((fi)i=1,...,p) instead of rd(f1,...,fp). When needed for the sake of clarity we mightO use the notations rdOf or f to em- phasize the dependency in f. O S We are now ready to give a formal definition of a comparison-based step-size adaptive randomized search. n p Definition 2.3 (CB-SARS minimizing f : R R). Let p N> and U = m → ∈ U ... U where U is a subset of R . Let pU be a probability distribution defined on p× × 1 p U where each U distributed according to pU has a representation (U ,..., U ) (each i p n U U). Let ol be a solution function as in Definition 2.1. Let 1 :Ω U R ∈ + S p + G × → and 2 : R U R be two mesurable mappings and let denote = ( 1, 2). G × → p G G G A CB-SARS is determined by the quadruplet ( ol, , U ,pU) from which the re- cursive sequence (X , σ ) Ω is defined via (X , σ S) ΩG and for all t: t t ∈ 0 0 ∈ (2.4) Xi = ol((X , σ ), Ui) ,i =1,...,p t S t t t (2.5) = rd(f(X1),...,f(Xp)) S(p) S O t t ∈ (2.6) X = ((X , σ ), U ) t+1 G1 t t S∗ t (2.7) σ = (σ , U ) t+1 G2 t S∗ t 6 A. AUGER AND N. HANSEN
p where (Ut)t N is an i.i.d. sequence of random vectors on U distributed according to ∈ pU, rd is the ordering function as in Definition 2.2. TheO previous definition illustrates the function value free property as we see that the update of the state (Xt, σt) is performed using solely the information given by the permutation that contains the order of the candidate solutions according to f. For a comparison-based step-size adaptive randomized search, the function introduced in (2.1) is the composition of the update function , the solution operatorF ol and the ordering function, more precisely G S
(2.8) f ((x, σ), u)= ((x, σ), rd(f( ol((x, σ), ui )) ) u) . F G O S i=1,...,p ∗
Note that for the update of the step-size (equation (2.7)), we assume that Xt does not come into play. Examples of CB-SARS are given in Section 2.3. 2.2. Invariance Properties. Invariance is an important principle in science in general and in optimization. Invariance generalizes properties that are true on a sin- gle function to a whole class of functions. Some invariances are taken for granted in optimization like translation invariance while others are less common or less recog- nized. In the sequel we start by formalizing that CB-SARS are invariant to strictly monotonic transformations of f. We focus then in Section 2.2.2 on invariance in search space and formalize translation invariance and scale invariance. We also derive sufficient conditions for a CB-SARS to be translation and scale invariant. 2.2.1. Invariance to Strictly Monotonic Transformations of f. Invari- ance to strictly monotonic transformations of f of a CB-SARS algorithm is a direct consequence of the algorithm definition. It stems from the fact that the objective function only comes into play through the ranking of the solutions via the ordering function (see (2.5), (2.6) and (2.7)). This ordering function does output the same result on f or any strictly monotonic transformation of f. More formally let us define the set of strictly monotonic mappings g : I R, where I is a subset of R i.e. if MI → for all x and y in I such that x
(Xt, σt) = (Xt′ , σt′) .
Proof. The proof is immediate, by induction. Assume (Xt, σt) = (Xt′ , σt′ ) and let i i 1 p 1 denote Xt = ol((Xt, σt), Ut). Because rd(f(Xt ),...,f(Xt )) = rd(g f(Xt ),...,g f(Xp)) = ,S then O O ◦ ◦ t S (X , σ )= ((X , σ ), U )= ((X′ , σ′), U ) = (X′ , σ′ ) . t+1 t+1 G t t S∗ t G t t S∗ t t+1 t+1 LINEAR CONVERGENCE OF CB-SARS VIA STABILITY OF MARKOV CHAINS 7
Fig. 2.1. Illustration of invariance to strictly increasing transformations. Representation of three instances of functions belonging to the invariance (w.r.t. strictly increasing transformations) class of f(x) = x 2 in dimension 1. On the left the sphere function and middle and right functions g ◦ f for two different g ∈ M. On those three functions, a comparison-based step-size adaptive randomized search will generate the same sequence (Xt, σt) (see Proposition 2.4) and consequently convergence will take place at the same rate.
Consequently, on the three functions depicted in Figure 2.1, a comparison-based step- size adaptive randomized search will produce the same sequence (Xt, σt)t N. Hence if convergence takes place on one of those functions, it will take place on the∈ two others and at the same convergence rate. 2.2.2. Invariances in the Search Space: Translation and Scale-Invariance. We consider now invariance of comparison-based step-size adaptive randomized search related to transformations in the search space. We use a classical approach to formal- ize invariance using homomorphisms transforming state variables via a group action and visualize invariances with a commutative diagram. We start by translation in- variance, usually taken for granted in optimization. Translation invariance. Most optimization algorithms are translation invariant, which implies the same performance when optimizing f(x) or f(x x0) for all x0 provided a respective initialization of the algorithm. More precisely,− let us consider Rn endowed with the addition operation + as a group and consider (Ω) the set of invertible mappings from the state space Ω to itself, that is, the set ofA all (invertible) state space transformations. Endowed with the function composition , ( (Ω), ) yields also a group structure. We remind the definition of a group morphism.◦ A ◦ Definition 2.5 (Group homomorphism). Let (G ,.) and (G , ) be two groups. 1 2 ∗ A mapping Φ: G1 G2 is called group homomorphism if for all x, y G1 we have Φ(x.y)=Φ(x) Φ(y→). ∈ ∗ 1 1 1 From the definition follows that for any x G1, Φ(x− ) = [Φ(x)]− where x− 1 ∈ (resp.[Φ(x)]− ) denotes the inverse of x (resp. of [Φ(x)]). Note that in case x belongs to an additive group, the inverse is denoted x. Let Homo((Rn, +), ( (Ω), )) be the set of group homomorphisms from (Rn, +)− to ( (Ω), ). For instance,A consider◦ Φ Homo((Rn, +), ( (Ω), )), i.e. Φ : y (Rn, +)A Φ(◦ y) where Φ(y) is a state ∈ A ◦ ∈ Rn →R+ space transformation such that for all (x, σ) >, Φ(y)((x, σ)) = (x + y, σ). We are now ready to state a definition of translation∈ × invariance. Definition 2.6 (Translation Invariance). A SARS with transition function is translation invariant if there exists a group homomorphism Φ Homo((Rn, +), ( (Ω)F , )) n ∈ A ◦ such that for any objective function f, for any x0 R , for any (x, σ) Ω and for any u Up ∈ ∈ ∈ f(x) 1 f(x x ) (2.9) ((x, σ), u) =Φ(x )− − 0 (Φ(x )(x, σ), u) , F 0 F 0 Φ( x ) − 0 | {z } 8 A. AUGER AND N. HANSEN
f(x) f(x) - - (Xt, σt) F (Xt+1, σt+1) (Xt, σt) F (Xt+1, σt+1) 6 6 6 6
Φ(x ) Φ( x ) Φ(x ) Φ( x ) Φ(α) Φ( 1 ) Φ(α) Φ( 1 ) 0 − 0 0 − 0 α α
? f(x x ) ? ? f(αx) ? − 0 F - F - (Xt′ , σt′ ) (Xt′+1, σt′+1) (Xt′ , σt′) (Xt′+1, σt′+1)
Fig. 2.2. Left: Commutative diagram for the translation invariance property applied to one −1 iteration of a step-size adaptive algorithm (Φ(−x0) = [Φ(x0)] ). Right: Commutative diagram for the scale-invariance property applied to one iteration of a step-size adaptive algorithm (Φ(1/α) = −1 [Φ(α)] ). The homomorphisms Φ (different on the left and right) define for any x0 (resp. α) a search space transformation Φ(x0) (resp. Φ(α)).
where the function to be optimized is shown as upper-script of the transition function . The previous definition means that a SARS algorithm is translation invariant, if weF can find an homomorphism Φ (that depends on the algorithm) that defines for any translation x0, a search space transformation Φ(x0), such that: (i) if we start from (Xt, σt) and apply one iteration of the algorithm to optimize f(x) or (ii) apply the state space transformation Φ(x0) to the state of the algorithm, apply one iteration of the algorithm on x f(x x ) and transform back the state of the algorithm → − 0 via Φ( x0), then we recover (Xt+1, σt+1). This property is pictured via a double- commutative− diagram (see Figure 2.2). We consider in the next proposition some specific properties of ol and that render a comparison-based step-size adaptive randomized search translationS invariant.G These properties are satisfied for algorithms presented in Section 2.3. p Proposition 2.7. Let ( ol, , U ,pU) be a CB-SARS according to Definition 2.3. If the following conditions areS satisfied:G (i) for all x, x Rn for all σ > 0, for all ui U 0 ∈ ∈ (2.10) ol((x + x , σ), ui)= ol((x, σ), ui)+ x S 0 S 0 (ii) for all x, x Rn for all σ> 0, for all y Up 0 ∈ ∈ (2.11) ((x + x , σ), y)= ((x, σ), y)+ x G1 0 G1 0 p then ( ol, , U ,pU) is translation invariant and the associated group homomorphism Φ is definedS G by
(2.12) Φ(x0)(x, σ) = (x + x0, σ) for all x0, x, σ .
p In addition, assuming that the ol function satisfies property (2.10), then if ( ol, , U ,pU) is translation invariant with (2.12S ) as homomorphism, then (2.11) is satisfied.S G Proof. Consider the homomorphism defined in (2.12), then (2.11) writes (2.13) (Φ(x )(x, σ), y) =Φ(x ) ( ((x, σ), y)) , G 0 0 G i i and (2.10) writes ol(Φ(x0)(x, σ), u ) x0 = ol((x, σ), u ). This latter equation implies that the sameS permutation will− resultS from ordering solutions generated by S the ol function on f from (x, σ) or on f(x x0) starting from Φ(x0)(x, σ). Using S f(x x0−) f (2.13) we hence have (Φ(x0)(x, σ), − u) = Φ(x0) ((x, σ), u) G SΦ(x0 )(x,σ) ∗ G S(x,σ) ∗ which turns out to coincide with (2.9). The inverse is immediate. LINEAR CONVERGENCE OF CB-SARS VIA STABILITY OF MARKOV CHAINS 9
Scale-invariance property. The scale invariance property translates the fact that the algorithm has no intrinsic notion of scale. It can be defined similarly to translation R+ invariance by considering the set of group homomorphisms from the group ( >,.) (where . denotes the multiplication between two real numbers) to the group ( (Ω), ). R+ A ◦ We denote this set Homo(( >,.), ( (Ω), )). Definition 2.8 (Scale-invarianceA ). A◦ SARS with transition function is scale- R+ F invariant if there exists an homomorphism Φ Homo(( >,.), ( (Ω), )) such that for any f, for any α> 0, for any (x, σ) Rn R∈+ and for any uA Up ◦ ∈ × > ∈ (2.14) f(x)((x, σ), u) = Φ(1/α) f(αx)(Φ(α)(x, σ), u) , F F where the function optimized is shown as upper-script of the transition function . In the previous definition we have used the fact that for any element α of the multiplica-F R+ tive group ( >,.) its inverse is 1/α. The scale-invariance property can be pictured via a double-commutative diagram (see Figure 2.2). We derive in the next proposition some conditions for a CB-SARS to be scale- invariant that will be useful in the sequel to prove that the algorithms presented in Section 2.3 are scale-invariant. p Proposition 2.9. Let ( ol, , U ,pU) be a CB-SARS according to Definition 2.3. If for all α> 0 the followingS threeG conditions are satisfied: (i) for all ui U, (x, σ) Rn R+, ∈ ∈ × > x σ (2.15) ol((x, σ), ui)= α ol , , ui S S α α (ii) for all y Up, (x, σ) Rn R+ ∈ ∈ × > x σ (2.16) ((x, σ), y)= α , , y G1 G1 α α and (iii) for all y Up, σ R+ ∈ ∈ > σ (2.17) (σ, y)= α , y , G2 G2 α R+ then it is scale invariant and the associated homomorphism is Φ: α > Φ(α) where for all (x, σ) Rn R+, ∈ → ∈ × > (2.18) Φ(α)(x, σ) = (x/α, σ/α) .
p Inversely, assuming that the ol function satisfies (2.15), if ( ol, , U ,pU) is scale- invariant with the homomorphismS defined in (2.18), then (2.16S) andG (2.17) are satis- fied. Proof. From (i) we deduce that for any (x, σ) in Rn R+ and any ui U, × > ∈ x σ f( ol((x, σ), ui)) = f α ol , , ui S S α α which implies that the same permutation will result from ordering solutions (with rd) on f starting from (x, σ) or on f(αxS) starting from (x/α, σ/α), i.e. with some O f(x) f(αx) obvious notations (x,σ) = ( x , σ ). On the other hand using (2.8) the following holds S S α α (2.19) f(x)((x, σ), u)= ((x, σ), f(x) u) F G S(x,σ) ∗ f(αx) x σ x σ f(αx) (2.20) , , u = , , ( x , σ ) u . F α α G α α S α α ∗ 10 A. AUGER AND N. HANSEN
f(x) f(αx) x σ Assuming (ii) and (iii) we find that ((x, σ), u) = α α , α , u . Using the homomorphism defined in (2.18) theF previous equationF reads f(x)((x, σ), u) = Φ(1/α) f(αx) ((Φ(α)(x, σ)) , u) F F which is (2.14). Hence we have found an homomorphism such that (2.14) holds, which is the definition of scale-invariance. The inverse is immediate. Remark that given a CB-SARS that satisfies the conditions (i), (ii) and (iii) from the previous proposition, we can reparametrize the state of the algorithm by 2 σ˜t = σt (if the underlying sampling distribution is Gaussian, this means parametrize by variance instead of standard deviation) leaving unchanged the parametrization for the mean vector. Then the conditions of the previous proposition are not anymore valid for the new parametrization. Yet the algorithm is still scale-invariant but a different morphism needs to be considered, namely
(2.21) Φ(α)(x, σ˜) = (x/α, σ/α˜ 2) .
Hence the sufficient conditions derived are not general at all, however they cover typ- ical settings for CB-SARS. Adapting however Proposition 2.9 for other parametriza- tions is usually easy. 2.3. Examples of CB-SARS. In order to illustrate the CB-SARS framework introduced, we present in this section several examples of CB-SARS algorithms and analyze their invariance properties. 2.3.1. Non-elitist Step-size Adaptive Evolution Strategies (ES). We con- sider two examples of algorithms following Definition 2.3 that were introduced un- der the name Evolution Strategies (ES). They all share the same sampling space p n p p n p U = R × . A vector Ut U = R × is composed of p i.i.d. standard multi- ∈ i Rn variate normal distributions, i.e. Ut (0, In) and thus the joint density 1 p 1 ∼ N p ∈ 1 1 T pU(u ,..., u ) is the product p (u ) ...p (u ) where p (x)= n/2 exp x x . N N N (2π) − 2 The solution operator to sample new solutions is given by: (2.22) ol((X , σ ), Ui)(= Xi)= X + σ Ui , i =1, . . . , p , S t t t t t t t i 2 and hence each candidate solution Xt follows the distribution (Xt, σt In). The vec- N 2 tor Xt is thus the mean vector of the underlying distribution (Xt, σt In). Using the terminology sometimes employed for evolution strategies–forN stressing the paral- i lel with biology–, (2.22) implements a mutation: the new solutions Xt, also called offspring, are obtained by mutation of the solution Xt also referred as parent. (1) (p) Given the vector of ordered samples Yt = Ut = (UtS ,..., UtS ) where is the permutation resulting from the rankingS of ∗ objective function values of the S solutions (see (2.5)), the update equation for the mean vector Xt that defines the function is given by G1 p (2.23) X = ((X , σ ), Y ) := X + κ σ w Yi t+1 G1 t t t t m t i t i=1 X + where κm R is usually called the learning rate and is often set to 1 and wi R are ∈ p ∈ weights that satisfy w1 ... wp and i=1 wi = 1. The update implements the idea to move the vector≥ towards≥ better solutions| | by doing a weighted recombination of the p best solutions [34, 15, 2]. Often,P only positive weights are considered where LINEAR CONVERGENCE OF CB-SARS VIA STABILITY OF MARKOV CHAINS 11
optimally half of the weights should be non zero, i.e. w1 w2 ... w p/2 > 0 ≥ ≥ ≥ ⌊ ⌋ and wi = 0 for i > p/2 . If equals weights are used the terminology intermediate recombination is employed⌊ ⌋ [7]. p i Recently, an interesting interpretation of the meaning of the vector σt i=1 wiYt was given: it is an approximation of the (n first coordinates) of the natural gradi- ent of a joint criterion defined on the manifold of the family of gaussianP probability distribution [1, 29]. Several step-size updates have been used with the update of the mean vector Xt in (2.23). First of all, consider the update derived from the cumulative step-size adaptation or path-length control without cumulation [16] that reads
p i √ w wiY (2.24) σ = (σ , Y )= σ exp κ i=1 t 1 t+1 G2 t t t σ E[ (0, I ) ] − NP n where κσ > 0 is the learning rate for the step-size update usually set close to one 2 and w = 1/ wi . The value 1/κσ is often considered as a damping parameter. The ruling principle for the update is to compare the length of the recombined step p i P i=1 wiYt to its expected length if the objective function would return independent random values. Indeed if the signal given by the objective function is random, the Pstep-size should stay constant. It is not difficult to see that in such condition, a random ordering takes place and hence the distribution of the vector Yt is the same as the p i distribution of the vector Ut, finally it follows that √ w i=1 wiYt is distributed according to a standard multivariate normal distribution. Hence (2.24) implements pP i to increase the step-size if the observed length of √ w i=1 wiYt is larger than the expected length under random selection and decrease it otherwise. Overall, the update function associated to the CSA without cumulation readsP
p i x+σκm i=1 wiy p w yi CSAw/o((x, σ), y)= √ w P i=1 i . σ exp κσ 1 E[ P(0,In) ] G N − ! In practice the step-size update CSA is used in combination with cumulation and in the update, the norm of the selected step is replaced by the norm of a path that cumulates steps of previous generation [15]. The CSA with cumulation is the default step-size adaptation mechanism used in the CMA-ES algorithm. The second example we present corresponds to the natural gradient update for the step-size with exponential parametrization [13] that writes
p p κ κ σ = σ exp σ Tr w Yi(Yi)T I = σ exp σ w ( Yi 2 n) . t+1 t 2n i t t − n t 2n i t − i=1 !! i=1 ! X X We then define the update function for the step-size update in xNES as
p i x+σκm i=1 wiy (2.25) ((x, σ), y)= κ p . xNES σ exp σ Pw ( yi 2 n) G ( 2n i=1 i − ) P Often the parameter p is denoted λ, a parameter corresponding to the number of positive weights is introduced and the terminology comma selection as opposed to plus (or elitist selection) that will be detailed later is employed. This terminology stresses that the update of the vector Xt and step-size takes into account solutions sampled anew and do not consider “old” solutions. Consequently it is not guaranteed that the best solutions at iteration t + 1 has a smaller objective function value than 12 A. AUGER AND N. HANSEN
the best solution at iteration t. In the case where only positive weights are used, a compact notation for the algorithms described above is ( / w, λ)-ES where λ = p and equals the number of non-zero weights. Invariance properties. The two different comparison-based step-size adaptive ran- domized search algorithms presented in this section are translation invariant and scale- invariant. They indeed satisfy the sufficient conditions derived in Proposition 2.7 and Proposition 2.9. 2.3.2. Evolution Strategy with Self-adaptation. Another type of algorithms included in the comparison-based step-size adaptive randomized search definition are the so-called self-adaptive step-size ES. The idea of self-adaptation dates back from the 70’s and consists in adding the parameters to be adapted (step-size, covariance matrix, ...) to the vector that undergo variations (mutations and recombinations) and let the selection (through the ordering function) adjusts the parameters [31, 36]. In the case where one single step-size is adapted, the step-size undergoes first a mutation: it is multiplied by a random variable following a log-normal distribution Logn(0, τ 2) where τ 1/√n. The mutated step-size is then used as overall standard deviation ≈ p for the multivariate normal distribution (0, In). In this case, the space U equals R(n+1) p N i U Rn+1 i × . The n first coordinates of an element Ut = denoted [Ut]1...n ( Rn) correspond to the sampled standard multivariate∈ normal distribution vector ∈ i and the last coordinate denoted [Ut]n+1 to the sampled normal distribution for sam- pling the log-normal distribution used to mutate the step-size. The solution function is defined as (2.26) ol((X , σ ), Ui)= Xi = X + σ exp τ[Ui ] [Ui] S t t t t t t t n+1 t 1...n i i where [Ut]1...n (0, In) and [Ut]n+1 (0, 1). The distribution pU admits thus a ∼ N 1 p ∼ N1 p i n+1 density that equals pU(u ,..., u )= p (u ) ...p (u ) , u R . Remark that the ordering function selects the couple multivariateN N normal distributio∈ n and log-normal distribution used to mutate the step-size at the same time. Assuming that only the best solution plays a role in the update of Xt (i.e. it corresponds to a single non-zero weight in the recombination equation (2.23)), the update for the mean vector reads
1 1 (2.27) Xt = Xt + σt exp(τ[Yt ]n+1)[Yt ]1...n and the update for the step-size is
1 (2.28) σt+1 = σt exp(τ[Yt ]n+1) . A step-size adaptive Evolution Strategy satisfying (2.26),(2.27) and (2.28) is called (1,p) self-adaptive step-size ES ((1,p)-SA). The (1,p) refers to the fact that a single solution is selected out of the p. The update function for the (1,p)-SA reads G 1 1 x+σ exp(τ[y ]n+1)[y ]1...n (1,p) SA((x, σ), y)= 1 . G − σ exp(τ[y ]n+1) We see thus that the step-size is adapted by the selection that occurs through the ordering. The rationale behind the method being that unadapted step-size cannot successfully give good solutions and that selection will adapt (for free) the step-size (explaining thus the terminology “self-adaptation”). Self-adaptive algorithms have been popular in the 90’s certainly due to the fact that its idea is simple and attractive. However self-adaptation leads in general to too small step-size. Different variants of self-adaptation using multiple parents and recombinations exist, we refer to the review paper [8] for further readings and references. LINEAR CONVERGENCE OF CB-SARS VIA STABILITY OF MARKOV CHAINS 13
Invariances. In virtue of Proposition 2.7 and Proposition 2.9 the (1,p)-SA is translation and scale-invariant. The linear convergence of the self-adaptive ES algorithm described in this sec- tion in dimension 1 was proven in [3] on spherical functions using the Markov chain approach presented here. 2.3.3. Step-size Random Search or Compound Random Search or (1+1)- ES with 1/5 Success Rule. The last example presented is an algorithm where the sequence f(Xt) is decreasing, i.e. updates that only improve or leave Xt unchanged are performed. The algorithm is often characterized of elitist as the best solution at a current iteration is kept for the next iteration. For the example presented, at each iteration a single new solution is sampled from Xt, i.e.
1 1 Xt = Xt + σtUt
1 Rn 1 where Ut follows a standard multivariate normal distribution, and hence Xt ∈ 2 1 follows the distribution (Xt, σt In). The step Ut is accepted if the candidate solution N 2 Rn is better than the current one and rejected otherwise. Let us denote Ut = 0 1 2 Up Rn 2 ∈ the zero vector and take Ut = (Ut , Ut ). Hence = × and the probability 1 2 1 2 distribution of U equals pU(u , u ) = p (u )δ0(u ) where δ0 is the Dirac delta function. The ol function corresponds thenN to the function in (2.22). S The update equation for Xt is similar to (2.23) with weights (w1, w2) = (1, 0). Remark that contrary to the non-elitist algorithms presented before, the sampled step Ut, the selected step Yt and the new mean Xt+1 have a singular part w.r.t. the Lebesgue measure. An algorithm following such an update is often referred as (1 + 1)-ES but was also introduced under the name Markov monotonous search [37], step-size random search [35] or compound random search [12]. The adaptation of the step-size idea starts from the observation that if the step- size is very small, the probability of success (i.e. to sample a better solution) is ap- proximately one-half but the improvements are small because the step is small. On the opposite if the step-size is too large, the probability of success will be small, typ- ically the optimum will be overshoot and the improvement will also be very small. In between lies an optimal step-size associated to an optimal probability of success [35, 31, 12]. A proposed adaptive step-size algorithm consists in trying to maintain a probability of success (i.e. probability to sample a better solution) to a certain target value ptarget, increase the step-size in case the probability of success is larger than ptarget and decrease it otherwise [12, 31, 32]. The optimal probability of success, i.e. allowing to obtain an optimal convergence rate has been computed on the sphere func- tion f(x)= x 2 for dimension of the search problem going to infinity and is roughly equal to 0.27 [ 35, 31]. Another function where the asymptotic optimal probability of success was computed is the corridor function2 where it is equal to 1/(2e)[32]. As a trade-off between the probability of success on the sphere and on the corridor, the target probability is often taken equal to 1/5=0.20 and gave the name one-fifth success rule to the step-size adaptive algorithm. We call the algorithm with ptarget as target success probability the generalized one-fifth success rule.3
2 The corridor function is defined as f(x) = x1 for −b < x2 < b,... − b < xn < b, for b > 0 otherwise +∞. 3 Note that ptarget does not correspond to the optimal probability of success as indeed if the probability of success equals the target probability, the step-size is kept constant. Hence if conver- gence occurs the achieved probability of success is smaller than the target probability. Therefore, on 14 A. AUGER AND N. HANSEN
Several implementations of the generalized one-fifth success rule exist. In some implementations, the probability of success is estimated by fixing a step-size for a few iterations, counting the number of successful solutions and deducing an estimation of the probability of success. The step-size is then increased if the probability of success is larger than ptarget and decreased otherwise [31, 32]. This version of the one-fifth success rule is in particular the one investigated theoretically by J¨agersk¨upper [22, 21, 20, 19]. A somehow simpler implementation consists in estimating at each iteration 4 the probability of success as 1 1 =1 1 : this indicator function being f(Xt )