The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

On the Time Complexity of Algorithm Selection Hyper-Heuristics for Multimodal Optimisation

Andrei Lissovoi, Pietro S. Oliveto, John Alasdair Warwicker Rigorous Research, Department of Computer Science The University of Sheffield, Sheffield S1 4DP, United Kingdom {a.lissovoi,p.oliveto,j.warwicker}@sheffield.ac.uk

Abstract of hard combinatorial optimisation problems (see (Burke et al. 2010; 2013) for surveys of results). Selection hyper-heuristics are automated algorithm selec- Despite their numerous successes, very limited rigor- tion methodologies that choose between different heuristics ous theoretical understanding of their behaviour and per- during the optimisation process. Recently selection hyper- ¨ heuristics choosing between a collection of elitist randomised formance is available (Lehre and Ozcan 2013; Alanazi and local search heuristics with different neighbourhood sizes Lehre 2014). Recently, it has been proved that a simple se- have been shown to optimise a standard unimodal benchmark lection hyper-heuristic called Random Descent (Cowling, function from evolutionary computation in the optimal ex- Kendall, and Soubeiga 2001; 2002), choosing between eli- pected runtime achievable with the available low-level heuris- tist randomised local search (RLS) heuristics with differ- tics. In this paper we extend our understanding to the domain ent neighbourhood sizes, optimises a standard unimodal of multimodal optimisation by considering a hyper-heuristic benchmark function from evolutionary computation called from the literature that can switch between elitist and non- LEADINGONES in the best possible runtime (up to lower elitist heuristics during the run. We first identify the range order terms) achievable with the available low-level heuris- of parameters that allow the hyper-heuristic to hillclimb ef- ficiently and prove that it can optimise a standard hillclimb- tics (Lissovoi, Oliveto, and Warwicker 2018). For this to oc- ing benchmark function in the best expected asymptotic time cur, it is necessary to run the selected low-level heuristics achievable by unbiased mutation-based randomised search for a sufficient amount of time, called the learning period, to heuristics. Afterwards, we use standard multimodal bench- allow the hyper-heuristic to accurately determine how use- mark functions to highlight function characteristics where ful the chosen operator is at the current stage of the opti- the hyper-heuristic is efficient by swiftly escaping local op- misation process. If the learning period is not sufficiently tima and ones where it is not. For a function class called long, the hyper-heuristic fails to identify whether the chosen CLIFFd where a new gradient of increasing fitness can be low-level heuristics are actually useful, leading to random identified after escaping local optima, the hyper-heuristic is choices of which heuristics to apply and a degradation of extremely efficient while a wide range of established elitist the overall performance. Since the optimal duration of the and non-elitist algorithms are not, including the well-studied Metropolis algorithm. We complete the picture with an anal- learning period may change during the optimisation, (Doerr et al. 2018) have recently introduced a self-adjusting mecha- ysis of another standard benchmark function called JUMPd as an example to highlight problem characteristics where the nism and rigorously proved that it allows the hyper-heuristic hyper-heuristic is inefficient. Yet, it still outperforms the well- to track the optimal learning period throughout the optimi- established non-elitist Metropolis algorithm. sation process for LEADINGONES. In this paper we aim to extend the understanding of the be- haviour and performance of hyper-heuristics to multimodal 1 Introduction optimisation problems. In order to evaluate their capability Selection hyper-heuristics are automated algorithm selec- at escaping local optima, we consider elitist and non-elitist tion methodologies designed to choose which of a set of selection operators that have been used in hyper-heuristics low-level heuristics to apply in the next steps of the opti- in the literature. (Cowling, Kendall, and Soubeiga 2001; misation process (Cowling, Kendall, and Soubeiga 2001). 2002) introduced two variants of move acceptance oper- Rather than deciding in advance which heuristics to apply ators: the elitist ONLYIMPROVING (OI) operator, which for a problem, the aim is to automate the process at run- only accepts moves that improve the current solution, and time. Originally shown to effectively optimise scheduling the non-elitist ALLMOVES (AM) operator, which accepts problems, such as the scheduling of a sales summit and uni- any new solution independent of its quality. Another se- versity timetabling (Cowling, Kendall, and Soubeiga 2001; lection operator that has been considered in the literature 2002), they have since been successfully applied to a variety is the IMPROVINGANDEQUAL (IE) acceptance operator which in addition to accepting improving solutions also ac- Copyright c 2019, Association for the Advancement of Artificial cepts solutions of equal quality (Ayob and Kendall 2003; Intelligence (www.aaai.org). All rights reserved. Bilgin, Ozcan,¨ and Korkmaz 2007; Ozcan,¨ Bilgin, and Kork-

2322 maz 2006). In the mentioned works, the acceptance operator Algorithm 1 Move Acceptance Hyper-Heuristic remains fixed throughout the run, with the hyper-heuristic (MAHHOI) (Lehre and Ozcan¨ 2013) only switching between different mutation operators. How- 1: Choose x ∈ {0, 1}n uniformly at random ever, it would be more desirable that hyperheuristics are al- 2: while termination criteria not satisfied do lowed to decide to use elitist selection for hillclimbing in 3: x0 ← FLIPRANDOMBIT(x) exploitation phases of the search and non-elitism for explo- ALLMOVES with probability p ration, for instance to escape from local optima. 4: ACC ← ONLYIMPROVING otherwise (Qian, Tang, and Zhou 2016) analysed selection hyper- 0 0 heuristics in the context of multi-objective optimisation. 5: if ACC(x, x ) then x ← x They considered a hyper-heuristic selecting between elitist (IE) and strict elitist (OI) acceptance operators and pre- sented a function where it is necessary to mix the two ac- based unbiased randomised search heuristic (Lehre and Witt ceptance operators. Lehre and Ozcan¨ presented a hyper- 2012) even though it does not rely on elitism. Afterwards we heuristic which chooses between two different low level highlight the power of the hyper-heuristic by analysing its heuristics that use the above described OI (elitist) and AM performance for standard benchmark function classes cho- (non-elitist) acceptance operators (Lehre and Ozcan¨ 2013). sen because they allow us to isolate important properties of The considered Simple Random hyper-heuristic uses 1-bit multimodal optimisation landscapes. flips as mutation operator and selects the AM acceptance Firstly, we consider the CLIFFd class of functions, consist- operator with probability p and the OI acceptance operator ing of a that, once escaped, allows the iden- with probability 1 − p. Essentially the algorithm is a Ran- tification of a new slope of increasing fitness. For this class dom Local Search algorithm that switches between strict of functions we rigorously prove that the hyper-heuristic ef- elitism, by only accepting improvements, and extreme non- ficiently escapes the local optima and even achieves the best elitism by accepting any new found solution. For the stan- known expected runtime of O(n log n) (for general purpose dard ROYALROADk benchmark function from evolutionary randomised search heuristics) on the hardest instances of the computation, which consists of several blocks of k bits each function class (Corus, Oliveto, and Yazdani 2017). Thus, that have to be set correctly to observe a fitness improve- we prove that it considerably outperforms established eli- ment, and k ≥ 2 they theoretically proved that it is necessary tist and non-elitist evolutionary algorithms and the popular to mix the acceptance operators for the hyper-heuristic to METROPOLIS algorithm. We complete the picture by con- be efficient, because by only accepting improvements (i.e., sidering the standard JUMPk multimodal instance class of p = 0) the runtime is infinite due to the hyper-heuristic not functions to provide an example of problem characteristics being able to cross the plateaus of equal fitness while by al- where the hyper-heuristic is not efficient. This class is cho- ways accepting any move (i.e., p = 1), the algorithm simply sen to isolate problems where escaping local optima is very performs a random search. By choosing the value of the pa- hard because it is difficult to identify a new slope of increas- rameter p appropriately they provide an upper bound on the ing gradient. Nevertheless, the hyper-heuristic is efficient for expected runtime of the hyper-heuristic of O(n3·k2k−3) ver- k instances of moderate jump size (i.e., constant) where it still sus the O(n log(n) · (2 /k)) expected time required by evo- considerably outperforms METROPOLIS. lutionary algorithms with standard bit mutation (i.e., each Due to space constraints in this extended abstract, we omit bit is flipped with probability 1/n) and with the standard some proofs. selection operator that accepts new solutions if their fitness is as good as that of their parent (Doerr, Sudholt, and Witt 2013). In particular, just using the IE acceptance operator 2 Preliminaries throughout the run leads to a better performance. Hence, the In this section, we will formally introduce the hyper- advantages of switching between selection operators rather heuristic algorithms and the problem classes we will analyse than just using one all the time were not evident. in this paper, and briefly state some widely-known mathe- In this paper we present a systematic analysis of the same matical tools for the runtime analysis of randomised search ¨ hyper-heuristic considered in (Lehre and Ozcan 2013) for heuristics that we will use throughout the paper. multimodal optimisation problems where the considerable advantages of changing the selection operator during the run may be highlighted. In particular, we will increase our Algorithms understanding of the behaviour and performance of hyper- We will analyse the Move Acceptance hyper-heuristic pre- heuristics by providing examples of instance classes where viously considered in (Lehre and Ozcan¨ 2013). In each it- the hyper-heuristic is efficient at escaping local optima and eration, one bit chosen uniformly at random will flip and, examples where it is not. We first perform an analysis of with probability p the ALLMOVES (AM) acceptance opera- the standard unimodal ONEMAX benchmark function from tor is used, while with probability 1 − p the ONLYIMPROV- evolutionary computation to identify the range of parameter ING (OI) acceptance operator is used. Algorithm 1 shows its values for p that allow the hyper-heuristic to hillclimb, hence pseudocode. to locate local optima efficiently. In particular, we prove that We also consider the MAHH variant of Algorithm 1 1 IE for any p = (1+)n , for any constant  > 0, the hyper- whereby the IE acceptance operator is used instead of the OI heuristic is asymptotically as efficient as the best mutation- operator. Thus, in MAHHIE, the AM acceptance operator is

2323 used with probability p, and the IE acceptance operator is ) 60 x

used with probability 1 − p. (

The problem classes we consider are all functions of uni- 35 40 tation, where changing the number of 1-bits in the bit-string LIFF 20 Global by 1 (as is done by the FLIPRANDOMBIT mutation oper- C Optimum ator of Algorithm 1) will always change the fitness value, 0 i.e., there are no plateaus of constant fitness. Under these 0 20 40 60 80 100 conditions, we point out that all the statements made in the |x|1 paper for the MAHHOI hyper-heuristic will also hold for the MAHHIE hyper-heuristic. In the rest of this paper, we will Figure 1: CLIFFd(x) with n = 100 and d = 35. focus the analysis on the MAHHOI hyper-heuristic. 150 Benchmark Function Classes Global Optimum ) x We will present a runtime analysis of the MAHHOI hyper- ( 100 heuristic on three benchmark problem classes defined over 35 n-bit strings – ONEMAX,CLIFF ,JUMP – commonly 50 d m UMP used in the theory of randomised search heuristics to evalu- J 0 ate their performance. These problem classes are artificially 0 20 40 60 80 100 constructed with the purpose of reflecting and isolating com- mon difficulty profiles that are known to appear in classical |x|1 combinatorial optimisation problems and are expected to ap- pear in real-world optimisation. Figure 2: JUMPm(x) with n = 100 and m = 35. The ONEMAX problem class is a class of unimodal func- tions which provide a consistent fitness gradient leading to the global optimum. The class displays the typical function optimisation feature that improving solutions are harder to  ONEMAX(x) if |x|1 ≤ n − d, CLIFFd(x) := identify as the optimum is approached. It is generally used ONEMAX(x) − d + 1/2 otherwise. to evaluate and validate the hillclimbing performance of ran- n domised search heuristics. The function is defined as fol- As for ONEMAX, the global optimum is placed at the 1 bit lows: string to simplify notation. The parameter d controls both the fitness decrease from the local optimum to the lowest n X point on the slope leading to the global optimum and the ONEMAX(x) := xi. length of this second slope. This class of problems captures i=1 real-world problems where local optima have narrow basins In the above definition the global optimum is placed in the of attraction: if it is able to escape from the local optimum, a 1n bit-string for convenience of the analysis i.e., fitness in- search heuristic has good chances of identifying a new basin creases with the number of 1-bits. The results we derive will of attraction. hold for any instance in the function class, i.e., the opti- The JUMPm class of functions is similar to CLIFFd, but mum may be any bit string and the fitness function returns has both fitness gradients leading toward the local optima: the Hamming distance to the global optimum. This class of thus, to find the global optimum, the search algorithm should functions is known to be the easiest among all functions with not only take a fitness-decreasing mutation to escape the lo- a unique global optimum for unary unbiased black box algo- cal optimum, but also disregard the fitness gradient in fur- rithms (mutation-based EAs) (Lehre and Witt 2012). ther iterations. An example instance is shown in Figure 2. UMP 1 < m < n/2 We also consider two similar multi-modal problem We define the J m class of functions (for ) classes, where the optimisation algorithm will need to es- as follows: cape from a local optimum in order to reach the global opti- n + m if |x| = n,  1 mum. The two classes differ in whether the fitness function JUMPm(x) := m + ONEMAX(x) if |x|1 ≤ n − m, guides the search toward or away from the global optimum n − ONEMAX(x) otherwise. once the search process does escape the local optimum. n The CLIFFd class of functions was originally proposed as As for ONEMAX, the global optimum is placed at the 1 bit an example where non-elitist evolutionary algorithms out- string to simplify notation. Compared to CLIFFd, this class perform elitist ones (Jagersk¨ upper¨ and Storch 2007). Func- of functions features a local optimum with a much broader tions within the class generally lead the optimisation pro- basin of attraction, making it more difficult for the search cess to a local optimum, from which a fitness-decreasing heuristics to escape the basin and locate the global optimum. mutation can be taken to find another fitness-improving slope leading to the global optimum. An example instance Mathematical Analysis Tools is shown in Figure 1. We define the CLIFFd class of func- We now present some well known drift analysis theorems tions (for 1 < d < n/2) as follows: often used to bound the expected runtime of a randomised

2324 search heuristic by considering its drift (i.e., its expected de- 3 Unimodal Functions crease in distance from the optimum in each step). We will We begin the study of MAHHOI by analysing its perfor- apply these theorems throughout the paper for our analyses mance on unimodal functions. Since in this setting accept- using the Hamming distance as a measure of distance to the ing worsening moves is never required, the hyper-heuristic optimum (or another set of target solutions), and write ∆(i) cannot outperform the elitist algorithms. Nevertheless, The- to refer to the drift conditioned on the parent solution con- orem 6 shows that MAHHOI can still be very efficient when taining i 1-bits. optimising ONEMAX, even with p > 0. We first introduce The following Additive Drift Theorem provides upper and Theorem 4 and Theorem 5, which show that the parameter p lower bounds on the expected runtime given, respectively, should not be too large. lower and upper bounds on the drift which hold throughout Theorem 4. The runtime of MAHH on ONEMAX(x), the process. √ OI with p ≥ ( n log2 n)/n, is at least nΩ(log n) with probabil- Theorem 1 (Additive Drift Theorem (He and Yao 2001)). ity at least 1 − n−Ω(log n). Let {Xt}t≥0 be a sequence of random variables over a finite set of states S ⊆ + and let T be the random variable that For ONEMAX, the larger the value of p, the greater the R0 probability of accepting a move away from the global op- denotes the first point in time for which Xt = 0. If there exist δ ≥ δ > 0 such that for all t ≥ 0, we have timum and the greater the expected runtime of the hyper- u l heuristic. Applying the standard Negative Drift Theorem δu ≥ E(Xt − Xt+1 | Xt) ≥ δl, (Oliveto and Witt 2011; 2012) for MAHHOI with any con- stant value of p gives the following result. then the expected optimisation time E(T ) satisfies Theorem 5. The runtime of MAHHOI on ONEMAX, with Ω(n) X0/δu ≤ E(T | X0) ≤ X0/δl, and p = Θ(1), is at least 2 with probability at least 1 − 2−Ω(n). E(X0)/δu ≤ E(T ) ≤ E(X0)/δl. We now provide an upper bound on p for which the hyper- If the drift changes considerably throughout the process, heuristic is efficient. then the following Variable Drift Theorem provides sharper upper bounds on the runtime. Theorem 6. The expected runtime of MAHHOI on ONE- MAX, with p < 1/(n − 1), is at most Theorem 2 (Variable Drift Theorem (Johannsen 2010)). n n  n  Let (Xt)t≥0 be a stochastic process over some state space E(Tp) ≤ + · ln . S ⊆ {0}∪[xmin, xmax], where xmin > 0. Let h(x) be an in- 1 − p(n − 1) 1 + p 1 − p(n − 1) tegrable, monotone increasing function on [x , x ] such min max If p = 0, we have the well-studied Randomised Local that E(X − X | X ≥ x ) ≥ h(X ). Then it holds t t+1 t min t Search (RLS) algorithm, and Theorem 6 gives an expected for the first hitting time T := min{t | Xt = 0} that runtime of E(T0) ≤ n + n ln n = O(n log n). For p = 1/n, x Z X0 1 combining Theorem 6 with a separate argument regarding E(T | X ) ≤ min + dx. the time needed to reach a fitness of n from a fitness of n − 0 h(x ) h(x) min xmin 1 (using e.g. Lemma 8) yields a O(n log n) bound on the p = 1  > 0 The Negative Drift Theorem allows to derive exponen- expected runtime. For (1+)n with any constant , tial lower bounds on the runtime when the drift is negative applying Theorem 6 directly shows that any ONEMAX style in large enough areas of the search space (Oliveto and Witt hillclimbs can be completed in expected time O(n log n). 2011; 2012). The Simplified Drift Theorem with Scaling is Corollary 7. The expected runtime of MAHHOI on ONE- a generalised version of the Negative Drift Theorem that al- 1 MAX, with p = (1+)n , for any constant  > 0, is lows the negative drift ε to be sub-constant in magnitude. O(n log n). Theorem 3 (Simplified Drift Theorem with Scaling (Oliveto For the remainder of the paper, we consider MAHHOI and Witt 2015)). Let Xt, t ≥ 0 be real-valued random vari- with p = 1 . We will show how with this param- ables describing a stochastic process over some state space. (1+)n eter value, as well as hill-climbing efficiently, the hyper- Suppose that there exist an interval [a, b] ⊆ and, possibly R heuristics can escape from difficult local optima effectively. depending on ` := b − a, a drift bound ε := ε(`) > 0, as well as a scaling factor r := r(`) such that for all t ≥ 0 the following conditions hold: 4 When the Hyper-Heuristic is Efficient We now analyse the CLIFFd (1 < d < n/2) class of bench- 1. E(Xt+1 − Xt | X0,...,Xt; a < Xt < b) ≥ ε, mark functions. Recall that CLIFFd features a local optimum −j 2. P (|Xt+1 − Xt| ≥ jr | X0,...,Xt; a < Xt) ≤ e for that the hyper-heuristic must escape before climbing a fit- all j ∈ N0, ness gradient toward the global optimum. We refer to the lo- 3. 1 ≤ r2 ≤ ε`/(132 log(r/ε)). cal optimum at i = n − d as the ‘cliff’, two ONEMAX style ∗ hillclimbs as the ‘first slope’ and ‘second slope’ respectively Then for the first hitting time T := min{t ≥ 0 : (see Figure 1), and use d to denote the ‘length of the cliff’. ∗ ε`/(132r2) Xt ≤ a | X0 > b} it holds that P (T ≤ e ) = To find the global optimum of the CLIFF function, it is 2 d O(e−ε`/(132r )). necessary to escape the local optimum, by either dropping

2325 down from the cliff and accepting a worse candidate solu- the third when i ≥ n − d + 2, and the fourth when i = n, tion and then climbing up the second slope, or making a pro- i.e., the optimum has been constructed. We use T1,...,T4 to hibitive jump to the global optimum on the other side of the denote the number of iterations the algorithm spends in each cliff (this is possible with standard bit mutation, by requiring of these stages, and note that by definition of these stages, expected exponential time in the length of the cliff, but not the number of iterations T before the optimum is reached is with local mutations). T = T1 + T2 + T3 + T4, and by linearity of expectation, We now consider the performance of MAHHOI on E(T ) = E(T1) + E(T2) + E(T3) + E(T4). (1) CLIFFd. Clearly, with p = 1,MAHHOI reduces to a ran- dom walk across the fitness landscape. Similarly, if p = 0, In the first stage, while i < n − d,CLIFFd resembles with probability at least 1/2, the bit-string is initialised with the ONEMAX function, and we can use the upper bound for at most n/2 1-bits, and will hillclimb to the top of the cliff. the expected runtime of MAHHOI on ONEMAX with p = There is no improving step from this position and the global 1 from Theorem 6. Hence, E(T ) = O(n log n). optimum cannot be reached. By the law of total expectation, (1+)n 1 i ≥ n − d the expected runtime will be infinite. We will show that us- The second stage begins with , and ends once i ≥ n−d+1 i = n−d ing MAHH with p = 1 , which has been shown to . When , there are no improving moves. OI (1+)n If the OI operator is selected, there will be no change in the hillclimb efficiently (Corollary 7), will still allow worsen- candidate solution. Hence, any move must come from use of ing moves with sufficiently high probability to be able to the AM operator, which is selected with probability 1 . move down from the cliff and reach the global optimum in (1+)n expected polynomial time. A mutation step may either increase the number of 1-bits in d/n Theorem 9 bounds the expected runtime of MAHH on the bit-string with probability , or decrease the number OI of 1-bits with probability (n − d)/n. We use Lemma 8 to CLIFF from above. We begin, however, by introducing a d E(T ) helper Lemma, which was proved in (Droste, Jansen, and bound 2 , the expected time to jump down from the cliff, Wegener 2001), for trajectory based algorithms which can 2 + n (1 + ) n − d + only change the number of 1-bits in the bit-string by 1. E(T2) = E(T ) = + · E(T ). n−d d d n−d−1 The lemma was subsequently used to analyse the perfor- (2) mance of the (1+1) EA for noisy OneMax for small noise strength (Droste 2004). Unlike in noisy optimisation, where + We now must bound E(Tn−d−1). At i = n − d − 1, the the noise represents uncertainty with the respect to the true drift is as follows: fitness of solutions, in hyper-heuristics the AM operator is d + 1 1 n − d − 1 intended to be helpful to the optimisation process by allow- ∆(n − d − 1) = − · ing the algorithm to escape from local optima. n (1 + )n n Lemma 8. (Droste, Jansen, and Wegener 2001, Lemma 3) n(d + (d + 1)) + d + 1 + = 2 , Let E(Ti ) be the expected time to reach a state with i + 1 n (1 + ) 1-bits, given a state with i 1-bits, and p+ and p− be the i i as flipping any one of the (d + 1) remaining 0-bits increases transition probabilities to reach a state with, respectively, the fitness by 1 and is accepted by either operator, and flip- i + 1 and i − 1 1-bits. Then: ping any one of the (n−d−1) 1-bits decreases the fitness by 1 p− 1 and is only accepted if the ALLMOVES operator is chosen, E(T +) = + i · E(T + ). i p+ p+ i−1 which occurs with probability p = 1/((1 + )n). i i Clearly, for all 0 ≤ i ≤ n − d − 1, the drift is bounded Within the context of non-elitist local search algorithms, from above by the drift at i = n − d − 1, while the distance + − such as MAHHOI, the transition probability pi (pi ) refers to the required point from i = n−d−1 is 1. By the Additive to the probability of making a local mutation which in- Drift Theorem (Theorem 1), we have creases (respectively decreases) the number of 1-bits in the + offspring solution and accepting that solution. E(Tn−d−1) ≤ 1/∆(n − d − 1) n2(1 + ) n Theorem 9. The expected runtime of MAHHOI on 1 = = O . CLIFFd, with p = (1+)n for any constant  > 0, is n(d + (d + 1)) + d + 1 d  n3  O n log n + d2 . We can now return to bounding E(T2) in Equation 2: n2(1 + ) n − d Proof. Let i denote the number of 1-bits in the bit-string at E(T ) = + · E(T + ) any time t > 0. We wish to bound the expected runtime 2 d d n−d−1 from above by separately bounding four ‘stages’ of the op- n2(1 + ) n − d n n2  = + · O = O . (3) timisation process, each starting once all earlier stages have d d d d ended, and ending once a solution with at least certain num- ber of 1-bits has been constructed for the first time during the The third stage begins with i ≥ n − d + 1, and ends optimisation process (i.e., the HH may go backwards after- once i ≥ n − d + 2. When i = n − d + 1, all moves are wards yet will remain in the same stage): the first stage ends improving moves and all moves will be accepted, regardless when i ≥ n − d is reached, the second when i ≥ n − d + 1, of the choice of the OI or AM operator. With probability

2326 (d − 1)/n, the accepted move decreases the number of 1- Algorithm 2 METROPOLIS Algorithm (Metropolis et al. bits to i = n − d i.e., the hyper-heuristic returns to the local 1953) optimum. With probability (n−d+1)/n, the accepted move 1: Choose x ∈ {0, 1}n uniformly at random, set α(n) ≥ 1 instead increases the number of 1-bits to i = n − d + 2. We 2: while termination criteria not satisfied do again use Lemma 8 to bound E(T3), the expected time to 3: x0 ← FLIPRANDOMBIT(x) + 0 take one step up the second slope. Noting that E(Tn−d) = 4: ∆f ← f(x ) − f(x) O(n2/d) by Equation 3, we get 5: if ∆f ≥ 0 then x ← x0 n n − d + 1 6: else Choose r ∈ [0, 1] uniformly at random E(T ) = E(T + ) = + · E(T + ) 7: if r ≤ α(n)∆f then x ← x0 3 n−d+1 d − 1 d − 1 n−d n n − d + 1 n2  n3  = + · O = O . d − 1 d − 1 d d2 Theorem 9 gives an expected runtime bound of MAHHOI 3 2 If d = 2, the CLIFFd function will have been optimised on CLIFFd of O n log n + n /d . This bound is small- at this point. However, for d ≥ 3, it is necessary to climb est when d is large, suggesting that the algorithm is fastest further up the second slope. If d = 3, we point out that when the cliff is hardest for elitist algorithms, i.e., a lin- + E(T4) = E(Tn−d+2), and by applying Lemma 8 bound ear cliff length, d = Θ(n), gives an expected runtime of O(n log n) n 1 n − d + 2 . This runtime asymptotically matches the best E(T + ) = + · · E(T + ) case expected performance of an artificial immune system, n−d+2 d − 2 (1 + )n d − 2 n−d+1 which escapes the local optimum with an ageing operator n 1 n − d + 2  n3  = + · · O (also O(n log n) in expectation (Corus, Oliveto, and Yazdani d − 2 (1 + )n d − 2 d2 2017)), which is the best runtime known for hard CLIFFd = n/(d − 2) + O n3/d3 . functions (for problem-independent search heuristics). We For d > 3, we know that i = n − d + 2 and i = n − d + 3 suspect MAHHOI is faster in practice (i.e., by decreasing are points on the second slope, and by applying Lemma 8 the leading constants in the runtime), but leave this proof for bound future work. Mutation based EAs have a runtime of Θ(nd) n 1 n − d + 3 for d ≤ n/2 (Paixao˜ et al. 2017); for d = Θ(n), this will E(T + ) = + · · E(T + ) n−d+3 d − 3 (1 + )n d − 3 n−d+2 give at least exponential runtime in the length of the cliff. n 1 n − d + 3  n3  Steady-State Genetic Algorithms which use crossover have = + · · O d − 3 (1 + )n d − 3 d3 recently been proven to be faster by at least a linear factor (Dang et al. 2018), but would still require exponential ex- = n/(d − 3) + O n3/d4 . pected runtimes for large cliff lengths. For d > 4 this trend will continue as MAHHOI progresses The worst-case scenario for MAHHOI is a constant cliff closer towards the global optimum; in particular, for k < d, length, giving a runtime of O(n3). This means that the (1+1)  3  + n n EA will outperform MAHH if d < 3, but will be slower we will have E(T ) = + O k . OI n−d+k d−k d for any 3 < d ≤ n/2, with a performance gap that increases Pd−1 + Hence, E(T4) = k=2 E(Tn−d+k). The terms in exponentially with the length of the cliff. For CLIFFd, only the summation will be asymptotically dominated by the an upper bound on the expected runtime of the non-elitist 3 3 + 25 O(n /d ) term in E(Tn−d+2) if d is sub-linear, giv- (1,λ) EA is available in the literature: O(n ) if λ is neither 3 3 3 2 ing E(T4 | d = o(n)) ≤ d · O(n /d ) = O(n /d ). How- too large nor too small (Jagersk¨ upper¨ and Storch 2007). ever, if d is linear in the problem size, the first terms will We can also compare the performance of MAHHOI dominate, and we will have: on CLIFFd with a well-studied non-elitist algorithm. The d−1 d−1 METROPOLIS algorithm (Algorithm 2) accepts worsening X n X n f(y)−f(x) E(T4 | d = Θ(n)) = O(1) + ≤ O(1) + α(n) α(n) ≥ 1 d − i d − i moves with probability , for some , i=2 i=0 and accepts all improvements (Metropolis et al. 1953). The- d X 1 orem 10 shows that MAHHOI will considerably outperform = O(1) + n · = O(n log n). j METROPOLIS on CLIFFd for all d > 3. j=1 Theorem 10. The expected runtime of METROPOLIS on Combining the bounds for the sub-linear and linear d, we  d−3/2   3  1 n−d+1  n  ω(1) n CLIFFd is at least min · · , n . conclude that E(T4) = O n log n + d2 . 2 d−1 log n We now return to our overall runtime bound from Equa- We have proven that while METROPOLIS cannot optimise tion 1 and complete the proof: hard CLIFFd variants in polynomial time, MAHHOI is ex- E(T ) = E(T1) + E(T2) + E(T3) + E(T4) tremely efficient for hard cliffs, and has an expected runtime 3  n2 n3 n3  of O(n ) in the worst case. = O n log n + + 2 + n log n + 2 d d d 5 When the Hyper-Heuristic is Inefficient = O n log n + n3/d2 . We now consider the JUMPm function class (1 < m < n/2) as an example of a multimodal function where MAHHOI

2327 has a harder time escaping the local optimum. Unlike polynomial time, while METROPOLIS requires exponential CLIFFd, the fitness decreases on the second slope, making time in expectation in all cases. it harder to traverse. In order to optimise the JUMPm function (Figure 2), it is 6 Conclusions first necessary to reach the local optimum at the top of the We have rigorously analysed the performance of the Move slope. Then, either a jump to the global optimum is made, Acceptance Hyper-Heuristic (MAHH ) for multimodal which requires exponential expected time in the length of OI optimisation problems. We have shown that setting the pa- the jump for unbiased mutation operators, or a jump is made rameter value to p ≤ 1/((1 + )n), for any constant  > down to the slope of decreasing fitness, which must be tra- 0, allows to hillclimb the ONEMAX function in expected versed before the global optimum is found. O(n log n) runtime as desired. On the other hand, for too Theorem 11. The runtime of MAHHOI on JUMPm, with large parameter values we have shown how the runtime be- 1 cm p = (1+)n , is at least Ω(n log n + 2 ), for some constant comes exponential. c > 0 and any constant  > 0, with probability at least We have also shown that MAHHOI performs well on mul- 1 − 2−Ω(m). timodal landscapes where the candidate solution identifies a The following theorem provides an upper bound on the gradient of increasing fitness after leaving a local optimum. expected runtime by considering the expected time for the The CLIFFd function class was analysed to provide such an example where MAHHOI exhibits very good performance, MAHHOI to accept m consecutive worsening moves from the local to the global optimum. matching the best runtime known for problem-independent search heuristics, for instances that are prohibitive for elitist Theorem 12. The expected runtime of MAHHOI on Evolutionary Algorithms and METROPOLIS. 1 JUMPm, with p = (1+)n , for any constant  > 0, is On multimodal functions with long slopes of decreasing 2m−1  O n log n + n /m . fitness that have to be traversed, MAHHOI does not per- form as well. In particular, we analysed the JUMPm function Using global mutations to jump can achieve faster ex- class to provide such an example, proving that MAHH pected runtimes. The expected runtime of the (1+1) EA OI m has a runtime that is exponential in the size of the ‘jump’, on JUMPm is Θ(n ) (Droste, Jansen, and Wegener 2002), with overwhelming probability. However, also the estab- and a bound on Steady State (µ+1) Genetic Algorithms of m−1 lished non-elitist METROPOLIS algorithm was shown to be O(n ) has recently been proved (Dang et al. 2018), both inefficient on this benchmark function, with an even worse outperforming our upper bound for MAHHOI. Recent work runtime (i.e., at least exponential in the problem size, with has shown that exponential speedups are achieved by an Ar- overwhelming probability). The difference is particularly tificial Immune System and an and dramatic for small jumps (i.e., m = Θ(1)), where MAHHOI (GA) with heavy-tailed mutation oper- is able to find the global optimum in expected polynomial ators (Corus, Oliveto, and Yazdani 2018; Friedrich, Quin- time, because unlike METROPOLIS, its acceptance operators zan, and Wagner 2018; Doerr et al. 2017) (however, the ex- do not consider the magnitude of the fitness difference be- pected runtime is still exponential in m). A compact GA has tween the parent and offspring solutions. super-polynomial speedups over these algorithms for super- All the statements given in the paper for MAHH also m = o(n) OI constant jump lengths when (Hasenohrl¨ and Sut- hold for MAHH , a variant of MAHH which chooses ton 2018). In particular for logarithmic jump lengths the al- IE OI the IMPROVINGANDEQUAL acceptance operator with prob- gorithm optimises JUMP in expected polynomial time. We m ability 1 − p, and the ALLMOVES acceptance operator with point out that steady state GAs with diversity mechanisms probability p. that enhance the power of crossover can optimise JUMP in m Various avenues can be explored in future work. Firstly, expected time O(mn log n+4m) for jumps up to m = n/8) analysing more sophisticated move acceptance operators (Dang et al. 2016) and in Θ(n log n + 4m) with an unre- would give more insight into the behaviour and performance alistically small crossover probability of O(n/m) (Kotzing,¨ of hyper-heuristics commonly used in literature, such as Sudholt, and Theile 2011). Monte Carlo approaches (Ayob and Kendall 2003). Also, Concerning algorithms that use non-elitism to escape implementing learning mechanisms within the MAHHOI, from local optima, we will now show that METROPOLIS has such that it could ‘learn’ to prefer a certain operator worse performance than MAHHOI on JUMPm. The large fit- in certain parts of the fitness landscape similarly to the i = n − m ness difference between the two points at and hyper-heuristic introduced in (Doerr et al. 2018) for muta- i = n − m + 1 ETROPOLIS makes M unlikely to accept the tion operator selection, should be considered. Furthermore, i = n − m + 1 point when α(n) is set high enough to enable MAHHOI should be analysed on classical problems from efficient hill-climbing of the first slope. combinatorial optimisation with real-world applications. Fi- Theorem 13. The runtime of METROPOLIS on JUMPm(x) nally, more comprehensive hyper-heuristics that choose be- is 2Ω(n), with probability at least 1 − 2−Ω(n). tween multiple parameter sets, including the population size, should be analysed. Unlike METROPOLIS,MAHHOI does not consider the magnitude of the fitness difference between two solutions in its acceptance operators. This allows it to optimise JUMPm Acknowledgements This work was supported by the EP- with smaller jump sizes, such as m = Θ(1), in expected SRC under Grant No. EP/M004252/1.

2328 References Friedrich, T.; Quinzan, F.; and Wagner, M. 2018. Escaping Alanazi, F., and Lehre, P. K. 2014. Runtime analysis of se- large deceptive basins of attraction with heavy-tailed muta- lection hyper-heuristics with classical learning mechanisms. tion operators. In Proc. of GECCO ’18, 293–300. ACM. In Proc. of CEC ‘14, 2515–2523. IEEE. Hasenohrl,¨ V., and Sutton, A. M. 2018. On the runtime dy- Ayob, M., and Kendall, G. 2003. A monte carlo hyper- namics of the compact genetic algorithm on jump functions. heuristic to optimise component placement sequencing for In Proc. of GECCO ’18, 967–974. ACM. multi head placement machine. In Proc. of InTech ‘03, 132– He, J., and Yao, X. 2001. Drift analysis and average 141. time complexity of evolutionary algorithms. Artif. Intel. Bilgin, B.; Ozcan,¨ E.; and Korkmaz, E. E. 2007. An exper- 127(1):57–85. imental study on hyper-heuristics and exam timetabling. In Jagersk¨ upper,¨ J., and Storch, T. 2007. When the plus strategy Proc. of PATAT ‘07, 394–412. Springer. outperforms the comma strategy and when not. In Proc. of Burke, E. K.; Hyde, M.; Kendall, G.; Ochoa, G.; Ozcan,¨ FOCI ’07, 25–32. IEEE. E.; and Woodward, J. R. 2010. A classification of Johannsen, D. 2010. Random combinatorial structures and hyper-heuristic approaches. In Handbook of . randomized search heuristics. Ph.D. Dissertation, Univer- Springer. 449–468. sitat˝ des Saarlandes, Saarbrucken.˝ Burke, E. K.; Gendreau, M.; Hyde, M.; Kendall, G.; Ochoa, Kotzing,¨ T.; Sudholt, D.; and Theile, M. 2011. How G.; Ozcan,¨ E.; and Qu, R. 2013. Hyper-heuristics: A survey crossover helps in pseudo-boolean optimization. In Proc. of the state of the art. J. Op. Res. Soc. 1695–1724. of GECCO ’11, 989–996. ACM. Corus, D.; Oliveto, P. S.; and Yazdani, D. 2017. On the Lehre, P. K., and Ozcan,¨ E. 2013. A runtime analysis of runtime analysis of the opt-IA artificial immune system. In simple hyper-heuristics: To mix or not to mix operators. In Proc. of GECCO ’17, 83–90. ACM. Proc. of FOGA ’13, 97–104. ACM. Corus, D.; Oliveto, P. S.; and Yazdani, D. 2018. Fast ar- Lehre, P. K., and Witt, C. 2012. Black-box search by unbi- tificial immune systems. In Proc. of PPSN ‘18, 67–78. ased variation. Algorithmica 623–642. Springer. Lissovoi, A.; Oliveto, P. S.; and Warwicker, J. A. 2018. Cowling, P.; Kendall, G.; and Soubeiga, E. 2001. A hyper- Simple Hyper-heuristics Optimise LeadingOnes in the Best heuristic approach to scheduling a sales summit. In Proc. of Runtime Achievable Using Randomised Local Search Low- PATAT ‘01, 176–190. Springer. Level Heuristics . ArXiv e-prints. Cowling, P.; Kendall, G.; and Soubeiga, E. 2002. Hyper- Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; heuristics: A tool for rapid prototyping in scheduling and Teller, A. H.; and Teller, E. 1953. Equation of state cal- optimisation. In Proc. of EvoWorkshops ‘02, 1–10. Springer. culations by fast computing machines. J. of Chem. Phys. Dang, D.-C.; Friedrich, T.; Kotzing,¨ T.; Krejca, M. S.; 1087–1092. Lehre, P. K.; Oliveto, P. S.; Sudholt, D.; and Sutton, A. M. Oliveto, P. S., and Witt, C. 2011. Simplified drift analy- 2016. Escaping local optima with diversity mechanisms and sis for proving lower bounds in evolutionary computation. crossover. In Proc. of GECCO ’16, 645–652. ACM. Algorithmica 369–386. Dang, D.; Friedrich, T.; Kotzing,¨ T.; Krejca, M. S.; Lehre, Oliveto, P. S., and Witt, C. 2012. Erratum: Simplified drift P. K.; Oliveto, P. S.; Sudholt, D.; and Sutton, A. M. 2018. analysis for proving lower bounds in evolutionary computa- Escaping local optima using crossover with emergent diver- tion. CoRR abs/1211.7184. sity. IEEE Trans. Evol. Comp. 22(3):484–497. Oliveto, P. S., and Witt, C. 2015. Improved time complexity Doerr, B.; Le, H. P.; Makhmara, R.; and Nguyen, T. D. 2017. analysis of the simple genetic algorithm. Th. Comp. Sci. Fast genetic algorithms. In Proc. of GECCO ’17, 777–784. 605:21–41. ACM. Ozcan,¨ E.; Bilgin, B.; and Korkmaz, E. E. 2006. Hill Doerr, B.; Lissovoi, A.; Oliveto, P. S.; and Warwicker, J. A. climbers and mutational heuristics in hyperheuristics. In 2018. On the runtime analysis of selection hyper-heuristics Proc. of PPSN ‘06, 202–211. Springer. with adaptive learning periods. In Proc. of GECCO ’18, Paixao,˜ T.; Perez´ Heredia, J.; Sudholt, D.; and Trubenova,´ B. 1015–1022. ACM. 2017. Towards a runtime comparison of natural and artificial Doerr, B.; Sudholt, D.; and Witt, C. 2013. When do evolu- evolution. Algorithmica 681–713. tionary algorithms optimize separable functions in parallel? Qian, C.; Tang, K.; and Zhou, Z.-H. 2016. Selection hyper- In Proc. of FOGA ’13, 51–64. ACM. heuristics can provably be helpful in evolutionary multi- Droste, S.; Jansen, T.; and Wegener, I. 2001. Dynamic pa- objective optimization. In Proc. of PPSN ‘16, 835–846. rameter control in simple evolutionary algorithms. In Proc. Springer. of FOGA ‘01. ACM. 275 – 294. Droste, S.; Jansen, T.; and Wegener, I. 2002. On the analysis of the (1+1) evolutionary algorithm. Th. Comp. Sci. 51–81. Droste, S. 2004. Analysis of the (1+1) EA for a noisy one- max. In Proc. of GECCO ’04, 1088–1099. Springer.

2329