The Cross-Entropy Method for Continuous Multi-Extremal Optimization
Total Page:16
File Type:pdf, Size:1020Kb
The Cross-Entropy Method for Continuous Multi-extremal Optimization Dirk P. Kroese Sergey Porotsky Department of Mathematics Optimata Ltd. The University of Queensland 11 Tuval St., Ramat Gan, 52522 Brisbane 4072, Australia Israel. Reuven Y. Rubinstein Faculty of Industrial Engineering and Management Technion, Haifa Israel Abstract In recent years, the cross-entropy method has been successfully applied to a wide range of discrete optimization tasks. In this paper we consider the cross-entropy method in the context of continuous optimization. We demonstrate the effectiveness of the cross-entropy method for solving difficult continuous multi-extremal optimization problems, including those with non- linear constraints. Key words: Cross-entropy, continuous optimization, multi-extremal objective function, dynamic smoothing, constrained optimization, nonlinear constraints, acceptance{rejection, penalty function. 1 1 Introduction The cross-entropy (CE) method (Rubinstein and Kroese, 2004) was motivated by Rubinstein (1997), where an adaptive variance minimization algorithm for estimating probabilities of rare events for stochastic networks was presented. It was modified in Rubinstein (1999) to solve combinatorial optimization problems. The main idea behind using CE for continuous multi-extremal optimization is the same as the one for combinatorial optimization, namely to first associate with each optimization problem a rare event estimation problem | the so-called associated stochastic problem (ASP) | and then to tackle this ASP efficiently by an adaptive algorithm. The principle outcome of this approach is the construction of a random sequence of solutions which converges probabilistically to the optimal or near-optimal solution. As soon as the ASP is defined, the CE method involves the following two iterative phases: 1. Generation of a sample of random data (trajectories, vectors, etc.) according to a specified random mechanism. 2. Updating the parameters of the random mechanism, typically parameters of pdfs, on the basis of the data, to produce a \better" sample in the next iteration. The CE method has been successfully applied to a variety of problems in combinatorial op- timization and rare-event estimation, the latter with both light- and heavy-tailed distributions. Applications areas include buffer allocation, queueing models of telecommunication systems, neu- ral computation, control and navigation, DNA sequence alignment, signal processing, scheduling, vehicle routing, reinforcement learning, project management and reliability systems. References and more details on applications and theory can be found in the CE tutorial (de Boer et al., 2004) and the recent CE monograph (Rubinstein and Kroese, 2004). It is important to note that the CE method deals successfully with both deterministic problems, such as the traveling salesman problem, and noisy (i.e., simulation-based) problems, such as the buffer allocation problem. The goal of this paper is to demonstrate the quite accurate performance of the CE method for solving difficult continuous multi-extremal problems, both constrained and unconstrained. In particular we show numerically that typically it finds the optimal (or near-optimal) solution very fast and provides more accurate results than those reported in the literature. 2 It is out of the scope of this paper to review the huge literature on analytical/deterministic methods for solving optimization problem. Many of these methods are based on gradient or pseudo- gradient techniques. Examples are Line Search, Gradient Descent, Newton-Type Methods, Variable Metric, Conjugate Gradient etc. The main drawback of gradient-based methods is that they, by their nature, do not cope well with optimization problems that have non-convex objective functions and/or many local optima. Such multi-extremal continuous optimization problems arise abundantly in applications. A popular and convenient approach to these type of problems is to use random search techniques. The basic idea behind such methods is to systematically partition the feasible region into smaller subregions and then to move from one subregion to another based on information obtained by random search. Well-known examples include simulated annealing (Aarts and Korst, 1989), threshold acceptance, (Dueck and Scheur, 1990), genetic algorithms (Goldberg, 1989), tabu search (Glover and Laguna, 1993), ant colony method (Dorigo et al., 1996), and the stochastic comparison method (Gong et al., 1992). The rest of this paper is organized as follows. Section 2 presents an overview of the CE method. In section 3 we give the main algorithm for continuous multi-extremal optimization using multi- dimensional normal sampling with independent components. A particular emphasis is put on the issue of different updating procedures for the parameters of the normal pdf, so-called fixed and dynamic smoothing. In Section 4 we present numerical results with the CE Algorithm for both unconstrained and constrained programs. We demonstrate the high accuracy of CE using two approaches to constrained optimization: the acceptance-rejection approach and the penalty approach. 2 The Main CE Algorithm for Optimization The main idea of the CE method for optimization can be stated as follows: Suppose we wish to maximize some \performance" function S(x) over all elements/states x in some set X . Let us denote the maximum by γ∗, thus γ∗ = max S(x) : (1) x2X To proceed with CE, we first randomize our deterministic problem by defining a family of pdfs 3 ff(·; v); v 2 Vg on the set X . Next, we associate with (1) the estimation of P E `(γ) = u(S(X) > γ) = uIfS(X)>γg; (2) the so-called associated stochastic problem (ASP). Here, X is a random vector with pdf f(·; u), for some u 2 V (for example, X could be a normal random vector) and γ is a known or unknown parameter. Note that there are in fact two possible estimation problems associated with (2). For a given γ we can estimate `, or alternatively, for a given ` we can estimate γ, the root of (2). Let us consider the problem of estimating ` for a certain γ close to γ ∗. Then, typically fS(X) > γg is a rare event, and estimation of ` is a non-trivial problem. The CE method solves this efficiently by making adaptive changes to the probability density function according to the Kullback-Leibler CE, thus creating a sequence f(·; u); f(·; v1); f(·; v2); : : : of pdfs that are \steered" in the direction of the theoretically optimal density f(·; v∗) corresponding to the degenerate density at an optimal point. In fact, the CE method generates a sequence of tuples f(γt; vt)g, which converges quickly to a small neighborhood of the optimal tuple (γ∗; v∗). More specifically, we initialize by setting −2 v0 = u, choosing a not very small quantity %, say % = 10 , and then we proceed as follows: 1. Adaptive updating of γt. For a fixed vt−1, let γt be the (1 − %)-quantile of S(X) under vt−1. That is, γt satisfies P vt−1 (S(X) > γt) > %; (3) P vt−1 (S(X) 6 γt) > 1 − %; (4) where X ∼ f(·; vt−1). A simple estimator of γt, denoted γt, can be obtained by drawing a random sample X1; : : : ; XN from f(·; vt−1) and evaluating theb sample (1 − %)-quantile of the performances as γt = S(d(1−%)Ne): (5) b 2. Adaptive updating of vt. For fixed γt and vt−1, derive vt from the solution of the program Ev X maxv D(v) = maxv t−1 IfS( )>γtg ln f(X; v) : (6) The stochastic counterpart of (6) is as follows: for fixed γt and vt−1 (the estimate of vt−1), derive vt from the following program b b N b 1 max D(v) = max I X ln f(X ; v) : (7) v v N fS( i)>γtg i Xi=1 b b 4 Remark 2.1 (Smoothed Updating) Instead of updating the parameter vector v directly via the solution of (7) we use the following smoothed version vt = αv~t + (1 − α)vt−1; 8 i = 1; : : : n; (8) b b where v~t is the parameter vector obtained from the solution of (7), and α is called the smoothing parameter, with 0:7 < α ≤ 1. Clearly, for α = 1 we have our original updating rule. The reason for using the smoothed (8) instead of the original updating rule is twofold: (a) to smooth out the values of vt, (b) to reduce the probability that some component vt;i of vt will be zero or one at the first few iterations.b This is particularly important when vt is abvectorb or matrix of probabilities. Note that for 0 < α ≤ 1 we always have that vt;i > 0, whileb for α = 1 one might have (even at the first iterations) that either vt;i = 0 or vt;i =b1 for some indices i. As result, the algorithm will converge to a wrong solution. b b Thus, the main CE optimization algorithm, which includes smoothed updating of parameter vector v can be summarized as follows. Algorithm 1 Generic CE Algorithm for Optimization 1. Choose some v0. Set t = 1. b 2. Generate a sample X1; : : : ; XN from the density f(·; vt−1) and compute the sample (1 − %)- quantile γt of the performances according to (5). b b 3. Use the same sample X1; : : : ; XN and solve the stochastic program (7). Denote the solution by vt. e 4. Apply (8) to smooth out the vector vt. e 5. Repeat steps 2{4 until a pre-specified stopping criterion is met. 3 Continuous Multi-Extremal Optimization Here we apply the CE Algorithm 1 to continuous multi-extremal optimization problems (1), for both (a) unconstrained and (b) constrained problems with non-linear boundaries. We assume n henceforth that each x = (x1; : : : ; xn) is a real-valued vector and that X is a subset of R . 5 (a) The Unconstrained Case In this case generation of a random vector X = (X1; : : : ; Xn) 2 X is straightforward.