International Journal of Control Vol. 81, No. 8, August 2008, 1232–1238
Stochastic optimisation with inequality constraints using simultaneous perturbations and penalty functions I.-Jeng Wanga* and James C. Spalla The Johns Hopkins University Applied Physics Laboratory, 11100 Johns Hopkins Road, Laurel, MD 20723-6099, USA (Received 20 July 2006; final version received 2 August 2007)
We present a stochastic approximation algorithm based on penalty function method and a simultaneous perturbation gradient estimate for solving stochastic optimisation problems with general inequality constraints. We present a general convergence result that applies to a class of penalty functions including the quadratic penalty function, the augmented Lagrangian, and the absolute penalty function. We also establish an asymptotic normality result for the algorithm with smooth penalty functions under minor assumptions. Numerical results are given to compare the performance of the proposed algorithm with different penalty functions. Keywords: stochastic optimisation; inequality constraint; simultaneous perturbation
1. Introduction Bertsekas (1995, x 4.3). Throughout the paper, we use In this paper, we consider a constrained stochastic n to denote the nth estimate of the solution *. optimisation problem for which only noisy measure- Several results have been presented for constrained ments of the cost function are available. More optimisation in the stochastic domain. In the area of specifically, we are aimed to solve the following stochastic approximation (SA), most of the available optimisation problem: results are based on the simple idea of projecting the estimate n back to its nearest point in G whenever n min Lð Þ, ð1Þ lies outside the constraint set G. These projection- 2 G based SA algorithms are typically of the following d d where L : R !Ris a real-valued cost function, 2R form: is the parameter vector, and G Rd is the constraint set. We also assume that the gradient of L( ) exists and nþ1 ¼ G½ n ang^nð nÞ , ð2Þ is denoted by g( ). We assume that there exists a unique d solution * for the constrained optimisation problem where G : R ! G is the set projection operator, and defined by (1). We consider the situation where no g^nð nÞ is an estimate of the gradient g( n); see, for explicit closed-form expression of the function L is example, Dupuis and Kushnes (1987); Kushner and available (or is very complicated even if available), and Clark (1978); Kushner and Yin (1997); Sadegh (1997). the only information are noisy measurements of L at The main difficulty for this projection approach lies in specified values of the parameter vector . This the implementation (calculation) of the projection scenario arises naturally for simulation-based optimi- operator G. Except for simple constraints like interval sation where the cost function L is defined as the or linear constraints, calculation of G( ) for an expected value of a random cost associated with the arbitrary vector is a formidable task. stochastic simulation of a complex system. We also Other techniques for dealing with constraints have assume that significant costs (in term of time and/or also been considered: Hiriart-Urruty (1977) and Pflug computational costs) are involved in obtaining each (1981) present and analyse a SA algorithm based on measurement (or sample) of L( ). These constraint the penalty function method for stochastic optimisa- prevent us from estimating the gradient (or Hessian) of tion of a convex function with convex inequality L( ) accurately, hence prohibit the application of constraints; Kushner and Clark (1978) presents several effective non-linear programming techniques for SA algorithms based on the Lagrange multiplier inequality constraint, for example, the sequential method, the penalty function method, and a combina- quadratic programming methods; see, for example, tion of both. Most of these techniques rely on the
*Corresponding author. Email: [email protected]
ISSN 0020–7179 print/ISSN 1366–5820 online 2008 Taylor & Francis DOI: 10.1080/00207170701611123 http://www.informaworld.com International Journal of Control 1233
Kiefer and Wofolwitz (1952) (KW) type of gradient where l 2Rs can be viewed as an estimate of the estimate when the gradient of the cost function is not Lagrange multiplier vector. The associated penalty readily available. Furthermore, the convergence of function is these SA algorithms based on ‘non-projection’ techni- no 1 Xs ques generally requires complicated assumptions on Pð Þ¼ maxf0, l þ rq ð Þg 2 l2 : ð6Þ r2 j j j the cost function L and the constraint set G. In this 2 j¼1 paper, we present and study the convergence of a class of algorithms based on the penalty function methods Let {rn} be a positive and strictly increasing sequence with rn !1 and {ln} be a bounded non-negative and the simultaneous perturbation (SP) gradient s estimate Spall (1992). The advantage of the SP gradient sequence in R . It can be shown (see, for example, estimate over the KW-type estimate for unconstrained Bertsekas (1995, x 4.2)) that the minimum of the optimisation has been demonstrated with the simulta- sequence functions {Ln}, defined by neous perturbation stochastic approximation (SPSA) Lnð Þ , Lr ð , lnÞ, algorithms. And whenever possible, we present suffi- n cient conditions (as remarks) that can be more easily converges to the solution of the original constrained verified than the much weaker conditions used in our problem (1). Since the penalised cost function (or convergence proofs. the augmented Lagrangian) (5) is a differentiable We focus on general explicit inequality constraints function of , we can apply the standard stochastic where G is defined by approximation technique with the SP gradient estimate for L to minimise {L ( )}. In other words, the original , d n G f 2 R : qjð Þ 0, j ¼ 1, ..., sg, ð3Þ problem can be solved with an algorithm of the d following form: where qj : R !R are continuously differentiable real-valued functions. We assume that the analytical nþ1 ¼ n anr^ Lnð nÞ expression of the function qj is available. We extend the result presented in Wang and Spall (1999) to incorpo- ¼ n ang^n anrnrPð nÞ, rate a larger classes of penalty functions based on where g^ is the SP estimate of the gradient g( )at the augmented Lagragian method. We also establish n n that we shall specify later. Note that since we assume the asymptotic normality for the proposed algorithm. the constraints are explicitly given, the gradient of the Simulation results are presented to illustrated the penalty function P( ) is directly used in the algorithm. performance of the technique for stochastic Note that when l ¼ 0, the penalty function defined optimisation. n by (6) reduces to the standard quadratic penalty function discussed in Wang and Spall (1999) 2. Constrained SPSA algorithms Xs L ð ,0Þ¼Lð Þþr maxf0, q ð Þg 2: 2.1 Penalty functions r j j¼1 The basic idea of the penalty-function approach is to convert the originally constrained optimisation pro- Even though the convergence of the proposed algo- blem (1) into an unconstrained one defined by rithm only requires {ln} be bounded (hence we can set ln ¼ 0), we can significantly improve the performance min Lrð Þ , Lð ÞþrPð Þ, ð4Þ of the algorithm with appropriate choice of the sequence based on concepts from Lagrange multiplier d where P: R !R is the penalty function and r is theory. Moreover, it has been shown (Bertsekas 1995) a positive real number normally referred to as the that, with the standard quadratic penalty function, the penalty parameter. The penalty functions are defined penalized cost function Ln ¼ L þ rnP can become ill- such that P is an increasing function of the constraint conditioned as rn increases (that is, the condition functions qj; P 4 0 if and only if qj 4 0 for any j; number of the Hessian matrix of Ln at n diverges to 1 P !1 as qj 4 1; and P ! l (l 0) as qj ! 1.In with rn). The use of the general penalty function this paper, we consider a penalty function method defined in (6) can prevent this difficulty if {ln}is based on the augmented Lagrangian function defined chosen so that it is close to the true Lagrange by multipliers. In x 4, we will present an iterative method no 1 Xs based on the method of multipliers (see, for example, L ð , lÞ¼Lð Þþ maxf0, l þ rq ð Þg 2 l2 , r 2r j j j Bertsekas (1982)) to update ln and compare its j¼1 performance with the standard quadratic penalty ð5Þ function. 1234 I.-J. Wang and J.C. Spall P n 2.2 A SPSA algorithms for inequality constraints (A.2) dn ! 0, and k¼1 akek converges. In this section, we present the specific form of the (A.3) For some N 5 1, any 4 0 and for each n N, algorithm for solving the constrained stochastic if k *k , then there exists a n( ) 4 0 T optimisation problem. The algorithm we consider is such that ( *) Pfn( ) n( )k *k where 1 defined by n( ) satisfies n¼1 an nð Þ¼1 and 1 dn n( ) ! 0. nþ1 ¼ n ang^nð nÞ anrnrPð nÞ, ð7Þ (A.4) For each i ¼ 1, 2, ..., d, and any 4 0, if where g^nð nÞ is an estimate of the gradient of L, g( ), at j ni ( *)ij 4 eventually, then either fni( n) 0 n,{rn} is an increasing sequence of positive scalar with eventually or fni( n) 5 0 eventually, where we use limn!1 rn ¼1, rP( ) is the gradient of P( )at , and ( *)i to denote the ith element of *. (A.5) For any 4 0 and non-empty S {1, 2, ..., d} {Pan} is a positive scalar sequence satisfying an ! 0 and 1 there exists a 0( , S) 4 such that for all ¼ an ¼1. The gradient estimate g^n is obtained n 1 d from two noisy measurements of the cost function L by 2 { 2R : j( *)ij 5 when i 26 S, 0 þ j( *)ij ( , S) when i 2 S}, ðLð n þ cn nÞþ n Þ ðLð n cn nÞþ n Þ 1 , ð8Þ P 2cn n ð Þ fnið Þ lim sup Pi6 2 S i 51: where 2Rd is a random perturbation vector, n n!1 i 2 Sð Þi fnið Þ þ cn ! 0 is a positive sequence, n and n are noise in the measurements. and 1/ denotes the vector n Then the sequence { n} defined by the algorithm (10) 1 d 1= n, ...,1= n . For analysis, we rewrite the algo- converges to *. rithm (7) into Based on Theorem 3.1, we give a convergence result for n nþ1 ¼ n angð nÞ anrnrPð nÞþandn an , algorithm (7) by substituting rLn( n) ¼ g( n) þ 2cn n rnrP( n), dn, and n=2cn n into fn( n), dn, and en in ð9Þ (10), respectively. We need the following assumptions. where dn and "n are defined by (C.1) There exists K1 2N such that for all n K1,we d Lð n þ cn nÞ Lð n cn nÞ have a unique n 2R with rLnð nÞ¼0. dn , gð nÞ , 2cn n (C.2) { ni} are i.i.d. and symmetrically distributed 1 , þ aboutP 0, with j nij 0 a.s. and Ej ni j 1. n n n , n (C.3) k¼1 ðak k=2ck kÞ converges almost surely. respectively. (C.4) If k *k , then there exists a ( ) 4 0 such We establish the convergence of the algorithm (7) that and the associated asymptotic normality under appro- (i) if 2 G,( *)T g( ) ( )k *k 4 0. priate assumptions in the next section. (ii) if 26 G, at least one of the following two conditions hold T 3. Convergence and asymptotic normality . ( *) g( ) ( )k *k and ( *)T rP( ) 0. 3.1 Convergence theorem . ( *)T g( ) M for some constant To establish convergence of the algorithm (7), we need M4 0and( *)TrP( ) ( )k *k 4 0 to study the asymptotic behaviour of an SA algorithm (C.5) anrn ! 0, g( ) and rP( ) are Lipschitz. (see with a ‘time-varying’ regression function. In other comments below). words, we need to consider the convergence of an SA (C.6) rL ( ) satisfies condition (A5). algorithm of the following form: n Theorem 2: Suppose that assumptions (C.1–C.6) hold. nþ1 ¼ n an fnð nÞþandn þ anen, ð10Þ Then the sequence { n} defined by (7) converges to * where { fn( )} is a sequence of functions. We state here almost surely. without proof a version of the convergence theorem Proof: We only need to verify the conditions given by Spall and Cristion (1998) for an algorithm in (A.1–A.5) in Theorem 1 to show the desired result. the generic form (10). . Condition (A.1) basically requires the station- Theorem 1: Assume the following conditions hold. ary point of the sequence {rLn( )} converges (A.1) For each n large enough ( N for some N 2N), to *. Assumption (C.1) together with existing there exists a unique n such that fnð nÞ¼0. results on penalty function methods estab- Furthermore, limn!1 n ¼ : lishes this desired convergence. International Journal of Control 1235
. From the results in Spall (1992), Wang (1996), the algorithm visit any zero-measure set with zero and assumptions (C.2–C.3), we can show that probability. Assuming that the set D , { 2Rd: rP( ) condition (A.2) hold. does not exist} has Lebesgue measure 0 and the . Since rn !1, we have condition (A.3) hold random perturbation n follows Bernoulli distribution i i from assumption (C.4). (Pð n ¼ 0Þ¼Pð n ¼ 1Þ¼12), we can construct a . From (9), assumption (C.1) and (C.5), we have simple proof to show that