Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

I-Jeng Wang* and James C. Spall** The Johns Hopkins University Applied Physics Laboratory 11100 Johns Hopkins Road Laurel, MD 20723-6099, USA

Abstract— We present a stochastic approximation algorithm Several results have been presented for constrained opti- based on penalty function method and a simultaneous per- mization in the stochastic domain. In the area of stochastic turbation gradient estimate for solving stochastic optimization approximation (SA), most of the available results are based problems with general inequality constraints. We present a θ general convergence result that applies to a class of penalty on the simple idea of projecting the estimate n back to its functions including the quadratic penalty function, the aug- nearest point in G whenever θn lies outside the constraint set mented Lagrangian, and the absolute penalty function. We also G. These projection-based SA algorithms are typically of the establish an asymptotic normality result for the algorithm with following form: smooth penalty functions under minor assumptions. Numerical results are given to compare the performance of the proposed θn+1 = πG[θn − angˆn(θn)], (2) algorithm with different penalty functions. where π : Rd → G is the set projection operator, andg ˆ (θ ) I. INTRODUCTION G n n is an estimate of the gradient g(θn); see, for example [2], In this paper, we consider a constrained stochastic opti- [3], [5], [6]. The main difficulty for this projection approach mization problem for which only noisy measurements of the lies in the implementation (calculation) of the projection cost function are available. More specifically, we are aimed operator πG. Except for simple constraints like interval or to solve the following optimization problem: linear constraints, calculation of πG(θ) for an arbitrary vector minL(θ), (1) θ is a formidable task. θ∈G Other techniques for dealing with constraints have also where L: Rd → R is a real-valued cost function, θ ∈ Rd is been considered: Hiriart-Urruty [7] and Pflug [8] present and the parameter vector, and G ⊂ Rd is the constraint set. We analyze a SA algorithm based on the penalty function method also assume that the gradient of L(·) exists and is denoted for stochastic optimization of a convex function with convex by g(·). We assume that there exists a unique solution θ ∗ for inequality constraints; Kushner and Clark [3] present several the constrained optimization problem defined by (1). We con- SA algorithms based on the Lagrange multiplier method, sider the situation where no explicit closed-form expression the penalty function method, and a combination of both. of the function L is available (or is very complicated even if Most of these techniques rely on the Kiefer-Wofolwitz (KW) available), and the only information are noisy measurements [4] type of gradient estimate when the gradient of the cost of L at specified values of the parameter vector θ. This function is not readily available. Furthermore, the conver- scenario arises naturally for simulation-based optimization gence of these SA algorithms based on “non-projection” where the cost function L is defined as the expected value techniques generally requires complicated assumptions on of a random cost associated with the stochastic simulation the cost function L and the constraint set G. In this paper, we of a complex system. We also assume that significant costs present and study the convergence of a class of algorithms (in term of time and/or computational costs) are involved in based on the penalty function methods and the simultaneous obtaining each measurement (or sample) of L(θ). These con- perturbation (SP) gradient estimate [9]. The advantage of straint prevent us from estimating the gradient (or Hessian) the SP gradient estimate over the KW-type estimate for of L(·) accurately, hence prohibit the application of effective unconstrained optimization has been demonstrated with the nonlinear programming techniques for inequality constraint, simultaneous perturbation stochastic approximation (SPSA) for example, the sequential quadratic programming methods algorithms. And whenever possible, we present sufficient (see; for example, section 4.3 of [1]). Throughout the paper conditions (as remarks) that can be more easily verified than ∗ we use θn to denote the nth estimate of the solution θ . the much weaker conditions used in our convergence proofs. We focus on general explicit inequality constraints where This work was supported by the JHU/APL Independent Research and Development Program. G is defined by *Phone: 240-228-6204; E-mail: [email protected]. d **Phone: 240-228-4960; E-mail: [email protected]. G , {θ ∈ R : q j(θ) ≤ 0, j = 1,...,s}, (3) d where q j : R → R are continuously differentiable real- Note that when λn = 0, the penalty function defined by (6) valued functions. We assume that the analytical expression of reduces to the standard quadratic penalty function discussed the function q j is available. We extend the result presented in in [10] [10] to incorporate a larger classes of penalty functions based s 2 on the augmented Lagragian method. We also establish the Lr(θ,0) = L(θ) + r ∑ [max{0,q j(θ)}] . asymptotic normality for the proposed algorithm. Simulation j=1 results are presented to illustrated the performance of the Even though the convergence of the proposed algorithm only technique for stochastic optimization. requires {λn} be bounded (hence we can set λn = 0), we II. CONSTRAINED SPSA ALGORITHMS can significantly improve the performance of the algorithm with appropriate choice of the sequence based on concepts A. Penalty Functions from Lagrange multiplier theory. Moreover, it has been The basic idea of the penalty-function approach is to shown [1] that, with the standard quadratic penalty function, convert the originally constrained optimization problem (1) the penalized cost function Ln = L + rnP can become ill- into an unconstrained one defined by conditioned as rn increases (that is, the condition number ∗ of the Hessian matrix of Ln at θ diverges to ∞ with rn). minLr(θ) , L(θ) + rP(θ), (4) n θ The use of the general penalty function defined in (6) can λ where P: Rd → R is the penalty function and r is a positive prevent this difficulty if { n} is chosen so that it is close to real number normally referred to as the penalty parameter. the true Lagrange multipliers. In Section IV, we will present The penalty functions are defined such that P is an increasing an iterative method based on the method of multipliers (see; for example, [11]) to update λ and compare its performance function of the constraint functions q j; P > 0 if and only if n with the standard quadratic penalty function. q j > 0 for any j; P → ∞ as q j → ∞; and P → −l (l ≥ 0) ∞ as q j → − . In this paper, we consider a penalty function B. A SPSA Algorithms for Inequality Constraints method based on the augmented Lagrangian function defined by In this section, we present the specific form of the algorithm for solving the constrained stochastic optimization 1 s L (θ,λ) = L(θ) + ∑ [max{0,λ + rq (θ)}]2 − λ 2 , problem. The algorithm we consider is defined by r 2r j j j j=1 θ θ θ ∇ θ n o(5) n+1 = n − angˆn( n) − anrn P( n), (7) where λ ∈ Rs can be viewed as an estimate of the Lagrange whereg ˆn(θn) is an estimate of the gradient of L, g(·), at multiplier vector. The associated penalty function is θn, {rn} is an increasing sequence of positive scalar with s ∞ ∇ θ θ θ 1 limn→∞ rn = , P( ) is the gradient of P( ) at , and P(θ) = ∑ [max{0,λ + rq (θ)}]2 − λ 2 . (6) 2r2 j j j {an} is a positive scalar sequence satisfying an → 0 and j=1 ∑∞ ∞ n o n=1 an = . The gradient estimateg ˆn is obtained from two Let {rn} be a positive and strictly increasing sequence with noisy measurements of the cost function L by r → ∞ and {λ } be a bounded nonnegative sequence in Rs. n n (L(θ + c ∆ ) + ε+) − (L(θ − c ∆ ) + ε−) 1 It can be shown (see; for example, section 4.2 of [1]) that n n n n n n n n , (8) 2cn ∆n the minimum of the sequence functions {Ln}, defined by where ∆ Rd is a random perturbation vector, c 0 is a L θ , L θ λ n ∈ n → n( ) rn ( , n), ε+ ε− positive sequence, n and n are noise in the measurements. converges to the solution of the original constrained problem 1 1 1 and ∆ denotes the vector ∆1 ,..., ∆d . For analysis, we (1). Since the penalized cost function (or the augmented n n n rewrite the algorithm (7) intoh i Lagrangian) (5) is a differentiable function of θ, we can εn apply the standard stochastic approximation technique with θn+1 = θn − ang(θn) − anrn∇P(θn) + andn − an , (9) 2cn∆n the SP gradient estimate for L to minimize {Ln(·)}. In other words, the original problem can be solved with an algorithm where dn and εn are defined by of the following form: θ ∆ θ ∆ , θ L( n + cn n) − L( n − cn n) ˆ dn g( n) − , θn+1 = θn − an∇Ln(θn) 2cn∆n ε , ε+ ε− = θn − angˆn − anrn∇P(θn), n n − n , whereg ˆn is the SP estimate of the gradient g(·) at θn that we respectively. shall specify later. Note that since we assume the constraints We establish the convergence of the algorithm (7) and the are explicitly given, the gradient of the penalty function P(·) associated asymptotic normality under appropriate assump- is directly used in the algorithm. tions in the next section. III. CONVERGENCE AND ASYMPTOTIC NORMALITY (C.6) ∇Ln(·) satisfies condition (A5) A. Convergence Theorem Theorem 2: Suppose that assumptions (C.1–C.6) hold. θ θ ∗ To establish convergence of the algorithm (7), we need Then the sequence { n} defined by (7) converges to to study the asymptotic behavior of an SA algorithm with almost surely. a “time-varying” regression function. In other words, we Proof: We only need to verify the conditions (A.1–A.5) need to consider the convergence of an SA algorithm of the in Theorem 1 to show the desired result: following form: • Condition (A.1) basically requires the stationary point of the sequence {∇L (·)} converges to θ ∗. Assumption θ θ θ n n+1 = n − an fn( n) + andn + anen, (10) (C.1) together with existing result on penalty function methods establishes this desired convergence. where { fn(·)} is a sequence of functions. We state here without proof a version of the convergence theorem given • From the results in [9], [14] and assumptions (C.2–C.3), by Spall and Cristion in [13] for an algorithm in the generic we can show that condition (A.2) hold. ∞ form (10). • Since rn → , we have condition (A.3) hold from Theorem 1: Assume the following conditions hold: assumption (C.4). • From (9), assumption (C.1) and (C.5) , we have (A.1) For each n large enough (≥ N for some N ∈ N), there θ ∗ θ ∗ ∗ exists a unique n such that fn( n ) = 0. Furthermore, |(θn+1 − θn)i| < |(θn − θ )i| θ ∗ θ ∗ limn→∞ n = . ∑n for large n if |θ − (θ ∗) | > ρ. Hence for large n, (A.2) dn → 0, and k=1 akek converges. ni i (A.3) For some N < ∞, any ρ > 0 and for each n ≥ N, if the sequence {θni} does not “jump” over the interval ∗ θ ∗ θ θ θ ∗ ρ kθ − θ k ≥ ρ, then there exists a δn(ρ) > 0 such that between ( )i and ni. Therefore if | ni − ( )i| > ∗ T ∗ θ (θ −θ ) fn(θ) ≥ δn(ρ)kθ −θ k where δn(ρ) satisfies eventually, then the sequence { fni( n)} does not change ∑∞ δ ρ ∞ δ ρ −1 sign eventually. That is, condition (A.4) holds. n=1 an n( ) = and dn n( ) → 0. ∗ (A.4) For each i = 1,2,...,d, and any ρ > 0, if |θni −(θ )i| > • Assumption (A.5) holds directly from (C.6). ρ eventually, then either fni(θn) ≥ 0 eventually or fni(θn) < 0 eventually. Theorem 2 given above is general in the sense that it does (A.5) For any τ > 0 and nonempty S ⊂ {1,2,...,d}, there ex- not specify the exact type of penalty function P(·) to adopt. ists a ρ0(τ,S) > τ such that for all θ ∈ {θ ∈ Rd : |(θ − In particular, assumption (C.4) seems difficult to satisfy. In ∗ ∗ 0 θ )i| < τ when i 6∈ S,|(θ − θ )i| ≥ ρ (τ,S) when i ∈ fact, assumption (C.4) is fairly weak and does address the S.}, limitation of the penalty function based gradient descent ∗ ∑ (θ − θ )i fni(θ) limsup i6∈S < 1. algorithm. For example, suppose that a constraint function θ θ ∗ θ 0 0 n→∞ ∑i∈S( − )i fni( ) qk(·) has a local minimum at θ with qk(θ ) > 0. Then for ¯ ¯ 0 T ¯ ¯ every θ with q j(θ) ≤ 0, j 6= k, we have (θ −θ ) ∇P(θ) > 0 Then the sequence {θ¯n} defined by the¯ algorithm (10) ¯ ¯ θ θ 0 converges to θ ∗. whenever is close enough to . As rn gets larger, the ∇ θ Based on Theorem 1, we give a convergence result for term P( ) would dominate the behavior of the algorithm and result in a possible convergence to θ 0, a wrong solution. algorithm (7) by substituting ∇Ln(θn) = g(θn) + rn∇P(θn), ε d , and n into f (θ ), d , and e in (10), respectively. We also like to point out that assumption (C.4) is satisfied n 2cn∆n n n n n We need the following assumptions: if cost function L and constraint functions q j, j = 1,...,s are convex and satisfy the slater condition, that is, the minimum (C.1) There exists K N such that for all n K , we have 1 ∈ ≥ 1 cost function value L θ ∗ is finite and there exists a θ Rd θ ∗ Rd ∇ θ ∗ ( ) ∈ a unique n ∈ with Ln( n ) = 0. such that q j(θ) < 0 for all j (this is the case studied in [8]). (C.2) {∆ni} are i.i.d. and symmetrically distributed about 0, ∆ α ∆−1 α Assumption (C.6) ensures that for n sufficiently large each with | ni| ≤ 0 a.s. and E| ni | ≤ 1. θ ∇ θ n akεk element of g( ) + rn P( ) make a non-negligible contri- (C.3) ∑ converges almost surely. T k=1 2ck∆k θ θ ∗ θ ∇ θ ∗ bution to products of the form ( − ) (g( ) + rn P( )) (C.4) If kθ −θ k ≥ ρ, then there exists a δ(ρ) > 0 such that ∗ when (θ − θ )i 6= 0. A sufficient condition for (C.6) is that θ θ θ ∗ T θ δ ρ θ θ ∗ (i) if ∈ G, ( − ) g( ) ≥ ( )k − k > 0. for each i, gi(θ)+rn∇iP(θ) be uniformly bounded both away θ ∗ (ii) if 6∈ G, at least one of the following two from 0 and ∞ when k(θ − θ )ik ≥ ρ > 0 for all i. conditions hold Theorem 2 in the stated form does require that the penalty T • (θ − θ ∗) g(θ) ≥ δ(ρ)kθ − θ ∗k and (θ − function P be differentiable. However, it is possible to extend θ ∗)T ∇P(θ) ≥ 0. the stated results to the case where P is Lipschitz but T T • (θ − θ ∗) g(θ) ≥ −M and (θ − θ ∗) ∇P(θ) ≥ not differentiable at a set of point with zero measure, for δ(ρ)kθ − θ ∗k > 0 example, the absolute value penalty function (C.5) anrn → 0, g(·) and ∇P(·) are Lipschitz. (See comments P(θ) = max max 0,q j(θ) . below) j=1,...,s © © ªª In the case where the density function of measurement noise smooth penalty functions of the form ε+ ε− ( n and n in (8)) exists and has infinite support, we can s take advantage of the fact that iterations of the algorithm visit P(θ) = ∑ p j(q j(θ)), any zero-measure set with zero probability. Assuming that j=1 d the set D , {θ ∈ R : ∇P(θ)does not exist} has Lebesgue which including both the quadratic penalty and augmented ∆ measure 0 and the random perturbation n follows Bernoulli Lagrangian functions. distribution (P(∆i = 0) = P(∆i = 1) = 1 ), we can construct F ∆ 2 F σ 2 n n 2 Assume further that E[en| n, n] = 0 a.s., E[en| n] → a simple proof to show that ∆i −2 ρ2 ∆i 2 ξ 2 F σ a.s., E[( n) ] → , and E[( n) ] → , where n is the - algebra generated by θ ,...,θ . Let H(θ) denote the Hessian P{θ ∈ D i.o.} = 0 1 n n matrix of L(θ) and if P{θ0 ∈ D} = 0. Therefore, the convergence result in The- 2 Hp(θ) = ∑ ∇ (p j(q j(θ))). orem 2 applies to the penalty functions with non-smoothness j∈A at a set with measure zero. Hence in any practical application, The next proposition establishes the asymptotic normality we can simply ignore this technical difficulty and use for the proposed algorithm with the following choice of −α −γ η ∇P(θ) = max{0,q θ (θ)}∇q θ (θ), parameters: an = an , cn = cn and rn = rn with a,c,r > J( ) J( ) β α η γ γ α 3η 0, = − − 2 > 0, and 3 − 2 + 2 ≥ 0. where J(θ) = argmax j=1,...,s q j(θ) (note that J(θ) is uniquely Proposition 1: Assume that conditions (C.1-6) hold. Let θ ∗ T −1 −1 defined for 6∈ D). An alternative approach to handle this P be orthogonal with PHp(θ )P = a r diag(λ1,...,λd) technical difficulty is to apply the SP gradient estimate Then directly to the penalized cost L(θ) + rP(θ) and adopt the β n /2(θ − θ ∗) dist N(µ,PMPT ), n → ∞ convergence analysis presented in [15] for nondifferentiable n −→ optimization with additional convexity assumptions. 1 2 2 −2σ 2ρ2 λ β −1 λ where M = 4 a r c diag[(2 1 − +) ,...,(2 d − Use of non-differentiable penalty functions might allow us β )−1] with β = β < 2min λ if α = 1 and β = 0 if α < 1, ∞ + + i i + to avoid the difficulty of ill-conditioning as rn → without and using the more complicated penalty function methods such as α 3η the augmented Lagrangian method used here. The rationale 0 if3γ − + > 0, µ = 2 2 ∑s λ ∗ λ ∗ θ ∗ 1 β −1 γ α 3η here is that there exists a constantr ¯ = j=1 j ( j is the ((arHp( ) − 2 +I) T if 3 − 2 + 2 = 0, Lagrange multiplier associate with the jth constraint) such where the lth element of T is that the minimum of L + rP is identical to the solution of the original constrained problem for all r > r¯, based on the p 1 2ξ 2 (3) θ ∗ (3) θ ∗ theory of exact penalties (see; for example, section 4.3 of − ac Llll ( ) + 3 ∑ Liii ( ) . 6 " i=1,i6=l # [1]). This property of the absolute value penalty function Proof: For large enough n, we have allow us to use a constant penalty parameter r > r¯ (instead θ θ θ¯ θ θ ∗ θ of rn → ∞) to avoid the issue of ill-conditioning. However, E[gˆn( n)| n] = H( n)( n − ) + bn( n), ∇ θ θ¯ 0 θ θ ∗ it is difficult to obtain a good estimate forr ¯ in our situation P( n) = Hp( n)( n − ), where the analytical expression of g(·) (the gradient of the θ θ θ θ cost function L(·)) is not available. And it is not clear that where bn( n) = E[gˆn( n) − g( n)| n]. Rewrite the algorithm into the application of exact penalty functions with rn → ∞ would lead to better performance than the augmented Lagrangian ∗ −α+η ∗ −(α−η+β)/2 θn+1 − θ = (I − n Γn)(θn − θ ) + n ΦnVn based technique. In SectionIV we will also illustrate (via n−α+η−β/2T numerical results) the potential poor performance of the + n, algorithm with an arbitrarily chosen large r. where Γ −η θ¯ θ¯ 0 B. Asymptotic Normality n = an H( n) + arHp( n), −γ When differentiable penalty functions are used, we can es- Vn = n [gˆn(θn) − E(gˆn(θn)|θn)] tablish the asymptotic normality for the proposed algorithms. Φn = −aI, ∗ In the case where q j(θ ) < 0 for all j = 1,...,s (that is, there β/2−η Tk = −an bn(θn). is no active constraint at θ ∗), the asymptotic behavior of the algorithm is exactly the same as the unconstrained SPSA Following the techniques used in [9] and the general Nor- algorithm and has been established in [9]. Here we consider mality results from [16] we can establish the desired result. the case where at least one of constraints is active at θ ∗, ∗ that is, the set A , { j = 1,...s: q j(θ ) = 0} is not empty. Note that based on the result in Proposition 1, the conver- 1 3 α γ 1 η We establish the asymptotic Normality for the algorithm with gence rate at n is achieved with = 1 and = 6 − > 0. 0.7 IV. NUMERICAL EXPERIMENTS

We test our algorithm on a constrained optimization prob- 0.6 lem described in [17, p.352]: minL(θ) = θ 2 + θ 2 + 2θ 2 + θ 2 − 5θ − 5θ − 21θ + 7θ 0.5 θ 1 2 3 4 1 2 3 4 Absolute value penalty r = 10 subject to 0.4 2 2 2 q1(θ) = 2θ + θ + θ + 2θ1 − θ2 − θ4 − 5 ≤ 0 0.3 1 2 3 Quadratic penalty θ θ 2 θ 2 θ 2 θ 2 θ θ θ θ q2( ) = 1 + 2 + 3 + 4 + 1 − 2 + 3 − 4 − 8 ≤ 0 2 2 2 2 0.2

θ θ θ θ θ θ θ Average error over 100 simulations q3( ) = 1 + 2 2 + 3 + 2 4 − 1 − 4 − 10 ≤ 0. The minimum cost L θ ∗ 44 under constraints occurs at 0.1 ( ) = − Augmented Lagrangian θ ∗ T = [0,1,2,−1] where the constraints q1(·) ≤ 0 and q2(·) ≤ Absolute value penalty r = 3.01 λ ∗ λ ∗ λ ∗ T 0 0 are active. The Lagrange multiplier is [ 1 , 2 , 3 ] = 0 500 1000 1500 2000 2500 3000 3500 4000 [2,1,0]T . The problem had not been solved to satisfactory Number of Iterations accuracy with deterministic search methods that operate θ θ ∗ directly with constraints (claimed by [17]). Further, we Fig. 1. Error to the optimum (k n − k) averaged over 100 independent simulations. increase the difficulty of the problem by adding i.i.d. zero- mean Gaussian noise to L(θ) and assume that only noisy measurements of the cost function L are available (without • Absolute value penalty function: gradient). The initial point is chosen at [0,0,0,0]T ; and P(θ) = max max 0,q j(θ) . (16) the standard deviation of the added Gaussian noise is 4.0 j=1,...,s (roughly equal to the initial error). As discussed earlier, we© will ignore© theªª technical dif- We consider three different penalty functions: ficulty that P(·) is not differentiable everywhere. The • Quadratic penalty function: gradient of P(·) when it exists is s 1 2 ∇ θ θ ∇ θ P(θ) = ∑ [max{0,q j(θ)}] . (11) P( ) = max{0,qJ(θ)( )} qJ(θ)( ), (17) 2 j=1 where J(θ) = argmax j=1,...,s q j(θ). In this case the gradient of P(·) required in the algorithm For all the simulation we use the following parameter is −0.602 −0.101 s values: an = 0.1(n + 100) and cn = n . These ∇P(θ) = ∑ max{0,q j(θ)}∇q j(θ). (12) parameters for an and cn are chosen following a practical j=1 implementation guideline recommended in [18]. For the λ • Augmented Lagrangian: augmented Lagrangian method, n is initialized as a zero vector. For the quadratic penalty function and augmented s 1 0.1 θ λ θ 2 λ 2 Lagrangian, we use rn = 10n for the penalty parameter. P( ) = 2 ∑ [max{0, j + rq j( )}] − . (13) 2r j=1 For the absolute value penalty function, we consider two possible values for the constant penalty: r = r = 3.01 and In this case, the actual penalty function used will vary n r = r = 10. Note that in our experiment,r ¯ = ∑s λ ∗ = 3. over iteration depending on the specific value selected n j=1 j Hence the first choice of r at 3.01 is theoretically optimal for r and λ . The gradient of the penalty function n n but not practical since there is no reliable way to estimate required in the algorithm for the nth iteration is r¯. The second choice of r represent a more typical scenario 1 s where an upper bound onr ¯ is estimated. ∇P(θ) = ∑ max{0,λ + r q (θ)}∇q (θ). (14) nj n j j Figure 1 plots the averaged errors (over 100 independent rn j=1 simulations) to the optimum over 4000 iteration of the λ To properly update n, we adopt a variation of the algorithms. The simulation result in Figure 1 seems to sug- multiplier method [1]: gest that the proposed algorithm with the quadratic penalty function and the augmented Lagrangian led to comparable λ = max{0,λ + r q (θ ),M}, (15) nj nj n j n performance (the augmented Lagrangian method performed where λnj denotes the jth element of the vector λn, and slightly better than the standard quadratic technique). This + M ∈ R is a large constant scalar. Since (15) ensures suggests that a more effective update scheme for λn than that {λn} is bounded, convergence of the minimum of (15) is needed for the augmented Lagrangian technique. The {Ln(·)} remains valid. Furthermore, {λn} will be close absolute value function with r = 3.01(≈ r¯ = 3) has the best to the true Lagrange multiplier as n → ∞. performance. However, when an arbitrary upper bound onr ¯ is used (r = 10), the performance is much worse than both the [5] H. Kushner and G. Yin, Stochastic Approximation Al- quadratic penalty function and the augmented Lagrangian. gorithms and Applications. Springer-Verlag, 1997. This illustrates a key difficulty in effective application of the [6] P. Sadegh, “Constrained optimization via stochastic ap- exact penalty theorem with the absolute penalty function. proximation with a simultaneous perturbation gradient approximation,” Automatica, vol. 33, no. 5, pp. 889– V. CONCLUSIONS AND REMARKS 892, 1997. We present a stochastic approximation algorithm based [7] J. Hiriart-Urruty, “Algorithms of penalization type and on penalty function method and a simultaneous perturbation of dual type for the solution of stochastic optimization gradient estimate for solving stochastic optimization prob- problems with stochastic constraints,” in Recent De- lems with general inequality constraints. We also present velopments in Statistics (J. Barra, ed.), pp. 183–2219, a general convergence result and the associated asymptotic North Holland Publishing Company, 1977. Normality for the proposed algorithm. Numerical results are [8] G. C. Pflug, “On the convergence of a penalty-type included to demonstrate the performance of the proposed stochastic optimization procedure,” Journal of Informa- algorithm with the standard quadratic penalty function and a tion and Optimization Sciences, vol. 2, no. 3, pp. 249– more complicated penalty function based on the augmented 258, 1981. Lagrangian method. [9] J. C. Spall, “Multivariate stochastic approximation us- In this paper, we consider the explicit constraints where ing a simultaneous perturbation gradient approxima- the analytical expressions of the constraints are available. It is tion,” IEEE Transactions on Automatic Control, vol. 37, also possible to apply the same algorithm with appropriate pp. 332–341, March 1992. θ gradient estimate for P( ) to problems with implicit con- [10] I-J. Wang and J. C. Spall, “A constrained simultaneous straints where constraints can only be measured or estimated perturbation stochastic approximation algorithm based with possible errors. The success of this approach would on penalty function method,” in Proceedings of the 1999 depend on efficient techniques to obtain unbiased gradient American Control Conference, vol. 1, pp. 393–399, San estimate of the penalty function. For example, if we can Diego, CA, June 1999. θ measure or estimate a value of the penalty function P( n) [11] D. P. Bertsekas, Constrained Optimization and La- at arbitrary location with zero-mean error, then the SP gragne Mulitplier Methods, Academic Press, NY, 1982. gradient estimate can be applied. Of course, in this situation [12] E. K. Chong and S. H. Zak,˙ An Introduction to Opti- further assumptions on rn need to be satisfied (in general, mization. New York, New York: John Wiley and Sons, ∞ 2 we would at least need ∑ anrn < ∞). However, in a 1996. n=1 cn typical application, we most³ likely´ can only measure the [13] J. C. Spall and J. A. Cristion, “Model-free control of value of constraint q j(θn) with zero-mean error. Additional nonlinear stochastic systems with discrete-time mea- bias would be present if the standard finite-difference or the surements,” IEEE Transactions on Automatic Control, SP techniques were applied to estimate ∇P(θn) directly in vol. 43, no. 9, pp. 1178–1200, 1998. this situation. A novel technique to obtain unbiased estimate [14] I-J. Wang, Analysis of Stochastic Approximation and of ∇P(θn) based on a reasonable number of measurements Related Algorithms. PhD thesis, School of Electrical is required to make the algorithm proposed in this paper and Computer Engineering, Purdue University, August feasible in dealing with implicit constraints. 1996. [15] Y. He, M. C. Fu, and S. I. Marcus, “Convergence of VI. REFERENCES simultaneous perturbation stochastic approximation for [1] D. P. Bertsekas, Nonlinear Programming, Athena Sci- nondifferentiable optimization,” IEEE Transactions on entific, Belmont, MA, 1995. Automatic Control, vol. 48, no. 8, pp. 1459–1463, 2003. [2] P. Dupuis and H. J. Kushner, “Asymptotic behavior of [16] V. Fabian, “On asymptotic normality in stochastic ap- constrained stochastic approximations via the theory proximation,” The Annals of Mathematical Statistics, of large deviations,” Probability Theory and Related vol. 39, no. 4, pp. 1327–1332, 1968. Fields, vol. 75, pp. 223–274, 1987. [17] H.-P. Schwefel, Evolution and Optimum Seeking. John [3] H. Kushner and D. Clark, Stochastic Approximation Wiley and Sons, Inc., 1995. Methods for Constrained and Unconstrained Systems. [18] J. C. Spall, “Implementation of the simultaneous per- Springer-Verlag, 1978. turbation algorithm for stochastic optimization,” IEEE [4] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the Transactions on Aerospace and Electronic Systems, maximum of a regression function,” Ann. Math. Statist., vol. 34, no. 3, pp. 817–823, 1998. vol. 23, pp. 462–466, 1952.