<<

Proceedings of the 2002 Winter Simulation Conference E. Yücesan, C.-H. Chen, J. L. Snowdon, and J. M. Charnes, eds.

CONFIDENCE REGIONS FOR STOCHASTIC APPROXIMATION ALGORITHMS

Ming-hua Hsieh Peter W. Glynn

Department of Management Information Systems Department of Management Science and Engineering National Chengchi University Stanford University Wenshan, Taipei 11623, TAIWAN Stanford, CA 94305, U.S.A.

ABSTRACT we present has analogs in the setting of Kiefer-Wolfowitz procedures. Section 2 describes a for In principle, known central limit theorems for stochastic the Robbins-Monro algorithm. This central limit theorem approximation schemes permit the simulationist to provide lies at the basis of the confidence region procedures we de- confidence regions for both the optimum and optimizer of a scribe. Specifically, Section 2 discusses a confidence region problem that is solved by of procedure that requires consistent estimation of a certain such algorithms. Unfortunately, the structure of , whereas Section 3 describes a more eas- the limiting normal distribution depends in a complex way ily implemented “cancellation” procedure. In Section 4, on the problem . In particular, the covariance matrix we discuss some of our computational experience with the depends not only on constants but also on even procedure introduced in Section 3. Finally, Section 5 offers more statistically challenging parameters (e.g. the Hessian some concluding remarks. of the objective function at the optimizer). In this paper, we describe an approach to producing such confidence regions 2 CENTRAL LIMIT THEOREM FOR that avoids the necessity of having to explicitly estimate STOCHASTIC APPROXIMATIONS the covariance structure of the limiting normal distribution. This procedure offers an easy way for the simulationist to We start by describing the problem setting precisely. Sup- provide confidence regions in the stochastic optimization pose that our goal is to numerically maximize an objective setting. function α(θ) over a continuous decision parameter θ ∈

∗ Stochastic approximation algorithms are iterative procedures ∇α(θ ) = 0 (1) that permit simulationists to numerically optimize complex stochastic systems. This class of algorithms exhibits a where ∇α(θ) is the gradient of α(·) at θ ∈

(Note that Z(θ) is also encoded as a row vector.) A number Theorem 1. Suppose that (θn : n ≥ 0) satisfies (2), (3), of different procedures have been proposed in the literature and assumption A. Then, for obtaining such unbiased estimators of the gradient, in- 1 ∗ cluding likelihood ratio methods (Glynn 1986, Glynn 1990), n 2 (θn − θ ) ⇒ N(0,C) infinitesimal perturbation analysis (Glasserman 1991), Con- ditional Monte Carlo (Fu and Hu 1997), and the “push-out” as n →∞, where N(0,C)is a multivariate normal random approach (Rubinstein 1992). vector with zero and covariance matrix C given by For θ ∈0, where automatically. Condition iii) is a sufficient condition that guarantees that θ ∗ is a global maximizer of α(·). Condition P(Zn+1(θn) ∈ dz|θ0,Z1(θ0), ··· ,Zn(θn−1)) = F(dz; θn) iv) is a mild continuity hypothesis on the distribution of Z(θ), and condition v) is a technical integrability hypothesis d for z ∈< . (Again, we choose to encode the θn’s as row that is satisfied in great generality. Finally, condition vi) is vectors.) an assumption that controls the rate at which the variance We are now ready to describe one of many known of kZ(θ)k grows as kθk tends to infinity. ∗ − 1 central limit theorems (CLT’s) for θn; see p.147-150 of Theorem 1 asserts that θn converges to θ at rate n 2 Nevel’son and Has’minskii (1973) for a complete√ proof. in the number of iterations n. Furthermore, the error of θn For x ∈0, sup ∗ 1 ∇α(θ)(θ −θ ) < 0; ε

To continue the argument, we write the quadratic form tions of the Martingale Convergence Theorem (see p. 468 ∗ −1 ∗ T n(θn − θ )Cn (θn − θ ) as follows: of Billingsley 1995, or ch. 12 of Williams 1991), so that thereP exists a finite-valued random variable M∞ such ∗ −1 ∗ T n D /i → M n →∞ n(θn − θ )Cn (θn − θ ) that i=1 i ∞ a.s. as . Kronecker’s = n(θ − θ ∗)C−1(θ − θ ∗)T lemmaP (see p. 250 of Loeve 1977) then guarantees that n n (4) −1 n n i= Di/i → 0 a.s. as n →∞. But ∗ −1 −1 ∗ T 1 + n(θn − θ )(Cn − C )(θn − θ ) . Xn Xn−1 −1 −1 The continuous mapping principle implies that the first 6n = n Di + n γ(θi). (7) term on the right hand side of (4) converges weakly to i=1 i=0 N(0,C)C−1N(0,C)T , which is easily seen to have a χ 2 d ∗ distribution. On the other hand, Conditions A iv) and v) ensure that γ(θ) → γ(θ ) as ∗ ∗ θ → θ . In addition, it is known that θn → θ a.s. as ∗ −1 −1 ∗ T n →∞under the conditions of Theorem 1; see p. 93 of kn(θn − θ )(Cn − C )(θn − θ ) k Nevel’son and Has’minskii (1973). Hence, 1 ∗ 2 −1 −1 ≤kn 2 (θn − θ )k ·kCn − C k. (5) Xn−1 −1 ∗ ∗ T ∗ The first factor on the right-hand side of (5) converges n γ(θi) → γ(θ ) = EZ(θ ) Z(θ ) a.s. weakly, by the continuous mapping principle to kN(0,C)k2, i=0 whereas the second factor converges weakly to zero. Hence, as n →∞. It follows from (7) that the product converges weakly to zero, establishing that the left-hand side of (4) does indeed converge weakly to ∗ T ∗ − T 6n → EZ(θ ) Z(θ ) a.s. as n →∞. N(0,C)C 1N(0,C) , as desired. 2 C C Thus, if n is a consistent estimator for , the region 2 −1 T The estimator defined by (6) can be modified (and {θ : n(θn − θ)Cn (θn − θ) ≤ z} possibly improved) by weighting more recent observations ∗ more heavily, rather than weighting all observations equally. is an approximate 100(1 − δ)% confidence region for θ , 2 In the setting of Theorem 1, the greater challenge in provided that z is selected so that P(χd >z)= δ. The key ∗ constructing a consistent estimator Cn is the estimation of to constructing confidence regions for θ is therefore the the matrix H . The (i, j)’th entry of the matrix H is given consistent estimation of C. by As is evident from the formula for C, even when H ∗ T ∗ and EZ(θ ) Z(θ ) are known, the numerically evalua- ∂2α(θ) tion of C is non-trivial. Of course, in practice, both H Hij = |θ=θ∗ . ∂θi∂θj and EZ(θ∗)T Z(θ∗) are generally unknown and must them- selves be estimated, substantially complicating the task. In A number of the gradient estimation algorithms mentioned the current setting, EZ(θ∗)T Z(θ∗) can be estimated in a earlier, while capable of producing unbiased estimates of first straightforward fashion. In particular, let order partial derivatives, do not easily extend to construction of unbiased estimators for second order partial derivatives. Xn For example, infinitesimal perturbation analysis typically 6 = 1 Z (θ )T Z (θ ). n n i i−1 i i−1 (6) estimates only first derivatives consistently. One notable i=1 exception is the likelihood ratio method, for which second derivative estimators are easily constructed. {EkZ(θ)k4 : θ ∈

3 A GENERAL APPROACH TO CONSTRUCTION function g :

m Given the difficulties associated with consistent estimation 1 X C g(x ,...,x ) = x · of the covariance matrix , we seek an alternative method. 1 m m j To place our alternative methodology in its appropriate j=1  − context, we describe the idea in a more general setting. m m m 1 d X X X Suppose that we wish to compute a parameter α ∈< 1 1 T 1  (xl − xj ) (xl − xj ) for which there exists an estimator αn satisfying a CLT. In m − 1 m m l=1 j=1 j=1 particular, assume: Xm Assumption B. There exists a symmetric positive definite 1 T / · xj . matrix 0 such that n1 2(αn − α) ⇒ N(0,0) as n →∞. m j=1 We wish to construct confidence regions for α without 0 requiring existence of a consistent estimator for . The continuous mapping principle then yields the second α m The idea is to compute by simulating independent result. 2 α replications of the estimator n (thereby consuming roughly Theorem 2 applies immediately to the construction of mn α ,α ,...,α computer time units). Let 1n 2n mn be the confidence regions for stochastic approximations. Sup- m α resulting copies of the random vectorP n. The natural pose that we independently replicate the stochastic approx- α α(m, n) = 1 m α . estimator for is then m i=1 in The sample imation m times, thereby yielding m independent copies covariance matrix V (m, n) is given by θ1n,θ2n,...,θmn of the random vector θn. Put m X Xm V (m, n) = 1 (α − α(m, n))T (α − α(m, n)). 1 m − in in θ(m, n) = θin, 1 i= m 1 i=1 Xm Theorem 2. Under condition B, V˜ (m, n) = 1 (θ − θ(m, n))T (θ − θ(m, n)) m − in in √ 1 i= a) mn(α(m, n) − α) ⇒ N(0,0) as n →∞; 1 m ≥ d + b) if 1, then and set  m(α(m, n) − α)T V (m, n)−1(α(m, n) − α) −1 T 3mn(z) = θ : (θ(m, n) − θ)V˜ (m, n) (θ(m, n) − θ) d(m − 1) ⇒ F  m − d (d,m−d) zd(m − ) ≤ 1 . m − d as n →∞, where F(d,m−d) is an F distributed (d, m−d) random variable with degrees of freedom. Proposition 3. Assume the conditions of Theorem 1, and suppose that EZ(θ∗)T Z(θ∗) is non-singular. If m ≥ d + 1, Proof. Part a) of Theorem 2 is a simple consequence of the then continuous mapping principle. For part b) of Theorem 2, ∗ we note that if N1(0,0),...,Nm(0,0) are m independent P(θ ∈ 3mn(z)) → 1 − δ and identically distributed random vectors with distribution N(0,0), then the random matrix as n →∞, where z is selected so that P(F(d,m−d) ≤ z) = 1 − δ.  T Xm Xm 1  1  The proof is an easy application of Theorem 2. Propo- Ni(0,0)− Nj (0,0) m − m sition 3 asserts that 3mn(z) is (for large n) an approximate 1 i= j= ∗ 1  1  100(1 − δ)% confidence region for θ . In contrast to the m confidence region approach of Section 2 (in which C was 1 X · Ni( ,0)− Nj ( ,0) consistently estimated), this procedure can be easily imple- 0 m 0 j=1 mented. Note that the current procedure is essentially a “cancellation procedure”, in the terminology of Glynn and is almost surely non-singular (since m ≥ d + 1); see p.208 Iglehart (1990). of Searle (1982). One additional advantage of the current approach is that Because the matrix inverse functional is continuous in it requires the simulationist to run m independent replications a neighborhood of any non-singular matrix, the following of the stochastic approximation, providing m independent Hsieh and Glynn estimators of the maximizer θ ∗. These multiple independent We implemented the Robbins-Monro algorithm using re-starts of the stochastic approximation procedure permit likelihood ratio technique to estimate the derivative; see the simulationist to test the question of whether the data L’Ecuyer and Glynn (1994) for detail on constructing the collected is consistent with α(·) having a unique local maxi- likelihood ratio derivative estimator for this model. The mizer. This question is of great importance from an applied initial point θ0 is uniformly chosen from the interval. The standpoint, since global maximization is typically the sim- step parameter a in (3) is set to 0.1, and the simulation√ ulationist’s objective. We shall pursue this important issue budget at iteration n is set to be proportional to n (such in an expanded version of this paper. simulation time allocation has been shown to be empirically efficient in L’Ecuyer, Giroux, and Glynn (1994).) 4 COMPUTATIONAL RESULTS Table 1 summarizes the numerical results of the sim- ulation runs. The confidence level is 95%, and the 95% In this section, we test the confidence region procedure confidence interval for the coverage probability is estimated proposed in Section 3 on two different examples. The first from 100 replications. The coverage probabilities are rea- involves a scalar decision parameter, whereas the second sonably close to the nominal level of 95%, and exhibit some example concerns a two-dimensional vector of decision degree of convergence to their nominal level as n grows parameters. larger. (See, in particular, the results reported for m = 5.) Although n seems not have significant affect on the coverage 4.1 Model 1 accuracy, long n does narrow the confidence interval, i.e., the variation of θi is smaller. The first model is an M/M/1 queue with arrival rate λ = 1 and mean service time Table 1: Numerical Summary of Model 1. The Ideal Coverage Probability Is 0.95 m = m = θ ∈ 2 =[0.05, 0.95]. 3 5 n CI for coverage prob. CI for coverage prob. .  . .  . Therefore, the service time V has density fθ (x) = 64 0 87 0 066 0 54 0 098 .  . .  . exp(−x/θ)/θ. The goal is to minimize 128 0 86 0 068 0 66 0 093 256 0.90  0.059 0.82  0.076 512 0.89  0.062 0.74  0.086 α(θ) = w(θ) + 1 , θ (8) 1024 0.86  0.068 0.80  0.079 2048 0.86  0.068 0.87  0.066 where w(θ) is the mean steady-state sojourn time per cus- tomer in the system associated with parameter θ. It can be shown that w(θ) = θ/(1 − θ). Thus the optimal value 4.2 Model 2 θ ∗ = 0.5. Figure 1 shows the shape of α(θ) on the interval [0.05, 0.95]. This model appeared in L’Ecuyer, Giroux, and The second model is an M/M/1 queue with arrival rate λ ∈[ , . ] µ ∈[ . , ] Glynn (1994). 1 2 5 and service rate 3 5 6 . Thus, the service α(θ) = w(θ)+1/θ time V has density fµ(x) = µ exp(−µx) and the inter- 25 arrival time U has density fλ(x) = λ exp(−λx). The goal is to minimize

20 1 µ α(λ,µ) = w(λ, µ) + + , (9) λ 4

15 where w(λ, µ) is the mean steady-state customer sojourn λ ) time for the system associated with arrival rate and and θ ( α service rate µ. Again, it can be shown w(λ, µ) = 1/(µ−λ), ∗ ∗ 10 so that the optimal value (λ ,µ ) = (2, 4). The shape of α(λ,µ) is displayed in Figure 2. Again, we implement the Robbins-Monro algorithm 5 using the likelihood ratio gradient estimation algorithm. The initial (λ0,µ0) is chosen uniformly from the rectangle ([1, 2.5], [3.5, 6]). The step parameter a in (3) is set to 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.5, and the simulation budget at iteration n is set to be θ √ proportional to n. Figure 1: Objective Function of Model 1 Hsieh and Glynn

α(λ,µ)=w(λ,µ)+1/λ+µ/4 convergence to a local optimizer). We intend to pursue this question at greater depth in future work. In future work, we hope to study the following issues 2.8 concerning extensions of the methodology developed in this 2.7 paper as well as extensions of other methods for simulation 2.6 optimization: 2.5 ) µ , 2.4 λ (

α 1. Extend the theory and body of computational expe- 2.3 rience to the class of Kiefer-Wolfowitz algorithms; 2.2 2. Extend the theory and body of computational ex- 2.1

2 perience to the averaging algorithms proposed by 6 Polyak and Juditsky (1992), as well as Kushner 5.5 2.5 5 and Yin (1997); 2 4.5 3. Develop theory and algorithms to address the sit- 1.5 4 uation in which the gain constant a is chosen (in- µ 3.5 1 λ advertently) so small that the matrix aH + (1/2)I Figure 2: Objective Function of Model 2 has possible positive eigenvalues.

Table 2 summarizes the numerical results of the simula- REFERENCES tion runs. Again, the confidence level is 95%, and the 95% confidence interval for coverage probability is estimated Billingsley, P. 1995. Probability and Measure. Third Edi- from 100 replications. Again, the empirical evidence is in tion. John Wiley & Sons, New York. reasonably close agreement with the theory developed in Fu, M. C., and J. Q. Hu. 1997. Conditional Monte Section 3. Carlo: Gradient estimation and optimization appli- cations. Kluwer Academic Publisher, Boston. Table 2: Numerical Summary of Model 2. The Ideal Gill, P. E., W. Murray, and M.H. Wright. 1981. Practical Coverage Probability is 0.95 Optimization, Academic Press, London and New York. m = 3 m = 5 Glasserman, P. 1991. Gradient estimation via perturbation n CI for coverage prob. CI for coverage prob. analysis. Boston: Kluwer. 64 0.97  0.034 0.89  0.062 Glynn, P. W. 1986. Stochastic approximation for Monte 128 0.93  0.050 0.89  0.062 Carlo optimization. In Proceedings of the 1986 Winter 256 0.95  0.043 0.88  0.064 Simulation Conference, 356–365. 512 0.91  0.056 0.93  0.050 Glynn, P. W. 1990. Likelihood ratio gradient estimation 1024 0.96  0.039 0.90  0.059 for stochastic systems. Communications of the ACM 2048 0.96  0.039 0.91  0.056 33:75–84. Glynn, P. W. and D. L. Iglehart 1990. Simulation output The simulation results from these two models suggest analysis using standardized . Mathematics that the procedure described in Section 3 is a pragmatic of Operations Research 15:1–16. approach for constructing confidence intervals/regions for Kushner, H. J., and G. G. Yin. 1997. Stochastic ap- stochastic approximation algorithms. proximation algorithms and applications. New York: Springer-Verlag. 5 CONCLUDING REMARKS L’Ecuyer, P., and P. W. Glynn. 1994. Stochastic opti- mization by simulation: Convergence proofs for the In this paper, we have offered an approach to the construc- GI/G/1 queue in steady-state. Management Science tion of confidence regions for stochastic approximation algo- 40:1562–1578. rithms of Robbins-Monro type. A principal advantage of our L’Ecuyer, P., N. Giroux, and P. W. Glynn 1994. Stochastic proposed methodology is the ease with which it can be im- optimization by simulation: Numerical plemented from a practical standpoint. An additional feature with the M/M/1 queue in steady-state. Management of our proposed method is that its requirement to simulate Science 40:1245–1261. multiple independent replications of the stochastic approxi- Ljung, L., G. Ch. Pflug, and H. Walk. 1992. Stochastic mation procedure offers the simulationist an opportunity to approximation and optimization of random systems. test the question of whether the stochastic approximation Basel: Birkhäuser Verlag. has indeed converged to a global optimizer (as opposed to Loeve, M. 1977. Probability Theory, Vol. 1, 4th edition, Springer-Verlag. Hsieh and Glynn

Luenberger, D. G. 1984. Introduction to Linear and Nonlin- ear Programming, Addison-Wesley, Menlo Park, Cal- ifornia. Nevel’son, M. B., and R. Z. Has’minskii. 1973. Stochastic Approximation and Recursive Estimation. American mathematical Society, Providence, Rhode Island. Polyak, B. T., and A. B. Juditsky. 1992. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization 30:838–855. Rubinstein, R. Y. 1992. Sensitivity analysis of discrete event systems by the “push-out” method. Annals of Operations Research 39:229–237. Searle, S.R. 1982. Matrix Algebra Useful for Statistics. John Wiley, New York. Williams, D. 1991. Probability with Martingales. Cam- bridge Mathematical Textbooks. Cambridge University Press, Cambridge.

AUTHOR BIOGRAPHIES

MING-HUA HSIEH is an assistant Professor of the De- partment of Management Information Systems of National Chengchi University, Taiwan. From 1997 to 1999, he was a software designer at Hewlett Packard company, Califor- nia. He is a member of INFORMS. His research interests include simulation methodology and financial engineering. His e-mail address is .

PETER W. GLYNN received his Ph.D. from Stanford Uni- versity, after which he joined the faculty of the Department of Industrial Engineering at the University of Wisconsin- Madi- son. In 1987, he returned to Stanford, where he is the Thomas Ford Professor of Engineering in the Department of Man- agement Science and Engineering. His research interests include discrete-event simulation, computational probabil- ity, queueing, and general theory for stochastic systems. His e-mail address is .