2002: Confidence Regions for Stochastic Approximation

Proceedings of the 2002 Winter Simulation Conference E. Yücesan, C.-H. Chen, J. L. Snowdon, and J. M. Charnes, eds. CONFIDENCE REGIONS FOR STOCHASTIC APPROXIMATION ALGORITHMS Ming-hua Hsieh Peter W. Glynn Department of Management Information Systems Department of Management Science and Engineering National Chengchi University Stanford University Wenshan, Taipei 11623, TAIWAN Stanford, CA 94305, U.S.A. ABSTRACT we present has analogs in the setting of Kiefer-Wolfowitz procedures. Section 2 describes a central limit theorem for In principle, known central limit theorems for stochastic the Robbins-Monro algorithm. This central limit theorem approximation schemes permit the simulationist to provide lies at the basis of the confidence region procedures we de- confidence regions for both the optimum and optimizer of a scribe. Specifically, Section 2 discusses a confidence region stochastic optimization problem that is solved by means of procedure that requires consistent estimation of a certain such algorithms. Unfortunately, the covariance structure of covariance matrix, whereas Section 3 describes a more eas- the limiting normal distribution depends in a complex way ily implemented “cancellation” procedure. In Section 4, on the problem data. In particular, the covariance matrix we discuss some of our computational experience with the depends not only on variance constants but also on even procedure introduced in Section 3. Finally, Section 5 offers more statistically challenging parameters (e.g. the Hessian some concluding remarks. of the objective function at the optimizer). In this paper, we describe an approach to producing such confidence regions 2 CENTRAL LIMIT THEOREM FOR that avoids the necessity of having to explicitly estimate STOCHASTIC APPROXIMATIONS the covariance structure of the limiting normal distribution. This procedure offers an easy way for the simulationist to We start by describing the problem setting precisely. Sup- provide confidence regions in the stochastic optimization pose that our goal is to numerically maximize an objective setting. function α(θ) over a continuous decision parameter θ ∈<d . If α(·) is smooth, it is well known that maximization requires 1 INTRODUCTION computing an appropriate root θ ∗ of the equation ∗ Stochastic approximation algorithms are iterative procedures ∇α(θ ) = 0 (1) that permit simulationists to numerically optimize complex stochastic systems. This class of algorithms exhibits a where ∇α(θ) is the gradient of α(·) at θ ∈<d . (In this convergence rate that is typically not faster than of order paper, we take ∇α(θ) to be a row vector.) If ∇α(θ) can be c−1/2, where c is the size of the computer time budget. numerically evaluated, the root θ ∗ can often be efficiently Given this relatively slow convergence rate (as compared to computed by (deterministic) gradient-based Newton-type most non-random iterative procedures), it is desirable, from algorithms (see, e.g., Luenberger 1984; Gill, et al. 1981). practical standpoint, to assess the accuracy of the computed We focus on the case that ∇α(·) must be computed by solution at the conclusion of the calculation. Given that Monte Carlo sampling. It is well known that simulation- stochastic approximation algorithms are driven by random based algorithms enjoy a much broader range of applicability numbers, the most natural means of assessing error is via a than do methods requiring numerical evaluation of closed- confidence interval (for use when the optimizing decision form expressions for the expectations involved. Specifically, variable is scalar) or, more generally, via a confidence we assume existence of a family (Z(θ) : θ ∈<d ) of random region (for use when the optimizing decision variable is vectors that act as unbiased estimators of the gradient, vector-valued). namely This paper explores the construction of confidence regions for such iterative algorithms. We focus on Robbins- EZ(θ) =∇α(θ). Monro procedures in this paper, although much of the theory (2) Hsieh and Glynn (Note that Z(θ) is also encoded as a row vector.) A number Theorem 1. Suppose that (θn : n ≥ 0) satisfies (2), (3), of different procedures have been proposed in the literature and assumption A. Then, for obtaining such unbiased estimators of the gradient, in- 1 ∗ cluding likelihood ratio methods (Glynn 1986, Glynn 1990), n 2 (θn − θ ) ⇒ N(0,C) infinitesimal perturbation analysis (Glasserman 1991), Con- ditional Monte Carlo (Fu and Hu 1997), and the “push-out” as n →∞, where N(0,C)is a multivariate normal random approach (Rubinstein 1992). vector with mean zero and covariance matrix C given by For θ ∈<d , let Z ∞ 1 ∗ T ∗ F(dz; θ) = P(Z(θ)∈ dz) C = a2 exp((aH + I)u)EZ(θ ) Z(θ ) 0 2 d 1 for z ∈< . Given the existence of unbiased estimators for · exp((aH + I)u)du. the gradient, we are naturally led to the consideration of a 2 θ ∗ Robbins-Monro (R-M) algorithm for computing . Such Before proceeding further, let us briefly discuss as- a R-M algorithm proceeds by first choosing an initial guess sumption A. When α(·) is twice continuously differentiable θ θ ∗ θ 0 for the maximizer , and subsequently iterates to n+1 (as will typically be the case), i) is immediately satisfied at a θ from n via the recursion stationary point θ ∗ and H is just the Hessian of α evaluated ∗ ∗ a at θ . At an isolated maximizer θ , H must be negative 1 θn+1 = θn + Zn+1(θn) (3) definite and symmetric, so that aH + I will have negative n + 1 2 eigenvalues for a sufficiently large; ii) will then be satisfied for a>0, where automatically. Condition iii) is a sufficient condition that guarantees that θ ∗ is a global maximizer of α(·). Condition P(Zn+1(θn) ∈ dz|θ0,Z1(θ0), ··· ,Zn(θn−1)) = F(dz; θn) iv) is a mild continuity hypothesis on the distribution of Z(θ), and condition v) is a technical integrability hypothesis d for z ∈< . (Again, we choose to encode the θn’s as row that is satisfied in great generality. Finally, condition vi) is vectors.) an assumption that controls the rate at which the variance We are now ready to describe one of many known of kZ(θ)k grows as kθk tends to infinity. ∗ − 1 central limit theorems (CLT’s) for θn; see p.147-150 of Theorem 1 asserts that θn converges to θ at rate n 2 Nevel’son and Has’minskii (1973) for a complete√ proof. in the number of iterations n. Furthermore, the error of θn For x ∈<d (encoded as a row vector), let kxk= xxT be is approximately normally distributed when n is large. The its Euclidian norm. latter observation suggests the possibility of constructing ∗ Assumption A. The sequence (θn : n ≥ 0) satisfies the confidence regions for θ based on the above CLT. following conditions: Proposition 1. Suppose that (θn : n ≥ 0) satisfies the conditions of Theorem 1 and that EZ(θ∗)T Z(θ∗) is a non- ∗ ∗ ∗ i) ∇α(θ) = (θ − θ )H + o(kθ − θ k) as θ → θ ; singular matrix. If Cn ⇒ C as n →∞, then aH + 1 I ii) 2 is symmetric and all its eigenvalues are ∗ −1 ∗ T 2 negative; n(θn − θ )Cn (θn − θ ) ⇒ χd ∗ T iii) for all ε>0, sup ∗ 1 ∇α(θ)(θ −θ ) < 0; ε<kθ−θ k< ε ∗ ∗ 2 iv) F(·; θ) ⇒ F(·; θ ) as θ → θ , where ⇒ denotes as n →∞, where χd is a chi-square random variable with weak convergence; d degrees of freedom. 2 ∗ C v) there exists ε0 > 0 such that (kZ(θ)k :kθ −θ k < Proof We start by showing that is non-singular. The C ε0) is a uniformly integrable family of random matrix can alternatively be represented as the matrix variables; solution to the equation vi) there exists k<∞ such that EkZ(θ)k2 ≤ k(1 + kθk2). (aH + 1I)C + C(aH + 1I) = EZ(θ∗)T Z(θ∗); 2 2 Recall that for a square matrix B, see p. 77-78 of Ljung, Pflug, and Walk (1992). Lyapunov’s X∞ Bn lemma (see p. 133 of Nevel’son and Has’minskii 1973) (B) ≡ establishes that C is positive definite and therefore non- exp n! n=0 singular. Since the matrix inverse functional is continuous in a (matrix) neighborhood of a non-singular matrix, it (which is guaranteed to always converge absolutely and be follows from the continuous mapping principle for weak −1 −1 well-defined). convergence that Cn ⇒ C as n →∞. Hsieh and Glynn To continue the argument, we write the quadratic form tions of the Martingale Convergence Theorem (see p. 468 ∗ −1 ∗ T n(θn − θ )Cn (θn − θ ) as follows: of Billingsley 1995, or ch. 12 of Williams 1991), so that thereP exists a finite-valued random variable M∞ such ∗ −1 ∗ T n D /i → M n →∞ n(θn − θ )Cn (θn − θ ) that i=1 i ∞ a.s. as . Kronecker’s = n(θ − θ ∗)C−1(θ − θ ∗)T lemmaP (see p. 250 of Loeve 1977) then guarantees that n n (4) −1 n n i= Di/i → 0 a.s. as n →∞. But ∗ −1 −1 ∗ T 1 + n(θn − θ )(Cn − C )(θn − θ ) . Xn Xn−1 −1 −1 The continuous mapping principle implies that the first 6n = n Di + n γ(θi). (7) term on the right hand side of (4) converges weakly to i=1 i=0 N(0,C)C−1N(0,C)T , which is easily seen to have a χ 2 d ∗ distribution. On the other hand, Conditions A iv) and v) ensure that γ(θ) → γ(θ ) as ∗ ∗ θ → θ . In addition, it is known that θn → θ a.s. as ∗ −1 −1 ∗ T n →∞under the conditions of Theorem 1; see p. 93 of kn(θn − θ )(Cn − C )(θn − θ ) k Nevel’son and Has’minskii (1973).

2002: Confidence Regions for Stochastic Approximation

Trajectory Averaging for Stochastic Approximation MCMC Algorithms

Stochastic Approximation for the Multivariate and the Functional Median 3 It Can Not Be Updated Simply If the Data Arrive Online

Simulation Methods for Robust Risk Assessment and the Distorted Mix

An Empirical Bayes Procedure for the Selection of Gaussian Graphical Models

Stochastic Approximation of Score Functions for Gaussian Processes

Stochastic Approximation Methods and Applications in Financial Optimization Problems

Stochastic Approximation Algorithm

Stochastic Approximation: from Statistical Origin to Big-Data, Multidisciplinary Applications

On Implementation of the Markov Chain Monte Carlo Stochastic Approximation Algorithm

1 Multidimensional Stochastic Approximation: Adaptive Algorithms and Applications

Performance of a Distributed Stochastic Approximation Algorithm

Arxiv:1909.08749V2 [Stat.ML] 15 Sep 2020 Learning [Ber95a, Ber95b, SB18]