Acceleration of Stochastic Approximation in Adaptive Importance Sampling

Acceleration of Stochastic Approximation in Adaptive Importance Sampling Sheng Xu Supervised by Dr Reiichiro Kawai University of Sydney 1 Abstract Stochastic approximation can be used as a computationally simple method of updating the adaptive importance sampling parameter in Monte Carlo simulations. However, the performance of this method is poor when applied to simulations of events which occur with very low probability. In this project, we attempt to improve the stochastic approximation method in such simulations by introducing an auxiliary parameter. By doing so, we develop a new algorithm which performs noticeably better in the early stages of such simulations and exhibits significantly higher consis- tency across repeated experiments. We also prove the asymptotic normality of our algorithm, and demonstrate the effectiveness of our method with a numerical example. For brevity’s sake, many simplifications have been made and details omitted in this version of the report. 1 Introduction The general problem is to use Monte Carlo simulations to estimate some constant C which can be expressed as C = EP[F(U)]; where U : W ! Rd is a continuous random vector in the probability space (W;F ;P) and F : Rd ! R is a A -measurable function. Perhaps the most well known estimator for C is the sample average 1 N ∑ F(Un); N n=1 where Un are independent random vectors with the same distribution as U. By the central limit theorem (CLT), we have that ! p N 1 d 2 2 N ∑ F(Un) −C ! N (0;E[jF(U)j ] −C ): N n=1 It is important to find estimators which have lower asymptotic variance than this crude estimator, as this translates to an increase in efficiency of the Monte Carlo simulation. Some common variance reduction techniques are importance sampling, control variates, and stratified sampling. In importance sampling, the underlying measure is changed from P to Pq , a member of some parametrised family of measures. Under P, we can write C as [ ] E p(U) C = Pq F(U) ; pq (U) where p and pq are the densities of U with respect to P and Pq , respectively. This results in a new unbiased estimator for C N 1 p(Un) ∑ F(Un) ; N n=1 pq (Un) whose variance depends on q. Depending on the value of q, the variance of this estimator could either be lower or higher than that of the crude estimator. Unfortunately, in practice, we usually do not have enough information a priori to know what a good choice of q might be. One way of dealing with this is by choosing q adaptively. In adaptive importance sampling, the simulation to estimate C and a search algorithm for q are run concurrently. This area has attracted a great deal of attention recently [Kawai, 2015a,b; Lemaire & Pages,` 2010; Arouna, 2004; Kawai, 2008]. One search algorithm for q is stochastic approximation, pioneered by [Robbins & Monro, 1951]. While stochastic approximation algorithms perform well in general, it is known that they perform poorly in simulations of rare events, where F(U) = 0 with a very high probability. In such ∗ simulations, the convergence of qn to its optimal value q is extremely slow. Typically, q would be stuck near its initial value q0 for a significant amount of time. This is very undesirable as it means that there is little to no variance reduction. 2 The simulation of rare events, such as critical failures of transportation and communications in- frastructure, is crucial to many fields, and so it is of great interest to accelerate the stochastic approximation method in these simulations. To this end, we introduce an auxiliary parameter l into the stochastic approximation method through an additional change of measure. The introduction of an auxiliary parameter in adaptive importance sampling was first done in [Lemaire & Pages,` 2010], though for a different purpose. More recently, in [Kawai, 2015a], an auxiliary parameter was introduced to accelerate sample average approximation in adaptive importance sampling. This project can be seen as the stochastic approximation version of [Kawai, 2015a]. In Section 2, we formally state our assumptions, set up our adaptive importance sampling framework, and give a brief review of the history of stochastic approximation. In Section 3, we present our algorithm and theorems on its convergence. In Section 4, we give an example to demonstrate the effectiveness of our algorithm. We make some concluding remarks in Section 5. 2 Background 2.1 Adaptive Importance Sampling The following, up to and including Proposition 2.1, is taken from [Kawai, 2015a,b]. We assume (without loss of generality) that U is distributed uniformly on the d-dimensional hypercube (0;1)d. We can now write C as Z C = F(u) du: (0;1)d Henceforth, we omit the measure P when writing the expectation as all following results and calcula- tions are done with respect to P, although we may change measures during a derivation or proof. Let g(·;q) be a density function with parameter q 2 Q. Then, by changing the underlying measure, we have that Z −1 q q −1 q q g(G (u; ); 0) C = F(G(G (u; ); 0)) −1 du: (0;1)d g(G (u;q);q) Note that, if q = q0, then this framework reduces to the crude estimator. The variance of this estimator is given by Z [ ] −1 q q 2 −1 q q g(G (u; ); 0) − 2 F(G(G (u; ); 0)) −1 du C (0;1)d g(G (u;q);q) Z −1 q q j j2 g(G (u; 0); 0) − 2 q − 2 = F(u) −1 du C =: V( ) C : (0;1)d g(G (u;q0);q) For simplicity, we introduce some additional notation −1 g(G (u;l);q ) − H(u;q;l) := 0 ; R(u;q) := F(G(G 1(u;q);q ))H(u;q;q); g(G−1(u;l);q) 0 so that 2 C = E[R(U;q)]; V(q) = E[jF(U)j H(U;q;q0)]: Our problem is then to minimise V(q). Proposition 2.1. If g(·;q) is chosen so that it is component-wise Normal or Exponential, then (a) V(q) is twice continuously differentiable and strictly convex, and 2 2 ∇qV(q) = E[jF(U)j ∇q H(U;q;q0)]; Hessq (V(q)) = E[jF(U)j Hessq (H(U;q;q0))]: ∗ (b) q = argminV(q) exists, is unique, and is the unique solution to ∇qV(q) = 0. Assuming Proposition 2.1 holds, we implement adaptive importance sampling. 3 FOR n = 1 TO N, d 1. SAMPLE Un ∼ U(0;1) 2. ADD R(Un;qn−1) TO THE SIMULATED DATASET ∗ 3. qn = [SEARCH ALGORITHM FOR q ] 2.2 Stochastic Approximation Stochastic approximation was first proposed in [Robbins & Monro, 1951] as a method to solve equa- tions of the form M(x) = a, where M is monotonic and is observed with some noise. In our case, 0 2 M(x) = ∇qV(q) and a = 0. If there exists C > 0 such that, for all q 2 Q, kjF(U)j ∇q H(U;q;q0)k2 ≤ 0 P ∗ C (1 + kqk), then qn ! q . However, as remarked in [Lemaire & Pages,` 2010], this assumption of sub-linear growth is rarely satisfied in the framework of adaptive importance sampling. In [Chen & Zhu, 1986], the authors developed the constrained stochastic approximation algorithm, which removes the sub-linear growth assumption. 8 > q q − g j j2∇ q q < n− 1 = n−1 n F(Un) q H(Un; n−1; 0) q 2 q 2 q s s if − 1 Ksn−1 ; n = − 1 ; n = n−1 (2.1) > n 2 n 2 : q 2 q q s s if − 1 = Ksn−1 ; n = 0; n = n−1 + 1 n 2 where • s0 = 0, so sn counts the number of projections up to the n-th step S • f g ( ¥ Q Kn n2N is a sequence of compact sets satisfying Kn−1 int(Kn); n=1 Kn = • fg g ¥ g ¥ ¥ g2 ¥ n n2N is a sequence of step sizes satisfying ∑n=1 n = ; ∑n=1 n < • ∇q H(Un;qn−1;q0) is shorthand for ∇q H(u;q;q0) evaluated at u = Un;q = qn−1. Notation of this form will be used regularly henceforth. In simulations of rare events, F(Un) is 0 with a very high probability. This causes Algorithm ∗ (2.1) to set qn = qn−1 repeatedly, dramatically slowing down the convergence of qn to q . This is particularly bad in the early stages of the algorithm – until q is updated at least once, it is stuck at q0, which gives no reduction in variance whatsoever. 3 Main Results As in [Kawai, 2015a], we introduce the auxiliary parameter l 2 Q to V by changing the measure: Z −1 2 V(q) = jF(G(G (u;l);q0))j H(u;q;l)H(u;l;l) du =: E[N(U;q;l)]: (0;1)d Note that the framework in Section 2 can be recovered by letting l = q0. We stress that V(q) and ∇qV(q) have not been changed by the introduction of l. We have simply found new estimators for them. 3.1 Convergence The following convergence results (adapted from [Lelong, 2013; Kawai, 2015b,a]) apply to Algorithm (3.1). Define dMn(l) := ∇q N(Un;qn−1;ln−1) − ∇qV(qn−1): 4 Rd ! Rd×d k q k Assumption 3.1. (a) There exists a function y : such that limjqj!0 y( ) = 0 and a ∗ matrix A such that the eigenvalues of A have positive real parts and ∇qV(q) = A(q − q ) + y(q − q ∗)(q − q ∗).

Load more