Quick viewing(Text Mode)

Acceleration of Stochastic Approximation in Adaptive Importance Sampling

Acceleration of Stochastic Approximation in Adaptive Importance Sampling

Acceleration of Stochastic Approximation in Adaptive Importance

Sheng Xu Supervised by Dr Reiichiro Kawai University of Sydney

1 Abstract Stochastic approximation can be used as a computationally simple method of updating the adap- tive importance sampling parameter in Monte Carlo simulations. However, the performance of this method is poor when applied to simulations of events which occur with very low probability. In this project, we attempt to improve the stochastic approximation method in such simulations by introducing an auxiliary parameter. By doing so, we develop a new algorithm which performs noticeably better in the early stages of such simulations and exhibits significantly higher consis- tency across repeated . We also prove the asymptotic normality of our algorithm, and demonstrate the effectiveness of our method with a numerical example. For brevity’s sake, many simplifications have been made and details omitted in this version of the report.

1 Introduction

The general problem is to use Monte Carlo simulations to estimate some constant C which can be expressed as C = EP[F(U)], where U : Ω → Rd is a continuous random vector in the probability space (Ω,F ,P) and F : Rd → R is a A -measurable function. Perhaps the most well known estimator for C is the sample average

1 N ∑ F(Un), N n=1 where Un are independent random vectors with the same distribution as U. By the (CLT), we have that ( ) √ N 1 d 2 2 N ∑ F(Un) −C → N (0,E[|F(U)| ] −C ). N n=1

It is important to find estimators which have lower asymptotic than this crude estimator, as this translates to an increase in efficiency of the Monte Carlo simulation. Some common variance reduction techniques are importance sampling, control variates, and stratified sampling. In importance sampling, the underlying measure is changed from P to Pθ , a member of some parametrised family of measures. Under P, we can write C as [ ] E p(U) C = Pθ F(U) , pθ (U) where p and pθ are the densities of U with respect to P and Pθ , respectively. This results in a new unbiased estimator for C N 1 p(Un) ∑ F(Un) , N n=1 pθ (Un) whose variance depends on θ. Depending on the value of θ, the variance of this estimator could either be lower or higher than that of the crude estimator. Unfortunately, in practice, we usually do not have enough information a priori to know what a good choice of θ might be. One way of dealing with this is by choosing θ adaptively. In adaptive importance sampling, the simulation to estimate C and a search algorithm for θ are run concurrently. This area has attracted a great deal of attention recently [Kawai, 2015a,b; Lemaire & Pages,` 2010; Arouna, 2004; Kawai, 2008]. One search algorithm for θ is stochastic approximation, pioneered by [Robbins & Monro, 1951]. While stochastic approximation algorithms perform well in general, it is known that they perform poorly in simulations of rare events, where F(U) = 0 with a very high probability. In such ∗ simulations, the convergence of θn to its optimal value θ is extremely slow. Typically, θ would be stuck near its initial value θ0 for a significant amount of time. This is very undesirable as it that there is little to no variance reduction.

2 The simulation of rare events, such as critical failures of transportation and communications in- frastructure, is crucial to many fields, and so it is of great interest to accelerate the stochastic approx- imation method in these simulations. To this end, we introduce an auxiliary parameter λ into the stochastic approximation method through an additional change of measure. The introduction of an auxiliary parameter in adaptive importance sampling was first done in [Lemaire & Pages,` 2010], though for a different purpose. More recently, in [Kawai, 2015a], an aux- iliary parameter was introduced to accelerate sample average approximation in adaptive importance sampling. This project can be seen as the stochastic approximation version of [Kawai, 2015a]. In Section 2, we formally state our assumptions, set up our adaptive importance sampling frame- work, and give a brief review of the history of stochastic approximation. In Section 3, we present our algorithm and theorems on its convergence. In Section 4, we give an example to demonstrate the effectiveness of our algorithm. We make some concluding remarks in Section 5.

2 Background

2.1 Adaptive Importance Sampling The following, up to and including Proposition 2.1, is taken from [Kawai, 2015a,b]. We assume (without loss of generality) that U is distributed uniformly on the d-dimensional hypercube (0,1)d. We can now write C as ∫ C = F(u) du. (0,1)d Henceforth, we omit the measure P when writing the expectation as all following results and calcula- tions are done with respect to P, although we may change measures during a derivation or proof. Let g(·;θ) be a density function with parameter θ ∈ Θ. Then, by changing the underlying measure, we have that ∫ −1 θ θ −1 θ θ g(G (u; ); 0) C = F(G(G (u; ); 0)) −1 du. (0,1)d g(G (u;θ);θ)

Note that, if θ = θ0, then this framework reduces to the crude estimator. The variance of this estimator is given by ∫ [ ] −1 θ θ 2 −1 θ θ g(G (u; ); 0) − 2 F(G(G (u; ); 0)) −1 du C (0,1)d g(G (u;θ);θ) ∫ −1 θ θ | |2 g(G (u; 0); 0) − 2 θ − 2 = F(u) −1 du C =: V( ) C . (0,1)d g(G (u;θ0);θ) For simplicity, we introduce some additional notation

−1 g(G (u;λ);θ ) − H(u;θ,λ) := 0 , R(u;θ) := F(G(G 1(u;θ);θ ))H(u;θ,θ), g(G−1(u;λ);θ) 0 so that 2 C = E[R(U;θ)], V(θ) = E[|F(U)| H(U;θ,θ0)]. Our problem is then to minimise V(θ). Proposition 2.1. If g(·;θ) is chosen so that it is component-wise Normal or Exponential, then (a) V(θ) is twice continuously differentiable and strictly convex, and

2 2 ∇θV(θ) = E[|F(U)| ∇θ H(U;θ,θ0)], Hessθ (V(θ)) = E[|F(U)| Hessθ (H(U;θ,θ0))].

∗ (b) θ = argminV(θ) exists, is unique, and is the unique solution to ∇θV(θ) = 0. Assuming Proposition 2.1 holds, we implement adaptive importance sampling.

3 FOR n = 1 TO N, d 1. SAMPLE Un ∼ U(0,1)

2. ADD R(Un;θn−1) TO THE SIMULATED DATASET ∗ 3. θn = [SEARCHALGORITHMFOR θ ]

2.2 Stochastic Approximation Stochastic approximation was first proposed in [Robbins & Monro, 1951] as a method to solve equa- tions of the form M(x) = α, where M is monotonic and is observed with some noise. In our case, ′ 2 M(x) = ∇θV(θ) and α = 0. If there exists C > 0 such that, for all θ ∈ Θ, ∥|F(U)| ∇θ H(U;θ,θ0)∥2 ≤ ′ P ∗ C (1 + ∥θ∥), then θn → θ . However, as remarked in [Lemaire & Pages,` 2010], this assumption of sub-linear growth is rarely satisfied in the framework of adaptive importance sampling. In [Chen & Zhu, 1986], the authors developed the constrained stochastic approximation algorithm, which removes the sub-linear growth assumption.   θ θ − γ | |2∇ θ θ  n− 1 = n−1 n F(Un) θ H(Un; n−1, 0) θ ∈ θ 2 θ σ σ if − 1 Kσn−1 , n = − 1 , n = n−1 (2.1)  n 2 n 2  θ ∈ θ θ σ σ if − 1 / Kσn−1 , n = 0, n = n−1 + 1 n 2 where

• σ0 = 0, so σn counts the number of projections up to the n-th step ∪ • { } ( ∞ Θ Kn n∈N is a sequence of compact sets satisfying Kn−1 int(Kn), n=1 Kn = • {γ } ∞ γ ∞ ∞ γ2 ∞ n n∈N is a sequence of step sizes satisfying ∑n=1 n = , ∑n=1 n <

• ∇θ H(Un;θn−1,θ0) is shorthand for ∇θ H(u;θ,θ0) evaluated at u = Un,θ = θn−1. Notation of this form will be used regularly henceforth.

In simulations of rare events, F(Un) is 0 with a very high probability. This causes Algorithm ∗ (2.1) to set θn = θn−1 repeatedly, dramatically slowing down the convergence of θn to θ . This is particularly bad in the early stages of the algorithm – until θ is updated at least once, it is stuck at θ0, which gives no reduction in variance whatsoever.

3 Main Results

As in [Kawai, 2015a], we introduce the auxiliary parameter λ ∈ Θ to V by changing the measure: ∫ −1 2 V(θ) = |F(G(G (u;λ);θ0))| H(u;θ,λ)H(u;λ,λ) du =: E[N(U;θ,λ)]. (0,1)d

Note that the framework in Section 2 can be recovered by letting λ = θ0. We stress that V(θ) and ∇θV(θ) have not been changed by the introduction of λ. We have simply found new estimators for them.

3.1 Convergence The following convergence results (adapted from [Lelong, 2013; Kawai, 2015b,a]) apply to Algorithm (3.1). Define δMn(λ) := ∇θ N(Un;θn−1,λn−1) − ∇θV(θn−1).

4 Rd → Rd×d ∥ θ ∥ Assumption 3.1. (a) There exists a function y : such that lim|θ|→0 y( ) = 0 and a ∗ matrix A such that the eigenvalues of A have positive real parts and ∇θV(θ) = A(θ − θ ) + y(θ − θ ∗)(θ − θ ∗). ∞ γ δ λ 1{|θ − θ ∗| ≤ } (b) For every q > 0, the series ∑n=1 n+1 Mn+1( ) n q converges a.s. (c) There exist ρ > 0 and η > 0 such that 2+ρ ∗ supE[|δMn(λ)| 1{|θn−1 − θ | ≤ η}] < ∞ n∈N and there exist a symmetric positive definite matrix Σ(λ) such that

T ∗ P E[δMn(λ)δMn(λ) |Fn−1]1{|θn−1 − θ | ≤ η} → Σ(λ). ∗ (d) There exists µ > 0 such that, for every n ≥ 0, d(θ ,∂Kn) ≥ µ. Theorem 3.2. If 0.5 < α < 1, then under Assumption 3.1, as n → ∞, ( ) ∗ ∫ ∞ θ − θ d n√ → N 0, exp[−At]Σ(λ)exp[−AT t] dt . γn 0 For α = 1, under an additional assumption, a similar result holds with a slightly different asymp- totic matrix. Theorem 3.3. It holds almost surely that, as N → ∞, 1 N ∑ R(Un;θn−1) → C. N n=1 Moreover, if Theorem 3.2 holds and there exists q > 2 such that (∫ ) 2 N q 1 q−1 q limsup ∑ |H(u;θn−1,θ0)| |F(u)| du < ∞, a.s. d N→∞ N n=1 (0,1) then it holds that, as N → ∞, ( ) √ N 1 d ∗ 2 N ∑ R(Un;θn−1) −C → N (0,V(θ ) −C ). N n=1 Note that Theorem 3.3 is not simply the strong law of large numbers (SLLN) or the standard CLT, as the summands are not identically distributed due to the updating of θn. Instead, Theorem 3.3 is an application of the martingale CLT.

3.2 Updating the Auxiliary Parameter Our accelerated algorithm is (we omit the justification)   θ θ − γ ∇ θ λ  n− 1 = n−1 n θ N(Un; n−1, n−1)  2  θ ∈ σ θ θ σ σ −  if n− 1 K n−1 , n = n− 1 , n = n 1  2 2  θ ∈ σ θ θ σ σ −  if n− 1 / K n−1 , n = 0, n = n 1 + 1  2  θ λ if has not updated yet, draw {n uniformly from Kσn  θ − γ ∇ θ λ  st 0 L n θ N(Un; n−1, n−1) if it is in Kσn  elseif 1 update for θ, n ≤ n0, λn =  { θn otherwise   λ − − Lγ ∇θ N(U ;θ − ,λ − ) if it is in Kσ  ≤ λ n 1 n n n 1 n 1 n  elseif n n0, n = λ  n−1 otherwise else λn = λn−1 (3.1) where

5 • σ0 = 0, so σn counts the number of projections up to the n-th step ∪ • { } ( ∞ Θ Kn n∈N is a sequence of compact sets satisfying Kn−1 int(Kn), n=1 Kn = −α • γn = γ(n + 1) , 0.5 < α ≤ 1, γ > 0

• L ≫ 1 and n0 < N are constants ∗ In practice, aside from jump-starting the convergence of θn to θ , Algorithm (3.1) also tends to have better consistency than Algorithm (2.1). The example in Section 4 demonstrates clearly the improved asymptotic consistency of Algorithm (3.1) over Algorithm (2.1), as well as acceleration in the early stages. Note that Algorithm (3.1) is just one of many possible ways to update λ in this framework, and finding better ways to update λ could be an area of further investigation.

4 Numerical Example

For simplicity, we present a one-dimensional example, although it is not hard to apply Algorithm (3.1) to high-dimensional problems. In the classic Black-Scholes model, the price of a European call option is given by [ ( ( ) √ )] − σ2 σ Φ−1 −rT r 2 T+ T (U) C = E e max S0e − K,0 =: E[F(U)]. where r is the interest rate, T is the maturation date, S0 is the current stock price, σ is the volatility of the stock, K is the strike price, and Z is a standard normal . For this example, we set r = 0.05, T = 0.25, S0 = 31, σ = 0.1, and K = 35, which means the probability of F(U) = 0 is 0.9862. We choose g(z;θ) = ϕ(z − θ) i.e. normally distributed with θ and variance 1. Note that this satisfies Proposition 2.1. We let θ0 = 0. Then we have that Θ = R, and 2 −1 −θΦ−1(u)− θ R(u;θ) = F(Φ(Φ (u) + θ))e 2 2 −1 θ −θΦ−1(u) ∇θ H(u;θ,θ0) = (θ − Φ (u))e 2 ∗ By deterministic numerical approximation, θ = 2.7387. Finally, we let Kn = [−(n + 2),n + 2] and −1 γn = n . After 1000 iterations, Algorithm (2.1) gives the estimator Cb = 0.0052, which is quite far from the true value C = 0.00853. The sample variance was 0.0052, which is lower than the variance of the crude Monte Carlo estimator, 0.0097, but much higher than the minimum 0.000076, obtained using ∗ ∗ θ . Finally, θ1000 = 0.0264. It has barely moved from its original position, and is far from θ . We now introduce the auxiliary parameter λ to try to improve this simulation. After some calcu- lations, we get 2 2 −1 −1 2 θ −θ(Φ−1(u)+λ)−λΦ−1(u)− λ ∇θ N(u;θ,λ) = (θ − Φ (u) − λ)|F(Φ(Φ (u) + λ))| e 2 2 .

We let L = 100 and n0 = 900, which is 90% of the duration of the simulation. For fairness, we use the same random seed to generate Un. After 1000 iterations, Algorithm (3.1) gives the estimator Cb = 0.0055, which is marginally better than before. The sample variance was 0.0038, which is better than the previous simulation. Finally, ∗ θ1000 = 0.1301, which is closer to θ than the previous simulation. Overall, the simulation has notice- ably improved. This improvement holds consistently across different random seeds. Also, Algorithm (3.1) is much more asymptotically consistent than Algorithm (2.1), as shown on the right side of Figure 1. Finally, for this example, Algorithm (2.1) updated θ 10 times while Algorithm (3.1) updated θ 386 times. Figure 2 shows a plot of the difference between Algorithms (3.1) and (2.1) in updating θ, for θ = θ0. For λ < 0 and U < 0.98, both algorithms fail to update θ. For U > 0.99, both algorithms update θ, but Algorithm (2.1) moves θ further. For λ > 0 and 0.6 < U < 0.98, Algorithm (2.1) fails to update θ while Algorithm (3.1) updates θ, but the update is relatively small. The improved asymptotic consistency observed is most likely due to updates being much more frequent.

6 0.015

0.01

0.005

0 0 1 2 3 4 5 θ

∗ Figure 1: On the left, the blue curve is a plot of V(θ) in a neighbourhood of θ . The blue cross is θ0, ∗ the red one is θ1000 using Algorithm (2.1), and the yellow one is θ . The pink asterisk is θ1000 using Algorithm (3.1). On the right is a of θN for the two algorithms using N = 30000, γn = −0.51 50n , L = 10, n0 = 0.9 × N and MATLAB random seeds 1 to 100. There is an outlier of 2.35 for Algorithm (3.1) not shown in the histogram.

1

0.5

0

-0.5

-1 2 1 1 0.98 0 0.96 0.94 -1 0.92 λ U

2 Figure 2: A plot of ∇θ N(U;θ,λ) − |F(U)| ∇θ H(U;θ,θ0), for θ = θ0. 5 Concluding Remarks

In this project, we present a promising way of accelerating stochastic approximation in adaptive im- portance sampling for simulations of rare events. A potential direction of research in the future is validation analysis of our method, that is, how large N has to be in order for the algorithm to be within ε of the optimal value θ ∗ with probability at least 1 − α. One could also investigate other methods of updating the auxiliary parameter λ. It is also possible that the introduction of an auxil- iary parameter could improve stochastic approximation in other frameworks, such as mirror descent stochastic approximation [Nemirovski et al., 2009]. References

Arouna, B. 2004. Adaptive , a variance reduction technique. Monte Carlo Methods and Applications, 10(1), 1–24. Chen, K.-F., & Zhu, Y. 1986. Stochastic approximation procedure with randomly varying truncations. Scientia Sinica Series. Kawai, R. 2008. Adaptive Monte Carlo variance reduction for Levy´ processes with two-time-scale stochastic approximation. Methodology and Computing in Applied Probability, 10(2), 199–223. Kawai, R. 2015a. Acceleration of adaptive importance sampling with sample average approximation. preprint. Kawai, R. 2015b. Adaptive importance sampling Monte Carlo simulation for general multivariate probability laws. preprint. Lelong, J. 2013. Asymptotic normality of randomly truncated stochastic algorithms. ESAIM: Probability and , 17, 105–119. Lemaire, V., & Pages,` G. 2010. Unconstrained recursive importance sampling. The Annals of Applied Proba- bility, 20(3), 1029–1067. Nemirovski, A., Juditsky, A., Lan, G., & Shapiro, A. 2009. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimisation, 19(4), 1574–1609. Robbins, H., & Monro, S. 1951. A stochastic approximation method. The Annals of , 22(1), 400–407.

7