Stochastic Approximation Algorithm
Total Page:16
File Type:pdf, Size:1020Kb
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 5 The Stochastic Approximation Algorithm 5.1 Stochastic Processes { Some Basic Concepts 5.1.1 Random Variables and Random Sequences Let (; F;P ) be a probability space, namely: { is the sample space. { F is the event space. Its elements are subsets of , and it is required to be a σ- algebra (includes ; and ; includes all countable union of its members; includes all complements of its members). { P is the probability measure (assigns a probability in [0,1] to each element of F, with the usual properties: P () = 1, countably additive). A random variable (RV) X on (; F) is a function X : ! R, with values X(!). It is required to be measurable on F, namely, all sets of the form f! : X(!) · ag are events in F. A vector-valued RV is a vector of RVs. Equivalently, it is a function X : ! Rd, with similar measurability requirement. d A random sequence, or a discrete-time stochastic process, is a sequence (Xn)n¸0 of R -valued RVs, which are all de¯ned on the same probability space. 1 5.1.2 Convergence of Random Variables A random sequence may converge to a random variable, say to X. There are several useful notions of convergence: 1. Almost sure convergence (or: convergence with probability 1): a:s: Xn ! X if P f lim Xn = Xg = 1 : n!1 2. Convergence in probability: p Xn ! X if lim P (jXn ¡ Xj > ²) = 0 ; 8² > 0 : n!1 3. Mean-squares convergence (convergence in L2): 2 L 2 Xn ! X if EjXn ¡ X1j ! 0 : 4. Convergence in Distribution: Dist Xn ! X (or Xn ) X) if Ef(Xn) ! Ef(X) for every bounded and continuous function f. The following relations hold: a. Basic implications: (a.s. or L2) =) p =) Dist b. Almost sure convergence is equivalent to lim P fsup jXk ¡ Xj > ²) = 0 ; 8² > 0 : n!1 k¸n c. A useful su±cient condition for a.s. convergence: X1 P (jXn ¡ Xj > ²) < 1 : n=0 2 5.1.3 Sigma-algebras and information Sigma algebras (or σ-algebras) are part of the mathematical structure of probability theory. They also have a convenient interpretation as "information sets", which we shall ¯nd useful. ² De¯ne FX , σfXg, the σ-algebra generated by the RV X. This is the smallest σ-algebra that contains all sets of the form fX · ag ´ f! 2 : X(!) · ag. ² We can interpret σfXg as carrying all the information in X. Accordingly, we identify E(ZjX) ´ E(ZjFX ) : Also, \Z is measurable on σfXg" is equivalent to: Z = f(X) (with the additional technical requirement that f is a Borel measurable function). ² We can similarly de¯ne Fn = σfX1;:::;Xng, etc. Thus, E(ZjX1;:::;Xn) ´ E(ZjFn) : ² Note that Fn+1 ⊃ Fn: more RVs carry more information, leading Fn+1 to be ¯ner, or \more detailed" 5.1.4 Martingales A sequence (Xk; Fk)k¸0 on a given probability space (; F;P ) is a martingale if a. (Fk) is a \¯ltration" { an increasing sequence of σ-algebras in F. b. Each RV Xk is Fk-measurable. c. E(Xk+1jFk) = Xk (P{a.s.). Note that ² (a) Property is roughly equivalent to: Fk represents (the information in) some RVs (Y0;:::;Yk), and (b) then means: Xk is a function of (Y0;:::;Yk). 3 ² A particular case is Fn = σfX1;:::;Xng (a self-martingale). ² The central property is (c), which says that the conditional mean of Xk+1 equals Xk. This is obviously stronger than E(Xk+1) = E(Xk). ² The de¯nition sometimes requires also that EjXnj < 1, we shall assume that below. ² Replacing (c) by E(Xk+1jFk) ¸ Xk gives a submartingale, while E(Xk+1jFk) · Xk corresponds to a supermartingale. Examples: a. The simplest example of a martingale is Xk Xk = »` ; `=0 with f»kg a sequence of 0-mean independent RVs, and Fk = σ(»0;:::;»k). b. Xk = E(XjFk), where (Fk) is a given ¯ltration and X a ¯xed RV. Martingales play an important role in the convergence analysis of stochastic processes. We quote a few basic theorems (see, for example: A.N. Shiryaev, Probability, Springer, 1996). Martingale Inequalities Let(Xk; Fk)k¸0 be a martingale. Then for every ¸ > 0 and p ¸ 1 ½ ¾ p EjXnj P max jXkj ¸ ¸ · k·n ¸p and for p > 1 p p p p E[(max jXkj) ] · ( ) E(jXnj ) : k·n p¡1 Martingale Convergence Theorems 1. Convergence with Bounded-moments: Consider a martingale (Xk; Fk)k¸0. Assume that: q EjXkj · C for some C < 1, q ¸ 1 and all k. Then fXkg converges (a.s.) to a RV X1 (which is ¯nite w.p. 1). 4 2. Positive Martingale Convergence: If (Xk; Fk) is a positive martingale (namely Xn ¸ 0), then Xk converges (a.s.) to some RV X1. Martingale Di®erence Convergence The sequence (»k; Fk) is a martingale di®erence sequence if property (c) is replaced by E(»k+1jFk) = 0. In this case we have: P1 1 q 3. Suppose that for some 0 < q · 2, k=1 kq E(j»kj jFk¡1) < 1 (a.s.). 1 Pn Then limn!1 n k=1 »k = 0 (a.s.). For example, the conclusion holds if the sequence (»k) is bounded, namely j»kj · C for some C > 0 (independent of k). Note: ² It is trivially seen that (»n , Xn ¡ Xn¡1) is a martingale di®erence if (Xn) is a martingale. ² More generally, for any sequence (Yk) and ¯ltration (Fk), where Yk is measurable on Fk, the following is a martingale di®erence: »k , Yk ¡ E(YkjFk¡1) : The conditions of the last theorem hold for this »k if either: (i) jYkj · M 8k for some constant M < 1, q (ii) or, more generally, E(jYkj jFk¡1) · M (a.s.) for some q > 1 and a ¯nite RV M. In that case we have 1 Xn 1 Xn » ´ (Y ¡ E(Y jF )) ! 0 (a.s.) n k n k k k¡1 k=1 k=1 5 5.2 The Basic SA Algorithm The stochastic approximations (SA) algorithm essentially solves a system of (nonlinear) equations of the form h(θ) = 0 based on noisy measurements of h(θ). More speci¯cally, we consider a (continuous) function h : Rd ! Rd, with d ¸ 1, which depends on a set of parameters θ 2 Rd. Suppose that h is unknown. However, for each θ we can measure Y = h(θ) + !, where ! is some 0-mean noise. The classical SA algorithm (Robbins-Monro, 1951) is of the form θn+1 = θn + ®nYn = θn + ®n[h(θn) + !n]; n ¸ 0 : Here ®n is the algorithm the step-size, or gain. Obviously, with zero noise (!n ´ 0) the stationary points of the algorithm coincide with the solutions of h(θ) = 0. Under appropriate conditions (on ®n, h and !n) the algorithm indeed can be shown to converge to a solution of h(θ) = 0. References: H. Kushner and G. Yin, Stochastic Approximation Algorithms and Applications, Springer, 1997. V. Borkar, Stochastic Approximation: A Dynamic System Viewpoint, Hindustan, 2008. J. Spall, Introduction to Stochastic Search and Optimization: Estimation, Simulation and Control, Wiley, 2003. 6 Some examples of the SA algorithm: a. Average of an i.i.d. sequence: Let (Zn)¸0 be an i.i.d. sequence with mean ¹ = E(Z0) and ¯nite variance. We wish to estimate the mean. The iterative algorithm 1 θ = θ + [Z ¡ θ ] n+1 n n + 1 n n gives 1 1 Xn¡1 θ = θ + Z ! ¹ (w.p. 1), by the SLLN: n n 0 n k k=0 1 This is a SA iteration, with ®n = n+1 , and Yn = Zn ¡ θn. Writing Zn = ¹ + !n (Zn is considered a noisy measurement of ¹, with zero-mean noise !n), we can identify h(θ) = ¹ ¡ θ. b. Function minimization: Suppose we wish to minimize a (convex) function f(θ). De- @f noting h(θ) = ¡rf(θ) ´ ¡ @θ , we need to solve h(θ) = 0. The basic iteration here is θn+1 = θn + ®n[¡rf(θ) + !n]: This is a \noisy" gradient descent algorithm. When rf is not computable, it may be approximated by ¯nite di®erences of the form @f(θ) f(θ + e ± ) ¡ f(θ ¡ e ± ) ¼ i i i i : @θi 2±i where ei is the i-th unit vector. This scheme is known as the \Kiefer-Wolfowitz Procedure". 7 Some variants of the SA algorithm ² A ¯xed-point formulation: Let h(θ) = H(θ) ¡ θ. Then h(θ) = 0 is equivalent to the ¯xed-point equation H(θ) = θ, and the algorithm is θn+1 = θn + ®n[H(θn) ¡ θn + !n] = (1 ¡ ®n)θn + ®n[H(θn) + !n] : This is the form used in the Bertsekas & Tsitsiklis (1996) monograph. Note that in the average estimation problem (example a. above) we get H(θ) = ¹, hence Zn = H(θn) + !n. ² Asynchronous updates: Di®erent components of θ may be updated at di®erent times and rates. A general form of the algorithm is: θn+1(i) = θn(i) + ®n(i)Yn(i); i = 1; ¢ ¢ ¢ ; d where each component of θ is updated with a di®erent gain sequence f®n(i)g. These gain sequences are typically required to be of comparable magnitude. Moreover, the gain sequences may be allowed to be stochastic, namely depend on the entire history of the process up to the time of update. For example, in the TD(0) algorithm θ corresponds to the estimated value function V^ = (V^ (s); s 2 S), and we can de¯ne ®n(s) = 1=Nn(s), where Nn(s) is the number of visits to state s up to time n.