STA 4273: Minimizing Expectations Lecture 3 - Gradient Estimation I
Total Page:16
File Type:pdf, Size:1020Kb
STA 4273: Minimizing Expectations Lecture 3 - Gradient Estimation I Chris J. Maddison University of Toronto (UofT) STA4273-Lec3 1 / 20 Announcements Presentation assignments are out. Project handout(s) are out. (UofT) STA4273-Lec3 2 / 20 Presentation assignments Quercus: People ! Groups ! look for yourself in the Week N Presentation Assignments groups. Please let me know ASAP, if you cannot do your week. If you have not gotten an assignment, either I made a mistake or you didn't send in your rankings. Email me! (UofT) STA4273-Lec3 3 / 20 Project Handouts are up. Apologies for the delay. Come to office hours or email me for help! I've moved the due date of the Proposal back to Feb 22. I Reduce conflict Prof. Grosse's course. I I was late on getting the handout up. The proposal is to get you started. You do not have to end up working on the same project that you propose! Can I work alone? Yes, but standards will be just as high as for groups of 4. (UofT) STA4273-Lec3 4 / 20 Gradient estimation Assuming it exists, today and next week we will consider the problem of gradient estimation, i.e. computing rθEX ∼qθ [f (X ; θ)] Same old beloved assumptions. X is a random variable taking values in X with a prob. density qθ in D a parametric family of densities parameterized by θ 2 R . D f : X × R ! R is a function. (UofT) STA4273-Lec3 5 / 20 Gradient estimation A gradient estimator is a random variable G(θ) such that 4 3 E[G(θ)] = rθEX ∼qθ [f (X ; θ)] 2 1 Will briefly introduce two basic 0 approaches. 1 I Score function estimator (we've 2 actually seen this). 3 4 I Pathwise gradient estimator, also 4 2 0 2 4 called reparameterization estimator. (UofT) STA4273-Lec3 6 / 20 Pathwise gradient Let's start with pathwise gradient. f Example: Let f : R ! R be continuously differentiable, X ∼ N (m; 1) be a Gaussian with mean m 2 R. We want to compute: x rmE[f (X )] Imagine the flow of computation randn(loc = m) required to compute a sample f (X ) using numpy. Can we use the state of this m computation to compute an estimator? (UofT) STA4273-Lec3 7 / 20 Pathwise gradient f Naive idea: compute rmf (X ) using the chain rule given a realization like the x one to the right. What is the partial derivative of randn(loc=m)?? randn(loc = m) Intuitively, it should be 0. m (UofT) STA4273-Lec3 8 / 20 Pathwise gradient One idea that works: reparameterize the sampling process of X : f ∼ N (0; 1) X =d + m E[f (X )] = E[f ( + m)] x Assuming we can exchange the derivative and the integral, we now get + randn rmE[f (X )] = E[rmf ( + m)] Suggesting the estimator rmf ( + m) m Key idea: does not depend on m. (UofT) STA4273-Lec3 9 / 20 Pathwise gradient The pathwise gradient estimator is based on a change of variables. More generally, suppose there exists a random variable 2 E with d density p() and a function y : E × R !X such that X =d y(, θ) Then, assuming we can exchange the derivative and integral operation: rθEX ∼qθ [f (X ; θ)] = rθE∼p()[f (y(, θ); θ)] = E∼p()[rθf (y(, θ); θ)] Suggesting the pathwise gradient estimator: rθf (y(, θ); θ) (UofT) STA4273-Lec3 10 / 20 Pathwise gradient When can we exchange derivative and integral operators? Asmussen and Glynn (2007) give some simple conditions in Chap. 7.2, prop. 2.3. Rule-of-thumb: I Typically valid when Z(θ) := f (y(, θ); θ) is continuous and differentiable except at finitely many points. I Does NOT hold for reparameterizations of discrete X ! (UofT) STA4273-Lec3 11 / 20 Score function gradient f Let's move to the score function gradient. Example: Let f : R ! R be continuously differentiable, x X ∼ N (m; 1) be a Gaussian with mean m 2 R. We want to compute: randn(loc = m) rmE[f (X )] Let's see if we can attack this directly. m (UofT) STA4273-Lec3 12 / 20 Score function gradient Assuming we can exchange derivative and integral, (x−m)2 Z exp − 2 rmE[f (X )] = rm f (x) p dx x 2π (x−m)2 Z exp − 2 = f (x)rm p dx x 2π 2 Z exp − (x−m) 2 2 = f (x) p rm(−(x − m) =2) dx x 2π (x−m)2 Z exp − 2 = f (x) p (x − m) x 2π = E[f (X )(X − m)] i.e., weight the function value by the distance to the mean, very intuitive! (UofT) STA4273-Lec3 13 / 20 Score function gradient More generally, the score function gradient is based on the following identity rθqθ(x) = qθ(X )rθ log qθ(x) Assuming we can exchange derivative and integral: rθEX ∼qθ [f (X )] = EX ∼qθ [f (X )rθ log qθ(X )] Suggesting the score function gradient gradient estimator: f (X )rθ log qθ(X ) Note, this is for the case in which f does not depend on θ. To get a gradient with dependence on θ, just add @θf (X ; θ), where @θ is the vector of partial derivatives of f w.r.t. θ. (UofT) STA4273-Lec3 14 / 20 Control variates The score function has an important property that makes it easy to design control variates. Let C be any real-valued random variable that is uncorrelated to X EX ∼qθ [Crθ log qθ(X )] = E[C]EX ∼qθ [rθ log qθ(X )] = E[C]rθEX ∼qθ [1] = 0 We can use this to reduce the variance of score function estimators! (UofT) STA4273-Lec3 15 / 20 Control variates{RL Recall our gradient estimator for the finite-horizon MDP: XT rJ(θ) = τ∼p r(τ)r log πθ(at jst ) E t=0 Simulate a random trajectory τ ∼ p. PT t=0 r(τ)r log πθ(at jst ) is a score function estimator! We can reduce the variance using our new knowledge. (UofT) STA4273-Lec3 16 / 20 Control variates{RL 0 Given st , r(st0 ; at0 ) is independent of at for t < t, so, XT rJ(θ) = τ∼p r(τ)r log πθ(at jst ) E t=0 " T ! # XT X = τ∼p r(st0 ; at0 ) r log πθ(at jst ) E t=0 t0=0 2 0 1 3 XT X X = τ∼p r(st0 ; at0 ) + r(st0 ; at0 ) r log πθ(at jst ) E 4 t=0 @ A 5 t0<t t0≥t 2 0 1 3 XT X = τ∼p r(st0 ; at0 ) r log πθ(at jst ) E 4 t=0 @ A 5 t0≥t This is lower variance. But we can do more... (UofT) STA4273-Lec3 17 / 20 Control variates{RL π Given st , the value Vt (st ) is independent of at , so, 2 0 1 3 XT X rJ(θ) = τ∼p r(st0 ; at0 ) r log πθ(at jst ) E 4 t=0 @ A 5 t0≥t 2 0 1 3 XT X π = τ∼p r(st0 ; at0 ) − V (st ) r log πθ(at jst ) E 4 t=0 @ t A 5 t0≥t XT π = τ∼p A (st ; at )r log πθ(at jst ) E t=0 t This quantity π X π At (st ; at ) = r(st0 ; at0 ) − Vt (st ) t0≥t is an example of an advantage, i.e., how much better is it to take action at at time t than the average value. (UofT) STA4273-Lec3 18 / 20 Control variates{RL (4) much lower variance gradient estimator than (3): T ! XT X r(st0 ; at0 ) r log πθ(at jst ) (1) t=0 t0=0 0 1 XT X π r(st0 ; at0 ) − V (st ) r log πθ(at jst ) (2) t=0 @ t A t0≥t (UofT) STA4273-Lec3 19 / 20 Talks today Can we use pathwise gradients for discrete random variables? Can we reduce the variance of the score function estimator for discrete random varibles using clever subset structure? Can we reduce the variance of RL gradients by estimating the advantage in more clever ways? (UofT) STA4273-Lec3 20 / 20.