Policy Gradient Algorithms

Policy Gradient Algorithms Ashwin Rao ICME, Stanford University Ashwin Rao (Stanford) Policy Gradient Algorithms 1 / 33 Overview 1 Motivation and Intuition 2 Definitions and Notation 3 Policy Gradient Theorem and Proof 4 Policy Gradient Algorithms 5 Compatible Function Approximation Theorem 6 Natural Policy Gradient Ashwin Rao (Stanford) Policy Gradient Algorithms 2 / 33 Why do we care about Policy Gradient (PG)? Let us review how we got here We started with Markov Decision Processes and Bellman Equations Next we studied several variants of DP and RL algorithms We noted that the idea of Generalized Policy Iteration (GPI) is key Policy Improvement step: π(s; a) derived from argmaxa Q(s; a) How do we do argmax when action space is large or continuous? Idea: Do Policy Improvement step with a Gradient Ascent instead Ashwin Rao (Stanford) Policy Gradient Algorithms 3 / 33 \Policy Improvement with a Gradient Ascent??" We want to find the Policy that fetches the \Best Expected Returns" Gradient Ascent on \Expected Returns" w.r.t params of Policy func So we need a func approx for (stochastic) Policy Func: π(s; a; θ) In addition to the usual func approx for Action Value Func: Q(s; a; w) π(s; a; θ) called Actor and Q(s; a; w) called Critic Critic parameters w are optimized w.r.t Q(s; a; w) loss function min Actor parameters θ are optimized w.r.t Expected Returns max We need to formally define \Expected Returns" But we already see that this idea is appealing for continuous actions GPI with Policy Improvement done as Policy Gradient (Ascent) Ashwin Rao (Stanford) Policy Gradient Algorithms 4 / 33 Value Function-based and Policy-based RL Value Function-based Learn Value Function (with a function approximation) Policy is implicit - readily derived from Value Function (eg: -greedy) Policy-based Learn Policy (with a function approximation) No need to learn a Value Function Actor-Critic Learn Policy (Actor) Learn Value Function (Critic) Ashwin Rao (Stanford) Policy Gradient Algorithms 5 / 33 Advantages and Disadvantages of Policy Gradient approach Advantages: Finds the best Stochastic Policy (Optimal Deterministic Policy, produced by other RL algorithms, can be unsuitable for POMDPs) Naturally explores due to Stochastic Policy representation Effective in high-dimensional or continuous action spaces Small changes in θ ) small changes in π, and in state distribution This avoids the convergence issues seen in argmax-based algorithms Disadvantages: Typically converge to a local optimum rather than a global optimum Policy Evaluation is typically inefficient and has high variance Policy Improvement happens in small steps ) slow convergence Ashwin Rao (Stanford) Policy Gradient Algorithms 6 / 33 Notation Assume episodic with 0 ≤ γ ≤ 1 or non-episodic with 0 ≤ γ < 1 Usual notation of discrete-time, countable-spaces, stationary MDPs 0 a a We lighten P(s; a; s ) notation to Ps;s0 and R(s; a) notation to Rs Initial State Probability Distribution denoted as p0 : N! [0; 1] Policy Function Approximation π(s; a; θ) = P[At = ajSt = s; θ] PG coverage is quite similar for non-discounted non-episodic, by considering average-reward objective (we won't cover it) Ashwin Rao (Stanford) Policy Gradient Algorithms 7 / 33 \Expected Returns" Objective Now we formalize the \Expected Returns" Objective J(θ) 1 X t J(θ) = Eπ[ γ · Rt+1] t=0 Value Function V π(s) and Action Value function Qπ(s; a) defined as: 1 π X k−t V (s) = Eπ[ γ · Rk+1jSt = s] for all t = 0; 1; 2;::: k=t 1 π X k−t Q (s; a) = Eπ[ γ · Rk+1jSt = s; At = a] for all t = 0; 1; 2;::: k=t Advantage Function Aπ(s; a) = Qπ(s; a) − V π(s) Also, p(s ! s0; t; π) will be a key function for us - it denotes the probability of going from state s to s0 in t steps by following policy π Ashwin Rao (Stanford) Policy Gradient Algorithms 8 / 33 Discounted-Aggregate State-Visitation Measure 1 1 X t X t J(θ) = Eπ[ γ · Rt+1] = γ · Eπ[Rt+1] t=0 t=0 1 X t X X X a = γ · ( p0(S0) · p(S0 ! s; t; π)) · π(s; a; θ) ·Rs t=0 s2N S02N a2A 1 X X X t X a = ( γ · p0(S0) · p(S0 ! s; t; π)) · π(s; a; θ) ·Rs s2N S02N t=0 a2A Definition X π X a J(θ) = ρ (s) · π(s; a; θ) ·Rs s2N a2A where ρπ(s) = P P1 γt · p (S ) · p(S ! s; t; π) is the key function S02N t=0 0 0 0 (for PG) we'll refer to as Discounted-Aggregate State-Visitation Measure. Ashwin Rao (Stanford) Policy Gradient Algorithms 9 / 33 Policy Gradient Theorem (PGT) Theorem X π X π rθJ(θ) = ρ (s) · rθπ(s; a; θ) · Q (s; a) s2N a2A π π Note: ρ (s) depends on θ, but there's no rθρ (s) term in rθJ(θ) Note: rθπ(s; a; θ) = π(s; a; θ) · rθ log π(s; a; θ) So we can simply generate sampling traces, and at each time step, π calculate (rθ log π(s; a; θ)) · Q (s; a) (probabilities implicit in paths) Note: rθ log π(s; a; θ) is Score function (Gradient of log-likelihood) We will estimate Qπ(s; a) with a function approximation Q(s; a; w) We will later show how to avoid the estimate bias of Q(s; a; w) This numerical estimate of rθJ(θ) enables Policy Gradient Ascent Let us look at the score function of some canonical π(s; a; θ) Ashwin Rao (Stanford) Policy Gradient Algorithms 10 / 33 Canonical π(s; a; θ) for finite action spaces For finite action spaces, we often use Softmax Policy θ is an m-vector (θ1; : : : ; θm) Features vector φ(s; a) = (φ1(s; a); : : : ; φm(s; a)) for all s 2 N ; a 2 A Weight actions using linear combinations of features: φ(s; a)T · θ Action probabilities proportional to exponentiated weights: eφ(s;a)T ·θ π(s; a; θ) = for all s 2 N ; a 2 A P φ(s;b)T ·θ b2A e The score function is: X rθ log π(s; a; θ) = φ(s; a)− π(s; b; θ)·φ(s; b) = φ(s; a)−Eπ[φ(s; ·)] b2A Ashwin Rao (Stanford) Policy Gradient Algorithms 11 / 33 Canonical π(s; a; θ) for continuous action spaces For continuous action spaces, we often use Gaussian Policy θ is an m-vector (θ1; : : : ; θm) State features vector φ(s) = (φ1(s); : : : ; φm(s)) for all s 2 N Gaussian Mean is a linear combination of state features φ(s)T · θ Variance may be fixed σ2, or can also be parameterized Policy is Gaussian, a ∼ N (φ(s)T · θ; σ2) for all s 2 N The score function is: (a − φ(s)T · θ) · φ(s) r log π(s; a; θ) = θ σ2 Ashwin Rao (Stanford) Policy Gradient Algorithms 12 / 33 Proof of Policy Gradient Theorem We begin the proof by noting that: X π X X π J(θ) = p0(S0) · V (S0) = p0(S0) · π(S0; A0; θ) · Q (S0; A0) S02N S02N A02A π Calculate rθJ(θ) by parts π(S0; A0; θ) and Q (S0; A0) X X π rθJ(θ) = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0) S02N A02A X X π + p0(S0) · π(S0; A0; θ) · rθQ (S0; A0) S02N A02A Ashwin Rao (Stanford) Policy Gradient Algorithms 13 / 33 Proof of Policy Gradient Theorem π Now expand Q (S0; A0) as: X RA0 + γ ·PA0 · V π(S ) (Bellman Policy Equation) S0 S0;S1 1 S12N X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X X p (S ) · π(S ; A ; θ) · r (RA0 + γ ·PA0 · V π(S )) 0 0 0 0 θ S0 S0;S1 1 S02N A02A S12N Note: r RA0 = 0, so remove that term θ S0 X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X X p (S ) · π(S ; A ; θ) · r ( γ ·PA0 · V π(S )) 0 0 0 0 θ S0;S1 1 S02N A02A S12N Ashwin Rao (Stanford) Policy Gradient Algorithms 14 / 33 Proof of Policy Gradient Theorem Now bring the r inside the P to apply only on V π(S ) θ S12N 1 X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X X p (S ) · π(S ; A ; θ) · γ ·PA0 · r V π(S ) 0 0 0 0 S0;S1 θ 1 S02N A02A S12N Now bring P and P inside the P S02N A02A S12N X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X X γ · p (S ) · ( π(S ; A ; θ) ·PA0 ) · r V π(S ) 0 0 0 0 S0;S1 θ 1 S12N S02N A02A Ashwin Rao (Stanford) Policy Gradient Algorithms 15 / 33 Policy Gradient Theorem X Note that π(S ; A ; θ) ·PA0 = p(S ! S ; 1; π) 0 0 S0;S1 0 1 A02A X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X π γ · p0(S0) · p(S0 ! S1; 1; π) · rθV (S1) S12N S02N π X π Now expand V (S1) to π(S1; A1; θ) · Q (S1; A1) A12A X X π = p0(S0) · rθπ(S0; A0; θ) · Q (S0; A0)+ S02N A02A X X X π γ · p0(S0) · p(S0 ! S1; 1; π) · rθ( π(S1; A1; θ) · Q (S1; A1)) S12N S02N A12A Ashwin Rao (Stanford) Policy Gradient Algorithms 16 / 33 Proof of Policy Gradient Theorem P π We are now back to when we started calculating gradient of a π · Q .

Policy Gradient Algorithms

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support