Policy Gradient
Total Page:16
File Type:pdf, Size:1020Kb
Lecture 7: Policy Gradient Lecture 7: Policy Gradient David Silver Lecture 7: Policy Gradient Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value or action-value function using parameters θ, V (s) V π(s) θ ≈ Q (s; a) Qπ(s; a) θ ≈ A policy was generated directly from the value function e.g. using -greedy In this lecture we will directly parametrise the policy πθ(s; a) = P [a s; θ] j We will focus again on model-free reinforcement learning Lecture 7: Policy Gradient Introduction Value-Based and Policy-Based RL Value Based Learnt Value Function Implicit policy Value Function Policy (e.g. -greedy) Policy Based Value-Based Actor Policy-Based No Value Function Critic Learnt Policy Actor-Critic Learnt Value Function Learnt Policy Lecture 7: Policy Gradient Introduction Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Lecture 7: Policy Gradient Introduction Rock-Paper-Scissors Example Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Consider policies for iterated rock-paper-scissors A deterministic policy is easily exploited A uniform random policy is optimal (i.e. Nash equilibrium) Lecture 7: Policy Gradient Introduction Aliased Gridworld Example Example: Aliased Gridworld (1) The agent cannot differentiate the grey states Consider features of the following form (for all N, E, S, W) φ(s; a) = 1(wall to N; a = move E) Compare value-based RL, using an approximate value function Qθ(s; a) = f (φ(s; a); θ) To policy-based RL, using a parametrised policy πθ(s; a) = g(φ(s; a); θ) Lecture 7: Policy Gradient Introduction Aliased Gridworld Example Example: Aliased Gridworld (2) Under aliasing, an optimal deterministic policy will either move W in both grey states (shown by red arrows) move E in both grey states Either way, it can get stuck and never reach the money Value-based RL learns a near-deterministic policy e.g. greedy or -greedy So it will traverse the corridor for a long time Lecture 7: Policy Gradient Introduction Aliased Gridworld Example Example: Aliased Gridworld (3) An optimal stochastic policy will randomly move E or W in grey states πθ(wall to N and S, move E) = 0:5 πθ(wall to N and S, move W) = 0:5 It will reach the goal state in a few steps with high probability Policy-based RL can learn the optimal stochastic policy Lecture 7: Policy Gradient Introduction Policy Search Policy Objective Functions Goal: given policy πθ(s; a) with parameters θ, find best θ But how do we measure the quality of a policy πθ? In episodic environments we can use the start value πθ J1(θ) = V (s1) = Eπθ [v1] In continuing environments we can use the average value X πθ πθ JavV (θ) = d (s)V (s) s Or the average reward per time-step X πθ X a JavR (θ) = d (s) πθ(s; a) s s a R πθ where d (s) is stationary distribution of Markov chain for πθ Lecture 7: Policy Gradient Introduction Policy Search Policy Optimisation Policy based reinforcement learning is an optimisation problem Find θ that maximises J(θ) Some approaches do not use gradient Hill climbing Simplex / amoeba / Nelder Mead Genetic algorithms Greater efficiency often possible using gradient Gradient descent Conjugate gradient Quasi-newton We focus on gradient descent, many extensions possible And on methods that exploit sequential structure Lecture 7: Policy Gradient Finite Difference Policy Gradient Policy Gradient !"#$%"&'(%")*+,#'-'!"#$%%" (%")*+,#'.+/0+,# Let J(θ) be any policy objective function Policy gradient algorithms search for a y !"#$%&'('%$&#()&*+$,*$#&&-&$$$$."'%"$ local maximum in J(θ) by ascending the gradient'*$%-/0'*,('-*$.'("$("#$1)*%('-*$ of the policy, w.r.t. parameters θ ,22&-3'/,(-& %,*$0#$)+#4$(-$ %&#,(#$,*$#&&-&$1)*%('-*$$$$$$$$$$$$$∆θ = α θJ(θ) r Where θJ(θ) is the policy gradient y !"#$2,&(',5$4'11#&#*(',5$-1$("'+$#&&-&$r 0 @J(θ) 1 1)*%('-*$$$$$$$$$$$$$$$$6$("#$7&,4'#*($@θ1 B . C %,*$*-.$0#$)+#4$(-$)24,(#$("#$θJ(θ) = B . C r @ A '*(#&*,5$8,&',05#+$'*$("#$1)*%('-*$@J(θ) @θ ,22&-3'/,(-& 9,*4$%&'('%:;$$$$$$n and α is a step-size parameter <&,4'#*($4#+%#*($=>? Lecture 7: Policy Gradient Finite Difference Policy Gradient Computing Gradients By Finite Differences To evaluate policy gradient of πθ(s; a) For each dimension k [1; n] 2 Estimate kth partial derivative of objective function w.r.t. θ By perturbing θ by small amount in kth dimension @J(θ) J(θ + uk ) J(θ) − @θk ≈ where uk is unit vector with 1 in kth component, 0 elsewhere Uses n evaluations to compute policy gradient in n dimensions Simple, noisy, inefficient - but sometimes effective Works for arbitrary policies, even if policy is not differentiable those parameters to an Aibo and instructing it to time itself Velocity of Learned Gait during Training 300 Learned Gait as it walked between two fixed landmarks (Figure 5). More (UT Austin Villa) efficient parameters resulted in a faster gait, which translated into a lower time and a better score. After completing an 280 evaluation, the Aibo sent the resulting score back to the host Learned Gait (UNSW) 260 computer and prepared itself for a new set of parameters to Hand!tuned Gait evaluate.6 (UNSW) When implementing the algorithm described above, we 240 Hand!tuned Gait Lecture 7: Policy Gradient (UT Austin Villa) chose to evaluate t =15policies per iteration. Since there Velocity (mm/s) Hand!tuned Gait (German Team) Finitewas Difference significant noisePolicy in Gradienteach evaluation, each set of parameters 220 was evaluated three times. The resulting score for that set of AIBO example parameters was computed by taking the average of the three 200 evaluations. Each iteration therefore consisted of 45 traversals 1 between pairs of beacons and lasted roughly 7 2 minutes. Since 180 even small adjustments to a set of parameters could lead to 0 5 10 15 20 25 Training AIBO to Walk by Finite DifferenceNumber of Iterations Policy Gradient major differences in gait speed, we chose relatively small values for each !j.Tooffsetthesmallvaluesofeach!j,we Fig. 6. The velocity of the best gait from each iteration during training, accelerated the learning process by using a larger value of compared to previous results. We were able to learn a gait significantly faster than both hand-coded gaits and previously learned gaits. η =2for the step size. z Our team (UT Austin Villa [10]) first approached the gait Parameter Initial ! Best optimization problem by hand-tuning a gait described by half- Value Value Front locus: elliptical loci. ThisLandmarks gait performedC comparably to those of (height) 4.2 0.35 4.081 other teams participating inB RoboCup 2003. The work reported (x offset) 2.8 0.35 0.574 in this paper uses the hand-tuned UT Austin Villa walk asC’ a A (y offset) 4.9 0.35 5.152 starting point for learning. Figure 1 compares the reported Rear locus: speeds of the gaits mentioned above, both hand-tunedB’ and (height) 5.6 0.35 6.02 learned, including that of our starting point, the UT Austin (x offset) 0.0 0.35 0.217 Villa walk. The latter walk is described fully in aA’ team (y offset) -2.8 0.35 -2.982 technical report [10]. The remainder of this section describes Locus length 4.893 0.35 5.285 Locus skew multiplier 0.035 0.175 0.049 x the details of the UT Austin Villa walk that are important to Front height 7.7 0.35 7.483 y understand for the purposes of this paper. Rear height 11.2 0.35 10.843 Time to move Hand-tuned gaits Learned gaits Fig. 2. Thethrough elliptical locus locus of0.704 the Aibo’s0.016 foot. The0.679 half-ellipse is defined by Fig. 5. The training environment for our experiments. Each Aibo times itself length,Time height, on andground position in the0.5x-y plane.0.05 0.430 CMUas it German moves back and UT forth Austin between a pair of landmarksHornby (A and A’, B and B’, or C and C’).Goal: learn a fast AIBO walkFig. 7. (useful The initial policy, for the amount Robocup) of change (!) for each parameter, and (2002) Team Villa UNSW 1999 UNSW best policy learned after 23 iterations. All values are given in centimeters, 200 230 245 254 170 270 except time to move through locus, which is measured in seconds, and time 4 AIBO walkIV. RESULTS policy is controlledon ground,During by which the 12is a fraction. American numbers Open tournament (elliptical in May loci) of 2003, Fig. 1. Maximum forward velocities of the best current gaits (in mm/s) for UT Austin Villa used a simplified version of the parameter- different teams,Our both main learned result and is hand-tuned. that using the algorithm described in Section III, we were able to find one of the fastest known Aibo ization described above that did not allow the front and rear Adapt these parameters by finiteThere are adifference couple of possible explanationspolicy for gradient the amount gaits. Figure 6 shows the performance of the best policy of ofheights variation of in the the learningrobot to process, differ. visible Hand-tuning as the spikes these in parameters The half-ellipticaleach iteration during locus the used learning by process. our team After is 23 shown iterations in thegenerated learning curve a gait shown that in allowed Figure 6.