Is the Policy Gradient a Gradient?

Is the Policy Gradient a Gradient? Chris Nota Philip S. Thomas College of Information and Computer Sciences College of Information and Computer Sciences University of Massachusetts Amherst University of Massachusetts Amherst [email protected] [email protected] ABSTRACT is pessimal, regardless of whether the discounted or undiscounted The policy gradient theorem describes the gradient of the expected objective is considered. discounted return with respect to an agent’s policy parameters. The analysis in this paper applies to nearly all state-of-the-art However, most policy gradient methods drop the discount factor policy gradient methods. In Section 6, we review all of the policy from the state distribution and therefore do not optimize the dis- gradient algorithms included in the popular stable-baselines counted objective. What do they optimize instead? This has been an repository [9] and their associated papers, including A2C/A3C [13], open question for several years, and this lack of theoretical clarity ACER [28], ACKTR [30], DDPG [11], PPO [18], TD3 [6], TRPO [16], has lead to an abundance of misstatements in the literature. We and SAC [8]. We motivate this choice in Section 6, but we note 1 answer this question by proving that the update direction approxi- that all of these papers were published at top conferences and mated by most methods is not the gradient of any function. Further, have received hundreds or thousands of citations. We found that we argue that algorithms that follow this direction are not guar- all of the implementations of the algorithms used the “incorrect” anteed to converge to a “reasonable” fixed point by constructing a policy gradient that we discuss in this paper. While this is a valid counterexample wherein the fixed point is globally pessimal with algorithmic choice if properly acknowledged, we found that only respect to both the discounted and undiscounted objectives. We one of the eight papers acknowledged this choice, while three of motivate this work by surveying the literature and showing that the papers made erroneous claims regarding the discounted policy there remains a widespread misunderstanding regarding discounted gradient and others made claims that were misleading. The purpose policy gradient methods, with errors present even in highly-cited of identifying these errors is not to criticize the authors or the algo- papers published at top conferences. rithms, but to draw attention to the fact that confusion regarding the behavior of policy gradient algorithm exists at the very core of the RL community and has gone largely unnoticed by reviewers. 1 INTRODUCTION This has led to a proliferation of errors in the literature. We hope that by providing definitive answers to the questions associated Reinforcement learning (RL) is a subfield of machine learning in with these errors we are able to improve the technical precision which computational agents learn to maximize a numerical reward of the literature and contribute to the development of a better the- signal through interaction with their environment. Policy gradient oretical understanding of the behavior of reinforcement learning methods encode an agent’s behavior as a parameterized stochastic algorithms. policy and update the policy parameters according to an estimate of the gradient of the expected sum of rewards (the expected return) with respect to those parameters. In practice, estimating the 2 NOTATION effect of a particular action on rewards received far in thefuture RL agents learn through interactions with an environment. An can be difficult, so almost all state-of-the-art implementations in- environment is expressed mathematically as a Markov decision stead consider an exponentially discounted sum of rewards (the process (MDP). An MDP is a tuple, ¹S; A; P; R;d0;γ º, where S is discounted return), which shortens the effective horizon considered the set of possible states of the environment, A is the set of actions when selecting actions. The policy gradient theorem [25] describes available to the agent, P : S × A × S ! »0; 1¼ is a transition the appropriate update direction for this discounted setting. How- function that determines the probability of transitioning between ever, almost all modern policy gradient algorithms deviate from states given an action, R : S × A ! [−Rmax; Rmax¼ is the expected the original theorem by dropping one of the two instances of the reward from taking an action in a particular state, bounded by discount factor that appears in the theorem. It has been an open some Rmax 2 R, d0 : S!»0; 1¼ is the initial state distribution, question for several years as to whether these algorithms are un- and γ 2 »0; 1¼ is the discount factor which decreases the utility of biased with respect to a different, related objective [26]. In this rewards received in the future. In the episodic setting, interactions paper, we answer this question and prove that most policy gradient with the environment are broken into independent episodes. Each algorithms, including state-of-the-art algorithms, do not follow the episode is further broken into individual timesteps. At each timestep, gradient of any function. Further, we show that for some tasks, the t, the agent observes a state, St , takes an action, At , transitions to fixed point of the update direction followed by these algorithms a new state, St+1, and receives a reward, Rt . Each episode begins with t = 0 and ends when the agent enters a special state called Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems the terminal absorbing state, s1. Once s1 is entered, the agent can (AAMAS 2020), B. An, N. Yorke-Smith, A. El Fallah Seghrouchni, G. Sukthankar (eds.), never leave and receives a reward of 0 forever. We assume that May 2020, Auckland, New Zealand © 2020 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. 1ICML, NeurIPS, or ICLR, with the exception of PPO, which appears to have been published only on arXiv. AAMAS’20, May 2020, Auckland, New Zealand Chris Nota and Philip S. Thomas t limt!1 Pr¹St = s1º = 1, since otherwise, the episode may persist outer γ . We label this expression rJ?¹θº because the question of indefinitely and the continuing setting must be considered. whether or not it is the gradient of some objective function, J?, was A policy, π : S × A ! »0; 1¼, determines the probability that left open by Thomas [26]. Thomas [26] was only able to construct an agent will choose an action in a particular state. A parameter- J? in an impractically restricted setting where π did not affect the ized policy, πθ , is a policy that is defined as a function of some state distribution. The goal of this paper is to provide answers to parameter vector, θ, which may be the weights in a neural network, the following questions: values in a tabular representation, etc. The compatible features of • Is rJ?¹θº the gradient of some objective function? a parameterized policy represent how θ may be changed in order • If not, does rJ?¹θº at least converge to a reasonable policy? to make a particular action, a 2 A, more likely in a particular 2 S ¹ º @ θ ¹ º r ¹ º state, s , and are defined as ψ s; a B @θ ln π s; a . The value 4 J? θ IS NOT A GRADIENT θ function, Vγ : S! R, represents the expected discounted sum In this section, we answer the first of our two questions and show of rewards when starting in a particular state under policy πθ ; that the update direction used by almost all policy gradient algo- ; θ ¹ º E»Í1 k j ; ¼ rithms, rJ?¹θº, is not the gradient of any function using a proof that is, t Vγ s B k=0 γ Rt+k St =s θ , where conditioning 8 θ by contraposition with the Clairaut-Schwarz theorem on mixed on θ indicates that t; At ∼ π ¹St ; ·º. The action-value function, partial derivatives [19]. First, we present this theorem (Theorem Qθ : S × A ! R, is8 similar, but also considers the action taken; γ 4.1) and its contrapositive (Corollary 4.2). Next, we present Lemma θ Í1 k that is, t;Qγ ¹s; aº B E» γ Rt+k jSt =s; At =a;θ¼. The advan- r ¹ º 8 k=0 4.3, which allows us to rewrite J? θ in a new form. Finally, in tage function is the difference between the action-value function Theorem 4.4 we apply Corollary 4.2 and Lemma 4.3 and derive and the (state) value function: Aθ ¹s; aº Qθ ¹s; aº − V θ ¹sº. γ B γ γ a counterexample proving that J? does not, in general, exist, and objective J The of an RL agent is to maximize some function, , therefore that the “policy gradient” given by rJ?¹θº is not, in fact, of its policy parameters, θ. In the episodic setting, the two most a gradient. commonly stated objectives are the discounted objective, Jγ ¹θº = Í1 t Í1 Theorem 4.1. (Clairaut-Schwarz theorem): If f : Rn ! R E» t=0 γ Rt jθ¼, and the undiscounted objective, J¹θº=E» t=0 Rt jθ¼. The discounted objective has some convenient mathematical prop- exists and is continuously twice differentiable in some neighborhood erties, but it corresponds to few real-world tasks. Sutton and Barto of the point ¹a1; a2;:::; anº, then its second derivative is symmetric: [23] have even argued for its deprecation. However, we will see @f ¹a1; a2;:::; anº @f ¹a1; a2;:::; anº i; j : = : in Section 6 that the discounted objective is commonly stated as 8 @xi @xj @xj @xi a justification for the use of a discounted factor, even whenthe Proof.

Load more