Mitigating Planner Overfitting in Model-Based Reinforcement Learning

Mitigating Planner Overfitting in Model-Based Reinforcement Learning Dilip Arumugam 1 David Abel 2 Kavosh Asadi 2 Nakul Gopalan 2 Christopher Grimm 3 Jun Ki Lee 2 Lucas Lehnert 2 Michael L. Littman 2 Abstract then use this learned model to select the best policy that it can. A straightforward approach to this problem, referred An agent with an inaccurate model of its envi- to as the certainty equivalence approximation (Dayan & Se- ronment faces a difficult choice: it can ignore jnowski, 1996), is to take the learned model and to compute the errors in its model and act in the real world in its optimal policy, deploying the resulting policy in the real whatever way it determines is optimal with respect environment. The promise of such an approach is that, for to its model. Alternatively, it can take a more con- environments that are defined by relatively simple dynamics servative stance and eschew its model in favor of but require complex behavior, a model-based learner can optimizing its behavior solely via real-world in- start making high-quality decisions with little data. teraction. This latter approach can be exceedingly slow to learn from experience, while the former Nevertheless, recent large-scale successes of reinforcement can lead to “planner overfitting”—aspects of the learning have not been due to model-based methods but agent’s behavior are optimized to exploit errors in instead derive from value-function based or policy-search its model. This paper explores an intermediate po- methods (Mnih et al., 2015; 2016; Schulman et al., 2017; sition in which the planner seeks to avoid overfit- Hessel et al., 2018). Attempts to leverage model-based ting through a kind of regularization of the plans it methods have fallen below expectations, particularly when considers. We present three different approaches models are learned using function-approximation methods. that demonstrably mitigate planner overfitting in Jiang et al.(2015) highlighted a significant shortcoming of reinforcement-learning environments. the certainty equivalence approximation, showing that it is important to hedge against possibly misleading errors in a learned model. They found that reducing the effective 1. Introduction planning depth by decreasing the discount factor used for decision making can result in improved performance when Model-based reinforcement learning (RL) has proven to be operating in the true environment. a powerful approach for generating reward-seeking behavior in sequential decision-making environments. For example, a At first, this result might seem counter intuitive—the best number of methods are known for guaranteeing near optimal way to exploit a learned model can be to exploit it incom- behavior in a Markov decision process (MDP) by adopting pletely. However, an analogous situation arises in supervised a model-based approach (Kearns & Singh, 1998; Brafman machine learning. It is well established that, particularly & Tennenholtz, 2002; Strehl et al., 2009). In this line of when data is sparse, the representational capacity of super- work, a learning agent continually updates its model of vised learning methods must be restrained or regularized to the transition dynamics of the environment and actively avoid overfitting. Returning the best hypothesis in a hypoth- seeks out parts of its environment that could contribute to esis class relative to the training data can be problematic if achieving high reward but that are not yet well learned. the hypothesis class is overly expressive relative to the size Policies, in this setting, are designed specifically to explore of the training data. The classic result is that testing perfor- unknown transitions so that the agent will be able to exploit mance improves, plateaus, then drops as the complexity of (that is, maximize reward) in the long run. the learner’s hypothesis class is increased. A distinct model-based RL problem is one in which an agent In this paper, we extend the results on avoiding planner over- has explored its environment, constructed a model, and must fitting via decreasing discount rates by introducing several other ways of regularizing policies in model-based RL. In 1Department of Computer Science, Stanford University each case, we see the classic “overfitting” pattern in which 2 3 Department of Computer Science, Brown University Department resisting the urge to treat the learned model as correct and of Computer Science & Engineering, University of Michigan. Cor- respondence to: Dilip Arumugam <[email protected]>. to search in a reduced policy class is repaid by improved performance in the actual environment. We believe this re- Mitigating Planner Overfitting in Model-Based Reinforcement Learning search direction may hold the key to large-scale applications necessary to solve a different set of Bellman equations: of model-based RL. X Q (s; a) = R(s; a) + γ T (s; a) 0 × Section2 provides a set of definitions, which provide a s s0 vocabulary for the paper. Section3 reviews the results ! X on decreasing discount rates, Section4 presents a new ap- (1 − ) max Q(s0; a0) + /jAj Q(s0; a0) : a0 proach that plans using epsilon greedy policies, and Sec- a0 tion5 presents results where policy-search is performed using lower capacity representations of policies. Section6 The optimal epsilon-greedy policy plays an important role in summarizes related work and Section7 concludes. the analysis of learning algorithms like SARSA (Rummery, 1994; Littman & Szepesvari´ , 1996). 2. Definitions These examples of optimal policies are with respect to all possible deterministic Markov policies. In this pa- An MDP M is defined by the quantities hS; A; R; T; γi, per, we also consider optimization with respect to a re- where S is a state space, A is an action space, R : S × A ! stricted set of policies Π. The optimal restricted policy can R is a reward function, T : S × A ! P(S) is a transition be found by comparing the scalar values of the policies: function, and 0 ≤ γ < 1 is a discount factor. The notation ∗ ρ = argmaxρ2Π Vρ: q P(X) represents the set of probability distributions over the discrete set X. Given an MDP M = hS; A; R; T; γi, its q optimal value function Q∗ is the solution to the Bellman 3. Decreased Discounting equation: Let M = hS; A; R; T; γi be the evaluation environment and Mc = hS; A; R; T;b γi be the planning environment, where ∗ X ∗ 0 0 T is the learned model and γ ≤ γ is a smaller discount Q (s; a) = R(s; a) + γ T (s; a)s0 max Q (s ; a ): b a0 s0 factor used to decreaseq the effective planning horizon. Jiang et al.(2015) proved a boundq on the difference between This function is unique and can be computed by algorithms the performance of the optimal policy in M and the perfor- such as value iteration or linear programming (Puterman, mance of the optimal policy in Mc when executed in M: 1994). r A (deterministic) policy is a mapping from states to ac- γ − γ 2R 1 2jSjjAjjΠ j R + max log R;γ : tions, π : S ! A. Given a value function Q : S × (1 − γ)(1 − γ) max (1 − γ)2 2n δ A ! R, the greedy policy with respect to Q is πQ(s) = (1)q ∗ q argmaxa Q(s; a). The greedy policy with respect to Q Here, R = max R(s; a) is the largest reward (we max q s;a q maximizes expected discounted reward from all states. We assume all rewards are non-negative), δ is the certainty assume that ties between actions of the greedy policy are bro- with which the bound needs to hold, n is the number of ken arbitrarily but consistently so there is always a unique samples of each transition used to build the model, and optimal policy for any MDP. jΠR;γ j is the number of distinct possibly optimal policies The value function for a policy π deployed in M can be for hS; A; R; ·; γi over the entire space of possible transition q found by solving functions. They show thatqjΠR;γ j is an increasing function of γ, grow- jSj π X π 0 0 ing from 1 to as high as jAj , the size of the set of all QM (s; a) = R(s; a) + γ T (s; a)s0 QM (s ; π(s )): possible deterministicq policies. They left open the shape of 0 q s this function, which is most useful if it grows gradually, but could possibly jump abruptly. The value function of the optimal policy is the optimal To help ground intuitions, we estimated the shape of jΠ j value function. For a policy π, we also define the scalar R;γ over a set of randomly generated MDPs. Following Jiang V π = P w Qπ (s; π(s)), where w is an MDP-specific M s s M et al.(2015), a “ten-state chain” MDP M = hS; A; T; R; γqi weighting function over the states. is drawn such that, for each state–action pair, (s; a) 2 S×A, The epsilon-greedy policy (Sutton & Barto, 1998) is a the transition function T (s; a) is constructed by choosing stochastic policy where the probability of choosing action 5 states at random from S, then assigning probabilities to a is (1 − ) + /jAj if a = argmaxa Q(s; a) and /jAj these states by drawing 5 independent samples from a uni- otherwise. The optimal epsilon greedy policy for M is not form distribution over [0; 1] and normalizing the resulting generally the epsilon greedy policy for Q∗. Instead, it is numbers. The probability of transition to any other state Mitigating Planner Overfitting in Model-Based Reinforcement Learning 200 180 n=10000 160 n=50000 140 n=200000 n=900000 120 n=1500000 100 80 Equation 1 bound 60 40 0.80 0.85 0.90 0.95 1.00 γ Figure 2.

Load more