<<

Mitigating Planner Overfitting in Model-Based

Dilip Arumugam 1 David Abel 2 Kavosh Asadi 2 Nakul Gopalan 2 Christopher Grimm 3 Jun Ki Lee 2 Lucas Lehnert 2 Michael L. Littman 2

Abstract then use this learned model to select the best policy that it can. A straightforward approach to this problem, referred An agent with an inaccurate model of its envi- to as the certainty equivalence approximation (Dayan & Se- ronment faces a difficult choice: it can ignore jnowski, 1996), is to take the learned model and to compute the errors in its model and act in the real world in its optimal policy, deploying the resulting policy in the real whatever way it determines is optimal with respect environment. The promise of such an approach is that, for to its model. Alternatively, it can take a more con- environments that are defined by relatively simple dynamics servative stance and eschew its model in favor of but require complex behavior, a model-based learner can optimizing its behavior solely via real-world in- start making high-quality decisions with little data. teraction. This latter approach can be exceedingly slow to learn from experience, while the former Nevertheless, recent large-scale successes of reinforcement can lead to “planner overfitting”—aspects of the learning have not been due to model-based methods but agent’s behavior are optimized to exploit errors in instead derive from value-function based or policy-search its model. This paper explores an intermediate po- methods (Mnih et al., 2015; 2016; Schulman et al., 2017; sition in which the planner seeks to avoid overfit- Hessel et al., 2018). Attempts to leverage model-based ting through a kind of regularization of the plans it methods have fallen below expectations, particularly when considers. We present three different approaches models are learned using function-approximation methods. that demonstrably mitigate planner overfitting in Jiang et al.(2015) highlighted a significant shortcoming of reinforcement-learning environments. the certainty equivalence approximation, showing that it is important to hedge against possibly misleading errors in a learned model. They found that reducing the effective 1. Introduction planning depth by decreasing the discount factor used for decision making can result in improved performance when Model-based reinforcement learning (RL) has proven to be operating in the true environment. a powerful approach for generating reward-seeking behavior in sequential decision-making environments. For example, a At first, this result might seem counter intuitive—the best number of methods are known for guaranteeing near optimal way to exploit a learned model can be to exploit it incom- behavior in a Markov decision process (MDP) by adopting pletely. However, an analogous situation arises in supervised a model-based approach (Kearns & Singh, 1998; Brafman . It is well established that, particularly & Tennenholtz, 2002; Strehl et al., 2009). In this line of when data is sparse, the representational capacity of super- work, a learning agent continually updates its model of vised learning methods must be restrained or regularized to the transition dynamics of the environment and actively avoid overfitting. Returning the best hypothesis in a hypoth- seeks out parts of its environment that could contribute to esis class relative to the training data can be problematic if achieving high reward but that are not yet well learned. the hypothesis class is overly expressive relative to the size Policies, in this setting, are designed specifically to explore of the training data. The classic result is that testing perfor- unknown transitions so that the agent will be able to exploit mance improves, plateaus, then drops as the complexity of (that is, maximize reward) in the long run. the learner’s hypothesis class is increased. A distinct model-based RL problem is one in which an agent In this paper, we extend the results on avoiding planner over- has explored its environment, constructed a model, and must fitting via decreasing discount rates by introducing several other ways of regularizing policies in model-based RL. In 1Department of Computer Science, Stanford University each case, we see the classic “overfitting” pattern in which 2 3 Department of Computer Science, Brown University Department resisting the urge to treat the learned model as correct and of Computer Science & Engineering, University of Michigan. Cor- respondence to: Dilip Arumugam . to search in a reduced policy class is repaid by improved performance in the actual environment. We believe this re- Mitigating Planner Overfitting in Model-Based Reinforcement Learning search direction may hold the key to large-scale applications necessary to solve a different set of Bellman equations: of model-based RL.  X Q (s, a) = R(s, a) + γ T (s, a) 0 × Section2 provides a set of definitions, which provide a s s0 vocabulary for the paper. Section3 reviews the results ! X on decreasing discount rates, Section4 presents a new ap- (1 − ) max Q(s0, a0) + /|A| Q(s0, a0) . a0 proach that plans using epsilon greedy policies, and Sec- a0 tion5 presents results where policy-search is performed using lower capacity representations of policies. Section6 The optimal epsilon-greedy policy plays an important role in summarizes related work and Section7 concludes. the analysis of learning algorithms like SARSA (Rummery, 1994; Littman & Szepesvari´ , 1996). 2. Definitions These examples of optimal policies are with respect to all possible deterministic Markov policies. In this pa- An MDP M is defined by the quantities hS, A, R, T, γi, per, we also consider optimization with respect to a re- where S is a state space, A is an action space, R : S × A → stricted set of policies Π. The optimal restricted policy can R is a reward function, T : S × A → P(S) is a transition be found by comparing the scalar values of the policies: function, and 0 ≤ γ < 1 is a discount factor. The notation ∗ ρ = argmaxρ∈Π Vρ. q P(X) represents the set of probability distributions over the discrete set X. Given an MDP M = hS, A, R, T, γi, its q optimal value function Q∗ is the solution to the Bellman 3. Decreased Discounting equation: Let M = hS, A, R, T, γi be the evaluation environment and Mc = hS, A, R, T,b γi be the planning environment, where ∗ X ∗ 0 0 T is the learned model and γ ≤ γ is a smaller discount Q (s, a) = R(s, a) + γ T (s, a)s0 max Q (s , a ). b a0 s0 factor used to decreaseq the effective planning horizon. Jiang et al.(2015) proved a boundq on the difference between This function is unique and can be computed by algorithms the performance of the optimal policy in M and the perfor- such as value iteration or linear programming (Puterman, mance of the optimal policy in Mc when executed in M: 1994). r A (deterministic) policy is a mapping from states to ac- γ − γ 2R 1 2|S||A||Π | R + max log R,γ . tions, π : S → A. Given a value function Q : S × (1 − γ)(1 − γ) max (1 − γ)2 2n δ A → R, the greedy policy with respect to Q is πQ(s) = (1)q ∗ q argmaxa Q(s, a). The greedy policy with respect to Q Here, R = max R(s, a) is the largest reward (we max q s,a q maximizes expected discounted reward from all states. We assume all rewards are non-negative), δ is the certainty assume that ties between actions of the greedy policy are bro- with which the bound needs to hold, n is the number of ken arbitrarily but consistently so there is always a unique samples of each transition used to build the model, and optimal policy for any MDP. |ΠR,γ | is the number of distinct possibly optimal policies The value function for a policy π deployed in M can be for hS, A, R, ·, γi over the entire space of possible transition q found by solving functions.

They show thatq|ΠR,γ | is an increasing function of γ, grow- |S| π X π 0 0 ing from 1 to as high as |A| , the size of the set of all QM (s, a) = R(s, a) + γ T (s, a)s0 QM (s , π(s )). possible deterministicq policies. They left open the shape of 0 q s this function, which is most useful if it grows gradually, but could possibly jump abruptly. The value function of the optimal policy is the optimal To help ground intuitions, we estimated the shape of |Π | value function. For a policy π, we also define the scalar R,γ over a set of randomly generated MDPs. Following Jiang V π = P w Qπ (s, π(s)), where w is an MDP-specific M s s M et al.(2015), a “ten-state chain” MDP M = hS, A, T, R, γqi weighting function over the states. is drawn such that, for each state–action pair, (s, a) ∈ S×A, The epsilon-greedy policy (Sutton & Barto, 1998) is a the transition function T (s, a) is constructed by choosing stochastic policy where the probability of choosing action 5 states at random from S, then assigning probabilities to a is (1 − ) + /|A| if a = argmaxa Q(s, a) and /|A| these states by drawing 5 independent samples from a uni- otherwise. The optimal epsilon greedy policy for M is not form distribution over [0, 1] and normalizing the resulting generally the epsilon greedy policy for Q∗. Instead, it is numbers. The probability of transition to any other state Mitigating Planner Overfitting in Model-Based Reinforcement Learning

200 180 n=10000 160 n=50000 140 n=200000 n=900000 120 n=1500000 100 80 Equation 1 bound 60 40 0.80 0.85 0.90 0.95 1.00 γ

Figure 2. Bound on policy loss for randomly generated MDPs, showing the tightest bound for intermediate values of γ for inter- Figure 1. The number of distinct optimal policies found generating mediate amounts of data. random transition functions for a fixed reward function varying γ in random MDPs. In all experiments, the estimated MDP (Mc) was computed q using maximum likelihood estimates of T and R with no is zero. For each state–action pair (s, a) ∈ S × A, the additive Gaussian noise. Optimal policies were all found reward R(s, a) is drawn from a uniform distribution with by running value iteration in the estimated MDP Mc. The support [0, 1]. For our MDPs, we chose |S| = 10, |A| = 2 empirical loss (Equation 14 of Jiang et al.(2015)) was com- and γ = 0.99. We examined γ in {0.0, 0.1, ..., 0.9, 0.99}, puted for each value of γ ∈ {0.0, 0.1, 0.2,..., 0.9, 0.99}. computed optimal policies by running value iteration with The error bars shown in the figure represent 95% confidence 10 iterations. We sampled repeatedlyq until no new optimal intervals. policy was discovered for 5000 consecutive samples.

Figure1 is an estimate of how |ΠR,γ | grows in this class of 4. Increased Exploration randomly generated MDPs. Fortunately, the set appears to In this section, we consider a novel regularization approach grow gradually, making γ an effectiveq parameter for fighting in which planning is performed over the set of epsilon- planner overfitting. greedy policies. The intuition here is that adding noise to γ Estimating |ΠR,γ | ≈ 11qe − 10, Figure2 shows the bound the policies makes it harder for them to be tailored explicitly of Equation1 applied toq the random MDP distribution to the learned model, resulting in less planner overfitting. (|S| = 10, |A| =q 2, R = 1, γ = .99). max In Section 4.1, a general bound is introduced and then Sec- Note that the expected “U” shape is visible, but only for tion 4.2 applies the bound to the set of epsilon greedy poli- a relatively narrow range of values of n. For under 50k cies. samples, the minimal loss bound is achieved for γ = 0. For over 900k samples, the minimal loss bound is achieved 4.1. General Bounds for γ = γ. (Note that the pattern shown here is relatively q We can relate the structure of a restricted set of policies insensitive to the estimated shape of |ΠR,γ |.) Π to the performance in an approximate model with the q For actual MDPs, the “U” shape is much moreq robust. Using following theorem. this same distribution over MDPs, Figure3 replicates an em- Theoremq 1. Let Π be a set of policies for an MDP M = pirical result of Jiang et al.(2015) showing that intermediate hS, A, T, R, γi. Let M = hS, A, T , R, γi be an MDP like values of γ are most successful and that this value grows c b M, but with a differentq transition function. Let π be the as the model used in planning becomes more accurate (hav- optimal policy for M and π be the optimal policy for Mc. ing been trainedq on more trajectories). We sampled MDPs b from the same random distribution and, for each value of Let ρ be the optimal policy in Π for M and ρb be the optimal n ∈ {5, 10, 20, 50}, we generated 1000 datasets each con- policy in Π for Mc. Then, n 10 q sisting of trajectories of length starting from a state π ρb π ρ p p |VM q− VM | ≤ |VM − VM | + 2 max |VM − V |. selected uniformly at random and executing a random policy. p∈Π Mc

q Mitigating Planner Overfitting in Model-Based Reinforcement Learning

In particular, Consider a sequence of Πi such that Πi ⊆ Πi+1. Then, the first part of the bound is monotonically non-increasing (it goes down each time a better policy is included in the set) and the second part of the bound is monotonically non-decreasing (it goes up each time a policy is included that magnifies the difference in performance possible in the two MDPs).

In Lemma1, we show that the particular choice of Mc that comes from statistically sampling transitions as in certainty equivalence leads to a bound on |V p −V p |, for an arbitrary M Mc policy p. Lemma 1. Given true MDP M, let Mc be an MDP com- prised of a reward function R and transition function Tb estimated from n samples for each state–action pair, and let p be a policy, then the following holds with probability at least 1 − δ: s Figure 3. Reducing the discount factor used in planning combats p p 2Rmax 1 2|S||A||Π| planner overfitting in random MDPs. 2 max |VM − V | ≤ 2 log . p∈Π Mc (1 − γ) 2n δ q q Proof. This lemma is a variation of the classic “Simulation Proof. We can write Lemma” (Kearns & Singh, 1998; Strehl et al., 2009) and is proven in this form as Theorem 2 of Jiang et al.(2015) π ρb VM − VM (specifically, Lemma 2). Note that their proof, though stated π ρ ρ ρ with respect to a particular choice of Π set, holds in this = (VM − VM ) + (VM − V ) Mc general form. −(V ρb − V ρb ) − (V ρb − V ρ ) M Mc Mc Mc q π ρ ρ ρ ρb ρb 4.2. Bound for Epsilon-Greedy Policies ≤ (VM − V ) + (V − V ) − (V − V ) (2) M M Mc M Mc ρ ρ ρ ρ ρ ρ π b π b b It remains to show that kVM − VM k∞ is bounded when re- ≤ |VM − V | + |V − V | + |V − V | M M Mc M Mc stricted to epsilon-greedy policies. For the case of planning π ρ p p ≤ |VM − VM | + 2 max |VM − V |. (3) with a decreased discount factor, Jiang et al.(2015) provide p∈Π Mc a bound for this quantity in their Lemma 1. For the case of Equation2 follows from the factq that V ρb − V ρ ≥ 0, since epsilon-greedy policies, the corresponding bound is proven Mc Mc in the following lemma. ρb is chosen as optimal among the set of restricted policies Lemma 2. For any MDP M, the difference in value of with respect to Mc. Equation3 follows because both ρ and the optimal policy π and the optimal -greedy policy ρ is ρb are included in Π. The theorem follows from the fact that π ρb bounded by: VM − VM ≥ 0 since π is chosen to be optimal in M. q π ρ  |VM − V | ≤ Rmax . Theorem1 shows that the restricted policy set Π impacts the M (1 − γ)(1 − γ(1 − )) resulting value of the plan in two ways. First, the bigger the ρ π Proof. Let π be the optimal policy for M and u be a policy class is, the closer VM becomes to VM —thatq is, the more policies we consider, the closer to optimal we become. At that selects actions uniformly at random. We can define π, the same time, max |V p − V p | grows as Π gets larger an -greedy version of π, as: p∈Π M Mc as there are more policies that can differ in value between a a a π (s) = (1 − )π (s) + u (s). (4) M and Mc. q q where πa(s) refers to the probability associated with action Jiang et al.(2015) leverage this structure in the specific a under a policy π. Let T π denote the transition matrix from case of defining Π by optimizing policies using a smaller states to states under policy π. Using the above definition, value for γ. Our Theorem1 generalizes the idea to arbitrary we can decompose the transition matrix into restricted policy classesq and arbitrary pairs of MDPs M and Mc. T π = (1 − )T π + T u. (5) Mitigating Planner Overfitting in Model-Based Reinforcement Learning

Similarly, we have for the reward vector over states,

Rπ = (1 − )Rπ + Ru. (6)

To obtain our bound, note

π π VM − VM ∞ X t−1 t−1 = γt−1 [T π] Rπ − γt−1 [T π ] Rπ t=1 ∞ X t−1 = γt−1 [T π] Rπ t=1 t−1 −γt−1 [T π ] [(1 − )Rπ + Ru] ∞ X t−1 = γt−1 [T π] Rπ t=1 t−1 t−1 −γt−1 [T π ] (1 − )Rπ − γt−1 [T π ] Ru | {z } ≥0 Figure 4. The number of distinct optimal policies found generating ∞ X t−1 random transition functions for a fixed reward function varying . ≤ γt−1 [T π] Rπ t=1 −γt−1 [(1 − )T π + T u]t−1 (1 − )Rπ. (7) Figure4 is an estimate of how |ΠR,| grows over the class Since T π is a transition matrix, all its entries lie in [0, 1]; of randomly generated MDPs. Again, the set appears to hence, we have the following element-wise matrix inequal- grow gradually as  decreases, making  another effective ity: parameter for fighting planner overfitting.

π u t−1 π t−1 [(1 − )T + T ] ≥ [(1 − )T ] . (8) 4.3. Empirical Results Plugging inequality8 into the bound7 results in We evaluated this exploration-based regularization ap- proach in the distribution over MDPs used in Fig- π π VM − VM ure3. Figure5 shows results for each value of  ∈ ∞ X t−1 t−1 {0.0, 0.1, 0.2,..., 0.9, 1.0}. Here, the maximum likelihood ≤ γt−1 [T π] Rπ − γt−1(1 − )t [T π] Rπ transition function Tb was replaced with the epsilon-softened t=1 ∞ transition function T. In contrast to the previous figure, X t−1 t π t−1 π regularization increases as we go to the right. Once again, ≤ k γ (1 − (1 − ) )[T ] R k∞ t=1 we see that intermediate values of  are most successful and the best value of  decreases as the model used in planning becomes more accurate (having been trained on more tra- Since kRπk = R we can upper bound the norm of ∞ max jectories). The similarity to Figure3 is striking—in spite of difference of the values vector over states with the difference in approach, it is essentially the mirror image ∞ of Figure3. π π X t−1 t kVM − VM k∞ ≤ γ (1 − (1 − ) )Rmax t=1 We see that manipulating either γ or  can be used to modu-  late the impact of planner overfitting. Which method to use = Rmax. (1 − γ)(γ − γ + 1) in practice depends on the particularq planner being used and how easily it is modified to use these methods. Using this inequality, we can bound the difference in value of the optimal policy π and the optimal -greedy policy ρ by 5. Decreased Policy Complexity

π ρ π π In addition to indirectly controlling policy complexity via kVM − VM k∞ ≤ kVM − VM k∞   and γ, it is possible to control for the complexity via the ≤ R . (1 − γ)(γ − γ + 1) max representation of the policy itself. In this section, we look at varying the complexity of the policy in the context of model- based RL in which a model is learned and then a policy for Mitigating Planner Overfitting in Model-Based Reinforcement Learning

Optimized-Policy Performance 200

180

160

140

120 mean return

100

80

on model 60 on environment

1 5 15 50 100 250 2000 hidden size

Figure 5. Increasing the randomness in action selection during Figure 6. Decreasing the number of hidden units used to represent planning combats planner overfitting in random MDPs. a policy combats planner overfitting in the Lunar Lander domain.

each and ReLU activations. We again used Adam and used that model is optimized via a policy search approach. Such step size 0.001 to learn the model. an approach was used in the setting of helicopter control (Ng et al., 2003) in the sense that collected data in that work We then ran policy-gradient RL (REINFORCE) as a planner was used to build a model and a policy was constructed to using the learned model. The policy was represented by a optimize performance in this model (via policy search, in neural network with a single hidden . To control the this case) and then deployed in the environment. complexity of the policy representation, we varied the num- ber of units in the hidden layer from 1 to 2000. Results were Our test domain was Lunar Lander, an environment with a averaged over 40 runs. Figure6 shows that increasing the continuous state space and discrete actions. The goal of the size of the hidden layer in the policy resulted in better and environment is to control a falling spacecraft so as to land better performance on the learned model (top line). How- gently in a target area. It consists of 8 state variables, namely ever, after 250 or so units, the resulting policy performed the lander’s x and y coordinates, x and y velocities, angle less well on the actual environment (bottom line). Thus, we and angular velocities, and two Boolean flags corresponding see that reducing policy complexity serves as yet another to whether each leg has touched down. The agent can take way to reduce planner overfitting. 4 actions, corresponding to which of its three thrusters (or no thruster) is active during the current time step. The Lunar Lander environment is publicly available as part of 6. Related Work the OpenAI Gym Toolkit (Brockman et al., 2016). Prior work has explored the use of regularization in rein- We collected 40k 200-step episodes of data on Lunar Lan- forcement learning to mitigate overfitting. We survey some der. During data collection, decisions were made by a of the previous methods according to which function is policy-gradient algorithm. Specifically, we ran the REIN- regularized: (1) value, (2) model, or (3) policy. FORCE algorithm with the state–value function as the base- line (Williams, 1992; Sutton et al., 2000). For the policy 6.1. Regularizing Value Functions and value networks, we used a single hidden layer neural Many prior approaches have applied regularization to value network with 16 hidden units and relu activation functions. function approximation, including Least Squares Tempo- We used the Adam algorithm (Kingma & Ba, 2014) with ral Difference learning (Bradtke & Barto, 1996), Policy the default parameters and a step size of 0.005. The learned Evaluation, and the batch approach of Fitted Q-Iteration model was a 3-layer neural net with ReLU activation func- (FQI) (Ernst et al., 2005). tions mapping the agent’s state (8 inputs corresponding to 8 state variables) as well as a one-hot representation of actions Kolter & Ng(2009) applied regularization techniques to (4 inputs corresponding to 4 possible actions). The model LSTD (Bradtke & Barto, 1996) with an algorithm they consisted of two fully connected hidden layers with 32 units called LARS-TD. In particular, they argued that, without Mitigating Planner Overfitting in Model-Based Reinforcement Learning regularization, LSTD’s performance depends heavily on experiences and set of possible Q functions, must choose the number of basis functions chosen and the size of the a Q function from the set that minimizes the true Bellman data set collected. If the data set is too small, the technique error. They provided a general complexity regularization is prone to overfitting. They showed that L1 and L2 reg- bound for , which they applied to bound ularization yield a procedure that inherits the benefits of the approximation error for the Q function chosen by their selecting good features while making it possible to com- proposed algorithm, BERMIN. pute the fixed point. Later work by Liu et al.(2012) built on this work with the algorithm RO-TD, an L1 regularized 6.2. Regularizing Models off policy Temporal Difference Learning method. Johns In model-based RL, regularizaion can be used to improve et al.(2010) cast the L1 regularized fixed-point computa- tion as a linear complementarity problem, which provides estimates of R and T when data is finite or limited. stronger solution-uniqueness guarantees than those provided Taylor & Parr(2009) investigated the relationship between for LARS-TD. Petrik et al.(2010) examined the approxi- Kernelized LSTD Xu et al.(2005) and other related tech- mate linear programming (ALP) framework for finding ap- niques, with a focus on regularization in model-based RL. proximated value functions in large MDPs. They showed Most relevant to our work is their decomposition of the the benefits of adding an L1 regularization constraint to the Bellman error into transition and reward error, which they ALP that increases the error bound at training time and helps empirically show offers insight into the choice of regulariza- fight overfitting. tion parameters.

Farahmand et al.(2008a) and Farahmand et al.(2009) fo- Bartlett & Tewari(2009) developed an algorithm, REGAL, cused on regularization applied to Policy Iteration and Fitted with optimal regret for weakly communicating MDPs. RE- Q-Iteration (FQI) (Ernst et al., 2005) and developed two GAL heavily relies on regularization; based on all prior related methods for Regularized Policy Iteration, each lever- experience, the algorithm continually updates a set M that, aging L2 regularization during the evaluation of policies with high probability, contains the true MDP. Letting λ∗(M) for each iteration. The first method adds a regularization denote the optimal per-step reward of the MDP M, the tra- term to the Least Squares Temporal Difference (LSTD) ditional optimistic exploration tactic would suggest that the error (Bradtke & Barto, 1996), while the second adds a agent should choose the M 0 in M with maximal λ∗(M 0). similar term to the optimization of Bellman residual mini- REGAL also includes a regularization term to this maximiza- mization (Baird et al., 1995; Schweitzer & Seidmann, 1985; tion to prevent overfitting based on the experiences so far, Williams & Baird, 1993) with regularization (Loth et al., resulting in state-of-the-art regret bounds. 2007). Their main result shows finite convergence for the Q function under the approximated policy and the true op- 6.3. Regularizing Policies timal policy. A method for FQI adds a regularization cost to the least squares regression of the Q function. Follow The focus of applying regularization to policies is to limit up work (Farahmand et al., 2008b) expanded Regularized the complexity of the policy class being searched in the Fitted Q-Iteration to planning. That is, given a data set planning process. It is this approach that we adopt in the 0 0 present paper. D = h(s1, a1, r1, s1),..., (sm, am, rm, sm)i and a func- tion family F (like regression trees), FQI approximates Somani et al.(2013) explored how regularization can help a Q function through repeated iterations of the following online planning for Partially Observable Markov Decision regression problem: Processes (POMDPs). They introduced the DESPOT al- gorithm (Determinized Sparse Partially Observable Tree), Qbt+1 m  2 which constructs a tree that models the execution of all X 0 policies on a number of sampled scenarios (rollouts). How- = argmin ri + γ max Qbt(si, a) − Q(si, ai) a0∈A Q∈F i=1 ever, the authors note that DESPOT typically succumbs to overfitting, as a policy that performs well on the sampled +λPen(Qb), scenarios is not likely to perform well in general. The work where λPen(Qb) imposes a regularization penalty term and proposes a regularized extension of DESPOT, R-DESPOT, λ is a regularization coefficient. They prove bounds relating where regularization takes the form of balancing between this regularization cost to the approximation error in Qb the performance of the policy on the samples with the com- between iterations of FQI. plexity of the policy class. Specifically, R-DESPOT imposes a regularization penalty on the utility of each node in the Farahmand & Szepesvari´ (2011) and Farahmand(2011) fo- belief tree. The algorithm then computes the policy that cused on a problem relevant to our approach—regularization maximizes regularized utility for the tree using a bottom up for Q value selection in RL and planning. They considered dynamic programming procedure on the tree. The approach an offline setting in which an algorithm, given a data set of Mitigating Planner Overfitting in Model-Based Reinforcement Learning is similar to ours in that it also limits policy complexity References through regularization, but focuses on regularizing utility Baird, Leemon et al. Residual algorithms: Reinforcement instead of regularizing the use of a transition model. Inves- learning with function approximation. In Proceedings of tigating the interplay between these two approaches poses the twelfth international conference on machine learning, an interesting direction for future work. In a similar vein, pp. 30–37, 1995. Thomas et al.(2015) developed a batch RL algorithm with a probabilistic performance guarantee that limits the com- Bartlett, Peter L. and Tewari, A. REGAL: A regularization plexity of the policy class as a means of regularization. based algorithm for reinforcement learning in weakly Petrik & Scherrer(2008) conducted analysis similar to Jiang communicating MDPs. Proceedings of the Twenty-Fifth et al.(2015). Specifically, they investigated the situations in Conference on Uncertainty in Artificial Intelligence, pp. which using a lower-than-actual discount factor can improve 35–42, 2009. solution quality given an approximate model, noting that this procedure has the effect of regularizing rewards. The Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and Bowl- work also advanced the first bounds on the error of using a ing, Michael. The arcade learning environment: An eval- smaller discount factor. uation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.

7. Conclusion Bradtke, Steven J and Barto, Andrew G. Linear least-squares For three different regularization methods—decreased dis- algorithms for temporal difference learning. Machine counting, increased exploration, and decreased policy com- Learning, 22(1-3):33–57, 1996. plexity, we found a consistent U-shaped tradeoff between Brafman, Ronen I. and Tennenholtz, Moshe. R-MAX— the size of the policy class being searched and its perfor- a general polynomial time algorithm for near-optimal mance on a learned model. Future work will evaluate other reinforcement learning. Journal of Machine Learning methods such as drop out and . Research, 3:213–231, 2002. The plots that varied  and γ were quite similar, raising the possibility that perhaps epsilon-greedy action selection is Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, functioning as another way to decrease the effective horizon Schneider, Jonas, Schulman, John, Tang, Jie, and depth using in planning—chaining together random actions Zaremba, Wojciech. Openai gym, 2016. makes future states less predictable and therefore carry less weight. Later work can examine whether jointly choosing  Dayan, Peter and Sejnowski, Terrence J. Exploration and γ is more effective than setting only one at a time. bonuses and dual control. Machine Learning, 25:5–22, 1996. More work is needed to identify methods that can learn in much larger domains (Bellemare et al., 2013). One concept Ernst, Damien, Geurts, Pierre, and Wehenkel, Louis. Tree- worth considering is adapting regularization non-uniformly based batch mode reinforcement learning. Journal of to the state space. That is, it should be possible to modulate Machine Learning Research, 6(Apr):503–556, 2005. the complexity of policies considered in parts of the state space where the model is more accurate, allowing more Farahmand, Amir-massoud. Regularization in reinforcement expressive plans is some places than others. learning. PhD thesis, University of Alberta, 2011.

Farahmand, Amir-massoud and Szepesvari,´ Csaba. Model selection in reinforcement learning. Machine Learning, 85:299–332, 2011.

Farahmand, Amir Massoud, Ghavamzadeh, Mohammad, Szepesvari,´ Csaba, and Mannor, Shie. Regularized Policy Iteration. Nips, pp. 441–448, 2008a.

Farahmand, Amir Massoud, Ghavamzadeh, Mohammad, Szepesvari,´ Csaba, and Mannor, Shie. Regularized fitted Q-Iteration: Application to planning. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformat- ics), 5323 LNAI:55–68, 2008b. Mitigating Planner Overfitting in Model-Based Reinforcement Learning

Farahmand, Amir Massoud, Ghavamzadeh, Mohammad, Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Szepesvari,´ Csaba, and Mannor, Shie. Regularized fitted Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dhar- q-iteration for planning in continuous-space markovian shan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis. decision problems. Proceedings of the American Control Human-level control through deep reinforcement learn- Conference, pp. 725–730, 2009. ing. Nature, 518:529–533, 2015.

Hessel, Matteo, Modayil, Joseph, van Hasselt, Hado, Schaul, Mnih, Volodymyr, Badia, Adria` Puigdomenech,` Mirza, Tom, Ostrovski, Georg, Dabney, Will, Horgan, Dan, Piot, Mehdi, Graves, Alex, Lillicrap, Timothy P., Harley, Tim, Bilal, Azar, Mohammad Gheshlaghi, and Silver, David. Silver, David, and Kavukcuoglu, Koray. Asynchronous Rainbow: Combining improvements in deep reinforce- methods for deep reinforcement learning. In ICML, 2016. ment learning. In AAAI, 2018. Ng, Andrew Y., Kim, H. Jin, Jordan, Michael I., and Sastry, Jiang, Nan, Kulesza, Alex, Singh, Satinder, and Lewis, Shankar. Autonomous helicopter flight via reinforcement Richard. The dependence of effective planning horizon learning. In Advances in Neural Information Processing on model accuracy. In Proceedings of the 2015 Interna- Systems 16 (NIPS-03), 2003. tional Conference on Autonomous Agents and Multiagent Systems, pp. 1181–1189, 2015. Petrik, Marek and Scherrer, Bruno. Biasing approximate dynamic programming with a lower discount factor. Ad- Johns, J, Painter-Wakefield, C, and Parr, R. Linear com- vances in Neural Information Processing Systems (NIPS), plementarity for regularized policy evaluation and im- 1:1–8, 2008. provement. Advances in neural information processing systems, 23:1009–1017, 2010. Petrik, Marek, Taylor, Gavin, Parr, Ron, and Zilberstein, Shlomo. using regularization in ap- Kearns, Michael and Singh, Satinder. Near-optimal rein- proximate linear programs for markov decision processes. forcement learning in polynomial time. In Proceedings of arXiv preprint arXiv:1005.1860, 2010. the 15th International Conference on Machine Learning, pp. 260–268, 1998. URL citeseer.nj.nec.com/ Puterman, Martin L. Markov Decision Processes—Discrete kearns98nearoptimal.html. Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994. Kingma, Diederik P. and Ba, Jimmy Lei. Adam: A method for stochastic optimization. In arXiv preprint Rummery, G. A. Problem solving with reinforcement learn- arXiv:1412.6980, 2014. ing. PhD thesis, Cambridge University Engineering De- partment, 1994. Kolter, J. Zico and Ng, Andrew Y. Regularization and fea- ture selection in least-squares temporal difference learn- Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, ing. Proceedings of the 26th Annual International Con- Alec, and Klimov, Oleg. Proximal policy optimization ference on Machine Learning - ICML ’09, 94305:1–8, algorithms. CoRR, abs/1707.06347, 2017. 2009. Schweitzer, Paul J and Seidmann, Abraham. Generalized Littman, Michael L. and Szepesvari,´ Csaba. A generalized polynomial approximations in markovian decision pro- reinforcement-learning model: Convergence and applica- cesses. Journal of mathematical analysis and applica- tions. In Saitta, Lorenza (ed.), Proceedings of the Thir- tions, 110(2):568–582, 1985. teenth International Conference on Machine Learning, pp. 310–318, 1996. Somani, A, Ye, Nan, Hsu, D, and Lee, Ws. DESPOT : Online POMDP Planning with Regularization. Advances Liu, Bo, Mahadevan, Sridhar, and Liu, Ji. Regularized off- in Neural Information Processing Systems, pp. 1–9, 2013. policy td-learning. In Advances in Neural Information Processing Systems, pp. 836–844, 2012. Strehl, Alexander L., Li, Lihong, and Littman, Michael L. Reinforcement learning in finite MDPs: PAC analysis. Loth, Manuel, Davy, Manuel, and Preux, Philippe. Sparse Journal of Machine Learning Research, 10:2413–2444, temporal difference learning using . In 2007 IEEE 2009. International Symposium on Approximate Dynamic Pro- gramming and Reinforcement Learning, pp. 352–359. Sutton, Richard S. and Barto, Andrew G. Reinforcement IEEE, 2007. Learning: An Introduction. The MIT Press, 1998. Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Sutton, Richard S., McAllester, David, Singh, Satinder, and Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Mansour, Yishay. Policy gradient methods for reinforce- Graves, Alex, Riedmiller, Martin A., Fidjeland, Andreas, ment learning with function approximation. In Advances Mitigating Planner Overfitting in Model-Based Reinforcement Learning

in Neural Information Processing Systems 12, pp. 1057– 1063, 2000. Taylor, Gavin and Parr, Ronald. Kernelized value function approximation for reinforcement learning. Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09, pp. 1–8, 2009.

Thomas, Philip, Theocharous, Georgios, and Ghavamzadeh, Mohammad. High Confidence Policy Improvement. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 2380–2388, 2015. Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Ma- chine Learning, 8(3):229–256, 1992. Williams, Ronald J and Baird, Leemon C. Tight perfor- mance bounds on greedy policies based on imperfect value functions. Technical report, Citeseer, 1993.

Xu, Xin, Xie, Tao, Hu, Dewen, and Lu, Xicheng. Kernel least-squares temporal difference learning. International Journal of Information Technology, 11(9):54–63, 2005.