Planning with Expectation Models for Control

Planning with Expectation Models for Control Katya Kudashkina 1 Yi Wan 2 3 Abhishek Naik 2 3 Richard S. Sutton 2 3 Abstract primarily due to their ability to leverage the ever-increasing computational power (driven by Moore’s Law), as well In model-based reinforcement learning (MBRL), as their generality in terms of dependence on data rather Wan et al. (2019) showed conditions under which than hand-crafted features or rule-based techniques. Key to the environment model could produce the expecta- building general-purpose AI systems would be methods that tion of the next feature vector rather than the full scale with computation (Sutton, 2019). distribution, or a sample thereof, with no loss in planning performance. Such expectation models RL is on its way to fully embrace scaling with computation— are of interest when the environment is stochastic it already embraces scaling with computation in many as- and non-stationary, and the model is approximate, pects, for example by leveraging search techniques (e.g., such as when it is learned using function approx- Monte-Carlo Tree Search, see Browne et al., 2012; Finnsson imation. In these cases a full distribution model & Bjornsson,¨ 2008), and modern deep learning techniques may be impractical and a sample model may be such as artificial neural networks. Methods that resort to either more expensive computationally or of high approximating functions rather than learning them exactly variance. Wan et al. considered only planning for have not been fully investigated. Extending the techniques prediction to evaluate a fixed policy. In this paper, used in the simpler tabular regime to this function approx- we treat the control case—planning to improve imation regime is an obvious first step, but some of the and find a good approximate policy. We prove that ideas that have served us well in the past might actually be planning with an expectation model must update impeding progress on the new problem of interest. For exam- a state-value function, not an action-value func- ple, Sutton & Barto (2018) showed that when dealing with tion as previously suggested (e.g., Sorg & Singh, feature vectors rather than underlying states, the common 2010). This opens the question of how planning Bellman error objective is not learnable with any amount of influences action selections. We consider three experiential data. Recently, Naik et al. (2019) showed that strategies for this and present general MBRL al- discounting is incompatible with function approximation in gorithms for each. We identify the strengths and the case of continuing control tasks. Understanding func- weaknesses of these algorithms in computational tion approximation in RL is key to building general-purpose experiments. Our algorithms and experiments are intelligent systems that can learn to solve many tasks of the first to treat MBRL with expectation models arbitrary complexity in the real world. in a general setting. Many of the methods in RL focus on approximating value functions for a given fixed policy—referred to as the prediction problem (e.g., Sutton et al., 1988; Singh et al., 1995; Methods that scale with computation are most likely to Wan et al., 2019). The more challenging problem of learning arXiv:2104.08543v1 [cs.AI] 17 Apr 2021 stand the test of time. We refer to scaling with computation the best behavior is known as the control problem—that is in the context of artificial intelligence (AI) as using more approximating optimal policies and optimal value functions. computation to provide a better approximate answer. This is In the control problem, an RL agent learns within one of in contrast to with the common notion of scaling in computer two broad frameworks: model-free RL and model-based RL science as using more computation for solving a bigger (MBRL). In model-free RL, the agent relies solely on its problem exactly. The recent success of modern machine observations to make decisions (Sutton & Barto, 2018). In learning techniques, in particular that of deep learning, is model-based RL, the agent has a model of the world, which 1Department of Engineering, University of Guelph, Guelph, it uses in conjunction with its observations to plan its deci- Canada 2Department of Computing Science, University of Al- sions. The process of taking a model as input and producing berta, Edmonton, Canada 3Alberta Machine Intelligence Institute, or improving a policy for interacting with a modeled envi- Edmonton, Canada. Correspondence to: Katya Kudashkina <eku- ronment is referred to as planning. Models and planning [email protected]>. are helpful. One advantage of models and planning is that Pre-print. Copyright 2021 by the author(s). they are useful when the agent faces unfamiliar or novel Planning with Expectation Models for Control situations—when the agent may have to consider states and MBRL methods in which 1) the model is parameterized by actions that it has not experienced or seen before. Planning some learnable weights, 2) the agent learns the parameters can help the agent evaluate possible actions by rolling out of the model, and 3) the agent then uses the model to plan hypothetical scenarios according to the model and then com- an improved policy, are referred to as planning with learned puting their expected future outcomes (Doll et al., 2012; Ha parametric models. Learned parametric models are used in & Schmidhuber, 2018; Sutton & Barto, 2018). Planning the Dyna architecture (Sutton, 1991), in normalized advan- with function approximation remains a challenge in rein- tage functions method that incorporates the learned model forcement learning today (Shariff and Szepesvari,´ 2020). into the Q-learning algorithm based on imagination rollouts Planning can be performed with various kinds of models: (Gu et al., 2016), and in MuZero (Schrittwieser et al., 2019). distribution, sample, and expectation. Wan et al. (2019) The latter is a combination of a replay-based method and considered planning with an expectation model for the pre- a learned model: the model is trained by using trajectories diction problem within the function approximation setting. that are sampled from the replay buffer. In this paper we extend Wan et al.’s (2019) work on the pre- If the environment is non-stationary, then the transitions diction problem to the more challenging control problem, in stored in the replay buffer might be stale and can slow down the context of stochastic and non-stationary environments. or even hinder the learning progress. In this work, we focus This will involve several important definitions. We start on the learned parametric models. off with discussing important choices in MBRL (Section 1). Types of Learned Models. Another important choice to This is followed by fundamentals on planning with expecta- make in MBRL is a choice of a model type. A model tion models in the general context of function approximation enables an agent to predict what would happen if actions are (Section 2). We then show (in Section 3) that planning with executed from states prior to actually executing them and an expectation model for control in stochastic non-stationary without necessarily being in those states. Given a state and environments must update a state-value function and not an action, the model can predict a sample, an expectation, or a action-value function as previously suggested (e.g., Sorg distribution of outcomes, which results in three model-type & Singh, 2010). Finally, we consider three ways in which possibilities. actions can be selected when planning with state-value functions, and identify their relative strengths and weaknesses The first possibility is when a model computes a probability (Sections 4 & 5). p of the next state as a result of the action taken by the agent. We refer to such model as a distribution model. Such 1. Choices in Model-Based RL models have been used typically with an assumption of a particular kind of distribution such as Gaussian (e.g., Chua Model-based methods are an important part of reinforce- et al., 2018). For example, Deisenroth & Rasmussen (2011) ment learning’s claim to provide a full account of intelli- proposed a model-based policy search method based on gence. An intelligent agent should be able to model its probabilistic inference for learning control where a distribu- environment and use that model flexibly and efficiently to tion model is learned using Gaussian processes. Learning a plan its behavior. In MBRL, models add knowledge to distribution can be challenging: 1) distributions are poten- the agent in a way that policies and value functions do not tially large objects (Wan et al., 2019); and 2) distribution (van Hasselt et al., 2019). Typically, a model receives a models may require an efficient way of representing and state and an action as inputs, and computes the next state computing them, and both of the tasks can be challenging and reward (Kuvayev & Sutton, 1996; Sutton et al., 2008; (Kurutach et al., 2018). Hester & Stone, 2011). This output is used in planning to The second possibility is for the model to compute a sample further improve policies and value functions. In this section of the next state rather than computing the full distribution. we discuss three important choices one needs to make in We refer to such model as a sample model. The output of MBRL. sample models is more compact than the output of distri- Learned models vs. Experience Replay. One choice bution models and thus, more computationally feasible. In to make is where planning improvements could come this sense sample models are similar to experience replay. from: learned models or experience replay (ER) (Lin, 1992). Sample models have been a good solution to when a deter- In replay-based methods the agent plans using experience ministic environment appears stochastic or non-stationary stored in the agent’s memory.

Planning with Expectation Models for Control

Ai12-General-Game-Playing-Pre-Handout

Towards General Game-Playing Robots: Models, Architecture and Game Controller

Artificial General Intelligence in Games: Where Play Meets Design and User Experience

PAI2013 Popularize Artificial Intelligence

Arxiv:2002.10433V1 [Cs.AI] 24 Feb 2020 on Game AI Was in a Niche, Largely Unrecognized by Have Certainly Changed in the Last 10 to 15 Years

Conference Program

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

General Game Playing and the Game Description Language

Rolling Horizon NEAT for General Video Game Playing

Automatic Construction of a Heuristic Search Function for General Game Playing

Simulation-Based Approach to General Game Playing

General Game Playing Michael Thielscher, Dresden