Latent Property State Abstraction for Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Latent Property State Abstraction For Reinforcement learning John Burden Sajjad Kamali Siahroudi Daniel Kudenko Centre for the Study Of Existential L3S Research Center L3S Research Center Risk Hanover Hanover University of Cambridge [email protected] [email protected] [email protected] ABSTRACT the construction of RS functions. This would allow for reaping the Potential Based Reward Shaping has proven itself to be an effective benefits of reward shaping without the onus of manually encoding method for improving the learning rate for Reinforcement Learning domain knowledge or abstract tasks. algorithms — especially when the potential function is derived Our main contribution is a novel RL technique, called Latent from the solution to an Abstract Markov Decision Process (AMDP) Property State Abstraction (LPSA) that creates its own intrinsic encapsulating an abstraction of the desired task. The provenance reward function for use with reward shaping which can speed of the AMDP is often a domain expert. In this paper we introduce up the learning process of Deep RL algorithms. In our work we a novel method for the full automation of creating and solving an demonstrate this speed-up on the widely popular DQN algorithm AMDP to induce a potential function. We then show empirically [13]. Our method works by learning its own abstract representation that the potential function our method creates improves the sample of environment states using an auto-encoder. Then it learns an efficiency of DQN in the domain in which we test our approach. abstract value function over the environment before utilising this to increase the learning rate on the ground level of the domain KEYWORDS using reward shaping. We empirically evaluate our novel approach against the baseline Reinforcement Learning, Reward Shaping, Abstraction DQN agent as well as an implementation of a World Models [8] agent. 1 INTRODUCTION Advances in recent years have thrust Reinforcement Learning (RL) 2 BACKGROUND AND PREREQUISITES into the spotlight. RL agents primarily learn through interaction In this Section we introduce the background on Reinforcement with their environment. The aim for RL is for the agent to learn Learning and Abstract Markov Decision Processes. a suitable policy c that maps states to actions such that the ex- pected cumulative reward is maximised over an episode of the 2.1 Reinforcement Learning environment. The "Curse of Dimensionality" [1] observes that the The principal component of Reinforcement Learning is interaction number of states in a system grows exponentially with the size of between an agent and an environment [19]. The environment is the state representation. This is an issue for many RL algorithms as usually represented by a Markov Decision Process (MDP). We use each state (or those nearby) often need to be visited several times the standard notation for MDPs: for a satisfactory policy to be learnt. This increase in the sample complexity can drastically slow the rate at which RL algorithms M = ¹(M, 퐴M,'M,%Mº learn. Reward Shaping (RS) can somewhat mitigate the effects of this (M is the set of states of which M can take. 퐴M : (M ! 퐴 is a "curse" [16], but in most approaches this comes with the cost of function mapping a state to the actions available from that state. 0 incorporating some domain knowledge that has to be provided by 'M ¹B, 0, B º denotes the reward received when transitioning from 0 a human expert. In these approaches, an intrinsic reward function B to B via action 0. Finally, %M denotes the probability of moving (the reward comes from within the agent, rather than from the to state B0 when in state B and using action 0. environment) encodes domain knowledge which the agent receives The subscript M is used to allow us to distinguish between mul- in addition to the reward from the environment. tiple MDPs and also between Abstract Markov Decision Processes However, in many cases it is not straightforward and at times which we introduce later. For a more complete overview of RL, see rather difficult to encode the required domain knowledge asare- [19]. ward shaping function. There have been initial attempts at auto- Many methods exist for training an agent to learn a policy c matically constructing the intrinsic reward function [11], but these (a function mapping the current state to the action to perform) approaches require environment dynamics to be known a priori. that maximises the cumulative reward received by an agent over a Solutions to abstract representations of the RL task can also be single episode of the environment, such methods include simple used to yield shaping functions [4, 7, 12]. This still requires domain tabular based methods like &-Learning [19] and SARSA [17] , as knowledge as a formulation of the abstract task. Nevertheless, this well as newer deep learning approaches such as DQN [13] and PPO requires less domain knowledge than encoding the full shaping [18]. These methods typically suffer from the so-called "Curse of function. Unfortunately, the task of encoding or identifying such Dimensionality" [1], where the number of possible states increases knowledge for many applications can still be quite difficult and time exponentially with the size of the state-dimension. This often causes consuming. This motivates research into the further automation of learning to be slow in large, complex environments. 2.2 Reward Shaping ground actions and doing so would induce a very stochastic AMDP. An aid to combating this curse is that of Reward Shaping (RS). The game of Othello was used in [12], and constructing the state Reward Shaping operates by providing additional reward 퐹 ¹B, 0, B0º abstraction function / involved detailed knowledge about desirable to the agent after each transition. This additional reward represents states of the game, such as the agent having its pieces occupy cor- external knowledge and is used to steer the agent towards more ners and edges. This approach also involved constructing a basic desirable behaviour. Option that responded optimally within the abstract state-space. It was shown in [14] that the optimal policy for a given envi- None of these approaches are sufficiently automated to scale-up ronment will remain unchanged if 퐹 is defined as the difference of to much larger domains in terms of the cost of encoding domain potential between two states. That is, we define knowledge. 0 0 When utilising AMDPs for PBRS, there is an inherent trade-off to 퐹 ¹B, 0, B º = l ¹Wq ¹B º − q ¹Bºº consider: The more "abstract" the AMDP is, the less knowledge the Where W is the discount factor and l is a scaling factor. The total re- shaping function can encode, but the AMDP will consequently be ward received by an agent after each transition has occurred is then easier to solve. The reverse also holds — less "abstract" AMDPs can AC>C0; = A4GCA8=B82 ¸ A8=CA8=B82 , where A4GCA8=B82 is the reward given provide more useful shaping functions at higher cost of solving the 0 by the environment and A8=BCA8=B82 = 퐹 ¹B, 0, B º is the additional AMDP. The optimal degree of abstraction will depend ultimately reward for shaping. on the size of the task and the level to which abstract knowledge is The primary issue to consider is the origin of the shaping func- available or can be obtained. tion. Manually designing a function that indicates the quality of Previous work [3] has also utilised small, low-level discrete state- a transition or state is in many cases time consuming or infeasi- space AMDPs to speed up an agent learning to solve continuous- ble. This motivates research into automatically constructing such a state-discrete-action control problems with deep learning. Here the function. agent constructed the AMDP based on an initial exploration phase. The agent subsequently solved the AMDP and utilised it with PBRS 2.3 Reward Shaping With Abstract Markov to speed up its own learning process for the control problems. The Decision Processes biggest drawback of this work was the fact that the abstract states created were uniform in size across each state dimension — thus the Useful shaping functions can be created from solutions to abstrac- AMDP consisted of a number of abstract states exponential in the tions of tasks. Abstract Markov Decision Processes (AMDPs) pro- size of the abstract state-dimension. This limited the domains for vide a framework for representing these abstractions of MDPs: which this approach worked to small, low dimensional continuous control problems. A = ¹(A, 퐴A,'A,%Aº Our work here seeks to address this issue with this work — Each element of the tuple corresponds to an abstraction of their extending the complexity of the abstract state-space to handle larger, MDP counterpart, operating on abstract states rather than ground pixel-based state-representations by overcoming the exponential states. Defining these abstract sets and functions is not trivial, and growth of the uniform state abstraction approach. To do this, we they are often constructed utilising domain knowledge from a hu- employ auto-encoders to learn a continuous latent representation man expert. However, even with a human expert, the questions of ground states and use this as our abstract state-space. of what abstract states should there be, and what abstract actions, rewards, and transitions should be defined, are often not easy to answer. 2.4 Auto-encoders A function, /, is also required in order to map ground states An auto-encoder [6] is a neural network architecture consisting B 2 (M to abstract states C 2 (A. The function / is referred to as of an encoder-decoder pair. The purpose of an auto-encoder is to the state-abstraction function. simultaneously learn both an encoder and decoder by feeding the When multiple agents operate on separate levels of abstraction, combined pair training examples ¹G, Gº — that is, the network tries we refer to the agent interacting with the MDP as the ground agent to learn the identity function.