Internal Model from Observations for Reward Shaping
Total Page:16
File Type:pdf, Size:1020Kb
Internal Model from Observations for Reward Shaping Daiki Kimura Subhajit Chaudhury IBM Research AI IBM Research AI Tokyo, Japan Tokyo, Japan [email protected] [email protected] Ryuki Tachibana Sakyasingha Dasgupta* IBM Research AI Ascent Robotics Inc. Tokyo, Japan Tokyo, Japan [email protected] [email protected] ABSTRACT In most RL methods, it is critical to choose a well-designed re- Reinforcement learning methods require careful design involving a ward function to successfully ensure that the agent learns a good reward function to obtain the desired action policy for a given task. action policy for performing the task. Moreover, there are cases In the absence of hand-crafted reward functions, prior work on the in which the reward function is very sparse or may not be directly topic has proposed several methods for reward estimation by using available. Humans can often imitate the behavior of their instruc- expert state trajectories and action pairs. However, there are cases tors and estimate which actions or environmental states are good where complete or good action information cannot be obtained from for the eventual accomplishment of a task without being provided expert demonstrations. We propose a novel reinforcement learning with a continual reward. For example, young adults initially learn method in which the agent learns an internal model of observation how to write letters by imitating demonstrations provided by their on the basis of expert-demonstrated state trajectories to estimate teachers or other adults (experts). Further skills get developed on rewards without completely learning the dynamics of the external the basis of exploration around this initial grounding provided by environment from state-action pairs. The internal model is obtained the demonstrations. Taking inspiration from such scenarios, vari- in the form of a predictive model for the given expert state distribu- ous methods have been proposed, which are collectively known as tion. During reinforcement learning, the agent predicts the reward imitation learning [8, 9] or learning from demonstration [26]. In- as a function of the difference between the actual state and the state verse reinforcement learning [1, 19, 35], behavior cloning [25], and predicted by the internal model. We conducted multiple experiments curiosity-based exploration [23] are also examples of research in in environments of varying complexity, including the Super Mario this field. Typically, in all these formulations, expert demonstrations Bros and Flappy Bird games. We show our method successfully are provided as input. trains good policies directly from expert game-play videos. The majority of such prior work assumes that the demonstrations f¹ i i º ¹ i i ºg contain both states and actions s0; a0 ; :::; st ; at and that these KEYWORDS can be used to solve the problem of having only a sparse reward or a complete lack thereof. However, there are many cases in real- Inverse Reinforcement Learning, Reward Shaping, Expert demon- world environments in which such detailed action information is not stration readily available. For example, a typical schoolteacher does not tell students the exact amount of force to apply to each of their fingers 1 INTRODUCTION while they are learning how to write. Reinforcement learning (RL) [30] enables an agent to learn the de- As such, in this work, as our primary contribution, we propose a sired behavior required to accomplish a given objective, such that reinforcement learning method in which the agent learns an internal the expected return or reward for the agent is maximized over time. predictive model that is trained on the external environment from Typically, a scalar reward signal is used to guide the agent’s behavior state-only trajectories by expert demonstrations. This model is not arXiv:1806.01267v4 [cs.LG] 14 Oct 2018 so that the agent learns a control policy that maximizes the cumula- trained on both the state and action pairs. Hence, during each RL tive scalar reward over trajectories. This type of learning is referred step, this method estimates an expected reward value on the basis to as model-free RL if the agent does not have an apriori model or of the similarity between the actual and predicted state values by knowledge of the dynamics of the environment it is acting in. Some the internal model. Therefore, the agent must learn to reward known notable breakthroughs among the many recent research efforts that good states and penalize unknown deviations. Here, we formulate incorporate deep models are the deep Q-network (DQN) [17], which this internal model as a temporal-sequence prediction model that approximated a Q-value function used as a deep neural network and predicts the next state value given the current and past state values trained agents to play Atari games with discrete control, the deep de- at every time step. This paper presents experimental results on mul- terministic policy gradient (DDPG) [15], which successfully applied tiple environments with varying input and output settings for the deep RL for continuous control agents, and the trust region policy op- internal model. In particular, we show that it is possible to learn timization (TRPO) [27], which formulated a method for optimizing good policies using an internal model trained by observing only control policies with guaranteed monotonic improvement. game-playing videos, akin to the way we as humans learn by ob- serving others. Furthermore, we compare the performance of our *This work originated while working at IBM Research proposed method regarding the baselines of hand-crafted rewards, proposed by [5] because both methods calculate the similarity of prior research efforts, and other baseline methods for the different demonstrations as a reward shaping function. However, while their environments. paper dealt only with discrete action tasks, we show a similar ap- proach can be applied to continuous action tasks 1. Moreover, all the 2 RELATED WORK above-mentioned methods rely on both state and action information provided by expert demonstrations. In RL, an agent learns a policy π¹at jst º that produces good actions Another recent line of work aimed at learning useful policies for from the observation at the time. DQN [17] showed that a Q-value agents even in the absence of expert demonstrations. In this regard, function q¹st ; at º can be successfully approximated with a deep neu- they trained an RL agent with a combination of intrinsic curiosity- ral network. DAQN [12] showed the pre-training by a generative based reward and hand-engineered reward that had a continuous model reduces the number of training iterations. Similarly, actor or very sparse scalar signal [23]. The curiosity-based reward was and critic networks in DDPG can enable continuous control, e.g. in designed to have a high value when the agent encountered unseen robotic manipulation by minimizing a distance between the robot states and a low value when it was in a state similar to the previously end-effector and the target position. Since the success with DDPG, explored states. The paper reported good policies in games, such as other methods, such as TRPO [27] and proximal policy optimiza- Super Mario Bros. and Doom, without any expert demonstrations. tion (PPO) [28] have been proposed as further improvements for Here, we also compared our proposed method with the curiosity- model-free RL regarding continuous control. based approach and demonstrated better-learned behavior. However, Although RL enables an agent to learn an optimal policy in the as a limitation, our method assumed that state demonstrations were absence of supervised training data, in a standard case, it involves available as expert data. the difficult task of hand-crafting good reward functions for each Also, there is a work that estimate the reward from linear func- environment [1]. Several kinds of approach have been proposed to tion [29]. However, they evaluated by simple task; specifically, they work around or tackle this problem. An approach that does not re- used 27 discrete state-variables for Mario. On the other hand, our quire hand-crafted rewards is behavior cloning based on supervised method is using the non-linear model. learning instead of RL [25]. It learns the conditional distribution of At the same time as this work, a recent paper [31] proposed learn- actions from given states in a supervised manner. Although it has ing policies using behavior cloning method based on observations an advantage of fast convergence [8] (as behavior cloning learns a only. Unlike that work, here we put primary focus on reward shaping single action from states during each step), it typically results in the based on internal model from observation data. compounding of errors in future states. An alternate approach, inverse reinforcement learning (IRL), was 3 PROPOSED METHOD proposed [19]. In this work, the authors tried to recover the reward function as the best description of the given expert demonstrations 3.1 Problem Statement from humans or expert agents using linear programming methods. We considered a MDP consisting of states S and actions A, where This was based on the assumption that expert demonstrations are the reward signal r : S × A ! R was unknown. An agent acted solutions to a Markov Decision Process (MDP) defined by a hid- defined by this MDP following a policy, π¹at jst º. Here, we assumed den reward function [19]. It demonstrated successful estimation to have knowledge of a finite set of expert state trajectories, τ = 0 n i i i of the reward function regarding relatively simple environments, fS ; :::; S g, where S = fs0; :::;sm g. These trajectories represented such as a grid world and the mountain car problem. Extending [19], joint angles, raw images, or other environmental states. entropy-based methods that compute a suitable reward function by Since the reward signal was unknown, our primary goal was to maximizing the entropy of the expert demonstrations have been find a reward signal that enabled the agent to learn a policy, π, that proposed [36].