Interactive Reinforcement Learning with Dynamic Reuse of Prior Knowledge from Human and Agent Demonstrations

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Interactive Reinforcement Learning with Dynamic Reuse of Prior Knowledge from Human and Agent Demonstrations Zhaodong Wang and Matthew E. Taylor School of EECS, Washington State University fzhaodong.wang, [email protected] Abstract it. It is typically used in cases where no environmental reward is available, and often requires the transition model. Reinforcement learning has enjoyed multiple im- To further improve over the demonstrations, one approach pressive successes in recent years. However, these is the Human Agent Transfer [Taylor et al., 2011] (HAT) successes typically require very large amounts of algorithm, which formulates the problem as that of transfer data before an agent achieves acceptable perfor- learning [Taylor and Stone, 2009]: a source agent can demon- mance. This paper focuses on a novel way of strate a policy and then a target agent can improve its perfor- combating such requirements by leveraging exist- mance over that policy. As refinement, the Confidence Hu- ing (human or agent) knowledge. In particular, this man Agent Transfer [Wang and Taylor, 2017] algorithm was paper leverages demonstrations, allowing an agent proposed by leveraging the confidence in a policy. to quickly achieve high performance. This paper In order to leverage demonstrations to improve learning, introduces the Dynamic Reuse of Prior (DRoP) four problems must be considered. First, the demonstration algorithm, which combines the offline knowledge may be suboptimal, and the agent should aim to improve (demonstrations recorded before learning) with upon it. Second, if there are multiple demonstrators, their an online confidence-based performance analysis. outputs must be combined in a way to handle any inconsis- DRoP leverages the demonstrator’s knowledge by tencies [Mao et al., 2018]. Third, the demonstration is rarely automatically balancing between reusing the prior exhaustive and some type of generalization must be used to knowledge and the current learned policy, allow- handle unseen states. Fourth, the agent must balance the using the agent to outperform the original demon- age of the prior knowledge and its own self-learned policy. strations. We compare with multiple state-of-the- In this paper, we introduce DRoP (Dynamic Reuse of art learning algorithms and empirically show that Prior) as a interactive method to assist RL by addressing the DRoP can achieve superior performance in two do- above problems. Prior research [Chernova and Veloso, 2007; mains. Additionally, we show that this confidence Wang and Taylor, 2017] used offline confidence. In con- measure can be used to selectively request addi- trast, DRoP leverages temporal difference models to achieve tional demonstrations, significantly improving the online confidence-based performance measurement on trans- learning performance of the agent. ferred knowledge for better domain adaption. To guarantee convergence, we introduce an action selection method to 1 Introduction help the target agent balance between following the demonstration and following its own learned knowledge. We em- There have been increasingly successful applications of rein- pirically evaluate DRoP using the domains of Cartpole and forcement learning [Sutton and Barto, 1998] (RL) methods Mario, showing improvement over existing methods, com- in both virtual agents and physical robots. However, RL of- pared with other state-of-art demonstration learning methods. ten suffers from slow learning speeds in complex domains, Results also validate our claim that multiple experts’ demon- which is particularly detrimental when initial performance is strations can be leveraged simultaneously, and that DRoP is critical. External knowledge may be leveraged by RL agents able to distinguish between high- and low-quality demonstra- to improve learning — demonstrations have been shown to tions automatically. Finally, we show that these confidence be useful for many types of agents’ learning [Schaal, 1997; measures can be used to actively request additional demon- Argall et al., 2009]. In contrast to many behavior cloning strations, significantly improving learning performance. methods, which seek to mimic the demonstrated behavior, The main contributions of this paper are: 1) automati- our goal is to leverage a demonstration to learn faster, and cally balancing between an existing demonstration and a self- ultimately outperform the demonstrator. learned policy, 2) efficiently integrating demonstrations from Inverse reinforcement learning (IRL) [Ng et al., 2000] is multiple sources by distinguishing the knowledge quality, an alternative to behavior cloning where the agent aims to es- and 3) actively requesting demonstrations in low confidence timate the demonstrator’s reward function and then optimize states. 3820 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 2 Background Human Agent Transfer (HAT) takes a novel step by inte- This section presents a selection of relevant background grating the demonstrations with RL. The goal of HAT is to knowledge and techniques from recent research. leverage demonstration from a source human or source agent, and then improve agents’ performance with RL. Rule trans- 2.1 Reinforcement Learning fer [Taylor and Stone, 2007] is used in HAT to remove the requirements on sharing the same internal representation be- By interacting with an environment, an RL agent can learn tween source and target agents, which is the novel approach a policy to maximize an external reward. A Markov deci- that allows knowledge transfer across different types agents sion process is common formulation of the RL problem. In a (e.g., from human to an agent). The following steps summa- Markov decision process, A is a set of actions an agent can rize HAT: take and S is a set of states. There are two (initially un- known) functions within this process: a transition function 1. Learn a policy (π : S 7! A) from the source task. (T : S × A 7! S) and a reward function (R : S × A 7! R). 2. Train a decision list upon the learned policy as “IF- The goal of an RL agent is to maximize the expected re- ELSE” rules. ward — different RL algorithms have different ways of ap- 3. The target agent’s action is guided by the trained rules [ proaching this goal. This paper uses Q-learning Watkins and under a decaying probability. Dayan, 1992] as the base RL althorithm: By fitting the demonstration reuse into RL domains with 0 0 Q(s; a) Q(s; a) + α[r + γmaxQ(s ; a ) − Q(s; a)] reward distributions, HAT can better adapt the transferred a0 policy into the target task comparing to the pure classifier 2.2 Transfer Learning and Learning from training of Dagger. Demonstration As an extension, Confidence Human Agent Trans- fer [Wang and Taylor, 2017] (CHAT) provides a method The key idea of transfer learning is to leverage existing based on confidence — it leverages a confidence-based knowledge to improve a new agent’s learning performance. source agent’s/human’s demonstration to improve the learn- Transfer learning has been applied in various domains, such ing performance. Instead of rule transfer, CHAT measures [ et al. et al. as multitask learning Kirkpatrick , 2017; Teh , the confidence in the source demonstration. Such offline con- ] [ et al. 2017 , deep reinforcement learning Rusu , 2016; fidence is used to predict how reliable the transferred knowl- et al. et al. ] Parisotto , 2016; Higgins , 2017 , and represen- edge is. To assist RL, CHAT will leverage the source demon- [ et al. et al. ] tation learning Maurer , 2016; Luo , 2017 . In this strations to suggest an action in the agent’s current state, section, we will discuss knowledge transfer techniques using along with the calculated confidence. For example, CHAT demonstrations. can use a Gaussian distribution to predict an action from a [ ] Probabilistic Policy Reuse Fernandez´ and Veloso, 2006 demonstration with an offline probability. If the calculated is one transfer learning approach. Like many other existing confidence is higher than a pre-tuned confidence threshold, approaches, it assumes both the source and the target agents the agent would consider the prior knowledge reliable and share the same internal representations and optimal demon- execute the suggested action. strations are required. Existing policies could guide the learn- To guarantee that the demonstration data will not harm the [ ing direction as shown elsewhere Da Silva and Mackworth, agent’s learning convergence, all above methods use similar et al. ] 2010; Brys , 2017 , but near-optimal policies could be solutions — following the artificial probability control, which impracticable due to the complexity of the learning task or forces the agent into reusing the prior knowledge under a de- the cost of a domain expert’s time. caying probability curve. Imitation is a popular and fundamental approach that trans- fers the demonstrator’s behavior by having an agent exactly following the demonstrations. However the learner’s perfor- 3 Dynamic Reuse of Prior (DRoP) mance could possibly be limited by the demonstrator. On top This section introduces DRoP, a method to estimate the con- of imitation learning, Dagger [Ross et al., 2011] incorporates fidence of the agent in different data sources over time. Sec- the demonstration trajectories into the target agent’s real vis- tion 3.1 discusses how we build the confidence-based policy ited states. Dagger works by collecting a dataset

Interactive Reinforcement Learning with Dynamic Reuse of Prior Knowledge from Human and Agent Demonstrations

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support