Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans

A Need for Speed: Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans Bei Peng James MacGlashan Robert Loftin Washington State University Brown University North Carolina State [email protected] [email protected] University [email protected] Michael L. Littman David L. Roberts Matthew E. Taylor Brown University North Carolina State Washington State University [email protected] University [email protected] [email protected] ABSTRACT 1. INTRODUCTION As robots become pervasive in human environments, it is As robots become pervasive in human environments, it is important to enable users to effectively convey new skills important to enable users to effectively convey new skills without programming. Most existing work on Interactive without programming. Recent work has shown that rein- Reinforcement Learning focuses on interpreting and incorpo- forcement learning [18] (RL) agents can successfully speed rating non-expert human feedback to speed up learning; we up learning using human feedback [3, 9, 20]. However, learn- aim to design a better representation of the learning agent ing efficiently from human input brings significant challenges that is able to elicit more natural and effective communica- (e.g., delay, inconsistent input, and limited human patience) tion between the human trainer and the learner, while treat- relative to learning autonomously. ing human feedback as discrete communication that depends This work is inspired by the ways humans train dogs | probabilistically on the trainer's target policy. This work en- using rewards and punishments is a simple and familiar tails a user study where participants train a virtual agent to paradigm for humans to train intelligent agents. A human accomplish tasks by giving reward and/or punishment in a can easily correct the agent if it makes a mistake interpreting variety of simulated environments. We present results from a command, rather than stopping the agent and providing 60 participants to show how a learner can ground natural a new language example. In particular, we treat human language commands and adapt its action execution speed feedback as a discrete communication that depends prob- to learn more efficiently from human trainers. The agent's abilistically on the trainer's target policy. RL methods are action execution speed can be successfully modulated to en- most effective in environments where the numerically-valued courage more explicit feedback from a human trainer in ar- reward function contains all the information needed to learn eas of the state space where there is high uncertainty. Our the policy. Learning from human feedback is substantively results show that our novel adaptive speed agent dominates different. Trainers generally do not have numerical rewards different fixed speed agents on several measures of perfor- to give to learners | instead, they use only a small set of mance. Additionally, we investigate the impact of instruc- discrete feedback signals and they may give those signals in tions on user performance and user preference in training a number of different ways to implicitly communicate the conditions. target behavior. Thus, while standard RL algorithms can be used in this setting, they are not designed to leverage all Categories and Subject Descriptors of the available information. Our previous work [12] showed that considering human feedback to be a discrete communi- I.2.6 [Learning]: Miscellaneous cation that depends probabilistically on the trainer's target policy can be more effective for training an agent, rather General Terms than treating it as a numerical reward signal as in tradi- Languages, Performance, Human Factors, Experimentation tional approaches. Our Strategy-aware Bayesian Learning (SABL) model is able to learn from both explicit and im- Keywords plicit feedback from human trainers. While most existing work on Interactive Reinforcement Learning Sequential Decision Tasks from Humans; Human- Learning focuses on interpreting and incorporating human Agent Interaction; Crowdsourcing Experiments; Eliciting feedback to speed up learning, the design of agents is one Human Feedback; Variable Speed Agents neglected topic. We aim to design a better representation of the learning agent that is able to elicit more natural and Appears in: Proceedings of the 15th International Conference effective communication between the human trainer and the on Autonomous Agents and Multiagent Systems (AAMAS 2016), learner, treating human feedback as a discrete value (reward, J. Thangarajah, K. Tuyls, C. Jonker, S. Marsella (eds.), May 9{13, 2016, Singapore. punishment, or neutral). Considering that human trainers Copyright c 2016, International Foundation for Autonomous Agents and get tired and then give infrequent feedback to the agent [11], Multiagent Systems (www.ifaamas.org). All rights reserved. 957 trying to motivate people to provide more explicit feedback the impact of teacher guidance increases with state space to the agent when it is in more important states is current size. Tenorio-Gonzalez et al. [19] introduce a dynamic re- work. To enable the agent to make the most of limited ward shaping approach to learn from infrequent and incon- human input, higher amounts of explicit feedback should sistent human feedback where the user's feedback is con- be delivered in areas of the state space where the learner verted into a shaping reward, which is then used to modify has more uncertainty about how to act. Similarly, when the MDP reward to guide the agent's learning process. the agent is fairly confident in its action choice, humans can Work by Griffith et al. [3] and Ho et al. [4] are per- expend less effort providing feedback. haps the most similar to ours. Here, human feedback is To seek a more successful and effective learning interac- treated as discrete communication that depends probabilis- tion, the agent should aim to minimize the number of task tically on the trainer's target policy, rather than on some failures, maximize its use of explicit human feedback, and numerical reward, so that the trainer's feedback is making minimize the total time needed to learn to successfully com- a direct statement about the policy itself rather than in- plete the task. The agent can move slower when it is unsure, fluencing the policy through a reward. Specifically, In the giving human trainers more time to think about their strat- work of Ho et al. [4], instead of using the reward-maximizing egy and deliver feedback. Similarly, when the agent is fairly model based on standard reinforcement learning, an action- confident in its action choice, it can move faster and finish feedback model is shown to be followed by the majority of the task more quickly. Lastly, the number of actions needed people while teaching. However, the interaction between to learn a (near-)optimal policy should be minimized to re- learning mechanism and feedback pattern is not explored. duce any per-action penalties (e.g., efficiently exploring a In the current paper, human feedback (either explicitly re- state space while minimizing wear on a robot). Note that warding, explicitly punishing, or providing no feedback) af- minimizing the time and the number of actions is identical ter an agent's action is evidence for or against hypothesized when the agent acts with a fixed time per action | if the target tasks represented by MDP reward functions. At each action rate is variable, the number of actions executed can time step, the agent follows the optimal policy for the most accrue at a different rate from the wall clock time expended. likely task, which is computed using an \off-the-shelf" MDP Therefore, we are interested in designing an agent to op- planning algorithm. timize four metrics. We aim to 1) minimize the wall clock To better learn from human feedback, we must deeply un- time the agent spends learning, 2) minimize the number of derstand how do human teachers want to teach the agent. actions the agent takes while learning, 3) encourage the user Work by Knox et al. [8] examine how users want to pro- to give explicit feedback to the agent (especially when the vide feedback, finding that: 1) there is little difference in agent is unsure about its action choice), and 4) consistently a trainer's feedback whether they think that the agent can learn a (near-)optimal policy. learn or that they are critiquing a fixed performance; and 2) In this paper, we present our Learning Agents Model- humans can reduce the amount of feedback they give over ing Belief-Dependent Action Speeds (LAMBDAS) approach time, and having the learner purposefully select bad actions and results of a human-subjects experiment to show how a can encourage more feedback. Similarly, our work addresses learner can adapt its action execution speed to learn more the problem of increasing the trainer's feedback quantity efficiently from human trainers. We demonstrate that the and quality. However, it is the first to consider modulating agent's action execution speed can be successfully modu- the agent's action execution speed according to its confi- lated to strategically encourage more explicit feedback from dence level to elicit more explicit feedback. Our previous a human trainer in parts of the state space where the learner work [12] find that people choose different training strate- has more uncertainty about how to act. Our results show gies to teach the agent, and may use a lack of explicit feed- that our novel adaptive speed agent dominates different fixed back to communicate the same information as delivered by speed agents according to the four metrics discussed above. reward or punishment in both contextual bandit and sequential decision-making tasks. The choices of trainer strategies 2. BACKGROUND AND RELATED WORK are shown to be affected by several factors (e.g., the na- ture of the training task or the learning agent). Instead of This section discusses the research most related to this focusing on training strategies, we aim to seek a more suc- paper's contributions.

Load more