Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans

Total Page:16

File Type:pdf, Size:1020Kb

Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans A Need for Speed: Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans Bei Peng James MacGlashan Robert Loftin Washington State University Brown University North Carolina State [email protected] [email protected] University [email protected] Michael L. Littman David L. Roberts Matthew E. Taylor Brown University North Carolina State Washington State University [email protected] University [email protected] [email protected] ABSTRACT 1. INTRODUCTION As robots become pervasive in human environments, it is As robots become pervasive in human environments, it is important to enable users to effectively convey new skills important to enable users to effectively convey new skills without programming. Most existing work on Interactive without programming. Recent work has shown that rein- Reinforcement Learning focuses on interpreting and incorpo- forcement learning [18] (RL) agents can successfully speed rating non-expert human feedback to speed up learning; we up learning using human feedback [3, 9, 20]. However, learn- aim to design a better representation of the learning agent ing efficiently from human input brings significant challenges that is able to elicit more natural and effective communica- (e.g., delay, inconsistent input, and limited human patience) tion between the human trainer and the learner, while treat- relative to learning autonomously. ing human feedback as discrete communication that depends This work is inspired by the ways humans train dogs | probabilistically on the trainer's target policy. This work en- using rewards and punishments is a simple and familiar tails a user study where participants train a virtual agent to paradigm for humans to train intelligent agents. A human accomplish tasks by giving reward and/or punishment in a can easily correct the agent if it makes a mistake interpreting variety of simulated environments. We present results from a command, rather than stopping the agent and providing 60 participants to show how a learner can ground natural a new language example. In particular, we treat human language commands and adapt its action execution speed feedback as a discrete communication that depends prob- to learn more efficiently from human trainers. The agent's abilistically on the trainer's target policy. RL methods are action execution speed can be successfully modulated to en- most effective in environments where the numerically-valued courage more explicit feedback from a human trainer in ar- reward function contains all the information needed to learn eas of the state space where there is high uncertainty. Our the policy. Learning from human feedback is substantively results show that our novel adaptive speed agent dominates different. Trainers generally do not have numerical rewards different fixed speed agents on several measures of perfor- to give to learners | instead, they use only a small set of mance. Additionally, we investigate the impact of instruc- discrete feedback signals and they may give those signals in tions on user performance and user preference in training a number of different ways to implicitly communicate the conditions. target behavior. Thus, while standard RL algorithms can be used in this setting, they are not designed to leverage all Categories and Subject Descriptors of the available information. Our previous work [12] showed that considering human feedback to be a discrete communi- I.2.6 [Learning]: Miscellaneous cation that depends probabilistically on the trainer's target policy can be more effective for training an agent, rather General Terms than treating it as a numerical reward signal as in tradi- Languages, Performance, Human Factors, Experimentation tional approaches. Our Strategy-aware Bayesian Learning (SABL) model is able to learn from both explicit and im- Keywords plicit feedback from human trainers. While most existing work on Interactive Reinforcement Learning Sequential Decision Tasks from Humans; Human- Learning focuses on interpreting and incorporating human Agent Interaction; Crowdsourcing Experiments; Eliciting feedback to speed up learning, the design of agents is one Human Feedback; Variable Speed Agents neglected topic. We aim to design a better representation of the learning agent that is able to elicit more natural and Appears in: Proceedings of the 15th International Conference effective communication between the human trainer and the on Autonomous Agents and Multiagent Systems (AAMAS 2016), learner, treating human feedback as a discrete value (reward, J. Thangarajah, K. Tuyls, C. Jonker, S. Marsella (eds.), May 9{13, 2016, Singapore. punishment, or neutral). Considering that human trainers Copyright c 2016, International Foundation for Autonomous Agents and get tired and then give infrequent feedback to the agent [11], Multiagent Systems (www.ifaamas.org). All rights reserved. 957 trying to motivate people to provide more explicit feedback the impact of teacher guidance increases with state space to the agent when it is in more important states is current size. Tenorio-Gonzalez et al. [19] introduce a dynamic re- work. To enable the agent to make the most of limited ward shaping approach to learn from infrequent and incon- human input, higher amounts of explicit feedback should sistent human feedback where the user's feedback is con- be delivered in areas of the state space where the learner verted into a shaping reward, which is then used to modify has more uncertainty about how to act. Similarly, when the MDP reward to guide the agent's learning process. the agent is fairly confident in its action choice, humans can Work by Griffith et al. [3] and Ho et al. [4] are per- expend less effort providing feedback. haps the most similar to ours. Here, human feedback is To seek a more successful and effective learning interac- treated as discrete communication that depends probabilis- tion, the agent should aim to minimize the number of task tically on the trainer's target policy, rather than on some failures, maximize its use of explicit human feedback, and numerical reward, so that the trainer's feedback is making minimize the total time needed to learn to successfully com- a direct statement about the policy itself rather than in- plete the task. The agent can move slower when it is unsure, fluencing the policy through a reward. Specifically, In the giving human trainers more time to think about their strat- work of Ho et al. [4], instead of using the reward-maximizing egy and deliver feedback. Similarly, when the agent is fairly model based on standard reinforcement learning, an action- confident in its action choice, it can move faster and finish feedback model is shown to be followed by the majority of the task more quickly. Lastly, the number of actions needed people while teaching. However, the interaction between to learn a (near-)optimal policy should be minimized to re- learning mechanism and feedback pattern is not explored. duce any per-action penalties (e.g., efficiently exploring a In the current paper, human feedback (either explicitly re- state space while minimizing wear on a robot). Note that warding, explicitly punishing, or providing no feedback) af- minimizing the time and the number of actions is identical ter an agent's action is evidence for or against hypothesized when the agent acts with a fixed time per action | if the target tasks represented by MDP reward functions. At each action rate is variable, the number of actions executed can time step, the agent follows the optimal policy for the most accrue at a different rate from the wall clock time expended. likely task, which is computed using an \off-the-shelf" MDP Therefore, we are interested in designing an agent to op- planning algorithm. timize four metrics. We aim to 1) minimize the wall clock To better learn from human feedback, we must deeply un- time the agent spends learning, 2) minimize the number of derstand how do human teachers want to teach the agent. actions the agent takes while learning, 3) encourage the user Work by Knox et al. [8] examine how users want to pro- to give explicit feedback to the agent (especially when the vide feedback, finding that: 1) there is little difference in agent is unsure about its action choice), and 4) consistently a trainer's feedback whether they think that the agent can learn a (near-)optimal policy. learn or that they are critiquing a fixed performance; and 2) In this paper, we present our Learning Agents Model- humans can reduce the amount of feedback they give over ing Belief-Dependent Action Speeds (LAMBDAS) approach time, and having the learner purposefully select bad actions and results of a human-subjects experiment to show how a can encourage more feedback. Similarly, our work addresses learner can adapt its action execution speed to learn more the problem of increasing the trainer's feedback quantity efficiently from human trainers. We demonstrate that the and quality. However, it is the first to consider modulating agent's action execution speed can be successfully modu- the agent's action execution speed according to its confi- lated to strategically encourage more explicit feedback from dence level to elicit more explicit feedback. Our previous a human trainer in parts of the state space where the learner work [12] find that people choose different training strate- has more uncertainty about how to act. Our results show gies to teach the agent, and may use a lack of explicit feed- that our novel adaptive speed agent dominates different fixed back to communicate the same information as delivered by speed agents according to the four metrics discussed above. reward or punishment in both contextual bandit and sequen- tial decision-making tasks. The choices of trainer strategies 2. BACKGROUND AND RELATED WORK are shown to be affected by several factors (e.g., the na- ture of the training task or the learning agent). Instead of This section discusses the research most related to this focusing on training strategies, we aim to seek a more suc- paper's contributions.
Recommended publications
  • Rotational Motion (The Dynamics of a Rigid Body)
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Robert Katz Publications Research Papers in Physics and Astronomy 1-1958 Physics, Chapter 11: Rotational Motion (The Dynamics of a Rigid Body) Henry Semat City College of New York Robert Katz University of Nebraska-Lincoln, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/physicskatz Part of the Physics Commons Semat, Henry and Katz, Robert, "Physics, Chapter 11: Rotational Motion (The Dynamics of a Rigid Body)" (1958). Robert Katz Publications. 141. https://digitalcommons.unl.edu/physicskatz/141 This Article is brought to you for free and open access by the Research Papers in Physics and Astronomy at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Robert Katz Publications by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. 11 Rotational Motion (The Dynamics of a Rigid Body) 11-1 Motion about a Fixed Axis The motion of the flywheel of an engine and of a pulley on its axle are examples of an important type of motion of a rigid body, that of the motion of rotation about a fixed axis. Consider the motion of a uniform disk rotat­ ing about a fixed axis passing through its center of gravity C perpendicular to the face of the disk, as shown in Figure 11-1. The motion of this disk may be de­ scribed in terms of the motions of each of its individual particles, but a better way to describe the motion is in terms of the angle through which the disk rotates.
    [Show full text]
  • Chapter 5 the Relativistic Point Particle
    Chapter 5 The Relativistic Point Particle To formulate the dynamics of a system we can write either the equations of motion, or alternatively, an action. In the case of the relativistic point par- ticle, it is rather easy to write the equations of motion. But the action is so physical and geometrical that it is worth pursuing in its own right. More importantly, while it is difficult to guess the equations of motion for the rela- tivistic string, the action is a natural generalization of the relativistic particle action that we will study in this chapter. We conclude with a discussion of the charged relativistic particle. 5.1 Action for a relativistic point particle How can we find the action S that governs the dynamics of a free relativis- tic particle? To get started we first think about units. The action is the Lagrangian integrated over time, so the units of action are just the units of the Lagrangian multiplied by the units of time. The Lagrangian has units of energy, so the units of action are L2 ML2 [S]=M T = . (5.1.1) T 2 T Recall that the action Snr for a free non-relativistic particle is given by the time integral of the kinetic energy: 1 dx S = mv2(t) dt , v2 ≡ v · v, v = . (5.1.2) nr 2 dt 105 106 CHAPTER 5. THE RELATIVISTIC POINT PARTICLE The equation of motion following by Hamilton’s principle is dv =0. (5.1.3) dt The free particle moves with constant velocity and that is the end of the story.
    [Show full text]
  • Chapter 5 ANGULAR MOMENTUM and ROTATIONS
    Chapter 5 ANGULAR MOMENTUM AND ROTATIONS In classical mechanics the total angular momentum L~ of an isolated system about any …xed point is conserved. The existence of a conserved vector L~ associated with such a system is itself a consequence of the fact that the associated Hamiltonian (or Lagrangian) is invariant under rotations, i.e., if the coordinates and momenta of the entire system are rotated “rigidly” about some point, the energy of the system is unchanged and, more importantly, is the same function of the dynamical variables as it was before the rotation. Such a circumstance would not apply, e.g., to a system lying in an externally imposed gravitational …eld pointing in some speci…c direction. Thus, the invariance of an isolated system under rotations ultimately arises from the fact that, in the absence of external …elds of this sort, space is isotropic; it behaves the same way in all directions. Not surprisingly, therefore, in quantum mechanics the individual Cartesian com- ponents Li of the total angular momentum operator L~ of an isolated system are also constants of the motion. The di¤erent components of L~ are not, however, compatible quantum observables. Indeed, as we will see the operators representing the components of angular momentum along di¤erent directions do not generally commute with one an- other. Thus, the vector operator L~ is not, strictly speaking, an observable, since it does not have a complete basis of eigenstates (which would have to be simultaneous eigenstates of all of its non-commuting components). This lack of commutivity often seems, at …rst encounter, as somewhat of a nuisance but, in fact, it intimately re‡ects the underlying structure of the three dimensional space in which we are immersed, and has its source in the fact that rotations in three dimensions about di¤erent axes do not commute with one another.
    [Show full text]
  • SPEED MANAGEMENT ACTION PLAN Implementation Steps
    SPEED MANAGEMENT ACTION PLAN Implementation Steps Put Your Speed Management Plan into Action You did it! You recognized that speeding is a significant Figuring out how to implement a speed management plan safety problem in your jurisdiction, and you put a lot of time can be daunting, so the Federal Highway Administration and effort into developing a Speed Management Action (FHWA) has developed a set of steps that agency staff can Plan that holds great promise for reducing speeding-related adopt and tailor to get the ball rolling—but not speeding! crashes and fatalities. So…what’s next? Agencies can use these proven methods to jump start plan implementation and achieve success in reducing speed- related crashes. Involve Identify a Stakeholders Champion Prioritize Strategies for Set Goals, Track Implementation Market the Develop a Speed Progress, Plan Management Team Evaluate, and Celebrate Success Identify Strategy Leads INVOLVE STAKEHOLDERS In order for the plan to be successful, support and buy-in is • Outreach specialists. needed from every area of transportation: engineering (Federal, • Governor’s Highway Safety Office representatives. State, local, and Metropolitan Planning Organizations (MPO), • National Highway Transportation Safety Administration (NHTSA) Regional Office representatives. enforcement, education, and emergency medical services • Local/MPO/State Department of Transportation (DOT) (EMS). Notify and engage stakeholders who were instrumental representatives. in developing the plan and identify gaps in support. Potential • Police/enforcement representatives. stakeholders may include: • FHWA Division Safety Engineers. • Behavioral and infrastructure funding source representatives. • Agency Traffic Operations Engineers. • Judicial representatives. • Agency Safety Engineers. • EMS providers. • Agency Pedestrian/Bicycle Coordinators. • Educators. • Agency Pavement Design Engineers.
    [Show full text]
  • Newton's Third Law (Lecture 7) Example the Bouncing Ball You Can Move
    3rd Law Newton’s third law (lecture 7) • If object A exerts a force on object B, then object B exerts an equal force on object A For every action there is an in the opposite direction. equal and opposite reaction. B discuss collisions, impulse, A momentum and how airbags work B Æ A A Æ B Example The bouncing ball • What keeps the box on the table if gravity • Why does the ball is pulling it down? bounce? • The table exerts an • It exerts a downward equal and opposite force on ground force upward that • the ground exerts an balances the weight upward force on it of the box that makes it bounce • If the table was flimsy or the box really heavy, it would fall! Action/reaction forces always You can move the earth! act on different objects • The earth exerts a force on you • you exert an equal force on the earth • The resulting accelerations are • A man tries to get the donkey to pull the cart but not the same the donkey has the following argument: •F = - F on earth on you • Why should I even try? No matter how hard I •MEaE = myou ayou pull on the cart, the cart always pulls back with an equal force, so I can never move it. 1 Friction is essential to movement You can’t walk without friction The tires push back on the road and the road pushes the tires forward. If the road is icy, the friction force You push on backward on the ground and the between the tires and road is reduced.
    [Show full text]
  • Newton's Third Law of Motion
    ENERGY FUNDAMENTALS – LESSON PLAN 1.4 Newton’s Third Law of Motion This lesson is designed for 3rd – 5th grade students in a variety of school settings Public School (public, private, STEM schools, and home schools) in the seven states served by local System Teaching power companies and the Tennessee Valley Authority. Community groups (Scouts, 4- Standards Covered H, after school programs, and others) are encouraged to use it as well. This is one State lesson from a three-part series designed to give students an age-appropriate, Science Standards informed view of energy. As their understanding of energy grows, it will enable them to • AL 3.PS.4 3rd make informed decisions as good citizens or civic leaders. • AL 4.PS.4 4th rd • GA S3P1 3 • GA S3CS7 3rd This lesson plan is suitable for all types of educational settings. Each lesson can be rd adapted to meet a variety of class sizes, student skill levels, and time requirements. • GA S4P3 3 • KY PS.1 3rd rd • KY PS.2 3 Setting Lesson Plan Selections Recommended for Use • KY 3.PS2.1 3rd th Smaller class size, • The “Modeling” Section contains teaching content. • MS GLE 9.a 4 • NC 3.P.1.1 3rd higher student ability, • While in class, students can do “Guided Practice,” complete the • TN GLE 0307.10.1 3rd and /or longer class “Recommended Item(s)” and any additional guided practice items the teacher • TN GLE 0307.10.2 3rd length might select from “Other Resources.” • TN SPI 0307.11.1 3rd • NOTE: Some lesson plans do and some do not contain “Other Resources.” • TN SPI 0307.11.2 3rd • At home or on their own in class, students can do “Independent Practice,” • TN GLE 0407.10.1 4th th complete the “Recommended Item(s)” and any additional independent practice • TN GLE 0507.10.2 5 items the teacher selects from “Other Resources” (if provided in the plan).
    [Show full text]
  • Climate Solutions Acceleration Fund Winners Announced
    FOR IMMEDIATE RELEASE Contact: Amanda Belles, Communications & Marketing Manager (617) 221-9671; [email protected] ​ Climate Solutions Acceleration Fund Winners Announced Boston, Massachusetts (June 3, 2020) - Today, Second Nature - a Boston-based NGO who ​ ​ accelerates climate action in, and through, higher education - announced the colleges and universities that were awarded grant funding through the Second Nature Climate Solutions ​ Acceleration Fund (the Acceleration Fund). ​ The opportunity to apply for funding was first announced at the 2020 Higher Education Climate ​ Leadership Summit. The Acceleration Fund is dedicated to supporting innovative cross-sector ​ climate action activities driven by colleges and universities. Second Nature created the Acceleration Fund with generous support from Bloomberg Philanthropies, as part of a larger project to accelerate higher education’s leadership in cross-sector, place-based climate action. “Local leaders are at the forefront of climate action because they see the vast benefits to their communities, from cutting energy bills to protecting public health,” said Antha Williams, head ​ of environmental programs at Bloomberg Philanthropies. “It’s fantastic to see these ​ forward-thinking colleges and universities advance their bold climate solutions, ensuring continued progress in our fight against the climate crisis.” The institutions who were awarded funding are (additional information about each is further ​ below): ​ Agnes Scott College (Decatur, GA) Bard College (Annandale-on-Hudson,
    [Show full text]
  • A Focus on Action Speed Training
    A Focus on Action Speed Training by Klaus Feldmann / EHF Lecturer Recent analyses of major international events (ECh, WCh and Olympics) have been highlighting the fact that European top teams have further developed their execution of the game. This, however, has not involved highly obvious changes such as new tactical concepts and formations but, more importantly, minor steps including • more fast breaks in the game and efficient exploitation of opportunities arising from such breaks, • active, variable defence action with small, fast steps, frequently culminating in recovery of the ball as well as • a larger number of attacks in each game with shorter preparation times and faster execution. Underlying all these tendencies in game development is higher playing speed and faster action as exhibited by teams from Africa and Asia. This style of playing has more recently also been encouraged by the IHF through its rule revisions (e.g. fast throw-off and early warning for passive playing). Training activities taking into account these new developments will therefore include the following aspects in the future: • quicker perception and information processing, also on the part of those playing without the ball, • greater demands made on legwork to support flexible defence action, • speed-driven application of attack techniques and • improvements in switching from defence to attack and from attack to defence. From demands being made by the game ... As a first step in making our players faster it is helpful to take a close look at and analyse the demands being made on speed in the game. From this analysis, we can then develop the general components of a basic methodological formula for speed-driven training.
    [Show full text]
  • Newton's Third
    Standard—7.3.17: Investigate that an unbalanced force, acting on an object, changes its speed or path of motion or both, and know that if the force always acts toward the same center as the object moves, the object’s path may curve into an orbit around the center. Also covers: 7.2.6 (Detailed standards begin on page IN8.) Newton’s Third Law Action and Reaction Newton’s first two laws of motion explain how the motion of a single object changes. If the forces acting on the object are bal- anced, the object will remain at rest or stay in motion with con- I Identify the relationship stant velocity. If the forces are unbalanced, the object will between the forces that objects accelerate in the direction of the net force. Newton’s second law exert on each other. tells how to calculate the acceleration, or change in motion, of an object if the net force acting on it is known. Newton’s third law describes something else that happens Newton’s third law can explain how when one object exerts a force on another object. Suppose you birds fly and rockets move. push on a wall. It may surprise you to learn that if you push on Newton’s third Review Vocabulary a wall, the wall also pushes on you. According to force: a push or a pull law of motion , forces always act in equal but opposite pairs. Another way of saying this is for every action, there is an equal New Vocabulary but opposite reaction.
    [Show full text]
  • Solving Equations of Motion by Using Monte Carlo Metropolis: Novel Method Via Random Paths Sampling and the Maximum Caliber Principle
    entropy Article Solving Equations of Motion by Using Monte Carlo Metropolis: Novel Method Via Random Paths Sampling and the Maximum Caliber Principle Diego González Diaz 1,2,* , Sergio Davis 3 and Sergio Curilef 1 1 Departamento de Física, Universidad Católica del Norte, Casilla 1280, Antofagasta, Chile; [email protected] 2 Banco Itaú-Corpbanca, Casilla 80-D, Santiago, Chile 3 Comisión Chilena de Energía Nuclear, Casilla 188-D, Santiago, Chile; [email protected] * Correspondence: [email protected] Received: 9 July 2020; Accepted: 14 August 2020; Published: 21 August 2020 Abstract: A permanent challenge in physics and other disciplines is to solve Euler–Lagrange equations. Thereby, a beneficial investigation is to continue searching for new procedures to perform this task. A novel Monte Carlo Metropolis framework is presented for solving the equations of motion in Lagrangian systems. The implementation lies in sampling the path space with a probability functional obtained by using the maximum caliber principle. Free particle and harmonic oscillator problems are numerically implemented by sampling the path space for a given action by using Monte Carlo simulations. The average path converges to the solution of the equation of motion from classical mechanics, analogously as a canonical system is sampled for a given energy by computing the average state, finding the least energy state. Thus, this procedure can be general enough to solve other differential equations in physics and a useful tool to calculate the time-dependent properties of dynamical systems in order to understand the non-equilibrium behavior of statistical mechanical systems. Keywords: maximum caliber; Monte Carlo Metropolis; equation of motion; least action principle 1.
    [Show full text]
  • Chapter 19 Angular Momentum
    Chapter 19 Angular Momentum 19.1 Introduction ........................................................................................................... 1 19.2 Angular Momentum about a Point for a Particle .............................................. 2 19.2.1 Angular Momentum for a Point Particle ..................................................... 2 19.2.2 Right-Hand-Rule for the Direction of the Angular Momentum ............... 3 Example 19.1 Angular Momentum: Constant Velocity ........................................ 4 Example 19.2 Angular Momentum and Circular Motion ..................................... 5 Example 19.3 Angular Momentum About a Point along Central Axis for Circular Motion ........................................................................................................ 5 19.3 Torque and the Time Derivative of Angular Momentum about a Point for a Particle ........................................................................................................................... 8 19.4 Conservation of Angular Momentum about a Point ......................................... 9 Example 19.4 Meteor Flyby of Earth .................................................................... 10 19.5 Angular Impulse and Change in Angular Momentum ................................... 12 19.6 Angular Momentum of a System of Particles .................................................. 13 Example 19.5 Angular Momentum of Two Particles undergoing Circular Motion .....................................................................................................................
    [Show full text]
  • The Action, the Lagrangian and Hamilton's Principle
    The Action, the Lagrangian and Hamilton's principle Physics 6010, Fall 2010 The Action, the Lagrangian and Hamilton's principle Relevant Sections in Text: x1.4{1.6, x2.1{2.3 Variational Principles A great deal of what we shall do in this course hinges upon the fact that one can de- scribe a wide variety of dynamical systems using variational principles. You have probably already seen the Lagrangian and Hamiltonian ways of formulating mechanics; these stem from variational principles. We shall make a great deal of fuss about these variational for- mulations in this course. I will not try to completely motivate the use of these formalisms at this early point in the course since the motives are so many and so varied; you will see the utility of the formalism as we proceed. Still, it is worth commenting a little here on why these principles arise and are so useful. The appearance of variational principles in classical mechanics can in fact be traced back to basic properties of quantum mechanics. This is most easily seen using Feynman's path integral formalism. For simplicity, let us just think about the dynamics of a point particle in 3-d. Very roughly speaking, in the path integral formalism one sees that the probability amplitude for a particle to move from one place, r1, to another, r2, is given by adding up the probability amplitudes for all possible paths connecting these two positions (not just the classically allowed trajectory). The amplitude for a given path r(t) is of the i S[r] form e h¯ , where S[r] is the action functional for the trajectory.
    [Show full text]