A Comparison of Action Selection Learning Methods

From: Proceedings of the Third International Conference on Multistrategy Learning. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. A Comparison of Action Selection Learning Methods Diana F. Gordon Devika Subramanian Naval Research Laboratory, Code 5510 Department of Computer Science 4555 Overlook Avenue, S.W. Rice University Washington, D.C. 20375 Houston, TX 77005 [email protected] [email protected] Abstract man subjects performing the task. Then, two learning methods are described: our model-based method and a Our goal is to develop a hybrid cognitive model benchmark reinforcement learning algorithm that does of howhumans acquire skills on complexcogni- tive tasks. Weare pursuing this goal by designing not have an explicit model. Prior results reported hybrid computational architectures for the NRL in the literature of empirical comparisons of action Navigation task, which requires competentsenso- models versus reinforcement learning are mixed (Lin, rimotor coordination. In this paper, we empiri- 1992; Mahadevan,1992); they do not clearly indicate cally comparetwo methodsfor control knowledge that one method is superior. Here we compare these acquisition (reinforcement learning and a novel two methods empirically on the Navigation task us- variant of action models), as well as a hybrid ing a large collection of execution traces. Our pri- of these methods, with humanlearning on this mary goal in this comparison is to determine which task. Our results indicate that the performance performs more like humanlearning on this task. Both of our action models approach more closely ap- methods include sensor relevance knowledge from the proximates the rate of humanlearning on the task than does reinforcement learning or the hy- verbal protocols. The results of this empirical com- brid. Wealso experimentally explore the impact parison indicate that our action models method more of backgroundknowledge on system performance. closely approximates the time-scales and trends in hu- By adding knowledgeused by the action models manlearning behavior on this task. Nevertheless, nei- system to the benchmarkreinforcement learner, ther algorithm performs as well as the humansubject. we elevate its performanceabove that of the ac- We next explore a multistrategy variant that com- tion models system. bines the two methods for the purpose of better ap- proximating the humanlearning, and present empirical Introduction results with this method. Although the multistrategy approach is unsuccessful, an alternative is highly suc- Our goal is to develop a hybrid cognitive model of how cessful. This alternative consists of modifying the ar- humansacquire skills by explicit instruction and re- chitecture of the reinforcement learner to incorporate peated practice on complex cognitive tasks. We are knowledge used by the action models method. pursuing this goal by designing hybrid (multistrategy) computational architectures for the NRLNavigation The NRL Navigation and Mine task, which requires sensorimotor coordination skill. In this paper, we develop a novel method based on para- Avoidance Domain metric action models for actively learning visual-motor The NRLnavigation and mine avoidance domain, de- coordination. Although similar to previous work on ac- veloped by Alan Schultz at the Naval Research Labora- tion models, our method is novel because it capitalizes tory and hereafter abbreviated the "Navigation task," on available background knowledge regarding sensor is a simulation that can be run either by humans relevance. We have confirmed the existence and use through a graphical interface, or by an automated of such knowledge with extensive verbal protocol data agent. The task involves learning to navigate through collected from human subjects. In our action models obstacles in a two-dimensional world. A single agent approach, the agent actively interacts with its envi- controls an autonomous underwater vehicle (AUV) ronment by gathering ezecution traces (time-indexed that has to avoid mines and rendezvous with a sta- streams of visual inputs and motor outputs) and by tionary target before exhausting its fuel. The mines learning a compact representation of an effective pol- may be stationary, drifting, or seeking. Time is di- icy for action choice guided by the action model. vided into episodes. An episode begins with the agent This paper begins by describing the NRLNavigation on one side of the mine field, the target placed ran- task, as well as the types of data collected from hu- domly on the other side of the mine field, and random 90 MSL-96 From: Proceedings of the Third International Conference on Multistrategy Learning. Copyright © 1996, AAAI (www.aaai.org). All rights reserved. mine locations within a bounded region. An episode by collecting answers to questions posed at the end of ends with one of three possible outcomes: the agent the individual sessions. reaches the goal (success), hits a mine (failure), or hausts its fuel (failure). Reinforcement, in the form Methods for Modeling Action Selection a binary reward dependent on the outcome, is received Learning at the end of each episode. An episode is further sub- divided into decision cycles corresponding to actions Our goal is to build a model that most closely dupli- (decisions) taken by the agent. cates the humansubject data in learning performance. The agent has a limited capacity to observe the In particular, subjects becomeproficient at this task world it is in; in particular, it obtains information (assuming no noise in the sensors and only 25 mines) about its proximal environs through a set of seven con- after only a few episodes. Modeling such an extremely secutive sonar segments that give it a 90 degree forward rapid learning rate presents a challenge. In develop- field of view for a short distance. Obstacles in the field ing our learning methods, we have drawn from both of view cause a reduction in sonar segment length (seg- the machine learning and cognitive science literature. ment length is proportional to obstacle distance); one By far the most widely used machine learning method mine may appear in multiple segments. The agent also for tasks like ours is reinforcement learning. Reinforce- has a range sensor that provides the current distance ment learning is mathematically sufficient for learning to the target, a bearing sensor (in clock notation) that policies for our task, yet has no explicit world model. indicates the direction in which the target lies, and a More commonin the cognitive science literature are ac- time sensor that measures the remaining fuel. A hu- tion models, e.g., (Arbib, 1972), which require building man subject performing this task sees visual gauges explicit representations of the dynamicsof the world to corresponding to each of these sensors. The turn and choose actions. speed actions are controlled by joystick motions. The turn and speed chosen on the previous decision cycle Reinforcement learning are additionally available to the agent. Given its de- layed reward structure and the fact that the world is Reinforcement learning has been studied extensively in presented to the agent via sensors that are inadequate the psychological literature, e.g., (Skinner, 1984), and to guarantee correct identification of the current state, has recently becomevery popular in the machine learn- the Navigation world is a partially observable Markov ing literature, e.g., (Sutton, 1988; Lin, 1992; Gordon decision process (POMDP). Subramanian, 1993). Rather than using only the dif- An example of a few snapshots from an execution ference between the prediction and the true reward for trace (with only a subset of the sensors shown) is the the error, as in traditional supervised learning, (tempo- following: ral difference) reinforcement learning methods use the difference between successive predictions for errors to time range bearing sonarl turn speed improve the learning. Reinforcement learning provides 4 lOOO 1 220 32 - 20 a method for modeling the acquisition of the policy 5 1000 12 220 -32 20 function: 6 1000 11 220 0 20 F : sensors --* actions 7 1000 11 90 0 20 Currently, the most popular type of reinforcement A trace file records the binary success/failure for each learning is q-learning, developed by Watkins, which is episode. based on ideas from temporal difference learning, as well as conventional dynamic programming (Watkins, Data from Human Subjects 1989). It requires estimating the q-value of a sensor In the experiments with humans, seven subjects were configuration s, i.e., q(s, a) is a prediction of the util- used, and each ran for two or three 45-minute sessions ity of taking action a in a world state represented by with the simulations. We instrumented I the simula- s. The q-values are updated during learning based on tion to gather execution traces for subsequent analysis minimizing a temporal difference error. Action choice (Gordon et al., 1994). Wealso obtained verbal proto- is typically stochastic, where a higher q-value implies cols by recording subject utterances during play and a higher probability that action will be chosen in that 1 Note that although humansubjects use a joystick for state. While q-learning with explicit state representations actions, we do not model the joystick but instead model addresses the temporal credit assignment problem, it actions at the level of discrete turns and speeds (e.g., turn 32 degrees to the left at speed 20). Humanjoystick mo- is standard practice to use input generalization and tions are ultimately translated to these turn and speed val- neural networks to also address the structural credit ues before~being passed to the simulated task. Likewise, assignment problem, e.g., (Lin, 1992). The q-value out- the learning agents we construct do not ~see~ gauges but put node of the control neural network corresponding instead get the numericsensor values directly from the sim- to the chosen action u is given an error that reflects the ulation (e.g., rangeis 500). difference betweenthe current prediction of the utility, Gordon 91 From: Proceedings of the Third International Conference on Multistrategy Learning.

A Comparison of Action Selection Learning Methods

Quantum Reinforcement Learning During Human Decision-Making

Latest Snapshot.” to Modify an Algo So It Does Produce Multiple Snapshots, ﬁnd the Following Line (Which Is Present in All of the Algorithms)

The Architecture of a Multilayer Perceptron for Actor-Critic Algorithm with Energy-Based Policy Naoto Yoshida

Dalle Banche Dati Bibliografiche Pag. 2 60 70 80 94 102 Segnalazioni

Macro Action Selection with Deep Reinforcement Learning in Starcraft

Application of Newton's Method to Action Selection in Continuous

Quantum Inspired Algorithms for Learning and Control of Stochastic Systems

Towards Learning in Probabilistic Action Selection: Markov Systems and Markov Decision Processes

Softmax Action Selection

Delay-Aware Model-Based Reinforcement Learning for Continuous Control Baiming Chen, Mengdi Xu, Liang Li, Ding Zhao*

Action Selection for Composable Modular Deep Reinforcement Learning

Reinforcement and Shaping in Learning Action Sequences with Neural Dynamics