Evolutionary Algorithms for Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Articial Intelligence Research Submitted published Evolutionary Algorithms for Reinforcement Learning David E Moriarty moriartyisiedu University of Southern California Information Sciences Institute Admiralty Way Marina del Rey CA Alan C Schultz schultzaicnrlnavymil Navy Center for Applied Research in Articial Intel ligence Naval Research Laboratory Washington DC John J Grefenstette grefibgmuedu Institute for Biosciences Bioinformatics and Biotechnology George Mason University Manassas VA Abstract There are two distinct approaches to solving reinforcement learning problems namely searching in value function space and searching in p olicy space Temporal dierence meth o ds and evolutionary algorithms are wellknown examples of these approaches Kaelbling Littman and Mo ore recently provided an informative survey of temp oral dierence meth o ds This article fo cuses on the application of evolutionary algorithms to the reinforcement learning problem emphasizing alternative p olicy representations credit assignment meth o ds and problemsp ecic genetic op erators Strengths and weaknesses of the evolutionary approach to reinforcement learning are presented along with a survey of representative applications Introduction Kaelbling Littman and Mo ore and more recently Sutton and Barto pro vide informative surveys of the eld of reinforcement learning RL They characterize two classes of metho ds for reinforcement learning metho ds that search the space of value func tions and metho ds that search the space of p olicies The former class is exemplied by the temp oral dierence TD metho d and the latter by the evolutionary algorithm EA approach Kaelbling et al fo cus entirely on the rst set of metho ds and they provide an excellent account of the state of the art in TD learning This article is intended to round out the picture by addressing evolutionary metho ds for solving the reinforcement learning problem As Kaelbling et al clearly illustrate reinforcement learning presents a challenging array of diculties in the pro cess of scaling up to realistic tasks including problems asso ciated with very large state spaces partially observable states rarely o ccurring states and non stationary environments At this p oint which approach is b est remains an op en question so it is sensible to pursue parallel lines of research on alternative metho ds While it is b eyond the scop e of this article to address whether it is b etter in general to search value function space or p olicy space we do hop e to highlight some of the strengths of the evolutionary approach to the reinforcement learning problem The reader is advised not to view this c AI Access Foundation and Morgan Kaufmann Publishers All rights reserved Moriarty Schultz Grefenstette article as an EA vs TD discussion In some cases the two metho ds provide complementary strengths so hybrid approaches are advisable in fact our survey of implemented systems illustrates that many EAbased reinforcement learning systems include elements of TD learning as well The next section sp ells out the reinforcement learning problem In order to provide a sp ecic anchor for the later discussion Section presents a particular TD metho d Sec tion outlines the approach we call Evolutionary Algorithms for Reinforcement Learning EARL and provides a simple example of a particular EARL system The following three sections fo cus on features that distinguish EAs for RL from EAs for general function op timization including alternative p olicy representations credit assignment metho ds and RLsp ecic genetic op erators Sections and highlight some strengths and weaknesses of the EA approach Section briey surveys some successful applications of EA systems on challenging RL tasks The nal section summarizes our presentation and p oints out directions for further research Reinforcement Learning All reinforcement learning metho ds share the same goal to solve sequential decision tasks through trial and error interactions with the environment Barto Sutton Watkins Grefenstette Ramsey Schultz In a sequential decision task an agent interacts with a dynamic system by selecting actions that aect state transitions to optimize some reward function More formally at any given time step t an agent p erceives its state s and selects an action a The system resp onds by giving the agent some p ossibly zero t t numerical reward r s and changing into state s s a The state transition may b e t t t t determined solely by the current state and the agents action or may also involve sto chastic pro cesses The agents goal is to learn a policy S A which maps states to actions The optimal policy can b e dened in many ways but is typically dened as the p olicy that pro duces the greatest cumulative reward over all states s argmax V s s where V s is the cumulative reward received from state s using p olicy There are also many ways to compute V s One approach uses a discount rate to discount rewards over time The sum is then computed over an innite horizon 1 X i V s r t ti i where r is the reward received at time step t Alternatively V s could b e computed by t summing the rewards over a nite horizon h h X V s r t ti i The agents state descriptions are usually identied with the values returned by its sensors which provide a description of b oth the agents current state and the state of the Evolutionary Algorithms for Reinforcement Learning world Often the sensors do not give the agent complete state information and thus the state is only partially observable Besides reinforcement learning intelligent agents can b e designed by other paradigms notably planning and supervised learning We briey note some of the ma jor dierences among these approaches In general planning metho ds require an explicit mo del of the state transition function s a Given such a mo del a planning algorithm can search through p ossible action choices to nd an action sequence that will guide the agent from an initial state to a goal state Since planning algorithms op erate using a mo del of the environment they can backtrack or undo state transitions that enter undesirable states In contrast RL is intended to apply to situations in which a suciently tractable action mo del do es not exist Consequently an agent in the RL paradigm must actively explore its environment in order to observe the eects of its actions Unlike planning RL agents cannot normally undo state transitions Of course in some cases it may b e p ossible to build up an action mo del through exp erience Sutton enabling more planning as exp erience accumulates However RL research fo cuses on the b ehavior of an agent when it has insucient knowledge to p erform planning Agents can also b e trained through sup ervised learning In sup ervised learning the agent is presented with examples of stateaction pairs along with an indication that the action was either correct or incorrect The goal in sup ervised learning is to induce a general p olicy from the training examples Thus sup ervised learning requires an oracle that can supply correctly lab eled examples In contrast RL do es not require prior knowledge of correct and incorrect decisions RL can b e applied to situations in which rewards are sparse for example rewards may b e asso ciated only with certain states In such cases it may b e imp ossible to asso ciate a lab el of correct or incorrect on particular decisions without reference to the agents subsequent decisions making sup ervised learning infeasible In summary RL provides a exible approach to the design of intelligent agents in situ ations for which b oth planning and sup ervised learning are impractical RL can b e applied to problems for which signicant domain knowledge is either unavailable or costly to obtain For example a common RL task is rob ot control Designers of autonomous rob ots often lack sucient knowledge of the intended op erational environment to use either the planning or the sup ervised learning regime to design a control p olicy for the rob ot In this case the goal of RL would b e to enable the rob ot to generate eective decision p olicies as it explores its environment Figure shows a simple sequential decision task that will b e used as an example later in this pap er The task of the agent in this grid world is to move from state to state by selecting among two actions right R or down D The sensor of the agent returns the identity of the current state The agent always starts in state a and receives the reward indicated up on visiting each state The task continues until the agent moves o the grid world eg by taking action D from state a The goal is to learn a p olicy that returns the highest cumulative rewards For example a p olicy which results in the sequences of actions R D R D D R R D starting from from state a gives the optimal score of Moriarty Schultz Grefenstette a b c d e 1 0 2 1 -1 1 2 1 1 2 0 2 3 3 -5 4 3 1 4 1 -2 4 1 2 5 1 1 2 1 1 Figure A simple gridworld sequential decision task The agent starts in state a and receives the row and column of the current b ox as sensory input The agent moves from one b ox to another by selecting b etween two moves right or down and the agents score is increased by the payo indicated in each b ox The goal is to nd a p olicy that maximizes the cumulative score Policy Space vs ValueFunction Space Given the reinforcement learning problem as describ ed in the previous section we now address the main topic how to nd an optimal p olicy We consider two main approaches one involves