Q-Learning in Continuous State and Action Spaces
Total Page:16
File Type:pdf, Size:1020Kb
-Learning in Continuous Q State and Action Spaces Chris Gaskett, David Wettergreen, and Alexander Zelinsky Robotic Systems Laboratory Department of Systems Engineering Research School of Information Sciences and Engineering The Australian National University Canberra, ACT 0200 Australia [cg dsw alex]@syseng.anu.edu.au j j Abstract. -learning can be used to learn a control policy that max- imises a scalarQ reward through interaction with the environment. - learning is commonly applied to problems with discrete states and ac-Q tions. We describe a method suitable for control tasks which require con- tinuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a non-holonomic control task. Advantage Learning, a variation of -learning, is shown enhance learning speed and reliability for this task.Q 1 Introduction Reinforcement learning systems learn by trial-and-error which actions are most valuable in which situations (states) [1]. Feedback is provided in the form of a scalar reward signal which may be delayed. The reward signal is defined in relation to the task to be achieved; reward is given when the system is successfully achieving the task. The value is updated incrementally with experience and is defined as a discounted sum of expected future reward. The learning systems choice of actions in response to states is called its policy. Reinforcement learning lies between the extremes of supervised learning, where the policy is taught by an expert, and unsupervised learning, where no feedback is given and the task is to find structure in data. There are two prevalent approaches to reinforcement learning: -learning and actor-critic learning. In -learning [2] the expected value of eachQ action in each state is stored. In -learningQ the policy is formed by executing the action with the highest expectedQ value. In actor-critic learning [3] a critic learns the value of each state. The value is the expected reward over time from the environment under the current policy. The actor tries to maximise a local reward signal from the critic by choosing actions close to its current policy then changing its policy depending upon feedback from the critic. In turn, the critic adjusts the value of states in response to rewards received following the actor's policy. The main advantage of -learning over actor-critic learning is exploration insensitivity|the ability toQ learn without necessarily following the current pol- icy. However, actor-critic learning has a major advantage over current imple- mentations of -learning; the ability to respond to smoothly varying states with smoothly varyingQ actions. Actor-critic systems can form a continuous mapping from state to action and update this policy based on the local reward signal from the critic. -learning is generally considered in the case that states and actions are both discrete.Q In some real world situations, and especially in control, it is advantageous to treat both states and actions as continuous variables. This paper describes a continuous state and action -learning method and applies it to a simulated control task. Essential characteristicsQ of a continuous state and action -learning system are also described. Advantage Learning [4] is found to be anQ important variation of -learning for these tasks. Q 2 -Learning Q -learning works by incrementally updating the expected values of actions in states.Q For every possible state, every possible action is assigned a value which is a function of both the immediate reward for taking that action and the expected reward in the future based on the new state that is the result of taking that action. This is expressed by the one-step -update equation, Q (x; u) := (1 α) (x; u) + α (R + γ max (xt+1; ut+1)) ; (1) Q − Q Q where is the expected value of performing action u in state x; x is the state vector;Qu is the action vector; R is the reward; α is a learning rate which controls convergence and γ is the discount factor. The discount factor makes rewards earned earlier more valuable than those received later. This method learns the values of all actions, rather than just finding the optimal policy. This knowledge is expensive in terms of the amount of informa- tion which has to be stored, but it does bring benefits. -learning is exploration insensitive, any action can be carried out at any time andQ information is gained from this experience. Actor-critic learning does not have this ability, actions must follow or nearly follow the current policy. This exploration insensitivity allows -learning to learn from other controllers, even if they are directed to- ward achievingQ a different task they can provide valuable data. Knowledge from several -learners can be combined, as the values of non-optimal actions are known,Q a compromise action can be found. In the standard -learning implementation -values are stored in a table. One cell is requiredQ per combination of state andQ action. This implementation is not amenable to continuous state and action problems. 3 Continuous States and Actions Many real world control problems require actions of a continuous nature, in response to continuous state measurements. It should be possible that actions vary smoothly in response to smooth changes in state. But most learning systems, indeed most classical AI techniques, are designed to operate in discrete domains, manipulating symbols rather than real numbered variables. Some problems that we may wish to address, such as high-performance control of mobile robots, cannot be adequately carried out with coarsely coded inputs and outputs. Motor commands need to vary smoothly and accurately in response to continuous changes in state. -learning with discretised states and actions scale poorly. As the number of stateQ and action variables increase, the size of the table used to store -values grows exponentially. Accurate control requires that variables be quantisedQ finely, but as these systems fail to generalise between similar states and actions, they require large quantities of training data. If the learning task described in Sect. 7 was attempted with a discrete -learning algorithm the number of -values to be stored in the table wouldQ be extremely large. For example, discretisedQ roughly to seven levels, the eight state variables and two action variables would require almost 300 million elements. Without generalisation, producing this num- ber of experiences is impractical. Using a coarser representation of states leads to aliasing, functionally different situations map to the same state and are thus indistinguishable. It is possible to avoid these discretisation problems entirely by using learning methods which can deal directly with continuous states and actions. 4 Continuous State and Action -Learning Q There have been several recent attempts at extending the -learning framework to continuous state and action spaces [5, 6, 7, 8, 9]. Q We believe that there are eight criteria that are necessary and sufficient for a system to be capable of this type of learning. Listed in in Fig. 1, these require- ments are a combination of those required for basic -learning as described in Sect. 2 combined with the type of continuous behaviourQ described in Sect. 3. None of the -learning systems discussed below appear to fulfil all of these cri- teria completely.Q In particular, many systems cannot learn a policy where actions vary smoothly with smooth changes in state (criteria Continuity). In these not- quite continuous systems a small change in state cannot cause a small change in action. In effect the function which maps state to action is a staircase|a piecewise constant function. Sections 4.1{4.6 describe various real valued state and action -learning methods and techniques and rate them (in an unfair and biased manner)Q against the criteria in Fig. 1. 4.1 Adaptive Critic Methods Werbos's adaptive critic family of methods [5] use several feedforward artificial neural networks to implement reinforcement learning. The adaptive critic family includes methods closely related to actor-critic and -learning. A learnt dynamic model assists in assigning reward to components of theQ action vector (not meeting Action Selection: Finds action with the highest expected value quickly. State Evaluation: Finds value of a state quickly as required for the - update equation (1). A state's value is the value of highestQ valued action in that state. Evaluation: Stores or approximates the entire -function as required Q for the -update equation (1). Q Model-Free: RequiresQ no model of system dynamics to be known or learnt. Flexible Policy: Allows representation of a broad range of policies to allow freedom in developing a novel controller. Continuity: Actions can vary smoothly with smooth changes in state. State Generalisation: Generalises between similar states, reducing the amount of exploration required in state space. Action Generalisation: Generalises between similar actions, reducing the amount of exploration required in action space. Fig. 1. Essential capabilities for a continuous state and action -learning system Q the Model-Free criteria). If the dynamic model is already known, or learning one is easier than learning the controller itself, model based adaptive critic methods are an efficient approach to continuous state, continuous action reinforcement learning. 4.2 CMAC Based -learning Q Santamaria, Ashwin and Sutton [6] have presented results for -learning sys- tems using Albus's CMAC (Cerebellar Model Articulation Controller)Q [10]. The CMAC is a function approximation system which features spatial locality, avoid- ing the unlearning problem described in Sect. 6. It is a compromise between a look up table and a weight-based approximator. It can generalise between simi- lar states, but it involves discretisation, making it impossible to completely fulfil the Continuity criteria. In [6] the inputs to the CMAC are the state and action, the output is the expected value.