<<

-Learning in Continuous Q State and Action Spaces

Chris Gaskett, David Wettergreen, and Alexander Zelinsky

Robotic Systems Laboratory Department of Systems Engineering Research School of Information Sciences and Engineering The Australian National University Canberra, ACT 0200 Australia [cg dsw alex]@syseng.anu.edu.au | |

Abstract. -learning can be used to learn a control policy that max- imises a scalarQ reward through interaction with the environment. - learning is commonly applied to problems with discrete states and ac-Q tions. We describe a method suitable for control tasks which require con- tinuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a non-holonomic control task. Advantage Learning, a variation of -learning, is shown enhance learning speed and reliability for this task.Q

1 Introduction

Reinforcement learning systems learn by trial-and-error which actions are most valuable in which situations (states) [1]. Feedback is provided in the form of a scalar reward signal which may be delayed. The reward signal is defined in relation to the task to be achieved; reward is given when the system is successfully achieving the task. The value is updated incrementally with experience and is defined as a discounted sum of expected future reward. The learning systems choice of actions in response to states is called its policy. lies between the extremes of , where the policy is taught by an expert, and , where no feedback is given and the task is to find structure in data. There are two prevalent approaches to reinforcement learning: -learning and actor-critic learning. In -learning [2] the expected value of eachQ action in each state is stored. In -learningQ the policy is formed by executing the action with the highest expectedQ value. In actor-critic learning [3] a critic learns the value of each state. The value is the expected reward over time from the environment under the current policy. The actor tries to maximise a local reward signal from the critic by choosing actions close to its current policy then changing its policy depending upon feedback from the critic. In turn, the critic adjusts the value of states in response to rewards received following the actor’s policy. The main advantage of -learning over actor-critic learning is exploration insensitivity—the ability toQ learn without necessarily following the current pol- icy. However, actor-critic learning has a major advantage over current imple- mentations of -learning; the ability to respond to smoothly varying states with smoothly varyingQ actions. Actor-critic systems can form a continuous mapping from state to action and update this policy based on the local reward signal from the critic. -learning is generally considered in the case that states and actions are both discrete.Q In some real world situations, and especially in control, it is advantageous to treat both states and actions as continuous variables. This paper describes a continuous state and action -learning method and applies it to a simulated control task. Essential characteristicsQ of a continuous state and action -learning system are also described. Advantage Learning [4] is found to be anQ important variation of -learning for these tasks. Q 2 -Learning Q -learning works by incrementally updating the expected values of actions in Qstates. For every possible state, every possible action is assigned a value which is a function of both the immediate reward for taking that action and the expected reward in the future based on the new state that is the result of taking that action. This is expressed by the one-step -update equation, Q (x, u) := (1 α) (x, u) + α (R + γ max (xt+1, ut+1)) , (1) Q − Q Q where is the expected value of performing action u in state x; x is the state vector;Qu is the action vector; R is the reward; α is a which controls convergence and γ is the discount factor. The discount factor makes rewards earned earlier more valuable than those received later. This method learns the values of all actions, rather than just finding the optimal policy. This knowledge is expensive in terms of the amount of informa- tion which has to be stored, but it does bring benefits. -learning is exploration insensitive, any action can be carried out at any time andQ information is gained from this experience. Actor-critic learning does not have this ability, actions must follow or nearly follow the current policy. This exploration insensitivity allows -learning to learn from other controllers, even if they are directed to- ward achievingQ a different task they can provide valuable data. Knowledge from several -learners can be combined, as the values of non-optimal actions are known,Q a compromise action can be found. In the standard -learning implementation -values are stored in a table. One cell is requiredQ per combination of state andQ action. This implementation is not amenable to continuous state and action problems.

3 Continuous States and Actions Many real world control problems require actions of a continuous nature, in response to continuous state measurements. It should be possible that actions vary smoothly in response to smooth changes in state. But most learning systems, indeed most classical AI techniques, are designed to operate in discrete domains, manipulating symbols rather than real numbered variables. Some problems that we may wish to address, such as high-performance control of mobile robots, cannot be adequately carried out with coarsely coded inputs and outputs. Motor commands need to vary smoothly and accurately in response to continuous changes in state. -learning with discretised states and actions scale poorly. As the number of stateQ and action variables increase, the size of the table used to store -values grows exponentially. Accurate control requires that variables be quantisedQ finely, but as these systems fail to generalise between similar states and actions, they require large quantities of training data. If the learning task described in Sect. 7 was attempted with a discrete -learning the number of -values to be stored in the table wouldQ be extremely large. For example, discretisedQ roughly to seven levels, the eight state variables and two action variables would require almost 300 million elements. Without generalisation, producing this num- ber of experiences is impractical. Using a coarser representation of states leads to aliasing, functionally different situations map to the same state and are thus indistinguishable. It is possible to avoid these discretisation problems entirely by using learning methods which can deal directly with continuous states and actions.

4 Continuous State and Action -Learning Q There have been several recent attempts at extending the -learning framework to continuous state and action spaces [5, 6, 7, 8, 9]. Q We believe that there are eight criteria that are necessary and sufficient for a system to be capable of this type of learning. Listed in in Fig. 1, these require- ments are a combination of those required for basic -learning as described in Sect. 2 combined with the type of continuous behaviourQ described in Sect. 3. None of the -learning systems discussed below appear to fulfil all of these cri- teria completely.Q In particular, many systems cannot learn a policy where actions vary smoothly with smooth changes in state (criteria Continuity). In these not- quite continuous systems a small change in state cannot cause a small change in action. In effect the function which maps state to action is a staircase—a piecewise constant function. Sections 4.1–4.6 describe various real valued state and action -learning methods and techniques and rate them (in an unfair and biased manner)Q against the criteria in Fig. 1.

4.1 Adaptive Critic Methods

Werbos’s adaptive critic family of methods [5] use several feedforward artificial neural networks to implement reinforcement learning. The adaptive critic family includes methods closely related to actor-critic and -learning. A learnt dynamic model assists in assigning reward to components of theQ action vector (not meeting : Finds action with the highest expected value quickly. State Evaluation: Finds value of a state quickly as required for the - update equation (1). A state’s value is the value of highestQ valued action in that state. Evaluation: Stores or approximates the entire -function as required Q for the -update equation (1). Q Model-Free: RequiresQ no model of system dynamics to be known or learnt. Flexible Policy: Allows representation of a broad range of policies to allow freedom in developing a novel controller. Continuity: Actions can vary smoothly with smooth changes in state. State Generalisation: Generalises between similar states, reducing the amount of exploration required in state space. Action Generalisation: Generalises between similar actions, reducing the amount of exploration required in action space.

Fig. 1. Essential capabilities for a continuous state and action -learning system Q the Model-Free criteria). If the dynamic model is already known, or learning one is easier than learning the controller itself, model based adaptive critic methods are an efficient approach to continuous state, continuous action reinforcement learning.

4.2 CMAC Based -learning Q Santamaria, Ashwin and Sutton [6] have presented results for -learning sys- tems using Albus’s CMAC (Cerebellar Model Articulation Controller)Q [10]. The CMAC is a function approximation system which features spatial locality, avoid- ing the unlearning problem described in Sect. 6. It is a compromise between a look up table and a weight-based approximator. It can generalise between simi- lar states, but it involves discretisation, making it impossible to completely fulfil the Continuity criteria. In [6] the inputs to the CMAC are the state and action, the output is the expected value. To find max this implementation requires a search across all possible actions, calculatingQ the -value for each to find the highest. This does not fulfil the Action Selection criteria.Q Another concern is that approximation resources are used evenly across the state and action spaces. Santamaria et. al. address this by pre-distorting the state information using a priori knowledge so that more important parts of the state space receive more approximation resources.

4.3 -AHC Q Rummery presents a method which combines -learning with actor-critic learn- ing [7]. -learning is used to chose betweenQ a set of actor-critic learners. Its performanceQ overall was unsatisfactory. In general it either set the actions to constant settings, making it equivalent to Lin’s system for generalising between states [11], or only used one of the actor-critic modules, making it equivalent to a standard actor-critic system. These problems may stem from not fulfilling Evaluation, Action Generalisation and State Generalisation criteria when dif-Q ferent actor-critic learners are used. This system is one of the few which can represent non-piecewise constant policies (Continuity criteria).

4.4 -Kohonen Q Touzet describes a -learning system based on Kohonen’s self organising map [8, 12]. The state, actionQ and expected value are the elements of the feature vector. Actions are chosen by choosing the node which most closely matches the state and a the maximum possible value (one). Unfortunately the actions are always piecewise constant, not fulfilling the Continuity criteria.

4.5 -Radial Basis Q Santos describes a system based on radial basis functions [13]. It is very similar to the -Kohonen system in that each radial basis neuron’s holds a center vector like theQ Kohonen feature vector. The number of possible actions is equal to the number of radial basis neurons, so actions are piecewise constant (not fulfilling the Continuity criteria). It does not meet the Evaluation criteria as only those actions described by the radial basis neuronsQ have an associated value.

4.6 Neural Field -learning Q Gross, Stephan and Krabbes have implemented a -learning system based on dynamic neural fields [9]. A neural vector quantiser (NeuralQ Gas) clusters similar states. A neural field encodes the values of actions so that selecting the action with the highest requires iterative evaluation of the neural field dynamics. This limits the speedQ with which actions can be selected (the Action Selection criteria) and values of states found (the State Evaluation criteria). The system fulfils the State Generalisation and Action Generalisation criteria.

4.7 Our Approach

We seek a method of learning the control for a continuously acting agent func- tioning in the real world, for example a mobile robot travelling to goal loca- tion. For this application of reinforcement learning, the existing approaches have shortcomings that make them inappropriate for controlling this type of system. Many can’t adequately generalise between states and/or actions. Others can’t produce smoothly varying control actions or can’t generate actions quickly enough for operation in real time. For these reasons we propose a scheme for reinforcement learning that uses a neural network and an interpolator to ap- proximate the -function. Q 5 Wire-fitted Neural Network -Learning Q

Wire-fitted Neural Network -Learning is a continuous state, continuous action -learning method. It couplesQ a single feedforward artificial neural network with Qan interpolator (“wire-fitter”) to fulfil all the criteria in Fig. 1. Feedforward Artificial Neural networks have been used successfully to gener- alise between similar states in -learning systems where actions are discrete [11, 7]. If the output from the neuralQ network describes (non-fixed) actions and their expected values, an interpolator can be used to generalise between them. This would fulfil the State Generalisation and Action Generalisation criteria. Baird and Klopf [14] describe a suitable interpolation scheme called “wire- fitting”. The wire-fitting function is a moving least squares interpolator, closely related to Shepard’s function [15]. Each “wire” is a combination of an action vec- tor, u, and its expected value, q, which is a sample of the -function. Baird and Klopf used the wire-fitting function in a memory based reinforcementQ learning scheme. In our system these parameters describing wire positions are the output of a neural network, whose input is the state vector, x. Figure 2 is an example of wire-fitting. The action is this case is one dimen- sional, but the system supports many dimensional actions. The example shows the graph of action versus value ( ) for a particular state. The number of wires is fixed, the position of the wiresQ changes to fit new data. Required changes are calculated using the partial derivatives of the wire-fitting function. Once new wire positions have been calculated the neural network is trained to output these new positions.

0 0

−0.1 −0.1

−0.2 −0.2

−0.3 −0.3

−0.4 −0.4

Q −0.5 Q −0.5

−0.6 −0.6

−0.7 −0.7

−0.8 −0.8

−0.9 −0.9

−1 −1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 u u

Fig. 2. The wire-fitting process. The action (u) is one dimensional in this case. Three wires (shown as ), this is the output from the neural network for a particular state. The wire-fitting◦ function interpolates between the wires to calculate for every u. The new data ( ) does not fit the curve well (left), so the wires are movedQ according to partial derivatives∗ (right). In other states the wires would be in different positions. The wire-fitting function has several properties which make it a useful inter- polator for implementing -learning. Q Updates to the -value (1) require max (x, u), which can be calculated quickly with the wire-fittingQ function (theQState Evaluation criteria). The action u for max (x, u) can also be calculated quickly (the Action Se- lection criteria). ThisQ is needed when choosing an action to carry out. A property of this interpolator is that the highest interpolated value always coincides with the highest valued interpolation point, so the action with the highest value is always one of the the input actions. When choosing an action it is sufficient to propagate the state through the neural network, then compare the output q to find the best action. The wire-fitter is not required at this stage, the only calculation is the forward pass through the neural network. Wire-fitting also works with many dimensional scattered data while remain- ing computationally tractable; no inversion of matrices is required. Interpolation is local, only points nearby influence the value of . Areas far from all wires have a value which is the average of q, wild extrapolationsQ do not occur (see Fig. 2). It does not suffer from oscillations, unlike most polynomial schemes. Importantly, partial derivatives in terms of each q and u of each point can be calculated quickly. These partial derivatives allow error in the output of the -function to be propagated to the neural network according to the . Q This combination of neural network and interpolator stores the entire function (the Evaluation criteria). It represents policies in a very flexible way;Q it allows suddenQ changes in action in response to a change in state by changing wires, while also allowing actions to change smoothly in response to changes in state (the Continuity and Flexible Policy criteria). The training algorithm is shown in figure 3. Training of the single hidden , feedforward neural network is by incre- mental . The learning rate is kept constant throughout. Tan- sigmoidal neurons are used, restricting the magnitude of actions and values to between 1 and -1. The wire-fitting function is

n qi(x) 2 i=0 u uˆ i(x) +c(qmax(x) qi(x))+ (x, u) = lim P k − k − + n 1 Q  0 2 → i=0 u uˆ i(x) +c(qmax(x) qi(x))+ P k − k − n qi(x) Pi=0 distance(x,u) (2) = lim n 1  0+ → Pi=0 distance(x,u) wsum (x, u) = lim ,  0+ norm (x, u) → where i is the wire number; n is the total number of wires; x is the state vector; ui (x) is the ith action vector; qi (x) is the value of the ith action vector; u is the action vector to be evaluated, c is a small smoothing factor and  avoids division by zero. The dimensionality of the action vectors u and ui is the number of continuous variables in the action. The two simplified forms shown simplify Rt+1 ut ___ / /x 1. In real time, feed the t+1 Neural u0,q0 / state into the neural network. u1,q1 / x Network . Carry out the action with the t / uk.,qk / (multilayer . highest q. Store the resulting feedforward) un.,qn / change in state. reward, actions, chosen state next |values{z } action state

ut Rt+1  2. Calculate a new estimate Neural u0,q0 / u1,q1 / Wire of from the current value, xt Network .  / u .,q Fitter / Q x (multilayer k. k / the reward and the value of t+1 . (interpolator) Q un.,qn / the next state. This can be feedforward) done when convenient. updated state, estimate next state of Q

3. From the new value of ut calculate new values for Qu Neural  o u0,q0 Wire and q using the wire-fitter Network o u1.,q1 x . Fitter t / o uk.,qk o partial derivatives. Train the (multilayer . Q un.,qn (interpolator) neural network to output the feedforward) o new new u and q. This can be training |actions,{z } done when convenient. values Fig. 3. Wire-fitted neural network training algorithm. description of the partial derivatives. The of from (2) in Q terms of q (x)k is

∂ norm (x, u) (distance (x, u) + qk c) wsum (x, u) c Q = lim · · − 2 · . (3) ∂qk  0+ [norm (x, u) distance (x, u)] → · Equation (3) is inexact when qk = qmax. The partial derivative of in terms of Q u (x)k,j is

∂ [wsum (x, u) norm (x, u) qk] 2 (uk,j uj) Q = lim − · · · 2 − , (4) ∂uk,j  0+ [norm (x, u) distance (x, u)] → · where j selects a term of the action vector (uj is a term of the chosen action). The summation terms in (3) and (4) have already been found in the calculation of with (2). QWith partial derivatives known it is possible to calculate new positions for all the wires u0...n and q0...n by descent. As a result of this change the output from the wire-fitter should move closer to the new target . Q Q 6 Practical Issues

When the input to a neural network changes slowly a problem known as unlearn- ing or interference can cause the network to unlearn the correct output for other inputs because recent experience dominates the training data [16]. We cope with this problem by storing examples of state, action and next state transitions and replaying them as if they are being re-experienced. This creates a constantly changing input to the neural network, known as a persistent excitation. We do not store target outputs for the network as these would become incorrect through the learning process described in Sect. 5. Instead the wire-fitter is used to cal- culate new neural network output targets. This method makes efficient use of data gathered from the world without relying on extrapolation. A disadvantage is that if conditions change the stored data could become misleading. One problem with applying -learning to continuous problems is that a single suboptimal action will not preventQ a high value action from being carried out at the next time step. Thus the value of actions in a particular state can be very similar, as the value of the action in the next time step will be carried back. As the -value is only approximated for continuous states and actions it is likely thatQ most of the approximation power will be used representing the values of the states rather than actions in states. The relative values of actions will be poorly represented, resulting in an unsatisfactory policy. The problem is compounded as the time intervals between control actions get smaller. Advantage Learning [4] addresses this problem by emphasising the differences in value between the actions. In Advantage Learning the value of the optimal action is the same as for -learning, but the lesser value of non-optimal actions Q is emphasised by a scaling factor (k ∝ ∆t). This makes a more efficient use of the approximation resources available. The Advantage Learning update is

(x, u) := (1 α) (x, u) A − A 1 k 1 + α R + γ max (xt+1, ut+1) + 1 max (xt, ut) , (5)  k A  − k  A  where is analogous to in (1). The results in Sect. 7 show that Advantage LearningA does make a differenceQ in our learning task.

7 Simulation Results

We apply our learning algorithm to a simulation task. The task involves guiding a submersible vehicle to a target position by firing thrusters located on either side. The thrusters produce continuously variable thrust ranging from full forward to full backward. As there are only two thrusters (left, right) but three degrees on freedom (x, y, rotation) the submersible is non-holonomic in its planar world. The simulation includes momentum and friction effects in both angular and linear displacement. The controller must learn to slow the submersible and hold position as it reaches the target. Reward is the negative of the distance to the target (this is not a pure delayed reward problem). Fig. 4. Appearance of the simulator for one run. The submersible gradually learns to control its motion to reach targets

Figure 4 shows a simulation run with hand placed targets. At the point marked zero the learning system does not have any knowledge of the effects of its actuators, the meaning of its sensors, or even the task to be achieved. After some initial wandering the controller gradually learns to guide the submersible directly toward the target and come to a near stop. In earlier results using -learning alone [17], the controller learned to direct the submersible to the firstQ randomly placed target about 70% of the time. Less than half of the controllers could reach all in series of 10 targets. Observation of -values showed that the value varied only slightly between actions, making it Qdifficult to learn a stable policy. In our current implementation we use Advantage Learning (see Sect. 6) to emphasise the differences between actions. We now report that 100% of the controllers converge to acceptable performance. To test this, we placed random targets at a distance of 1 unit, in a random direction, from a simulated submersible robot and allowed a period of 200 time steps for it to approach and hold station on the target. For a series of targets, the average distance over the time period, was recorded. A random motion achieves an average distance of 1 unit (no progress) while a hand coded controller can achieve 0.25. The learning algorithm reduces the average distance with time, eventually approaching hand coded controller performance. Recording distance rather than just ability to reach the target ensures that controllers which fail to hold station don’t receive a high rating. Graphs comparing 140 controllers trained with -learning and 140 trained with Advantage Learning are shown in the box-and-whiskerQ plots in Fig. 5. The median distance to the target is the horizontal line in the middle of the box. The upper and lower bounds of the box show where 25% of the data above and below 2 2

1.8 1.8

1.6 1.6

1.4 1.4

1.2 1.2

1 1

0.8 0.8

0.6 0.6 Average Distance To Target Average Distance To Target 0.4 0.4

0.2 0.2

0 0 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Target Number Target Number

Fig. 5. Performance of 140 learning controllers using -learning (left) and Advantage Learning (right) which attempt to reach 40 targets eachQ placed one distance unit away the median lie, so the box contains the middle 50% of the data. Outliers, which are outside 1.5 times the range between the upper and lower ends of the box from the median, are shown by a “+” sign. The whiskers show the range of the data, excluding outliers. Advantage Learning converges to good performance more quickly and reliably than -learning and with many fewer and smaller magnitude spurious actions. GradualQ improvement is still taking place at the 40th target. The quantity of outliers on the graph for -learning show that the policy continues to produce erratic behaviour in aboutQ 10% of cases. When reward is based only on distance to the target (as in the experiment above) the actions are somewhat step like. To promote smooth control it is necessary to punish for both energy use and sudden changes in commanded action. Such penalties encouraged smoothness and confirmed that the system is capable of responding to continuous changes in state with continuous changes in action. A side effect of punishing for consuming energy is an improved ability to maintain position.

8 Conclusion

A practical continuous state, continuous action -learning system has been de- scribed and tested. It was found to converge quicklyQ and reliably on a simulated control task. Advantage Learning was found to be an important tool in over- coming the problem of similarity in value between actions.

Acknowledgements

We thank WindRiver Systems and BEI Systron Donner for their support of the Kambara AUV project. We also thank the Underwater project team: Samer Abdallah, Terence Betlehem, Wayne Dunston, Ian Fitzgerald, Chris McPherson, Chanop Silpa-Anan and Harley Truong for their contributions. References

[1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduc- tion. Bradford Books, MIT, 1998. [2] Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, University of Cambridge, 1989. [3] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans on systems, man and cybernetics, SMC-13:834–846, 1983. [4] Mance E. Harmon and Leemon C. Baird. Residual advantage learning applied to a differential game. In Proceedings of the International Conference on Neural Networks, Washington D.C, 1995. [5] Paul J. Werbos. Approximate for real-time control and neural modeling. In D. A. White and D. A. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. Van Nostrand Reinhold, 1992. [6] Juan C. Santamaria, Richard S. Sutton, and Ashwin Ram. Experiments with rein- forcement learning in problems with continuous state and action spaces. Adaptive Behaviour, 6(2):163–218, 1998. [7] Gavin Adrian Rummery. Problem solving with reinforcement learning. PhD thesis, Cambridge University, 1995. [8] Claude F. Touzet. Neural reinforcement learning for behaviour synthesis. Robotics and Autonomous Systems, 22(3-4):251–81, 1997. [9] H.-M. Gross, V. Stephan, and M. Krabbes. A neural field approach to topo- logical reinforcement learning in continuous action spaces. In Proc. 1998 IEEE World Congress on Computational Intelligence, WCCI’98 and International Joint Conference on Neural Networks, IJCNN’98, Anchorage, Alaska, 1998. [10] J. S. Albus. A new approach to manipulator control: the cerebrellar model ar- ticulated controller (CMAC). J. Dynamic Systems, Measurement and Control, 97:220–227, 1975. [11] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, plan- ning and teaching. Journal, 8(3/4), 1992. [12] T. Kohonen. Self-Organization and Associative Memory. Springer, Berlin, third edition, 1989. [13] Juan Miguel Santos. Contribution to the study and design of reinforcement func- tions. PhD thesis, Universidad de Buenos Aires, Universite d’Aix-Marseille III, 1999. [14] Leemon C. Baird and A. Harry Klopf. Reinforcement learning with high- dimensional, continuous actions. Technical Report WL-TR-93-1147, Wright Lab- oratory, 1993. [15] Peter Lancaster and K¸estutis Salkauskas.ˇ Curve and Surface Fitting, an Intro- duction. Academic Press, 1986. [16] W. Baker and J. Farrel. An introduction to connectionist learning control systems. In D. A. White and D. A. Sofge, editors, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches. Van Nostrand Reinhold, 1992. [17] Chris Gaskett, David Wettergreen, and Alexander Zelinsky. Reinforcement learn- ing applied to the control of an autonomous underwater vehicle. In Proceedings of the Australian Conference on Robotics and Automation (AuCRA99), 1999.