A Comparison of Action Selection Methods for Implicit Policy Method Reinforcement Learning in Continuous Action-Space
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Middlesex University Research Repository A Comparison of Action Selection Methods for Implicit Policy Method Reinforcement Learning in Continuous Action-Space Barry D. Nichols School of Science and Technology Middlesex University London, England Email: [email protected] Abstract—In this paper I investigate methods of applying action-space is large or continuous. This is due to the fact that reinforcement learning to continuous state- and action-space it becomes impossible to compare the values of every possible problems without a policy function. I compare the performance action, and it has been stated that applying optimisation to the of four methods, one of which is the discretisation of the action- space, and the other three are optimisation techniques applied to action selection at every time-step would be prohibitively time finding the greedy action without discretisation. The optimisation consuming [3], [5]. However, there have been some examples methods I apply are gradient descent, Nelder-Mead and Newton’s of the successful use of IPM in continuous action-space [6]– Method. The action selection methods are applied in conjunction [8]. with the SARSA algorithm, with a multilayer perceptron utilized Here I investigate IPM using several different action se- for the approximation of the value function. The approaches are applied to two simulated continuous state- and action-space lection methods: discretisation, gradient descent, Newton’s control problems: Cart-Pole and double Cart-Pole. The results Method and Nelder-Mead. I compare these methods both in are compared both in terms of action selection time and the terms of performance and action selection time on two well- number of trials required to train on the benchmark problems. known continuous state- and action-space control benchmark problems from the literature: the Cart-Pole problem and related I. INTRODUCTION double Cart-Pole problem. In these experiments, apart from the Reinforcement learning (RL) [1], [2] has many successes action selection method used, all parameter values are kept in discrete and small action-space problems. However, many constant. This allows a clear comparison of the performance interesting problems have large and/or continuous action- of the action selection methods, both in terms of speed: space, for which it is not obvious which RL approach is the action selection time and accuracy: reflected by the number most appropriate. of training trials required by each approach. The most common methods utilised in the application of The rest of the paper is organised as follows. Firstly, in RL to continuous action-space problems are either actor-critic Section II, I give a brief summary of RL focussing on the methods (AC) [3] or direct policy search (DPS) [4]. AC approach used here. Then, in Section III, I present the details requires the approximation of two functions: the value function of the action selection methods which will be compared. In Q : S × A ! R, giving the expected long term reward from Section IV I specify the details of the neural network I use being in a given state s 2 S and taking an action a 2 A, to approximate Q(s; a), and the derivation of raQ(s; a) and 2 and the policy function π : S!A, which is a mapping raQ(s; a), which are required to apply the derivative based from states to actions. It is also common, when using AC action selection methods. Sections V and VI give the details methods, to use the value function V : S! R, which is the of the experiments conducted on the Cart-Pole and double expected long-term, discounted reward when in state s 2 S Cart-Pole domains respectively, including results obtained. and following π(s). DPS on the other hand seek to directly Finally I present my conclusions and compare the performance optimise the policy function π(s) taking the value of the policy of the approaches across the different problem domains in to be the sum of discounted rewards received when following Section VII. In the conclusion I also give some indication that policy. of problems which may arise when applying these approaches Another method, which is the method I investigate here, with possible solutions and some directions for future work. is the implicit policy method (IPM), which approximates the II. BACKGROUND value function Q : S × A ! R; however, unlike AC methods, no policy function is stored. Instead the action to take from An RL agent learns to perform a task by experimenting a given state is calculated as required from the Q function. on that task and receiving feedback based on its current Despite being the method of choice when the action-space is performance. At any given time-step, the agent is in state small or discrete [3], IPM is not frequently applied when the s 2 S and must select an action a 2 A. After applying the selected action a to the environment, the agent receives a used to update the values of Q(s; a) (6), where α is a step-size numerical reward r 2 R from the environment and transitions parameter. to the resulting state s0 2 S. The reward could be negative in order to represent a punishment for reaching undesirable states, δt = rt+1 + γQ(st+1; at+1) − Q(st; at) (5) or, in the case of delayed rewards, the reward may be zero Q(st; at) Q(st; at) + αδt (6) until the terminal state. The delayed reward case is suitable for several tasks, e.g. a board game where the reward is based A. Continuous State- and Action-Space on the final outcome of the game, or a disaster avoidance When the state-space is continuous it is not possible to apply task where we simply attempt to avoid failure. An example of a lookup table to store the Q values, as it is in the small, disaster avoidance is the Cart-Pole problem used in this paper, discrete state- and action-space setting. Therefore function described in more detail in Section V. approximation is often applied to approximate the Q function. In order to improve its performance on the task, the RL However, when the action space is continuous it is also agent seeks to maximise the rewards it receives, or to be impossible to compare the values of each action from the precise the sum of long-term discounted rewards (1), where t current state; thus, other methods must be employed. is the time-step; rt+1 is the reward received in time-step t+1, The approximation of the Q function could be achieved after taking at from st, and γ 2 (0; 1] is the discount rate. by an artificial neural network (ANN). Many types of ANN Normally γ < 1 to ensure the sum of rewards remains finite have been successfully applied to the approximation of the and to assign more weight to rewards which will be received value function, e.g. CMAC, RBF, perceptron, MLP. Here, I in the short-term. focus my presentation on the multilayer perceptron (MLP) as T −1 X this is the ANN I apply in this work. The MLP is a global γtr (1) t+1 function approximator capable of approximating any function t=0 to arbitrary accuracy, provided sufficient neurons are included The optimisation of the sum of discounted rewards is in the hidden layer [9]. achieved by the construction of a value function Q : S ×A ! When an ANN is employed to approximate the value R, which is an approximation of the expected sum of long function, the update is then applied to the parameters of the term rewards when taking the given action from the given ANN (the synaptic weights) rather than directly updating the state and then following the learned policy π : S!A. The value associated with the specific state and action. The update learned policy is a mapping from states to the action to take equation is then (7), where θ~ is the parameter vector for the when in that state. The Q function is defined as (2), and can be ANN. recursively defined as (3), which is the sum of the immediate ~ ~ θ θ + αδtrθ~Q(st; at) (7) reward and the expected sum of rewards which will be received resulting from taking the next action from the next state. In the continuous action-space setting exploration is often applied through Gaussian exploration [10]. In Gaussian ex- (T − ) 1 ploration a Gaussian random variable is added to the greedy X k Q(s; a) = E γ r(sk+1; ak+1) s0 = s; a0 = a (2) action before it is applied to the environment. This is more k=0 appropriate than -greedy and Softmax when the action-space is continuous. Q(st; at) = r + γQ(st+1; at+1) (3) When the action-space is continuous some researchers be- The learnt policy involves selecting the greedy action w.r.t. lieve it is infeasible to directly solve the optimisation prob- the value function. In small, discrete action-space the agent lem (4) at every time-step [3], [5]. Therefore, separate function can select the greedy action by evaluating the value function approximators are often applied to approximating the policy at every possible action (4). function. This approach, which employs both the Q function and the π function, is the actor-critic method [5]. arg max Q(s; a); 8a 2 A (4) An alternative is to attempt to optimise the policy function a without learning the value function. This is achieved by taking The agent must also perform exploratory actions in order to the value of the current policy to be the J function, which update the value function in unexplored areas of the state- and is defined as the sum of discounted rewards received in an action-space, which may lead to an improved policy.