Continuous Control with a Combination of Supervised and Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
ORE Open Research Exeter TITLE Continuous Control with a Combination of Supervised and Reinforcement Learning AUTHORS Kangin, D; Pugeault, N JOURNAL Proceedings of the International Joint Conference on Neural Networks DEPOSITED IN ORE 23 April 2018 This version available at http://hdl.handle.net/10871/32566 COPYRIGHT AND REUSE Open Research Exeter makes this work available in accordance with publisher policies. A NOTE ON VERSIONS The version presented here may differ from the published version. If citing, you are advised to consult the published version for pagination, volume/issue and date of publication Continuous Control with a Combination of Supervised and Reinforcement Learning Dmitry Kangin and Nicolas Pugeault Computer Science Department, University of Exeter Exeter EX4 4QF, UK fd.kangin, [email protected] Abstract—Reinforcement learning methods have recently sures: for example, by the average speed, time for completing achieved impressive results on a wide range of control problems. a lap in a race, or other appropriate criteria. This situation is However, especially with complex inputs, they still require an the same for other control problems connected with robotics, extensive amount of training data in order to converge to a meaningful solution. This limits their applicability to complex including walking [9] and balancing [10] robots, as well as in input spaces such as video signals, and makes them impractical many others [11]. In these problems, also usually exist some for use in complex real world problems, including many of those criteria for assessment (for example, time spent to pass the for video based control. Supervised learning, on the contrary, is challenge), which would help to assess how desirable these capable of learning on a relatively limited number of samples, control actions are. but relies on arbitrary hand-labelling of data rather than task- derived reward functions, and hence do not yield independent The problem becomes even more challenging if the results control policies. In this article we propose a novel, model- are dependent on the sequence of previous observations [12], free approach, which uses a combination of reinforcement and e.g. because of dynamic nature of the problem involving speed supervised learning for autonomous control and paves the way or acceleration, or the difference between the current and the towards policy based control in real world environments. We previous control signal. use SpeedDreams/TORCS video game to demonstrate that our approach requires much less samples (hundreds of thousands In many real world problems, it is possible to combine against millions or tens of millions) comparing to the state-of-the- reinforcement and supervised learning. For the problem of art reinforcement learning techniques on similar data, and at the autonomous driving, it is often possible to provide parallel same time overcomes both supervised and reinforcement learning signals of the autopilot which could be used to restrict the approaches in terms of quality. Additionally, we demonstrate reinforcement learning solutions towards the sensible subsets applicability of the method to MuJoCo control problems. of control actions. Similar things can also be done for robotic control. Such real world models can be analytical, or trained I. INTRODUCTION by machine learning techniques, and may use some other Recently published reinforcement learning approaches [1], sensors, which are capable to provide alternative information [2], [3] have achieved significant improvements in solving which enables to build the reference model (e.g., the model complex control problems such as vision-based control. They trained on LiDAR data can be used to train the vision based can define control policies from scratch, without mimicking model). However, although there were some works using any other control signals. Instead, reward functions are used to partially labelled datasets within the reinforcement learning formulate criteria for control behaviour optimisation. However, framework [13], as far as we believe, the proposed prob- the problem with these methods is that in order to explore the lem statement, injecting supervised data into reinforcement control space and find the appropriate actions, even the most learning using regularisation of Q-functions, is different from recent state-of-the-art methods require several tens of millions the ones published before. In [13], the authors consider the of steps before convergence [1], [4]. problem of robotic control which does not involve video data, One of the first attempts to combine reinforcement and and their approach considers sharing the replay buffer between supervised learning can be attributed to [5]. Recently proposed the reinforcement learning and demonstrator data. In [14], the supervised methods for continuous control problems, like [6], authors address the problem of sample efficiency improvement are known to rapidly converge, however they do not provide by using pretraining and state dynamics prediction. However, independent policies. Contrary to many classification or re- the proposed idea of pretraining is different from the one gression problems, where there exists a definitive ground truth, proposed here as the random positions and actions are used. as it is presented in [7], [8], for many control problems an Also, in contrast to the proposed method, which uses image infinitely large number of control policies may be appropriate. input, the approach, proposed in [14], has only been tested on In autonomous driving problems, for example, there is no low-dimensional problems. single correct steering angle for every moment of time as The novelty of the approach, presented in this paper, is given different trajectories can be appropriate. However, a whole as follows: control sequence can be assessed according to objective mea- 1) a new regularised formulation optimisation, combining l Annotated Dataset Supervised Pre-training Initial Model Environment l > 0 : hz1; z2; : : : ; zli 2 Z . We define an operator from the subsequences of real valued quantities to d-dimensional Rewards Control Signals p d control signals π(zi; zi−1; : : : ; zi−p+1jΘπ): Z ! R , Reinforcement&Supervised Learning where i 2 [p; l], p 2 N is the number of frames used to produce the control signal, and Θπ are the parameters of the operator π(·). We denote ci = π(zi; : : : ; zi−p+1jΘπ), Finetuned Model xi = (zi; zi−1; : : : ; zi−p+1). The problem is stated as follows: given the set of N N Fig. 1: The overall scheme of the proposed method j lj sequences z^ 2 Z j=1 and the set of corresponding signal N lj D j dE sequences c^i 2 R , produced by some external reinforcement and supervised learning, is proposed; i=1 j=1 2) a training algorithm is formulated based on this problem control method called a ’reference actor’, find the parameters statement, and assessed on control problems; Θπ of the actor π(·|Θπ), which minimise a loss function (the 3) the novel greedy actor-critic reinforcement learning al- used loss function is defined in Formula (7)). gorithm is proposed as a part of the training algorithm, B. Label-assisted reinforcement learning containing interlaced data collection, critic and actor update stages. The reinforcement learning method is inspired by the DDPG The proposed method reduces the number of samples from algorithm [3], however, it is substantially reworked in order to millions or tens of millions, required to train the reinforcement meet the needs of the proposed combination of supervised and learning model on visual data, to just hundreds of thousands, reinforcement learning working in real time. First, the problem and is shown to converge to policies outperforming the ones statement and basic definitions, necessary for formalisation of achieved by both supervised and reinforcement learning ap- the method, need to be given. proaches. This approach helps to build upon the performance In line with the standard terminology for reinforcement of supervised learning by using reinforcement learning, which learning, we refer to a model, generating control signals, as both provides the ability to learn faster than reinforcement an agent, and to control signals as actions. Also, as it is learning and at the same time to improve policy using rewards, usually done for reinforcement learning problem statements, which is not possible when using just supervised training we assume the states to be equivalent to the observations. algorithms. Initially, the model receives an initial state of the environment x1. Then it applies the action c1, which results in transition ROPOSED METHOD II. P into the state x2. The procedure repeats recurrently, resulting n The overall idea of the method is shown in Fig. 1 and can in sequences of states X = hx1; x2; : : : ; xni and actions n be described as follows: to perform an initial approximation C = hc1; c2; : : : ; cni. Every time the agents performs an by supervised learning and then, using both explicit labels and action it receives a reward r(xi; ci) 2 R [3]. rewards, to fine-tune it. The emitted actions are defined by the policy π, mapping For supervised pre-training, the annotated dataset should states into actions. In general [15], it can be stochastic, contain recorded examples of control by some existing model. defining a probability distribution over the action space, how- The aim of this stage is to mimic the existing control methods ever here the deterministic policy is assumed. For such a in order to avoid control behaviour of the trained model deterministic policy, one can express the discounted future resulting in small rewards. reward using the following form of the Bellman equation for For supervised and reinforcement learning based fine- the expectation of discounted future reward Q(·; ·) [3]: tuning, a pretrained model is used as an initial approximation. Q(x ; c ) = [r(x ; c ) + γQ(x ; π(x ))] ; (1) Contrary to the standard reinforcement learning approaches, i i Exi+1 i i i+1 i+1 the pre-trained model helps it to avoid control values resulting where π(·) is an operator from states to actions, γ 2 [0; 1] is in small rewards right from the beginning.