Control Policy with Autocorrelated Noise in Reinforcement Learning for Robotics
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Machine Learning and Computing, Vol. 5, No. 2, April 2015 Control Policy with Autocorrelated Noise in Reinforcement Learning for Robotics Paweł Wawrzyński primitives by means of RL algorithms based on MDP Abstract—Direct application of reinforcement learning in framework, but it does not alleviate the problem of robot robotics rises the issue of discontinuity of control signal. jerking. The current paper is intended to fill this gap. Consecutive actions are selected independently on random, In this paper, control policy is introduced that may undergo which often makes them excessively far from one another. Such reinforcement learning and have the following properties: control is hardly ever appropriate in robots, it may even lead to their destruction. This paper considers a control policy in which It is applicable to robot control optimization. consecutive actions are modified by autocorrelated noise. That It does not lead to robot jerking. policy generally solves the aforementioned problems and it is readily applicable in robots. In the experimental study it is It can be optimized by any RL algorithm that is designed applied to three robotic learning control tasks: Cart-Pole to optimize a classical stochastic control policy. SwingUp, Half-Cheetah, and a walking humanoid. The policy introduced here is based on a deterministic transformation of state combined with a random element in Index Terms—Machine learning, reinforcement learning, actorcritics, robotics. the form of a specific stochastic process, namely the moving average. The paper is organized as follows. In Section II the I. INTRODUCTION problem of our interest is defined. Section III presents the main contribution of this paper i.e., a stochastic control policy Reinforcement learning (RL) addresses the problem of an that prevents robot jerking while learning. In Section IV an agent that optimizes its reactive policy in a poorly structured analysis of this policy is presented. Section V contains an and initially unknown environment [1]. The primary experimental study in which the policy is applied in two application of RL is robotics where the agent becomes a simulated and one real robotic learning control tasks. The last robot’s controller and the robot itself with its surrounding section concludes the paper. becomes the agent’s environment. Reinforcement learning offers the prospect of efficient robot behaviour being learned rather than programmed by a human designer. II. PROBLEM FORMULATION A typical setting in which RL is applied in robotics is as follows. There are two levels of control. The lower level is We consider the standard RL setup [1]. A Markov Decision based on servomotors in the robot’s joints. The servomotors Process (MDP) defines a problem of an agent that observes its are fed with desired joint positions and try to make the joints state, st, in discrete time t = 1, 2, 3,..., performs actions, at, follow them. At the higher level, the controller determines receives rewards, rt, and moves to other states, st+1. A desired servomotors’ positions based on the robot state. A particular MDP is a tuple <S, A, Ps, r> where S and A are the learning (through reinforcement) component resides at the state and action spaces, respectively; { Ps (·| s,a) : s ∈ S, a ∈A} higher control level. Within typical scheme of the learning is a set of state transition distributions; we write st+1 ~ component operation, the desired servomotors’ positions are Ps(·|st,at) and assume that each Ps is a density. Each state periodically selected on random, and consecutive selections transition generates a reward, rt ∈R. Here we assume that each are only stochastically dependent through the robot state. But reward is depends deterministically on the current action and that means that the consecutive desired servos’ positions are the next state, rt = r(at,st+1). The agent learns to assign actions far from one another. This results in a characteristic jerking of to states so as to may expect in each state highest rewards in the robot which is an unhealthy robot behaviour and may lead the future. to its destruction. Here we consider robotic applications of the above general Applications of RL in robotics are surveyed in [2]. More framework. Therefore, both spaces of interest are general discussion on policy search in robotics is presented in multidimensional continuous: S = Rns and A = Rna. Also, it is [3]. The work [4] presents an RL algorithm that optimizes assumed that st reflects the state of a certain continuous-time robotic primitives. This algorithm overcomes the problem of system at discrete time instants. Let τ∈ R denote continuous robot jerking during learning at the cost of giving up the time. The dynamics of that system could be described by an framework of Markov Decision Process (MDP) [1]. In [5] a equation of the form method is presented that enables optimization of robotic ds fs , , (1) Manuscript received August 5, 2014; revised December 1, 2014. d Paweł Wawrzyński is with Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland (e-mail: where f is unknown, δ > 0 denotes time discretization, st = [email protected]). DOI: 10.7763/IJMLC.2015.V5.489 91 International Journal of Machine Learning and Computing, Vol. 5, No. 2, April 2015 s(τ0 + tδ), and a(τ) = at for τ ∈ [τ0 + tδ,τ0 + (t + 1)δ). A stochastic process defined in (5) is known as the moving The subject of our interest is a stochastic control policy that average. Let us now verify the conditions from the previous produces actions. The following properties of this policy are section. required: 1) Each ξt is a sum of normal random vectors, therefore each 1) It is possible to optimize this policy by any RL algorithm ξt has the same (stationary) distribution N(0,σ2I). designed for stochastic control policy optimization. 2) ξt and ξt+i are, for |i| > M, computed from different ζ-s, 2) Fine time discretization does not prevent the learning thus they are stochastically independent. algorithm from efficient operation nor it results in robot 3) For 0 ≤ i < M we have jerking. However, fine time discretization may require 2 adjustment of some parameters of the learning algorithm MX−1 and the policy. 2 (6) Ekξt − ξt−ik = E ζt−j − ζt−i−j j=0 2 Xi−1 Xi−1 (7) III. POLICY DEFINITION = E ζt−j − ζt−M−j A. Generic Definition j=0 j=0 2 (8) Let the actions by produced by the following function = 2iσ =M; and therefore at = h(st,ξt,θ) (2) 2 2 2 n n k − k k − k where st is the state, ξt ∈ R ξ is a random element, and θ ∈ R is E ξt ξt−i < E ξt ξt−i−1 = 2(i+1)σ =M: (9) a parameter (e.g., neural weights). Typically a policy that 4) If g is continuous with respect to its arguments, then h is produces actions is defined as a probability distribution continuous too. parameterized by the state and a vector, θ, whose optimization is an objective of learning. But technically, The distribution of action defined according to (3) is 2 actions are always computed on the basis of a certain function, normal with mean g(st;θ) and covariance matrix Iσ . h, and a finite-dimensional random element, ξt. With the additional assumptions that ξt have the same distribution for NALYSIS various t, and ξt is stochastically independent from ξt+i for i IV. A 0, eq. (2) is a typical representation of a stochastic control In this section we investigate the following question: how policy in reinforcement learning. to parametrize control policy to assure constant level of In this paper the following set or requirements is imposed randomness in state trajectory for any given time on (2): discretization, δ. In this order, power of noise is defined 1) ξt has the same (stationary) distribution for each t. below, it is derived for ξt defined in the previous section, and it is proven that the randomness in state trajectory is 2) ξt is stochastically independent from ξt+i for |i|≥ M, where M > 0 is a certain constant. proportional to that power. 3) E∥ξt − ξt−i∥2 < E∥ξt − ξt−i−1∥2 for 0 ≤ i < M. A. Power of ξ 4) h is continuous with respect to all its arguments. Let The first two conditions are required for the policy to be ξ(τ) = ξt for t : τ0 + tδ ≤ τ < τ0 + (t + 1)δ: (10) applicable in known reinforcement learning algorithms, e.g., in actor-critics [6], [7]. The latter two make consecutive We define the power of ξ as control actions close to one another. Strict “continuity” of the 0 1 τZ0+T τZ0+T control signal is not possible in continuous time. However, if 1 @ T A Pξ = lim E ξ(τ)dτ ξ(τ) dτ : (11) h is continuous and ξt is on average close to ξt+1, consecutive T !1 T actions are close to each other as well. τ0 τ0 B. Specific Definition In the case of the moving-average (5), its power has the value 0 ! A specific design of a policy based on the above Xt−1 Xt−1 1 @ T requirements may be the following. Let Pξ = lim E δξi δξi t!1 tδ (12) i=0 i=0 at = h(st,ξt,θ) = g(st;θ) + ξt, (3) 0 ! Xt−1 Xt−1 1 @ T where g is a certain approximator parametrized by θ with = lim E δMζi δMζi t!1 tδ (13) input st, and ξt is defined as follows. Let i=0 i=0 1 2 2 2 2 = δ M tIσ =M (14) ξt ∼ N(0, Iσ /M) (4) tδ = Iσ2δM: (15) be random vectors stochastically independent for different t, M > 0 be constant, and B.