The Architecture of a Multilayer Perceptron for Actor-Critic Algorithm with Energy-Based Policy Naoto Yoshida
Total Page:16
File Type:pdf, Size:1020Kb
The Architecture of a Multilayer Perceptron for Actor-Critic Algorithm with Energy-based Policy Naoto Yoshida To cite this version: Naoto Yoshida. The Architecture of a Multilayer Perceptron for Actor-Critic Algorithm with Energy- based Policy. 2015. hal-01138709v2 HAL Id: hal-01138709 https://hal.archives-ouvertes.fr/hal-01138709v2 Preprint submitted on 19 Oct 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. The Architecture of a Multilayer Perceptron for Actor-Critic Algorithm with Energy-based Policy Naoto Yoshida School of Biomedical Engineering Tohoku University Sendai, Aramaki Aza Aoba 6-6-01 Email: [email protected] Abstract—Learning and acting in a high dimensional state- when the actions are represented by binary vector actions action space are one of the central problems in the reinforcement [4][5][6][7][8]. Energy-based RL algorithms represent a policy learning (RL) communities. The recent development of model- based on energy-based models [9]. In this approach the prob- free reinforcement learning algorithms based on an energy- based model have been shown to be effective in such domains. ability of the state-action pair is represented by the Boltzmann However, since the previous algorithms used neural networks distribution and the energy function, and the policy is a with stochastic hidden units, these algorithms required Monte conditional probability given the current state of the agent. Carlo sampling methods for the action selections. In this paper, Previous studies on energy-based RL utilized the framework we investigate actor-critic algorithms based on the energy-based of restricted Boltzmann machines (RBMs) [10][11][12] to model. And we especially use neural networks with deterministic hidden units as an actor. Introducing the deterministic hidden represent the probability over the state-action pair. There are units, we show that the gradient of the objective function is two approaches in the energy-based RL, one is the value-based proportional to the gradient of the energy function. Then, we approach in which the action-value function is approximated show the relationships between the energy-based approach and by the (free) energy function of the RBMs [4][5][6][7], the the conventional policy gradient algorithms by introducing the other is the actor-critic approach in which the RBMs are used specific energy function. We reveal that the specificity of the RL paradigm severely disturbs learning of an actor when we only for the policy and the value function is approximated use a multilayer perceptron as the representation of the policy. by the other deterministic function [8]. However, because the We therefore introduce “twin net” architecture. Finally, we RBMs are stochastic neural networks, exact sampling of the empirically show the effectiveness of this architecture in several action from the Boltzmann distribution is intractable except discrete domains. when considering a small action set. Therefore, the Monte Carlo sampling methods, such as Gibbs sampling, are used as I. INTRODUCTION an approximation. Reinforcement learning (RL) is a paradigm of machine In this paper, we investigate energy-based RL and suggest learning in which the artificial agents learn the optimal action a new actor-critic algorithm. In our approach, we use deter- sequences from interactions with the environments. One of the ministic neural networks, and the energy functions are also central problems in recent studies of RL is efficient learning deterministic functions with respect to the state-action pairs. in environments with a large state-action set. Actor-critic al- By introducing the specific energy function, we show that gorithms are model-free RL algorithms that explicitly separate an exact sampling from the policy is tractable, even when the evaluation of current policy from the representation of the we considering large action sets. We also show that the the policy; hence, the policy in an infinitely large state-action set gradient of the objective function, that is, the policy gradient, can be compactly represented by arbitrary smooth functions, is proportional to the gradient of the energy function. Then, such as multilayer perceptrons [1][2][3]. The conventional we reveal that the specificity of the learning framework of actor-critic algorithms treat the “1 of K” type actions in actor-critic algorithms can disturb the learning of the actor which the agent at each time step selects one action from when we use multilayer perceptrons. In order to overcome K discrete actions, or continuous vector actions in which this problem, we introduce a new architecture of multilayer actions are represented by real valued vectors. However, there perceptrons, an energy function, and a loss function for the is an important class of representations for actions, which is actor-critic algorithm. Finally, we empirically show that the represented by the K bits vector of the binary values. This suggested algorithm successfully learns the good policies in type of action is challenging for RL algorithms because the several environments with discrete state-action sets. size of the action set exponentially grows with the length of II. BACKGROUNDS the vector. In this paper, we call this class of actions the K bits binary vector actions. A. Policy Gradient In recent studies on the online model-free RL algo- We assume that the RL agents act in an environment rithms, energy-based RL appeared to be a promising ap- modeled by the Markov Decision Process (MDP). MDP is proach to tackle learning in a large state-action set, especially defined by 4 tuples hS; A; P; Ri, where S is the state set and s 2 S is the state, A is the action set and a 2 A by is the action. P is the transition rule in the MDP, which is X π X π π defined by the conditional probability P (s0js; a) where s0 is rθJ(θ) = d (s) Q (s; a) − V (s) rθπθ(ajs) the next state given a state s and an action a. R is the reward s a 0 X π X π function r(s; a; s ), which gives the temporal evaluation of the = d (s) A (s; a)rθπθ(ajs) (2) state-action pair. The agent takes an action at every time step s a h π i from the stochastic policy π(ajs), which is the conditional = E A (s; a)rθ log πθ(ajs) ; (3) probability over the action set a 2 A given a state s. π π π The objective function of the RL algorithms is defined by where A (s; a) = Q (s; a)−V (s) is the advantage function. P π The advantage function has the property a π(ajs)A (s; a) = 1 P π(ajs)Qπ(s; a) − V π(s) = 0. hX t−1 i a J(π) = E γ rt ; In the actor-critic architecture, we approximate the value t=1 function of the current policy by a parametric function Vv(s) where v is the parameter of the function. Because the temporal where t denotes the time step, γ denotes the discount factor π π π difference (TD) error δt = rt + γV (st+1) − V (st) is an 0 ≤ γ < 1, rt = r(st; at; st+1), and d (s) denotes the π unbiased estimate of the advantage function, one approxima- discounted transition probability given the policy d (s) = tion of the advantage is calculated by using the approximated P1 γtP (s = sjs ; π) E[·] t=0 t 0 [13]. And is the expectation op- value function δ~ = r + γV (s ) − V (s ) [16][17]. Then E[f(s; a)] = P dπ(s) P π(ajs)f(s; a) t t v t+1 v t erator s a . The model- the update of the parameter ∆θ is given by free RL algorithms try to maximize this objective function ~ J(π) from trials and errors, without explicitly modeling the ∆θt = αδtrθ log πθ(ajs) (4) environment. where α is the learning rate. Many RL algorithms introduce the value function V π(s) π and the action-value function Q (s; a) by B. Energy-based Reinforcement Learning 1 In this paper, we call the approaches energy-based RL h i π X t−1 when the policy π is defined by some energy function E . V (s) = Eπ γ rt s0 = s θ θ t=1 The energy-based RL algorithms in previous studies used the 1 energy function π hX t−1 i Q (s; a) = Eπ γ rt s0 = s; a0 = a > > > > > t=1 Eθ(s; a; h) = s Wshh + a Wahh + bs s + ba a + bh h; where s, a, and h are the state, the action, and the hidden units where Eπ[·] is the expectation operator that denotes the ex- pectation over the trajectories given a policy π [14]. The value represented by the stochastic binary vectors in the RBMs. Wxy function and the action-value function have a relationship denotes the bidirectional connection between x and y. bx is > V π(s) = P π(ajs)Qπ(s; a). the bias vector of x. And denotes a transpose of the matrix a or vector. Then the parameter of the energy function θ is θ = If the policy π(ajs) is a smooth function π (ajs) with θ fW ;W ; b ; b g.