<<

The Architecture of a Multilayer for Actor-Critic with Energy-based Policy Naoto Yoshida

To cite this version:

Naoto Yoshida. The Architecture of a for Actor-Critic Algorithm with Energy- based Policy. 2015. ￿hal-01138709v2￿

HAL Id: hal-01138709 https://hal.archives-ouvertes.fr/hal-01138709v2 Preprint submitted on 19 Oct 2015

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. The Architecture of a Multilayer Perceptron for Actor-Critic Algorithm with Energy-based Policy

Naoto Yoshida School of Biomedical Engineering Tohoku University Sendai, Aramaki Aza Aoba 6-6-01 Email: [email protected]

Abstract—Learning and acting in a high dimensional state- when the actions are represented by binary vector actions action space are one of the central problems in the [4][5][6][7][8]. Energy-based RL represent a policy learning (RL) communities. The recent development of model- based on energy-based models [9]. In this approach the prob- free algorithms based on an energy- based model have been shown to be effective in such domains. ability of the state-action pair is represented by the Boltzmann However, since the previous algorithms used neural networks distribution and the energy function, and the policy is a with stochastic hidden units, these algorithms required Monte conditional given the current state of the agent. Carlo sampling methods for the action selections. In this paper, Previous studies on energy-based RL utilized the framework we investigate actor-critic algorithms based on the energy-based of restricted Boltzmann machines (RBMs) [10][11][12] to model. And we especially use neural networks with deterministic hidden units as an actor. Introducing the deterministic hidden represent the probability over the state-action pair. There are units, we show that the gradient of the objective function is two approaches in the energy-based RL, one is the value-based proportional to the gradient of the energy function. Then, we approach in which the action-value function is approximated show the relationships between the energy-based approach and by the (free) energy function of the RBMs [4][5][6][7], the the conventional policy gradient algorithms by introducing the other is the actor-critic approach in which the RBMs are used specific energy function. We reveal that the specificity of the RL paradigm severely disturbs learning of an actor when we only for the policy and the value function is approximated use a multilayer perceptron as the representation of the policy. by the other deterministic function [8]. However, because the We therefore introduce “twin net” architecture. Finally, we RBMs are stochastic neural networks, exact sampling of the empirically show the effectiveness of this architecture in several action from the Boltzmann distribution is intractable except discrete domains. when considering a small action set. Therefore, the Monte Carlo sampling methods, such as Gibbs sampling, are used as I.INTRODUCTION an approximation. Reinforcement learning (RL) is a paradigm of machine In this paper, we investigate energy-based RL and suggest learning in which the artificial agents learn the optimal action a new actor-critic algorithm. In our approach, we use deter- sequences from interactions with the environments. One of the ministic neural networks, and the energy functions are also central problems in recent studies of RL is efficient learning deterministic functions with respect to the state-action pairs. in environments with a large state-action set. Actor-critic al- By introducing the specific energy function, we show that gorithms are model-free RL algorithms that explicitly separate an exact sampling from the policy is tractable, even when the evaluation of current policy from the representation of the we considering large action sets. We also show that the the policy; hence, the policy in an infinitely large state-action set gradient of the objective function, that is, the policy gradient, can be compactly represented by arbitrary smooth functions, is proportional to the gradient of the energy function. Then, such as multilayer [1][2][3]. The conventional we reveal that the specificity of the learning framework of actor-critic algorithms treat the “1 of K” type actions in actor-critic algorithms can disturb the learning of the actor which the agent at each time step selects one action from when we use multilayer perceptrons. In order to overcome K discrete actions, or continuous vector actions in which this problem, we introduce a new architecture of multilayer actions are represented by real valued vectors. However, there perceptrons, an energy function, and a for the is an important class of representations for actions, which is actor-critic algorithm. Finally, we empirically show that the represented by the K bits vector of the binary values. This suggested algorithm successfully learns the good policies in type of action is challenging for RL algorithms because the several environments with discrete state-action sets. size of the action set exponentially grows with the length of II.BACKGROUNDS the vector. In this paper, we call this class of actions the K bits binary vector actions. A. Policy Gradient In recent studies on the online model-free RL algo- We assume that the RL agents act in an environment rithms, energy-based RL appeared to be a promising ap- modeled by the (MDP). MDP is proach to tackle learning in a large state-action set, especially defined by 4 tuples hS, A,P,Ri, where S is the state set and s ∈ S is the state, A is the action set and a ∈ A by is the action. P is the transition rule in the MDP, which is X π X π π  defined by the conditional probability P (s0|s, a) where s0 is ∇θJ(θ) = d (s) Q (s, a) − V (s) ∇θπθ(a|s) the next state given a state s and an action a. R is the reward s a 0 X π X π function r(s, a, s ), which gives the temporal evaluation of the = d (s) A (s, a)∇θπθ(a|s) (2) state-action pair. The agent takes an action at every time step s a h π i from the stochastic policy π(a|s), which is the conditional = E A (s, a)∇θ log πθ(a|s) , (3) probability over the action set a ∈ A given a state s. π π π The objective function of the RL algorithms is defined by where A (s, a) = Q (s, a)−V (s) is the advantage function. P π The advantage function has the property a π(a|s)A (s, a) = ∞ P π(a|s)Qπ(s, a) − V π(s) = 0. hX t−1 i a J(π) = E γ rt , In the actor-critic architecture, we approximate the value t=1 function of the current policy by a parametric function Vv(s) where v is the parameter of the function. Because the temporal where t denotes the time step, γ denotes the discount factor π π π difference (TD) error δt = rt + γV (st+1) − V (st) is an 0 ≤ γ < 1, rt = r(st, at, st+1), and d (s) denotes the π unbiased estimate of the advantage function, one approxima- discounted transition probability given the policy d (s) = tion of the advantage is calculated by using the approximated P∞ γtP (s = s|s , π) E[·] t=0 t 0 [13]. And is the expectation op- value function δ˜ = r + γV (s ) − V (s ) [16][17]. Then E[f(s, a)] = P dπ(s) P π(a|s)f(s, a) t t v t+1 v t erator s a . The model- the update of the parameter ∆θ is given by free RL algorithms try to maximize this objective function ˜ J(π) from trials and errors, without explicitly modeling the ∆θt = αδt∇θ log πθ(a|s) (4) environment. where α is the . Many RL algorithms introduce the value function V π(s) π and the action-value function Q (s, a) by B. Energy-based Reinforcement Learning

∞ In this paper, we call the approaches energy-based RL h i π X t−1 when the policy π is defined by some energy function E . V (s) = Eπ γ rt s0 = s θ θ t=1 The energy-based RL algorithms in previous studies used the ∞ energy function π hX t−1 i Q (s, a) = Eπ γ rt s0 = s, a0 = a > > > > > t=1 Eθ(s, a, h) = s Wshh + a Wahh + bs s + ba a + bh h, where s, a, and h are the state, the action, and the hidden units where Eπ[·] is the expectation operator that denotes the ex- pectation over the trajectories given a policy π [14]. The value represented by the stochastic binary vectors in the RBMs. Wxy function and the action-value function have a relationship denotes the bidirectional connection between x and y. bx is > V π(s) = P π(a|s)Qπ(s, a). the bias vector of x. And denotes a transpose of the matrix a or vector. Then the parameter of the energy function θ is θ = If the policy π(a|s) is a smooth function π (a|s) with θ {W ,W , b , b }. In previous studies, we defined the joint respect to the parameter θ, the objective function can be treated sh ah s a probability by using the energy function and the Boltzmann as the function of the parameter J(π ) = J(θ). Then the θ distribution as gradient of J(θ) with respect to the parameters is given by e−βEθ (s,a,h) P (s, a, h; θ) = 0 0 0 . X π X π P −βEθ (s ,a ,h ) ∇θJ(θ) = d (s) Q (s, a)∇θπθ(s, a). (1) s0,a0,h0 e s a Then the policy is represented by the conditional probability

This relationship is called the policy gradient theorem [13]. e−βFθ (s,a) However, learning the parameters by directly applying the πθ(a|s) = P (a|s; θ) = 0 P e−βFθ (s,a ) equation (1) is known to be slow because of the large vari- a0 ance of ∇θJ(θ). Then, usually the baseline function F (s) is where Fθ(s, a) denotes the free-energy Fθ(s, a) = π 1 P −βEθ (s,a,h) subtracted from Q (s, a) as − β log h e . The energy-based RL was first de- veloped by Sallans and Hinton [4]. In their approach, the X π X π  ∇θJ(θ) = d (s) Q (s, a) − F (s) ∇θπθ(a|s). free-energy is used for the function approximation of the s a action-value function, while the parameter θ is updated by using standard TD-based reinforcement learning algorithms P P Because πθ(a|s) = 1 ⇔ ∇θπθ(a|s) = 0, then [4] [5][6][7]. P a a a ∇θπθ(a|s)F (s) = 0. Therefore, the subtraction above Actor-critic algorithms based on energy-based policies were doesn’t change the direction of the gradient [15]. When we suggested by Heess et al [8]. In their approach, the actor define the baseline F (s) = V π(s), the policy gradient is given is represented by the energy-based policy, while using a critic architecture based on the conventional TD learning with Proof 3.1: We apply equality (5) to equation (3), then function approximation. h i ∇ J(θ) = E Aπ(s, a)∇ log π (a|s) In previous approaches to energy-based RL, the stochastic θ θ θ h  i hidden units h were introduced. Hence an exact sampling π X −βEθ (s,b) = E A (s, a)∇θ −βEθ(s, a) − log e from an energy-based policy can easily become intractable b if we assume a large action set, that is high dimensional h = E Aπ(s, a) actions. To overcome this intractability, previous approaches used (blocked) Gibbs sampling as an approximation of the  X i ×β −∇θEθ(s, a) + πθ(b|s)∇θEθ(s, b) action sampling in a large action set. However, if we use more b complex energy models than RBMs, like the deep Boltzmann h i = −βE Aπ(s, a)∇ E (s, a) machines [18], the sampling actions from the models can be θ θ computationally expensive. Also, the use of (blocked) Gibbs h π X i +βE A (s, a) πθ(b|s)∇θEθ(s, b) sampling requires the iterative samplings and some additional b experiences to obtain good performances. h π i = −βE A (s, a)∇θEθ(s, a) . The last equality is given by the equality

III.ENERGY-BASED ACTOR-CRITIC h π X i E A (s, a) πθ(b|s)∇θEθ(s, b) b The previous approaches to the energy-based RL utilized X π X  X π = d (s) πθ(b|s)∇θEθ(s, b) πθ(a|s)A (s, a) the RBMs, hence they used stochastic hidden units h. Because s b a the intractability of the exact sampling from the energy-based = 0. policy is mainly from these stochastic hidden units, we instead investigated the energy-based policy with deterministic hidden 2 units. From this perspective, we derive the policy gradient We can easily extend this proof for the energy-based policy based on the energy-based policy. Also, we show that the with stochastic hidden units. From the above equality, if we normalization terms of the learning can be naturally introduced update the critic Vv(s) by using the standard TD algorithms, into the policy gradient. we may use the TD error δt to calculate the update of the parameter ∆θ by

∆θt = −αβδt∇θEθ(s, a), (7) A. Actor-Critic Algorithm for Energy-based Policy where α is the learning rate. Because the advantage function has a prop- In our approach, we assume that the joint probability is P π erty a πθ(a|s)A (s, a) = 0, we can add given by π βE[A (s, a)∇θL(s, θ)] = 0 to the right hand side of the equation (6) where L(s, θ) is the arbitrary smooth e−βEθ (s,a) function with respect to θ. Then the policy gradient has a P (s, a; θ) = 0 0 , P −βEθ (s ,a ) form s0,a0 e h π i ∇θJ(θ) = −βE A (s, a)∇θ Eθ(s, a) − L(s, θ) . where β denotes the inverse temperature 0 ≤ β and E (s, a) θ This equality suggest that the regularization term, which is of- is the energy function parameterized by θ. Hence the policy ten used in gradient-based optimization, can also be introduced is given by into the policy gradient. And the original policy gradient given by the equation (3) is the special case that the normalization −βEθ (s,a) 1 P e term is given by L(s, θ) = − β log b exp(−βEθ(s, b)). πθ(a|s) = 0 . (5) P e−βEθ (s,a ) a0∈A B. Definition of the Energy Functions In this section, we introduce two energy functions, and the The policy gradient theorem has the following property with relationships between the energy-based RL and the conven- respect to the energy. tional policy gradient algorithms are discussed. Theorem 3.1: If the policy of the agent is given by the When the environment is defined with discrete actions, such energy-based policy as (5), the gradient of the objective as 1 of K type discrete actions or the K bits binary action function with respect to the parameter is given by vectors, we can use the cross-entropy energy K X i i i i  h π i Eθ(s, a) = − a log µθ(s) + (1 − a ) log(1 − µθ(s)) , (8) ∇θJ(θ) = −βE A (s, a)∇θEθ(s, a) . (6) i=1 i where x is the i-th element of the vector x. µθ(s) is the output If we treat the continuous vector actions, we can introduce vector of some deterministic function. In this paper, we assume the energy function that this function is represented by a multilayer perceptron 1 with K output units, and that each element of µθ(s) is in the Eθ(s, a) = ||a − µθ(s)||2, (9) i 2 range of 0 < µθ(s) < 1, which can be easily implemented in multilayer perceptrons by applying the logistic output units. where ||·||2 is the squared norm and µθ(s) is the output vector When we consider the 1 of K type discrete actions, we of the some deterministic function. In this case, the Boltzmann can explicitly calculate the equation (8) for each K action distribution is obtained by and take one of the K actions according to the Boltzmann e−βEθ (s,a) πθ(a|s) = distribution (5). If we choose β = 1 and the sum of the R −βEθ (s,b) b e db output µθ(s) is normalized to one, the energy-based update is − β (a−µ (s))>(a−µ (s)) e 2 θ θ equivalent to the conventional policy gradient algorithm for K = ∞ ∞ − β (b−µ (s))>(b−µ (s)) discrete actions, which use µ (s) as the representation of the R R 2 θ θ θ −∞ ··· −∞ e db stochastic policies. This policy with the energy-based update = N (µθ(s), I/β). (7) is equivalent to the classical actor-critic algorithm, in which the actor is trained through the [2][19]. Hence the policy is given as the Gaussian policy with the When we consider the K bits binary vector actions, we mean µθ(s) and the covariance matrix I/β with the identity substitute the equation (8) into the Boltzmann distribution (5). matrix I. This equation suggests that the energy-based RL with Then we have the squared norm energy function is identical to the policy e−βEθ (s,a) gradient algorithm with respect to the Gaussian policy with πθ(a|s) = P −βEθ (s,b) mean parametrization. b e h n oi In the following sections, we focus on the energy-based RL P i i i i β a log µθ (s)+(1−a ) log(1−µθ (s)) with the 1 of K type discrete actions and the K bits binary e i = hn oi vector actions. P i i β bi log µ (s)+(1−bi) log(1−µ (s)) P i θ θ b e IV. ENERGY-BASED ARCHITECTUREOF MULTILAYER β[ai log µi (s)+(1−ai) log(1−µi (s)) ERCEPTRONFOR DOMAIN Q e θ θ ] P RL = i β[bi log µi (s)+(1−bi) log(1−µi (s)) Q P e θ θ ] The direction of the update (7) is preserved even if we i bi∈{0,1} replace βδt by sign(βδt), where sign(·) is the sign function. Y i = π(a |s, θ), Then, the update rule is given by i ∆θ = −α sign(βδ )∇ E (s , a ). (10) where π(ai = 1|s, θ) is given by t t θ θ t t

i β log µθ (s) This update rule suggests a very simple rule; if the current i e π(a = 1|s, θ) = i i TD error is positive, decrease the energy with respect to the P β[bi log µ (s)+(1−bi) log(1−µ (s))] bi∈{0,1} e θ θ current state-action pair, otherwise increase the energy. This i β learning procedure, which decreases the energy of the desired µθ(s) = β β . input-output pair and increases the energy of the undesired µi (s) + 1 − µi (s) θ θ input-output pair, can be seen as energy-based learning (EL) Therefore, we can sample the action efficiently from the policy. [9]. Because the energy function in the previous section can Also, if we use β = 1 and define µθ(s) by a multilayer be seen as the conventional cost functions (and the regular- perceptron, we obtain πθ(a|s) = µθ(s). Again this case is ization terms) with the teaching signal at and the µθ(st), identical to the conventional actor-critic algorithm applied to if we use the multilayer perceptron to represent µθ(st), the the policy with a multilayer perceptron with stochastic output gradient of the energies are efficiently calculated through the units, as discussed in [8]. standard backpropagation algorithm. While updating the actor Also, we can treat multi-agent settings, such as that there by the backpropagation was suggested in the classical studies are N agents in the environment and the n-th agent takes the [2][19][15], we show in the following section that the learning action an from Kn discrete actions. Because the energy can be of the actor with single fully-connected multilayer perceptrons PN n decomposed as Eθ(s, a) = n=1 Eθ (s, an), the whole policy can be very slow, even in simple classification tasks, with is also decomposed as positive and the negative TD errors. N Y n A. Problem of EL with Multilayer Perceptron in RL domain πθ(a|s) = πθ (an|s), n=1 One of the most prominent differences between and RL, especially the learning of the actor in the where πn(a |s) is the sub-policy for the n-th agent given by θ n actor-critic, is the quality of the training data set. That is, in −βEn(s,a ) e θ n supervised learning, the data set always gives the correct pairs πn(a |s) = . θ n −βEn(s,b ) P e θ n of the inputs and the outputs. We will call this kind of data bn Fig. 1. Examples of correct data and incorrect data. Correct data are composed of an image and a corresponding number (“2” in this case) and sign(βδ) = 1. In the case of incorrect data, an image is also provided but the number is wrong (the right answer correspond to the image is “9” in this case, but “5” is provided in this data) and sign(βδ) = −1. Fig. 2. The difficulty in learning of an the actor using a multilayer perceptron and dirty data: The three curves in the figure represent the means and the clean data set. On the other hand, in RL, the agent often the standard deviations of the test errors over the 10 trials of the MNIST classification task, including 99.9% of the incorrect data with “correct- receives incorrect pairs that are unwanted. We will call such incorrect” labels. The vertical axis is the mean test error rate, the horizontal data that includes this kind of incorrect data the dirty data set. axis is the number of the updates. The blue line is the energy-based updates. Energy-based learning with dirty data can disturb the learn- The green line is the the CACLA-type updates. The red line is the energy- based updates with the twin net. The broken line represents the energy- ing even in a simple classification task. Here, we show an based updates without the incorrect data, which are the same with supervised example in which the actor (agent) with a 3- perceptron learning. learns image classification with the MNIST handwritten digit [20]. The MNIST database consists of 60,000 training the performance of the simple energy-based learning with the data set and 10,000 test data set, and the data are 28×28 gray dirty data is almost at the chance level. scale images and their corresponding labels. In this example, One approach to avoid this problem is to filter out the we scaled all the pixels of the input images to real values in the incorrect data and train the model only when the TD error range of [0, 1]. In order to mimic the condition of the RL, we is positive (sign(βδ) > 0). This approach was suggested artificially generated incorrect data, in which the input images by Hasselt et al. and it is named the actor-critic learning and the given labels do not match. Examples of the correct automaton (ACLA) for discrete actions, the continuous ACLA data and the incorrect data are shown in Figure 1. Also, we (CACLA) for continuous actions. These algorithms have em- attached correct-incorrect labels to the correct and incorrect pirically confirmed that (C)ACLA-type updates improve the data, which correspond to the sign of the TD errors in the learning speed in several RL domains [21][22]. The update context of RL. For the correct data, we gave sign(βδ) = 1, rule of the CACLA is exactly the same with the equation and otherwise sign(βδ) = −1. We used the stochastic gradient (10) except that the sign function is replaced by the Heaviside descent with a constant step size parameter α = 0.1 to train all function h(·)1. of the multilayer perceptrons in this task. At each time step, The CACLA-type updates were tested for the MNIST the agent receives correct data with a probability of 0.001, and classification task, and the results are shown in Figure 2 as otherwise receives a incorrect data. the green line. This figure shows that the CACLA-type update We first trained the energy-based model, which uses a 3- clearly outperforms the simple energy-based update. layer perceptron with 20 hidden units with logistic and 10 output units with softmax activation function, B. Twin net: an Architecture of Energy-based Model for RL to represent µθ(s). In the test phase, the models were tested domain by using 100 random test images from the test data set, and Although the CACLA-type update rules simply discard the the output of the model in the test was the lowest energy incorrect data with negative TD errors, it is natural to assume output given by an image, which is greedy in terms of the RL context. The blue line in Figure 2 is the 1The Heaviside function h(x) is given by result of the simple energy-based learning explained above. (1 (x > 0) And the broken line is the result of the same but h(x) = 0 (x < 0) the clean data set was used in the training, which was exactly 0.5 (x = 0). the same with the supervised learning. The figure shows that where

µi (s) θp β µi (s) π (ai = 1|s) = θn . θ µi (s) 1−µi (s) θp β θp β µi (s) + 1−µi (s) θn θn Because the suggested learning architecture requires two deterministic functions (multilayer perceptrons, in this paper) for representing µp(s) and µn(s), we would call this method “twin net” architecture in this paper (Figure 3). We also tested this update rule in the MNIST classification task. In this task we used two multilayer perceptrons, one for µp(s) and the other is for µn(s). The results are shown as the red line in the Figure 2. The figure clearly shows that the twin net can utilize the incorrect data and outperform the CACLA-type update.

Fig. 3. The architecture of the twin net (the parameters θ are omitted in this V. EXPERIMENT figure). The policy is represented by two energy functions, Ep is trained only when the TD errors are positive , and otherwise En is trained. We verified the twin net in the three domains (grid world, grid world with binary vector actions, and blocker task) and compared it with the conventional actor-critic method (Nor- that the incorrect data also included some useful information mal) and the actor-critic with CACLA-type update (CACLA). for learning the optimal policy. For example, if the agent receives negative rewards after almost all state transitions A. Agents and the values at all states are initialized to zero, the actor In this experiment, we constructed the critic by TD learning receives mostly negative TD errors at the beginning of the with the linear function approximation. That is, the agent learning, hence the agent is required to learn the policy by receives a feature vector φ(s) at state s, and the value function > elimination. In this kind of environment, the actor learning at s is approximated by the linear function Vv(s) = v φ(s) with the CACLA-type update can be less sample-efficient where v is the parameter vector. The update of the parameter because the data with the negative TD errors are all filtered given the transition st → st+1 is out and the policy is not updated at most of the time steps. δ = r + γV (s ) − V (s ) In order to remove this sample-inefficiency of the CACLA- t t vt t+1 vt t type update, we explicitly separated the energy function by vt+1 = vt + αcδt∇vVv(s)|v=vt , where α is the step size parameter of the critic. In the Eθ(s, a) = Eθ (s, a) − Eθ (s, a), (11) c p n experiment, we initialized all elements of v by zero. We used 3-layer perceptrons with 20 hidden units with where Eθp (s, a) and Eθn (s, a) are the energy functions pa- sigmoid activation function, and we set φ(s) as the inputs for rameterized by θp and θn. And we define that both Eθp (s, a) all the multilayer perceptrons used in the actor2. In the grid and Eθn (s, a) are given by the equation (10), introducing the world task (GW), we used the softmax output units. In the two deterministic functions, µθp (s) and µθn (s). By using this energy function, we introduce the loss function grid world with binary vector actions task (GWBV), we used the logistic output units. And in the blocker task, we used

Lθ(s, a) = h(δ)Eθp (s, a) + (1 − h(δ))Eθn (s, a), (12) the agent-wise softmax output units (4 actions ×3 agents). The parameter of the multilayer perceptrons θ were initialized and we train the actor by so that the parameters connected with the output units were all zero, and the parameters between the input units and the ∆θ = −α∇θLθ(s, a). (13) hidden units were sampled from a uniform distribution over 1 1 [− , ], where Nin is the number of the input units. Nin Nin This update is a natural extension of the CACLA-type updates For the twin net architecture, we updated the actor by the for learning with negative TD errors. equation (13). And for comparisons, we updated the actor with The resultant policy with the energy function (11) for 1 of the single multilayer perceptron with the rule K type actions is exactly same, and the policy for K bits binary vector actions is given by ∆θt = −αf(βδt)∇θEθ(s, a).

K 2In the preliminary experiment, the numbers of hidden units for the Y i πθ(a|s) = πθ(a |s), Normal and CACLA-type updates were varied to 50 and 100. However, the performances did not change very much. i=1 If f(x) = x; we would call this update the Normal update. variations, and only four of sixteen actions can move the agent And if f(x) = h(x), we would call this update the CACLA- toward the four directions. The relationships are shown in the type update. We note that the Normal update was very slow table II. The action ‘Stay’ does not cause any state transition in the GWBV task in the range of the step sizes for the actor in the environment, hence this environment is challenging explained below, then we used f(x) = sign(x) instead of because the agent needs to learn the “pattern” of the optimal f(x) = x in this task. action at each state. We used β = 1 and γ = 0.95 in all domains, therefore the Normal update was equivalent to the conventional actor- TABLE II BINARY ACTION VECTORS critic algorithm. For the critic, we used αc = 0.1 in the GW and GWBV. In the blocker task, we used αc = 0.02. For Action Binary Vector −I actor updates, we tested α from {1.0, 0.6, 0.3} × 10 where North 1,1,0,0 I = 1, 2, 3, 4, 5 and compared the best performance of each South 0,0,1,1 update rule. The actual learning rates compared in the figures East 1,0,1,0 are shown in the table I. West 0,1,0,1 Stay otherwise TABLE I STEPSIZEPARAMETERSOFTHEACTORNETWORKS Figure 4(c) is the result showing the mean and standard Method GW GWBV Blocker deviation of 10 runs. Again the vertical axis represents the Normal 0.1 0.1 0.003 steps taken in the corresponding episodes, the horizontal axis CACLA 0.003 0.003 0.00006 represents the number of the episodes. The broken line repre- Twin net 0.06 0.1 0.06 sents the performance of the optimal policy (14 steps). Figure 4(c) clearly shows that only the suggested twin net method (red line) could learn the optimal policy in this experiment, while B. Environments the other methods (blue line: Normal; green line: CACLA) 1) Grid World (GW): In this environment, the task is the maintain low quality policies. shortest path problem from the start state to the goal state. We 3) Blocker Task: The blocker task is an multi-agent task used the 47-states grid world suggested by Sutton, as shown in suggested by Sallans and Hinton [4]. This environment con- Figure 4(a). The feature vector φ(s) in the GW is the 48 bits sists of a 4 × 7 grid, three agents, and two blockers. To binary feature; only one of the first 47 bits, which correspond obtain a positive reward, the agents need to cooperate in this to the current state s, takes one and zero for all other 46 bits. environment. Each agent can move in four directions as the The value of the 48-th bit is always one. The agent has four GW task, and the ‘team’ of agents obtain a +1 reward when actions for the 1-step moves toward north, west, south, east. one of the three agents enters the end-zone, otherwise the team The state transition occurs only when there is a next state in receives a −1 reward. The feature vector φ(s) is given by the the direction of the taken action at the current state. positions (grid cells) of each agent (28 bits × 3 agents), the The training was done by the episodic rule; the agents start eastern most position of each blocker (28 bits × 2 blockers) from the start state ‘S’ in the all episodes, and one episode and a bias bit which is always one (1 bit). ends when the agent enters the goal state ‘G’ or 800 time steps The moves of the agents and blockers are ordered. The agent passed without reaching the goal. The agent receives a reward can move one of four directions if there is a next position and +1 when reaching the goal, otherwise the reward is always the next position is not filled by another agent or blocker, zero. otherwise the agent stays at the same position. To enter the Figure 4(b) is the result showing the mean and standard end-zone, one of the agents needs to pass through the grid cell deviation of 10 runs. The vertical axis represents the steps between the blockers. The two blockers can move only east or taken in the corresponding episodes, the horizontal axis rep- west and block the agent. The west blocker is responsible for resents the number of episodes. The broken line represents the 1-4 columns, and the east blocker is responsible for the the performance of the optimal policy (14 steps). This result 4-7 columns; the behavior of the blocker is pre-programmed suggests that the twin net method (red line) learns faster than to prevent the agents from entering the end-zone [4][8]. The the other single net methods (blue line: Normal; green line: one time step ends after the transitions of three agents and CACLA). And only the twin net could reach the optimal the subsequent two blockers transitions. The agents start from performance in this episode numbers. the random positions of the bottom rows in Figure 5(a) but 2) Grid World with Binary Vector Actions (GWBV): In this no agents ever overlaps. Each episode ends when one of the environment, we consider the shortest path problem in the agents enters the end-zone (success) or 40 time steps have same grid with the previous GW. However, the actions are passed (failure). defined differently; the actions in this task are coded by binary Figure 5(b) is the result of the experiment. The vertical axis vectors. And the reward is zero when the agent reaches the represents the mean success rate of the last 1000 episodes. And goal, otherwise the reward −1 is provided. The actions are the horizontal axis represents the time steps. This result shows represented by a 4 bits binary vector, which has 24 = 16 that the twin-net learns the successful policy very quickly, 800

700

600

500

400

300

200

100

0

−100 0 50 100 150 200 250 300 350 400

(a) (b) (c)

Fig. 4. (a): Sutton’s grid world. (b) and (c): The performances of the algorithms in the grid world domain (b) and the grid world and the binary action vector domain (c). The vertical axis represents the steps from the start and the goal state. The horizontal axis is the number of episodes. The blue line is a single neural network with energy-based update. The green line is the CACLA-type update. The red line is the proposed method with an twin net.

1

0.8

0.6

0.4

0.2

0 0 2 4 6 8 10 x 104 (a) (b)

Fig. 5. (a): The blocker task domain. There are three agents and two blockers in the domain. Blockers are pre-programmed with a strategy to stop them. To achieve this goal, agents need to cooperate. (b): The success ratio of the three different learning methods. The vertical axis is the mean success rate of the last 1000 episodes. The horizontal axis is the time step. The blue line is a single neural network with energy-based update. The green line is the CACLA-type update. The red line is the proposed method with a twin net. while the other methods still fail with some probability after ACKNOWLEDGMENT the 1.0 × 105 time steps. I would like to thank Makoto Otsuka for useful comments on this work.

VI.CONCLUSION REFERENCES

In this paper, we introduced energy-based reinforcement [1] I. H. Witten, “An adaptive optimal controller for discrete-time markov environments,” Information and control, vol. 34, no. 4, pp. 286–295, learning with deterministic hidden units. We showed that the 1977. policy gradient is proportional to the gradient of the energy, [2] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive and that the energy-based actor-critic algorithm with specific elements that can solve difficult learning control problems,” Systems, types of the energy functions and settings are identical to Man and Cybernetics, IEEE Transactions on, no. 5, pp. 834–846, 1983. [3] I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska, “A survey of the conventional actor-critic algorithms. Additionally, using a actor-critic reinforcement learning: Standard and natural policy gradi- simple experiment, we revealed that the learning in an actor ents,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, can be slow in RL domains when we represent the policy by IEEE Transactions on, vol. 42, no. 6, pp. 1291–1307, 2012. [4] B. Sallans and G. E. Hinton, “Using free energies to represent q-values a multilayer perceptron. From this fact, we suggest the twin in a multiagent reinforcement learning task,” in NIPS, 2000, pp. 1075– net architecture. We also suggest the energy-based learning 1081. procedure for this architecture. Finally, we empirically showed [5] ——, “Reinforcement learning with factored states and actions,” The Journal of Research, vol. 5, pp. 1063–1088, 2004. that the suggested architecture effectively worked in several [6] M. Ohtsuka, “Goal-oriented representations of the external world: A RL domains. free-energy-based approach,” 2010. [7] S. Elfwing, E. Uchibe, and K. Doya, “Scaled free-energy based rein- forcement learning for robust and efficient learning in high-dimensional state spaces,” Frontiers in neurorobotics, vol. 7, 2013. [8] N. Heess, D. Silver, and Y. W. Teh, “Actor-critic reinforcement learning with energy-based policies.” in EWRL. Citeseer, 2012, pp. 43–58. [9] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial on energy-based learning,” Predicting structured data, 2006. [10] P. Smolensky, “Information processing in dynamical systems: Founda- tions of harmony theory,” 1986. [11] Y. Freund and D. Haussler, of distributions of binary vectors using two layer networks. Computer Research Laboratory [University of California, Santa Cruz], 1994. [12] G. E. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002. [13] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gra- dient methods for reinforcement learning with function approximation.” in NIPS, vol. 99. Citeseer, 1999, pp. 1057–1063. [14] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998. [15] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992. [16] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee, “Natural actor–critic algorithms,” Automatica, vol. 45, no. 11, pp. 2471–2482, 2009. [17] H. Van Hasselt, “Reinforcement learning in continuous state and action spaces,” Reinforcement Learning, pp. 207–251, 2012. [18] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in International Conference on Artificial Intelligence and , 2009, pp. 448–455. [19] C. W. Anderson, “Strategy learning with multilayer connectionist rep- resentations,” 1987. [20] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [21] H. Van Hasselt and M. A. Wiering, “Reinforcement learning in contin- uous action spaces,” in Approximate and Rein- forcement Learning, 2007. ADPRL 2007. IEEE International Symposium on. IEEE, 2007, pp. 272–279. [22] H. van Hasselt and M. A. Wiering, “Using continuous action spaces to solve discrete problems,” in Neural Networks, 2009. IJCNN 2009. International Joint Conference on. IEEE, 2009, pp. 1149–1156.