Submitted by Florian Henkel
Submitted at Institute of Computational Perception
Supervisor Univ.-Prof. Dr. Gerhard Widmer
Co-Supervisor A Regularization Study Dipl.-Ing. Matthias Dorfer for Policy Gradient July 2018 Methods
Master Thesis to obtain the academic degree of Diplom-Ingenieur in the Master’s Program Computer Science
JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Österreich www.jku.at DVR 0093696
Abstract
Regularization is an important concept in the context of supervised machine learning. Especially with neural networks it is necessary to restrict their ca- pacity and expressivity in order to avoid overfitting to given train data. While there are several well-known and widely used regularization techniques for supervised machine learning such as L2-Normalization, Dropout or Batch- Normalization, their effect in the context of reinforcement learning is not yet investigated. In this thesis we give an overview of regularization in combination with policy gradient methods, a subclass of reinforcement learning algorithms relying on neural networks. We compare different state-of-the-art algorithms together with regularization methods for supervised learning to get a better understanding on how we can improve generalization in reinforcement learn- ing. The main motivation for exploring this line of research is our current work on score following, where we try to train reinforcement learning agents to listen to and read music. These agents should learn from given musical training pieces to follow music they have never heard and seen before. Thus, the agents have to generalize which is why this scenario is a suitable test bed for investigating generalization in the context of reinforcement learning.
The empirical results found in this thesis should primarily serve as a guideline for our future work in this field. Although we have a rather limited set of experiments due to hardware limitations, we see that regularization in rein- forcement learning is not working in the same way as for supervised learning. Most notable is the effect of Batch-Normalization. While this technique did not work for one of the tested algorithms, it yields promising but very insta- ble results for another. We further observe that one algorithm is robust and not affected at all by regularization. In our opinion it is necessary to further explore this field and also perform a more in depth and thorough study in the future.
II
Kurzfassung
Im Bereich des Supervised Machine Learning spielt das Konzept der Reg- ularisierung eine wesentliche Rolle. Speziell bei neuronalen Netzen ist es notwendig, diese in ihrer Kapazität und Ausdrucksstärke einzuschränken, um sogenanntes Overfitting auf gegebene Trainingsdaten zu vermeiden. Während es für Supervised Machine Learning einige bekannte und häufig verwendete Techniken zur Regularisierung gibt, wie etwa L2-Normalization, Dropout oder Batch-Normalization, so ist deren Einfluss im Bezug auf Reinforcement Learn- ing noch nicht erforscht. In dieser Arbeit geben wir eine Übersicht über Regu- larisierung in Verbindung mit Policy Gradient Methoden, einer Unterklasse von Reinforcement Learning, die auf neuronalen Netzen basiert. Wir vergleichen verschiedene modernste Algorithmen zusammen mit Regularisierungsmetho- den für Supervised Machine Learning, um zu verstehen, wie die General- isierungsfähigkeit bei Reinforcement Learning verbessert werden kann. Die Hauptmotivation, dieses Forschungsgebiet zu untersuchen, ist unsere aktuelle Arbeit im Bereich der automatischen Musikverfolgung, wo wir versuchen, Agen- ten mit Hilfe von Reinforcement Learning beizubringen, Musik zu hören und zu lesen. Diese Agenten sollen von gegeben Trainingsmusikstücken lernen, um dann noch nie gehörter und gesehener Musik zu folgen. Daher müssen die Agenten in der Lage sein zu generalisieren, wodurch dieses Szenario eine passende Testumgebung zu Erforschung von Generalisierung im Bereich Rein- forcement Learning ist.
Die empirischen Ergebnisse dieser Arbeit sollen uns primär als Richtline für unsere zukünfte Arbeit in diesem Fachgebiet dienen. Auch wenn wir auf Grund von Hardwareeinschränkungen nur eine begrenzte Anzahl an Exper- imenten durchführen konnten, so können wir doch feststellen, dass sich Reg- ularisierung in Reinforcement Learning nicht gleich verhält wie für Super- vised Learning. Besonders hervorzuheben ist hier der Einfluss von Batch- Normalization. Während diese Technik für einen der getesteten Algorithmen nicht funktionierte, so lieferte sie für einen anderen vielversprechende, wenn auch instabile, Resultate. Desweiteren können wir feststellen, dass ein Algo- rithmus robust auf Regularisierung reagiert und von dieser überhaupt nicht beinflusst wird. Unserer Meinung nach ist es notwendig, in der Zukunft weiter in diesem Bereich zu forschen und eine gründlichere und ausführlichere Studie durchzuführen.
IV
Acknowledgments
First of all, I would like to thank my whole family, my loving partner and my friends who always supported me throughout my studies. Without them this would not have been possible. Furthermore, I would like to thank Gerhard Widmer, who not only supervised this thesis but also gave me the opportunity to work as a student researcher at the Institute of Computational Perception. I am really grateful for having the possibility to work in such a productive environment and with all my experienced colleagues. I would also particularly like to thank Matthias Dorfer, who guided me during the course of this thesis. One could not think of a better and more supportive advisor.
This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 670035, project Con Espressione).
VI
Contents
1 Introduction 1 1.1 Motivation ...... 1 1.2 Related Work ...... 1 1.3 Outline ...... 3
2 Theory 4 2.1 Reinforcement Learning ...... 4 2.2 Policy Gradient and Actor Critic Methods ...... 6 2.2.1 REINFORCE ...... 8 2.2.2 One-Step Actor Critc ...... 10 2.2.3 (Asynchronous) Advantage Actor Critic ...... 11 2.2.4 Proximal Policy Optimization ...... 11 2.3 Neural Networks ...... 14 2.3.1 Fully Connected Neural Networks ...... 14 2.3.2 Convolutional Neural Networks ...... 16 2.3.3 Back-Propagation and Gradient Based Optimization .. 18 2.3.4 Activation Functions ...... 19 2.4 Regularization for Neural Networks ...... 21 2.4.1 L2-Normalization ...... 21 2.4.2 Dropout ...... 22 2.4.3 Batch-Normalization ...... 22
3 Experimental Study 24 3.1 The Score Following Game ...... 24 3.2 Experimental Setup ...... 28 3.2.1 The Nottingham Dataset ...... 28 3.2.2 Network Architectures ...... 29 3.2.3 Training and Validation ...... 34 3.3 Results and Discussion ...... 36 3.3.1 Result Summary ...... 36 3.3.2 Comparing Algorithms ...... 43 3.3.3 Comparing Activation Functions ...... 43 3.3.4 Comparing Regularization Techniques ...... 46 3.4 Implications on the Network Architecture ...... 51
4 Conclusion 53
VIII
List of Figures
1 Agent-Environment interaction framework ...... 4 2 Simple example neural network ...... 14 3 Convolution visualization ...... 17 4 Activation Function comparison ...... 20 5 Dropout ...... 23 6 Time Domain to Frequency Domain ...... 24 7 Score Following Game MDP ...... 25 8 Score Following Game State Space ...... 26 9 Score Following Game Reward function ...... 27 10 Network Architecture sketch ...... 29 11 Shallow Network: Validation set setting comparison ...... 38 12 Shallow Network: Test set setting comparison ...... 39 13 Deep Network: Validation set setting comparison ...... 41 14 Deep Network: Test set setting comparison ...... 42 15 Shallow Network: Algorithm comparison ...... 44 16 Shallow Network: Performance with different activations .... 45 17 Deep Network: Performance with different activations ...... 45 18 Shallow Network: Reinforce with regularization ...... 46 19 Shallow Network: A2C with regularization ...... 47 20 Deep Network: A2C with regularization ...... 48 21 Shallow Network: PPO with regularization ...... 49 22 Deep Network: PPO with regularization ...... 50
X
List of Tables
1 Shallow Network Architecture ...... 30 2 Shallow Network Architecture with Dropout ...... 31 3 Shallow Network Architecture with Batch-Normalization .... 31 4 Deep Network Architecture ...... 32 5 Deep Network Architecture with Dropout ...... 32 6 Deep Network Architecture with Batch-Normalization ...... 33 7 Hyperparameters ...... 35 8 Shallow Network: Training set performance ...... 37 9 Shallow Network: Validation set performance ...... 38 10 Shallow Network: Test set performance ...... 39 11 Deep Network: Training set performance ...... 40 12 Deep Network: Validation set performance ...... 41 13 Deep Network: Test set performance ...... 42 14 Simplified Network Archtitecture S1 ...... 52 15 Simplified Network Archtitecture S2 ...... 52 16 Simplified Network Results ...... 52
XII
1 Introduction
In this introductory section, we illustrate the motivation behind the thesis, explain research, related to the subject of our work and give an outline of the structure of the remaining content.
1.1 Motivation In recent years, the field of reinforcement learning (RL) gained a lot of at- tention with achievements in the domain of games like Atari or Go [25, 34]. Increasing computational power as well as improved algorithms are pushing the boundaries of what people thought to be too complex for computers to learn. However, the application of RL is not limited to games only. Recent work successfully incorporates such techniques in fields like traffic control or the control of electric power systems [14,23], making it an important research area with practical relevance.
During the course of our research on RL in the context of score following1, we stumbled upon several interesting open research problems. One of them being the problem of overfitting and generalization, which we want to address in this thesis. While there are multiple well understood and working regu- larization techniques to avoid overfitting in supervised machine learning, the effect of regularization with respect to RL is barely investigated. In this the- sis, we explore how regularization techniques like L2-Normalization, Dropout and Batch-Normalization are influencing the behavior of state-of-the-art RL algorithms. Furthermore, we will investigate the use of different activation functions in terms of a deep and shallow neural network architecture. To do so, we conduct experiments with state-of-the-art RL algorithms on an envi- ronment called the Score Following Game, which we developed as part of our research on score following [11].
1.2 Related Work In this section we review work related to policy gradient methods and reg- ularization in the context of RL. The most basic policy gradient algorithm is REINFORCE, which was introduced in 1992 by Williams [39]. The ideas used by this algorithm are the basis for most state-of-the-art policy gradi- ent methods. In 2015, Lillicrap et al. introduced Deep Deterministic Policy Gradient (DDPG) [21], an off-policy RL algorithm utilizing experience replay,
1Score following is the process of listening to a piece of music while following or tracking the piece in its score representation. This will be further explained in Section 3.
1 which seems to be especially suited for continuous control tasks. The con- cept of experience replay was further investigated by Wang et al. in 2016 and Andrychowicz et al. in 2017, where they introduce Actor-Critic with Ex- perience Replay (ACER) [38] and Hindsight Experience Replay (HER) [1], respectively.
In 2015, Schulman et al. proposed Trust Region Policy Optimization (TRPO) [31], utilizing a trust region approach meaning that policy updates only hap- pen in such a way that the new policy is still considered trustworthy. To measure the trust region the Kullback-Leibler divergence is used. In 2017 they also introduced Proximal Policy Optimization (PPO) [33], which is similar to TRPO, but more sample efficient and furthermore easier to implement.
In 2016, Mnih et al. introduced Asynchronous Advantage Actor Critic (A3C) [24], a multi-step actor critic method using several actors in parallel to encour- age exploration. Their approach was further improved in terms of sample effi- ciency by Wu et. al in [41], where they introduce Actor Critic using Kronecker- factored Trust Region (ACKTR). They propose the use of Kronecker-factored approximation for approximating the natural gradient, which should allow for faster convergence than plain stochastic gradient descent.
To the best of our knowledge, there is not much work concerned with overfit- ting and regularization in deep RL, especially in the context of policy gradient methods. One recent study by Zhang et al. in 2018 elaborates this problem and evaluates several strategies proposed to alleviate overfitting by introduc- ing stochasticity into the learning process [42]. However, they do not explore the effects of regularization of the underlying function approximators as we will in the course of this thesis. Furthermore, there is an older study of 2011 related to regularization in RL [12]. However, this work is not concerned with policy gradient methods or newer regularization techniques such as Dropout and Batch-Normalization.
From all the aforementioned algorithms, we decide to use REINFORCE, a synchronous version of A3C (Advantage Actor Critic) and PPO in our study. We include REINFORCE as our baseline for this task and we expect the other algorithms to outperform it in several aspects. The Advantage Actor Critic is chosen because of its simplicity, while still yielding good results in the literature. Instead of TRPO we chose PPO, because of the aforementioned benefits and because it is currently used as the default RL agent at OpenAI2.
2https://blog.openai.com/openai-baselines-ppo/
2 1.3 Outline The reminder of this thesis is structured in the following way. In Section 2 we explain the minimum theory required for understand this work. This will comprise the basics of RL, Policy Gradient and Actor Critic Methods as well as an introduction to the three algorithms we are considering: REINFORCE with baseline, Advantage Actor Critic and Proximal Policy Optimization. Ad- ditionally we briefly explain the idea behind neural networks and how they work. Finally, we introduce three different activation functions and three com- mon regularization techniques for the supervised training of neural networks, which we examine in our experiments.
Section 3 is the main part of this thesis and is concerned with our experimental study. At first, we introduce the Score Following Game [11], which is the RL environment we used for the experiments. Afterwards we explain our exper- imental setup, including information about the dataset, neural networks and hyperparameters. The experiments itself are split into three parts. First, we compare the algorithms. Second, we explore the effect of different activation functions and finally we study the influence of different regularization tech- niques. At last, in Section 4 we conclude this thesis and provide and outlook on future work an possible research directions.
3 2 Theory
In the following section, we first elaborate the basic theory of reinforcement learning including policy gradient methods and the actor-critic setup, as well as a description of the algorithms used in our experimental study. Afterwards, we give an introduction to neural networks, especially the subclass of convolutional neural networks (CNNs) along with an explanation of activation functions used in machine learning. The last part of this section comprises a brief introduction to regularization techniques for neural networks covering L2-Normalization, Dropout and Batch-Normalization.
2.1 Reinforcement Learning To summarize reinforcement learning in a nutshell, one could simply say that it is learning from interaction3. The two basic components are an agent and an environment, with which the agent will interact. In Figure 1 this process is visualized. At each time step t the environment is in a certain state St. The agent observes this state and reacts by choosing an action At. This action is performed within the environment, yielding a new state St+1 as well as an immediate reward signal Rt+1 for this action.
Environment
action
reward Agent state
Figure 1: Interaction between the agent and the environment. Agent performs an action At depending on a state St. It re- ceives a new state St+1 and a reward Rt+1 for the current action. (Figure reproduced from [36])
The sole objective of the agent is to maximize this reward over time. However, the crucial part is that the agent cannot simply choose actions yielding a high immediate reward as it is possible that those actions will eventually lead to
3This intro to RL as well as our choice of notation is based on the book by Sutton and Barto [36].
4 suboptimal situations in the future. Thus, the agent has to plan in the long run and tries to maximize the sum of future rewards, which we call the return Gt. In order to keep this sum finite and to control the influence of future rewards, one uses a discounted formulation of the return as given in Equation (1) ∑∞ k Gt = γ Rt+k+1 (1) k=0 where γ is a discounting factor with γ ∈ [0, 1). Choosing a factor closer to 1 considers future rewards more strongly, while a factor closer to 0 emphasizes on immediate rewards.
Until now, we have not explained how an agent chooses an action. To do so, we introduce the term policy. The policy determines the behavior of an agent and can be seen as a conditional probability distribution over actions given states. We refer to the policy as π(a | s). Note that the decision of the agent only depends on the current state St, thus the agent should be able to choose an appropriate action by just observing this state. In this context we often speak about of Markov Decision Process (MDP) and the Markov property, respectively. The Markov property basically means that a future state only depends on the current state and not on the past, i.e. p(St+1 | St,St−1,St−2, ..., S0) = p(St+1 | St). While there are ways to tackle problems where the Markov property is violated, it is desirable to formulate environments in such way that the state-transition process is Markovian [36].
If a stochastic decision process has the Markov property it is called an MDP. An MDP is a quintuple consisting of five parts
• S, a set of all possible states the environment can be in (state space)
• A, a set of all possible actions an agent can take (action space)
• R, a function defining the immediate rewards (reward function)
• P, a conditional probability distribution determining the probabilities for going from one state to another given a certain action (transition probabilities)
• γ, the discounting factor for controlling the trade-off between immediate and future rewards
Note that S and A do not necessarily have to be finite sets. For real world problems it is often the case that one of these two or even both sets are infinite.
5 Although we are not concerned with an infinite action space in this thesis, we later on briefly describe how this can be handled.
In order to learn a proper behavior, the agent needs the means to determine whether a state or an action in a certain state, respectively, is good or bad. We can measure the goodness of a state and of a state-action pair using the previously defined return Gt. For a state s, we define the state-value function as the expected return from time step t onwards given that we are in this particular state, and given that we follow a certain policy π
vπ(s) = E [Gt | St = s] (2)
Similar to this, we define the action-value function for a state s and action a
qπ(s, a) = E [Gt | St = s, At = a] (3)
Note that we always define these functions with respect to a policy π. As- suming that we would know the optimal value functions q∗(s, a) and v∗(s), meaning that we have perfect knowledge of the value of a certain state/state- action pair, we could therefore determine the best action within each state and in this way construct an optimal policy that will yield the maximum reward.
It is possible to obtain such an optimal solution for problems with a small state space and where we have access to the underlying dynamics of the environment (colloquially the “laws of the nature”, describing how an environment reacts to actions). However, this is infeasible for most real world applications. There- fore, other approaches have been developed. One of them are policy gradient methods, which we introduce in the following section
2.2 Policy Gradient and Actor Critic Methods The essential point of policy gradient methods is that instead of learning value functions, which will then be used to determine a policy, we directly learn a policy. The definition of the policy therefore changes to a parametrized for- mulation in the form of π(a | s; θ). Learning a policy then means adapting this parametrization θ to conform to some desired behavior. In terms of policy gradient methods, this learning is achieved by using the gradient of a perfor- mance measure J(θ) and performing (stochastic) gradient ascent with respect to this performance measure \ θt+1 = θt + α∇J(θt) (4)
6 \ where ∇J(θt) is a stochastic approximation of the gradient of J with respect to the parameters θt. As this performance measure involves the policy π, we need to make sure that π(a | s; θ) is differentiable with respect to θ. One common form of parametrization for a discrete action space is the use of numerical action selection preferences as shown in Equation (5)[36]
exp(h(s, a, θ)) π(a | s; θ) = ∑ (5) b exp(h(s, b, θ)) where h(s, a, θ) are real-valued numerical preferences for state-action pairs parametrized by θ. In the following we assume θ to be the weights of a neural network. Equation (5) scales these preferences to a valid probability distribu- tion, meaning that the state-action pair with the highest preference will have the highest probability. Once we have defined the policy, one can either act greedily by selection the action with the highest probability, or stochastically by sampling an action from the probability distribution.
Real world problems often involve a continuous action space, therefore it is not possible to define such numerical preferences, as our set of actions is now infinitely large. A way of approaching this is by learning the parameters of probability distributions from which the actions will then be sampled. The common choice is a Gaussian distribution, where the parameters µ and σ are learned by function-approximators like neural networks [36]. However, there is also work where different probability distributions are used, e.g. the Beta distribution [7].
It remains to show how performance measure J(θ) is defined. In general, one distinguishes between episodic and continuing tasks. Continuing tasks are ongoing tasks, where the agent is constantly acting inside the environment. Episodic tasks are tasks that usually have a clearly defined ending. One game of chess could for example be seen as a single episode with three different ter- minal states (win, lose, draw). The agent then plays several of those episodes until it is able to solve a problem. In this thesis we are concerned with episodic tasks and will therefore derive the performance measure for this case.
For episodic tasks the performance measure is defined as the value of the start state s0 by
J(θ) = vπθ (s0) (6) with the value function v following policy π parameterized by θ. Recall that the value of a state is the expected future return starting in this specific state. Therefore, we will maximize our future return by maximizing the objective
7 J(θ). The problem however is, that by using this formulation the performance depends not only on the action selection but also on the distribution of states. While the first is not severe, the latter is unfavorable due to the usually un- known effect of the policy on the state distribution as it is a function of the environment [36]. Fortunately, this can be rewritten to what is known as the policy gradient theorem, which does not rely on the derivative of the state distribution. ∑ ∑ ∇J(θ) ∝ µ(s) qπ(s, a)∇θπ(a | s; θ) (7) s a where µ(s) is the probability of being in state s. A derivation of the whole theorem can be found in [36].
2.2.1 REINFORCE With the policy gradient theorem, we can finally derive an update rule for learning a policy. Recall that µ(s) is the probability of being in state s, given that we follow our current policy. The policy gradient theorem is therefore a probability weighted sum, which allows us to rewrite Equation (7) as an expectation ∑ ∑ ∇J(θ) ∝ µ(s) qπ(s, a)∇θπ(a | s; θ) s [ a ] ∑ = Eπ qπ(St, a)∇θπ(a | St; θ) [ a ] ∑ ∇ π(a | S ; θ) = E π(a | S ; θ) q (S , a) θ t π t π t π(a | S ; θ) (8) [ a ] t ∇ π(A | S ; θ) = E q (S ,A ) θ t t π π t t π(A | S ; θ) [ t ]t ∇θπ(At | St; θ) = Eπ Gt π(At | St; θ)
Using this, we can refine the update formula given in Equation (4) and arrive at what was introduced as the REINFORCE algorithm by Williams [39]
∇θπ(At | St; θt) θt+1 = θt + αGt (9) π(At | St; θt)
∇ | Note that θπ(At St;θt) is often written as ∇ ln π(A | S ; θ ). An intuitive π(At|St;θt) θ t t t explanation of this formula is that the gradient will point into the direction of the parameter space which increases the probability of taking this action in the future. If this action was beneficial, the return will most probably be
8 high and if not it will be low. This scales the update, so a high return will increase the probability of taking this action more, than with a low return. The denominator is used as a normalizing factor, in order to avoid an ad- vantage of actions that are selected more often, i.e. have a higher probability. Most state-of-the-art policy gradient algorithms use a variation of this formula.
One problem with the REINFORCE algorithm, and with policy gradients in general, is that the variance of updates (gradients) can be very high. The reason for this is that the updates strongly depend on the used samples (the trajectory). High variance has the effect that it slows down learning. One obvious way to alleviate this is to use more samples. However this is often not possible as creating such samples, be it in real-life or in a simulation, can be very costly [38]. It is therefore desirable to create agents that learn fast, while using as few interactions with the environment as possible. In the literature this is often referred to as the sample efficiency [38,41]. There are algorithms that try to do so, e.g actor-critic with experience replay (ACER) [38]. However as this thesis focuses on investigating the effect of regularization on RL, we are not considering them.
One simple approach to reduce variance is to include a baseline b(s), to which the action-value function will be compared. This changes the policy gradient theorem to ∑ ∑ ∇J(θ) ∝ µ(s) (qπ(s, a) − b(s))∇θπ(a | s; θ) (10) s a This baseline could be any function or even a constant, however it must not depend on the actions a. The idea behind this is intuitive. For instance, consider the baseline to be the average reward we would normally gather in a certain state. Using this fact in the update formula allows us to evaluate an action we performed relatively to the average performance we usually achieve in this state. This means, if we actually got less reward than on average, the action was probably not that good and we therefore want to discourage this behavior. On the other hand, if we performed better than on average, the action should be encouraged in the future. In practice, the function used as a baseline is often an estimate of the state-value function vˆ(St; θv), where θv is again a learned parametrization, e.g. a neural network. The update formula for the REINFORCE with baseline algorithm is given in Equation (11)
θt+1 = θt + α(Gt − vˆ(St; θv))∇θ ln π(At | St; θt) (11) Note that adding this baseline does not add bias to the update rule. REIN- FORCE with baseline will be the first algorithm we consider in our Experi- ments in Section 3. For simplicity we will refer to it as Reinforce.
9 2.2.2 One-Step Actor Critc In machine learning it is often of advantage to introduce bias in order to reduce variance. This is referred to as the Bias-Variance trade-off [5]. Actor critic methods introduce a bias in the form of bootstrapping, meaning that we update the value estimate of one state by using the value estimate of subsequent states. Using this knowledge we can define the update for the one-step actor critic δ = (R + γvˆ(S ; θ ) − vˆ(S ; θ )) t t+1 t+1 v t v (12) θt+1 = θt + αδt∇θ ln π(At | St; θt)
The learned state-value function vˆ(St; θv) acts again as a baseline and addi- tionally serves as an estimation of the future return. The intuition behind this formulation is that δt (also called the advantage), is used to evaluate the action we took in a certain state compared to the average performance in this state. This means that if an action was better than our average performance, we will increase the probability of taking it in the future, as the expression inside the brackets will be positive. If it was worse on the other hand, the expression will be negative and we thus reduce the probability.
The advantage of such a formulation is that it is not necessary to wait for the whole return Gt of one trajectory/episode until an update can be applied. Therefore actor critic methods are able to learn online respectively incremen- tally improve their policy after each step.
In terms of actor critic methods the policy is referred to as the actor and the state-value function as the critic. A common choice for optimizing the state- value function is to minimize the squared difference between the immediate one-step reward plus the value estimate of the next state, and the state-value estimate of the current state. This translates to
R = Rt + γvˆ(St+1; θv) 2 ∂(R − vˆ(St; θv)) (13) θv = θv + αv ∂θv where αv is a separate learning rate for the critic. It is however often the case that actor and critic share some of the parameters and thus they can be learned jointly by a single optimizer with a single learn rate [24,41].
10 2.2.3 (Asynchronous) Advantage Actor Critic In 2016, Mnih et all introduced the Asynchronous Advantage Actor Critic (A3C) algorithm [24], which is basically a multi-step actor-critic method. In- stead of updating the policy and value function after each step, a certain number of tmax steps are taken before applying the update. This changes the update rule to
k∑−1 i k A(St,At; θ, θv) = γ Rt+i + γ vˆ(St+k; θv) − vˆ(St; θv) i=0 (14)
θt+1 = θt + αA(St,At; θt, θv)∇θ log π(At | St; θt) where A(St,At; θ, θv) is called advantage function and k is at max tmax or a smaller number, in case a terminal state was reached before tmax steps could be performed. vˆ is again an estimate of the value function. Furthermore, they make use of entropy regularization as introduced by Williams et al. [40]. They add the entropy of the policy to the update rule, which should avoid early convergence to non-optimal policies as well as encourage exploration. The idea is to keep the entropy high and thus have more evenly distributed action selection probabilities. Thereby different actions are chosen more frequently which in turn leads to more exploration.
The special characteristic of the A3C algorithm is that multiple actors are used in parallel, on different instances of the same environment. It is likely that these actors are not acting the same over all instances, hence different parts within the environment will be explored. This not only improves exploration, but also uses decorrelated samples for updating. These updates happen asyn- chronously utilizing a parallelized approach to stochastic gradient descent [27].
To avoid the technical complexity of asynchronous updates, one can use a synchronous variant of the A3C algorithm, dubbed A2C for advantage actor critic. In this case, the updates are not applied asynchronously, but once all actors have finished a number of steps, an update is applied by averaging over all. This version is now used by several researchers for comparing their algorithms [37,41].
2.2.4 Proximal Policy Optimization Proximal Policy Optimization (PPO) was introduced in 2017 by Schulman et al. [33] and is currently used as the default algorithm for RL problems by Ope- nAI. Like A2C, PPO is an actor critic method, which uses the same approach
11 of parallel actors. The difference however is, how the advantage function as well as the loss of the policy is defined and how PPO handles updates.
As an advantage function, PPO uses generalized advantage estimation (GAE) [32]. Using this advantage formulation we get
V − δt = Rt + γvˆ(St+1; θv) vˆ(St; θv) (15)
∑∞ ˆGAE(γ,λ) l V At = (γλ) δt+l (16) l=0 where vˆ is again an estimation of the value function, γ the discounting factor and λ an additional adaptable hyperparameter. If, for instance, one chose λ = 0 then
ˆGAE(γ,0) V − At = δt = Rt + γvˆ(st+1; θv) vˆ(st; θv) (17) Choosing the λ parameter accordingly one can control the bias-variance trade- off. A value closer to zero introduces more bias due to the immediate approx- imation, and in this way the variance tends to be lower [32]. Using λ = 1 on the other hand has high variance and almost no bias due to the summation of the rewards as one can see in Equation (18)
∑∞ ∑∞ ˆGAE(γ,1) l V l − At = γ δt = γ Rt+l vˆ(st; θv) (18) l=0 l=0 One problem with regular policy gradient methods is data, respectively sam- ple efficiency. This means the number of interactions the agent has to perform within the environment until it is able to solve a given task. High variance during the learning process leads to a low sample efficiency and thus the goal is to reduce variance through various methods [32]. Besides GAE, PPO does so by performing multiple update steps using the same samples created during interaction. However, in order to do so the loss used for optimizing the policy is changed, as it would otherwise have a damaging effect on the policy itself [33].
Consider the regular policy loss of REINFORCE and using one trajectory to update the policy multiple times. This would lead to large updates and serious overfitting with respect to the current trajectory. To alleviate this problem, a new loss (also termed surrogate objective) is introduced given as [ ] ( ) ˆ ˆ E min rt(θ)At, clip(rt(θ), 1 − ϵ, 1 + ϵ)At (19)
12 πθ(At|St) where rt(θ) = π (A |S ) is a probability ratio of the policy under the parametriza- θold t t tion θ and the policy under parametrization with θold, denoting the policy at the previous update iteration. Aˆ denotes the estimated advantage function in the form of GAE. If the parametrization is equal, i.e θ = θold, this ratio would be one. By using clip inside the objective, the ratio is clipped to be within [1 − ϵ, 1 + ϵ] such that the updates will not drive the new policy too far away from the old one. Taking the minimum afterwards results in a loss that is a ˆ lower bound on rt(θ)At, and furthermore includes a penalty, if the updates of the policy are too large.
Comparing no clipping with clipping of the policy, they show that no clipping could not learn a useful policy and is even worse than a random one, while using clipping is able to do so. Furthermore, we can see from their results that this epsilon is another hyperparemter that can significantly influence the results. Thus a proper hyperparameter tuning is required.
13 2.3 Neural Networks Neural networks are the de facto standard in various tasks like image classifi- cation [20], speech recognition [16] or music information retrieval [19]. In the following, we give a superficial overview of the basics of neural networks as the are used for approximating the policy and value function for policy gradient methods. We explain the idea behind convolutional neural networks (CNNs), error backpropagation and introduce three state-of-the-art activation functions that we consider in our experimental study.
2.3.1 Fully Connected Neural Networks As shown in Figure 2, a fully connected neural network (FCNN) consists of inputs x = {x1, x2, ...xn}, an arbitrary number of hidden layers with (hidden) units and an output layer with a number of output units. Each layer l is connected to its subsequent layer by weights W (l) as well as an additional bias vector b(l). Furthermore, we have hidden layer activation functions h(l)(x) and an output layer activation function f(x) that process its input x.
Figure 2: A simple neural network with three inputs, a single hidden layer with three units and two outputs.
A neural network is formalized with a matrix notation. As an example, we will do this for the network shown in Figure 2.
14 ( ) x = x1 x2 x3 (20) (1) (1) (1) w 11 w 12 w 13 (1) (1) (1) (1) W = w 21 w 22 w 23 (21) (1) (1) (1) w 31 w 32 w 33 (2) (2) w 11 w 12 (2) (2) (2) W = w 21 w 22 (22) w(2) w(2) ( 31 32 ) (1) (1) (1) (1) b = b 1 b 2 b 3 (23) ( ) (2) (2) (2) (2) b = b 1 b 2 b 3 (24)
The input for the first hidden activation function is defined by a matrix mul- tiplication and a vector addition.
(1) (1) net (1) (x) = x · W + b h (1) (1) (1) (1) x1w 11 + x2w 21 + x3w 31 + b 1 T (25) (1) (1) (1) (1) = x1w 12 + x2w 22 + x3w 32 + b 2 (1) (1) (1) (1) x1w 13 + x2w 23 + x3w 33 + b 3
The hidden activation function h(1) is then
(1) h (x) = g(neth(1) (x)) (26) where g is a differentiable function, e.g. tanh or sigmoid. We will later on introduce three special kinds of functions that are used in deep learning.
The final output is defined in the same way (we now omit writing out the whole matrix calculation):
(1) (2) (2) netf (x) = h (x) · W + b (27)
f(x) = g(netf (x)) (28) where g is again a differentiable function. For (multinomial) classification tasks this function is chosen to be the softmax function given in Equation (29). The idea is to normalize a vector in such way that the values are in range (0, 1) and sum up to 1. Thus, this vector is interpreted as a probability distribution
15 over some classes, or a probability distribution over actions to forge a bridge to RL
exp(x ) ∑ i softmax(x)i = n (29) j=1 exp(xj)
2.3.2 Convolutional Neural Networks The previously described neural network is called fully connected, meaning that every unit in a layer is connected to every unit in the following layer. This might become a problem if one has to process high dimensional input data. For instance, consider a (single-channeled) image with 256x256 pixels, where each pixel has to be treated as one input dimension, and one hidden layer with 128 units. Then the first weight matrix would have a dimensionality of 65.536x128 and thus has 8.388.608 entries respectively parameters that have to be adjusted. As a neural network not only consists of a single layer, the total number of parameters is even higher.
CNNs circumvent this by using the convolution operation and so called kernels instead of the regular matrix multiplication. Reconsidering our example, the image is now treated as a two dimensional input. This input is then convolved with a kernel of a certain size, typically sizes are 3×3 or 5×5. The result of this operation is called a feature map. So, instead of more than a million parameters, we would only need 9 for a 3×3 kernel. Usually one uses multiple kernels in a convolutional layer and thus produces several feature maps, but the number of parameters is still far below that of a FCNN.
Using a two dimensional input I and kernel K we define the convolution op- eration by ∑ ∑ (K ∗ I)(i, j) = I(i − m, j − n)K(m, n) (30) m n For a better understanding the process is visualized in Figure 3, where a 3×4 input is convolved with a 2×2 kernel. The kernel is moved over the whole input step by step and each position in the image is multiplied with the respective position in the kernel, which is afterwards summed up and passed through an activation function. In this example we do not consider cases where parts of the kernel would lie outside the input image. Therefore the resulting output (the feature map) is smaller than the input. To keep the output image at the same size, one can pad the border of the input with zeros (zero padding). Instead of shifting the kernel one step at a time, one can also use bigger step sizes. This is generally referred to as the stride. Especially in the first layers of
16 a CNN it might be advantageous to choose a bigger stride in order to reduce the dimensionality of the input.
Figure 3: Convolution of a 3 × 4 input with a 2 × 2 kernel, without zero padding and a stride of one. For CNNs, the val- ues in the output cells will be passed through an activation function before being processed by a subsequent layer. Figure reproduced from [15]
Besides the fact that we can reduce the overall number of parameters, CNNs have another important advantage, which is called translation equivariance or sometimes invariance. Colloquially this means that for a CNN it does not matter where in the input a certain pattern occurs as it will still be able to recognize it, e.g. detect the same cat no matter if it is in the bottom left or bottom right corner of an image.
In Section 3 we introduce the network architectures for our experiments which consist of a combination of several convolutional and fully connected layers. We use the convolutional layers to learn representations from high dimensional input images and the fully connected layers to learn the policy and value function.
17 2.3.3 Back-Propagation and Gradient Based Optimization Neural Networks are usually trained using the back-propagation algorithm (backprop) together with a gradient-based optimization technique [29]. The goal of optimization is to minimize, respectively maximize a function. It is often the case that one wants to minimize an error/objective function J(θ) that is parametrized by θ, e.g. the weights of a neural network.
If we now take the derivative of this function with respect to the weight vector, ∂J(θ) ∇ we get the gradient ∂θ = θJ(θ). This gradient gives us the slope of J(θ) at θ or in other words points in the direction of the steepest ascent. Thus, we can minimize the objective by performing small update steps into the direction of the negative gradient, which results in the following update rule for gradient descent
θ = θ − η∇θJ(θ) (31) where η > 0 is the learning rate or step-size. There are several known is- sues with gradient descent, like oscillations or slow convergence due to flat error surfaces. In general the algorithm is quite sensitive to the choice of the learning rate, which is crucial to choose properly. Researchers try to overcome these issues, by using more advanced versions that extend this algorithm like RMSprop, AdaGrad or Adam [28].
To compute the required gradient, the backprop algorithm is used. This sim- ply refers to the process of calculating the gradient on each layer of a neural network by propagating information (error deltas in function approximation) backwards from the output layer. This is done by recursively applying the chain rule of calculus. So, starting with a new input, the data is passed for- ward through the neural network and its hidden layers, where finally an output is produced on the last layer. This is referred to as forward propagation or forward pass. Using this output, the gradient of the objective with respect to the weights on the last layer is computed. This information is then passed backwards and used on the preceding hidden layers to calculate the gradients for the weights of the hidden units.
18 2.3.4 Activation Functions We already mentioned two activation functions, namely tanh and sigmoid. While these are still in use, it is nowadays more common to use different func- tions [15]. In the following, we explain three state-of-the-art activation func- tions, rectified linear units (ReLUs) [26], exponential linear units (ELUs) [8] and scaled exponential linear units (SELUs) [18].
As shown in Equation (32) and Figure 4, the ReLU activation function clips all negative values to zero and otherwise passes x linearly. An advantage of this function is that the gradient will either be 0 or 1, which is simple to calculate. If x > 0 the gradient will always be 1, thus it alleviates the problem of vanishing gradients in these cases. If x ≤ 0 the gradient will be 0, which on the one hand has the advantage that activations will be sparse. This means some hidden units will be set to 0 and thus not fire, which might help during learning. On the other hand, it could occur that a unit is never active and thus the weights will not be adapted during training [22]. relu(x) = max(0, x) (32) Like ReLUs, ELUs let positive values pass linearly. The difference, however, is that negative values are not mapped to 0, but exponentially diminish and saturate at a value specified by the parameter α. An example with α = 1 is shown in Figure 4, where one can see that the negative domain saturates close to a value of −1. As ELUs allow negative values, this circumvents the problem of inactive neurons, while still mitigating the problem of vanishing gradients. Furthermore, the mean value of the activation function is closer to zero, which should permit faster learning [8]. x if x > 0 elu(x) = (33) α(ex − 1) if x ≤ 0
As the name suggests, SELUs are basically ELUs scaled by an additional parameter λ as shown in Equation (34). Like ELUs, SELUs have a mean close to zero, but additionally keep the variance near one. In their paper, Klambauer et al. show that activation functions with these characteristics have self-normalizing properties, meaning that in all layers of the network the activation converges to zero mean and unit variance. This allows to robustly train networks with many layers. λx if x > 0 selu(x) = (34) λ(αex − α) if x ≤ 0
19