<<

Submitted by Florian Henkel

Submitted at Institute of Computational Perception

Supervisor Univ.-Prof. Dr. Gerhard Widmer

Co-Supervisor A Regularization Study Dipl.-Ing. Matthias Dorfer for Policy July 2018 Methods

Master Thesis to obtain the academic degree of Diplom-Ingenieur in the Master’s Program Computer Science

JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Österreich www.jku.at DVR 0093696

Abstract

Regularization is an important concept in the context of supervised . Especially with neural networks it is necessary to restrict their ca- pacity and expressivity in order to avoid overfitting to given train data. While there are several well-known and widely used regularization techniques for supervised machine learning such as L2-Normalization, Dropout or Batch- Normalization, their effect in the context of is not yet investigated. In this thesis we give an overview of regularization in combination with policy gradient methods, a subclass of reinforcement learning algorithms relying on neural networks. We compare different state-of-the-art algorithms together with regularization methods for to get a better understanding on how we can improve generalization in reinforcement learn- ing. The main motivation for exploring this line of research is our current work on score following, where we try to train reinforcement learning agents to listen to and read music. These agents should learn from given musical training pieces to follow music they have never heard and seen before. Thus, the agents have to generalize which is why this scenario is a suitable test bed for investigating generalization in the context of reinforcement learning.

The empirical results found in this thesis should primarily serve as a guideline for our future work in this field. Although we have a rather limited set of experiments due to hardware limitations, we see that regularization in rein- forcement learning is not working in the same way as for supervised learning. Most notable is the effect of Batch-Normalization. While this technique did not work for one of the tested algorithms, it yields promising but very insta- ble results for another. We further observe that one algorithm is robust and not affected at all by regularization. In our opinion it is necessary to further explore this field and also perform a more in depth and thorough study in the future.

II

Kurzfassung

Im Bereich des Supervised Machine Learning spielt das Konzept der Reg- ularisierung eine wesentliche Rolle. Speziell bei neuronalen Netzen ist es notwendig, diese in ihrer Kapazität und Ausdrucksstärke einzuschränken, um sogenanntes Overfitting auf gegebene Trainingsdaten zu vermeiden. Während es für Supervised Machine Learning einige bekannte und häufig verwendete Techniken zur Regularisierung gibt, wie etwa L2-Normalization, Dropout oder Batch-Normalization, so ist deren Einfluss im Bezug auf Reinforcement Learn- ing noch nicht erforscht. In dieser Arbeit geben wir eine Übersicht über Regu- larisierung in Verbindung mit Policy Gradient Methoden, einer Unterklasse von Reinforcement Learning, die auf neuronalen Netzen basiert. Wir vergleichen verschiedene modernste Algorithmen zusammen mit Regularisierungsmetho- den für Supervised Machine Learning, um zu verstehen, wie die General- isierungsfähigkeit bei Reinforcement Learning verbessert werden kann. Die Hauptmotivation, dieses Forschungsgebiet zu untersuchen, ist unsere aktuelle Arbeit im Bereich der automatischen Musikverfolgung, wo wir versuchen, Agen- ten mit Hilfe von Reinforcement Learning beizubringen, Musik zu hören und zu lesen. Diese Agenten sollen von gegeben Trainingsmusikstücken lernen, um dann noch nie gehörter und gesehener Musik zu folgen. Daher müssen die Agenten in der Lage sein zu generalisieren, wodurch dieses Szenario eine passende Testumgebung zu Erforschung von Generalisierung im Bereich Rein- forcement Learning ist.

Die empirischen Ergebnisse dieser Arbeit sollen uns primär als Richtline für unsere zukünfte Arbeit in diesem Fachgebiet dienen. Auch wenn wir auf Grund von Hardwareeinschränkungen nur eine begrenzte Anzahl an Exper- imenten durchführen konnten, so können wir doch feststellen, dass sich Reg- ularisierung in Reinforcement Learning nicht gleich verhält wie für Super- vised Learning. Besonders hervorzuheben ist hier der Einfluss von Batch- Normalization. Während diese Technik für einen der getesteten Algorithmen nicht funktionierte, so lieferte sie für einen anderen vielversprechende, wenn auch instabile, Resultate. Desweiteren können wir feststellen, dass ein Algo- rithmus robust auf Regularisierung reagiert und von dieser überhaupt nicht beinflusst wird. Unserer Meinung nach ist es notwendig, in der Zukunft weiter in diesem Bereich zu forschen und eine gründlichere und ausführlichere Studie durchzuführen.

IV

Acknowledgments

First of all, I would like to thank my whole family, my loving partner and my friends who always supported me throughout my studies. Without them this would not have been possible. Furthermore, I would like to thank Gerhard Widmer, who not only supervised this thesis but also gave me the opportunity to work as a student researcher at the Institute of Computational Perception. I am really grateful for having the possibility to work in such a productive environment and with all my experienced colleagues. I would also particularly like to thank Matthias Dorfer, who guided me during the course of this thesis. One could not think of a better and more supportive advisor.

This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement 670035, project Con Espressione).

VI

Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Related Work ...... 1 1.3 Outline ...... 3

2 Theory 4 2.1 Reinforcement Learning ...... 4 2.2 Policy Gradient and Actor Critic Methods ...... 6 2.2.1 REINFORCE ...... 8 2.2.2 One-Step Actor Critc ...... 10 2.2.3 (Asynchronous) Advantage Actor Critic ...... 11 2.2.4 Proximal Policy Optimization ...... 11 2.3 Neural Networks ...... 14 2.3.1 Fully Connected Neural Networks ...... 14 2.3.2 Convolutional Neural Networks ...... 16 2.3.3 Back-Propagation and Gradient Based Optimization .. 18 2.3.4 Activation Functions ...... 19 2.4 Regularization for Neural Networks ...... 21 2.4.1 L2-Normalization ...... 21 2.4.2 Dropout ...... 22 2.4.3 Batch-Normalization ...... 22

3 Experimental Study 24 3.1 The Score Following Game ...... 24 3.2 Experimental Setup ...... 28 3.2.1 The Nottingham Dataset ...... 28 3.2.2 Network Architectures ...... 29 3.2.3 Training and Validation ...... 34 3.3 Results and Discussion ...... 36 3.3.1 Result Summary ...... 36 3.3.2 Comparing Algorithms ...... 43 3.3.3 Comparing Activation Functions ...... 43 3.3.4 Comparing Regularization Techniques ...... 46 3.4 Implications on the Network Architecture ...... 51

4 Conclusion 53

VIII

List of Figures

1 Agent-Environment interaction framework ...... 4 2 Simple example neural network ...... 14 3 visualization ...... 17 4 comparison ...... 20 5 Dropout ...... 23 6 Time Domain to Frequency Domain ...... 24 7 Score Following Game MDP ...... 25 8 Score Following Game State Space ...... 26 9 Score Following Game Reward function ...... 27 10 Network Architecture sketch ...... 29 11 Shallow Network: Validation set setting comparison ...... 38 12 Shallow Network: Test set setting comparison ...... 39 13 Deep Network: Validation set setting comparison ...... 41 14 Deep Network: Test set setting comparison ...... 42 15 Shallow Network: Algorithm comparison ...... 44 16 Shallow Network: Performance with different activations .... 45 17 Deep Network: Performance with different activations ...... 45 18 Shallow Network: Reinforce with regularization ...... 46 19 Shallow Network: A2C with regularization ...... 47 20 Deep Network: A2C with regularization ...... 48 21 Shallow Network: PPO with regularization ...... 49 22 Deep Network: PPO with regularization ...... 50

X

List of Tables

1 Shallow Network Architecture ...... 30 2 Shallow Network Architecture with Dropout ...... 31 3 Shallow Network Architecture with Batch-Normalization .... 31 4 Deep Network Architecture ...... 32 5 Deep Network Architecture with Dropout ...... 32 6 Deep Network Architecture with Batch-Normalization ...... 33 7 Hyperparameters ...... 35 8 Shallow Network: Training set performance ...... 37 9 Shallow Network: Validation set performance ...... 38 10 Shallow Network: Test set performance ...... 39 11 Deep Network: Training set performance ...... 40 12 Deep Network: Validation set performance ...... 41 13 Deep Network: Test set performance ...... 42 14 Simplified Network Archtitecture S1 ...... 52 15 Simplified Network Archtitecture S2 ...... 52 16 Simplified Network Results ...... 52

XII

1 Introduction

In this introductory section, we illustrate the motivation behind the thesis, explain research, related to the subject of our work and give an outline of the structure of the remaining content.

1.1 Motivation In recent years, the field of reinforcement learning (RL) gained a lot of at- tention with achievements in the domain of games like Atari or Go [25, 34]. Increasing computational power as well as improved algorithms are pushing the boundaries of what people thought to be too complex for computers to learn. However, the application of RL is not limited to games only. Recent work successfully incorporates such techniques in fields like traffic control or the control of electric power systems [14,23], making it an important research area with practical relevance.

During the course of our research on RL in the context of score following1, we stumbled upon several interesting open research problems. One of them being the problem of overfitting and generalization, which we want to address in this thesis. While there are multiple well understood and working regu- larization techniques to avoid overfitting in supervised machine learning, the effect of regularization with respect to RL is barely investigated. In this the- sis, we explore how regularization techniques like L2-Normalization, Dropout and Batch-Normalization are influencing the behavior of state-of-the-art RL algorithms. Furthermore, we will investigate the use of different activation functions in terms of a deep and shallow neural network architecture. To do so, we conduct experiments with state-of-the-art RL algorithms on an envi- ronment called the Score Following Game, which we developed as part of our research on score following [11].

1.2 Related Work In this section we review work related to policy gradient methods and reg- ularization in the context of RL. The most basic policy gradient algorithm is REINFORCE, which was introduced in 1992 by Williams [39]. The ideas used by this algorithm are the basis for most state-of-the-art policy gradi- ent methods. In 2015, Lillicrap et al. introduced Deep Deterministic Policy Gradient (DDPG) [21], an off-policy RL algorithm utilizing experience replay,

1Score following is the process of listening to a piece of music while following or tracking the piece in its score representation. This will be further explained in Section 3.

1 which seems to be especially suited for continuous control tasks. The con- cept of experience replay was further investigated by Wang et al. in 2016 and Andrychowicz et al. in 2017, where they introduce Actor-Critic with Ex- perience Replay (ACER) [38] and Hindsight Experience Replay (HER) [1], respectively.

In 2015, Schulman et al. proposed Trust Region Policy Optimization (TRPO) [31], utilizing a trust region approach meaning that policy updates only hap- pen in such a way that the new policy is still considered trustworthy. To measure the trust region the Kullback-Leibler divergence is used. In 2017 they also introduced Proximal Policy Optimization (PPO) [33], which is similar to TRPO, but more sample efficient and furthermore easier to implement.

In 2016, Mnih et al. introduced Asynchronous Advantage Actor Critic (A3C) [24], a multi-step actor critic method using several actors in parallel to encour- age exploration. Their approach was further improved in terms of sample effi- ciency by Wu et. al in [41], where they introduce Actor Critic using Kronecker- factored Trust Region (ACKTR). They propose the use of Kronecker-factored approximation for approximating the natural gradient, which should allow for faster convergence than plain stochastic .

To the best of our knowledge, there is not much work concerned with overfit- ting and regularization in deep RL, especially in the context of policy gradient methods. One recent study by Zhang et al. in 2018 elaborates this problem and evaluates several strategies proposed to alleviate overfitting by introduc- ing stochasticity into the learning process [42]. However, they do not explore the effects of regularization of the underlying function approximators as we will in the course of this thesis. Furthermore, there is an older study of 2011 related to regularization in RL [12]. However, this work is not concerned with policy gradient methods or newer regularization techniques such as Dropout and Batch-Normalization.

From all the aforementioned algorithms, we decide to use REINFORCE, a synchronous version of A3C (Advantage Actor Critic) and PPO in our study. We include REINFORCE as our baseline for this task and we expect the other algorithms to outperform it in several aspects. The Advantage Actor Critic is chosen because of its simplicity, while still yielding good results in the literature. Instead of TRPO we chose PPO, because of the aforementioned benefits and because it is currently used as the default RL agent at OpenAI2.

2https://blog.openai.com/openai-baselines-ppo/

2 1.3 Outline The reminder of this thesis is structured in the following way. In Section 2 we explain the minimum theory required for understand this work. This will comprise the basics of RL, Policy Gradient and Actor Critic Methods as well as an introduction to the three algorithms we are considering: REINFORCE with baseline, Advantage Actor Critic and Proximal Policy Optimization. Ad- ditionally we briefly explain the idea behind neural networks and how they work. Finally, we introduce three different activation functions and three com- mon regularization techniques for the supervised training of neural networks, which we examine in our experiments.

Section 3 is the main part of this thesis and is concerned with our experimental study. At first, we introduce the Score Following Game [11], which is the RL environment we used for the experiments. Afterwards we explain our exper- imental setup, including information about the dataset, neural networks and hyperparameters. The experiments itself are split into three parts. First, we compare the algorithms. Second, we explore the effect of different activation functions and finally we study the influence of different regularization tech- niques. At last, in Section 4 we conclude this thesis and provide and outlook on future work an possible research directions.

3 2 Theory

In the following section, we first elaborate the basic theory of reinforcement learning including policy gradient methods and the actor-critic setup, as well as a description of the algorithms used in our experimental study. Afterwards, we give an introduction to neural networks, especially the subclass of convolutional neural networks (CNNs) along with an explanation of activation functions used in machine learning. The last part of this section comprises a brief introduction to regularization techniques for neural networks covering L2-Normalization, Dropout and Batch-Normalization.

2.1 Reinforcement Learning To summarize reinforcement learning in a nutshell, one could simply say that it is learning from interaction3. The two basic components are an agent and an environment, with which the agent will interact. In Figure 1 this process is visualized. At each time step t the environment is in a certain state St. The agent observes this state and reacts by choosing an action At. This action is performed within the environment, yielding a new state St+1 as well as an immediate reward signal Rt+1 for this action.

Environment

action

reward Agent state

Figure 1: Interaction between the agent and the environment. Agent performs an action At depending on a state St. It re- ceives a new state St+1 and a reward Rt+1 for the current action. (Figure reproduced from [36])

The sole objective of the agent is to maximize this reward over time. However, the crucial part is that the agent cannot simply choose actions yielding a high immediate reward as it is possible that those actions will eventually lead to

3This intro to RL as well as our choice of notation is based on the book by Sutton and Barto [36].

4 suboptimal situations in the future. Thus, the agent has to plan in the long run and tries to maximize the sum of future rewards, which we call the return Gt. In order to keep this sum finite and to control the influence of future rewards, one uses a discounted formulation of the return as given in Equation (1) ∑∞ k Gt = γ Rt+k+1 (1) k=0 where γ is a discounting factor with γ ∈ [0, 1). Choosing a factor closer to 1 considers future rewards more strongly, while a factor closer to 0 emphasizes on immediate rewards.

Until now, we have not explained how an agent chooses an action. To do so, we introduce the term policy. The policy determines the behavior of an agent and can be seen as a conditional probability distribution over actions given states. We refer to the policy as π(a | s). Note that the decision of the agent only depends on the current state St, thus the agent should be able to choose an appropriate action by just observing this state. In this context we often speak about of Markov Decision Process (MDP) and the Markov property, respectively. The Markov property basically means that a future state only depends on the current state and not on the past, i.e. p(St+1 | St,St−1,St−2, ..., S0) = p(St+1 | St). While there are ways to tackle problems where the Markov property is violated, it is desirable to formulate environments in such way that the state-transition process is Markovian [36].

If a stochastic decision process has the Markov property it is called an MDP. An MDP is a quintuple consisting of five parts

• S, a set of all possible states the environment can be in (state space)

• A, a set of all possible actions an agent can take (action space)

• R, a function defining the immediate rewards (reward function)

• P, a conditional probability distribution determining the probabilities for going from one state to another given a certain action (transition probabilities)

• γ, the discounting factor for controlling the trade-off between immediate and future rewards

Note that S and A do not necessarily have to be finite sets. For real world problems it is often the case that one of these two or even both sets are infinite.

5 Although we are not concerned with an infinite action space in this thesis, we later on briefly describe how this can be handled.

In order to learn a proper behavior, the agent needs the means to determine whether a state or an action in a certain state, respectively, is good or bad. We can measure the goodness of a state and of a state-action pair using the previously defined return Gt. For a state s, we define the state-value function as the expected return from time step t onwards given that we are in this particular state, and given that we follow a certain policy π

vπ(s) = E [Gt | St = s] (2)

Similar to this, we define the action-value function for a state s and action a

qπ(s, a) = E [Gt | St = s, At = a] (3)

Note that we always define these functions with respect to a policy π. As- suming that we would know the optimal value functions q∗(s, a) and v∗(s), meaning that we have perfect knowledge of the value of a certain state/state- action pair, we could therefore determine the best action within each state and in this way construct an optimal policy that will yield the maximum reward.

It is possible to obtain such an optimal solution for problems with a small state space and where we have access to the underlying dynamics of the environment (colloquially the “laws of the nature”, describing how an environment reacts to actions). However, this is infeasible for most real world applications. There- fore, other approaches have been developed. One of them are policy gradient methods, which we introduce in the following section

2.2 Policy Gradient and Actor Critic Methods The essential point of policy gradient methods is that instead of learning value functions, which will then be used to determine a policy, we directly learn a policy. The definition of the policy therefore changes to a parametrized for- mulation in the form of π(a | s; θ). Learning a policy then means adapting this parametrization θ to conform to some desired behavior. In terms of policy gradient methods, this learning is achieved by using the gradient of a perfor- mance measure J(θ) and performing (stochastic) gradient ascent with respect to this performance measure \ θt+1 = θt + α∇J(θt) (4)

6 \ where ∇J(θt) is a stochastic approximation of the gradient of J with respect to the parameters θt. As this performance measure involves the policy π, we need to make sure that π(a | s; θ) is differentiable with respect to θ. One common form of parametrization for a discrete action space is the use of numerical preferences as shown in Equation (5)[36]

exp(h(s, a, θ)) π(a | s; θ) = ∑ (5) b exp(h(s, b, θ)) where h(s, a, θ) are real-valued numerical preferences for state-action pairs parametrized by θ. In the following we assume θ to be the weights of a neural network. Equation (5) scales these preferences to a valid probability distribu- tion, meaning that the state-action pair with the highest preference will have the highest probability. Once we have defined the policy, one can either act greedily by selection the action with the highest probability, or stochastically by sampling an action from the probability distribution.

Real world problems often involve a continuous action space, therefore it is not possible to define such numerical preferences, as our set of actions is now infinitely large. A way of approaching this is by learning the parameters of probability distributions from which the actions will then be sampled. The common choice is a Gaussian distribution, where the parameters µ and σ are learned by function-approximators like neural networks [36]. However, there is also work where different probability distributions are used, e.g. the Beta distribution [7].

It remains to show how performance measure J(θ) is defined. In general, one distinguishes between episodic and continuing tasks. Continuing tasks are ongoing tasks, where the agent is constantly acting inside the environment. Episodic tasks are tasks that usually have a clearly defined ending. One game of chess could for example be seen as a single episode with three different ter- minal states (win, lose, draw). The agent then plays several of those episodes until it is able to solve a problem. In this thesis we are concerned with episodic tasks and will therefore derive the performance measure for this case.

For episodic tasks the performance measure is defined as the value of the start state s0 by

J(θ) = vπθ (s0) (6) with the value function v following policy π parameterized by θ. Recall that the value of a state is the expected future return starting in this specific state. Therefore, we will maximize our future return by maximizing the objective

7 J(θ). The problem however is, that by using this formulation the performance depends not only on the action selection but also on the distribution of states. While the first is not severe, the latter is unfavorable due to the usually un- known effect of the policy on the state distribution as it is a function of the environment [36]. Fortunately, this can be rewritten to what is known as the policy gradient theorem, which does not rely on the derivative of the state distribution. ∑ ∑ ∇J(θ) ∝ µ(s) qπ(s, a)∇θπ(a | s; θ) (7) s a where µ(s) is the probability of being in state s. A derivation of the whole theorem can be found in [36].

2.2.1 REINFORCE With the policy gradient theorem, we can finally derive an update rule for learning a policy. Recall that µ(s) is the probability of being in state s, given that we follow our current policy. The policy gradient theorem is therefore a probability weighted sum, which allows us to rewrite Equation (7) as an expectation ∑ ∑ ∇J(θ) ∝ µ(s) qπ(s, a)∇θπ(a | s; θ) s [ a ] ∑ = Eπ qπ(St, a)∇θπ(a | St; θ) [ a ] ∑ ∇ π(a | S ; θ) = E π(a | S ; θ) q (S , a) θ t π t π t π(a | S ; θ) (8) [ a ] t ∇ π(A | S ; θ) = E q (S ,A ) θ t t π π t t π(A | S ; θ) [ t ]t ∇θπ(At | St; θ) = Eπ Gt π(At | St; θ)

Using this, we can refine the update formula given in Equation (4) and arrive at what was introduced as the REINFORCE algorithm by Williams [39]

∇θπ(At | St; θt) θt+1 = θt + αGt (9) π(At | St; θt)

∇ | Note that θπ(At St;θt) is often written as ∇ ln π(A | S ; θ ). An intuitive π(At|St;θt) θ t t t explanation of this formula is that the gradient will point into the direction of the parameter space which increases the probability of taking this action in the future. If this action was beneficial, the return will most probably be

8 high and if not it will be low. This scales the update, so a high return will increase the probability of taking this action more, than with a low return. The denominator is used as a normalizing factor, in order to avoid an ad- vantage of actions that are selected more often, i.e. have a higher probability. Most state-of-the-art policy gradient algorithms use a variation of this formula.

One problem with the REINFORCE algorithm, and with policy in general, is that the of updates (gradients) can be very high. The reason for this is that the updates strongly depend on the used samples (the trajectory). High variance has the effect that it slows down learning. One obvious way to alleviate this is to use more samples. However this is often not possible as creating such samples, be it in real-life or in a simulation, can be very costly [38]. It is therefore desirable to create agents that learn fast, while using as few interactions with the environment as possible. In the literature this is often referred to as the sample efficiency [38,41]. There are algorithms that try to do so, e.g actor-critic with experience replay (ACER) [38]. However as this thesis focuses on investigating the effect of regularization on RL, we are not considering them.

One simple approach to reduce variance is to include a baseline b(s), to which the action-value function will be compared. This changes the policy gradient theorem to ∑ ∑ ∇J(θ) ∝ µ(s) (qπ(s, a) − b(s))∇θπ(a | s; θ) (10) s a This baseline could be any function or even a constant, however it must not depend on the actions a. The idea behind this is intuitive. For instance, consider the baseline to be the average reward we would normally gather in a certain state. Using this fact in the update formula allows us to evaluate an action we performed relatively to the average performance we usually achieve in this state. This means, if we actually got less reward than on average, the action was probably not that good and we therefore want to discourage this behavior. On the other hand, if we performed better than on average, the action should be encouraged in the future. In practice, the function used as a baseline is often an estimate of the state-value function vˆ(St; θv), where θv is again a learned parametrization, e.g. a neural network. The update formula for the REINFORCE with baseline algorithm is given in Equation (11)

θt+1 = θt + α(Gt − vˆ(St; θv))∇θ ln π(At | St; θt) (11) Note that adding this baseline does not add bias to the update rule. REIN- FORCE with baseline will be the first algorithm we consider in our Experi- ments in Section 3. For simplicity we will refer to it as Reinforce.

9 2.2.2 One-Step Actor Critc In machine learning it is often of advantage to introduce bias in order to reduce variance. This is referred to as the Bias-Variance trade-off [5]. Actor critic methods introduce a bias in the form of bootstrapping, meaning that we update the value estimate of one state by using the value estimate of subsequent states. Using this knowledge we can define the update for the one-step actor critic δ = (R + γvˆ(S ; θ ) − vˆ(S ; θ )) t t+1 t+1 v t v (12) θt+1 = θt + αδt∇θ ln π(At | St; θt)

The learned state-value function vˆ(St; θv) acts again as a baseline and addi- tionally serves as an estimation of the future return. The intuition behind this formulation is that δt (also called the advantage), is used to evaluate the action we took in a certain state compared to the average performance in this state. This means that if an action was better than our average performance, we will increase the probability of taking it in the future, as the expression inside the brackets will be positive. If it was worse on the other hand, the expression will be negative and we thus reduce the probability.

The advantage of such a formulation is that it is not necessary to wait for the whole return Gt of one trajectory/episode until an update can be applied. Therefore actor critic methods are able to learn online respectively incremen- tally improve their policy after each step.

In terms of actor critic methods the policy is referred to as the actor and the state-value function as the critic. A common choice for optimizing the state- value function is to minimize the squared difference between the immediate one-step reward plus the value estimate of the next state, and the state-value estimate of the current state. This translates to

R = Rt + γvˆ(St+1; θv) 2 ∂(R − vˆ(St; θv)) (13) θv = θv + αv ∂θv where αv is a separate for the critic. It is however often the case that actor and critic share some of the parameters and thus they can be learned jointly by a single optimizer with a single learn rate [24,41].

10 2.2.3 (Asynchronous) Advantage Actor Critic In 2016, Mnih et all introduced the Asynchronous Advantage Actor Critic (A3C) algorithm [24], which is basically a multi-step actor-critic method. In- stead of updating the policy and value function after each step, a certain number of tmax steps are taken before applying the update. This changes the update rule to

k∑−1 i k A(St,At; θ, θv) = γ Rt+i + γ vˆ(St+k; θv) − vˆ(St; θv) i=0 (14)

θt+1 = θt + αA(St,At; θt, θv)∇θ log π(At | St; θt) where A(St,At; θ, θv) is called advantage function and k is at max tmax or a smaller number, in case a terminal state was reached before tmax steps could be performed. vˆ is again an estimate of the value function. Furthermore, they make use of entropy regularization as introduced by Williams et al. [40]. They add the entropy of the policy to the update rule, which should avoid early convergence to non-optimal policies as well as encourage exploration. The idea is to keep the entropy high and thus have more evenly distributed action selection probabilities. Thereby different actions are chosen more frequently which in turn leads to more exploration.

The special characteristic of the A3C algorithm is that multiple actors are used in parallel, on different instances of the same environment. It is likely that these actors are not acting the same over all instances, hence different parts within the environment will be explored. This not only improves exploration, but also uses decorrelated samples for updating. These updates happen asyn- chronously utilizing a parallelized approach to stochastic gradient descent [27].

To avoid the technical complexity of asynchronous updates, one can use a synchronous variant of the A3C algorithm, dubbed A2C for advantage actor critic. In this case, the updates are not applied asynchronously, but once all actors have finished a number of steps, an update is applied by averaging over all. This version is now used by several researchers for comparing their algorithms [37,41].

2.2.4 Proximal Policy Optimization Proximal Policy Optimization (PPO) was introduced in 2017 by Schulman et al. [33] and is currently used as the default algorithm for RL problems by Ope- nAI. Like A2C, PPO is an actor critic method, which uses the same approach

11 of parallel actors. The difference however is, how the advantage function as well as the loss of the policy is defined and how PPO handles updates.

As an advantage function, PPO uses generalized advantage estimation (GAE) [32]. Using this advantage formulation we get

V − δt = Rt + γvˆ(St+1; θv) vˆ(St; θv) (15)

∑∞ ˆGAE(γ,λ) l V At = (γλ) δt+l (16) l=0 where vˆ is again an estimation of the value function, γ the discounting factor and λ an additional adaptable hyperparameter. If, for instance, one chose λ = 0 then

ˆGAE(γ,0) V − At = δt = Rt + γvˆ(st+1; θv) vˆ(st; θv) (17) Choosing the λ parameter accordingly one can control the bias-variance trade- off. A value closer to zero introduces more bias due to the immediate approx- imation, and in this way the variance tends to be lower [32]. Using λ = 1 on the other hand has high variance and almost no bias due to the summation of the rewards as one can see in Equation (18)

∑∞ ∑∞ ˆGAE(γ,1) l V l − At = γ δt = γ Rt+l vˆ(st; θv) (18) l=0 l=0 One problem with regular policy gradient methods is data, respectively sam- ple efficiency. This means the number of interactions the agent has to perform within the environment until it is able to solve a given task. High variance during the learning process leads to a low sample efficiency and thus the goal is to reduce variance through various methods [32]. Besides GAE, PPO does so by performing multiple update steps using the same samples created during interaction. However, in order to do so the loss used for optimizing the policy is changed, as it would otherwise have a damaging effect on the policy itself [33].

Consider the regular policy loss of REINFORCE and using one trajectory to update the policy multiple times. This would lead to large updates and serious overfitting with respect to the current trajectory. To alleviate this problem, a new loss (also termed surrogate objective) is introduced given as [ ] ( ) ˆ ˆ E min rt(θ)At, clip(rt(θ), 1 − ϵ, 1 + ϵ)At (19)

12 πθ(At|St) where rt(θ) = π (A |S ) is a probability ratio of the policy under the parametriza- θold t t tion θ and the policy under parametrization with θold, denoting the policy at the previous update iteration. Aˆ denotes the estimated advantage function in the form of GAE. If the parametrization is equal, i.e θ = θold, this ratio would be one. By using clip inside the objective, the ratio is clipped to be within [1 − ϵ, 1 + ϵ] such that the updates will not drive the new policy too far away from the old one. Taking the minimum afterwards results in a loss that is a ˆ lower bound on rt(θ)At, and furthermore includes a penalty, if the updates of the policy are too large.

Comparing no clipping with clipping of the policy, they show that no clipping could not learn a useful policy and is even worse than a random one, while using clipping is able to do so. Furthermore, we can see from their results that this epsilon is another hyperparemter that can significantly influence the results. Thus a proper hyperparameter tuning is required.

13 2.3 Neural Networks Neural networks are the de facto standard in various tasks like image classifi- cation [20], [16] or music information retrieval [19]. In the following, we give a superficial overview of the basics of neural networks as the are used for approximating the policy and value function for policy gradient methods. We explain the idea behind convolutional neural networks (CNNs), error and introduce three state-of-the-art activation functions that we consider in our experimental study.

2.3.1 Fully Connected Neural Networks As shown in Figure 2, a fully connected neural network (FCNN) consists of inputs x = {x1, x2, ...xn}, an arbitrary number of hidden layers with (hidden) units and an output with a number of output units. Each layer l is connected to its subsequent layer by weights W (l) as well as an additional bias vector b(l). Furthermore, we have hidden layer activation functions h(l)(x) and an output layer activation function f(x) that process its input x.

Figure 2: A simple neural network with three inputs, a single hidden layer with three units and two outputs.

A neural network is formalized with a matrix notation. As an example, we will do this for the network shown in Figure 2.

14 ( ) x = x1 x2 x3 (20)   (1) (1) (1) w 11 w 12 w 13 (1)  (1) (1) (1)  W = w 21 w 22 w 23 (21) (1) (1) (1) w 31 w 32 w 33   (2) (2) w 11 w 12 (2)  (2) (2)  W = w 21 w 22 (22) w(2) w(2) ( 31 32 ) (1) (1) (1) (1) b = b 1 b 2 b 3 (23) ( ) (2) (2) (2) (2) b = b 1 b 2 b 3 (24)

The input for the first hidden activation function is defined by a matrix mul- tiplication and a vector addition.

(1) (1) net (1) (x) = x · W + b h   (1) (1) (1) (1) x1w 11 + x2w 21 + x3w 31 + b 1 T (25)  (1) (1) (1) (1)  = x1w 12 + x2w 22 + x3w 32 + b 2 (1) (1) (1) (1) x1w 13 + x2w 23 + x3w 33 + b 3

The hidden activation function h(1) is then

(1) h (x) = g(neth(1) (x)) (26) where g is a differentiable function, e.g. tanh or sigmoid. We will later on introduce three special kinds of functions that are used in .

The final output is defined in the same way (we now omit writing out the whole matrix calculation):

(1) (2) (2) netf (x) = h (x) · W + b (27)

f(x) = g(netf (x)) (28) where g is again a differentiable function. For (multinomial) classification tasks this function is chosen to be the given in Equation (29). The idea is to normalize a vector in such way that the values are in range (0, 1) and sum up to 1. Thus, this vector is interpreted as a probability distribution

15 over some classes, or a probability distribution over actions to forge a bridge to RL

exp(x ) ∑ i softmax(x)i = n (29) j=1 exp(xj)

2.3.2 Convolutional Neural Networks The previously described neural network is called fully connected, meaning that every unit in a layer is connected to every unit in the following layer. This might become a problem if one has to process high dimensional input data. For instance, consider a (single-channeled) image with 256x256 pixels, where each pixel has to be treated as one input dimension, and one hidden layer with 128 units. Then the first weight matrix would have a dimensionality of 65.536x128 and thus has 8.388.608 entries respectively parameters that have to be adjusted. As a neural network not only consists of a single layer, the total number of parameters is even higher.

CNNs circumvent this by using the convolution operation and so called kernels instead of the regular matrix multiplication. Reconsidering our example, the image is now treated as a two dimensional input. This input is then convolved with a kernel of a certain size, typically sizes are 3×3 or 5×5. The result of this operation is called a feature map. So, instead of more than a million parameters, we would only need 9 for a 3×3 kernel. Usually one uses multiple kernels in a convolutional layer and thus produces several feature maps, but the number of parameters is still far below that of a FCNN.

Using a two dimensional input I and kernel K we define the convolution op- eration by ∑ ∑ (K ∗ I)(i, j) = I(i − m, j − n)K(m, n) (30) m n For a better understanding the process is visualized in Figure 3, where a 3×4 input is convolved with a 2×2 kernel. The kernel is moved over the whole input step by step and each position in the image is multiplied with the respective position in the kernel, which is afterwards summed up and passed through an activation function. In this example we do not consider cases where parts of the kernel would lie outside the input image. Therefore the resulting output (the feature map) is smaller than the input. To keep the output image at the same size, one can pad the border of the input with zeros (zero padding). Instead of shifting the kernel one step at a time, one can also use bigger step sizes. This is generally referred to as the stride. Especially in the first layers of

16 a CNN it might be advantageous to choose a bigger stride in order to reduce the dimensionality of the input.

Figure 3: Convolution of a 3 × 4 input with a 2 × 2 kernel, without zero padding and a stride of one. For CNNs, the val- ues in the output cells will be passed through an activation function before being processed by a subsequent layer. Figure reproduced from [15]

Besides the fact that we can reduce the overall number of parameters, CNNs have another important advantage, which is called translation equivariance or sometimes invariance. Colloquially this means that for a CNN it does not matter where in the input a certain pattern occurs as it will still be able to recognize it, e.g. detect the same cat no matter if it is in the bottom left or bottom right corner of an image.

In Section 3 we introduce the network architectures for our experiments which consist of a combination of several convolutional and fully connected layers. We use the convolutional layers to learn representations from high dimensional input images and the fully connected layers to learn the policy and value function.

17 2.3.3 Back-Propagation and Gradient Based Optimization Neural Networks are usually trained using the back-propagation algorithm (backprop) together with a gradient-based optimization technique [29]. The goal of optimization is to minimize, respectively maximize a function. It is often the case that one wants to minimize an error/objective function J(θ) that is parametrized by θ, e.g. the weights of a neural network.

If we now take the derivative of this function with respect to the weight vector, ∂J(θ) ∇ we get the gradient ∂θ = θJ(θ). This gradient gives us the slope of J(θ) at θ or in other words points in the direction of the steepest ascent. Thus, we can minimize the objective by performing small update steps into the direction of the negative gradient, which results in the following update rule for gradient descent

θ = θ − η∇θJ(θ) (31) where η > 0 is the learning rate or step-size. There are several known is- sues with gradient descent, like oscillations or slow convergence due to flat error surfaces. In general the algorithm is quite sensitive to the choice of the learning rate, which is crucial to choose properly. Researchers try to overcome these issues, by using more advanced versions that extend this algorithm like RMSprop, AdaGrad or Adam [28].

To compute the required gradient, the backprop algorithm is used. This sim- ply refers to the process of calculating the gradient on each layer of a neural network by propagating information (error deltas in function approximation) backwards from the output layer. This is done by recursively applying the of calculus. So, starting with a new input, the data is passed for- ward through the neural network and its hidden layers, where finally an output is produced on the last layer. This is referred to as forward propagation or forward pass. Using this output, the gradient of the objective with respect to the weights on the last layer is computed. This information is then passed backwards and used on the preceding hidden layers to calculate the gradients for the weights of the hidden units.

18 2.3.4 Activation Functions We already mentioned two activation functions, namely tanh and sigmoid. While these are still in use, it is nowadays more common to use different func- tions [15]. In the following, we explain three state-of-the-art activation func- tions, rectified linear units (ReLUs) [26], exponential linear units (ELUs) [8] and scaled exponential linear units (SELUs) [18].

As shown in Equation (32) and Figure 4, the ReLU activation function clips all negative values to zero and otherwise passes x linearly. An advantage of this function is that the gradient will either be 0 or 1, which is simple to calculate. If x > 0 the gradient will always be 1, thus it alleviates the problem of vanishing gradients in these cases. If x ≤ 0 the gradient will be 0, which on the one hand has the advantage that activations will be sparse. This means some hidden units will be set to 0 and thus not fire, which might help during learning. On the other hand, it could occur that a unit is never active and thus the weights will not be adapted during training [22]. relu(x) = max(0, x) (32) Like ReLUs, ELUs let positive values pass linearly. The difference, however, is that negative values are not mapped to 0, but exponentially diminish and saturate at a value specified by the parameter α. An example with α = 1 is shown in Figure 4, where one can see that the negative domain saturates close to a value of −1. As ELUs allow negative values, this circumvents the problem of inactive neurons, while still mitigating the problem of vanishing gradients. Furthermore, the mean value of the activation function is closer to zero, which should permit faster learning [8].  x if x > 0 elu(x) = (33) α(ex − 1) if x ≤ 0

As the name suggests, SELUs are basically ELUs scaled by an additional parameter λ as shown in Equation (34). Like ELUs, SELUs have a mean close to zero, but additionally keep the variance near one. In their paper, Klambauer et al. show that activation functions with these characteristics have self-normalizing properties, meaning that in all layers of the network the activation converges to zero mean and unit variance. This allows to robustly train networks with many layers.  λx if x > 0 selu(x) = (34) λ(αex − α) if x ≤ 0

19

− − − − −

− −

− − − − − −

Figure 4: Comparison of ReLU, ELU and SELU activation. For ELU we use α = 1 and for SELU we use α = 1.67326 and λ = 1.0507 as proposed in the original paper [18].

20 2.4 Regularization for Neural Networks Regularization is a widely used technique in supervised machine learning to avoid overfitting to training data and thus allow for generalization to unseen observations. The problem with neural networks is, that they have a high capacity and expressivity. Thus, they easily overfit to data and are even able to learn the training data by heart. Three common methods for neural net- works to alleviate these problems are L2-Normalization (L2-Norm), Dropout and Batch-Normalization (Batch-Norm) [15]. While their behavior is widely investigated and acknowledged for supervised learning, this is not yet done for RL.

As we normally do not have the basic train/test data setup in RL, we cannot easily observe the effect of overfitting. However it is known that RL-algorithms even tend to overfit to the latest episodes respectively observations [13]. There- fore, it is relevant to explore the influence of regularization in RL. The imple- mentation of the previously described learning algorithms is not affected by the use of regularization, it just concerns the underlying neural networks used for function approximation. In the following, the use and effect of the three methods is explained.

2.4.1 L2-Normalization A regularization technique that is also used for other methods like linear or logistic regression are norm penalties. The idea is to adapt the general objec- tive function by adding an additional term that acts as a penalty on the model parameters. Given some objective function J(θ) and a norm penalty Ω(θ) on the parameters θ, we derive the new regularized objective J ′ as

J ′(θ) = J(θ) + αΩ(θ) (35) where the amount of regularization is adapted by the parameter α ≥ 0. Choos- ing α = 0 results in no regularization, while higher values will yield a stronger regularization.

If we now minimize the new objective function, we minimize the original ob- jective as well as the penalty term, which measures the size of the parameters. Thus, during optimization the model is forced to produce smaller weights and only has high values if really necessary, i.e. if the gain on the original objective outweighs the penalty. The purpose of keeping the weights small is to limit the capacity of a model, i.e. restrict its power of adapting too much to the

21 training data. Thereby, the effect of overfitting should be mitigated.

The parameter norm penalty we are using is the L2 norm given by 1 Ω(θ) = ∥w∥2 (36) 2 2 where w are only the weights (weight vectors) of the model excluding the bias, as one usually only applies regularization on the weights [15]. Thus, this kind of regularization is often referred to as weight decay.

2.4.2 Dropout Dropout is a technique to avoid overfitting and was proposed in 2014 by Sri- vastava et al [35]. The basic idea behind dropout is to randomly deactivate units of the neural network during training, i.e. setting their activation to zero. In Figure 5, we provide a visualization of this process. On the left we see the fully connected network and on the right we see certain units that are dropped on some of the layers. The idea here is, that by dropping some units, the network is less likely to adapt and rely on certain input configurations of the previous layers. Thus, the network should not be able to easily overfit to the data, as it can never be really sure which units are active or not.

It is important to note that the SELU activation function should not be used with the standard dropout technique, as this would change the mean and variance of the activation. Thus, the self-normalizing properties would be violated. In order to preserve these, Klambauer et al. introduce alpha dropout [18], where the activation is not set to zero, but to −λα (the value in the negative domain, where the SELU function saturates).

2.4.3 Batch-Normalization In contrast to the aforementioned methods, the main purpose of Batch-Norm is not regularization. When introduced in 2015 by Ioffe and Szegedy [17], the idea was that Batch-Norm reduces the internal covariate shift, meaning the effect that during training of a neural network the distribution of the inputs on each layer is changing. As a consequence lower learning rates are necessary, which in turn slows down learning. To reduce this effect, the inputs on all lay- ers are normalized to mean zero and unit variance. This is done by subtracting the mean and dividing by the standard deviation of a mini-batch. According to their findings, this reduces overfitting and allows for better generalization [17].

22 (b) Neural Network after applying (a) Regular Neural Network dropout

Figure 5: A visualization of the dropout process reproduced from [35]. On the left we see the fully connect neural network with 2 hidden units and 4 units on each hidden layer. On the right we see the same network with some units dropped (marked by a cross).

Why and how Batch-Norm is really improving learning is still part of active research and not entirely understood. A recent paper of 2018 by Santurkar et al. for example contradicts the belief that Batch-Norm reduces the internal covariate shift [30]. Instead they claim that it has a smoothing effect on the error surface of the optimization problem, which is beneficial in terms of the gradient and the learning rates we can choose.

So, although the theory behind Batch-Norm is not perfectly understood, we nevertheless want to explore its impact with respect to deep RL, as it is a commonly used technique in supervised learning. As a final remark, it is important to note that the SELU activation function with its self-normalizing properties has a similar effect and should thus not be used in combination with Batch-Norm.

23 3 Experimental Study

In this section, we explain our conducted experiments to investigate the im- pact of regularization on RL. First, we introduce the Score Following Game, a RL environment for score following, and motivate why we use it for our study. Afterwards we describe our experimental setup and elaborate our main experi- ments. Each one is treated separately followed by a discussion of the outcome. Finally, we compare our results and discuss our overall findings.

3.1 The Score Following Game Score following means following or tracking a musical performance within a re- spective representation of the currently playing piece, e.g. sheet music (score). Colloquially this describes the process of a computer listening to a performance or recording of a piece of music, where it then tries to read along within the sheet representation like a musician would do when playing a piece. In order for a computer to process the musical performance it is usually transformed from the time domain into the frequency domain by using the Short-time Fourier transform (STFT). The resulting representation is called spectrogram. A visualization before and after the transformation is shown in Figure 6.

(a) Score representation

Audio in Frequency Domain (Spectrogram) 7000

000 6000

5000 000

4000

0 3000 Frequency [Hz] Frequency 000 2000

000 1000

0 0 0 1 2 3 4 5 Time [s] (b) Time Domain (c) Frequency Domain (Spectrogram)

Figure 6: In (a) the score of a C-major scale is shown, (b) is a synthesized performance of the score in the time domain and (c) is the audio transformed from the time domain to frequency domain.

24 There are various application domains for score following, like the accompani- ment of solo musicians or automatic page turning of note sheets [3,6]. Current approaches utilize probabilistic models like Hidden Markov Models (HMMs) or Conditional Random Fields [4] as well as Dynamic Time Warping [2]. In recent years, also deep learning was applied to this problem in the form of multi-modality learning, where neural networks are trained to directly learn from audio and images of scores [9, 10]. This multi-modality approach is fur- ther investigated in [11], where we propose a RL approach to score following for which we developed the Score Following Game environment.

In order to solve score following with RL, we have to formulate it as a MDP. A sketch of the score following MDP is shown in Figure 7. As our state space S we define score and spectrogram excerpts including delta frames (Figure 8). These delta frames are the one step differences from the current state to the previous one. By including those, we approximately arrive at a state represen- tation satisfying the Markov property as this encodes the transition dynamics of the environment. This should enable the agent to determine its current speed in the image and how fast the music is performed.

Score Following Agent

reward action

Figure 7: The MDP for the score following game. The agent receives a reward and the current state from the environment, where a state is given as the spectrogram and a sheet excerpt with the respective delta frames. Based on this state the agent chooses to either increase/decrease or keep its current speed. (Figure taken from [11])

For the action space A, we choose a finite set of three actions, allowing the agent to control its current tempo. The current tempo is given by vpxl in pixels per time step. The set of actions is given by A = {−∆vpxl, 0, +∆vpxl}, where ∆vpxl is the amount by which the agent adapts its current speed. For our ex- periments we use ∆vpxl = 0.5, meaning the agent can either increase its speed by 0.5 pixels per time step or decrease it. The reward is basically defined as a

25 ..nnzr eada lotaltm tp,wihsol aiiaelearning facilitate should which steps, signal, time reward dense all a agent. almost receive the at we score for reward that the positions zero is in target this non notes get of the advantage middle i.e. further to The the To according onsets. to interpolate the we score. onset) and notes, the (the Therefore, between audio within space reward. the head the a within for note calculate note respective to a for the those However, of of define beginning performance. to the can musical match necessary that the we is score within it the point problem within certain our position a pixel to exact matched no be is there speaking, and piece). game Musically different the the The a lose reaches on will it 1. agent (possibly until the over get of reached, start will reward is it to border reward a has this less receive Once the will target border. the agent tracking from the away is score, it the further within Figure position in shown correct as function decaying linearly [ from taken (Figure holds property state. approx- Markov current To the the piece. which to ( of current frames for consists the delta representation state of include state The we spectrogram a game. the at following and arrive score sheet imately the the of for excerpt space an state The 8: Figure

Sheet Image Sheet Image ∆ ,wihaetedffrne rmtepeiu state previous the from differences the are which ), 26 11 Spectrogram ]) 9 fteaeti tteexactly the at is agent the If . Spectrogram Figure 9: The reward formulation for the score following game. Depending on the current position xˆ of the agent and its distance dx to the target x , the reward linearly decays from the maximum value of 1 (directly on target) to 0. (Figure taken from [11])

27 3.2 Experimental Setup In the following, we explain how we design and conduct our experiments. At first we introduce the Nottingham Dataset, a simple monophonic music dataset we use for our study. Next, we explain the shallow and deeper network architectures we are considering for our policy and value functions. Finally, we describe our training and validation process.

3.2.1 The Nottingham Dataset For our experiments, we use the Nottingham dataset4, which is a collection of 296 monophonic folk melodies. For each piece we have a MIDI file and a score representation. Furthermore, there are annotations that exactly match the note onsets in the MIDI to the respective positions of the notes in the score. This information is necessary in order to determine the target position in the score and the reward the agent will receive. The MIDI files are rendered to audio files by Fluidsynth5, which are in turn used to compute the spectrograms.

The dataset itself is already split into a training set with 187, a validation set with 63 and a test set with 46 pieces. Usually one does not have such a train/validation/test split setting in the context of RL, but it is an important setup in supervised learning to determine whether a model is able to generalize to unseen future data or if it overfits to the training data. Using this data we are able to perform tests concerned with overfitting and generalization in the context of RL. Unfortunately, it takes a long time to train the algorithms in all kinds of different settings. To further encourage overfitting and make its effects more visible, we reduce the training set to only 10 pieces. The remain- ing dataset splits are the same.

For computing the spectrograms we use a frame rate of 20 FPS and calculate log-frequency spectrograms using a sample rate of 22.05kHz. For the fast Fourier transform (FFT) we use a window size of 2048 and post-process the results with a logarithmic filterbank to get frequencies in the range of 60Hz to 6kHz resulting in 78 frequency bins.

4https://ifdo.ca/~seymour/nottingham/nottingham.html 5http://www.fluidsynth.org/

28 3.2.2 Network Architectures Based on our previous work [11], we decide to use two different network archi- tectures. A simpler and shallower one given in Table 1 and a deeper one given in Table 4. The basic structure of the networks is the following. We process the audio-spectrogram and the sheet-image separately with several convolu- tional layers. The learned representation is flattened, concatenated and passed through a fully connect layer with 256 and 512 units, respectively. Afterwards the output is again split to a policy part with a softmax over three actions and a value part given as a single linear output. A sketch of our architecture is shown in Figure 10.

Figure 10: A sketch of our network architecture. The sheet and the audio are separately processed by several convolutional layers. The output of those layers is afterwards concatenated and further processed by a fully connected layer. Finally, this is again split into an output network for the policy πθ and the value function vˆ. (Figure reproduced and adapted from [11])

To summarize, we use two different network architectures for our experi- ments: a shallow and a deep network (Shallow Network/Plain and Deep Net- work/Plain). For each architecture we have two additional variations that include Dropout and Batch-Norm, respectively. For L2-Norm we do not cre- ate additional networks, but use the plain architectures. So, altogether this

29 results in six networks as shown in the following tables. In Table 2 and 5, we adapt the networks to include Dropout of 0.2 at certain layers (Shallow Network/Dropout and Deep Network/Dropout). For the models in Table 3 and 6, we add Batch-Norm after every layer, except for the output layer (Shal- low Network/Batch-Norm and Deep Network/Batch-Norm). Note that we do not use zero padding for the shallow network. For the deep architecture we apply padding as indicated. Apart from the output layer, we always use the same activation function on all layers, which will be specified in the experiment results.

Audio (Spectrogram) 39×20 Sheet-Image 40×150 Conv 16×4×4 - stride-2 Conv 16×4×8 - stride-2 Conv 32×3×3 - stride-2 Conv 32×3×3 - stride-(1,2) Conv 64×3×3 - stride-1 Conv 32×3×3 - stride-(1,2) Conv 32×4×4 - stride-2 Flatten - Concatenation - Dense 256 Dense 256 Dense 256 Dense 3 - Softmax Dense 1 - Linear

Table 1: Shallow Network/Plain: Shallow network architecture without regularization. Conv 16×3×3 means a convolutional layer with 16 3×3 kernels. The spectrogram and the sheet are handled separately by different convolu- tional layers and afterwards concatenated and processed by a fully connected (Dense) layer. On the left side is the policy output passed through a softmax function with three possible actions. On the right is a single linear output for the value function. No zero padding is used.

30 Audio (Spectrogram) 39×20 Sheet-Image 40×150 Conv 16×4×4 - stride-2 Conv 16×4×8 - stride-2 Conv 32×3×3 - stride-2 Conv 32×3×3 - stride-(1,2) Conv 64×3×3 - stride-1 - DO Conv 32×3×3 - stride-(1,2) Conv 32×4×4 - stride-2 - DO Flatten - Concatenation - Dense 256 - DO Dense 256 - DO Dense 256 - DO Dense 3 - Softmax Dense 1 - Linear

Table 2: Shallow Network/Dropout: Shallow network architecture with Dropout (DO) of 0.2 added at certain layers. Conv 16×3×3 means a convolu- tional layer with 16 3×3 kernels. The spectrogram and the sheet are handled separately by different convolutional layers and afterwards concatenated and processed by a fully connected (Dense) layer. On the left side is the policy output passed through a softmax function with three possible actions. On the right is a single linear output for the value function. No zero padding is used.

Audio (Spectrogram) 39×20 Sheet-Image 40×150 Conv 16×4×4 - stride-2 - BN Conv 16×4×8 - stride-2 - BN Conv 32×3×3 - stride-2 - BN Conv 32×3×3 - stride-(1,2) - BN Conv 64×3×3 - stride-1 - BN Conv 32×3×3 - stride-(1,2) - BN Conv 32×4×4 - stride-2 - BN Flatten - Concatenation - Dense 256 - BN Dense 256 - BN Dense 256 - BN Dense 3 - Softmax Dense 1 - Linear

Table 3: Shallow Network/Batch-Norm: Shallow network architecture with Batch-Norm (BN) added after each layer except for the output layer. Conv 16×3×3 means a convolutional layer with 16 3×3 kernels. The spectro- gram and the sheet are handled separately by different convolutional layers and afterwards concatenated and processed by a fully connected (Dense) layer. On the left side is the policy output passed through a softmax function with three possible actions. On the right is a single linear output for the value function. No zero padding is used.

31 Audio (Spectrogram) 39×20 Sheet-Image 40×150 Conv 32×3×3 - stride-1 - pad-1 Conv 32×5×5 - stride-(1,2) - pad-2 Conv 32×3×3 - stride-1 - pad-1 Conv 32×3×3 - stride-1 - pad-1 Conv 64×3×3 - stride-2 - pad-1 Conv 64×3×3 - stride-2 - pad-1 Conv 64×3×3 - stride-1 - pad-1 Conv 64×3×3 - stride-1 - pad-1 Conv 64×3×3 - stride-1 - pad-1 Conv 64×3×3 - stride-1 - pad-1 Conv 96×3×3 - stride-2 - pad-1 Conv 64×3×3 - stride-2 - pad-1 Conv 96×1×1 - stride-1 - pad-0 Conv 96×3×3 - stride-2 - pad-0 Flatten - Dense 512 Conv 96×1×1 - stride-1 - pad-0 Flatten - Dense 512 Concatenation - Dense 512 Dense 256 Dense 256 Dense 3 - Softmax Dense 1 - Linear

Table 4: Deep Network/Plain: Deep network architecture without regular- ization. Conv 16×3×3 means a convolutional layer with 16 3×3 kernels. The spectrogram and the sheet are handled separately by different convolutional layers and afterwards concatenated and processed by a fully connected (Dense) layer. On the left side is the policy output passed through a softmax function with three possible actions. On the right is a single linear output for the value function. Zero padding is indicated by pad.

Audio (Spectrogram) 39×20 Sheet-Image 40×150 Conv 32×3×3 - stride-1 - pad-1 Conv 32×5×5 - stride-(1,2) - pad-2 Conv 32×3×3 - stride-1 - pad-1 Conv 32×3×3 - stride-1 - pad-1 Conv 64×3×3 - stride-2 - pad-1 Conv 64×3×3 - stride-2 - pad-1 Conv 64×3×3 - stride-1 - pad-1 - DO Conv 64×3×3 - stride-1 - pad-1 - DO Conv 64×3×3 - stride-1 - pad-1 Conv 64×3×3 - stride-1 - pad-1 Conv 96×3×3 - stride-2 - pad-1 Conv 64×3×3 - stride-2 - pad-1 - DO Conv 96×1×1 - stride-1 - pad-0 - DO Conv 96×3×3 - stride-2 - pad-0 Flatten - Dense 512 Conv 96×1×1 - stride-1 - pad-0 - DO Flatten - Dense 512 Concatenation - Dense 512 Dense 256 - DO Dense 256 - DO Dense 3 - Softmax Dense 1 - Linear

Table 5: Deep Network/Dropout: Deep network architecture with Dropout (DO) of 0.2 added at certain layers. Conv 16×3×3 means a convolutional layer with 16 3×3 kernels. The spectrogram and the sheet are handled separately by different convolutional layers and afterwards concatenated and processed by a fully connected (Dense) layer. On the left side is the policy output passed through a softmax function with three possible actions. On the right is a single linear output for the value function. Zero padding is indicated by pad.

32 Audio (Spectrogram) 39×20 Sheet-Image 40×150 Conv 32×3×3 - stride-1 - pad-1 - BN Conv 32×5×5 - stride-(1,2) - pad-2 - BN Conv 32×3×3 - stride-1 - pad-1 - BN Conv 32×3×3 - stride-1 - pad-1 - BN Conv 64×3×3 - stride-2 - pad-1 - BN Conv 64×3×3 - stride-2 - pad-1 - BN Conv 64×3×3 - stride-1 - pad-1 - BN Conv 64×3×3 - stride-1 - pad-1 - BN Conv 64×3×3 - stride-1 - pad-1 - BN Conv 64×3×3 - stride-1 - pad-1 - BN Conv 96×3×3 - stride-2 - pad-1 - BN Conv 64×3×3 - stride-2 - pad-1 - BN Conv 96×1×1 - stride-1 - pad-0 - BN Conv 96×3×3 - stride-2 - pad-0 - BN Flatten - Dense 512 - BN Conv 96×1×1 - stride-1 - pad-0 - BN Flatten - Dense 512 - BN Concatenation - Dense 512 - BN Dense 256 - BN Dense 256 - BN Dense 3 - Softmax Dense 1 - Linear

Table 6: Deep Network/Batch-Norm: Deep network architecture with Batch-Norm(BN) added after each layer except for the output layer. Conv 16×3×3 means a convolutional layer with 16 3×3 kernels. The spectrogram and the sheet are handled separately by different convolutional layers and afterwards concatenated and processed by a fully connected (Dense) layer. On the left side is the policy output passed through a softmax function with three possible actions. On the right is a single linear output for the value function. Zero padding is indicated by pad.

33 3.2.3 Training and Validation In the following, we describe our training and validation process including all preselected hyperparameter choices for our agents. Usually, when researchers evaluate RL agents, they train and evaluate them several times with different random seeds. This is necessary due to the involved stochasticity during the learning process. However, as previously mentioned, it takes a long time to train the agents. So, due to hardware limitations we only train the agents five times for all settings of the shallow network architecture and three times for the deep architecture to get an average performance with mean and standard deviation.6

For the shallow network architecture, we train each agent (Reinforce with base- line, A2C and PPO) with three activation functions (ReLU, ELU and SELU) in four different regularization settings, no regularization, L2-Norm with a weight decay factor of 10−5, Dropout and Batch-Norm. For Reinforce we do not use Batch-Norm, as the updates only involve a single batch. Furthermore, Batch-Norm is not used together with the SELU activation. Therefore, the total number of experiments for the shallow network is 31. We consider the same setup for the deep architecture, except that we are not using Reinforce, as it takes significantly longer to train it than the others and it also performs worse in most cases. Thus, the total number of experiments is 22.

As we have a shared network architecture, we minimize the objective for the policy and value function together with a single optimizer. To do this we add the objective of the value function with a factor of 0.5 (value coefficient c1) to the policy objective including entropy regularization with a factor of 0.01 (entropy coefficient c2) for A2C and PPO. The complete objective we need to minimize is given by

L = −Lπ + c1Lv − c2Hπ (37) where Lπ is the objective for the policy, which needs to be maximized, Lv is the loss for the value function we need to minimize and Hπ is the entropy of the policy we want to maximize. As optimizer we use Adam with a learning rate of 10−4 and decay rates for the first and second moment estimates of 0.5 and 0.999, respectively. The discounting factor γ is set to 0.9. A summary of the optimization parameters and the algorithm specific hyperparameters can be found in Table 7. Most of them were chosen according to our findings in [11].

6We know that this is a small sample size and not ideal, but it allows us to get at least a feeling on how reliable our results are.

34 During training, the agent will always see an excerpt of the whole spectrogram consisting of 40 frames (2 seconds) and 78 frequency bins downsampled by a factor of two to 39×20 as well as an excerpt of the sheet image with 40×150 pixels. As our evaluation measure, we use what we call the Global Tracking Ratio (GTR). The GTR is a number between 0 and 1, and defines the relative number of notes the agent was able to track until it lost, i.e. fell out of the tracking border. So, for the note to be count as correctly tracked, it is not necessary that the agent is exactly at the target position within the score. As long as the current note is still within the tracking border, it is regarded as tracked. For instance, consider a set of two pieces with 20 note events in the first one and 15 in the second one. For the first piece the agent can track 15 out of 20 notes and for the second piece 10 out of 15. This results in a GTR 1 · 15 10 of 2 ( 20 + 15 ) = 0.7083. If the agent is able to track all pieces to the end, the GTR is 1.

Every 1000 updates our agents are evaluated five times on the validation set with 63 pieces. If there is no improvement in terms of the average GTR for 15 evaluation cycles (patience), the training of the agent is stopped. For the final evaluation the overall best model up until the end of the training is used. Once the training process is finished, we let the agent perform ten evaluation runs on all three dataset splits calculating the mean and standard deviation of the GTRs.

Hyperparameter Value Patience 15 L2-Norm weight decay 10−5 Adam learning rate 10−4 Adam decay rates (β1, β2) (0.5, 0.999) Time Horizon (tmax) 15 Number of actors 8 Value coefficient c1 0.5 Entropy coefficient c2 0.01 Discount factor γ 0.9 GAE parameter λ 0.95 PPO clipping parameter ϵ 0.2 PPO epochs 4 PPO batch size 32

Table 7: Hyperparameter overview. We did not specifically tune them, but followed our findings in [11], where they yield good results.

35 3.3 Results and Discussion In this section, we present and interpret the results of our experimental study on regularization in the context of RL. First, we summarize our overall results including statistical significance tests to determine whether a configuration was significantly better than another. Afterwards, we take a closer look on each experiment group on its own providing visualizations of the performance of the algorithms.

3.3.1 Result Summary In the following we summarize our results for the shallow (Table 8-10) and the deep (Table 11-13) network architecture. For each architecture we provide the performance on the training, validation and test set in terms of the GTR. As all configurations were trained five and three times, respectively, we calculate the mean and the standard deviation to see how much the training process varies. Furthermore, we apply Welch’s t-test to determine the statistical sig- nificance of the best results on the validation and test set. This test compares the mean of two groups with unequal and checks if the difference between them is significant.

To apply the test, we define our null hypothesis H0 as the equality of two means and our alternative hypothesis H1 as unequal means. Using the test to compare two different experiment settings we receive a p-value, which tells us how strong the evidence against H0 is. A strong evidence (usually a p-value of ≤ 0.05) allows us to reject the null hypothesis and we can thus conclude that there is a significant difference between the two settings. In the follow- ing tables for the validation and test performance we mark those cells as bold where there is no significant difference (p-value > 0.05) to the best performing setting. Additionally, we provide a visual comparison of the different experi- mental settings in Figures 11-14.

In almost all cases except for some settings where either SELUs or Batch-Norm were involved, the agents were able to perfectly fit the training data with a shallow network. We therefore only include the result table with the train- ing set performance for the sake of completeness. Comparing the validation performance in Table 9 with the test performance in Table 10, we observe a general performance improvement in favor of the test performance. We further see more configurations that are not significantly different from each other for the test set than for the validation set. In the following we therefore refer to the test set performance for discussing the results.

36 Shallow CNN Train Performance NO L2 DO BN ReLU 0.999 ± 0.001 1.000 ± 0.000 1.000 ± 0.000 - Reinforce ELU 0.992 ± 0.006 0.993 ± 0.007 1.000 ± 0.000 - SELU 0.023 ± 0.018 0.035 ± 0.013 0.033 ± 0.013 - ReLU 0.993 ± 0.013 0.995 ± 0.007 1.000 ± 0.000 1.000 ± 0.001 A2C ELU 1.000 ± 0.000 0.999 ± 0.001 0.991 ± 0.012 0.697 ± 0.239 SELU 0.829 ± 0.339 0.340 ± 0.281 0.216 ± 0.101 - ReLU 0.989 ± 0.017 0.993 ± 0.013 0.998 ± 0.003 0.841 ± 0.184 PPO ELU 0.992 ± 0.010 0.984 ± 0.016 0.995 ± 0.004 0.535 ± 0.131 SELU 0.972 ± 0.025 0.960 ± 0.025 0.993 ± 0.010 -

Table 8: Shallow Network/Training Set: Mean performance of the trained agents in terms of the GTR including standard deviation. All agents were trained for five times. Each column contains the result for one of the cho- sen regularization settings, no regularization (NO), L2-Normalization (L2), Dropout (DO) and Batch-Norm (BN).

There are several notable observations. First, we expected Reinforce to per- form worse than A2C and PPO, which is the case for most settings. However, if we include Dropout, we can improve the performance of Reinforce to perform almost on par with the others. Second, A2C with Batch-Norm performs best on average. This has to be treated with caution, as we see a high standard deviation, which is also visualized in Figure 12. In the subsequent sections we will further see that Batch-Norm is generally quite unstable. We observe a similar behavior for the SELU activation function, where the standard devia- tion is even higher. Especially for A2C with SELU and no regularization the performance is high for four out of five trained agents. It is worth taking a closer look on this activation function for future work. Finally, the last notable observation is the robustness of PPO in terms of different regularization tech- niques. We see that for ReLU and ELU it does not seem to matter whether we use no regularization, L2-Normalization or Dropout. Only Batch-Norm does not work together with PPO, which might be related to its loss formulation as well as the repeated use of the same samples for updating.

37 Shallow CNN Val. Performance NO L2 DO BN ReLU 0.526 ± 0.055 0.574 ± 0.068 0.672 ± 0.041 - Reinforce ELU 0.422 ± 0.030 0.399 ± 0.046 0.667 ± 0.034 - SELU 0.020 ± 0.016 0.035 ± 0.012 0.027 ± 0.009 - ReLU 0.637 ± 0.048 0.576 ± 0.041 0.689 ± 0.027 0.764 ± 0.054 A2C ELU 0.510 ± 0.026 0.505 ± 0.075 0.665 ± 0.052 0.449 ± 0.148 SELU 0.548 ± 0.201 0.244 ± 0.136 0.179 ± 0.029 - ReLU 0.734 ± 0.042 0.719 ± 0.048 0.650 ± 0.033 0.265 ± 0.044 PPO ELU 0.698 ± 0.030 0.694 ± 0.036 0.704 ± 0.027 0.179 ± 0.021 SELU 0.463 ± 0.052 0.439 ± 0.031 0.638 ± 0.041 -

Table 9: Shallow Network/Validation Set: Mean performance of the trained agents in terms of the GTR including standard deviation. All agents were trained for five times. The best performance is marked bold. Addition- ally we mark those settings bold that are not significantly different to the best one.

0.8

0.6

0.4

0.2 Global Tracking Ratio

0.0 a2c_elu_l2 ppo_elu_l2 a2c_elu_no a2c_elu_do a2c_elu_bn a2c_relu_l2 ppo_elu_no ppo_elu_do ppo_elu_bn ppo_relu_l2 a2c_selu_l2 ppo_selu_l2 a2c_relu_no a2c_relu_do a2c_relu_bn ppo_relu_no ppo_relu_do ppo_relu_bn a2c_selu_no a2c_selu_do ppo_selu_no ppo_selu_do reinforce_elu_l2 reinforce_elu_no reinforce_elu_do reinforce_relu_l2 reinforce_selu_l2 reinforce_relu_no reinforce_relu_do reinforce_selu_no reinforce_selu_do Experimental Settings Figure 11: Shallow Network/Validation Set: Visualization of the mean performance (blue dot) of all agents including standard deviation (blue line) as well as the performance of each of the five trained agents separately

(black dots). ppo_elu_do means the PPO algorithm with ELU activation and Dropout.

38 Shallow CNN Test Performance NO L2 DO BN ReLU 0.642 ± 0.074 0.638 ± 0.093 0.780 ± 0.073 - Reinforce ELU 0.517 ± 0.052 0.498 ± 0.083 0.752 ± 0.051 - SELU 0.021 ± 0.017 0.042 ± 0.018 0.028 ± 0.008 - ReLU 0.714 ± 0.055 0.646 ± 0.043 0.792 ± 0.021 0.830 ± 0.084 A2C ELU 0.625 ± 0.037 0.595 ± 0.083 0.751 ± 0.037 0.527 ± 0.167 SELU 0.645 ± 0.240 0.265 ± 0.181 0.163 ± 0.035 - ReLU 0.792 ± 0.039 0.780 ± 0.047 0.750 ± 0.047 0.284 ± 0.059 PPO ELU 0.754 ± 0.040 0.769 ± 0.038 0.781 ± 0.026 0.210 ± 0.030 SELU 0.521 ± 0.072 0.484 ± 0.040 0.758 ± 0.042 -

Table 10: Shallow Network/Test Set: Mean performance of the trained agents in terms of the GTR including standard deviation. All agents were trained for five times. The best performance is marked bold. Additionally we mark those settings bold that are not significantly different to the best one.

0.8

0.6

0.4

Global Tracking Ratio 0.2

0.0 a2c_elu_l2 ppo_elu_l2 a2c_elu_no a2c_elu_do a2c_elu_bn a2c_relu_l2 ppo_elu_no ppo_elu_do ppo_elu_bn ppo_relu_l2 a2c_selu_l2 ppo_selu_l2 a2c_relu_no a2c_relu_do a2c_relu_bn ppo_relu_no ppo_relu_do ppo_relu_bn a2c_selu_no a2c_selu_do ppo_selu_no ppo_selu_do reinforce_elu_l2 reinforce_elu_no reinforce_elu_do reinforce_relu_l2 reinforce_selu_l2 reinforce_relu_no reinforce_relu_do reinforce_selu_no reinforce_selu_do Experimental Settings Figure 12: Shallow Network/Test Set: Visualization of the mean perfor- mance (blue dot) of all agents including standard deviation (blue line) as well as the performance of each of the five trained agents separately (black dots).

ppo_elu_do means the PPO algorithm with ELU activation and Dropout.

39 Deep CNN Train Performance NO L2 DO BN ReLU 1.000 ± 0.000 0.979 ± 0.016 0.997 ± 0.005 0.961 ± 0.017 A2C ELU 1.000 ± 0.000 0.987 ± 0.017 0.775 ± 0.157 0.599 ± 0.074 SELU 0.404 ± 0.348 0.157 ± 0.009 0.134 ± 0.063 - ReLU 0.974 ± 0.028 0.969 ± 0.036 0.982 ± 0.011 0.497 ± 0.087 PPO ELU 0.942 ± 0.041 0.918 ± 0.030 0.970 ± 0.027 0.528 ± 0.186 SELU 0.947 ± 0.025 0.975 ± 0.020 0.880 ± 0.076 -

Table 11: Deep Network/Training Set: Mean performance of the trained agents in terms of the GTR including standard deviation. All agents were trained for three times. Each column contains the result for one of the cho- sen regularization settings, no regularization (NO), L2-Normalization (L2), Dropout (DO) and Batch-Norm (BN).

In general, we see that the deep network performs worse than the shallow one, although the better performing configurations are similar to before. One possible explanation for this is that there are now too many parameters that have to be adjusted and that we use too little training data. We conclude from this that for our monophonic dataset a smaller network is enough to achieve the best performance in terms of score following, which we will treat late on. A notable difference to the performance with a shallow architecture is that PPO with SELU now yields better results than PPO with ReLU. While we hoped to see even better results with SELUs for the deep network, this observation coincides with the assumption that SELUs can be used to train deeper networks better. Unfortunately, we were only able to train the agents three times due to hardware limitations. Therefore, the results might not be as reliable as before. Nevertheless, we see a trend which configurations might be able to yield good results in the future for possibly more complex data.

40 Deep CNN Val. Performance NO L2 DO BN ReLU 0.472 ± 0.026 0.401 ± 0.050 0.440 ± 0.012 0.545 ± 0.036 A2C ELU 0.466 ± 0.040 0.583 ± 0.056 0.415 ± 0.140 0.305 ± 0.034 SELU 0.199 ± 0.083 0.133 ± 0.019 0.122 ± 0.008 - ReLU 0.472 ± 0.028 0.425 ± 0.037 0.343 ± 0.037 0.105 ± 0.008 PPO ELU 0.613 ± 0.014 0.582 ± 0.024 0.544 ± 0.013 0.090 ± 0.006 SELU 0.493 ± 0.058 0.508 ± 0.058 0.347 ± 0.115 -

Table 12: Deep Network/Validation Set: Mean performance of the trained agents in terms of the GTR including standard deviation. All agents were trained for three times. Each column contains the result for one of the cho- sen regularization settings, no regularization (NO), L2-Normalization (L2), Dropout (DO) and Batch-Norm (BN).

0.6

0.5

0.4

0.3

Global Tracking Ratio 0.2

0.1 a2c_elu_l2 ppo_elu_l2 a2c_elu_no a2c_elu_do a2c_elu_bn a2c_relu_l2 ppo_elu_no ppo_elu_do ppo_elu_bn ppo_relu_l2 a2c_selu_l2 ppo_selu_l2 a2c_relu_no a2c_relu_do a2c_relu_bn ppo_relu_no ppo_relu_do ppo_relu_bn a2c_selu_no a2c_selu_do ppo_selu_no ppo_selu_do Experimental Settings

Figure 13: Deep Network/Validation Set: Visualization of the mean per- formance (blue dot) of all agents including standard deviation (blue line) as well as the performance of each of the five trained agents separately

(black dots). ppo_elu_do means the PPO algorithm with ELU activation and Dropout.

41 Deep CNN Test Performance NO L2 DO BN ReLU 0.543 ± 0.051 0.462 ± 0.046 0.502 ± 0.027 0.603 ± 0.058 A2C ELU 0.552 ± 0.052 0.677 ± 0.031 0.468 ± 0.198 0.345 ± 0.037 SELU 0.227 ± 0.086 0.138 ± 0.022 0.124 ± 0.008 - ReLU 0.511 ± 0.049 0.461 ± 0.050 0.429 ± 0.042 0.106 ± 0.009 PPO ELU 0.691 ± 0.038 0.681 ± 0.042 0.644 ± 0.014 0.103 ± 0.006 SELU 0.567 ± 0.085 0.548 ± 0.069 0.409 ± 0.153 -

Table 13: Deep Network/Test Set: Mean performance of the trained agents in terms of the GTR including standard deviation. All agents were trained for three times. Each column contains the result for one of the chosen regulariza- tion settings, no regularization (NO), L2-Normalization (L2), Dropout (DO) and Batch-Norm (BN).

0.7

0.6

0.5

0.4

0.3 Global Tracking Ratio 0.2

0.1 a2c_elu_l2 ppo_elu_l2 a2c_elu_no a2c_elu_do a2c_elu_bn a2c_relu_l2 ppo_elu_no ppo_elu_do ppo_elu_bn ppo_relu_l2 a2c_selu_l2 ppo_selu_l2 a2c_relu_no a2c_relu_do a2c_relu_bn ppo_relu_no ppo_relu_do ppo_relu_bn a2c_selu_no a2c_selu_do ppo_selu_no ppo_selu_do Experimental Settings

Figure 14: Deep Network/Test Set: Visualization of the mean performance (blue dot) of all agents including standard deviation (blue line) as well as the performance of each of the five trained agents separately (black dots).

ppo_elu_do means the PPO algorithm with ELU activation and Dropout.

42 3.3.2 Comparing Algorithms To allow for a visual comparison of the different algorithms, we pick for all reg- ularization settings the best performing activation function of each algorithm as shown in Figure 15. For this comparison we only use the shallow network architecture as we did not include Reinforce in the experiments with the deep network.

Considering no regularization, we see that PPO outperforms the others, both in terms of the GTR as well as the steps needed until it reaches its peak performance. Unfortunately, the figure is slightly misleading as it seems that Reinforce learns faster than A2C. This is however not the case, as we do not provide absolute timing information. In fact, Reinforce is about 4 to 5 times slower, with an average training time of 20 hours compared to A2C and PPO with approx. 4-5 hours.

For L2-Normalization, we observe a similar behavior, although A2C performs slightly worse than Reinforce and PPO. For Dropout, all algorithms settle around the same performance level, with PPO again learning the fastest. Con- sidering Batch-Norm, where we only compare PPO and A2C, we see that A2C is now performing better. PPO does not learn properly and stagnates around a GTR of 0.2. However, also for A2C it is hard to learn a proper policy as the performance fluctuates a lot. A possible reason for PPO to not learn at all might be the definition of the surrogate objective given in Equation (19) and the repeated updates using the same samples. While Batch-Norm yields good results in supervised learning with loss formulations like the mean squared error or the cross-entropy loss, the error formulation of PPO is different. For future work it will be interesting to take a closer look at what Batch-Norm actually does in combination with RL, especially PPO.

3.3.3 Comparing Activation Functions In the following, we compare the performance of each algorithm with different activation functions and no regularization technique applied. The results for the shallow network are shown in Figure 16 and for the deep network in Figure 17. As before, we use the GTR averaged over five runs on the validation set as our performance measure.

Comparing the figures, we see a clear performance difference between the dif- ferent activation functions in terms of the algorithms as well as the network

43 Algorithm comparison no regularization Algorithm comparison L2 1e-05 0.8 0.8 Reinforce relu NO Reinforce relu L2 A2C relu NO A2C relu L2 0.7 0.7 PPO relu NO PPO relu L2

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 [5 runs average] [5 runs average] Global Tracking Ratio Global Tracking Ratio

0.2 0.2

0.1 0.1

0.0 0.0 0 10 20 30 40 0 5 10 15 20 25 30 35 40 45 Validation steps [after 1000 updates] Validation steps [after 1000 updates] (a) No regularization (b) L2-Normalization

Algorithm comparison Dropout Algorithm comparison Batch-Norm 0.8 0.8 Reinforce relu Dropout A2C relu BN A2C relu Dropout PPO relu BN 0.7 0.7 PPO elu Dropout

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 [5 runs average] [5 runs average] Global Tracking Ratio Global Tracking Ratio

0.2 0.2

0.1 0.1

0.0 0.0 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 40 Validation steps [after 1000 updates] Validation steps [after 1000 updates] (c) Dropout (d) Batch-Norm

Figure 15: Shallow Network/Algorithm Comparison: Best performances of each algorithm in all different regularization settings with a shallow network architecture. The best activation function was chosen according to Table 9 considering the one with the highest mean value and disregarding the statistical significance. Each agent is trained five times. After 1000 update steps an agent is evaluated five times on the validation set. architectures. In all cases SELU activation performs worst and barely passes a GTR of 0.4. As SELUs are intended to train deeper networks, we hoped to see a significant performance improvement for the deep architecture, which is however not the case. For ReLU and ELU activation it is hard to determine which one is working best. A2C performs better and also learns slightly faster with ReLU activation than with ELU. For PPO it is the other way around, although the difference is marginal for the shallow network.

For the deep network we observe that A2C works equally well with ReLU and ELU activation. PPO performs best with ELU, but shows instabilities during learning. Between validation steps 15 and 20 we see a systematic performance drop for all three trained agents. We further see that PPO with SELU activa- tion is now able to reach the same level as with ReLU activation.

44 REINFORCE no regularization A2C no regularization 0.8 0.8 reinforce relu a2c relu reinforce elu a2c elu 0.7 0.7 reinforce selu a2c selu

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 [5 runs average] [5 runs average] Global Tracking Ratio Global Tracking Ratio

0.2 0.2

0.1 0.1

0.0 0.0 0 5 10 15 20 25 30 35 0 10 20 30 40 50 Validation steps [after 1000 updates] Validation steps [after 1000 updates] (a) Reinforce (b) A2C

PPO no regularization 0.8 ppo relu ppo elu 0.7 ppo selu

0.6

0.5

0.4

0.3 [5 runs average] Global Tracking Ratio

0.2

0.1

0.0 0 10 20 30 40 50 Validation steps [after 1000 updates] (c) PPO

Figure 16: Shallow Network/Activation Comparison: Performance of the algorithms with different activation functions and a shallow network ar- chitecture. Each agent is trained five times. After 1000 update steps an agent is evaluated five times on the validation set.

A2C no regularization PPO no regularization 0.8 0.8 a2c relu ppo relu a2c elu ppo elu 0.7 0.7 a2c selu ppo selu

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 [5 runs average] [5 runs average] Global Tracking Ratio Global Tracking Ratio

0.2 0.2

0.1 0.1

0.0 0.0 0 10 20 30 40 50 0 5 10 15 20 25 30 35 40 Validation steps [after 1000 updates] Validation steps [after 1000 updates] (a) A2C (b) PPO

Figure 17: Deep Network/Activation Comparison: Performance of the algorithms with different activation functions and a deep network architec- ture. Each agent is trained three times. After 1000 update steps an agent is evaluated five times on the validation set. For the deep architecture we only consider A2C and PPO.

45 3.3.4 Comparing Regularization Techniques In the following we compare the effect of the regularization techniques sep- arately for each algorithm and activation function. We further include the performance without regularization to see whether the results improve by us- ing a certain regularization technique or not.

Figure 18 shows how regularization affects Reinforce. We observe that for SELU activation Reinforce in general does not work regardless of which regu- larization technique is used. For ReLU and ELU, Dropout performs best and for ReLU activation it is also more stable compared to L2-Norm and no reg- ularization at all. Using no regularization, we see that Reinforce with ReLU activation starts to overfit slowly around validation step 20, where the GTR slightly decreases.

Reinforce Regularization comparison ReLU activation Reinforce Regularization comparison ELU activation 0.8 0.8 reinforce relu NO reinforce elu NO reinforce relu L2 reinforce elu L2 0.7 0.7 reinforce relu DO reinforce elu DO

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 [5 runs average] [5 runs average] Global Tracking Ratio Global Tracking Ratio

0.2 0.2

0.1 0.1

0.0 0.0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Validation steps [after 1000 updates] Validation steps [after 1000 updates] (a) ReLU activation (b) ELU activation

Reinforce Regularization comparison SELU activation 0.8 reinforce selu NO reinforce selu L2 0.7 reinforce selu DO

0.6

0.5

0.4

0.3 [5 runs average] Global Tracking Ratio

0.2

0.1

0.0 0 2 4 6 8 10 12 14 Validation steps [after 1000 updates] (c) SELU activation

Figure 18: Reinforce/Shallow Network: Performance of Reinforce with different activation functions, regularization techniques and a shallow network architecture. We compare no regularization (NO), L2-Normalization (L2) and Dropout (DO). Each agent is trained five times. After 1000 update steps an agent is evaluated five times on the validation set.

46 A2C Regularization comparison ReLU activation A2C Regularization comparison ELU activation 0.8 0.8 a2c relu NO a2c elu NO a2c relu L2 a2c elu L2 0.7 0.7 a2c relu DO a2c elu DO a2c relu BN a2c elu BN 0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 [5 runs average] [5 runs average] Global Tracking Ratio Global Tracking Ratio

0.2 0.2

0.1 0.1

0.0 0.0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 70 Validation steps [after 1000 updates] Validation steps [after 1000 updates] (a) ReLU activation (b) ELU activation

A2C Regularization comparison SELU activation 0.8 a2c selu NO a2c selu L2 0.7 a2c selu DO

0.6

0.5

0.4

0.3 [5 runs average] Global Tracking Ratio

0.2

0.1

0.0 0 5 10 15 20 25 30 Validation steps [after 1000 updates] (c) SELU activation

Figure 19: A2C/Shallow Network: Performance of A2C with different acti- vation functions, regularization techniques and a shallow network architecture. We compare no regularization (NO), L2-Normalization (L2), Dropout (DO) and Batch-Norm (BN). Each agent is trained five times. After 1000 update steps an agent is evaluated five times on the validation set.

In Figure 19 we show the influence of regularization on the A2C agent for the shallow network architecture. SELU activation again does not work. However, as mentioned before, we observe that when using no regularization, it some- times yields the overall best results, while failing for other trials. Considering ReLU activation, only Dropout performs slightly better than no regularization. Furthermore, we see again that while Batch-Norm steadily increases its per- formance, the learning process itself is unstable and also varies a lot. For ELU activation Dropout perform best and L2-Norm performs approximately the same as without regularization. Once more Batch-Norm is unstable and this time does not even reach an acceptable GTR. We see again that after around 30 to 40 validation steps A2C starts to overfit a little without regularization, but the effect seems to be less severe than with Reinforce.

47 A2C Regularization comparison ReLU activation A2C Regularization comparison ELU activation 0.8 0.8 a2c relu NO a2c elu NO a2c relu L2 a2c elu L2 0.7 0.7 a2c relu DO a2c elu DO a2c relu BN a2c elu BN 0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 [5 runs average] [5 runs average] Global Tracking Ratio Global Tracking Ratio

0.2 0.2

0.1 0.1

0.0 0.0 0 10 20 30 40 50 0 10 20 30 40 Validation steps [after 1000 updates] Validation steps [after 1000 updates] (a) ReLU activation (b) ELU activation

A2C Regularization comparison SELU activation 0.8 a2c selu NO a2c selu L2 0.7 a2c selu DO

0.6

0.5

0.4

0.3 [5 runs average] Global Tracking Ratio

0.2

0.1

0.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Validation steps [after 1000 updates] (c) SELU activation

Figure 20: A2C/Deep Network: Performance of A2C with different activa- tion functions, regularization techniques and a deep network architecture. We compare no regularization (NO), L2-Normalization (L2), Dropout (DO) and Batch-Norm (BN). Each agent is trained three times. After 1000 update steps an agent is evaluated five times on the validation set.

Figure 20 shows how regularization affects A2C with a deep network architec- ture. Comparing the results of the shallow architecture with the deep one, we see a general performance difference in favor of the shallow network. SELU activation is again not working at all no matter if regularization is applied or not. Using ReLU activation, we cannot determine a configuration that is clearly working better. Batch-Norm is again rather unstable, but there seems to be potential for improvement as it steadily increases. For future work one could consider a higher patience value, such that the training is not stopped too early. For ELU activation, we observe that L2-Norm performs better than the other regularization settings. However, there is again a lot of instability and variation. Thus, it is hard to infer which setting is superior.

48 PPO Regularization comparison ReLU activation PPO Regularization comparison ELU activation 0.8 0.8 ppo relu NO ppo elu NO ppo relu L2 ppo elu L2 0.7 0.7 ppo relu DO ppo elu DO ppo relu BN ppo elu BN 0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 [5 runs average] [5 runs average] Global Tracking Ratio Global Tracking Ratio

0.2 0.2

0.1 0.1

0.0 0.0 0 10 20 30 40 0 5 10 15 20 25 30 35 Validation steps [after 1000 updates] Validation steps [after 1000 updates] (a) ReLU activation (b) ELU activation

PPO Regularization comparison SELU activation 0.8 ppo selu NO ppo selu L2 0.7 ppo selu DO

0.6

0.5

0.4

0.3 [5 runs average] Global Tracking Ratio

0.2

0.1

0.0 0 10 20 30 40 50 Validation steps [after 1000 updates] (c) SELU activation

Figure 21: PPO/Shallow Network: Performance of PPO with different activation functions, regularization techniques and a shallow network architec- ture. We compare no regularization (NO), L2-Normalization (L2), Dropout (DO) and Batch-Norm (BN). Each agent is trained five times. After 1000 update steps an agent is evaluated five times on the validation set.

In Figure 21 we show the influence of regularization on the PPO agent for the shallow network architecture. While Batch-Norm neither works for ReLU nor ELU, we observe an interesting invariance to the other regularization tech- niques. No regularization, L2-Norm and Dropout reach approximately the same performance level for ReLU and ELU activation. The performance with SELU activation is worse, however also all regularization settings work at least a little bit compared to Reinforce and A2C. Especially the use of Dropout im- proves the performance over no regularization. Overall, it seems that PPO is rather robust to the choice of activation function and regularization technique except for the use of Batch-Norm.

49 PPO Regularization comparison ReLU activation PPO Regularization comparison ELU activation 0.8 0.8 ppo relu NO ppo elu NO ppo relu L2 ppo elu L2 0.7 0.7 ppo relu DO ppo elu DO ppo relu BN ppo elu BN 0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 [5 runs average] [5 runs average] Global Tracking Ratio Global Tracking Ratio

0.2 0.2

0.1 0.1

0.0 0.0 0 10 20 30 40 0 5 10 15 20 25 30 35 Validation steps [after 1000 updates] Validation steps [after 1000 updates] (a) ReLU activation (b) ELU activation

PPO Regularization comparison SELU activation 0.8 ppo selu NO ppo selu L2 0.7 ppo selu DO

0.6

0.5

0.4

0.3 [5 runs average] Global Tracking Ratio

0.2

0.1

0.0 0 10 20 30 40 50 60 Validation steps [after 1000 updates] (c) SELU activation

Figure 22: PPO/Deep Network: Performance of PPO with different acti- vation functions, regularization techniques and a deep network architecture. We compare no regularization (NO), L2-Normalization (L2), Dropout (DO) and Batch-Norm (BN). Each agent is trained three times. After 1000 update steps an agent is evaluated five times on the validation set.

Figure 22 shows how regularization affects PPO with a deep network architec- ture. We observe a similar behavior as for the shallow network architecture, although the performance is again worse. We further see that PPO reaches a higher GTR with ELU activation than with ReLU for all regularization set- tings, except for Batch-Norm. No regularization, L2-Norm and Dropout are approximately on the same level, thus we cannot really determine if one is bet- ter or not. In general the learning process seems to be more unstable for the deep network architectures, which can probably be attributed to the number of adaptable network parameters.

50 3.4 Implications on the Network Architecture After reflecting on our results, we come to the conclusion that a shallow net- work architecture and no regularization trained with PPO and ReLU activation yields the most promising results, even though there is no significant difference to some other configurations. As the deep architecture did not improve our results, we want to ascertain whether an even simpler network can achieve the same performance as the shallow architecture. To test this, we try two simplified versions of the original shallow network given in Table 1.

For the first adaptation (S1) shown in Table 14, we remove the last convo- lutional layers and adapt the strides, the number of kernels as well as the number of hidden units. The resulting architecture has now approximately 49% fewer parameters. For the second adaptation (S2) shown in Table 15, we keep the from the original architecture and just adapt the num- ber of hidden units in the fully connected part of the network. The resulting architecture is now even smaller with about 58% fewer parameters than the original configuration.

To compare the networks, we again train five agent with each architecture. The results are shown in Table 16. We see that while S1 and S2 do not reach the same average performance as the original architecture, the results are not significantly different. If we further consider the fact that S2 now uses less than half of the number of parameters than the original architecture, the results are quite impressive. So, it is definitely possible to reduce the size of the network, which is also a type of regularization since we restrict its capacity. For our future work in the field of score following we should therefore consider further experiments to determine the best suiting network architecture.

51 Audio (Spectrogram) 39×20 Sheet-Image 40×150 Conv 16×4×4 - stride-2 Conv 16×4×8 - stride-2 Conv 32×3×3 - stride-2 Conv 32×3×3 - stride-2 Conv 32×3×4 - stride-(2,3) Flatten - Concatenation - Dense 128 Dense 128 Dense 128 Dense 3 - Softmax Dense 1 - Linear

Table 14: Simplified Network/S1: Conv 16×3×3 means a convolutional layer with 16 3×3 kernels. The spectrogram and the sheet are handled sepa- rately by different convolutional layers and afterwards concatenated and pro- cessed by a fully connected (Dense) layer. On the left side is the policy output passed through a softmax function with three possible actions. On the right is a single linear output for the value function. No zero padding is used.

Audio (Spectrogram) 39×20 Sheet-Image 40×150 Conv 16×4×4 - stride-2 Conv 16×4×8 - stride-2 Conv 32×3×3 - stride-2 Conv 32×3×3 - stride-(1,2) Conv 64×3×3 - stride-1 Conv 32×3×3 - stride-(1,2) Conv 32×4×4 - stride-2 Flatten - Concatenation - Dense 128 Dense 64 Dense 64 Dense 3 - Softmax Dense 1 - Linear

Table 15: Simplified Network/S2: Conv 16×3×3 means a convolutional layer with 16 3×3 kernels. The spectrogram and the sheet are handled sepa- rately by different convolutional layers and afterwards concatenated and pro- cessed by a fully connected (Dense) layer. On the left side is the policy output passed through a softmax function with three possible actions. On the right is a single linear output for the value function. No zero padding is used.

Architecture Train Validation Test Original 0.989 ± 0.017 0.734 ± 0.042 0.792 ± 0.039 S1 0.999 ± 0.002 0.701 ± 0.028 0.775 ± 0.029 S2 0.999 ± 0.002 0.700 ± 0.052 0.754 ± 0.037

Table 16: Mean performance of the trained agents (PPO with ReLU activa- tion and no regularization) with the original architecture and the simplified architectures S1 and S2. Results are given in terms of the GTR on all datasets including the standard deviation. All agents were trained for five times.

52 4 Conclusion

In this thesis, we investigated the effect of regularization in the context of RL. We compared several state-of-the-art policy gradient algorithms in combi- nation with different regularization techniques and activation functions. Our goal was to get a better understanding of how certain configurations affect the performance and generalization abilities of RL algorithms. We conducted our experiments on the score following environment due to the possibility of having a split into training, validation and test data. As we want to further use RL in the field of score following as part of the Con Espressione7 project, we think of this experimental study as a guideline for our future work.

We confirm our expectation that Reinforce performs worse than the more ad- vanced A2C and PPO. Nevertheless it is surprising that the use of Dropout improves the results of Reinforce to perform almost on par with the others. However, its major problem is still the training time. As Reinforce requires updates over full episodes and only involves a single actor, it takes about four to five times longer than A2C and PPO, which utilize multiple actors in par- allel.

We think that for the future it is necessary to even further reduce the training time of these algorithms, which also relates to the briefly mentioned sample efficiency. If the agents require fewer interactions with the environment, they are able to solve a problem faster. In Section 1 we described ACKTR, which tries to improve the sample efficiency by using an approximation of natural gradients. Compared to A2C it improves the sample efficiency about two to three times on average, while increasing the computational cost by approx. 10% to 25% [41]. As this method requires the use of a special and rather sophisticated optimization technique, we did not yet apply it to our score fol- lowing problem. However, for our future work we have to take a closer look at algorithms like ACKTR to see if they improve the sample efficiency for score following and are worth the additional computational effort.

One of the most interesting results of our experiments is the influence of Batch- Norm. While this technique did not work for PPO, it showed promising but rather instable results for A2C. Possible future work could involve a thorough evaluation of how Batch-Norm affects these algorithms and should comprise a closer look on the gradients as well as the learned weights. In general it would be worth examining the weights during training to see what happens

7http://www.cp.jku.at/research/projects/ConEspressione/

53 when the performance of the agent drops from one validation step to another. The effect of SELU activation also falls into line with the aforementioned ob- servations. For four out of five training runs, SELUs could yield good results in combination with A2C and no regularization. For all other algorithms and configurations it did not work properly. As both Batch-Norm and SELUs have similar properties, it is interesting to thoroughly investigate their behavior in comparison. The insights gained from such experiments could be used for train- ing deeper models, which is still a problem as we have seen in our experiments.

54 References

[1] Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O.P., Zaremba, W.: Hindsight experi- ence replay. In: Advances in Neural Information Processing Systems. pp. 5048–5058 (2017)

[2] Arzt, A.: Flexible and Robust Music Tracking. Ph.D. thesis, Dissertation, Johannes Kepler University, Linz, Austria (2016)

[3] Arzt, A., Böck, S., Flossmann, S., Frostel, H., Gasser, M., Liem, C.C., Widmer, G.: The piano music companion. In: ECAI. pp. 1221–1222 (2014)

[4] Arzt, A., Dorfer, M.: Aktuelle entwicklungen in der automatischen musikverfolgung. arXiv preprint arXiv:1708.02100 (2017)

[5] Bishop, C.: and Machine Learning. Springer (2006)

[6] Cancino-Chacón, C., Bonev, M., Durand, A., Grachten, M., Arzt, A., Bishop, L., Goebl, W., Widmer, G.: The accompanion v0. 1: An expres- sive accompaniment system. arXiv preprint arXiv:1711.02427 (2017)

[7] Chou, P.W., Maturana, D., Scherer, S.: Improving stochastic policy gra- dients in continuous control with deep reinforcement learning using the beta distribution. In: International Conference on Machine Learning. pp. 834–843 (2017)

[8] Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). In: International Con- ference on Learning Representations (2015)

[9] Dorfer, M., Arzt, A., Widmer, G.: Towards score following in sheet music images. In: Proceedings of the International Society for Music Information Retrieval Conference (2016)

[10] Dorfer, M., Arzt, A., Widmer, G.: Learning audio-sheet music correspon- dences for score identification and offline alignment. In: Proceedings of the International Society for Music Information Retrieval Conference (2017)

[11] Dorfer, M., Henkel, F., Widmer, G.: Learning to listen, read, and follow: Score following as a reinforcement learning game. In: Proceedings of the International Society for Music Information Retrieval Conference (2018)

55 [12] Farahmand, A.m., Szepesvari, C.: Regularization in reinforcement learn- ing. University of Alberta Edmonoton, Canada (2011)

[13] Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P.H., Kohli, P., Whiteson, S.: Stabilising experience replay for deep multi-agent re- inforcement learning. In: International Conference on Machine Learning. pp. 1146–1155 (2017)

[14] Glavic, M., Fonteneau, R., Ernst, D.: Reinforcement learning for electric power system decision and control: Past considerations and perspectives. IFAC-PapersOnLine 50(1), 6918–6927 (2017)

[15] Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning, vol. 1. MIT press Cambridge (2016)

[16] Graves, A., Mohamed, A.r., Hinton, G.: Speech recognition with deep recurrent neural networks. In: International Conference on Acoustics, Speech and Signal processing. pp. 6645–6649. IEEE (2013)

[17] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep net- work training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)

[18] Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self- normalizing neural networks. In: Advances in Neural Information Pro- cessing Systems. pp. 972–981 (2017)

[19] Korzeniowski, F., Widmer, G.: A fully convolutional deep auditory model for musical chord recognition. In: 26th International Workshop on Ma- chine Learning for Signal Processing. pp. 1–6. IEEE (2016)

[20] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp. 1097–1105 (2012)

[21] Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Sil- ver, D., Wierstra, D.: Continuous control with deep reinforcement learn- ing. arXiv preprint arXiv:1509.02971 (2015)

[22] Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: International Conference on Machine Learning (2013)

56 [23] Mannion, P., Duggan, J., Howley, E.: An experimental review of reinforce- ment learning algorithms for adaptive traffic signal control. In: Autonomic Road Transport Support Systems, pp. 47–66. Springer (2016)

[24] Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: Asynchronous methods for deep reinforce- ment learning. In: International Conference on Machine Learning. pp. 1928–1937 (2016)

[25] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wier- stra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

[26] Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltz- mann machines. In: Proceedings of the 27th International Conference on Machine Learning. pp. 807–814 (2010)

[27] Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Informa- tion Processing Systems. pp. 693–701 (2011)

[28] Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)

[29] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. nature 323(6088), 533 (1986)

[30] Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch nor- malization help optimization?(no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604 (2018)

[31] Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International Conference on Machine Learning. pp. 1889–1897 (2015)

[32] Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High- dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015)

[33] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

[34] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driess- che, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot,

57 M., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)

[35] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014)

[36] Sutton, R.S., Barto, A.G., et al.: Reinforcement learning: An introduc- tion. MIT press (1998)

[37] Wang, J.X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J.Z., Munos, R., Blundell, C., Kumaran, D., Botvinick, M.: Learning to re- inforcement learn. arXiv preprint arXiv:1611.05763 (2016)

[38] Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., de Freitas, N.: Sample efficient actor-critic with experience replay. arXiv preprint arXiv:1611.01224 (2016)

[39] Williams, R.J.: Simple statistical gradient-following algorithms for con- nectionist reinforcement learning. Machine Learning 8(3-4), 229–256 (1992)

[40] Williams, R.J., Peng, J.: Function optimization using connectionist rein- forcement learning algorithms. Connection Science 3(3), 241–268 (1991)

[41] Wu, Y., Mansimov, E., Grosse, R.B., Liao, S., Ba, J.: Scalable trust- region method for deep reinforcement learning using kronecker-factored approximation. In: Advances in Neural Information Processing Systems. pp. 5285–5294 (2017)

[42] Zhang, C., Vinyals, O., Munos, R., Bengio, S.: A study on overfitting in deep reinforcement learning. arXiv preprint arXiv:1804.06893 (2018)

58 Florian Henkel

Personal Data

Place and Date of Birth: Salzburg, Austria | 16 November 1994 Address: Eschenweg 1, Linz, Austria Phone: +43 676 3636536 email: [email protected] Education

Current Master of Science in Computer Science Oct 2017 Johannes Kepler University, Linz, Austria Data Science Major

Oct 2014 - Oct 2017 Bachelor of Science in Computer Science Johannes Kepler University, Linz, Austria Thesis: Automatic Chord Estimation with a Deep Convolutional Neural Network

Sept 2005 - Jun 2013 Grammar School Privatgymnasium der Herz-Jesu-Missionare Salzburg, Austria Work Experience

Current Student Researcher Oct 2017 Institute of Computational Perception Johannes Kepler University, Linz, Austria

Jul-Sept 2017 Software Development Intern AdRem Software Krakow, Poland

Oct 2016 - Feb 2017 Tutor for Networks and Distributed Systems Institute of Networks and Security Johannes Kepler University, Linz, Austria

Sept 2013 - May 2014 Civil Service Nursing home St. Nikolaus Neumarkt am Wallersee, Austria Eidesstattliche Erklärung

Ich erkläre an Eides statt, dass ich die vorliegende Masterarbeit selbstständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilf- smittel nicht benutzt bzw. die wörtlich oder sinngemäß entnommenen Stellen als solche kenntlich gemacht habe. Die vorliegende Masterarbeit ist mit dem elektronisch übermittelten Textdokument identisch.

Linz, Juli 2018

Florian Henkel