<<

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 60 CREDITS STOCKHOLM, SWEDEN 2019

Mixing Music Using Deep Reinforcement Learning

VIKTOR KRONVALL

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Mixing Music Using Deep Reinforcement Learning

VIKTOR KRONVALL

Master in Computer Science Date: October 14, 2019 Supervisor: Mats Nordahl Examiner: Örjan Ekeberg School of Electrical Engineering and Computer Science Host company: Keio University Swedish title: Musikmixning med Deep Reinforcement Learning

iii

Abstract

Deep Reinforcement Learning has recently seen good results in tasks such as board games, computer games and the control of autonomous vehicles. State- of-the-art autonomous DJ-systems generating mixed audio hard-code the mix- ing strategy commonly with a cross- transition. This research investigates whether Deep Reinforcement Learning is an appropriate method for learning a mixing strategy that can yield more expressive and varied mixes than the hard-coded mixing strategies by adapting the strategies to the songs played. To investigate this, a system named the DeepFADE system was constructed. The DeepFADE system was designed as a three-tier system of hierarchical Deep Reinforcement Learning models. The first tier selects an initial song and limits the song collection to a smaller subset. The second tier selects when to transition to the next song by loading the next song at pre-selected cue points. The third tier is responsible for generating a transition between the two loaded songs according to the mixing strategy. Two Deep Reinforcement Learning algorithms were evaluated, A3C and Dueling DQN. Convolutional and residual neural networks were used to train the reinforcement learning policies. Rewards functions were designed as combinations of heuristic functions that evaluate the mixing strategy according to several important aspects of a DJ-mix such as alignment of beats, stability in output volume, tonal con- sonance, and time between transitions of songs. The trained models yield policies that are either unable to create transitions between songs or strategies that are similar regardless of playing songs. Thus the learnt mixing strategies were not more expressive than hard-coded cross-fade mixing strategies. The training suffers from reward hacking which was argued to be caused by the agent’s tendency to focus on optimizing only some of the heuristics. The re- ward hacking was mitigated somewhat by the design of more elaborate rewards that guides the policy to a larger extent. A survey was conducted with a sample size of n = 11. The small sample size implies no statistically significant conclusions can be drawn. However, the mixes generated by the trained policy was rated more enjoyable compared to a randomized mixing strategy. The convergence rate of the training is slow and training time is not only limited by the optimization of the neural networks but also by the generation of audio used during training. Due to the limited available computational re- sources it is not possible to draw any clear conclusions whether the proposed method is appropriate or not when constructing the mixing strategy. iv

Sammanfattning

Djup förstärkningsinlärning har de senaste åren sett goda resultat i områden som brädspel, datorspel och styrning av autonoma fordon. De senaste auto- noma DJ-system som genererar mixat ljud hårdkodar ofta mixningsstrategin till en “cross-fade”-övergång. Detta arbete undersöker huruvida djup förstärk- ningsinlärning är en lämplig metod för att lära en mixiningsstrategi som ger upphov till mer expressiva och varierade mixar än de hårdkodade strategierna genom att anpassa strategin till den spelade låtarna. För att undersöka detta skapades ett system vid namn DeepFADE. DeepFADE-systemet designades som ett tredelat system med hierarkiska djupa förstärkningsinlärningssmodeller. Systemets första del väljer en inledan- de låt och begränsar samlingen låtar till en mindre delmängd. Andra delen be- stämmer när en övergång till nästa låt ska utföras genom att ladda nästa låt vid förvalda teckenpunkter. Tredje delen ansvarar för att generera en övergång mellan de två laddade låtarna enligt mixningsstrategin. Två algoritmer utvär- derades: A3C och Dueling DQN. Faltnings- och residuala neurala nätverk an- vändes för att träna förstärkningsinlärningspolicyn. Belöningsfunktionerna utformades som kombinationer av heuristiker som utvärderar mixningsstrategin enligt viktiga aspekter i en DJ-mix såsom takt- synkronisering, stabilitet i volym, tonal konsonans och tid mellan låtövergång- ar. De tränade modellerna ger upphov till strategier som antingen inte lyckas skapa låtövergångar eller strategier som inte ändras baserat på de spelade låtar- na. Således var de inlärda strategierna inte mer expressiva än de hårdkodade “cross-fade”-strategierna. Träningen lider av “reward hacking” vilket ansågs vara orsakat av agentens tendens att fokusera på några heuristiker på bekost- nad av lägre belöning för övriga heuristiker. Detta dämpades något genom utformade av mer komplexa belöningsfunktioner som styrde agentens policy i större utsträckning. En undersökning med urvalsstorlek n = 11 utfördes. Det begränsade ur- valet innebar att inga statistiskt signifikanta slutsatser kunde dras. Mixar gene- rerade av den tränade policyn bedömdes dock mer njutbara än de genererade av en slumpmässig mixiningsstrategi. Träningens konvergenshastighet är långsam och träningstiden påverkas in- te enbart av optimeringen av de neurala nätverken utan även av genereringen av ljud under träningens gång. På grund av begränsade beräkningsresurser kunde inga tydliga slutsatser dras huruvida den föreslagna metoden var lämplig för att konstruera en mixningsstrategi. Contents

1 Introduction 1 1.1 Description ...... 1 1.2 Motivation ...... 1 1.3 Objective ...... 2 1.4 Research Question ...... 2 1.4.1 Question ...... 2 1.4.2 Problem Definition ...... 3 1.5 Evaluation ...... 3 1.5.1 Datasets ...... 3 1.6 Novelty ...... 4 1.7 Limitation of Scope ...... 4 1.8 Important Aspects of a DJ-mix ...... 4 1.9 Sustainability and Power Consumption ...... 5 1.10 Ethical and Societal Considerations ...... 6

2 Background 7 2.1 Machine Learning ...... 7 2.1.1 Artificial Neural Networks ...... 7 2.1.2 Backprop – The Error Back-Propagation Algorithm . .8 2.1.3 Convolutional Neural Networks ...... 8 2.1.4 Deep Learning ...... 8 2.2 Reinforcement Learning ...... 9 2.2.1 Markov Decision Process ...... 9 2.2.2 Partially Observable Markov Decision Process . . . .9 2.2.3 Functions and Distributions ...... 10 2.2.4 Actor-Critic Learning ...... 12 2.2.5 Q-Learning ...... 13 2.2.6 Deep Reinforcement Learning ...... 13 2.2.7 Deep Q Network ...... 14

v vi CONTENTS

2.2.8 Dueling Network Architectures ...... 14 2.2.9 Asynchronous Advantage Actor-Critic ...... 15 2.2.10 Epsilon Greedy Policy ...... 16 2.2.11 Partial Observability ...... 17 2.2.12 Reward Hacking ...... 17 2.2.13 Inverse Reinforcement Learning ...... 17 2.3 DJ mixes ...... 18 2.3.1 Musical Structure ...... 19 2.3.2 Musical Key ...... 19 2.3.3 Critical Bands and Perceptual Scales ...... 20 2.3.4 Tonal Consonance ...... 20 2.3.5 Music Emotion Recognition ...... 20 2.3.6 MediaEval Database for Emotional Analysis in Music 21 2.3.7 Cue Point Selection ...... 21 2.3.8 Self-Similarity Based Novelty Score ...... 21 2.4 Autonomous DJ Systems ...... 23

3 Method 24 3.1 System Overview ...... 24 3.2 Pre-processing ...... 24 3.3 Song Recommendation Tier ...... 25 3.4 Cue Point Loading Tier ...... 26 3.5 Mixing (DJ-Controller) Tier ...... 26 3.5.1 Knobs and Faders ...... 27 3.5.2 Audio Generation ...... 27 3.5.3 Mixing and Volume ...... 28 3.5.4 Interpolation ...... 28 3.6 DJ-Controller Rewards ...... 28 3.7 Neural Network Architectures ...... 33 3.8 A3C Network ...... 33 3.9 Dueling DQN Network ...... 34 3.10 Algorithms ...... 35 3.10.1 A3C Implementation ...... 35 3.10.2 Dueling DQN Implementation ...... 36 3.11 Training ...... 36 3.12 Evaluation ...... 37 3.12.1 User Survey ...... 37 CONTENTS vii

4 Results 39 4.1 Cue Points ...... 39 4.2 Training ...... 39 4.3 Dueling DQN ...... 39 4.4 A3C ...... 42 4.4.1 Trained on 3438 Episodes ...... 42 4.4.2 Trained on 10000 Episodes ...... 44 4.4.3 Trained on 60000 Episodes ...... 45 4.5 Cue Point Loading Environment ...... 47 4.6 User Survey ...... 47

5 Discussion 52 5.1 Convergence Rate ...... 52 5.2 Multi-Threading ...... 52 5.3 A3C Policy Updates ...... 53 5.4 Complex Environment ...... 53 5.5 Rewards ...... 53 5.6 Cue Points ...... 55 5.7 Expressivity ...... 55 5.8 Tiered System ...... 55 5.9 Unsupervised Learning ...... 56 5.10 Extensibility ...... 56 5.11 User Survey and Subjective Evaluation ...... 56 5.12 Improvements and Future Work ...... 57

6 Conclusion 58

Bibliography 59

Chapter 1

Introduction

1.1 Description

The aim of this research is to construct a system that given a collection of songs can autonomously create a mix of songs similar to what a DJ does during a live performance. This resulting mix should be pleasing to listen to and interesting both musically and otherwise. The proposed method of constructing such a system is Deep Reinforcement Learning. This method was chosen because it does not assume any dataset of annotated DJ mixes. This is important both because of a lack of such datasets and because using supervised learning is likely to over-fit the model to the dataset leading to a degraded generalization of the model.

1.2 Motivation

Reinforcement learning in recent years have been successfully applied to self- driving cars [1] and video games and board games such as Atari Breakout [2] and the board game Go [3]. However, there is not yet enough research to answer whether the method is applicable to generating DJ mixes. This research aims to expand this area of research. The body of research relating to DJ mixes is limited. However, Ishizaki, Hoashi, and Takishima [4] provide some evidence that a mixing strategy with minimal adjustments leads to less discomfort when listening to the re- sulting mixes. Hirai, Doi, and Morishima [5] presents a model for generating mixes based on a local similarity between songs and Vande Veire and De Bie [6] presents a model where transitions are performed based on a novelty mea- sure on the audio waveform.

1 2 CHAPTER 1. INTRODUCTION

While the research relating to DJ mixes is limited there is overlap with other areas of research such as recommendation systems and playlist generation. Recent work [7] have shown the possibility of creating flexible and expressive models when generating playlists. Four major audiences have been identified for whom having an au- tonomous DJ system might be of interest. • Retail Stores: Retail stores often want to have music continuously play- ing in the background to enhance the shopping experience. Having a system that can maintain a specific mood and emotion in the music played can be of interest. • Personal use: An autonomous DJ system would provide a new way to experience one’s own music collection. • Social gatherings: At social gatherings where there is no DJ present, having a system generate the music allows the participants to focus on the gathering instead of the selection of music. • DJ: In case the system reaches an acceptable level of the generated audio the software could be used as an educational tool for DJs who want to learn how to create better mixes.

1.3 Objective

The objective of this research is to construct a system that can without human input transforms a collection of songs into a DJ-mix. Equally important is also the furthering of state-of-the-art methods for generating DJ-mixes and the investigation of the application of Deep Reinforcement Learning to the generation of such mixes.

1.4 Research Question

1.4.1 Question The research questions that this thesis will cover are: 1. Can a model be constructed that learns to mix songs such that the resulting mixes are enjoyable without hard-coding the mixing strategy?

2. Will the constructed model be sufficiently expressive to be able to adapt the mixing strategy to the selected songs? CHAPTER 1. INTRODUCTION 3

Furthermore, this project will also explore whether the method of Deep Reinforcement Learning is appropriate when constructing such a model.

1.4.2 Problem Definition The problem at hand requires an implementation of a system utilizing Deep Reinforcement Learning to generate DJ-mixes. The creation of such a sys- tem entails multiple hurdles that must be overcome in order for the system to perform well. Firstly, the number of actions that can possibly be performed at every time- step when generating a mix is large and the number of possible actions during a mix are exponential in terms of actions per time-step. Hence, it is of impor- tance to limit the number of actions considered at every time-step. Secondly, the evaluation of music is highly subjective and the order in- duced by which mix sounds better than other mixes is difficult to capture in software. Moreover, supposing such an order was constructed, the Deep Reinforce- ment Learning model is susceptible to “reward hacking” [8] where the model finds solutions which yield higher rewards according to the order but sub- jectively sounds worse when evaluated by humans. This problem is further amplified when more aspects are considered when evaluating the goodness of a mix since the model may choose to focus on one aspect at the detriment of other also important aspects for a mix to sound good.

1.5 Evaluation

The system will be evaluated by comparing the generated audio to human- generated mixes provided by the Mixotic dataset [9, 10]. The evaluation will consider aspects such as those presented in section 1.8.

1.5.1 Datasets • Mixotic: The mixotic dataset [9] contains DJ-mixes created by human disc jockeys extracted from the mixotic web site [10]. This dataset will be used when evaluating the performance of the system. The evaluation will be done by comparing the human-generated mixes from the dataset to mixes generated by the system. 4 CHAPTER 1. INTRODUCTION

• DEAM: The MediaEval Database for Emotional Analysis of Music (DEAM) [11] will be used to evaluate the emotional content of songs. The dataset contains features extracted by Eyben, Weninger, Gross, and Schuller [12].

1.6 Novelty

The primary distinction between state-of-the-art [6, 5] and the proposed project is that in the state-of-the-art research the mixing strategy is fixed to perform the transitions using cross-fades and linear tempo adjustments whereas this research will explore having the mixing strategy learned from data in a reinforcement learning setting. Furthermore, unlike the work by Vande Veire and De Bie [6] the proposed method does not assume that the music used is of a specific . This research aims to provide more information into what areas the method of Deep Reinforcement Learning is applicable to as well as expansion of data- driven approaches to generating DJ mixes.

1.7 Limitation of Scope

The system will not cover selecting the collection of songs that are used when generating the mixes. Instead, the task of choosing potential songs will be delegated to the user or some external system. The system will include a reward function that is designed to yield mixes that are enjoyable to the listeners. However, this research makes no attempt at defining a metric or order that can be used in general to compare enjoyability between different mixes.

1.8 Important Aspects of a DJ-mix

The DJ has to consider multiple aspects when creating the mix. The goal is to create a mix that flows seamlessly from one song to the next. The most noticeable aspect is to avoid drastic changes in volume since that interferes both with the ability of the listeners to follow the rhythm of the mix and causes sudden changes to the perceived intensity of the mix [13]. The DJ must also consider the basic technique of “beat matching” where the beat positions of all simultaneously playing songs are aligned in order to maintain the rhythm of the generated mix. CHAPTER 1. INTRODUCTION 5

To avoid jarring changes in the tempo of the generated mix, the tempo adjustments to the playing songs should be kept minimal throughout the mix [4]. It is also important to utilize the full audio spectrum to avoid “hollow” sounding mixes. For the generated mix to sound pleasing the harmonic content and musi- cal keys of the constituent songs in the mix should be considered to limit the amount of dissonance in the audio [14, 15]. Apart from the audio quality of the generated mix, the DJ must also con- sider the frequency of transitions between songs. Too many transitions lead to mixes where the rhythm isn’t clearly established before the next song is introduced. Too few transitions lead to mixes where the distinction between mixing the songs to a continuous stream of music and simply playing the songs in sequence is erased. For the mix to be seamless and coherent the DJ should also consider the higher-level musical structure of the songs such as phrases and musical form and perform transitions such that phrases are aligned between the songs in a transition. Since the experience of music is subjective, some or all of the above aspects may be disregarded for emotional effect.

1.9 Sustainability and Power Consumption

Deep Learning is computationally expensive and with the computational cost follows large power consumption [16]. The power consumption of the pro- posed system is also heavily influenced by the power consumption of the audio signal processing. This power consumption is linear in the number of signal transitions (from positive to negative) of the input signal [17] which in turn corresponds to the number of samples in the input signal. Carbon dioxide emissions follow a linear relation to power consumption and the power consumption for training one iteration is given by the amount of memory and the number of GPUs used [16]. Since the memory used is constant throughout the training the carbon emissions of the system is linearly proportional to the number of training steps and duration of the audio used per training step. 6 CHAPTER 1. INTRODUCTION

1.10 Ethical and Societal Considerations

The direct ethical implications of the existence of an autonomous DJ-system is on its own likely relatively limited. However, it is important to evaluate it in the context of the rapid increase in automation by the use of artificial intelligence. Manyika [18, pp. 46] estimates that 40% to 55% of the global wages are poten- tially affected by automation by utilizing current demonstrated technologies. The potential for automation in creative, social and emotional occupational categories such as DJ-ing is however estimated as being limited. However, an autonomous DJ-system could also potentially be used by professional DJs. The tools could also allow for new and creative opportunities such as the DJ mixing music in collaboration with the autonomous DJ-system. Manyika et al. [19] projects that in developed countries such as Germany, Japan, and the United States, large portions of the population may be required to change occupational categories. In case of rapid automation, the estimates are 32%, 42%, and 33% respectively. Thus, given further developments of artificial intelligence in cre- ative areas where jobs are projected to increase despite automation the ability to move to those occupational categories might become more limited and the percentage of unemployed people with no easy way back into employment in- creases. This inability to quickly adapt to the changing workplace demands is likely further exacerbated due to the growing demand for jobs with and edu- cation level of college-level or higher in developed countries. This research also relates to the ongoing debate on copyright and owner- ship of art and music created by the utilization of artificial intelligence. At first, the problem may seem negligible due to the copyright of the individual songs in a mix clearly belongs to the creator of those songs. However, given the United States Fair Use doctrine it is unclear at what point generated audio can be considered transformative work. Does the inclusion of a three-second clip of a song imply that express permission from the original author is required? The European Union Copyright Directive [20, 21] maintains the copyright of the original author even in part of the original work. However, many member states only consider it an infringement if a “substantial part” of the original phonogram (audio) is used [22]. See [22, 23] for a more in-depth discussion on ownership and copyright relating to audio and works generated by artificial intelligence systems. Chapter 2

Background

2.1 Machine Learning

The main goal of machine learning is to approximate some function that is difficult to construct or compute by providing an algorithm with data. An objective (or equivalently a loss) function is used to evaluate the approximation of the function. The machine learning problem can thus be formulated as an optimization problem where the objective function is maximized or the loss function is minimized. The optimization is performed by fitting the model parameters in the algo- rithm with the hope that the approximation will converge towards a function that outputs will yield a high score from the objective function and thus correct or close to the correct result on data that is similar to the data provided to the model during learning.

2.1.1 Artificial Neural Networks An artificial neural network is an instance of a machine learning algorithm. A neural network is a differentiable function composed of smaller nonlinear differentiable functions called layers. The layers may have parameters that determine the output of the layers. The layer parameters are updated such that the loss for the data passed to the neural network is minimized.

7 8 CHAPTER 2. BACKGROUND

2.1.2 Backprop – The Error Back-Propagation Algo- rithm The Error Back-Propagation Algorithm or Backprop for short is an algorithm by Werbos [24] to compute how the layer parameters in a neural network can be updated such that the loss is minimized. The algorithm is based on work within control theory [25]. Dreyfus [26] simplified the approach based on the chain rule of differentiation. Rumelhart, Hinton, and Williams [27] popular- ized the error back-propagation algorithm and presented the still commonly used sigmoid non-linearity. Linnainmaa [28] evaluated a very similar method and developed what is now referred to as automatic differentiation (AD) by computing the errors using Taylor series, an approach that is still commonly used in neural network frameworks today. The algorithm works by computing partial derivatives of the loss with re- spect to some variables in the larger function that is the neural network. These partial derivatives are computed from the last layer in the network. In order to compute the partial derivatives for layers earlier in the network, the chain rule is utilized such that the partial derivative of the loss with respect to the layer outputs is used in order to compute the derivatives with respect to the layer parameters as well as the layer inputs. This is the back-propagation of the error gradients. Once the partial derivative of the loss with respect to every layer parame- ter in the network has been computed, these layer parameters are updated by stepping in the negative direction of the partial derivative such that the loss is likely to be minimized.

2.1.3 Convolutional Neural Networks Convolutional neural networks (CNNs) are neural networks where one of the most prominent layer is a 2-dimensional convolution of the input with some kernel weights which are the parameters of this layer. This kind of neural network was popularized by LeCun et al. [29] where a convolutional neural network was used to identify hand-written digits in the MNIST dataset. The convolutional neural network architectures were inspired by the neocognitron [30, 31].

2.1.4 Deep Learning Deep learning refers to learning in artificial neural networks with multiple layers or networks with a high credit assignment path (CAP) value. The ex- CHAPTER 2. BACKGROUND 9

act depth required for a network to be determined deep is debated, but some authors place the threshold at a CAP > 2. [32]

2.2 Reinforcement Learning

In reinforcement learning an agent interacts with the environment in which it resides via observations and actions. The reinforcement learning framework is an iterative one where the environment is in some state s. The agent obtains an observation o of that state. The agent then chooses an action a depending on this observation. The action changes the state of the environment and the agent receives a reward r which signals to the agent how good that action was. The goal of the agent is to maximize the long-running sum of rewards. [33, p. 238]

2.2.1 Markov Decision Process When the states can be fully observed by the agent the reinforcement learning task can be modelled as a Markov Decision Process (MDP) M = {S , A , T , r}, where S is the state space of the process with states s ∈ S , A is the action space with actions a ∈ A , O is the observation space i with observations o ∈ O, T jk = P(st+1 = i|st = j, at = k) is the transition operator describing the probability of transitioning to a specific state i given a current state j and an action k, and r : S × A → R is the reward function. j Let µt = P(st = j) be the probability of being in state j at time t and k ξt = P(at = k) be the probability of action k at time t. This gives the i P i j k probability of being in state i at time t + 1, µt+1 = j,k T jkµt ξt .

2.2.2 Partially Observable Markov Decision Process When the states cannot be fully observed by the agent the reinforcement learn- ing task can be modelled as a Partially Observable Markov Decision Process (POMDP) M = {S , A , O, T , E , r}, where S is the state space, A is the action space, O is the observation space with observations o ∈ O, T is the l same transition operator as for the fully observed case. E i = P(ot = l|st = i) denotes the emission probability of observing an observation given a state of the environment and r : S × A → R is the reward function. [34] When the transition dynamics tensor T is unknown the world state s can be estimated by a belief state b. The belief state b is constructed such that bt 10 CHAPTER 2. BACKGROUND

i is a probability distribution over S denoted bt to mean P(bt = i). The belief state is updated given an observation ot = l and an action at = k according to

i bt+1

= P(bt+1 = i) = P(st+1 = i|bt = j, at = k, ot = l) (o = l|s = i, b = j, a = k) (s = i|b = j, a = k) = P t t+1 t t P t+1 t t P(ot = l|bt = j, at = k) (o = l|s = i) P (s = i|s = s, b = j, a = k) (s = s|b = j, a = k) = P t t+1 s∈S P t+1 t t t P t t t P(ot = l|bt = j, at = k) l P i s E i T skbt = s∈S (2.1) P(ot = l|bt = j, at = k)

Since P(ot = l|bt = j, at = k) is independent of the world state s, this normalizing constant can be precomputed. With this in mind, the transition probabilities of belief states can be more succinctly described using the tran- sition tensor

i X τ jk = P(bt+1 = i|bt = j, at = k) = P(bt+1 = i|bt = j, at = k, ot = l) l (2.2) giving the updated belief as

i X i j k bt+1 = τ jkbt ξt (2.3) j,k The reward function for partially observed states is given by the expected true reward X ρ(bt, at) = P(bt = s)r(s, at) (2.4) s∈S

2.2.3 Functions and Distributions The reinforcement learning agent samples actions according to some policy. The policy π(s) is a probability distribution over the action space A and de- termines what action the agent will take given a state s. The agent generates an episode as it explores the environment. An episode is a sequence of states from an initial state s0 until some terminal state sT . The episode is constructed sampling actions according to the policy π(s) of CHAPTER 2. BACKGROUND 11

each intermediate state and the subsequent state in the episode is given by the resulting state returned by the environment after performing the sampled action. The agent wishes to maximize the reward it receives from the reward func- tion when performing the actions. However, solely considering the immediate reward might lead to a sub-optimal policy in the long run and thus the learn- ing process can be modified to maximize either the total reward, the average reward or the total discounted reward. The total discounted reward is often chosen due to being easier to compute than the two other options. [35, p. 41] Let R(st = s, at = a) denote the stochastic variable associated to the reward r(s, a). Similarly, let R(st = s, π) denote the stochastic variable of the reward in state s selecting actions according to the policy π. The value of a state s, Vπ(s) for a given policy π computes the expected total future discounted reward starting in a state s and taking actions according to the policy π.

2 Vπ(st) = E[R(st, π) + γR(st+1, π) + γ R(st+2, π) + ... ] (2.5) where γ ∈ [0, 1) is the discount factor, actions are selected according to the policy π and states are given by the actions together with the transition prob- abilities T . Since the discount is exponential the value function admits the form:

X X s0 0 Vπ(s) = E[R(s, π)] + γ T asP(π(s) = a)Vπ(s ) (2.6) s0∈S a∈A

The action-value function Qπ(s, a) computes the expected total future dis- counted reward given a state s and an action a.

X s0 0 Qπ(s, a) = E[R(s, a)] + γ T asVπ(s ) (2.7) s0∈S The advantage function measures the expected improvement in expected reward by taking a specific action in a state s compared to the average action in the state s. With some basic probability theory and arithmetic, it can be verified that the value Vπ(s) of a state s corresponds to the average action-value Ea∼π(s)[Qπ(s, a)] of the same state. 12 CHAPTER 2. BACKGROUND

X X s0 0 Vπ(s) = E[R(s, π)] + γ T asP(π(s) = a)Vπ(s ) s0∈S a∈A X X X s0 0 = P(π(s) = a)E[R(s, a)] + P(π(s) = a)γ T asVπ(s ) a∈A a∈A s0∈S " # X X s0 0 = P(π(s) = a) E[R(s, a)] + γ T asVπ(s ) a∈A s0∈S = Ea∼π(s) [Qπ(s, a)] (2.8)

This allows us to define the advantage Aπ(s, a) as:

0 Aπ(s, a) = Qπ(s, a) − Ea0∼π(s) [Qπ(s, a )]

= Qπ(s, a) − Vπ(s) (2.9)

From this it follows that

Ea∼π(s)[Aπ(s, a)] = Ea∼π(s)[Qπ(s, a) − Vπ(s)] 0 = Ea∼π(s)[Qπ(s, a) − Ea0∼π(s)[Qπ(s, a )]] X X 0 0 = P(π(s) = a)[Qπ(s, a) − P(π(s) = a )Qπ(s, a )] a∈A a0∈A X X 0 0 = P(π(s) = a)P(π(s) = a )[Qπ(s, a) − Qπ(s, a )] a∈A a0∈A = 0 (2.10)

The probability of a 6= a0, P(π(s) = a)P(π(s) = a0) is 0 since the policy only yields one action for every state s and P(π(s) = a) = P(π(s) = a0) 0 0 0 for a = a . In the case where a = a the resulting term Qπ(s, a) − Qπ(s, a ) evaluates to 0.

2.2.4 Actor-Critic Learning Actor-critic algorithms [36] are algorithms where both the policy π(s) and the value function Vπ(s) are learned. The actor is the part of the algorithm that learns the policy and the critic learns how to determine the value of a state. CHAPTER 2. BACKGROUND 13

2.2.5 Q-Learning

One-step Q-Learning The one-step Q-learning algorithm [35] optimizes the action-value function iteratively according to h i Qt+1(st, at) = (1 − η)Qt(st, at) + η r(st, at) + γ max Qt(st+1, a) (2.11) a where η ∈ [0, 1] is the learning rate of the Q-learning. This number is usually set to a small number close to 0. With a value of η = 0 the model does not learn anything from the observed data. However, the algorithm keeps exploring new possible actions. With higher values of η the learning more quickly converges to exploit the actions which have been observed to give high reward but the algorithm might disregard untried actions leading to convergence to a local maximum of the received reward.

Asynchronous n-step Q-Learning The n-step Q-Learning algorithm by Mnih et al. [37] operates by comput- ing the total discounted reward for n steps or until a terminal state has been reached. The policy and the action-value function are then updated by first considering the last action, then the two last actions until all actions have been considered. More formally the estimated action-values Qθ(s, a) parameterized by the parameters θ are updated asynchronously according to

n−1 2 X ∂(Ri − Qθ(si, ai)) θ ← θ + (2.12) ∂θ i=0 Pn−1 n−1−k where Ri = k=i γ r(sk, ak)

2.2.6 Deep Reinforcement Learning Deep reinforcement learning (Deep RL) is the application of deep learning to train any computation of the action-value function Qθ(s, a), the value function Vθ(s) and the policy πθ. Here θ refers to the model parameters of the respective artificial neural networks that are used to approximate the functions. [38] 14 CHAPTER 2. BACKGROUND

2.2.7 Deep Q Network In the deep Q-network (DQN) algorithm the optimal action-value function

∗ 2 Q (s, a) = E[R(st, π) + γR(st+1, π) + γ R(st+2, π) + ... |st = s, at = a, π] (2.13) is approximated using a convolutional neural network. [2] The algorithm is off-policy and uses a database of stored state-transition tuples (si, ai, ri, si+1). Two networks are used when optimizing the parameters, a target network

Q − − − (s, a) Q (s, a) θ ,θv ,θa and a current network θ,θv,θa . For each tuple (si, ai, ri, si+1) in a mini-batch of size N sampled from the database with i ∈ [1,N] the current network parameters are updated by

{θ, θv, θa} ← {θ, θv, θa} − η∇θ,θv,θa ||yi − Qθ,θv,θa (s, a)|| (2.14) where ( ri if si+1 is terminal y = i 0 r + γ max 0 Q − − − (s , a ) i a θ ,θv ,θa i+1 otherwise η is the learning rate.

After each mini-batch the target network parameters are set to the current network parameters.

− − − {θ , θv , θa } ← {θ, θv, θa} (2.15)

Experience Replay During the training of a deep Q-network the experience of the agent at each time-step is recorded into a dataset Dt = {e0, e1, . . . , et} where the experi- ence at time-step k is stored as a 4-tuple ek = (sk, ak, r(sk, ak), sk+1). Then samples are drawn uniformly from this dataset and used to update the network parameters. This strategy minimizes the correlation between subsequent sam- ples during the training which is often much higher for consecutive samples.

2.2.8 Dueling Network Architectures The Dueling Network Architecture by Wang, Schaul, Hessel, Van Hasselt, Lanctot, and De Freitas [39] builds upon the DQN algorithm. However, in- stead of estimating the action-value function directly, a network is constructed CHAPTER 2. BACKGROUND 15

such that the value function and the advantage functions are estimated and then using these estimates a final estimate for the action-values is computed. The reason for decoupling the value and the advantage is that for some states know- ing the exact value is not as important as for other states. In cases where the agent might end up in a terminal state by choosing the wrong action, the ad- vantage is more important whereas in other states the specific choice of action is not as important than the long term value. The intuitive computation of the estimate of the action-values would be

Qθ,θv,θa (s, a) = Vθ,θv (s) + Aθ,θa (s, a) (2.16) where θ are parameters shared between the value estimate and the advantage estimates, θv are parameters of the value estimator network and θa are param- eters of the advantage estimator network. However, the estimate of the value

Vθ,θv (s) might not be a good estimator of the true value of the state s. Also,

Vθ,θv (s) and Aθ,θa (s, a) aren’t uniquely determined by Qθ,θv,θa (s, a). This can be observed by adding a constant to the resulting action-value estimate. For the following computation of the action-value estimates   0 Qθ,θv,θa (s, a) = Vθ,θv (s) + Aθ,θa (s, a) − max Aθ,θa (s, a ) (2.17) a0∈A ∗ ∗ the induced optimal policy π is given by π (s) = arg maxa∈A Qθ,θv,θa (s, a) = ∗ arg maxa Aθ,θa (s, a). For this policy we obtain Qθ,θv,θa (s, π (s)) = Vθ,θv (s). For stability of the learning, the Dueling Network Architecture defines the estimate in a slightly different way ! 1 X 0 Qθ,θ ,θ (s, a) = Vθ,θ (s) + Aθ,θ (s, a) − Aθ,θ (s, a ) (2.18) v a v a |A | a a0∈A this computation satisfies the same properties as the computation including the max function and the value of the state estimate Vθ,θv (s) can be uniquely deter- mined from the estimated action-values Qθ,θv,θa (s, a). However, this compu- tation has the drawback that the estimated value will be offset by the constant 1 P A (s, a0) |A | a0∈A θ,θa from the true value. Since this network produces estimates of the action-values, optimization of the parameters can be done using similar methods as for the ordinary DQN algorithm.

2.2.9 Asynchronous Advantage Actor-Critic The Asynchronous Advantage Actor-Critic (A3C) algorithm by Mnih et al. [37] estimates the policy πθ(s) and the value function Vθv (st) using deep neural 16 CHAPTER 2. BACKGROUND

networks. The algorithm is initiated by creating a global network that estimates the policy πθ(s) and the value function Vθv (st) and spawning N threads or pro- cesses. A thread-local neural network with the same architecture as the global network is created for each thread. Each thread is also allocated its own rein- forcement learning environment such that training can be performed in paral- lel. The parameters θ and θv of the global network are updated asynchronously by first copying the global network parameters θ and θv to the local network and then accumulating the gradients with respect to the copied parameters θ0 0 and θv

X 0 0 0 0 0 θ ← θ − η ∇θ βH(πθ (si)) − ∇θ log πθ (si)(Ri − Vθv (si)) i∈{T −1,...,t} 2 X ∂(Ri − Vθ0 (si)) θ ← θ − η v v v ∂θ0 i∈{T −1,...,t} v

0 0 where θ and θv denotes the thread-local parameters, H the entropy, β < 0 is a real-valued hyper-parameter, T is the time-step at the end of the up- date window, t is the time-step at the start of the update window and Ri = P T −1−n n∈{T −1,...,i} γ r(sn, an) is the total discounted reward from the time- step i to the end of the update window. The hyper-parameter β was set to 10ˆ{-3} for all experiments performed in this research. In the above equations the term Ri − Vθ(si) can be seen as the advantage Aπ(si, ai), hence the name of the algorithm. A synchronous version of A3C called Advantage Actor-Critic (A2C) has been shown to perform as well as or better than A3C for some tasks. [40, p. 7]

2.2.10 Epsilon Greedy Policy To encourage further exploration of the action space an epsilon-greedy policy can be used. Given a policy π the epsilon greedy policy is given by ( a ∼ U(A ) r < ε (2.19) a ∼ π otherwise where r is a uniformly random sample sampled from U[0, 1], and ε ∈ [0, 1] is a number determining whether the actions should be uniformly sampled or sampled according to the policy π. CHAPTER 2. BACKGROUND 17

2.2.11 Partial Observability While the presented algorithms do not explicitly model the belief states, ob- servations and emission probabilities by reinterpreting the value function and the action-value function having the observation space as the domain [41]. With this interpretation the neural networks estimating the functions implic- itly model the ordinary value function and action-value function belief state probabilities, as well as the emission probabilities. However, for the model to correctly be able to estimate the rewards it is clear that the observations must contain enough information to infer the expected reward from the observation.

2.2.12 Reward Hacking The reinforcement learning agent can sometimes find ways to optimize the reward in such a way that the resulting actions lead to high reward despite hu- man evaluation not perceiving the result as intended or good. This is referred to as “reward hacking”. The issue of reward hacking is exacerbated when the reward function is not based on some objective criteria but instead designed to guide the agent towards the intended result. [8]

2.2.13 Inverse Reinforcement Learning Reinforcement learning assumes that a reward function is provided. See sec- tions 2.2.1 and 2.2.2 for a more formal description of this. However, for some tasks, it may be difficult to design a reward function that assigns rewards to the performed actions to encourage the intended behavior. The inverse rein- forcement learning algorithm [42] circumvents this issue by training a model to estimate the reward function. A dataset D of episodes σ containing actions and observations is sampled using some policy. Then a human is tasked with constructing a partial preorder on the episodes by annotating pairs of episodes with a preference annotation µ that indicates which episode is preferred. The probability of the episode σ0 to be preferred over the episode σ1 esti- mated as P 0 0 ˆ 0 1 exp( t rˆφ(ot , at )) Pφ[σ σ ] = P 0 0 P 1 1 (2.20) exp( t rˆφ(ot , at )) + exp( k rˆφ(ok, ak)) where rˆφ is the estimated reward function. 18 CHAPTER 2. BACKGROUND

The parameters φ of the reward function estimate are then optimized by

X ˆ 0 1 ˆ 1 0 φ ← φ + η µ(0)∇φ log(Pφ[σ σ ]) + µ(1)∇φ log(Pφ[σ σ ]) (σ0,σ1,µ)∈D (2.21) where η is the learning rate, µ is the preference annotation such that µ(i) is 1 if the episode σi is preferred and 0 otherwise.

2.3 DJ mixes

A (DJ) is a human tasked with selecting music and creating a con- tinuous stream of audio called a “mix” by using a mixer. The DJ’s audience can be listening to the radio, dancing on a night club’s dance floor, or attending a social event or party. A sub-goal of the DJ task is to sequence the available songs into a playlist. That is, given a collection of songs, reorder them into a list such that the se- quence can flow smoothly from one song to another. This is different from the playlist generation task where both the songs used in the playlist and the order of the songs are chosen. [43] The DJ creates transitions between the subsequent songs in the sequenced playlist by using the volume, tempo and equalizer controls of the DJ mixer. Some modern mixers also include functionality to add effects to the mixed audio such as low pass filters and flangers. However, only volume, tempo and equalizer settings are considered in this research. A transition in this report refers to a segment of the mix where the mix moves from playing one song to the next by decreasing the volume of the first and increasing the volume of the latter. “Beat matching” is the process of aligning the tempo and beat positions of two or more songs. Beat matching is an elementary technique for DJs. Once the songs are beat matched the DJ can proceed to manipulate the volume and equalizer to create a transition between the songs without any discernible change to the underlying rhythmic structure of the mix. Ishizaki, Hoashi, and Takishima [4] showed that beat matching with min- imal changes to the transitioning songs’ tempo leads to lower degrees of user discomfort. Cross-fading is the act of lowering the volume of the outgoing song and si- multaneously increasing the volume of the incoming song. When cross-fading it is important for the DJ to maintain the amplitude of the output mix at a stable level to ensure that the transition is seamless. [13, pp. 2-4] CHAPTER 2. BACKGROUND 19

Mixers specialized for DJs are often equipped with a cross-fade that incor- porates symmetric interpolation between songs loaded into two channels of the mixer. This interpolation is designed to ensure the conservation of output amplitude when the two songs are playing at equal amplitude in order to easily create a seamless transition.

2.3.1 Musical Structure Besides aligning the beats and adjusting the volume during a transition, the DJ must also consider the musical structure of the songs that are present in a mix. One such musical structure is the phrase. Phrases are subdivisions of a melodic line where each phrase is regarded as a separate musical entity. [44, p. 259] In order to preserve the musical structure of the mix, the DJ aligns the songs such that phrases in all songs involved in a transition begin at the same time. This is below referred to as phrase alignment. The scheme governing the structure-at-large of a composition of music is in music theory referred to as the form of that piece [45, p. 277]. The form of a piece describes how musical sections are organized to constitute the piece. In modern music some common sections are “intro”, “verse”, “bridge”, “chorus”, and “outro”. The DJ makes use of this segmentation in order to achieve a balance be- tween the occurrences of the different sections.

2.3.2 Musical Key The musical key or tonality of a piece of music refers to the set of pitch-classes which relate to the “main key” or “tonic” according to “loyalty the tonic” [45, p. 386, 752]. A key can be in a mode such as the major mode or the minor mode. The tonic together with the mode determines the relation of all other pitch-classes to the tonic. Mixing in key [14] or harmonic mixing is the framework of selecting songs in a mix such that all songs present during a transition have related musical keys. One common interpretation of this framework is to select songs which are proximate in the circle of fifths. The relation between consecutive keys in the circle of fifths differs only by one additional sharp in the clockwise direction. [45, p. 150] 20 CHAPTER 2. BACKGROUND

2.3.3 Critical Bands and Perceptual Scales A critical band is a range of audio frequencies for which any two frequencies within the same critical band auditory masking causes perceptual interference [46]. That is, the two frequencies cannot be perceived individually but are instead perceived as a combination tone. The Bark scale by Zwicker [47] is a scale where frequencies are segmented into 24 critical bands. These bands are based on the natural division of the au- dible frequency range of the human ear. The bandwidths of each critical band (one Bark) are chosen to model perceptual loudness of frequencies uniformly across the frequency spectrum. The mel scale is a scale that is designed such that perceived distance be- tween frequencies are spaced equally. [48]

2.3.4 Tonal Consonance Plomp and Levelt [15] surveyed the perception of consonance between two simultaneously sounding frequencies. From this survey, a consonance curve was constructed. From the dissonance-consonance curve described in [15] a measure of dis- sonance can be defined as follows. Extract the spectral peaks with correspond- ing magnitudes from the audio signal. For every pair of frequencies f1 and f2, the fluctuation rate |f2 − f1| is computed. The critical bandwidth of the lower frequencies is calculated according to the Bark scale and dissonance of the fluctuation rate is found according to the dissonance curve for the given critical bandwidth. The total dissonance is then computed as the sum of dissonances weighted by their loudness. This method is implemented in the Essentia li- brary [49].

2.3.5 Music Emotion Recognition Music emotion recognition (MER) is the task of recognizing the emotion in a piece of music. MER can be viewed as a classification problem or a regression problem where each piece of music is to be annotated with a set of emotions. [50] In psychology, there are multiple models to classify experience of emotion referred to as affect. Ekman [51] presents a model of six basic independent emotions. In this model each basic emotion maps to some neural system of the brain. CHAPTER 2. BACKGROUND 21

Many models converge on identifying (at least) two fundamental dimen- sions of valence and arousal [52, pp. 1-2]. Valence is a dimension of affect where emotions range from unpleasant to pleasant emotions. Arousal is a di- mension ranging from quiet to active. Wundt [53] identifies the dimension of pleasure-displeasure as the most dominant in classifying affect. Posner, Russell, and Peterson [54] suggest that the basic emotion model fails to explain clinical findings and that the circum- plex model with arousal and valence dimensions should be preferred.

2.3.6 MediaEval Database for Emotional Analysis in Music Alajanki, Yang, and Soleymani [11] presents MediaEval Database for Emotional Analysis in Music (DEAM) which is a dataset for music emotion recognition consisting of 1802 excerpts and songs annotated with valence and arousal values. Valence and arousal values are provided on a per-second basis as well as a total for each song. The data was collected using Amazon Mechanical Turk, participants graded the arousal and valence while listening to the songs on a -10 to 10 point scale. Since the data is unstable at the beginning of all songs the first 15 seconds were removed from the annotations.

2.3.7 Cue Point Selection In order to limit the number of possible mixes, a DJ operates by first selecting some points in the available songs for where the transition into those songs will begin. This is traditionally done by placing small stickers on the records. Digital DJ-systems such as software or hardware controllers are commonly equipped with a cue-point functionality that allows the DJ to skip to one of these pre-selected cue-points for every song. Since these cue points are commonly used to alleviate the process of find- ing the position in a song from where to include it in a transition cue points are often chosen to correspond with musical section or musical phrase boundaries.

2.3.8 Self-Similarity Based Novelty Score By noting that at the boundary of musical sections we often see larger changes both in harmony and in amplitude we can apply this insight to segment audio according to the amount of change or “novelty” with respect to harmony and amplitude. [55, 6] 22 CHAPTER 2. BACKGROUND

Foote [55] describes a way to compute the novelty curve N(t) for a song by utilizing a self-similarity matrix S. Let Si,j represent some similarity metric between the time-segment i and the time-segment j in the song. By convolving the self-similarity matrix with the matrix K around the di- agonal of the self-similarity matrix we obtain the novelty curve

I J 2 2 X X ij N(t) = K St+i,t+j (2.22) I J i=− 2 j=− 2 where I is the kernel height, J is the kernel width and the kernel weights are given by ( 1 i ∗ j ≥ 0 Kij = −1 i ∗ j < 0

Vande Veire and De Bie [6] presents a method for calculating self- similarity matrices for harmony and amplitude.

Harmonic Similarity To compute the self-similarity matrix for harmony, the Mel-Frequency Cep- stral Coefficients (MFCC) are computed and then the cosine similarity of the h two MFCC vectors at time segment i and time segment j are stored in Si,j. The harmonic novelty is then computed as

I J 2 2 h X X ij h N (t) = K St+i,t+j (2.23) I J i=− 2 j=− 2

Amplitude Similarity In order to compute the self-similarity matrix for amplitude, the root mean squared (RMS) amplitude is computed for each time segment. Then the self- a similarity matrix Si,j = |xi −xj| where xk is the root mean squared amplitude for time segment k. The amplitude novelty curve N a(t) is computed analo- gously. The combined novelty curve N(t) is obtained by computing the geometric mean of the harmonic novelty and the amplitude novelty curves p N(t) = N h(t)N a(t) (2.24) CHAPTER 2. BACKGROUND 23

2.4 Autonomous DJ Systems

An autonomous DJ system is a system that without user input can sequence a collection of songs and mix them into a DJ mix. Hirai, Doi, and Morishima [5] presents an autonomous DJ system mix- ing according to harmonic similarity. The songs are split into 8 second long segments and then each segment is allocated to topics in a latent topic space. Segments that are allocated to similar topics are deemed similar in harmony and the system creates cross-fade transitions between songs when the similar- ity is high. The autonomous DJ system described by Vande Veire and De Bie [6] op- erates by first selecting cue-points according to a novelty score by a change in harmony and amplitude. Then transitions are created at one of the selected cue points according to three different strategies. All three strategies utilize a cross-fade as the way of changing the volume of the mixed songs in the tran- sition. Chapter 3

Method

3.1 System Overview

The Fully Autonomous DJ-Emulation using Deep Reinforcement Learn- ing (DeepFADE) system was implemented to investigate whether deep reinforcement learning can be used to learn a mixing strategy that is not hard-coded. The DeepFADE system was implemented as a three-tier system. Songs 1. Song Recommendation Tier 2. Cue Point Loading Tier Song 3. Mixing Tier Recommendation Current Song + 4 songs to load Each tier was implemented as an OpenAI Gym [56] environment constituting a deep reinforcement Cue Point Loading learning environment and a corresponding agent. 2 songs Figure 3.1 shows an overview of the DeepFADE Mixing system. Black arrows indicate data that initializes (DJ Controller) the next tier in the system. Red arrows indicate ac- Generated Mix tions performed within a tier.

Figure 3.1: DeepFADE 3.2 Pre-processing System Overview

Pre-processing was applied to the songs in the DEAM [11] dataset for training the models of the Cue Point Loading and Mixing tiers. The 1802 songs in the dataset were converted to 44.1kHz, 16 bit signed linear PCM wav format using FFmpeg [57]. In order to reduce the required

24 CHAPTER 3. METHOD 25

computation during the training of the models all songs were further resampled to 44.1 × 1.01k kHz for integers k ∈ [−10, 10] using the Essentia [49] library. A numpy 32-bit floating-point [58] array containing all the of the resampled waveforms was saved to disk for each song. The positions of the beats in the songs were estimated using the Essentia [49] library with a multi-feature tracking algorithm [59] and the beat positions in seconds were stored to disk. To determine cue points for each song, the novelty curves with respect to harmony and amplitude were computed for each song based on the scheme presented in section sec. 2.3.8. The self-similarity matrices with respect to the root square amplitude and the mel-frequency cepstrum coefficients were computed. The amplitude novelty curve Na(t) and the harmony novelty curve Nh(t) were computed given the root square mean amplitude and mel-frequency cepstrum coefficient self-similarity matrices respectively. The combined novelty curve Nc(t) was calculated as the geometric mean of the amplitude novelty curve Na(t) and the harmonic novelty curve Nh(t). p Nc(t) = Na(t)Nh(t) (3.1) The cue points were then selected as the 4 most prominent peaks of the combined novelty curve of the audio from beginning to 20 seconds before the end of each song were selected as cue points. A Mel-Frequency spectrogram was pre-computed for 30 seconds of audio from each cue point in every song for later use in the Song Loading tier of the system. If the 30-second window extended after the end of the song, the audio was padded with zeroes such that all spectrograms were of the same length.

3.3 Song Recommendation Tier

The song recommendation tier of the system is responsible to select which song to start the song with and four candidate songs that can be chosen to play as the next song. For this research the policy of the song recommendation tier was hard- coded to sample both the initial song and the four candidate songs to load from the collection of songs according to a uniform distribution of all songs in the provided collection. The recommendation does not perform the playlist se- quencing directly, instead, it provides the cue point loading tier with a smaller subset of the songs such that the sequencing can be performed in that tier. 26 CHAPTER 3. METHOD

3.4 Cue Point Loading Tier

The cue point loading tier determines when to load a candidate song start- ing at one of the cue points of that song. The cue points are selected as the four highest peaks in the combined novelty curve of root mean squared am- plitude and mel-frequency cepstrum coefficients similarity matrices. A mel- frequency spectrogram is created of 30 seconds of audio starting at each cue point in the candidate songs. The policy of the cue point loading tier was trained using the implemented algorithms presented in sec. 3.10. Observations provided to the cue point load- ing agent were given by the 16 spectrograms corresponding to each of the 4 cue points in the 4 songs, one additional spectrogram of 30 seconds of audio from the current offset in the currently playing song as well as one more spec- trogram whose values are all set to the number of actions performed since the last transition. The available actions in the cue point loading tier are 16 actions to load the respective cue points as well as one action to not load any cue point. The rewards are given by eq. 3.2 ( R(σ, p) if song σ is loaded at cue point p r = τ 2 (3.2) 100(1 − ( 20 ) ) if no cue point is loaded where τ is the time in seconds since last transition and R(σ, p) is the total reward given by the DJ simulation environment of the mixing tier described in sec. 3.5. If no cue point is loaded, the currently playing song is played for 2 seconds with no changes to the DJ-controller state. After loading a cue point and a transition is completed the currently playing song is set to the song from which the cue point was loaded. The offset of song is maintained at the time offset it was when the transition of mixing tier ended.

3.5 Mixing (DJ-Controller) Tier

The mixing tier performs the main work of the DeepFADE system. The tier is implemented as a simulation of a two-channel DJ-controller with a three- band equalizer per channel. The policy of this tier was also trained using the algorithms presented in sec. 3.10. Actions were modeled as discrete modifications to the controller. The fol- lowing actions were provided for both channels. CHAPTER 3. METHOD 27

• Increase Volume [Inc-V] • Decrease Volume [Dec-V] • Increase Tempo [Inc-T] • Decrease Tempo [Dec-T] • Increase Equalizer High Band [Inc-H] • Decrease Equalizer High Band [Dec-H] • Increase Equalizer Mid Band [Inc-M] • Decrease Equalizer Mid Band [Dec-M] • Increase Equalizer Low Band [Inc-L] • Decrease Equalizer Low Band [Dec-L]

An additional action that made no modification to the DJ-controller was also added yielding a total of 21 available actions. The agent was allowed to perform 10 actions per second of generated audio.

3.5.1 Knobs and Faders Each channel is controlled by five knobs and faders. A volume fader, a tempo fader and three knobs controlling the different bands of the equalizer. The volume fader ranges from 0 to 100 with changes in increments of 10 per action. The tempo fader ranges from -10 to 10 with a neutral tempo of 0. The audio of the channel is resampled to have a relative speed of 1.01F compared to the original tempo of the song where F is the value of the tempo fader. The three equalizer knobs represent the decibel gain of the respective bands.

Table 3.1: Knobs and faders for each channel

Knob/Fader Min Max Increase Decrease Volume 0 100 +10 -10 Tempo -10 10 +1 -1 High dB Gain -10 10 +1 -1 Mid dB Gain -10 10 +1 -1 Low dB Gain -10 10 +1 -1

3.5.2 Audio Generation At each time-step, the DJ-controller simulator reads a frame of audio at the time offsets of the two channels. The pre-calculated waveform corresponding to the current tempo is used when reading the audio data. 28 CHAPTER 3. METHOD

IIR filters based on Bristow-Johnson [60] implemented in the Essentia li- brary [49] modeling a low-shelf at 1kHz and slope 0.7, a peaking equalizer at 4.5kHz and band-width of 3 octaves and a high-shelf at 8kHz with a slope of 0.7 were used to equalize the audio. The filters are cascaded through function composition yielding the composed equalization equation 3.3.

eq = high(knobhigh) ◦ mid(knobmid) ◦ low(knoblow) (3.3)

3.5.3 Mixing and Volume After applying the equalization to the individual channels the audio is weighted according to the current volume. The volume is converted to the range [0,1] given the equation

0.8vol+80 f(vol) = 10 20 (3.4) This conversion balances the required precision more uniformly over the volume range. This is because the converted volume is a better linear approx- imation for a perceived loudness than scaling the amplitude linearly. The final mixed audio is found as the sum of the two channels weighted by their converted volumes.

3.5.4 Interpolation To reduce the number of spikes in the waveform causing high-frequency arti- facts, an additional window of 200 samples (approximately 4.5ms) is read and equalized at the end of the audio corresponding to the current action. For the next action, the first 200 samples are linearly interpolated with the 200 sample window from the last action.

3.6 DJ-Controller Rewards

Multiple heuristic rewards were designed in order to reward the DJ-controller agent for better sounding mixes. The total reward afforded to the DJ-controller was given by combinations of the rewards presented in this section. A volume stability reward was introduced to penalize drastic changes in volume of the generated mix. The reward is weighted by the absolute differ- ence between the average RMS across the episode and the average RMS for CHAPTER 3. METHOD 29

the most recent second of audio in the generated mix.

−5|g(st)−c(st)| rVS(st, at) = e (3.5) where the global average volume g and the current volume c are given by

t 1 X g(s ) = RMS(s ) t t k k=0 t 1 X c(s ) = RMS(s ) t 10 k k=t−10

A version of the volume stability reward comparing the output master vol- ume with a pre-determined constant was designed to attempt to maintain the volume at a consistently high volume. The reward is given by eq. 3.6

−5|0.5−c(st)| rCVS(st, at) = e (3.6) where c is computed in the same fashion as for the volume stability reward (eq. 3.5). The constant 0.5 was chosen to encourage mixes with a loud root mean squared amplitude. To encourage mixes with more consonance and thus more pleasant- sounding mixes, a tonal consonance reward was designed. The tonal dissonance of the audio was computed as the auditory roughness of the spectral peaks using the Essentia [49] library based on the tonal consonance by Plomp and Levelt [15]. The dissonance is weighted by the loudness weighted according to the Bark Scale [47]. The tonal consonance reward is given by eq. 3.7

rTC(st, at) = 1 − Dissonance(st) (3.7) Sudden changes in tempo are also undesired in the generated audio. To limit the amount of such changes a tempo change penalty (eq. 3.8) was added that penalizes modifications to the of the two channels of the mixer.  vol0(st) if at = Inc-T0  vol0(st) if at = Dec-T0  rTP(st, at) = vol1(st) if at = Inc-T1 (3.8)  vol1(st) if at = Dec-T1  0 otherwise 30 CHAPTER 3. METHOD

In order to penalize mis-alignment of beats and loss of rhythmic consis- tency, a reward based on the beat similarity metric by Hirai, Doi, and Mor- ishima [5] was added to encourages mixes where beat positions of the two channels coincide. The reward is computed as

rBEAT(st, at) = 1/(1 + d(beats0(st), beats1(st))) (3.9) where d(bs0, bs1) is the difference in beats computed as

d(Nil, bs1) = 0

d(bs0, Nil) = 0  b1 − b0 + d(bs0, bs1) if b0 < b1  d(b0 :: bs0, b1 :: bs1) = b0 − b1 + d(bs0, bs1) if b1 < b0  d(bs0, bs1) if b0 = b1 for beat positions bs0 and bs1 in seconds in the output mix for channel0 and channel1 respectively. Here Nil denotes the empty list and the operator :: de- notes the cons operator of a list, separating the first element of the list from the remaining potentially empty list. The most recent 8 seconds of the output mix is used when calculating the beat similarity reward for an action. Hence the rewards are delayed from when the actions leading to the songs being beat-matched were performed. The mixing rewards encourage mixes where the songs in both channels are audible at the same time. Two variants of the mixing reward were considered. The multiplicative mixing reward (eq. 3.10) encourages higher volumes in both channels and is given by the product of the volumes of the channels after division by 100 to normalize the reward to the range [0, 1].

vol0(st) vol1(st) r (st, at) = (3.10) MMIX 100 100 The additive mixing reward (eq. 3.11) encourages the sum of the volumes to be close to 100 (full volume for one channel).

rAMIX(st, at) = |100 − vol0(st) + vol1(st)| − 0.8 (3.11) When creating a transition channel 0 is assumed to play the outgoing song and channel 1 is assumed to play the incoming song. This is because when an episode is initialized in the mixing environment the volume of channel 1 is set to 0. A reward is introduced to encourage transitioning between the two songs loaded to the DJ controller. The transition reward encourages the agent to CHAPTER 3. METHOD 31

increase volume of channel 1 which is assumed to start with a lower volume than channel 0 at the beginning of the episode. The reward also encourages the decrease in volume of channel 0 such that only the incoming song will play at the end of the transition. Multiple variants encouraging transitions were considered. The simple transition reward (eq. 3.12) rewards higher volume for the incoming song and penalizes volume for the outgoing song. The constant 4 is weights the impor- tance of increasing the volume of the incoming song higher to avoid reducing the output master volume too quickly.

rST(st, at) = 4f(vol1(st)) − f(vol0(st)) (3.12) where f denotes the volume conversion as described in section 3.5.3. The delta transition reward (eq. 3.13} considers the slope of the volume histories for the two channels. The reward is shaped to encourage a volume increase of 10 units per second for channel 1 and a decrease in volume of 10 units per second for channel 0.

t   X 1 vol1(sk) − vol1(sk−10) rDT(st, at) = 1 − 10 − 9 10 k=t−80 t   X 1 vol0(sk) − vol0(sk−10) + 1 − −10 − (3.13) 9 10 k=t−80

The simple transition reward and the delta transition reward suffer from the rewards being delayed from the action that caused the change in volume. Hence, the instant transition reward (eq. 3.14) was introduced. Reward is given to the agent if a relevant volume change action is taken under the condition that the action causes a modification to the controller state. The instant transition reward encourages an increase in the volume of the incoming song and a de- crease in the volume of the outgoing song.  1 if at = Inc-V1 and vol1(st) < 100  rIT(st, at) = 1 if at = Dec-V0 and vol0(st) > 0 (3.14)  0 otherwise The mixing transition reward (eq. 3.15) is given by a combination of the delta transition reward and the additive mixing reward and was designed to perform transition of reasonable length while maintaining a consistently loud output volume. 32 CHAPTER 3. METHOD

( rDT(st, at) × rAMIX(st, at) rAMIX(st, at) > 0 r (st, at) = (3.15) MT rDT(st,at) rAMIX(st, at) ≤ 0 rAMIX(st,at)+0.5 If the action causes the knob or fader to exceed its bounds then the agent is penalized so that other actions can be considered instead. The value range of each knob and fader is given in table 3.1. The illegal action penalty is given by eq. 3.16. ( −0.1 value exceeds bounds rAP = (3.16) 0 otherwise To encourage all frequency bands to have a similar amplitude, the fre- quency band variance penalty (eq. 3.17) was introduced. The reason for this reward is to ensure that all parts of the frequency spectrum are utilized.

−0.2Var(bands(st)) rBANDS(st, at) = e (3.17) where bands(s) is a vector of 40 mel-frequency band amplitudes. In case the episode is terminated before a complete transition, a loss penalty is added to the reward of the last action. A transition is determined to be complete if the volume of channel 1 is 100 and the volume of channel 0 is 0 and there are at least 0.1 seconds of audio left in both songs. Contrastingly, if the volume of both channels is below 20 or there are less than 0.1 seconds of audio in any of the channels the episode is terminated early.

(  vol0(st)−vol1(st)  − 1 + 100 st early terminated rLP(st, at) = (3.18) 0 otherwise

In contrast, an episode terminates successfully if the volume of channel 0 is 0 and the volume of channel 1 is 100. In such cases the win bonus reward (eq. 3.19) awards the agent to encourage more successful transitions. ( 1 terminated successfully rWB(st, at) = (3.19) 0 otherwise The total reward afforded to the mixing environment agent was given by a combination of the individual rewards. See sections 4.3 and 4.4 for the reward functions used during training of the mixing environment. CHAPTER 3. METHOD 33

Residual Conv1D Dilated Tanh skip Conv1D Conv1D Dilated Sigmoid Conv1D residual

Figure 3.2: Residual Convolution Layer

3.7 Neural Network Architectures

A convolutional neural network inspired by the WaveNet [61] architecture was used when training the mixing tier of the DeepFADE system. A gated residual convolution layer was created composed of dilated 1-dimensional convolution as well as a strided convolution. Figure 3.2 shows the component layers of the gated residual convolution layer.

3.8 A3C Network

The architecture of the neural network used when training the A3C policy and value function is shown in figure 3.3. The network splits the observation into three channels of audio data and a vector of DJ-controller state. A stack of 15 gated residual layers of increasing dilation size is applied to the audio data. The skip connections of the residual layers are summed over and then followed by three non-residual 1-dimensional convolution layers with rectifying linear unit activation functions. The three convolution layers are then followed by a two-layer multi-layer perceptron. An equivalent multi-layer perceptron is also applied to the DJ-controller state and the outputs of the two perceptrons are added. The same vector is passed to two network modules, one for the actor and critic respectively. A two-layer multi-layer perceptron with a final softmax activation is used in the actor module to predict the policy π(s). A two-layer multi-layer perceptron with no final activation is used in the critic module to predict the value function V (s). To reduce the required memory of running the network when training the A3C mixing environment a simplified version of the network was used. Fig- 34 CHAPTER 3. METHOD

Actor Dense Dense Dense Dense 80 20 Split 128 21 Policy Residual Residual Residual Residual ReLU ReLU ReLU Softmax Conv1D Conv1D Conv1D Conv1D dilation=4 dilation=8 dilation=16 dilation=32768 Dense Dense Critic Conv1D Conv1D Conv1D 80 20 ReLU ReLU ReLU Dense ReLU ReLU Dense 128 Value 1 ReLU

Figure 3.3: A3C Network Architecture

Actor Dense Dense Dense Dense 80 20 128 21 Policy ReLU ReLU Split ReLU Softmax

Dense Dense Critic Conv1D Conv1D Conv1D 80 20 ReLU ReLU ReLU Dense ReLU ReLU Dense 128 Value 1 ReLU

Figure 3.4: Simplified A3C Network ure 3.4 shows the simplified network where the gated residual convolution layer stack is removed yielding a more traditional convolutional neural net- work. While the trainable parameters of the residual convolutions are rela- tively few due to the small kernel sizes, performing back-propagation requires storing all intermediate audio data for each of the residual layers causing a significant increase in required memory.

3.9 Dueling DQN Network

The network architecture when training the dueling DQN network is similar to the architecture used for the A3C algorithm. However, the network used for the dueling DQN only outputs a single vector, the vector of action-values Q(s, a) for each of the 21 actions a. The architecture used for the dueling DQN network is shown in figure 3.5.

Dense Dense 80 20 Split Residual Residual Residual Residual ReLU ReLU Conv1D Conv1D Conv1D Conv1D Dense Dense Action- dilation=4 dilation=8 dilation=16 dilation=32768 128 21 Value Dense Dense ReLU Conv1D Conv1D Conv1D 80 20 ReLU ReLU ReLU ReLU ReLU

Figure 3.5: Dueling DQN Network Architecture CHAPTER 3. METHOD 35

∇ Thread Local Thread Local Network 0 Envrionment 0 , Global Thread Local Thread Local Network Network 1 Envrionment 1 , Thread Local Thread Local Network N , Envrionment N

∇ Figure 3.6: Overview of the A3C algorithm

3.10 Algorithms

3.10.1 A3C Implementation The A3C algorithm was implemented in python with the Keras [62] library using Tensorflow [63] as the backend engine. The algorithm is initialized with a function that creates an OpenAI envi- ronment when called and a Keras model representing the global network that takes observations as inputs and produces predictions for policies and values for each of the observations. See figures 3.3 and 3.4 for examples of such networks. The implementation of the A3C algorithm is multi-threaded and begins with creating an OpenAI environment for each of the N threads using the provided function. A model is created where one input per trainable parameter is added to the global network such that the network can be updated using pre-computed gradients. Then the global network is cloned N times and the cloned networks are added to the Tensorflow computation graph. The weights are copied from the global network to all newly created networks. The N networks are initialized for multi-thread usage with the method _make_predict_function. A model that computes the gradients with respect to all trainable parameters given observations actions and discounted rewards are created for each of the networks. 36 CHAPTER 3. METHOD

Once all networks have been initialized N threads are spawned, each passed its corresponding network and environment. An overview of the algorithm is given in figure 3.6. For each thread, the algorithm takes 40 steps through the environment by sampling actions with numpy [58] according to the predicted policy distribution or until it reaches a terminal state. The total discounted rewards are calculated given the rewards given from the environment. The gradients ∇θ are computed given the actions, observations, and discounted rewards. Under a critical section with mutual exclusion the gradients are then applied to the global network using the update model and the new weights of the global network are copied from the global network to the thread-local network. The lists of actions, observations and rewards are cleared and if the state was terminal the environment is reset to the beginning of a new episode. The thread then continues the loop until the wanted number of steps through the environment have been trained on.

3.10.2 Dueling DQN Implementation The implementation of dueling DQN from the keras-rl [64] library was used. The implementation is single-threaded and stores the database of state transi- tion tuples in an in-memory sequential storage. The implementation modifies the network shown in figure 3.5 by removing the last layer of the network and adding a module that computes the advantages A(s, a) and value V (s) using the average offset calculation described in section 2.2.8.

3.11 Training

The DEAM [11] dataset was used to train the mixing environment. The dataset was chosen due to its variety of music styles, relatively large size of the dataset (1802 songs) and with the aim to incorporate the emotional analysis into the mixing strategy. However, since simpler rewards such as volume stability and beat similarity wasn’t satisfied the aim to incorporate the emotional content was discarded. The Cue Point loader tier was also trained on the DEAM dataset. During the training of this tier the policy determining the actions of the mixing tier was fixed to the policy trained using the A3C algorithm. CHAPTER 3. METHOD 37

3.12 Evaluation

The Mixotic [9, 10] dataset consisting of 10 mixes composed of 723 songs were fetched. The same pre-processing as for the DEAM [11] dataset was applied to the 732 songs. The pre-processing steps are detailed in section 3.2. 7 of the 10 mixes were sampled from the dataset without repetition. The DeepFADE system was used to generate mixes. 7 mixes were gen- erated using the trained Cue Point loading tier and the trained Mixing tier. 7 more mixes were generated with the trained Cue Point loading tier with the policy of the mixing agent fixed in order to construct a baseline for the mix- ing strategy. The policy was set to increase the volume of the incoming song with a probability of 10%, decrease the volume of the outgoing song with a probability of 10% and a uniform probability for the remaining 19 actions. For each of the mixes, a 120 second clip was extracted. The starting offset was sampled uniformly in the range 0 to 120 seconds before the end of the mix. 120 seconds of audio was loaded from the each mix, converted a monaural waveform and the resulting waveform was stored to disk. The 21 audio clips were combined into a single dataset and permuted ac- cording to a random shuffle. The permutation was stored so as to recover the original order.

3.12.1 User Survey A user survey was conducted where the participants were asked about their experience with DJing and asked to evaluate the 21 audio clips. The generation methods of the 21 clips were not revealed to the participants. For each of the 21 audio clips, the participants were asked to grade the audio clip with respect to comfort, groove [65], enjoyability, naturalness and noticeability of transition between songs according to a 5-grade scale. The participants were also asked whether the audio clip was human-generated or computer-generated. The questions asked for each of the audio clips in the survey were

1. Do you think this mix was generated by a human or by a computer? 2. How comfortable was it to listen to this mix? 3. How movement inducing was this mix? 4. How enjoyable was it to listen to this mix? 5. How natural do you perceive this mix? 6. How noticeable were the transitions between the songs in this mix? 38 CHAPTER 3. METHOD

The questions relating to the participants’ DJ experience were

1. Do you have DJ experience? (professional or otherwise) 2. Do you have experience with audio engineering, such as mastering or mixing? 3. When did you last listen to a DJ mix?

The first two questions allowed for a binary choice of Yes or No. For the third question the participants were asked to answer one of the following an- swers: I have never listened to a DJ mix, More than 2 years ago, Within the last 24 months, Within the last 12 months, Within the last 3 months, Within the last 4 weeks, Within the last week. Chapter 4

Results

4.1 Cue Points

Cue points were selected from the combined novelty curves calculated for each of the songs in the DEAM [11] dataset. 12 songs were sampled from the dataset without replacement. Figure 4.1 shows the audio waveform of the 12 sampled songs with the cue points indicated as vertical red lines. For the sam- pled songs, the selected cue point coincides with phrase boundaries. None of the sampled songs had a cue point assigned to the beginning of the song.

4.2 Training

The cue point loading tier and the mixing tier of the system were trained us- ing the Dueling DQN and A3C algorithms. The training was performed on a single computer equipped with a quad-core 2.8 GHz Intel Xeon E5-1603 v4 CPU, four 8GB DDR4 memory cards, and two GPUs, a Quadro P2000 GPU with 5GB GDDR5 GPU memory and a Quadro K620 GPU with 2GB DDR3 GPU memory.

4.3 Dueling DQN

The Dueling DQN experiment was run for 1344952 training steps and took 335104 seconds or 93 hours and was stopped due to the steady decrease in rewards such that the model could be improved upon. This corresponds to 40 steps trained per second.

39 40 CHAPTER 4. RESULTS

Figure 4.1: Cue points of the 12 songs sampled from the DEAM dataset. The cue points are visualized by red vertical lines.

The reward function used to train the Dueling DQN network is given by eq. 4.1. This function was chosen to encourage an approximately 10-second long transition with the delta-transition rDT. The transitions were further en- couraged by the win bonus rWB where the agent is awarded rewards when an episode ends due to channel 0 having volume 0 and channel 1 having vol- ume 100. The 5rVS(s, a)rAMIX(s, a)rTC(s, a) was included to ensure that the master volume was at a stable level (rVS) while also having a low degree of dissonance throughout the transition (rTC). The additive mixing reward rAMIX was included to encourage the volumes of the channels to sum to 100. Given the volume conversion presented in sec. 3.5.3 this approximates maintaining the master volume at the maximum volume of only one channel playing. The product of the individual rewards in the term was used to require the agent to satisfy all three aspects in order to obtain high rewards.

r(s, a) = 5rVS(s, a)rAMIX(s, a)rTC(s, a) + rDT(s, a) + 0.4rWB(s, a) (4.1)

The rewards are increasing during the first 2000 episodes as shown in figure 4.2a and then slowly decreasing over the remaining episodes to a reward simi- lar to the value at the start of the training session. Figure 4.2a shows the moving average reward over 40 episodes as the solid blue line and the shaded area is given by a confidence interval of 95% calculated through bootstrap sampling. CHAPTER 4. RESULTS 41

(a) Episode Rewards (b) Loss

Figure 4.2: Episode rewards and loss during training with the Dueling DQN algorithm. The training was performed for 4208 episodes corresponding to 1.3M training steps.

Mel spectrogram +0 dB 1000

16384

-10 dB 800 8192 -20 dB

600 4096 -30 dB

-40 dB Hz 2048 400

-50 dB

1024 200 -60 dB

512 -70 dB 0 NOP 0 -80 dB 0:00 0:15 0:30 0:45 1:00 1:15 1:30 (Inc-L,0) (Inc-L,1) (Inc-T,0) (Inc-T,1) (Inc-V,0) (Inc-V,1) (Inc-H,0) (Inc-H,1) (Inc-M,0) (Inc-M,1) (Dec-L,0) (Dec-L,1) (Dec-T,0) (Dec-T,1) (Dec-V,0) (Dec-V,1) (Dec-H,0) (Dec-H,1) Time (Dec-M,0) (Dec-M,1) (a) Mel spectrogram (b) Performed Actions

Figure 4.3: Spectrogram and actions when generating a mix with the Dueling DQN network after training.

The changes in rewards are also reflected in the loss curve (fig. 4.2b) where the loss is monotonically decreasing for the first 2000 episodes and then in- creasing for the rest of the training. The loss is computed on the episode gen- erated with the training set for which the rewards are also presented. The reason for not including an additional validation set when evaluating the loss here is that any such inclusion would imply generating an extra episode us- ing the validation set for each episode generated with the training set. Hence, this additional episode would require approximately twice the computational cost during training and due to the limited available computational resources a validation set was not utilized. The trained policy was evaluated by generating mixes on a subset of 58 songs sampled from the DEAM dataset. The cue point loading tier was set to 42 CHAPTER 4. RESULTS

randomly sample the 17 available actions according to a uniform distribution. The resulting policy is a clearly sub-optimal policy that only increases the volume of the incoming song as shown in figure 4.3b. Despite the simplistic mixing policy, the root mean squared amplitude throughout the generated mix is relatively stable as shown in figure 4.3a. However, the policy meant that no consideration of beat matching was taken and the resulting mixes consisted of the two songs in each transition playing with misaligned or “clashing” beats.

4.4 A3C

4.4.1 Trained on 3438 Episodes The Asynchronous Advantage Actor-Critic algorithm was used to train a model for 3438 episodes. The training ran for 71055 seconds corresponding to an average of 20.7 seconds per episode. The number of steps trained on is unavailable and since some episodes may terminate early, the number of training steps per second cannot be calculated. The reward function used during training is given by eq. 4.2. The transi- tion reward was modified to not award high rewards for a transition when the volume is changed drastically by multiplying the delta transition reward rDT with the volume stability reward rVS. The beat similarity reward rBEAT was in- troduced to strongly encourage beat alignment of the songs during transitions.

r(s, a) = rDT(s, a)rVS(s, a) + 100(rTC(s, a) − 0.8)

+ 100(rBEAT(s, a) − 0.5) + 100rWB(s, a) + 100rLP(s, a) (4.2)

The rewards increase drastically after training for 1000 episodes. How- ever, the rewards decrease again after 1500 episodes resulting in the oscillat- ing pattern shown in figure 4.4a. The loss (fig. 4.4b) also increases after 1000 episodes. However, this is expected since the change in acquired rewards lead to a higher loss for the critic. After increasing to approximately 17M the aver- age loss steadily decreases for the remaining episodes in a slightly oscillating pattern. The trained policy yields a much more uniform distribution of performed actions compared to the policy of the Dueling DQN network (fig. 4.5b). How- ever, the master volume of the generated audio experiences drastic dips as can be seen by the dark vertical bands in figure 4.5a. CHAPTER 4. RESULTS 43

(a) Episode Rewards (b) Loss (Actor Loss + Critic Loss)

Figure 4.4: Episode rewards and loss during the training of the A3C algorithm trained for 3438 episodes.

Mel spectrogram +0 dB 200 16384 -10 dB 175

8192 -20 dB 150

125 4096 -30 dB

100 -40 dB Hz 2048 75

-50 dB

1024 50

-60 dB 25

512 -70 dB 0 NOP 0 -80 dB 0:00 0:50 1:40 2:30 3:20 4:10 5:00 (Inc-L,0) (Inc-L,1) (Inc-T,0) (Inc-T,1) (Inc-V,0) (Inc-V,1) (Inc-H,0) (Inc-H,1) (Inc-M,0) (Inc-M,1) (Dec-L,0) (Dec-L,1) (Dec-T,0) (Dec-T,1) (Dec-V,0) (Dec-V,1) (Dec-H,0) (Dec-H,1) Time (Dec-M,0) (Dec-M,1) (a) Mel spectrogram (b) Performed Actions

Figure 4.5: Spectrogram and actions when generating a mix with the A3C network after training for 3438 episodes. 44 CHAPTER 4. RESULTS

(a) Episode Rewards (b) Loss

Figure 4.6: Episode rewards and loss during the training of the A3C algorithm trained for 10000 episodes. Episode rewards, actor loss (blue) and critic loss (orange) during the training of the A3C algorithm trained for 10000 episodes.

4.4.2 Trained on 10000 Episodes Given the larger variance of the policy distribution for the A3C algorithm, this algorithm was further explored. The reward function was updated and the A3C network was trained for 10000 episodes in the mixing environment. The training ran for 100529 seconds corresponding to an average of approximately 10 seconds per episode. The reward function used is given by eq. 4.3. The mixing transition re- ward function rMT was used to more heavily penalize mixes where the master volume diverges from one channel playing at full volume. This differs from the multiplication of the volume stability reward rVS with the delta transition reward rDT used in sec. 4.4.1 primarily in the sense that the possibility of neg- ative rewards in the mixing transition reward function rMT.

r(s, a) = rMT(s, a) + 10(rTC(s, a) − 0.8) + 10(rBEAT(s, a) − 0.9)

+ 100rWB(s, a) + 100rLP(s, a) (4.3) The resulting rewards (fig. 4.6a) were consistently lower than in the train- ing of the previous A3C network due to the negative rewards in the mixing transition reward function. The loss (fig. 4.6b) was measured separately for the actor (blue) and the critic (orange). The critic loss was not decreasing during the training whereas the actor loss was consistently negative indicating an issue with the A3C implementation. The rewards were oscillating and the policy maintained high variance throughout the training. Peaks in the rewards coincide with policies that generate transitions that end up in the winning state and the win bonus is awarded. Valleys in the rewards coincide with episodes where the volume of the outgoing song is lowered before the volume in the incoming song is raised resulting in a failed mix with high loss penalty. CHAPTER 4. RESULTS 45

Mel spectrogram +0 dB 40 16384

-10 dB 35

8192 -20 dB 30

25 4096 -30 dB

20 -40 dB Hz 2048 15 -50 dB

1024 10

-60 dB 5

512 -70 dB 0 NOP 0 -80 dB 0 5 10 15 20 25 30 35 40 45 (Inc-L,0) (Inc-L,1) (Inc-T,0) (Inc-T,1) (Inc-V,0) (Inc-V,1) (Inc-H,0) (Inc-H,1) (Inc-M,0) (Inc-M,1) (Dec-L,0) (Dec-L,1) (Dec-T,0) (Dec-T,1) (Dec-V,0) (Dec-V,1) (Dec-H,0) (Dec-H,1) Time (Dec-M,0) (Dec-M,1) (a) Mel spectrogram (b) Performed Actions

Figure 4.7: Spectrogram and actions when generating a mix with the A3C network after training for 10000 episodes.

Similarly to the trained model presented in sec. 4.4.1, the resulting policy explored all of the available actions with similar frequency (fig. 4.7b) when evaluated by generating a mix from the 58 songs. The generated mix, however, was ended prematurely due to the policy decreasing the output volume to 0 at the end of a transition (fig. 4.7a).

4.4.3 Trained on 60000 Episodes The A3C algorithm was reimplemented. This time great care was taken to en- sure that the correct parameters were used in the thread-local networks when predicting the action policy from where the actions were sampled during train- ing. The parameters were optimized with an RMSprop optimizer with ρ = 0.9, learning rate η = 0.01 and a linear decay of 0.001 per iteration. The reward function was updated to no longer include a win-bonus to avoid rewarding very short mixes. The updated reward function is given by eq. 4.4. The loss penalty was also significantly decreased to avoid overemphasizing that decreasing volume is bad. The illegal action penalty rAP was introduced to ensure that the agent didn’t get stuck in a state where all actions taken by the agent had no influence on the DJ controller. The volume stability reward rVS was reintroduced as an independent term and the beat matching reward rBEAT was weighted by the entropy in volume v(s) of the channels after normalization by the sum of the volumes. The reason for weighting the beat matching by this entropy is that misaligned beats only cause discomfort to the listener if both 46 CHAPTER 4. RESULTS

(a) Episode Rewards

(b) Actor Loss (c) Critic Loss

Figure 4.8: Episode rewards, actor loss, and critic loss during the training of the A3C algorithm trained for 60000 episodes.

channels are playing. The instant transition reward rIT was added to mitigate the effect of the transition rewards being delayed from when the relevant action was taken.

r(s, a) = 12rMT(s, a) + 3rTC(s, a) + 3rIT(s, a) + rBANDS(s, a)

+ 10rVS(s, a) + 10v(s)rBEAT(s, a) + rLP(s, a) + rAP(s, a) (4.4) where vol (s)  vol (s)  v(s) = − 0 log 0 vol0(s)vol1(s) vol0(s)vol1(s) vol (s)  vol (s)  − 1 log 1 vol0(s)vol1(s) vol0(s)vol1(s)

The rewards were unstable during the initial 10000 episodes but then in- creased monotonically for the following 30000 episodes before stagnating at around the 40000 episode mark (fig. 4.8a). This stagnation is likely caused by the linear decay of the learning rate. The instability in rewards was reflected in the critic loss curve which stabilized after 10000 episodes and then decreased CHAPTER 4. RESULTS 47

Mel spectrogram +0 dB 3500

16384

-10 dB 3000

8192 -20 dB 2500

4096 -30 dB 2000

-40 dB Hz 1500 2048

-50 dB 1000 1024

-60 dB 500

512 -70 dB 0 NOP 0 -80 dB 0:00 8:20 16:40 25:00 (Inc-L,0) (Inc-L,1) (Inc-T,0) (Inc-T,1) (Inc-V,0) (Inc-V,1) (Inc-H,0) (Inc-H,1) (Inc-M,0) (Inc-M,1) (Dec-L,0) (Dec-L,1) (Dec-T,0) (Dec-T,1) (Dec-V,0) (Dec-V,1) (Dec-H,0) (Dec-H,1) Time (Dec-M,0) (Dec-M,1) (a) Mel spectrogram (b) Performed Actions

Figure 4.9: Spectrogram and actions when generating a mix with the A3C network after training for 60000 episodes. for the following 20000 episodes. The actor loss also experienced the insta- bility of the reward and critic loss curves. However, after stabilizing the actor loss converged to an average loss of 0 with the absolute loss at around 300000 with a confidence of 95%. The low rewards around 2000 episodes correspond to short mixes with very drastic changes in volume. Contrasted with the resulting A3C policies given in sections 4.4.1 and 4.4.2 the resulting policy for the mixing tier has a lower degree of exploration of the action space. The performed actions are shown in figure 4.9b. However, the policy results in relatively stable amplitude as can be seen in figure 4.9a.

4.5 Cue Point Loading Environment

The cue point loading tier was trained with the A3C algorithm for 100000 episodes. The mixing policy was fixed to the A3C mixing tier trained for 3438 episodes presented in sec. 4.4.1. The rewards (fig. 4.10a) were oscillating slowly but the reward curve has a slightly increasing trend, resulting in a final average reward of approximately 3000. Compared to the rewards of the mixing tier, the rewards of the cue point loading tier has a higher variance.

4.6 User Survey

A user survey was conducted to subjectively evaluate the quality of the sys- tem. Mixes were generated using the system, 7 were generated with a ran- 48 CHAPTER 4. RESULTS

(a) Episode Rewards

(b) Actor Loss (c) Critic Loss

Figure 4.10: Episode rewards, actor loss, and critic loss during the training of the Cue Point Loading Tier with the A3C algorithm trained for 100000 episodes. dom mixing strategy and 7 were generated using the trained policy presented in sec. 4.4.3. These 14 mixes were compared to 7 mixes from the mixotic dataset [10, 9]. Mel spectrograms of the 21 mixes are found in figure 4.11. The wide black bands in the mixes generated by the random mixing strategy indicate prolonged durations of silence. The mixes generated from the trained mixing strategy contain sections where the high-frequency content of the au- dio is missing. This is also the case for some of the human-generated mixes (figs. 4.11g, 4.11j). From the 21 mixes, two-minute audio clips were extracted by randomly choosing a 120-second section of each mix. This duration was chosen to be long enough that the audio clip contained at least one transition while being sufficiently short to limit the total time needed to complete the survey. The process of extracting the audio clips is further explained in sec. 3.12. The survey was performed by letting the respondents listen to the mixes and after each mix grade it according to the aspects described in sec. 3.12.1. 11 respondents participated in the survey, 2 had previous DJ experience, 5 had experience with audio engineering (mixing or mastering). 9 respondents had listened to a DJ mix at some point in their lives, of which 6 had listened to a DJ mix in the last year. The mean grade for each graded aspect is given in CHAPTER 4. RESULTS 49

figs. 4.12a, 4.12b, 4.12c, 4.12d, 4.12e. The human-generated mixes from the mixotic dataset had higher mean grade with respect to comfort, enjoyability, groove and naturalness and lower degree of noticeability of transitions between songs. The average time between transitions is slightly longer in the mixes of the mixotic dataset. Thus the likelihood of an audio clip not containing any transition is higher for those clips. This may influence the noticeability of the transitions. The trained mixing strategy had a higher mean grade for comfort, enjoyability, and naturalness. However, the random mixing strategy performed better with respect to groove and noticeability of transitions. The audio clips from the mixotic dataset were correctly identified as human-generated 63% of the time. The audio clips from the random mixing strategy and the trained mixing strategy were misclassified as human- generated 40% and 43% respectively. Given the small sample size and the high variance of the answers no statistically significant conclusions can be drawn from the user survey. However, the data shows a tendency for human-generated mixes to be graded the most enjoyable followed by those from the trained mixing strategy. CHAPTER 4. RESULTS 50

(a) Mixotic (b) Random (c) Trained

(d) Mixotic (e) Random (f) Trained

(g) Mixotic (h) Random (i) Trained

(j) Mixotic (k) Random (l) Trained

(m) Mixotic (n) Random (o) Trained

(p) Mixotic (q) Random (r) Trained

(s) Mixotic (t) Random (u) Trained

Figure 4.11: Mel spectrograms of the mixes used in the user survey. CHAPTER 4. RESULTS 51

5.0 5.0

4.5 4.5

4.0 4.0

3.5 3.5

3.0 3.0

2.5 2.5

2.0 2.0

1.5 1.5

1.0 1.0 Mixotic (Human) Cue Point+Random Cue Point+Trained Mixotic (Human) Cue Point+Random Cue Point+Trained

(a) Comfort (b) Enjoyability

5.0 5.0

4.5 4.5

4.0 4.0

3.5 3.5

3.0 3.0

2.5 2.5

2.0 2.0

1.5 1.5

1.0 1.0 Mixotic (Human) Cue Point+Random Cue Point+Trained Mixotic (Human) Cue Point+Random Cue Point+Trained

(c) Groove (d) Naturalness

5.0 1.0

4.5 0.8 4.0

3.5 0.6

3.0

0.4 2.5

2.0 0.2 1.5

1.0 0.0 Mixotic (Human) Cue Point+Random Cue Point+Trained Mixotic (Human) Cue Point+Random Cue Point+Trained

(e) Noticeability of transitions (f) Ratio of mixes thought to be human gen- erated

Figure 4.12: Survey results evaluating the audio clips. The black error bars indicate one standard deviation. Chapter 5

Discussion

5.1 Convergence Rate

The rate of convergence is slow for the used reinforcement learning methods. Even on relatively simple Atari games, the time until convergence of the A3C algorithm to a high score (reward) is measured in hours or days [66, p. 10]. It is commonly required to train the agent on hundreds of millions of steps in the environment reaching a near-optimal policy [67, 68, pp. 12-15]. The time per training step of the mixing environment is not only limited by the calculation of the gradients. The generation of the audio and the measure- ments on the generated audio also incur a significant time required for each training step. Since the fastest implementation of the mixing environment generates 22.85 steps each second, extrapolating this information to train for 100 million steps the estimate of 1215 hours or 50 days of training time required is obtained.

5.2 Multi-Threading

Given the data dependence of previous samples when generating audio a GPU cannot easily be utilized to improve computational performance generating the audio. This is especially true for the equalization which is modeled as an IIR filter. The main cause of the difference in performance between the Dueling DQN algorithm and the A3C algorithm is that the keras-rl implementation of the Dueling DQN algorithm runs single-threadedly. Since the time of gen- erating audio is significant during training running multiple environments in

52 CHAPTER 5. DISCUSSION 53

parallel in multiple threads results in a higher throughput of generated audio. Since the DQN and the Dueling DQN algorithms are off-policy algorithms the episodes may be generated in advance and then stored to disk. With this approach the generation of audio can indeed be performed in parallel even with the Dueling DQN algorithm. However, this would be limited by the available disk space considering that a generated episode contains the waveform for the generated audio.

5.3 A3C Policy Updates

The initial implementation of the A3C algorithm used the weights of the global network to predict the policy during training. In the updated version of the implementation described in section 4.4.3 care was taken to ensure that the local networks only updated their weights after applying the gradients to the global network. This leads to a more stable exploration of the action space and update of the policy since the multiple agents explore the environment using the same parameters and thus also the same policy.

5.4 Complex Environment

The observation space consists of 3 × 4410 samples of audio as well as 160 floating-point numbers describing the volume histories of the two channels and 10 floating-point numbers describing the current state of all knobs and faders on the simulated DJ-controller. This observation space is high-dimensional and the required training time to establish the correlation between the obser- vation space, the action space of the 21 actions and the received reward is thus long. Furthermore, there are multiple compounding effects between the different actions of the mixing environment, e.g. the output root mean squared ampli- tude is not only influenced by the actions modifying the volume but also the actions modifying the equalizer knobs.

5.5 Rewards

The policy is sensitive to minor changes in the reward function. A slight change to how the rewards are calculated may severely change what policy is optimal. The rewards used are heuristics that are designed to optimize for 54 CHAPTER 5. DISCUSSION

some of the important aspects of a mix (section 1.8) with the hope that the resulting mix would be subjectively enjoyable. Without rewards encouraging transitions the optimal policy for the mixing tier agent is to not introduce the incoming song since the mix is most stable in amplitude and tempo by simply playing one of the songs. Hence, the transition rewards are required in order to ensure that a transition is performed and a mix can be generated. However, encouraging transitions create two large locally optimal policies of only increasing the volume of the incoming song and only decreasing the volume of the outgoing song. This is highlighted by the resulting policy of the Dueling DQN shown in figure 4.3b. The dips in the reward curve of the A3C algorithm (fig. 4.8a) can also be explained by a convergence to these locally optimal policies since the performed actions contain multiple subse- quent actions changing the volume which leads to lower total reward due to the transition rewards rMT and rDT. Many of the presented reward functions award the rewards to the agent with a delay from when the responsible action was taken. The delta transition reward compares the current volume to the volume one second before and thus the rewards are higher for the actions following the volume change than the volume changing action itself. The trained mixing policies exhibit reward hacking [8]. While the rewards momentarily increase, the larger reward doesn’t correspond to subjectively bet- ter mixes. Given that the total reward for a training step is given by a combi- nation of multiple heuristics the agent may optimize a subset of the heuristics and still obtain a moderately high reward. With larger computational resources available tuning of the weights for the individual rewards through parameter search should be considered to limit the possibility of high rewards despite poor performance. Unless multiple aspects important to generating a mix are considered, the resulting mixes are likely to sound subjectively poor. To combat reward hacking, more elaborate rewards such as the delta transi- tion reward (eq. 3.13) and the mixing transition reward (eq. 3.15) were added to shape the reward space more. However, even with these more elaborate reward functions, the prevalence of reward hacking was still great. The agent is provided observations which contain 0.1 seconds of generated audio, however, the beats in most songs occur every 0.4-0.75 seconds. Hence, it is difficult for the agent to observe and correlate the beat similarity reward rBEAT with the beats in the generated audio given the short observations. The reward function is by necessity constructed by multiple conflicting rewards. In order for the mixing tier agent to create transitions between the CHAPTER 5. DISCUSSION 55

songs, the increase in volume for the incoming song and a decrease in volume for the outgoing song is encouraged. However, this conflicts with the delta transition reward that penalizes too drastic changes in volume for any of the two channels. Likewise, the tempo must be changed to beat match the audio of the two channels but changing the tempo drastically of a playing song leads to discomfort and thus the tempo change penalty rTP penalizes these changes.

5.6 Cue Points

The estimated cue points correspond with phrase boundaries allowing mixes that potentially preserve the musical structure of the songs throughout the mix. The cue point loading tier yields a relatively high average reward when mixing. The actions performed when generating the mixes seem to have learned a near-optimal time of playing a song before loading the next cue point given the reward function.

5.7 Expressivity

The reason for using Deep Reinforcement Learning when generating the mixes as opposed to hard-coding the mixing strategy was to increase the expressivity of the mixes. However, the actions performed by the mixing tier agent are similar regardless of what songs are playing. Hence, the goal of achieving a higher degree of expressivity by utilizing Deep Reinforcement Learning was not satisfied. The transition rewards heavily restrict the policy in what mixing strategies are possible for the agent to take. Because the transitioning between songs is necessary for a mix to be generated those rewards are weighed heavily as to not get stuck in the local maxima of not creating any transitions at all. This results in that the agent will only focus on performing the transitions and the remaining actions are simply chosen by limiting the penalty for performing illegal actions.

5.8 Tiered System

The tiered separation of the system into three tiers resulted in a significant reduction in dimensionality. With the tiered approach, the actions relating to does not need to be considered in the cue point loading tier and vice versa. The separation also allowed for the assumption that two songs are always present 56 CHAPTER 5. DISCUSSION

when modifying the state of the DJ-controller in the mixing tier. This assump- tion, in turn, implied that actions for starting and stopping each channel were rendered unnecessary and the dimensionality of the action space could be re- duced further.

5.9 Unsupervised Learning

No annotated dataset is required for training the model presented in this re- search. This requirement is lifted due to the reinforcement learning model where the reward function provides the model with rewards to maximize that is computed based on the performed actions and the state of the reinforcement learning environment. Given the size of the action space and the space of po- tential mixes, any supervised learning with an annotated dataset is likely to over-fit to the dataset used. The generation of audio with the trained policy is also autonomous and does not require any human input apart from the preparation of the collection of songs from which the constituent songs in the mix are sampled.

5.10 Extensibility

The generation of the mixed audio via a simulation of a DJ-controller as de- scribed in sec. 3.5 lends itself to future extensions to the mixing strategies. More controls can be added to the DJ-controller such as parametrized low-pass and high-pass filters that are commonly present on modern mixing hardware aimed at DJs. The mixing strategies are potentially versatile and can incorpo- rate both the current state of the DJ-controller as well as historical information similar to what was done with regards to the volume in eq. 3.5.

5.11 User Survey and Subjective Evaluation

The user survey (sec. 4.6) indicates that the mixes generated with the trained policy are more comfortable, natural and enjoyable to listen to compared to the randomized mixing strategy. However, the transitions of the trained policy are more noticeable and the mixes less motion inducing. Although it was more obvious to the respondents when the transitions were performed in the mixes generated by the trained mixing policy, the mixes of the trained policy were generally preferred over the mixes generated with the random policy. It is CHAPTER 5. DISCUSSION 57

important to note that the small sample size of the survey, however, limits any statistically significant conclusions to be drawn. The survey also showed a misclassification rate of approximately 40% of whether the mixes were human-generated or computer-generated. This misclassification rate was similar for both the human-generated mixes of the mixotic [9, 10] dataset and the mixes generated by the system. The relatively high rate can be explained by that the time when a transition is performed is most indicative of whether the mix is generated by a human or not. However, the duration of the mixes transitioning between songs is short compared to the duration when the mixes play only one song simultaneously. Thus the task of classifying human generation of the mixes is made difficult by the high simi- larity of the human-generated mixes to the computer mixes when no transition is performed.

5.12 Improvements and Future Work

The Dueling DQN algorithm could be improved for this task by generating episodes in parallel due to it being an off-policy algorithm. This would make the algorithm feasible for use when training the mixing environment. While the goal of this research was to investigate whether a mixing strategy could be learned without hard-coding any part of the strategy future research is needed to evaluate whether hard-coding only the volume changes of the mixing strategy and providing the deep reinforcement learning agent with actions to change tempo and equalizer settings leads to an increase in the expressivity of mixes. Since the policy is very sensitive to changes of the reward function and the reward function is difficult to design an alternative approach of using Inverse Reinforcement Learning [42] may be used instead. Here, mixes generated by the system or by some external hardware which annotates the mixes with the appropriate actions could be classified by users to learn the reward function. If more resources are available, future research should consider running the training for more episodes than what was presented in this research. The instability in rewards during the first 10000 episodes for the training presented in sec. 4.4.3 indicates that training must be performed for more episodes until convergence of the policy. Chapter 6

Conclusion

This research investigated whether Deep Reinforcement Learning can be used to construct a mixing strategy for generating DJ-mixes without hard-coding the strategy such that more expressive mixes can be generated. The constructed system does not require any annotated dataset and the can generate mixes au- tonomously. The Deep Reinforcement Learning approach is found to be able to learn policies that can generate DJ-mixes. However, the approach yields mixing strategies that only consider single aspects important to generating the mix and do not adapt the mixing strategy based on the songs played. Hence, the learned mixing strategies are not more expressive than hard-coded mixing strategies. A user survey was conducted that indicates that the trained mixing strategy yields more enjoyable mixes than a random mixing strategy. The respondents failed to classify whether the mixes where human-generated or computer gen- erated at a rate of 40%. The convergence rate of Deep Reinforcement Learning is slow and since the audio generation is time-consuming the training time is the main limiting factor. Designing the reward function through a combination of heuristic rewards proved difficult since there are multiple factors that must be simultaneously satisfied when generating a DJ-mix but due to Reward Hacking the mixing strategies do not consider all important factors. Due to the limited available computational resources it is not possible to draw any clear conclusions whether the proposed method is appropriate or not when constructing the mixing strategy. However, the Deep Reinforcement Learning approach is difficult to apply without several orders of magnitude larger computational resources when constructing the mixing strategy.

58 Bibliography

[1] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yo- gamani. “Deep reinforcement learning framework for autonomous driv- ing”. In: Electronic Imaging 2017.19 (2017), pp. 70–76. [2] Volodymyr Mnih et al. “Human-level control through deep reinforce- ment learning”. In: Nature 518.7540 (2015), p. 529. [3] David Silver et al. “Mastering the game of Go without human knowl- edge”. In: Nature 550.7676 (2017), p. 354. [4] Hiromi Ishizaki, Keiichiro Hoashi, and Yasuhiro Takishima. “Full- Automatic DJ Mixing System with Optimal Tempo Adjustment based on Measurement Function of User Discomfort.” In: International Society for Music Information Retrieval. 2009, pp. 135–140. [5] Tatsunori Hirai, Hironori Doi, and Shigeo Morishima. “Latent Topic Similarity for Music Retrieval and Its Application to a System that Supports DJ Performance”. In: Journal of Information Processing 26 (2018), pp. 276–284. doi: 10.2197/ipsjjip.26.276. [6] Len Vande Veire and Tijl De Bie. “From raw audio to a seamless mix: creating an automated DJ system for ”. In: EURASIP Journal on Audio, Speech, and Music Processing 2018.1 (Sept. 2018), p. 13. issn: 1687-4722. doi: 10.1186/s13636- 018- 0134- 8. url: https://doi.org/10.1186/s13636-018-0134-8. [7] Shun-Yao Shih and Heng-Yu Chi. “Automatic, Personalized, and Flex- ible Playlist Generation using Reinforcement Learning”. In: Interna- tional Society for Music Information Retrieval. Sept. 2018, pp. 168– 173. [8] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. “Concrete Problems in AI Safety”. In: Com- puting Research Repository in arXiv abs/1606.06565 (2016).

59 60 BIBLIOGRAPHY

[9] Reinhard Sonnleitner, Andreas Arzt, and Gerhard Widmer. “Landmark- Based Audio Fingerprinting for DJ Mix Monitoring”. In: International Society for Music Information Retrieval. 2016. [10] Mixotic. mixotic.net - netlabel for free dj mixes. Aug. 6, 2019. url: http://www.mixotic.net/. [11] Anna Alajanki, Yi-Hsuan Yang, and Mohammad Soleymani. “Bench- marking music emotion recognition systems”. In: PLOS ONE (2016). under review. [12] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. “Re- cent Developments in openSMILE, the Munich Open-source Multime- dia Feature Extractor”. In: Proceedings of the 21st ACM International Conference on Multimedia. MM ’13. Barcelona, Spain: ACM, 2013, pp. 835–838. isbn: 978-1-4503-2404-5. doi: 10.1145/2502081. 2502224. url: http://doi.acm.org/10.1145/2502081. 2502224. [13] Dave Cliff. “Hang the DJ: Automatic sequencing and seamless mixing of dance-music tracks”. In: Hp Laboratories Technical Report Hpl 104 (2000). [14] Mixed In Key LLC. Harmonic Mixing Guide - Mixed In Key. 2019. url: https://mixedinkey.com/harmonic- mixing- guide/ (visited on 04/06/2019). [15] R. Reinier Plomp and Willem J. M. Levelt. “Tonal consonance and crit- ical bandwidth.” In: The Journal of the Acoustical Society of America 38 4 (1965), pp. 548–60. [16] Emma Strubell, Ananya Ganesh, and Andrew McCallum. “En- ergy and Policy Considerations for Deep Learning in NLP”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Com- putational Linguistics, July 2019, pp. 3645–3650. url: https : //www.aclweb.org/anthology/P19-1355. [17] Roger D. Chamberlain, Eric Hemmeter, Robert Morley, and Jason White. “Modeling the power consumption of audio signal processing computations using customized numerical representations”. In: 36th Annual Simulation Symposium, 2003. (2003), pp. 249–255. [18] James Manyika. “A future that works: AI automation employment and productivity”. In: McKinsey Global Institute Research, Tech. Rep. (2017). BIBLIOGRAPHY 61

[19] James Manyika et al. “Jobs lost, jobs gained: Workforce transitions in a time of automation”. In: McKinsey Global Institute (2017). [20] InfoSoc Directive. “Directive 2001/29/EC of the European Parliament and of the Council of 22 May 2001 on the harmonisation of certain aspects of copyright and related rights in the information society”. In: Official Journal of the European Union, L/167 22 (2001), p. 2001. [21] InfoSoc Directive. “Official Journal of the European Union, L 130, 17 May 2019”. In: 62 (2019), p. 130. [22] Lionel Bently et al. “Sound Sampling, a Permitted Use Under EU Copyright Law? Opinion of the European Copyright Society in Relation to the Pending Reference before the CJEU in Case C-476/17, Pelham GmbH v. Hütter”. In: IIC - International Review of and Competition Law 50.4 (May 2019), pp. 467–490. issn: 2195-0237. doi: 10 . 1007 / s40319 - 019 - 00798 - w. url: https://doi.org/10.1007/s40319-019-00798-w. [23] Jani Ihalainen. “Computer creativity: artificial intelligence and copy- right”. In: Journal of Intellectual Property Law & Practice 13.9 (2018), pp. 724–728. issn: 1747-1532. [24] Paul J. Werbos. “Beyond regression: New tools for predicting and anal- ysis in the behavioral sciences”. PhD thesis. 1974. [25] Henry J. Kelley. “Gradient Theory of Optimal Flight Paths”. In: ARS Journal 30.10 (1960), pp. 947–954. doi: 10.2514/8.5282. eprint: https://doi.org/10.2514/8.5282. url: https://doi. org/10.2514/8.5282. [26] Stuart Dreyfus. “The numerical solution of variational problems”. eng. In: Journal of Mathematical Analysis and Applications 5.1 (1962), pp. 30–45. issn: 0022-247X. [27] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. “Learning representations by back-propagating errors”. In: Nature 323.6088 (Oct. 1986), pp. 533–536. doi: 10 . 1038 / 323533a0. url: https://doi.org/10.1038/323533a0. [28] Seppo Linnainmaa. “Taylor expansion of the accumulated rounding er- ror”. eng. In: BIT Numerical Mathematics 16.2 (1976), pp. 146–160. issn: 0006-3835. [29] Yann LeCun et al. “Backpropagation Applied to Handwritten Zip Code Recognition”. In: Neural Computation 1 (1989), pp. 541–551. 62 BIBLIOGRAPHY

[30] Kunihiko Fukushima. “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”. eng. In: Biological Cybernetics 36.4 (1980), pp. 193–202. issn: 0340-1200. [31] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learn- ing”. In: Nature 521.7553 (May 2015), pp. 436–444. doi: 10 . 1038/nature14539. url: https://doi.org/10.1038/ nature14539. [32] Jürgen Schmidhuber. “Deep learning in neural networks: An overview”. In: Neural networks : the official journal of the International Neural Network Society 61 (2015), pp. 85–117. [33] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. “Reinforcement Learning: A Survey”. In: J. Artif. Intell. Res. 4 (1996), pp. 237–285. [34] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. “Planning and Acting in Partially Observable Stochastic Domains”. In: Artif. Intell. 101 (1998), pp. 99–134. [35] Chris Watkins. “Learning from delayed rewards”. PhD thesis. 1989. [36] Vijaymohan Konda. “Actor-critic algorithms”. In: Neural Information Processing Systems. 1999. [37] Volodymyr Mnih et al. “Asynchronous methods for deep reinforcement learning”. In: International conference on machine learning. 2016, pp. 1928–1937. [38] Yuxi Li. “Deep Reinforcement Learning: An Overview”. In: Computing Research Repository in arXiv abs/1701.07274 (2017). [39] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. “Dueling Network Architectures for Deep Reinforcement Learning”. In: International Conference on Machine Learning. 2016. [40] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy Optimization Algorithms”. In: Com- puting Research Repository in arXiv abs/1707.06347 (2017). [41] Tommi S. Jaakkola, Satinder P. Singh, and Michael I. Jordan. “Rein- forcement Learning Algorithm for Partially Observable Markov Deci- sion Problems”. In: Neural Information Processing Systems. 1994. BIBLIOGRAPHY 63

[42] Paul Francis Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. “Deep reinforcement learning from human preferences”. In: Neural Information Processing Systems. 2017. [43] Rachel M. Bittner et al. “Automatic Playlist Sequencing and Transi- tions”. In: International Society for Music Information Retrieval. 2017. [44] Thomas R. Knösche et al. “Perception of phrase structure in music.” In: Human brain mapping 24 4 (2005), pp. 259–73. [45] Willi Apel. Harvard Dictionary of Music. Harvard University Press, 1950. [46] Harvey Fletcher. “Auditory Patterns”. In: Rev. Mod. Phys. 12 (1 Jan. 1940), pp. 47–65. doi: 10 . 1103 / RevModPhys . 12 . 47. url: https://link.aps.org/doi/10.1103/RevModPhys. 12.47. [47] Eberhard Zwicker. “Subdivision of the audible frequency range into critical bands (Frequenzgruppen)”. In: The Journal of the Acoustical Society of America 33.2 (1961), pp. 248–248. [48] Stanley Smith Stevens, John Volkmann, and Edwin B Newman. “A scale for the measurement of the psychological magnitude pitch”. In: The Journal of the Acoustical Society of America 8.3 (1937), pp. 185– 190. [49] Dmitry Bogdanov et al. “Essentia: An Audio Analysis Library for Mu- sic Information Retrieval”. In: International Society for Music Informa- tion Retrieval. 2013. [50] J. Stephen Downie and Remco C. Veltkamp. ISMIR 2010 Proceedings of the 11th International Society for Music Information Retrieval Con- ference, August 9-13, 2010 Utrecht, Netherlands. eng. International So- ciety for Music Information Retrieval, 2010, pp. 225–266. [51] Paul Ekman. “Are there basic emotions?” In: Psychological review 99 3 (1992), pp. 550–3. [52] Peter Kuppens, Francis Tuerlinckx, James A Russell, and Lisa Feldman Barrett. “The relation between valence and arousal in subjective expe- rience.” In: Psychological bulletin 139 4 (2013), pp. 917–40. [53] Wilhelm Wundt. An introduction to psychology. 1924. 64 BIBLIOGRAPHY

[54] Jonathan E Posner, James A Russell, and Bradley S. Peterson. “The circumplex model of affect: an integrative approach to affective neu- roscience, cognitive development, and psychopathology.” In: Develop- ment and psychopathology 17 3 (2005), pp. 715–34. [55] Jonathan Foote. “Automatic audio segmentation using a measure of au- dio novelty”. In: Multimedia and Expo, 2000. ICME 2000. 2000 IEEE International Conference on. Vol. 1. IEEE. 2000, pp. 452–455. [56] Greg Brockman et al. OpenAI Gym. 2016. eprint: arXiv : 1606 . 01540. [57] Fabrice Bellard. FFmpeg. Version 3.4.5. July 11, 2019. url: Hadoo: //ffmpeg.org. [58] Travis Oliphant. NumPy: A guide to NumPy. 2006–. url: http:// www.numpy.org/. [59] J. R. Zapata, M. E. P. Davies, and E. Gómez. “Multi-Feature Beat Track- ing”. In: IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing 22.4 (Apr. 2014), pp. 816–825. issn: 2329-9290. doi: 10 . 1109/TASLP.2014.2305252. [60] Robert Bristow-Johnson. Cookbook formulae for audio EQ biquad filter coefficients. 2005. url: http://www.musicdsp.org/en/ latest/_downloads/3e1dc886e7849251d6747b194d482272/ Audio-EQ-Cookbook.txt. [61] Aäron van den Oord et al. “WaveNet: A Generative Model for Raw Audio”. In: SSW. 2016. [62] François Chollet. Keras. https://github.com/fchollet/ keras. 2015. [63] Martín Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org. 2015. url: http://tensorflow.org/. [64] Matthias Plappert. keras-rl. https://github.com/keras-rl/ keras-rl. 2016. [65] Guy Madison, Fabien Gouyon, Fredrik Ullén, and Kalle Hörnström. “Modeling the tendency for music to induce movement in humans: First correlations with low-level audio descriptors across music gen- res.” In: Journal of experimental psychology: human perception and performance 37.5 (2011), p. 1578. BIBLIOGRAPHY 65

[66] Mohammad Babaeizadeh, Iuri Frosio, Stephen Tyree, Jason Clemons, and Jan Kautz. “Reinforcement Learning through Asynchronous Advantage Actor-Critic on a GPU”. In: International Conference on Learning Representations. 2017. [67] Hado van Hasselt, Arthur Guez, and David Silver. “Deep Reinforce- ment Learning with Double Q-learning”. In: Association for the Ad- vancement of Artificial Intelligence. 2015. [68] Oriol Vinyals et al. “StarCraft II: A New Challenge for Rein- forcement Learning”. In: Computing Research Repository in arXiv abs/1708.04782 (2017).

TRITA -EECS-EX-2019:726

www.kth.se