Mixing Music Using Deep Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 60 CREDITS STOCKHOLM, SWEDEN 2019 Mixing Music Using Deep Reinforcement Learning VIKTOR KRONVALL KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Mixing Music Using Deep Reinforcement Learning VIKTOR KRONVALL Master in Computer Science Date: October 14, 2019 Supervisor: Mats Nordahl Examiner: Örjan Ekeberg School of Electrical Engineering and Computer Science Host company: Keio University Swedish title: Musikmixning med Deep Reinforcement Learning iii Abstract Deep Reinforcement Learning has recently seen good results in tasks such as board games, computer games and the control of autonomous vehicles. State- of-the-art autonomous DJ-systems generating mixed audio hard-code the mix- ing strategy commonly with a cross-fade transition. This research investigates whether Deep Reinforcement Learning is an appropriate method for learning a mixing strategy that can yield more expressive and varied mixes than the hard-coded mixing strategies by adapting the strategies to the songs played. To investigate this, a system named the DeepFADE system was constructed. The DeepFADE system was designed as a three-tier system of hierarchical Deep Reinforcement Learning models. The first tier selects an initial song and limits the song collection to a smaller subset. The second tier selects when to transition to the next song by loading the next song at pre-selected cue points. The third tier is responsible for generating a transition between the two loaded songs according to the mixing strategy. Two Deep Reinforcement Learning algorithms were evaluated, A3C and Dueling DQN. Convolutional and residual neural networks were used to train the reinforcement learning policies. Rewards functions were designed as combinations of heuristic functions that evaluate the mixing strategy according to several important aspects of a DJ-mix such as alignment of beats, stability in output volume, tonal con- sonance, and time between transitions of songs. The trained models yield policies that are either unable to create transitions between songs or strategies that are similar regardless of playing songs. Thus the learnt mixing strategies were not more expressive than hard-coded cross-fade mixing strategies. The training suffers from reward hacking which was argued to be caused by the agent’s tendency to focus on optimizing only some of the heuristics. The re- ward hacking was mitigated somewhat by the design of more elaborate rewards that guides the policy to a larger extent. A survey was conducted with a sample size of n = 11. The small sample size implies no statistically significant conclusions can be drawn. However, the mixes generated by the trained policy was rated more enjoyable compared to a randomized mixing strategy. The convergence rate of the training is slow and training time is not only limited by the optimization of the neural networks but also by the generation of audio used during training. Due to the limited available computational re- sources it is not possible to draw any clear conclusions whether the proposed method is appropriate or not when constructing the mixing strategy. iv Sammanfattning Djup förstärkningsinlärning har de senaste åren sett goda resultat i områden som brädspel, datorspel och styrning av autonoma fordon. De senaste auto- noma DJ-system som genererar mixat ljud hårdkodar ofta mixningsstrategin till en “cross-fade”-övergång. Detta arbete undersöker huruvida djup förstärk- ningsinlärning är en lämplig metod för att lära en mixiningsstrategi som ger upphov till mer expressiva och varierade mixar än de hårdkodade strategierna genom att anpassa strategin till den spelade låtarna. För att undersöka detta skapades ett system vid namn DeepFADE. DeepFADE-systemet designades som ett tredelat system med hierarkiska djupa förstärkningsinlärningssmodeller. Systemets första del väljer en inledan- de låt och begränsar samlingen låtar till en mindre delmängd. Andra delen be- stämmer när en övergång till nästa låt ska utföras genom att ladda nästa låt vid förvalda teckenpunkter. Tredje delen ansvarar för att generera en övergång mellan de två laddade låtarna enligt mixningsstrategin. Två algoritmer utvär- derades: A3C och Dueling DQN. Faltnings- och residuala neurala nätverk an- vändes för att träna förstärkningsinlärningspolicyn. Belöningsfunktionerna utformades som kombinationer av heuristiker som utvärderar mixningsstrategin enligt viktiga aspekter i en DJ-mix såsom takt- synkronisering, stabilitet i volym, tonal konsonans och tid mellan låtövergång- ar. De tränade modellerna ger upphov till strategier som antingen inte lyckas skapa låtövergångar eller strategier som inte ändras baserat på de spelade låtar- na. Således var de inlärda strategierna inte mer expressiva än de hårdkodade “cross-fade”-strategierna. Träningen lider av “reward hacking” vilket ansågs vara orsakat av agentens tendens att fokusera på några heuristiker på bekost- nad av lägre belöning för övriga heuristiker. Detta dämpades något genom utformade av mer komplexa belöningsfunktioner som styrde agentens policy i större utsträckning. En undersökning med urvalsstorlek n = 11 utfördes. Det begränsade ur- valet innebar att inga statistiskt signifikanta slutsatser kunde dras. Mixar gene- rerade av den tränade policyn bedömdes dock mer njutbara än de genererade av en slumpmässig mixiningsstrategi. Träningens konvergenshastighet är långsam och träningstiden påverkas in- te enbart av optimeringen av de neurala nätverken utan även av genereringen av ljud under träningens gång. På grund av begränsade beräkningsresurser kunde inga tydliga slutsatser dras huruvida den föreslagna metoden var lämplig för att konstruera en mixningsstrategi. Contents 1 Introduction 1 1.1 Description . .1 1.2 Motivation . .1 1.3 Objective . .2 1.4 Research Question . .2 1.4.1 Question . .2 1.4.2 Problem Definition . .3 1.5 Evaluation . .3 1.5.1 Datasets . .3 1.6 Novelty . .4 1.7 Limitation of Scope . .4 1.8 Important Aspects of a DJ-mix . .4 1.9 Sustainability and Power Consumption . .5 1.10 Ethical and Societal Considerations . .6 2 Background 7 2.1 Machine Learning . .7 2.1.1 Artificial Neural Networks . .7 2.1.2 Backprop – The Error Back-Propagation Algorithm . .8 2.1.3 Convolutional Neural Networks . .8 2.1.4 Deep Learning . .8 2.2 Reinforcement Learning . .9 2.2.1 Markov Decision Process . .9 2.2.2 Partially Observable Markov Decision Process . .9 2.2.3 Functions and Distributions . 10 2.2.4 Actor-Critic Learning . 12 2.2.5 Q-Learning . 13 2.2.6 Deep Reinforcement Learning . 13 2.2.7 Deep Q Network . 14 v vi CONTENTS 2.2.8 Dueling Network Architectures . 14 2.2.9 Asynchronous Advantage Actor-Critic . 15 2.2.10 Epsilon Greedy Policy . 16 2.2.11 Partial Observability . 17 2.2.12 Reward Hacking . 17 2.2.13 Inverse Reinforcement Learning . 17 2.3 DJ mixes . 18 2.3.1 Musical Structure . 19 2.3.2 Musical Key . 19 2.3.3 Critical Bands and Perceptual Scales . 20 2.3.4 Tonal Consonance . 20 2.3.5 Music Emotion Recognition . 20 2.3.6 MediaEval Database for Emotional Analysis in Music 21 2.3.7 Cue Point Selection . 21 2.3.8 Self-Similarity Based Novelty Score . 21 2.4 Autonomous DJ Systems . 23 3 Method 24 3.1 System Overview . 24 3.2 Pre-processing . 24 3.3 Song Recommendation Tier . 25 3.4 Cue Point Loading Tier . 26 3.5 Mixing (DJ-Controller) Tier . 26 3.5.1 Knobs and Faders . 27 3.5.2 Audio Generation . 27 3.5.3 Mixing and Volume . 28 3.5.4 Interpolation . 28 3.6 DJ-Controller Rewards . 28 3.7 Neural Network Architectures . 33 3.8 A3C Network . 33 3.9 Dueling DQN Network . 34 3.10 Algorithms . 35 3.10.1 A3C Implementation . 35 3.10.2 Dueling DQN Implementation . 36 3.11 Training . 36 3.12 Evaluation . 37 3.12.1 User Survey . 37 CONTENTS vii 4 Results 39 4.1 Cue Points . 39 4.2 Training . 39 4.3 Dueling DQN . 39 4.4 A3C . 42 4.4.1 Trained on 3438 Episodes . 42 4.4.2 Trained on 10000 Episodes . 44 4.4.3 Trained on 60000 Episodes . 45 4.5 Cue Point Loading Environment . 47 4.6 User Survey . 47 5 Discussion 52 5.1 Convergence Rate . 52 5.2 Multi-Threading . 52 5.3 A3C Policy Updates . 53 5.4 Complex Environment . 53 5.5 Rewards . 53 5.6 Cue Points . 55 5.7 Expressivity . 55 5.8 Tiered System . 55 5.9 Unsupervised Learning . 56 5.10 Extensibility . 56 5.11 User Survey and Subjective Evaluation . 56 5.12 Improvements and Future Work . 57 6 Conclusion 58 Bibliography 59 Chapter 1 Introduction 1.1 Description The aim of this research is to construct a system that given a collection of songs can autonomously create a mix of songs similar to what a DJ does during a live performance. This resulting mix should be pleasing to listen to and interesting both musically and otherwise. The proposed method of constructing such a system is Deep Reinforcement Learning. This method was chosen because it does not assume any dataset of annotated DJ mixes. This is important both because of a lack of such datasets and because using supervised learning is likely to over-fit the model to the dataset leading to a degraded generalization of the model. 1.2 Motivation Reinforcement learning in recent years have been successfully applied to self- driving cars [1] and video games and board games such as Atari Breakout [2] and the board game Go [3]. However, there is not yet enough research to answer whether the method is applicable to generating DJ mixes. This research aims to expand this area of research. The body of research relating to DJ mixes is limited. However, Ishizaki, Hoashi, and Takishima [4] provide some evidence that a mixing strategy with minimal tempo adjustments leads to less discomfort when listening to the re- sulting mixes. Hirai, Doi, and Morishima [5] presents a model for generating mixes based on a local similarity between songs and Vande Veire and De Bie [6] presents a model where transitions are performed based on a novelty mea- sure on the audio waveform.