Raveforce: a Deep Reinforcement Learning Environment for Music Generation

RaveForce: A Deep Reinforcement Learning Environment for Music Generation Qichao Lan Jim Tørresen Alexander Refsum Jensenius RITMO RITMO RITMO Department of Musicology Department of Informatics Department of Musicology University of Oslo University of Oslo University of Oslo [email protected] [email protected] [email protected] ABSTRACT motivation is to find a balance between these two forms of music representation in computational music generation. RaveForce is a programming framework designed for a Our research question is: how can an A.I system be trained computational music generation method that involves au- to consider the music sound while generating symbolic dio sample level evaluation in symbolic music representa- music representation? Technically speaking, we hope that tion generation. It comprises a Python module and a Super- the neural network in an A.I system can not only generate Collider quark. When connected with deep learning frame- symbolic sequences but also convert the symbolic repre- works in Python, RaveForce can send the symbolic music sentation into an audio waveform that can be evaluated. representation generated by the neural network as Open To do so, we need to use non-real-time synthesis for the Sound Control messages to the SuperCollider for non-real- transformation from symbolic music representation to an time synthesis. SuperCollider can convert the symbolic audio file which will become the input of the neural net- representation into an audio file which will be sent back work, and the output will be accordingly the next symbolic to the Python as the input of the neural network. With this representation. Compared with pure symbolic generation, iterative training, the neural network can be improved with this method also outputs the corresponding audio wave- deep reinforcement learning algorithms, taking the quan- form, which may broaden the application fields. Besides, titative evaluation of the audio file as the reward. In this different from raw audio generation, we fix the transform- paper, we find that the proposed method can be used to ing function for the neural network, which may make the search new synthesis parameters for a specific timbre of an computational resource focus more on the target music in- electronic music note or loop. formation than on the function estimation. In this paper, we will explain the proposed method and provide a programming implementation as well as two sim- 1. INTRODUCTION plified music tasks as examples. We start with the back- In a computational music generation task, what is essen- ground of deep learning music generation in Section2, tially generated? This question leads to a debate on ei- demonstrating the relationship between the data type and ther to generate music in symbolic music representation, the neural network architecture. In Section3, we present e.g. MIDI (Music Instrument Digital Interface) or to gen- our method to improve the symbolic representation and the erate the audio waveform directly. Symbolic music repre- reason why we choose to use deep reinforcement learn- sentations can generally reflect the idiosyncrasy of a mu- ing. Section4 introduces the implementation details of our sic piece, but they can hardly trace detailed music infor- deep reinforcement learning environment with an empha- mation, such as micro-tonal tunings, timbre nuances and sis on how we optimise it for a musical context. Section micro-timing. Signal-based music representations are bet- 5 describes the reward function design in customised tasks ter at preserving micro-level details that are not captured and explains the evaluation from running time and music well by the symbolic representations. Thus signal-based quality perspective. In Section6, we summarise the inno- workflows—including raw audio generation—may be a so- vations and limitations of our method as well as our future lution for computational music generation. However, since directions. raw audio generation requires much more computational resources than symbolic representation methods, there are still some difficulties for this method to generate long multi- 2. BACKGROUND track music pieces [1]. Furthermore, without a symbolic Computational music generation has for a long time been representation, these methods can be too sophisticated to an intriguing topic for musicologists and computer scien- explain from a music-theoretical perspective. Hence, our tists [2]. Of current algorithmic methods, deep learning seems to be particularly relevant for music generation tasks Copyright: c 2018 Qichao Lan et al. This is an open-access article distributed [3]. Deep learning is a method that learns from data repre- under the terms of the Creative Commons Attribution 3.0 Unported License, which sentations, so in terms of music generation, it is essential permits unrestricted use, distribution, and reproduction in any medium, provided to study the background of how the music representation the original author and source are credited. influences the learning process and result. 2.1 Symbolic vs signal-based representations tasks, e.g. Atari games [15]. After that, there appear more and more algorithms such as Proximal Policy Optimiza- Music can typically be represented as either signals (audio) tion (PPO) [16]. For testing these algorithms, there are or symbols (score representations). Popular symbolic rep- many simulation environments, e.g. the OpenAI Gym 1 . resentation methods include MIDI, musicXML, MEI, and For music, deep reinforcement learning has been used for others [4]. Among them, MIDI is one of the most popular the score following [17]. However, there is still no envi- data formats being used in deep learning music generation ronment designed for music generation. tasks. In some particular styles of music, and particularly the ones based on traditional music notation, MIDI data can be an efficient representation. One example is the pi- 3. DESIGN CONSIDERATION ano score generation in the DeepBach project [5]. Another Though symbolic representations have shown some limi- example is that of machine-assisted composition applica- tations, generating music at the audio sample level can be tions, in which MIDI allows for editable features [6]. How- computationally expensive. Therefore, we propose to gen- ever, as mentioned in the introduction, there are also many erate the symbolic representation first, and then use these cases in which symbolic representations are inadequate in representations to synthesise audio for evaluation. capturing the richness and nuances of the music in question. 3.1 From symbolic notation to audio One way to address limitations of symbolic representations is the use of sample-level music generation, as demon- Our first step is to choose a proper method to convert a strated in WaveNet [7] and WaveRNN [8]. However, al- symbolic representation to an audio file. Three options are though some progress has been made, the raw audio gen- considered: eration requires a lot of computational resources, and it is 1. to send the generated sequence to an instrument and too complicated to explain how these samples get organ- record the sound for evaluation. ised from a musicology perspective. The data format can also influence the design of the neu- 2. to use other general-purpose programming languages ral network. In symbolic representations, supervised learn- such as C++ for the sound synthesis. ing can be found in many applications [9]. For raw audio signals, unsupervised learning techniques such as autoen- 3. to use music programming languages like Max/MSP, coder and generative adversarial network (GAN) are fre- Pure Data, Csound and SuperCollider for non-real- quently adopted [10, 11]. time synthesis. 2.2 Reinforcement learning We exclude the first option because it would be too time- consuming, considering there would be a considerable num- Reinforcement learning is different from supervised or un- ber of iterations in the deep learning training process. The supervised learning techniques in that its updating strategy second option is the most efficient in synthesis speed, but relies on the interaction between an agent and the environ- it lacks the extensibility from a music perspective as users ment rather than the function gradient. In a given period— have to be familiar with the C-style programming languages. that is, an episode in reinforcement learning—the agent The third option best balances the efficiency and usability will try to maximise the reward it can get. The reward as music programming languages have already been ubiq- is calculated in each episode, and it is used to update the uitous in the electronic music field [18]. parameters of the agents [12]. However, both the second and the third option are faced The connection between reinforcement learning and mu- with the same challenge—the gradient. In supervised learn- sic generation goes back to the use of Markov models in ing, we need to know all the functions and their gradient. algorithmic composition. As one of the pioneers in auto- After comparing the output of the neural network and the mated music generation, in the piece called Analogique A, training data, we should fine-tune the parameter of the neu- Iannis Xenakis uses Markov models for the order of musi- ral network to minimise the loss with the help of these gra- cal sections [13]. The use of Markov models in composi- dients [19]. In our proposed method, since we involve the tion reveals its connection with reinforcement learning as non-real-time synthesis, back-propagation cannot be done the action of the agent only depends on the current state. in this context as the functions used for transforming sym- However, in previous research on reinforcement learning bolic representation to audio files are unknown. in computational music generation [14], the reward function calculation is not based on the sample-level evalua- 3.2 Addressing the gradient problem with deep tion. reinforcement learning Recently, deep learning technology has brought new pos- Deep reinforcement learning can solve the gradient prob- sibilities to reinforcement learning as it allows the agents to lem mentioned above as it relies only on the interaction examine higher-level information.

Load more