Deep Learning Techniques for Music Generation – a Survey

Deep Learning Techniques for Music Generation – A Survey Jean-Pierre Briot∗;1, Gaetan¨ Hadjeres† and François-David Pachet‡ ∗ Sorbonne Universite,´ CNRS, LIP6, F-75005 Paris, France † Sony Computer Science Laboratories, CSL-Paris, F-75005 Paris, France ‡ Spotify Creator Technology Research Lab, CTRL, F-75008 Paris, France This paper is a survey and an analysis of different ways of using deep learning (deep artificial neural networks) to generate musical content. We propose a methodology based on five dimensions for our analysis: • Objective – What musical content is to be generated? Examples are: melody, polyphony, accompaniment or counterpoint. – For what destination and for what use? To be performed by a human(s) (in the case of a musical score), or by a machine (in the case of an audio file). • Representation – What are the concepts to be manipulated? Examples are: waveform, spectrogram, note, chord, meter and beat. – What format is to be used? Examples are: MIDI, piano roll or text. – How will the representation be encoded? Examples are: scalar, one-hot or many-hot. • Architecture – What type(s) of deep neural network is (are) to be used? Examples are: feedforward network, recurrent network, autoencoder or generative adversarial networks. • Challenge – What are the limitations and open challenges? Examples are: variability, interactivity and creativity. • Strategy arXiv:1709.01620v4 [cs.SD] 7 Aug 2019 – How do we model and control the process of generation? Examples are: single-step feedforward, iterative feedforward, sampling or input manipulation. For each dimension, we conduct a comparative analysis of various models and techniques and we propose some tentative multidimensional typology. This typology is bottom-up, based on the analysis of many existing deep-learning based systems for music generation selected from the relevant literature. These systems are described and are used to exemplify the various choices of objective, representation, architecture, challenge and strategy. The last section includes some discussion and some prospects. Supplementary material is provided at the following companion web site: 1 Also Visiting Professor at UNIRIO (Universidade Federal do Estado do Rio de Janeiro) and Permanent Visiting Professor at PUC-Rio (Pontif´ıcia Universidade Catolica´ do Rio de Janeiro), Rio de Janeiro, Brazil. 1 www.briot.info/dlt4mg/ This paper is a simplified (weak DRM2) version of the following book [15]: Jean-Pierre Briot, Gaetan¨ Hadjeres and François-David Pachet, Deep Learning Techniques for Music Generation, Computational Synthesis and Creative Systems, Springer, 2019. Hardcover ISBN: 978-3-319-70162-2. eBook ISBN: 978-3-319-70163-9. Series ISSN: 2509- 6575. 2 In addition to including high quality color figures, the book includes: a table of contents, a list of tables, a list of figures, a table of acronyms, a glossary and an index. 2 Chapter 1 Introduction Deep learning has recently become a fast growing domain and is now used routinely for classification and prediction tasks, such as image recognition, voice recognition or translation. It became popular in 2012, when a deep learning architecture significantly outperformed standard techniques relying on handcrafted features in an image classification competition, see more details in Section 5. We may explain this success and reemergence of artificial neural network techniques by the combination of: • availability of massive data; • availability of efficient and affordable computing power1; • technical advances, such as: – pre-training, which resolved initially inefficient training of neural networks with many layers [80]2; – convolutions, which provide motif translation invariance [111]; – LSTM (long short-term memory), which resolved initially inefficient training of recurrent neural networks [83]. There is no consensual definition for deep learning. It is a repertoire of machine learning (ML) techniques, based on artificial neural networks. The key aspect and common ground is the term deep. This means that there are multiple layers processing multiple hierarchical levels of abstractions, which are automatically extracted from data3. Thus a deep architecture can manage and decompose complex representations in terms of simpler representations. The technical foundation is mostly artificial neural networks, as we will see in Chapter 5, with many extensions, such as: convolutional networks, recurrent networks, autoencoders, and restricted Boltzmann machines. For more information about the history and various facets of deep learning, see, e.g., a recent comprehensive book on the domain [63]. Driving applications of deep learning are traditional machine learning tasks4: classification (for instance, identifi- cation of images) and prediction5 (for instance, of the weather) and also more recent ones such as translation. But a growing area of application of deep learning techniques is the generation of content. Content can be of various kinds: images, text and music, the latter being the focus of our analysis. The motivation is in using now widely available various corpora to automatically learn musical styles and to generate new musical content based on this. 1 Notably, thanks to graphics processing units (GPU), initially designed for video games, which have now one of their biggest markets in data science and deep learning applications. 2 Although nowadays it has being replaced by other techniques, such as batch normalization [92] and deep residual learning [74]. 3 That said, although deep learning will automatically extract significant features from the data, manual choices of input representation, e.g., spectrum vs raw wave signal for audio, may be very significant for the accuracy of the learning and for the quality of the generated content, see Section 4.9.3. 4 Tasks in machine learning are types of problems and may also be described in terms of how the machine learning system should process an example [63, Section 5.1.1]. Examples are: classification, regression and anomaly detection. 5 As a testimony of the initial DNA of neural networks: linear regression and logistic regression, see Section 5.1. 3 1.1 Motivation 1.1.1 Computer-Based Music Systems The first music generated by computer appeared in 1957. It was a 17 seconds long melody named “The Silver Scale” by its author Newman Guttman and was generated by a software for sound synthesis named Music I, developed by Mathews at Bell Laboratories. The same year, “The Illiac Suite” was the first score composed by a computer [78]. It was named after the ILLIAC I computer at the University of Illinois at Urbana-Champaign (UIUC) in the United States. The human “meta-composers” were Lejaren A. Hiller and Leonard M. Isaacson, both musicians and scientists. It was an early example of algorithmic composition, making use of stochastic models (Markov chains) for generation as well as rules to filter generated material according to desired properties. In the domain of sound synthesis, a landmark was the release in 1983 by Yamaha of the DX 7 synthesizer, building on groundwork by Chowning on a model of synthesis based on frequency modulation (FM). The same year, the MIDI6 interface was launched, as a way to interoperate various software and instruments (including the Yamaha DX 7 synthesizer). Another landmark was the development by Puckette at IRCAM of the Max/MSP real-time interactive processing environment, used for real-time synthesis and for interactive performances. Regarding algorithmic composition, in the early 1960s Iannis Xenakis explored the idea of stochastic composition7 [209], in his composition named “Atrees”´ in 1962. The idea involved using computer fast computations to calculate various possibilities from a set of probabilities designed by the composer in order to generate samples of musical pieces to be selected. In another approach following the initial direction of “The Illiac Suite”, grammars and rules were used to specify the style of a given corpus or more generally tonal music theory. An example is the generation in the 1980s by Ebcioglu’s˘ composition software named CHORAL of a four-part chorale in the style of Johann Sebastian Bach, according to over 350 handcrafted rules [42]. In the late 1980s David Cope’s system named Experiments in Musical Intelligence (EMI) extended that approach with the capacity to learn from a corpus of scores of a composer to create its own grammar and database of rules [27]. Since then, computer music has continued developing for the general public, if we consider, for instance, the GarageBand music composition and production application for Apple platforms (computers, tablets and cellphones), as an offspring of the initial Cubase sequencer software, released by Steinberg in 1989. For more details about the history and principles of computer music in general, see, for example, the book by Roads [160]. For more details about the history and principles of algorithmic composition, see, for example, [128] and the books by Cope [27] or Dean and McLean [33]. 1.1.2 Autonomy versus Assistance When talking about computer-based music generation, there is actually some ambiguity about whether the objective is • to design and construct autonomous music-making systems – two recent examples being the deep-learning based AmperTM and Jukedeck systems/companies aimed at the creation of original music for commercials and documen- tary; or • to design and construct computer-based environments to assist human musicians (composers, arrangers, producers, etc.) – two examples being the FlowComposer

Load more