Arxiv:1907.04868V1 [Cs.SD] 10 Jul 2019

LakhNES: Improving multi-instrumental music generation with cross-domain pre-training Chris Donahue1 Huanru Henry Mao2 Yiting Ethan Li2 Garrison W. Cottrell2 Julian McAuley2 1 Department of Music, UC San Diego 2 Department of Computer Science, UC San Diego ABSTRACT i.e., models which assign likelihoods to sequences of dis- crete tokens, could be used to generate classical piano mu- We are interested in the task of generating multi- sic containing these elusive elements. In order to adapt this instrumental music scores. The Transformer architec- method to the multi-instrumental setting we incorporate in- ture has recently shown great promise for the task of strument specification directly into our language-like mu- piano score generation—here we adapt it to the multi- sic representation. However, this strategy alone may be in- instrumental setting. Transformers are complex, high- sufficient to generate high-quality multi-instrumental mu- dimensional language models which are capable of captur- sic, as the results of [1] also depend on access to large ing long-term structure in sequence data, but require large quantities of piano music. amounts of data to fit. Their success on piano score genera- To begin to address the data availability problem, we tion is partially explained by the large volumes of symbolic focus on an unusually large dataset of multi-instrumental data readily available for that domain. We leverage the music. The Nintendo Entertainment System Music recently-introduced NES-MDB dataset of four-instrument Database (NES-MDB) [2] contains 46 hours of chiptunes, scores from an early video game sound synthesis chip (the music written for the four-instrument ensemble of the NES NES), which we find to be well-suited to training with the (video game system) sound chip. This dataset is appealing Transformer architecture. To further improve the perfor- for music generation research not only for its size but also mance of our model, we propose a pre-training technique for its structural homogeneity—all of the music is written to leverage the information in a large collection of het- for a fixed ensemble. It is, however, smaller than the 172 erogeneous music, namely the Lakh MIDI dataset. De- hours of piano music in the MAESTRO Dataset [3] used to spite differences between the two corpora, we find that this train Music Transformer. transfer learning procedure improves both quantitative and The largest available source of symbolic music data is qualitative performance for our primary task. the Lakh MIDI Dataset [4] which contains over 9000 hours of music. This dataset is structurally heterogeneous (differ- 1. INTRODUCTION ent instruments per piece) making it challenging to model directly. However, intuition suggests that we might be able In this paper, we extend recent results for symbolic pi- to benefit from the musical knowledge ingrained in this ano music generation [1] to the multi-instrumental setting. dataset to improve our performance on chiptune genera- Both piano and multi-instrumental music are polyphonic, tion. Accordingly, we propose a procedure to heuristically where multiple notes may be sounding at any given point map the arbitrary ensembles of music in Lakh MIDI into in time. However, the generation of multi-instrumental the four-voice ensemble of the NES. We then pre-train our music presents an additional challenge not present in the generative model on this dataset, and fine-tune it on NES- piano domain: handling the intricate interdependencies MDB. We find that this strategy improves the quantitative between multiple instruments. Another obstacle for the performance of our generative model by 10%. Such trans- multi-instrumental setting is that there is less data avail- fer learning approaches are common practice in state-of- arXiv:1907.04868v1 [cs.SD] 10 Jul 2019 able than for piano, making it more difficult to train the the-art natural language processing [5, 6], and here we de- types of powerful generative models used in [1]. velop new methodology to employ these techniques in the Until recently, music generation methods struggled to music generation setting (as opposed to analysis [7]). capture two rudimentary elements of musical form: long- We refer to the generative model pre-trained on Lakh term structure and repetition. Huang et al. [1] demon- MIDI and fine-tuned on NES-MDB as LakhNES. In addi- strated that powerful neural network language models, tion to strong quantitative performance, we also conduct multiple user studies indicating that LakhNES produces c Chris Donahue, Huanru Henry Mao, Yiting Ethan Li, strong qualitative results. LakhNES is capable of gener- Garrison W. Cottrell, Julian McAuley. Licensed under a Creative Com- ating chiptunes from scratch, continuing human-composed mons Attribution 4.0 International License (CC BY 4.0). Attribution: material, and producing melodic material corresponding to Chris Donahue, Huanru Henry Mao, Yiting Ethan Li, Garrison W. Cot- human-specified rhythms. 1 trell, Julian McAuley. “LakhNES: Improving multi-instrumental music generation with cross-domain pre-training”, 20th International Society 1 Sound examples: https://chrisdonahue.com/LakhNES for Music Information Retrieval Conference, Delft, Netherlands, 2019. Code/data: https://github.com/chrisdonahue/LakhNES 2. RELATED WORK Music generation has been an active area of research for decades. Most early work involved manually encoding musical rules into generative systems or rearranging frag- ments of human-composed music; see [8] for an extensive overview. Recent research has favored machine learning [DT_645, NO_NOTEON_13, DT_732, NO_NOTEOFF, DT_2197, P1_NOTEON_76, DT_2, P1_NOTEON_70, DT_1463, P2_NOTEON_87, DT_1, P2_NOTEOFF, DT_148, TR_NOTEOFF, systems which automatically extract patterns from corpora DT_2055, P1_NOTEON_57, DT_2, P1_NOTEON_63, DT_18, P2_NOTEON_50, DT_1, P2_NOTEON_48, DT_711, P2_NOTEON_88, DT_2, P2_NOTEOFF, DT_2936, P1_NOTEON_60, of human-composed music. DT_20, P2_NOTEON_50, DT_1, P2_NOTEON_48, DT_3661, P1_NOTEON_67, DT_63, NO_NOTEON_13, DT_732, NO_NOTEOFF, DT_2197, NO_NOTEON_13, DT_1, NO_NOTEON_9, DT_176, TR_NOTEON_48, DT_489, NO_NOTEON_13, Many early machine learning-based systems focused DT_724, P2_NOTEON_87, DT_1, P2_NOTEOFF, DT_11, NO_NOTEOFF, DT_2192, P1_NOTEON_63, P1_NOTEON_76, DT_2, P1_NOTEON_70, DT_1463, DT_19, P2_NOTEON_50, DT_1, P2_NOTEON_48, DT_709, P2_NOTEON_88, DT_1, P2_NOTEOFF, on modeling simple monophonic melodies, i.e., music DT_906, TR_NOTEOFF, DT_2031, P1_NOTEON_84, DT_2, P1_NOTEON_70, DT_42, P2_NOTEON_87, DT_1, P2_NOTEOFF, DT_148, P2_NOTEON_72, DT_23, TR_NOTEON_48, DT_23, NO_NOTEON_9, DT_650, NO_NOTEON_13, DT_732, NO_NOTEOFF, DT_2197, P1_NOTEON_57, DT_2, P1_NOTEON_67, DT_1648, where only one note can be sounding at any given point … TR_NOTEOFF, DT_2020, P1_NOTEON_106, DT_1, P1_NOTEON_72, DT_40, TR_NOTEON_48, P1_NOTEON_61, DT_41, P2_NOTEON_85, DT_26, DT_22, NO_NOTEON_13, DT_1, NO_NOTEON_9, DT_677, NO_NOTEON_13, DT_731, in time [9–11]. More recently, research has focused on NO_NOTEOFF, DT_2198, P1_NOTEON_70, DT_1660, TR_NOTEOFF, DT_2048, TR_NOTEON_48, TR_NOTEON_39, DT_2, TR_NOTEON_43, DT_23, DT_23, NO_NOTEON_9, DT_678, NO_NOTEON_13, DT_735, NO_NOTEOFF, DT_2194, polyphonic generation tasks. Here, most work represents P1_NOTEON_72, DT_1672, TR_NOTEOFF, DT_1998, P1_NOTEON_58, DT_1, P1_NOTEON_60, NO_NOTEON_9, DT_646, NO_NOTEON_13, DT_733 DT_64, TR_NOTEON_48, DT_24, NO_NOTEON_9, DT_648, NO_NOTEON_13, DT_732, polyphonic music as a piano roll—a sparse binary ma- NO_NOTEOFF, DT_2197, P1_NOTEON_76, DT_2, P1_NOTEON_70, DT_1684, TR_NOTEOFF, DT_1983, P1_NOTEON_57, DT_2, P1_NOTEON_63, DT_3669, P1_NOTEON_60, DT_3669, P1_NOTEON_67, DT_39, P2_NOTEON_75, DT_45, NO_NOTEON_9, DT_10, TR_NOTEON_48, trix of time and pitch—and seeks to generate sequences DT_644, NO_NOTEON_13, DT_732, NO_NOTEOFF, DT_2199, P1_NOTEON_63, DT_1708, TR_NOTEOFF, DT_1959, P1_NOTEON_84, DT_2, P1_NOTEON_70, DT_39, TR_NOTEON_48, of individual piano roll timesteps [12, 13] or chunks of Figure 1. A visual comparison between the piano roll rep- DT_3628, P1_NOTEON_57, DT_2, P1_NOTEON_67, DT_1720, TR_NOTEOFF, DT_1948, P1_NOTEON_106, DT_1, P1_NOTEON_72, DT_39, P2_NOTEON_79, DT_23, TR_NOTEON_48, timesteps [14]. Other work favors an event-based repre- resentation of the original NES-MDB paper [2] (top) and DT_3607, P1_NOTEON_70, DT_1732, TR_NOTEOFF, DT_1976, TR_NOTEON_48, DT_3630, P1_NOTEON_72, DT_1744, TR_NOTEOFF, DT_1930, P1_NOTEON_58, DT_1, P1_NOTEON_61, sentation of music, where the music is flattened into a list the event representation of this work (bottom). In the pi- DT_44, P2_NOTEON_84, DT_33, TR_NOTEON_48, DT_1, TR_NOTEON_46, DT_24, NO_NOTEON_9, DT_630, NO_NOTEON_13, DT_732, NO_NOTEOFF, DT_2197, P1_NOTEON_78, DT_2, P1_NOTEON_70, DT_1756, TR_NOTEOFF, DT_1911, P1_NOTEON_57, DT_2, P1_NOTEON_65, of musically-salient events [1,15,16]. None of these meth- ano roll representation, the majority of information is the DT_3669, P1_NOTEON_61, DT_3669, P1_NOTEON_68, DT_62, NO_NOTEON_9, DT_104, TR_NOTEON_46, DT_574, NO_NOTEON_13, DT_735, NO_NOTEOFF, DT_2194, P1_NOTEON_65, ods allow for the generation of multi-instrumental music. same across timesteps. In our event representation, each DT_1780, TR_NOTEOFF, DT_1888, P1_NOTEON_93, DT_1, P1_NOTEON_72, DT_43, TR_NOTEON_46, DT_23, NO_NOTEON_9, DT_672, NO_NOTEON_13, DT_732, NO_NOTEOFF, Other research focuses on the multi-instrumental setting timestep encodes a musically-meaningful change. DT_2197, P1_NOTEON_58, DT_2, P1_NOTEON_68, DT_1792, TR_NOTEOFF, DT_1920, NO_NOTEON_9, DT_695, NO_NOTEON_13,

Arxiv:1907.04868V1 [Cs.SD] 10 Jul 2019

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support