From Jigs and Reels to Schottisar Och Polskor: Generating Scandinavian-Like Folk Music with Deep Recurrent Networks
Total Page:16
File Type:pdf, Size:1020Kb
From Jigs and Reels to Schottisar och Polskor: Generating Scandinavian-like Folk Music with Deep Recurrent Networks Eric Hallstrom¨ Simon Mossmyr Bob L. Sturm Victor Hansjons Vegeborn Jonas Wedin KTH Royal Institute of Technology - Speech, music and hearing division ferichal,mossmyr,bobs,victorhv,[email protected] ABSTRACT compact data representation designed around ABC nota- tion, and a large amount of training data. Will such a model The use of recurrent neural networks for modeling and also perform well given another kind of traditional mu- generating music has been shown to be quite effective for sic expressed in a similarly compact way? What happens compact, textual transcriptions of traditional music from when the amount of training data is an order of magnitude Ireland and the UK. We explore how well these models less than for the Irish transcription models? perform for textual transcriptions of traditional music from In this paper, we present our work applying deep recur- Scandinavia. This type of music has characteristics that rent modeling methods to Scandinavian folk music. We are similar to and different from that of Irish music, e.g., explore both LSTM and Gated Recurrent Unit (GRU) net- mode, rhythm, and structure. We investigate the effects of works [7], trained with and without dropout [8]. We ac- different architectures and training regimens, and evaluate quire our data from a crowd-sourced repository of Scan- the resulting models using three methods: a comparison dinavian folk music, which gives 4,083 transcriptions ex- of statistics between real and generated transcriptions, an pressed as ABC notation. Though this data is expressed appraisal of generated transcriptions via a semi-structured the same way as the Irish transcriptions used in [1], there interview with an expert in Swedish folk music, and an ex- are subtle differences between the styles that require a dif- ercise conducted with students of Scandinavian folk music. ferent approach, e.g., key changes in tunes. This results We find that some of our models can generate new tran- in a larger vocabulary for the Scandinavian transcription scriptions sharing characteristics with Scandinavian folk models, compared with the Irish ones (224 vs. 137 to- music, but which often lack the simplicity of real transcrip- kens) [1]. We also explore using pretraining with the Irish tions. One of our models has been implemented online at transcription dataset, with further training using only Scan- http://www.folkrnn.org for anyone to try. dinavian transcriptions. To evaluate the resulting models, we compare low-level statistics of the generated transcrip- 1. INTRODUCTION tions with the training data, conduct a semi-structured in- terview with an expert on Swedish folk music, and perform Recent work [1] applies long short-term memory (LSTM) an exercise with students of Scandinavian folk music. neural networks [2] to model and generate textual tran- We begin by briefly reviewing recurrent neural networks, scriptions of traditional music from Ireland and the UK. including LSTM and GRU networks. We then describe The data used in that work consists of over 23,000 tune the data we use, how we have process it to create training 1 transcriptions crowd-sourced online. Each transcription data, and how we train our models. We then present our 2 is expressed using a compact textual notation called ABC. evaluation of the models, and discuss the results and our The resulting transcription models have been used and eval- future work. uated in a variety of ways, from creating material for pub- lic concerts [3] and a professionally produced album [4], to numerical analyses of the millions of parameters in the net- 2. RECURRENT NEURAL NETWORKS work [5, 6], to an accessible online implementation. 3 The success of machine learning in reproducing idiosyncrasies A Recurrent Neural Network (RNN) [9] is a type of artifi- of Irish traditional music transcriptions comes in large part cial neural network that uses directed cycles in its compu- from the expressive capacity of the LSTM network, the tations, inspired by the cyclical connections between neu- rons in the brain [10]. These recurrent connections allow 1 http://thesession.org the RNN to use its output in a sequence, while the in- 2 http://abcnotation.com/wiki/abc:standard:v2.1 3 ternal states of the network act as memory. We test two http://www.folkrnn.org different flavors of RNN: Long Short-Term Memory Net- works (LSTM), and Gated Recurrent Units (GRU). The fi- Copyright: c 2019 Eric Hallstr¨om et al. This is an open-access article distributed nal layer of these networks is a softmax layer, which is under the terms of the Creative Commons Attribution 3.0 Unported License, which produces a conditional probability distribution over a vo- permits unrestricted use, distribution, and reproduction in any medium, provided cabulary given the previous observations. It is from this the original author and source are credited. distribution one samples to generate a sequence. 2.1 Long Short-Term Memory (LSTM) N:Se aven¨ + M:3/4 The LSTM is an RNN architecture designed to overcome L:1/8 problems in training conventional RNNs [2]. Each LSTM R:Visa Z:Nils L layer is defined by four “gates” transforming an input xt at time step t and a previous state h as follows [11]: K:Am t−1 EE A2 cc | ee B2 d2 | cB (Ac) BA | ˆG2 E4 :: w:ung-er-sven med ett hur-tigt mod han i = σ (W x + U h + b ) (1) t i t i t−1 i sving-ar sig * u-ti la-get ft = σ (Wf xt + Uf ht−1 + bf ) (2) EE A2 B2 | cd e2 d2 | cB Ac BˆG | A2 A4 :| w:fem-ton al-nar gro-na¨ band det bar¨ han ot = σ (Woxt + Uoht−1 + bo) (3) u-ti sin skjort-kra-ge ct = it tanh (Wuxt + Uuht−1 + bu) We process these transcriptions in the following way: + ft ct−1 (4) 1. Remove all comments and and non-musical data where σ() denotes the element-wise logistic sigmoid func- tion, and denotes the element-wise multiplication oper- 2. If the tune has multiple voices, separate them as if ator. The LSTM layer updates its hidden state by they are individual tunes 3. Parse the head of the tune and keep the length (L:), ht = ot tanh(ct): (5) meter (M:), and key fields (K:) The hidden state of an LSTM layer is the input to the next 4. Parse the body of the tune deeper layer. 5. Clean up and substitute a few resulting tokens to keep similarity over the data set (i.e “K:DMajor” is 2.2 Gated Recurrent Unit (GRU) substituted by “K:DMaj” etc.) A GRU layer is similar to that of the LSTM, but each layer We keep all the following tokens in the tunes body: uses only two gates and so is much simpler to compute [7]. Each GRU layer transforms an input xt and a previous • Changes in key (K:), meter (M:) or note length (L:) state h as follows: t−1 • Any note as described in the ABC-Standard (e.g., e, =a or any valid note) rt = σ(Wrxt + Urht−1 + br) (6) (2 (3 (4 zt = σ(Wzxt + Uzht−1 + bz): (7) • Duplets , triplets , quadruplets , etc. • Note length (Any integer after a note =a 4) The GRU layer updates its state by • Rest sign (z) ht = (1 − zt) tanh(Whxt + Uh(rt ht−1)) • Bars and repeat bars (:| |:) + zt ht−1: (8) • Grouping of simultaneous notes ([ and ]) Compared with the LSTM, each GRU layer has fewer pa- After processing, the transcription above appears as: rameters. [L:1/8] [M:3/4] 3. MODELING SCANDINAVIAN FOLK MUSIC [K:AMin] EEA2cc|eeB2d2|cBAcBA| 3.1 Data ˆG 2 E 4 :| |: E E A 2 B 2 | c d e 2 d 2 | c B A c B ˆG | A 2 A 4 :| FOLKWIKI 4 is a wiki-style site dedicated to Scandinavian folk music that allows users to submit tune transcriptions Each symbol separated by a space corresponds to one to- to a growing database, each expressed using ABC nota- ken in the model vocabulary. Notice that almost all meta- tion. We collect transcriptions from FOLKWIKI by using a data fields are removed, as well as lyrics. Reprise bars web scraper, 5 recursively gathering them using the “key” such as :: or :|: have been substituted by :| |: to category. 6 This produces 4083 unique transcriptions. An minimize the vocabulary size so the models become less example transcription is shown in the following: complex. The output produced by our text processing is a %%abc-charset utf-8 file with all transcriptions separated by a newline. We do not keep any transcriptions with fewer than 50 tokens or X:1 more than 1000 tokens. We also do not attempt to correct T:Visa human errors in transcription (e.g., miscounted bars). The T:ur Svenska Folkmelodier resulting dataset is available in a repository. 7 The parser utgivna av C.E. Sodling¨ we created to do the above is available at the project repos- B:http://www.smus.se/...(Edited by authors) 8 O:Smaland˚ itory. The total number of unique tokens in the Folkwiki dataset is 155. 4 http://www.folkwiki.se 5 http://www.scrapy.org 7 https://github.com/victorwegeborn/folk-rnn/tree/master/data/9 nov 6 http://www.folkwiki.se/Tonarter/Tonarter 8 http://www.github.com/ztime/polska 3.2 Pretraining models The training of deep models typically begins with a ran- dom initialization of its weights, but it can also begin with weights found from previous training. In the latter sense, one can think of it as making the network first aware of syntactical relationships in the domain in which it is work- ing, and then tuning the network on a subset to specialize it.