<<

Lead Sheet Generation with Deep Learning

Stephanie Van Laere

Supervisor: Prof. dr. ir. Bart Dhoedt Counsellors: Ir. Cedric De Boom, Dr. ir. Tim Verbelen

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Information Technology Chair: Prof. dr. ir. Bart Dhoedt Faculty of Engineering and Architecture Academic year 2017-2018

Lead Sheet Generation with Deep Learning

Stephanie Van Laere

Supervisor: Prof. dr. ir. Bart Dhoedt Counsellors: Ir. Cedric De Boom, Dr. ir. Tim Verbelen

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Information Technology Chair: Prof. dr. ir. Bart Dhoedt Faculty of Engineering and Architecture Academic year 2017-2018 Permission of use

The author gives permission to make this master dissertation available for con- sultation and to copy parts of this master dissertation for personal use. In the case of any other use, the copyright terms have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation.

Stephanie Van Laere Ghent, May 2018

I Acknowledgements

When I stumbled on this project, I immediately thought this would be the perfect fit for me. Being a musician and a composer myself, I saw this asthebest opportunity to combine two passions of mine: computer science and . At the final stages of one of the biggest projects of my life, I am overwhelmed by gratefulness to all the people that have helped me through this project. Firstly, I would like to thank my counselors, dr. ir. Cedric De Boom and dr. ir. Tim Verbelen, for your enormous support. Through the hours of feedback sessions, this master dissertation wouldn’t be what it is now without you. I would also like to thank my supervisor, prof. dr. ir. Bart Dhoedt, that I was able to work on this project. I truly couldn’t have imagined a better subject than this one. I would like to thank all my friends from BEST, who provided me great laugh- ter through moments I truly needed it. I would especially like to thank Karim, Margot and Alexander, for guidance and occasional distraction that was invalu- able to me. I would also like to thank Esti for our little thesis working moments and our friendship throughout these past six years. Laurence, thank you for the coffee and Skype dates and of course, our 11-year long friendship. Janneken, thank you for every I ever wrote and making my love for music grow even stronger. Lars, for helping and supporting me. Finally, I would like to thank everybody who filled in the survey and sent it to their friends. I never expected 177 responses in such a short amount of time and it helped immensely. I would like to thank my parents for introducing me to the piano and all types of music throughout my youth. You have always supported me through thick and thin, with some advice, a great meal, a hug or a joke. My brother, for your innovative, creative mind and confidence. My sister, for your generous and altru- istic heart. My grandparents, whom I’ve never seen without the biggest smiles. Finally, my great-grandmother, for your inspiration. I will always remember you playing ‘La tartine de beurre’ on the piano in our living room. Stephanie Van Laere

II Lead Sheet Generation with Deep Learning

Stephanie Van Laere

Supervisor: Prof. dr. ir. Bart Dhoedt Counsellors: Ir. Cedric De Boom, Dr. ir. Tim Verbelen

Master’s dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Information Technology Chair: Prof. dr. ir. Bart Dhoedt Faculty of Engineering and Architecture – Ghent University Academic year 2017-2018 Abstract

The Turing test of music, the idea that a computer creates music where it be- comes indistinguishable from a human-composed music piece, has been fascinating researches for decades now. Most explorations on music generation has focused on classical music, but some research has also been done regarding modern music, partially about lead . A lead sheet is a format of music representa- tion that is especially popular in and pop music. The main elements are notes, chords and optional . In the field of lead sheet generation, Markov models have been mostly explored, whilst a lot of the research in the classical field incorporates Recurrent Neural Networks (RNNs). We would liketo use these RNNs to generate modern-age lead sheet music by using the Wikifonia dataset. Specifically, we use a model with two components. The first component generates the chord scheme, the backbone of a lead sheet, together with the du- ration of the melody notes. The second component generates the pitches on this chord scheme. We evaluated this model through a survey with 177 participants.

Keywords: music generation, recurrent neural networks, deep learning, lead sheet generation

III Lead Sheet Generation with Deep Learning

Stephanie Van Laere

Supervisor(s): Bart Dhoedt, Cedric De Boom, Tim Verbelen

Abstract— The Turing test of music, the idea that a computer II. RELATED WORKS creates music where it becomes indistinguishable from a human- Music can be represented in two ways: (i) a signal or audio composed music piece, has been fascinating researches for decades representation, which uses raw audio or waveforms, or (ii) a now. Most explorations on music generation has focused on classical music, but some research has also been done regarding symbolic representation, which will note music through modern music, partially about lead sheet music. A lead sheet is a discrete events, such as notes, chord or . We will focus format of music representation that is especially popular in jazz on symbolic music generation. and pop music. The main elements are melody notes, chords and optional lyrics. In the field of lead sheet generation, Markov Researchers have been working on music generation models have been mostly explored, whilst a lot of the research in problems for decades now. From the works of Bharucha and the classical field incorporates Recurrent Neural Networks Todd in 1989 using neural networks (NNs) [1] to working with (RNNs). We would like to use these RNNs to generate modern-age more complex models such as the convolutional generative lead sheet music by using the Wikifonia dataset. Specifically, we adversarial network (GAN) from Yang et al. in 2017 [2], the use a model with two components. The first component generates topic clearly still has a lot left to explore. the chord scheme, the backbone of a lead sheet, together with the Research specifically for lead sheet generation has also been duration of the melody notes. The second component generates the conducted. FlowComposer [3] is part of the FlowMachines [4] pitches on this chord scheme. We evaluated this model through a survey with 177 participants. project and generates lead sheets in the style of the corpus selected by the user. The user can enter a partial lead, which is Keywords— music generation, recurrent neural networks, deep then completed by the model. If needed, the model can also learning, lead sheet generation generate from scratch. Some meter-constrained Markov chains represent the lead sheets. Pachet et al. have also conducted I. INTRODUCTION some other research within the FlowMachines project What is art? That is a question that can be thoroughly regarding lead sheets. In [5], they use the Mongeau & Sankoff discussed for hours by any art lover. Can the name ‘art’ only be similarity measure in order to generate similar melody themes used when it is made by humans, or is it also art if a human to the imposed theme. This is definitely relevant in any type of doesn’t recognize computer generated ‘art’? We want to focus music, since a lot of melody themes are repeated multiple times on research that makes this question so enticing: improving during pieces. Ramona, Cabral and Pachet have also created the computer-generated music that resembles the work of humans. ReChord Player, which generates musical for Specifically, we want to focus on lead sheet generation by any song [6]. Even though most research on modern lead sheet using Recurrent Neural Networks (RNNs). A lead sheet is a generation makes use of Markov models, we want to focus on format of music representation that is especially popular in jazz using Recurrent Neural Networks (RNNs) for lead sheet and pop music. The main elements are melody notes, chords generation using the Wikifonia dataset1. and optional lyrics. An example can be found in Figure 1. III. WIKIFONIA DATA PREPROCESSING AND AUGMENTATION The pieces from the Wikifonia dataset were all in MusicXML (MXL) format [7]. To make it easier to process, we transformed the pieces into a Protocol Buffer format2, inspired by Google's Magenta [8].

Some preprocessing steps were taken. Polyphonic notes, which means that multiple notes are played at the same time, are deleted until only the highest notes, the melody notes, Figure 1 A (partial) example of a lead sheet. remain. Anacruses, which are sequences of notes that precede the downbeat of the first measure, are omitted. Ornaments, We divided the problem in two components. The first which are decorative notes that are played rapidly in front of component generated the chords together with the or the central note, are also left out. duration of the melody notes. The second component generated In order for the model to know which measures need to be the pitches of the melody notes, based on the chords and the played after which measures, repeat signs need to be rhythm. We evaluated our model through a survey with 177 eliminated. The piece needs to be fed to the model as it would participants. be played. That's why we need to unfold the piece. Figure 2 shows this process.

1 http://www.synthzone.com/files/Wikifonia/Wikifonia.zip 2 https://developers.google.com/protocol-buffers/docs/proto In turn, we use the entire chord scheme and time_steps In the original dataset there were 42 different rhythm types, previous melody notes to predict the next note: including more complex rhythm types that were not frequent. � ���� ���� , ���� , … , ���� , �ℎ����) (2) We removed the pieces which held these less occurring rhythm _ types. Of the 6394 scores, 184 were removed and only 12 rhythm types remained. Since we wanted to model lead sheets, we removed 375 scores which did not have any chords.

Figure 3 The chord is repeated for each note so the duration of the chord becomes the duration of the note.

Figure 2 The process of unfolding for a lead sheet. B. Data representation There were flats♭, sharps ♯ and double flats in the chords of For the chord generation, we form two one-hot encoders that we concatenate to form one big vector. The first one represents the scores. We adapted all the chords’ roots and alters to only the chord itself. There are [A, A♯, B, C, C♯, D, D♯, E, F, F♯, G have twelve options: A, A♯, B, C, C♯, D, D♯, E, F, F♯, G and and G♯] x [Maj, Min, Dim, Aug] or 48 possibilities. The G♯. The modes of the chords were also reduced from 47 options measure bar, to represent the end of a measure, can also be to four: major, minor, augmented and diminished. added, which is optional. The second one-hot encoder For data augmentation, the scores were transformed in all 12 represents the rhythm of the chord and is of size 13. The first keys. The reason why this can be beneficial to our model, is element represents a measure bar, which is again optional. The because this will give the model more data to rely on to make other 12 elements are the 12 different rhythm types. Figure 4 the decisions. It also prevents the model from learning a bias visually represent the chord representation. that a subset of the dataset might have towards a specific key.

IV. A NEW WAY OF LEAD SHEET GENERATION

A. Methodology When approaching the problem of how to generate lead sheets, we wanted to first build the backbone of a lead sheet, the chord scheme, before handling the melody and the lyrics. Figure 4 Two one-hot vectors representing the chord itself and the There were two possibilities of modeling this that we have duration of the chord concatenated in the chord generation problem. considered. Measure bars representations are included. The first option is to generate a new chord scheme, just as in the training data. Once a chord scheme is generated, then we For the melody generation, the first one-hot encoded vector can generate the melody on this generated chord scheme. There representing the pitch of the note is of size 131. The first 128 are two difficulties that rise up when using this as a model. elements of the vector represent the (potential) pitch of the note. Firstly, many melody notes are usually played on the same This pitch is determined through the MIDI standard. The 129th chord. Therefore, if we want to generate melody notes on the element is only set to one if it is a rest (or no melody note is generated chord scheme, we don’t really know how many played). The 130th element represents a measure bar, which melody notes we should generate per chord. Should it be a can be included or excluded. Figure 5 clarifies. dense or fast piece, or should we only play one note on the chord? The second difficulty is to make sure that the duration of the different melody notes from the same chord sum up to the duration of the chord. If, for example, the model decides that four quarter notes should be played on a half note chord, this is a problem. In light of these two difficulties, we have opted for option two: combining the duration of the notes with the chords. As Figure 5 Representation of the melody note, where the measure bar is can be seen by Figure 3, we repeat the chord for each note included. The start token kickstarts the generation process. where it should be played, so the duration of the chord becomes the duration of the note. C. Architecture To generate the chord scheme, we predict the next chord The architecture can be found in Figure 6. In this Figure, based on the previous time_steps chords: measure bars are included in the sizing. The first component is the chord scheme generation component. This consists of an � �ℎ��� �ℎ���, �ℎ���, … , �ℎ���_) (1) input layer which will read the chords with their duration. This is followed by a number of LSTM layers, which is set to two in the figure. In this figure, the LSTM size is set to 512, but this can be adapted. The fully connected (FC) layer outputs the E. Training and generation details chord predictions of the chord generation part of the model. The chord generating component is trained using an input The chord scheme is used as an input for the bidirectional sequence of size time_steps and a target sequence of size LSTM encoder in the melody generation part of the model. A time_steps. The subsequent note of each note in the input bidirectional RNN uses information about both the past and the sequence is the target for that note. These sequences are future to generate an output. When we replace the RNN cell by selected from random scores in the training set each time and a LSTM cell, we get a bidirectional LSTM. By using this as an also batched together. encoder, we get information about the previous, current and The melody generating component is trained by giving a set future chords. This information can be used together with the of chords of size time_steps (�ℎ���, �ℎ���, … , �ℎ���_, previous melody notes when we generate a note, so the batched together) to the encoder. The forward and backward bi- prediction looks ahead at how the chord scheme is going to directional LSTM outputs are concatenated to form a matrix of progress and look back at both the chord scheme and the size (batch_size, time_steps, 2.LSTM_size). Then, the previous melody. Again, 512 is used as a dimension for the LSTMs but melody notes (����, ����, … , ����_) are added to this can be adapted. Multiple layers of LSTMs in both form a matrix of size (batch_size, time_steps, directions are possible. 2.LSTM_size+note_size). This is used as the input for the decoder. The target sequences, also batched together, are the melody notes corresponding to the chords

(����, ����, … , ����_). Again, all these sequences are randomly selected from a random score in the training set.

For the generation of the chord scheme, we use an initial seed with a chosen length to kickstart the generation. In this case, we use a seed length of time_steps. A forward pass will predict the next note and is added each time, shifting out the oldest note. We will sample the next chord and its rhythm. The generation process of the melody generating component runs the entire selected chord scheme through the encoder first. We will take the start token and concatenate the corresponding Figure 6 The architecture of the model, split into a chord scheme encoder output and put this through the decoder. We sample the generating component and a melody generating component. next note and add this to the input note sequence, after the start token. Again, we will shift out the oldest note. This continues D. Loss functions until each chord has a corresponding melody note. For the chord generating component, we have two loss V. EXPERIMENTS AND RESULTS functions, one for the chord itself and one for the duration, that we’ll add up. The model outputs the softmax for each of those All files generated for this paper used the default elements. Afterwards, we perform cross entropy on the targets hyperparameter values that are shown in Table 1. and the predictions. Table 1 The hyperparameter values used for the generation of pieces.

Hyperparameter Default Value (3) Time Steps 50 Size of LSTM layers 256 Number of LSTM layers 2 Inclusion measure bar No where index � represents the time step at which we are calculating the loss and index � is the jth element of the vector. A. Measure bars The total loss is a combination of the two (� ∈ [0,1]): We found that, during the generation, measure bars didn't really fall in the places they were expected. Sometimes we had (4) only one quarter note in between measure bars and sometimes 4 quarter notes filled the measure completely. Sometimes, the pitch part indicated the measure bar, sometimes the rhythm The mean is taken for all time steps: part, sometimes the chord and all in between. We can clearly see that this way of modeling the measure wasn't successful. (5) We therefore dropped the measure bar completely during the notation of the music piece in MIDI and excluded them for the generation. Further research needs to be done on how to model The melody generating component has one loss function. measures efficiently.

(6) B. Survey A survey was conducted in order to see how our computer- Again, the mean is taken for all time steps: generated pieces compare to real human-composed pieces. This was done by asking the participant a set of questions for each (7) piece in Google Form format. There were 177 participants for listeners. In general, we found a significant correlation between this particular online survey. the mean likeability of a piece and the mean of how much the The pieces were exported to mp3 in MuseScore, in order to participants think it is human-composed. eliminate any human nuances that could hint to the real nature Even though this result is promising, more research needs to of the piece. To make it easier on the ears, some chords were be done to determine a sense of direction of the pieces and to tied together, since the output of the model repeats each chord eliminate the bad fragments inside pieces. for each note. Then, for each piece, a fragment of 30 seconds was chosen. In the survey, there were three categories of music Three main areas of further research have been discussed. pieces: human-composed pieces, completely computer- The first one is adding lyrics to the lead sheets. This can be generated pieces and pieces where the melody was generated done by working on a syllable level for the training dataset and on an actual human-composed chord scheme. Three audio adding a third component to the melody on chord model, which fragments were included for each. generates the lyrics on the melody notes. We believe the information of the chords can be left out in this stage, since the We asked the participants about their music experience, melody guides the words, but further research needs to be done. where they could choose between experienced musician, The second part of further research is to add more structure to amateur musician or listener. For each audio fragment, there the lead sheets. Usually a lead sheet consists of some sort of were three questions: structure, such as verse - chorus – verse - chorus - bridge - 1. Do you recognize this audio fragment? (yes or no) chorus. This could be established by adding for example a 2. How do you like this audio fragment? (scale from 1 to 5) similarity factor to the models discussed, such as in [10]. 3. Is this audio fragment made by a computer or a human? Thirdly, ways to represent measures could still be researched (scale from 1 to 5) further. Perhaps adding a metric to see how many notes we still need to fill up the measure could be an option. We’ve found that in general, there was a significant correlation between the mean likeability of a piece and the mean of how EXAMPLES GENERATED MUSIC much the participants thought it was made by a computer or a human. We also concluded that the computer-generated pieces The files that were used in the survey can be found in were perceived at least as human as the human pieces and https://tinyurl.com/y8qpvu92. Good and bad examples that are outperforming them even. This was significant with � = taken from the same file can be found in 8.10. The experienced or amateur musicians didn’t https://tinyurl.com/y7tmnnah. The MIDI files can have strange outperform the listeners. notations sometimes,because the MIDI pitches were converted to standard music format by MuseScore. C. Subjective evaluation ACKNOWLEDGEMENTS We also evaluated the pieces ourselves, including the bad examples. In general, we have found that the claim of Waite in I would like to express my deepest gratitude to Cedric De [9] that the music made by RNNs lacks a sense of direction and Boom and Tim Verbelen for guiding me through this project become boring is true in our case. Even though the model and to Bart Dhoedt for letting me work on this subject. produces musically coherent fragments sometimes, a lot of the time you couldn't recognize that two chosen from the REFERENCES same piece are in fact from the same piece. Also, even though our pieces have some good fragments, [1] J. J. Bharucha and P. M. Todd, “Modeling the perception of tonal structure with neural nets,” Computer Music Journal, vol. 13, no. 4, pp. they usually have at least one portion which doesn't sound 44–53, 1989. musical at all once we start generating longer pieces. This was [2] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet: A convolutional confirmed by the examples above. This could possibly be fixed generative adversarial network for symbolic-domain music generation,” with a larger dataset, by training longer or letting an expert in in Proceedings of the 18th International Society for Music Information music filter out these fragments. Retrieval Conference (ISMIR’2017), Suzhou, China, 2017. [3] A. Papadopoulos, P. Roy, and F. Pachet, “Assisted lead sheet composition using flowcomposer,” in International Conference on VI. CONCLUSION AND FUTURE WORKS Principles and Practice of Constraint Programming, pp. 769–785, Springer, 2016. This paper focused on generating lead sheets, specifically [4] “Flowmachines project.” [Online]. Available: http://www.flow- melody combined with chords, in the style of the Wikifonia machines.com/ [Accessed:20-May-2018]. dataset. The model discussed first used a number of LSTM [5] F. Pachet, A. Papadopoulos, and P. Roy, “Sampling variations of sequences for structured music generation,” in Proceedings of the 18th layers and a fully connected layer to generate a chord scheme, International Society for Music Information Retrieval Conference which included the chord itself and also the rhythm of the (ISMIR’2017), Suzhou, China, pp. 167–173, 2017. melody note. After that, we used a Bi-LSTM encoder to gain [6] M. Ramona, G. Cabral, and F. Pachet, “Capturing a musician’s groove: some information about the entire chord scheme we want to Generation of realistic accompaniments from single song recordings.,” generate the melody pitches on. This chord scheme information in IJCAI, pp. 4140–4142, 2015. [7] M. Good et al., “Musicxml: An internet-friendly format for sheet music,” and the information about previous melody notes was used in in XML Conference and Expo, pp. 03–04, 2001. the decoder (again a number of LSTM layers and a fully [8] “Google magenta: Make music and art using machine learning.” connected layer). This second part generates the melody pitches [Online]. Available: https://magenta.tensorflow.org/ [Accessed:3-Dec- on the rhythm that was already established in the chord scheme. 2017]. [9] E. Waite, “Generating long-term structure in and stories,” 2016. The survey results are very positive. The participants couldn't [Online]. Available: https://magenta.tensorflow.org/2016/07/15/ lookback-rnn-attention-rnn [Accessed:25-May-2018]. really distinguish the computer-generated pieces from the [10] S. Lattner, M. Grachten, and G. Widmer, “Imposing higher-level human-composed pieces. Self-proclaimed expert musicians structure in polyphonic music generation using convolutional restricted and amateur musicians didn't perform better than regular music boltzmann machines and constraints,” arXiv preprint arXiv:1612.04742, 2016. Contents

Permission of use I

Acknowledgements II

Abstract III

Extended Abstract IV

Contents VIII

Listing of figures XII

List of Tables XV

List of Abbreviations XVII

List of Symbols XIX

1 Introduction 1

2 4 2.1 Standard music notation ...... 4 2.1.1 ...... 5 2.1.2 Notes and rests ...... 5 2.1.3 Key ...... 8 2.1.4 Meter ...... 9 2.1.5 Chord ...... 9 2.1.6 Measure ...... 10 2.2 Scales ...... 10 2.2.1 Major Scale ...... 10

VIII 2.2.2 Minor Scale ...... 11 2.3 Transposition ...... 11 2.4 Circle of Fifths ...... 12 2.5 Structure of a musical piece ...... 12

3 Sequence Generation with Deep Learning 13 3.1 Deep Learning ...... 14 3.2 Sequence Generating Models ...... 20 3.2.1 Recurrent Neural Network (RNN) ...... 21 3.2.2 Long Short-Term Memory (LSTM) ...... 21 3.2.3 Encoder-Decoder ...... 22

4 State of the Art in Music Generation 24 4.1 Music Representation ...... 24 4.1.1 Signal or Audio Representation ...... 24 4.1.2 Symbolic Representations ...... 25 A Musical Instrument Digital Interface (MIDI) Pro- tocol ...... 26 B MusicXML (MXL) ...... 27 C Piano Roll ...... 29 D ABC Notation ...... 30 E Lead Sheet ...... 32 4.2 Music Generation ...... 33 4.2.1 Preprocessing and data augmentation ...... 33 4.2.2 Music Generation ...... 34 4.2.3 Evaluation ...... 41

5 Data Analysis of the Dataset Wikifonia 43 5.1 Internal Representation ...... 43 5.2 Preprocessing ...... 45 5.2.1 Deletion of polyphonic notes ...... 45 5.2.2 Deletion of anacruses ...... 46 5.2.3 Deletion of ornaments ...... 46 5.2.4 Unfold piece ...... 47 5.2.5 Rhythm ...... 47 5.2.6 Chord ...... 49 A ...... 49 B Alter ...... 50 C Mode ...... 51

IX 5.3 Data Augmentation ...... 51 5.3.1 Transposition ...... 51 5.4 Histograms of the dataset ...... 53 5.4.1 Chords ...... 53 5.4.2 Melody Notes ...... 54 5.5 Split of the dataset ...... 55

6 A new approach to Lead Sheet Generation 56 6.1 Melody generation ...... 56 6.1.1 Machine Learning Details ...... 58 6.2 Melody on Chord generation ...... 61 A Chords, with note duration ...... 63 B Melody pitches ...... 65 6.2.1 Machine Learning Details ...... 66

7 Implementation and Results 73 7.1 Experiment Details ...... 73 7.2 Melody Generation ...... 74 7.2.1 Training and generation details ...... 74 7.2.2 Subjective comparison and Loss curves of MIDI output files 75 A Temperature ...... 75 B Comparison underfitting, early stopping and over- fitting ...... 76 C Time Steps ...... 78 D Inclusion of Measure bars ...... 81 E Data augmentation ...... 82 F LSTM size ...... 83 G Number of LSTM Layers ...... 84 7.2.3 Conclusion ...... 84 7.3 Melody on Chord Generation ...... 85 7.3.1 Training and generation details ...... 85 7.3.2 Subjective comparison of MIDI files ...... 86 A Comparison underfitting, early stopping and over- fitting ...... 86 B Examples and intermediate conclusions . . . . . 88 7.3.3 Survey ...... 89 7.3.4 Conclusion ...... 100

X 8 Conclusion 101 8.1 Further work ...... 102

Appendix A Chords 104

References 106

Index 115

XI Listing of figures

2.1 Standard : Itsy-bitsy spider. The elements are as followed: orange (1) is the clef, red (2) is the key (signature), blue (3) is the or meter, green (4) is the notation for the chords and the purple boxes (5) are the measure bars...... 5 2.2 Different Clefs [11] ...... 6 2.3 Rhythmic notation of notes. Each line splits the duration in half. 6 2.4 The do-re-mi syntax and ABC syntax of notes [12] ...... 7 2.5 Rhythmic notation of rests. Each line splits the duration in half. . 7 2.6 Rhythm notes with dots ...... 7 2.7 Accidentals of notes. The accidentals in this picture are respec- tively a flat, a natural, a sharp, a double sharp, a double flat . . . 8 2.8 Octave of C ...... 8 2.9 Repeat signs used in a musical piece. The instructions are depicted in the figure...... 10 2.10 A with first and second ending. The instructions are depicted in the figure...... 10 2.11 D major scale ...... 11 2.12 D minor scale ...... 11 2.13 Transpostion of a C major piece of ‘Twinkle twinkle little star’ into an F major piece [14] ...... 11 2.14 The Circle of Fifths [15] shows the relationship between the differ- ent key signatures and their major and minor keys...... 12

3.1 Graphical representation of a NN ...... 15

3.2 Activation functions ...... 16 3.3 Example of feedforward (a) and recurrent (b) network ...... 18 3.4 Early stopping: stop training when the validation error reaches a minimum. Otherwise, it is overfitting [36]...... 19

XII 3.5 Sequence generation model: input and output ...... 20

3.6 A LSTM memory block ...... 22 3.7 Encoder-Decoder architecture, inspired by [47] ...... 23

4.1 Visual representation of raw audio and its fourier transformed spec- trum [49] ...... 25 4.2 MIDI extract [17] ...... 27 4.3 MusicXML example [54] ...... 28 4.4 Pianola: the self-playing piano ...... 29 4.5 Piano Roll representation ...... 29 4.6 ABC notation: Speed the Plough [59] ...... 30 4.7 ABC notation: four octaves ...... 31 4.8 Lead Sheet Example ...... 32 4.9 Soprano prediction, DeepBach’s neural network architecture . . . 37 4.10 Reinforcement learning with audience feedback [78] ...... 39

5.1 Example of an anacrusis ...... 46 5.2 Example of an ornament ...... 46 5.3 Preprocessing step: unfolding the piece by eliminating the repeat signs ...... 47 5.4 Distribution of roots of chords in dataset without data augmentation 53 5.5 Distribution of modes of chords in dataset without data augmentation 54 5.6 Distribution of the melody pitches ...... 54

6.1 Two one hot encoders concatenated of a note representing the pitch and the rhythm in the melody generation problem. Measure bars are included in this figure...... 57 6.2 Architecture of the first simple melody generation model with two LSTM layers. Measure bars are included in this specific example . 59 6.3 Training procedure of the first simple melody generation model . 60 6.4 Generation procedure of the first simple melody generation model with seed length equal to 4 ...... 62 6.5 Some Old - John Coltrane: the original and adapted chord structure ...... 64 6.6 Two one-hot encoder vectors representing the chord itself and the duration of the chord concatenated in the chord generation prob- lem. Measure bars representations are included in this figure. . . . 66

XIII 6.7 Melody on chord generation: representation of a note where the measure bar is included ...... 67 6.8 Architecture of the melody on chord generation model where mea- sure bars are included ...... 68 6.9 Bidirection RNN [94] ...... 69 6.10 Melody on chord generation: our Encoder-Decoder architecture . 70

7.1 The conversion of MIDI pitches to standard music notation by MuseScore ...... 77 7.2 Comparison Training and Validation Loss for all default values for the melody model ...... 78 7.3 Training loss curves for different time_steps values for the melody model ...... 80 7.4 Training loss curves: inclusion of the measure bars or not for the melody model ...... 81 7.5 Training loss curves using data augmentation or not ...... 82 7.6 Training loss curves for different LSTM size values for the melody model ...... 83 7.7 Training loss curves: compare number of LSTM layers for the melody model ...... 84 7.8 Comparison Training and Validation Loss for all default values . . 87 7.9 A step in preparing the MIDI files for the survey ...... 90 7.10 Survey responses: correlation between mean likeability and how much the participants of the survey think it is computer generated 95 7.11 Survey responses: number of correct answers out of six per music experience category ...... 98

A.1 Different representations for chords ...... 105

XIV List of Tables

3.1 One-hot encoding example ...... 17

4.1 MIDI velocity table [52] ...... 26 4.2 MIDI ascii [17] ...... 27 4.3 Reinforcement Learning: Miles Davis - Little Blue Frog in Musical Acts by [80] ...... 40

5.1 Original Statistics Dataset Wikifonia ...... 44 5.2 Statistics Dataset Wikifonia after deletion of scores with rare rhythm types ...... 48 5.3 The twelve final rhythm types and their count in the dataset . . . 49 5.4 Statistics Dataset Wikifonia after deletion of scores with rare rhythm types and scores with no chords ...... 50 5.5 Roots of chords and how much they appear in the scores without transposition ...... 50 5.6 Alters of chords and how much they appear in the scores without transposition ...... 51 5.7 Modes of chords, how much they appear and their replacement in the scores without transposition ...... 52

6.1 The options for root, alter, mode and rhythm in a chord . . . . . 64 6.2 Chord generation: snippet of table for the index where the chord one hot vector is one depending on the root, alter and mode of the chord ...... 65

7.1 Default and possible values for parameters in the melody genera- tion model ...... 74 7.2 The longest common subsequent melody sequence for five gener- ated MIDI files using the default hyperparameter values. One time using only pitch, the other using both pitch and duration. . . . . 79

XV 7.3 Default and possible values for parameters in the melody genera- tion component for the melody on chord generation model . . . . 85 7.4 The pieces that were included in the survey, with their category and potential origin ...... 92 7.5 The Longest Common Subsequent Sequence (LCSS) for (partially) generated MIDI files in the survey. One time using only pitchand one time using only chords. A full copy is expected for the ‘only chord’ one for files 3.mid, 6.mid and 8.mid...... 93 7.6 The percentage of participants of the survey who responded that they recognized the audio fragment ...... 94 7.7 How much the participants responded they liked or disliked a piecet 96 7.8 The participants’ answer to the question if the fragment was computer- generated or human-composed ...... 97 7.9 Survey responses: average number of correct answers (out of six) per music experience category ...... 99

XVI List of Abbreviations

AI Artificial Intelligence

BPTT Backpropagation Through Time

DL Deep Learning

DRL Deep Reinforcement Learning

DNN Deep Neural Network

FC Fully Connected (Layer)

FFNN Feed-forward Neural Network

GAN Generative Adversarial Network

GRU Gated Recurrent Unit

HMM Hidden Markov Model

LCSS Longest Common Subsequent Subsequence

LM Language Modeling

LSTM Long Short-Term Memory

MIDI Musical Instrument Digital Interface

ML Machine Learning

MXL MusicXML

XVII NLG Natural Language Generation

NN Neural Network

RBM Restricted Boltzmann Machine

ReLU Rectified Linear Unit

RL Reinforcement Learning

RNN Recurrent Neural Network

RNN-LM Recurrent Neural Network Language Model

VAE Variational Auto-Encoder

VRASH Variational Recurrent Autoencoder Supported by History

XVIII List of Symbols

♭ Flat ˇ (“ 8th note

♮ Natural 3 Two notes of a triplet ˇ “ ♯ Sharp linked “ 5 Double sharp ˇ ( ‰ 8th note dotted ˇ “ ♭♭ Double flat Quarter note ˇ “ ‰ Quarter note dotted ˇ *“ 32th note ˘ “ Half note ˇ *“ ‰ 32th note dotted ˘ “ ˇ )“ 16th note ‰ Half note dotted ==3 ¯ ˇ “ One note of a triplet Whole note

XIX “Talking about music is like dancing about archi- tecture.” Unknown 1 Introduction

What is art? That is a question that can be thoroughly discussed for hours by any art lover. Can the name ‘art’ only be used when it is made by humans, or is it also art if a human doesn’t recognize computer generated ‘art’? Any form of art generation that tries to pass the Turing test of art puts a new nuance to this question. No matter what your specific answer to the question ‘What is art?’ is, the research to mimic the human creative mind remains fascinating.

Researchers have been working on music generation problems for decades now. From the works of Bharucha and Todd in 1989 using neural networks (NNs) [1] to working with more complex models such as the convolutional generative adversarial network (GAN) from Yang et al. in 2017 [2], the topic clearly still has a lot left to explore.

1 A lead sheet is a format of music representation that is especially popular in jazz and pop music. The main elements are melody notes, chords and optional lyrics. Research specifically for lead sheet generation has also been conducted. FlowComposer [3] is part of the FlowMachines project [4] and generates lead sheets in the style of the corpus selected by the user. The user can enter a partial lead, which is then completed by the model. If needed, the model can also generate from scratch. Some meter-constrained Markov chains represent the lead sheets. Pachet et al. have also conducted some other research within the FlowMachines project regarding lead sheets. In [5], they use the Mongeau & Sankoff similarity measure in order to generate similar melody themes to the imposed theme. This is definitely relevant in any type of music, since a lot of melody themes are repeated multiple times during pieces. Ramona, Cabral en Pachat have also created the ReChord Player, which generates musical accompaniments for any song [6]. Even though most research on modern lead sheet generation makes use of Markov models, we want to focus on using Recurrent Neural Networks (RNNs).

There are two main forms of music representation: signal representations, which use raw audio or an audio spectrum, and symbolic representations. An example of the signal representation can be found in WaveNet [7], which is a deep neural network that uses raw audio in order to generate music. This master dissertation focuses on the other form of music representation: symbolic repre- sentation. Symbolic representations will note music through discrete events, such as notes, chord or rhythms. Specifically in this master dissertation, we will usea MusicXML (MXL) dataset called Wikifonia. This dataset mostly contains mostly modern jazz/pop music. In this master dissertation, we want to model the lead sheets found in the Wikifonia dataset and generate similar types of music.

All types of sequence generating models have been used in the past to generate music: from Recurrent Neural Networks (RNNs) [8] to even Reinforcement learn- ing algorithms [9]. In this master dissertation, we will discuss two models: a simple melody generating model and a more complex melody on chord generating model. Both will use RNNs, specifically Long Short-Term Memory networks (LSTMs).

2 The more complex model first generates a chord scheme with the rhythm ordu- ration of the melody notes. This chord scheme will be used to generate the pitch of the melody notes in the second part. We will evaluate the models in three ways: (i) by comparing the training loss curves, (ii) by analyzing the pieces sub- jectively ourselves and (iii) by making a survey with 9 audio fragments. These 9 fragments had three categories: human-composed pieces, pieces where the chord schemes were human-composed but the melody was generated and fully generated computer pieces. This survey was filled in by 177 participants.

In Chapter 2, a short introduction on music theory is given. This will give them a basic understanding of the musical concepts that will be used in this master dissertation. This is quickly followed by a chapter that discusses deep learning and some sequence generating models that will be used in our model (Chapter 3). Chapter 4 will discuss the state of the art in music representation and music generation. The Wikifonia dataset will be completely analyzed in Chapter 5, as well as all the preprocessing and data augmentation steps that were taken. Afterwards, the models and the results will be discussed in Chapters 6 and 7 respectively. We will end with a conclusion and a discussion on potential future work in Chapter 8.

3 ”Music is a moral law. It gives soul to the uni- verse, wings to the mind, flight to the imagina- tion, and charm and gaiety to life and to every- thing.” Plato 2 Music Theory

First, we will introduce the basic elements of music theory that will be used throughout the remainder of this master dissertation. Readers familiar with mu- sic theory can skip this chapter and move to Chapter 3. For a more elaborate discussion on music theory, the interested reader can refer to [10].

2.1 Standard music notation

In order to look at the different elements of standard music notation, the song ‘Itsy-Bitsy Spider’ is included in Figure 2.1 as an example. This standard music notation is composed of several elements: a clef, a key, a meter or time signa- ture, notes and rests, chords and measures. These elements will be individually discussed in the following sections.

4 Figure 2.1: Standard musical notation: Itsy-bitsy spider. The elements are as followed: or- ange (1) is the clef, red (2) is the key (signature), blue (3) is the time signature or meter, green (4) is the notation for the chords and the purple boxes (5) are the measure bars.

2.1.1 Clef

The first highlighted piece of Figure 2.1 (the orange box or the box indicated with a ‘1’) is the clef of the piece. The clef is a reference in the sense that it indicates how to read the notes. For example, in this figure, a G-clef or Treble clef is used. This means that the first note in this piece is an A. For an F-clef or Bass clefthe same note would be an E. Figure 2.2 depicts some of the possible clefs and their respective interpretation of notes [11].

2.1.2 Notes and rests

Notes are put on a staff, indicated by five parallel lines. Notes have two important elements: pitch and duration or rhythm. The pitch is the frequency the note is played on. A note will sound higher or lower depending on its placement of the note on the staff and the clef (see Section 2.1.1). The rhythm depends on the way

5 Figure 2.2: Different Clefs [11] the note is drawn. Figure 2.3 shows the duration of the notes and their names, dividing the duration by two with each line.

Figure 2.3: Rhythmic notation of notes. Each line splits the duration in half.

Notes can also be represented in a textual format. Figure 2.4 shows both the do-re-mi as the ABC syntax of the notes. Do-re-mi is usually used for singing.

Of course, notes aren’t the only important element in a musical piece. Rests

6 Figure 2.4: The do-re-mi syntax and ABC syntax of notes [12] are also crucial. Rests are recorded when no note is being played at the moment. Figure 2.5 shows the same division as Figure 2.3 but for rests. Dots can be added to both notes and rests to increase the duration of the note or rest by half (see Figure 2.6).

Figure 2.5: Rhythmic notation of rests. Each line splits the duration in half.

Figure 2.6: Rhythm notes with dots

7 Accidentals of notes are used when the composer wants to alter the pitch of the note. Figure 2.7 shows all of them. From left to right:

• A flat ♭ lowers the pitch by a semitone.

• A natural ♮ puts the flattened or sharpened pitch back to its standard form.

• A sharp ♯ raises the pitch by a semitone.

• A double sharp 5 raises the pitch by two semitones.

• A double flat ♭♭ lowers the pitch by two semitones.

Figure 2.7: Accidentals of notes. The accidentals in this picture are respectively a flat, a natural, a sharp, a double sharp, a double flat

An octave is an interval between two pitches that have the same name (e.g. do or C), but one has double or half the frequency of the other. Figure 2.8 shows an octave of C.

Figure 2.8: Octave of C

2.1.3 Key

The highlighted piece in red of Figure 2.1 (or the one indicated with number ‘2’) is the key or of the piece. The main reason for using a key is simplicity. Every musical piece can be written out by using no key and writing accidentals before every note, but this is tedious work. If most of the piece uses an F♯, this is indicated at the beginning of a bar, so this doesn’t need to be repeated for every F.

8 2.1.4 Meter

The blue section of Figure 2.1 (or the one indicated with number ‘3’) is the meter or time signature of the piece. The lower part of the meter represents the duration of the elements in the measure. For example, in Figure 2.1 the 8 represents that the meter consists of eighth notes. The upper part of the meter indicates how many of those elements there are in the measure. For example, in Figure 2.1 the 6 represents that there are 6 eighths in a measure. The most commonly used meter, 4 especially in modern music, is 4, which means that there are four quarter notes in a measure.

2.1.5 Chord

The fourth or green part of Figure 2.1 is a representation of a chord. A chord is a group of notes that are played simultaneously. A chord can be represented in two ways: the normal way where every note is written out in standard musical notation or in a textual manner. This section will discuss the second form of chord notation.

This textual form of chords usually consist of a root letter in capital (e.g. C) and an optional addition on the right of that letter. One of the most popular additions are the following:

• No addition: This just represents the , so 1-major3-5.

• m: This means that the chord is minor, so 1-minor3-5.

• A number (e.g. 7 or 9): That note (e.g. 7 or 9) is added to the chord.

• ♭ or ♯: This means that the root note is flat or sharp respectively.

Of course, these additions can be combined and there are many more, but these are of the scope of this dissertation. A figure of common chords and their different representations can be found in Appendix A.

9 2.1.6 Measure

The fifth and final (purple) highlighted pieces of Figure 2.1 are certain barriers for the piece. The single vertical lines divide the piece into measures. The double vertical lines at the end of the piece indicate the ending of a piece. Other possible barriers and how to play them are depicted in Figures 2.9 and 2.10 [13].

Figure 2.9: Repeat signs used in a musical piece. The instructions are depicted in the figure.

Figure 2.10: A repeat sign with first and second ending. The instructions are depicted inthe figure.

2.2 Scales

This section will only discuss the most important scales for this master disserta- tion: the major and the minor scales.

2.2.1 Major Scale

There are two types of steps one can make in a musical piece: a whole tone (w) or a semitone (s). The major scale consists of seven steps in the following succession: w-w-s-w-w-w-s. An example of a major scale (D major) is given in Figure 2.11.

10 Figure 2.11: D major scale

2.2.2 Minor Scale

The minor scale has the following succession: w-s-w-w-s-w-w. An example of a minor scale (D minor) is given in Figure 2.12.

Figure 2.12: D minor scale

2.3 Transposition

Transposing a piece basically means that we move the piece up or down by a set interval. In essence, the piece doesn’t change, the piece is just played in a higher or lower key. The most common reason pieces for transposing pieces, is to match the range of the singer. As an example, Figure 2.13 shows the transposition of ‘Twinkle twinkle little star’ from a C major key to an F major key [14].

Figure 2.13: Transpostion of a C major piece of ‘Twinkle twinkle little star’ into an F major piece [14]

11 2.4 Circle of Fifths

Figure 2.14 is called the circle of fifths [15]. It depicts the relationship between different key signatures and their major and minor keys. It is often used forchord progression. The reason why it’s called the circle of fifths is because one goes up a fifth each time one moves to the right in the circle (or one fifth down ifonegoes counterclockwise).

Figure 2.14: The Circle of Fifths [15] shows the relationship between the different key signa- tures and their major and minor keys.

2.5 Structure of a musical piece

Just as a poem, a musical piece has a structure. The most famous structure in pop music currently is verse-chorus-verse-chorus-bridge-chorus. Even though the names verse, chorus and bridge are often used in popular music nowadays, one can also distinguish a piece differently. One can for example give the first section the letter A, the second one B, and so on. That way one can, similarly to poetry, summarize the structure of a piece by e.g. AABBA.

12 “I visualize a time when we will be to robots what dogs are to humans, and I’m rooting for the ma- chines.” Claude Shannon 3 Sequence Generation with Deep Learning

Even though the term Artificial Intelligence (AI) has only been around since the mid-1950s, where John McCarthy coined the term [16], the intrigue of making machines that think on their own has been around at least since ancient Greece [17]. This desire mostly manifested in stories like the bronze warrior Talos in Argonautica of Apollonius of Rhodes [17, 18].

John McCarthy, often referred to as the ‘father’ of AI, defined AI as ‘the science and engineering of making intelligent machines, especially intelligent computer programs’ [19]. Currently, AI not only has a wide range of topics for research [17], but it is also gaining fast momentum in the corporate industry [20].

Nowadays, the most prominent technique for A.I. systems is deep learning, which has recently showed state of the art results in the domains of object recogni- tion [21], sequence generation [22], image analyzing [23] and many other domains.

13 Therefore, this chapter will give an overview of what deep learning is and how it works, after which models that can be used for sequence generation will be discussed.

3.1 Deep Learning

Deep Learning (DL) is a subfield of Machine Learning (ML), which has gained popularity in the last decade [24]. There are three types of learning: supervised, unsupervised or partially supervised [25]. Supervised means that the training data includes both the input and the aimed result. Unsupervised means it only includes the input, but not the expected results. Partially supervised is some- where in between. Deep hierarchical models have the goal to form higher levels of abstractions through multiple layers that combine lower level features [26]. Since Deep Neural Networks (DNNs) are the most prominent form of DL models, this will be focused upon from now on [27]. In the second part of this chapter, other models are also discussed.

Firstly, traditional Neural Networks (NNs) will be discussed. A NN is made up of many processors called neurons that are interconnected [28]. Figure 3.1 shows the direct graph of a NN with certain weights wij on the edges. A more in-depth representation of a node can be found in the red box [29]. The activation function is usually a non-linear function.

Mathematically, the output of each node i can be described as seen in equation

3.1, where fi is the activation function of node i, yi is the output of the node, xj is th the j input of the node, wij is the weight on the edge between the two nodes and

θi is the bias of the node [30].

∑n yi = fi( wij · xj − θi) (3.1) j=1

14 Input Hidden ... Hidden Output layer layer layer layer

Input 1 Output 1

Input 2 Output 2

......

Input N Output M

Figure 3.1: Graphical representation of a NN

Commonly used activation functions are listed below. A more detailed de- scription of each of them, including drawbacks and advantages can be found in [31].

• Sigmoid function: This function takes a real-valued number and outputs a number between 0 and 1. It can be found in Equation 3.2 and Figure 3.2a.

1 σ(x) = (3.2) 1 + exp(−x)

15 1 4 y

1 3 y = 1 1+exp(−x) y = tanh x

. 0 5 x 2 −2 −1 1 2 1

−1 −6 −4 −2 0 2 4 6 −4 −2 0 2 4 (a) Sigmoid function (b) Tanh Function (c) ReLU Function

Figure 3.2: Activation functions

• Tanh: This function takes a real-valued number and outputs a number between -1 and 1 (see Figure 3.2b).

• Rectified Linear Unit (ReLU): This function is defined as:

f(x) = max(0, x) (3.3)

The function is displayed in Figure 3.2c. The ReLU is the de-facto standard currently in deep learning [32].

• Leaky ReLU: This function solves the dying ReLU problem (more info can be found in [31]) through a small constant α.   αx x<0 f(x) =  x x>=0

• Maxout: This function computes k linear mixes of the input with its weights and biases, where-after it selects the maximum. If k = 2, then the equation is as follows:

T T max(w1 x + b1, w2 x + b2) (3.4)

• Softmax: This function puts the largest weight on the most likely out-

16 come(s) and normalizes to one. Therefore, this can be interpreted as a probability distribution. This probability distribution will then be used to determine the target class for the given inputs. For a classification, there are two ways to use this function. The first is to select the value with the highest probability. The second is through sampling where the value gets chosen with their probability. This last method adds variability and non determinism. This function can also be used for predicting a discrete value where a one-hot encoding is used [17]. One-hot encoding is the process of transforming categorical features to a vector. For example, for colors, if we use regular integer encoding (assign integer value to each category value), it cannot be said that red>green>blue. There is no ordering between the colors and it could therefore result in bad predictions. That’s when one-hot encoding is used (see an example in Table 3.1).

( ) ∑exp xi f(x)i = k (3.5) j=0 exp(xj)

Red Green Blue Vector One-hot encoding of Red 1 0 0 [1,0,0] One-hot encoding of Green 0 1 0 [0,1,0] One-hot encoding of Blue 0 0 1 [0,0,1]

Table 3.1: One-hot encoding example

Neural networks can be divided into two classes: feed-forward neural networks (FFNNs) and recurrent neural networks (RNNs) [29]. A feed-forward NN is a network that has no cycles and where all the connections are going in one direction [17]. A recurrent neural network (RNN) does allow cycles or a bidirectional flow of data. Figure 3.3 shows two examples that clarifies the difference between the two [33].

17 Figure 3.3: Example of feedforward (a) and recurrent (b) network

Machine Learning algorithms depend heavily on data. Therefore, we will need three disjoint datasets: a training set, a validation set and a test set. The training set is used to train the model, whilst the validation set selects the best performing model on data that was not used during training. The test set is used after the training and validating, and is used to see how well the model generalizes on data that is completely new. These three sets are all derived from one large dataset.

In supervised learning, the model has to know what a good output is. The loss function or cost function will determine the quality of the output. The loss function calculates an error between the labeled data and the output of the model. An ideal model should generate low values for the cost function. This is also important for the test set loss, since this will be an indicator on how well the model generalizes. An example for a cost function in supervised learning is the mean-squared error (see equation 3.6), which minimizes the difference between the desired output and the one given by the model [34].

∑n 1 MSE = (ŷ − y )2 (3.6) N i i i=1

Once the cost function is defined, we need a method to minimize this cost function with respect to the weights. A minimum can be found if the gradient of this loss function with respect to the weights ∇L(w) is zero. This minimum can be found by iteratively going in the direction of the negative gradient. This

18 algorithm is called Gradient Descent. To calculate this gradient, backpropagation can be used. This algorithm has two steps: a forward step and backward step. The forward step will let data go through the network to generate a specific output. Then the error can be generated by comparing the desired output with the output from the model. In the backward step, the error is propagated throughout the network so each neuron can see how much they have contributed to the error. After that, the weights can be adjusted to optimize the model [35].

A model is said to overfit if through training the model is ‘learning the ex- amples by heart’. If the training error keeps decreasing, whilst the validation set error increases, then the model is overfitting. There are several approaches one can take to avoid overfitting. One approach is the approach of early stopping. This means that we stop training once the validation error reaches a minimum (see Figure 3.4)[36]. The term underfitting is used when the model can neither generalize the data nor model the training data well. This model will also perform badly on the training data.

Figure 3.4: Early stopping: stop training when the validation error reaches a minimum. Oth- erwise, it is overfitting [36].

The number of hidden layers determines whether a NN is deep or not. Tradi- tional NNs usually have 2-3 hidden layers whilst deep ones can have more than a

19 hundred [27].

This was a very short overview of how deep learning works. More information about the specifics mentioned in this section, can be found in dedicated literature [17, 29, 37].

3.2 Sequence Generating Models

Sequence generation is necessary for this master dissertation since a melody or even more broadly a music piece, can be seen as a sequence of notes. In our case with lead sheets, this can be seen as a sequence of syllables, notes and chords. Sequence generation essentially attempts to predict the next element(s) given the previous elements of the sequence [38]. Figure 3.5 shows the input and output of such a prediction model [39].

s1, s2, ..., sj sj+1, ..., sk Sequence Prediction Model

Figure 3.5: Sequence generation model: input and output

The next element from the sequence is usually generated by sampling from a probability distribution that shows how likely each element is given the input sequence. This probability distribution is formed through the training of the data. Mathematically, for generative RNNs, this can be formulated as follows, where f and g are modeled by the RNN, sn are the input/output sequences and hn are the hidden states:

sn+1 = f(hn, sn) (3.7)

20 hn1 = g(hn, sn) (3.8) In this section, different networks that were used in this master dissertation to generate sequences are discussed in detail. Other sequence generating models, that will also be mentioned in the state of the art, will not be discussed here. They can be found in literature.

3.2.1 Recurrent Neural Network (RNN)

As briefly discussed before, RNNs are networks with cycles, which means thatthe outputs of a hidden layer are also used as an additional input to compute the next value. This cycle represents a temporal aspect of the network. The networks can form short term memory elements through these cycles [40].

The training process of a RNN is similar to that of a traditional NN [41]. The backprogragation algorithm is used yet again, but the parameters are shared by every timestep in the network. The gradient therefore depends not only on the current time step, but also past time steps. Therefore, it is called Backpropagation Through Time (BPTT). Usually, it is trained by giving a sequence as input and it predicts the next element in the sequence [42]. Therefore, it can be used for music generation.

3.2.2 Long Short-Term Memory (LSTM)

In RNNs with conventional backpropagation through time (BPTT), a problem has arisen called the vanishing or exploding gradients problem. This problem occurs when the norm of the gradient either makes the long term components go to zero (vanishing) or to very high numbers (exploding)[43]. Long Short-Term Memory (LSTM) is designed to resolve this problem [44].

The LSTM contains memory blocks in the recurrent hidden layer that store the temporal state of the network. Figure 3.6 depicts a LSTM memory block, where xt represents the memory cell input and yt represents the output [45]. The

21 LSTM has several gates: the input gate, the output gate and the forget gate, that control the flow of writing, reading and forgetting (or erasing) respectively46 [ ].

xt xt

Input Gate it Output Gate ot

Cell

xt × ct × yt

×

ft Forget Gate

xt

Figure 3.6: A LSTM memory block

3.2.3 Encoder-Decoder

In this master dissertation, we will also use the encoder-decoder architecture. An encoder translates the input to a feature vector, which hold the most important information from the input. A decoder takes this feature vector and outputs a desired result. One possible result can be the input of the model again, so reconstruction is possible by just using the (smaller) feature vector. Another possibility is to input a certain sentence in a language and output a translation.

Usually the encoder-decoder architecture is coded as can be seen in Figure 3.7 [47]. This is a translation example. The first sentence in Dutch is encoded by RNN cells. The hidden state, which represents the sentence, from the encoder is then used as input for the decoder, who outputs the same sentence in English. The < start > token is used to kickstart the translation on the decoder side. As

22 can be clearly seen by the figure, the sentence in Dutch has five words, whilst the sentence in English only has three.

Figure 3.7: Encoder-Decoder architecture, inspired by [47]

23 ”Basic research is like shooting an arrow into the air and, where it lands, painting a target.” Homer Burton Adkins 4 State of the Art in Music Generation

This chapter first discusses different music representations that can be usedfor machine learning, and presents the state of the art in music generation.

4.1 Music Representation

Music can be represented in many ways, ranging from MIDI format to audio format. There are two main forms of music representation: signal representations and symbolic representations. In the next sections we will discuss both in more detail.

4.1.1 Signal or Audio Representation

The most basic type of music representation is an audio signal. This can be both raw audio or an audio spectrum, calculated using a Fourier transformation [17].

24 The significance of using this format as input can not only be found inmusic generation, but also in music recommendation systems, since software such as Spotify or Apple Music have increased in popularity the past couple of years [48]. Figure 4.1 shows a visual representation of a raw audio file and its corresponding audio spectrum [49].

Since this master dissertation will focus on symbolic representations, from now on only symbolic representations are mentioned and used. Examples of music generation from an audio format can be found in [50, 51].

Figure 4.1: Visual representation of raw audio and its fourier transformed spectrum [49]

4.1.2 Symbolic Representations

Most of the research in music generation focuses on symbolic representations of music. Symbolic representations will note music through discrete events, such as notes, chord or rhythms. This section will discuss the most popular ways to represent music.

25 A Musical Instrument Digital Interface (MIDI) Protocol

Music Instrument Digital Interface (MIDI) Protocol is a music protocol that de- fines two types of messages: event messages, for information such as pitch, nota- tion and velocity, and control messages, for information such as volume, vibrato or audio panning. The main note messages are:

• NOTE ON This message is sent when a note is being played. Information such as the pitch and the velocity are also included.

• NOTE OFF This message is sent when a note has stopped. The same information can be included in the message.

The note’s pitch is an integer in the interval [0,127], where the number 60 represents middle C on a piano. The note’s velocity is an integer in the interval [1,127], where Table 4.1 shows the different nuances [52]. pppp means that the notes are played very softly and ffff means it is played loudly.

Music notation Velocity Music Notation Velocity pppp 8 mf 64 ppp 20 f 80 pp 31 ff 96 p 42 fff 112 mp 53 ffff 127

Table 4.1: MIDI velocity table [52]

Other important parameters are the channel number and number of ticks:

• Channel Number is the number of the MIDI channel and is an integer in the interval [0,15]. Channel number 9 is exclusively used for percussion instruments.

• Ticks represents the number of ticks for a quarter note ˇ “.

26 Ticks Message type Channel Number Pitch Velocity 2 96 Note_on_c 0 60 90 2 192 Note_off_c 0 60 0 2 192 Note_on_c 0 62 90 2 288 Note_off_c 0 62 0 2 288 Note_on_c 0 64 90 2 384 Note_off_c 0 64 0 2 384 Note_on_c 0 65 90 2 480 Note_off_c 0 65 0 2 480 Note_on_c 0 62 90 2 576 Note_off_c 0 62 0

Table 4.2: MIDI ascii [17]

Figure 4.2: MIDI extract [17]

Table 4.2 and Figure 4.2 respectively give the ascii and the visual representa- tion of the same melody [17]. 96 Ticks corresponds to a sixteenth note ˇ )“ . A drawback of MIDI messages, as stated by Huang and Hu [53], is that it is difficult to see that different notes across multiple tracks are played atthesame time. Since this master dissertation only uses lead sheet representation, this will not form a problem here.

B MusicXML (MXL)

MusicXML (MXL) is an Internet-friendly format for representing music, using XML [54, 55]. It does not replace existing formats, since it is interchangeable with other formats such as MIDI. It is based on two music formats: MuseData [56] and Humdrum [57].

Listing 4.1 and Figure 4.3 show respectively the MusicXML and visual repre-

27 sentation of the same note [54].

Figure 4.3: MusicXML example [54]

Music 1 0 G 2 C 4 4 whole

Listing 4.1: MusicXML example [54]

28 C Piano Roll

The idea for the piano roll representation came from the pianola or the self- playing piano, which uses a perforated paper roll to play a composition without the interference of an actual pianist (see Figure 4.4)[17]. The length of the perforation represents the duration of the note and the location of the perforation represents the pitch.

Figure 4.4: Pianola: the self-playing piano

An example of a piano roll representation can be found in Figure 4.5.

Figure 4.5: Piano Roll representation

29 A major drawback of the piano roll representation is that the NOTE OFF event that does exist in MIDI, doesn’t exist [58]. According to a piano roll repre- sentation, there is no difference between two short notes and one long note.

D ABC Notation

Listing 4.2 and Figure 4.6 give respectively the ABC notation and the visual representation of the same piece [59]. The first five lines are the header ofthe musical piece. The notes themselves are contained in the following four lines.

X:1 T:Speed the Plough M:4/4 C:Trad. K:G |:GABc dedB|dedB dedB|c2ec B2dB|c2A2 A2BA| GABc dedB|dedB dedB|c2ec B2dB|A2F2 G4:| |:g2gf gdBd|g2f2 e2d2|c2ec B2dB|c2A2 A2df| g2gf g2Bd|g2f2 e2d2|c2ec B2dB|A2F2 G4:|

Listing 4.2: ABC notation: Speed the Plough [59]

Figure 4.6: ABC notation: Speed the Plough [59]

30 The components of the header are as follows:

• X: A reference number that was useful when it was first introduced for selecting specific tunes from a file. Nowadays, software doesn’t need this anymore.

• T: This field represents the title of the piece. In this case, T is ‘Speedthe Plough’. 4 • M: Also known as meter of a piece. A standard 4 is used for this piece. • C: This field represents the composer of the piece. In this piece, C is ‘Trad’.

• K: The key of the piece is G Major.

Each note is represented by a letter, but the octave and duration depends on the formatting of the letter. Listing 4.3 and Figure 4.7 show four octaves of notes, in ABC notation and standard notation respectively. C, D, E, F,|G, A, B, C|D E F G|A B c d| e f g a|b c' d' e'|f' g' a' b'|

Listing 4.3: Four octaves in abc notation [59]

Figure 4.7: ABC notation: four octaves

If there is a field L present (e.g. L: 1/16 ˇ )“ ), then this is used as the default length of notes. If it is not specified, an eighth note ˇ (“ is assumed. Then the length of the note can be adapted through numbers. For example c2 in case of the default length of an eighth note ˇ (“ is a quarter note ˇ “, since · 1 = 1 . A sixteenth 2 8 4 note ˇ )“ is therefore represented by c/2.

31 E Lead Sheet

A lead sheet is a format of music representation that is especially popular in jazz and pop music. Figure 4.8 shows an example of a in lead sheet format. As clearly can be seen from the example, there are a few elements to the lead sheet:

Figure 4.8: Lead Sheet Example

• Melody: The melody is presented in standard musical format. This is

32 usually the melody that the singer sings.

• Chord: The chords are placed above the melody in a textual format.

• Lyrics: The lyrics are placed under the melody, with each syllable corre- sponding to a specific note.

• Other: Other information such as the title, composer and performance parameters can also be added to the lead sheet.

4.2 Music Generation

This section is divided in three parts. First, preprocessing and data augmentation ideas are discussed. Afterwards, relevant works considering actual music genera- tion will be discussed. Some possible evaluation methods are touched upon at the end.

4.2.1 Preprocessing and data augmentation

Before the data is used, two (potential) steps can be taken: preprocessing and data augmentation. Not all datasets have the ideal format for machine learning algorithms. The data still needs to be cleaned, transformed or ‘preprocessed’. The data could for example be scaled or missing data points can be amended. This step is called preprocessing. Data augmentation on the other hand increases the amount of data points in our dataset. The dataset may be smaller or more limited than we want, since they are taken in a set of conditions. Collecting more data is costly, therefore we want to opt for data augmentation. In object recognition for example, the object can be rotated and added to the dataset. In music, the pieces could be transposed. Some preprocessing and data augmentation ideas are discussed in this section.

Preprocessing

For practical reasons, lead sheets are often written in condensed formats, using repeat signs (see Section 2.1.6) and using multiple lines for lyrics. However, it

33 is best that for machine learning purposes they are “unfolded”, so there is no ambiguity in which measure needs to be played after which one [60]. Duplicates or overlapping notes in the same MIDI channel don’t have any pur- pose so these can be removed as well [61]. Certain channels can also be removed: channels where only a few (e.g. two) note events occur or percussion channels which don’t serve any melodic purpose [61]. Transposing to a common tonality, e.g. C major, can be also used to eliminate affinity to certain keys [62].

In some datasets there are data duplicates, e.g. the same piece written by different people. When splitting the data into training, test and validation sets, the sta- tistical dependences need to be reduced. A possible solution, provided by Walder [61], has three steps:

1. Compute the normalized histogram of the midi pitches of the file. Do the same for the quantized durations. Concatenate the two vectors to form a signature vector for each file.

2. Perform hierarchical clustering on these vectors. Order the files by traversing the resulting hierarchy in a depth first fashion.

3. Given the ordening, split the dataset.

Data Augmentation

The data can be augmented by transposing into all possible keys (so within the vocal range) [63]. Any affinity that a composer has for a certain key is eliminated this way [64].

4.2.2 Music Generation

Recurrent Neural Networks have a dominant role in the field of Music Generation, since RNNs have had successes in many fields concerning generation, for example in text [65], handwriting [66] or image generation [67]. The reason feed-forward

34 networks (FFNN) are less popular in the case of generation, is through their total inability of storing past information. In the music generation field, this translates to the lack of knowing the position in the song [68]. An example of FFNN being used in music generation is Hadjeres and Briot’s MiniBach Chorale Generation System [42]. Even though the result was acceptable, it was less favorable than using more complex RNN networks.

General RNNs lack the ability of modeling a more global structure. These networks usually predict on a note-by-note basis. Examples of this can be found in [1, 69]. In context of his note-by-note attempt to music generation through RNNs, Mozer stated in [8] that it still lacked thematic structure and that the pieces were not musically coherent. The reason stated was the “architecture and training procedure do not scale well as the length of the pieces grow and as the amount of higher-order structure increases.” Eck et al. linked this to the prob- lem of vanishing gradients [68] and suggested that structure needs to be induced at multiple levels. They introduced a RNN architecture using LSTMs. They performed two experiments, using a 12-bar blues format: the first one being the generation of chord sequences and the second one being the combined generation of melody and chords. Both melody and chords were presented to the system as one-hot encodings and they strictly constrained the notes to 13 possible values and the chords to 12 possible chords. Each input vector represents a slice of time, which means that no distinction can be made between two short notes of the same pitch and one long note (see piano-roll drawback in Section C). Two possible so- lutions to this piano-roll problem were suggested. The first one is to the slice of time and marking the end of note with a zero. The second one is to model special units in the network to indicate the beginning of the note, inspired by the work of Todd [69]. Even though the generation is deterministic and even though the constraints on the possible notes and chords is strict, they did capture a global and local structure.

In the field of classical music, we’ll discuss to ways of generating music. The first is a rule-based expert system, and the second is a deep learning solution.An

35 example of a rule-based expert system was introduced by Ebcioglu et al. [70]. Considering Bach fully integrated thorough bass, counterpoint and in his compositions, a lot of rules can be formulated in order to make music in the style of Bach [71]. Experts formulated more than 300 rules, which generated excellent results, however, these results did not always sound like Bach. The main drawback of this method is the large effort needed from experts to make such a rule-based expert system. Therefore, a deep learning model is usually used for music generation, since it also doesn’t require any prior knowledge. The work of Boulanger-Lewandowski et al. [62] works with an RNN-RBM to model polyphonic classical music. The main restriction in this type of generation is that generation can only be performed from left to right. DeepBach, introduced by Hadjeres et al. [63], shows more flexibility without plagiarism. In DeepBach, a MIDI representation is usedand a hold symbol “_ _” is used to code whether the previous note is held. To generate steerable and strong results, LSTMs are used in both directions. Figure 4.9 shows the architecture of DeepBach. The first four lines of the input data are the four voices, whereas the other two model metadata. Generation is done through pseudo-Gibbs sampling. The authors of the article state:

“These choices of architecture somehow match real compositional prac- tice on Bach chorales. Indeed, when reharmonizing a given melody, it is often simpler to start from the cadence and write music backwards.”

Compared to BachBot, which made use of a 3-layer stacked LSTM in order to generate new chorales in the style of Bach, the DeepBach model was more general and more flexible [72]. An important remark from the BachBot authors is that the number of layers or the depth truly makes a large difference. Increasing the number of layers can decrease the validation loss up to 9%.

For more modern music, jazz and pop, some research has also been conducted. In the world of Jazz, Takeshi et al. have made an offline trio synthesizing

36 Figure 4.9: Soprano prediction, DeepBach’s neural network architecture system [73]. The system uses hidden Markov models (HMMs) and DNNs to generate a bass and drum for the piano MIDI input. In pop music, Chu et al. were inspired by a Song of Pi video1, where the music is created from the digits of π [74]. Their hierarchical RNN model, using MIDI input and generating melody, chords and drums, even outperformed Google’s Magenta [75] in a subjective evaluation with 27 human participants. There has also been research regarding GANs to generate multi-track pop/rock music [2, 22, 76].

Research specifically for lead sheet generation has also been conducted. Flow- Composer [3] is part of the FlowMachines [4] project and generates lead sheets (a melody with chords) in the style of the corpus selected by the user. The user can enter a partial lead, which is then completed by the model. If needed, the model can also generate from scratch. Some meter-constrained Markov chains represent

1https://www.youtube.com/watch?v=OMq9he-5HUU&feature=youtu.be

37 the lead sheets. Pachet et al. have also conducted some other research within the FlowMachines project regarding lead sheets. In [5], they use the Mongeau & Sankoff similarity measure in order to generate similar melody themes totheim- posed theme. This is definitely relevant in any type of music, since a lot of melody themes are repeated multiple times during pieces. Ramona, Cabral and Pachet have also created the ReChord Player, which generates musical accompaniments for any song [6]. All these examples within the FlowMachines project, use the FlowMachines Lead Sheet Data Base1, which requires a login to access it.

An important limitation is that current techniques learn only during training, but do not adapt when the generation happens [42]. The first solution of this could be to add the generated musical piece to the training data and to retrain. However, this could lead to overfitting and doesn’t necessarily produce better results. Another approach is through reinforcement learning. Murray-Rust et al. [9] have identified three types of feedback rewards that can be used in reinforcement learning for music generation:

1. How well the generated piece satisfies the predefined internal goals. These can be defined by the composer or can be defined during the performance.

2. Direct feedback (like or dislike) from human participants/audience members.

3. The incorporation of ideas the musical agent suggests, which results in in- teractive feedback.

An example of the first reward system was used in[77]. They used some basic rules in order to make the system improvise. Examples of these rules were: (i) using a note from the scale associated with the chord is fine and (ii) a note shouldn’t be longer than a quarter note. An example of the second possibiblity (the direct feedback from audience mem- bers) can be found in [78]. Figure 4.10 shows the three main components of the

1http://lsdb.flow-machines.com/

38 system: the listener, the reinforcement agent and the musical system. Musical tension was used as a reward function, since the authors correlate the tension to the rise of emotions in an audience. The biggest problem, as stated by [79], is that the interaction of the audience may disrupt the performance entirely. Therefore they suggested tracking emotions through facial recognition as a less disruptive feedback loop. However, this has not been explored further.

Figure 4.10: Reinforcement learning with audience feedback [78]

An example of the interactive feedback system from other agents can be found in [9]. The interactions between agents is modeled through Musical Acts [80], which were inspired by Speech Acts that are used in common agent communication languages. An example of these Musical Act, as stated in [80] for the song Little Blue Frog by Miles Davis, can be found in Table 4.3.

A very important remark was made by Bretan et al. [64] concerning the simi- larity between two musical passages. There can be a strong relationship between two chords or two musical passages even without using the same structure harmon- ically, rhythmically or melodically. Especially in chord prediction, an interesting

39 Time(s) Instrument Performative Description A spiky, stabbing phrase, Trumpet INFORM 2:00 - 2:13 based on scale 2 CONFIRM (an agent A informs a briefly seems to agree Clarinet different agent B, and with the trumpet. the agent B is happy with this) Bass clarinet CONFIRM Confirms scale 3 DISCONFIRM Ignores bass clarinet, Trumpet (same as confirm, but 2:13 - 2:29 and continues with stabs B is not content) Ignores bass clarinet, Clarinet DISCONFIRM and continues with lyricism in scale 2 Trumpet ARGUE All play lyrically, with Clarinet (occurs when multiple clarinet on scale 2, 2:29 - 2:43 Bass agents present trumpet on 1 Clarinet conflicting ideas) and Bass Clarinet in 3 PROPOSE proposes a resolution, 2:43 - 3:08 Trumpet (an agent proposes by playing stabs which a new idea) fit with any of the scales E-Piano supports the trumpet’s 3:03 - 3:08 CONFIRM Vibes resolution

Table 4.3: Reinforcement Learning: Miles Davis - Little Blue Frog in Musical Acts by [80] approach has been suggested in order to model these strong relationships. In ChordRipple, Huang et al. used chords that are played together often in a near vector space [81], inspired by the word2vec model introduced by Mikolov et al. [82]. In this model, the major and minor chords are arranged according to the circle of fifths (see Figure 2.14)[15]. Chords were represented into attributes, such as the root, the chord type, the inversion, bass, extensions and alterations.

40 Similarly to the chord2vec model in ChordRipple, Madjiheurem et al. also intro- duced a chord2vec model inspired by the word2vec model [83]. In the sequence-to- sequence model they suggested, chords were represented as an ordered sequence of notes. A special symbol ε is added to mark the end of a chord. In order to predict neighboring chords a RNN Encoder-Decoder is used.

Even though naïve sequence modeling approaches can perform well in music generation, these approaches miss important musical characteristics. For example a sense of coherence is needed by repeating motifs, phrases or melodies, unless the musical material is restricted or simplified. Lattner et al. [84] have solved this by using a self-similarity matrix to model the repetition structure. Other constraints on the tonality and the meter were also added. The same trade-off between a global structure and locally interesting melodies was noticed by Tikhonov et al. [85]. They suggested a Variational Recurrent Autoencoder Supported by History (VRASH). VRASH combines a LSTM Language Model, which does not capture the global structure, with a Variational Auto-Encoder (VAE). Even though most research that has been conducted use piano rolls or MIDI formats, some research can also be found using the ABC notation [86, 87, 88].

4.2.3 Evaluation

There are many ways for evaluation a musically generated piece. Possible evalu- ation ideas are the following:

• In many evaluation methods [62, 83], a baseline is used to compare the new method to. This can be an older method that has some similarity to the one the author wants to evaluate or a simple one, such as a Gaussian density.

• An online Turing test with human evaluators can also be used [63, 72]. Subjects first have to rate their level of expertise, after which theyvote “Human” or “Computer” on different pieces.

• Finding objective metrics is quite difficult in the field of music generation. Dong et al. [76] defined some objective metrics that could be used:

41 1. EB: ratio of empty bars (in %) 2. UPC: Number of used pitch classes, which are the classes of notes (e.g. C, C#, ...), per bar (can be 1-12). 3. QN: ratio of qualified notes (in %). If the note has a small duration (less than three time steps, or 32th note), it is considered as noise. 4. NIS: ratio of notes in the C scale (in %) 5. TD: tonal distance. The larger TD is, the weaker hamornicity is be- tween a pair of tracks.

Furthermore, they tracked these metrics for the training data and compared those to the generated data.

42 “Data matures like wine, applications like fish.” James Governor

5 Data Analysis of the Dataset Wikifonia

The dataset used in this master dissertation is the Wikifonia dataset. Even though it is not available anymore due to copyrighted issues, there was a download link available1. It is a dataset with 6394 MusicXML scores in lead sheet format. Lyrics and chords are included in most scores. Table 5.1 shows some statistics of the dataset.

5.1 Internal Representation

Even though the piece is in MusicXML format, this can be hard to work with. We had to find a way to store the same information in a more readable and easily adaptable way. Inspired by Google’s Magenta [75], the MusicXML file was parsed to a Protocol Buffer [89]. Protocol Buffers are buffers that structure and encode

1http://www.synthzone.com/files/Wikifonia/Wikifonia.zip

43 Type of statistic Number Scores 6394 Scores with lyrics 5220 Syllables in lyrics 730192 Unique syllables in lyrics 12922 Scores with multiple simultaneous notes 300 Pieces with more than one repeat 3443 4 Pieces with meter 4 4408 3 Pieces with meter 4 816 2 Pieces with meter 2 749 2 Pieces with meter 4 212 Pieces with meter 6 147 812 Pieces with meter 8 34 Pieces with meter 9 8 8 5 Pieces with meter 4 7 6 Pieces with meter 4 6 Pieces with meter 3 3 78 Pieces with meter 4 2 Pieces with meter 3 1 9 2 Pieces with meter 4 1 Different rhythm types of notes in scores 42

Table 5.1: Original Statistics Dataset Wikifonia the data in an efficient way. A part of the “music_representation.proto”-file can be found in Listing 5.1. The definitions of TimeSignature, KeySignature, , Note and Chord are all defined further in the file. message MusicRepresentation { // Unique id string id = 1; // the path of the file string filepath = 2;

// Lacking a time signature , 4/4 is assumed per MIDI standard. repeated TimeSignature time_signatures = 4; // Lacking a key signature , C Major is assumed per MIDI standard.

44 repeated KeySignature key_signatures = 5; // Lacking a tempo change , 120 qpm is assumed per MIDI standard. repeated Tempo = 6; // note repeated Note notes = 7; // Cord repeated Chord chords = 8; // division in divisions per quarter note , // 1/32 is default , which means 8 divisions per quarter note double shortest_note_in_piece = 3;

}

Listing 5.1: Part of the Music Representation Protocol Buffer File

The ‘repeated’-field means that the field has multiple values (e.g. repeated TimeSignature can have multiple TimeSignatures). As can be seen in Listing 5.1, each field ends with a specific number called a unique numbered tag. Theseare tags that Protocol Buffers use to find your field in the binary. Tags one tofifteen only take one byte to encode, so the most frequent used fields should have one of those numbers.

5.2 Preprocessing

In this section, the different preprocessing methods that were used, are discussed. We will begin by talking about deleting certain unnecessary notes from the piece, after which we will discuss the changes of the structure in the pieces themselves. We will end this section by discussing the different preprocessing steps that were taken in context of rhythms and chords.

5.2.1 Deletion of polyphonic notes

As this master dissertation uses a lead sheet format, the melody should be mono- phonic. Monophonic means that only one note plays at the same time. This melody is usually sung by the lead singer or played by a solo performer. As can

45 be seen in Table 5.1, there were 300 scores that had polyphonic fragments in it. These were mostly scores of a more classical nature. In order to make the data consistent, we had to make sure that all melodies were monophonic. Therefore, all polyphonic notes were removed until only the top note, the melody note, remained in the piece.

5.2.2 Deletion of anacruses

An anacrusis is a sequence of notes that precedes the downbeat of the first measure. Figure 5.1 shows an example, where the first measure that can be seen is the anacrusis. These measures were deleted as a preprocessing step.

Figure 5.1: Example of an anacrusis

5.2.3 Deletion of ornaments

Ornaments or embellishments are decorative notes that are played rapidly in front of the central note [90]. Figure 5.2 shows a smaller ornament note before a main quarter note.

Figure 5.2: Example of an ornament

46 In some pieces, ornament notes were present. In a MusicXML file, these are the notes that have no duration. These notes were removed from the piece, since they don’t serve any true melodic purpose. They just serve to add a variety.

5.2.4 Unfold piece

In order for the model to know which measures need to be played subsequently, repeat signs need to be eliminated. The piece needs to be fed to the model as it would be played. That’s why we need to unfold the piece. An example of such unfolding can be found in Figures 5.3a and 5.3b, which respectively show the original and unfolded piece. As can be seen, the yellow part is added again in between the green and the blue part in order to make sure the piece is given to the model as was intended by the composer.

(a) The original piece

(b) The piece after the algorithm

Figure 5.3: Preprocessing step: unfolding the piece by eliminating the repeat signs

5.2.5 Rhythm

In order to make sure that most pieces have similar formats, we had to analyze the different rhythms of notes in each piece. As can be seen from the original

47 statistics of the dataset (see Table 5.1), there were 42 different rhythm types in the original dataset. Since we want to encode the rhythms as a one-hot vector, and make this as efficient as possible, we want to eliminate the less frequently used rhythm types. We wanted to delete pieces that had less occurring rhythm types, in order to make sure the dataset is as conform as possible. Namely, these rare rhythm types indicated that the piece was more complex than or simply different from others in the dataset. Therefore, these 184 pieces were removed. In the end, only 12 rhythm types remained. The new statistics of the dataset can be found in Table 5.2. Table 5.3 shows the twelve final rhythm types and how many times they occur in the dataset.

Type of statistic Number Scores 6209 Scores with lyrics 5104 Syllables in lyrics 707968 Unique syllables in lyrics 12712 Scores with multiple simultaneous notes 280 Pieces with more than one repeat 3365 4 Pieces with meter 4 4294 3 Pieces with meter 4 808 2 Pieces with meter 2 737 2 Pieces with meter 4 197 Pieces with meter 6 139 812 Pieces with meter 8 11 Pieces with meter 9 7 8 5 Pieces with meter 4 7 6 Pieces with meter 4 3 Pieces with meter 3 3 78 Pieces with meter 4 2 Pieces with meter 3 1 9 2 Pieces with meter 4 0 Different rhythm types of notes in scores 12

Table 5.2: Statistics Dataset Wikifonia after deletion of scores with rare rhythm types

48 Occurrence of rhythm type in Note Image the scores 32th note ˇ *“ 672 32th note dotted ˇ *“ ‰ 1435 16th note ˇ )“ 60745 ==3 One note of a triplet ˇ “ 22616 8th note ˇ (“ 389947 3 Two notes of a triplet linked ˇ “ 14826 8th note dotted ˇ (“ ‰ 18646 quarter note ˇ “ 301286 quarter note dotted ˇ ‰“ 34549 half note ˘ “ 82011 half note dotted ˘ ‰“ 30441 whole note ¯ 29602

Table 5.3: The twelve final rhythm types and their count in the dataset

5.2.6 Chord

Chords are one of the main elements of lead sheets that we would like to model. Therefore, we first wanted to delete all scores that didn’t have chords in them.In total, these were 375 scores. The updated statistics table can be found in Table 5.4. A chord has three elements: a root, an accidental or alter and a mode. We analyzed all three in the following subsections.

A Root

There are seven different potential roots for a chord: A, B, C, D, E, FandG. Table 5.5 shows these roots and how much they appear in the scores without transposition.

49 Type of statistic Number Scores 5773 Scores with lyrics 4868 Syllables in lyrics 679849 Unique syllables in lyrics 12221 Scores with multiple simultaneous notes 248 Pieces with more than one repeat 3174 4 Pieces with meter 4 3985 3 Pieces with meter 4 758 2 Pieces with meter 2 721 2 Pieces with meter 4 150 Pieces with meter 6 132 812 Pieces with meter 8 11 Pieces with meter 9 6 8 5 Pieces with meter 4 6 6 Pieces with meter 4 2 Pieces with meter 3 1 78 Pieces with meter 4 0 Pieces with meter 3 1 9 2 Pieces with meter 4 0 Different rhythm types of notes in scores 12

Table 5.4: Statistics Dataset Wikifonia after deletion of scores with rare rhythm types and scores with no chords Root Number A 29892 B 29074 C 46931 D 34226 E 29441 F 39304 G 43804

Table 5.5: Roots of chords and how much they appear in the scores without transposition

B Alter

There are four alters that appeared in the scores: ♭♭, ♭, ♮ and ♯. Table 5.6 shows these alters and how much they appear in the scores without transposition. 50 Alter Number ♭♭ 7 ♮ 197845 ♭ 49214 ♯ 5606

Table 5.6: Alters of chords and how much they appear in the scores without transposition

However, we wanted for each chord that had the same harmonious content to have the exact same representation. For example, A♯,B♭ and C♭♭ have the same harmonic structure, yet a different name. Therefore, all chords were adapted to fit one of the following twelve combination of roots and alters: A,A♯, B, C, C♯, D, D♯, E, F, F♯, G and G♯.

C Mode

There are 47 different modes that appear in the score. They can be found inTable 5.7 with their count. The last column of the table represents the new mode that we replace the original mode with, so we only end up with four different modes: major, minor, diminished and augmented. The reason these four were chosen, was because they represent the structure of the basis of the chord (the first three notes, separated by a third each). More information on this can be found in music theory books.

5.3 Data Augmentation

In this section, the data augmentation of the dataset is discussed. The only data augmentation technique used was transposition.

5.3.1 Transposition

We already explained what transposition means in Section 2. In order to augment the dataset, we can transpose the original pieces in all possible keys. We did this

51 How How Mode Replace Mode Replace many many major 88621 Maj power 217 Maj dominant 59365 Maj minor-11th 200 Min minor 29577 Min suspended-second 160 Maj minor- 25292 Min minor-major 130 Min major-seventh 7645 Maj dim 127 Dim dominant-ninth 5536 Maj augmented-ninth 87 Maj major-sixth 5324 Maj 9 77 Maj 4957 Maj 6 62 Maj diminished 3850 Dim major-minor 56 Maj minor-sixth 2616 Min min9 46 Min min 2440 Min sus47 40 Maj 7 2382 Maj aug 31 Aug suspended-fourth 1999 Maj major-13th 24 Maj half-diminished 1988 Dim m7b5 23 Dim diminished-seventh 1651 Dim maj9 22 Maj augmented-seventh 1605 Aug min6 20 Min augmented 1344 Aug maj69 18 Maj dominant-13th 972 Maj dim7 11 Dim dominant-seventh 935 Maj minor-13th 4 Min maj7 835 Maj dim7 2 Dim minor-ninth 833 Min /A 1 Maj min7 793 Min minMaj7 1 Min major-ninth 510 Maj min/G 1 Min dominant-11th 224 Maj

Table 5.7: Modes of chords, how much they appear and their replacement in the scores with- out transposition by incrementing the piece each time with a semitone, until we have reached the octave. This means that the size of our dataset will be multiplied by twelve. The reason why this can be beneficial to our model, is because this will give the model more data to rely on to make the decisions. It also prevents the model from learning a bias that a large subset of the dataset might have towards a specific key.

52 An alternative approach that wasn’t used in this master dissertation, was to transpose all the pieces in a common key (e.g. C major). This has been advocated by Boulanger-Lewandowski et al. as a preprocessing step in order to improve results [62].

5.4 Histograms of the dataset

We wanted to plot the distribution of certain elements of the chords and the notes in order to gain a better understanding about the dataset. This was done in the two following subsections.

5.4.1 Chords

As can be clearly seen in Figure 5.4, C and G are the most common chords, followed by F and D in the dataset that wasn’t augmented. This is expected since these are chords that musicians use more often than other chords.

Figure 5.4: Distribution of roots of chords in dataset without data augmentation

It is also expected that the major and modes are used more often than the augmented or diminished modes. That’s what we have found in Figure 5.5.

53 Figure 5.5: Distribution of modes of chords in dataset without data augmentation

5.4.2 Melody Notes

We expect the pitches to be mostly centered above middle C (MIDI pitch 60). This was found in both datasets, the one that wasn’t augmented as well as the one that was (see Figures 5.6a and 5.6b).

(a) Without data augmentation (b) With data augmentation

Figure 5.6: Distribution of the melody pitches

54 5.5 Split of the dataset

Of course, the dataset was split into a training set and a test set. The training part of the split was data augmented, whilst the other one was not.

55 ”Not until a machine can write a sonnet or com- pose a concerto because of thoughts and emotions felt, and not by the chance fall of symbols, could we agree that machine equals brain-that is, not only write it but know that it had written it.” Professor Jefferson Lister 6 A new approach to Lead Sheet Generation

This chapter discusses the methodology details of two models: one simple melody generating model and a lead sheet generating model we’ll call the melody on chord generating model. The second one uses some knowledge obtained from the first model.

6.1 Melody generation

First, we train a baseline model that only generates the melody of the lead sheet. The goal is to construct a model that predicts the next note conditioned on the previous time_steps notes. These previous notes represent the ‘background’ of the

56 music piece. Mathematically, this is written as follows:

| p(notei notei−1, notei−2, ..., notei−time_steps) (6.1)

A note of a melody has two elements:

1. a rhythm or duration

2. a pitch (or no pitch if it is a rest)

We model a note by concatenating two one-hot encoded vectors, as depicted on Figure 6.1. This representation contains both the pitch and rhythm, as well as additional elements for modeling measure bars.

Figure 6.1: Two one hot encoders concatenated of a note representing the pitch and the rhythm in the melody generation problem. Measure bars are included in this figure.

The first one-hot encoded vector representing the pitch of the note isofsize 130. The first 128 elements of the vector represent the (potential) pitch ofthe note. This pitch is determined through the MIDI standard, as was discussed in Section 4.1.2A. The 129th element is only set to one if it is a rest. The 130th ele- ment represents a measure bar (or the end of a measure), which can be included or excluded.

57 The second one-hot encoder represents the rhythm of the note and is of size 13. The first element represents a measure bar, which is again optional. Theother 12 are the ones that were discussed in Section 5.2.5.

The reason why the measure bar is modeled twice, is because two loss functions will be applied separately on the two elements of the note. If the measure bar isn’t represented once in each element, we will give the model a zero-vector as target, which results in the model not training properly. The reason why will be elaborated on in Section 6.1.1, after the part of the loss functions. Including these measure bars is optional.

6.1.1 Machine Learning Details

As baseline model, we used a stacked LSTM with n LSTM layers (n = 2..4) and a fully connected (FC) output layer. Figure 6.2 shows the architecture. The two recurrent LSTM layers have a dimension of 512 in this figure, but this dimension can be adapted. All different possible values for the hyperparameters will be discussed in Section 7. As stated above, we have two elements of a note: the pitch and the rhythm or duration. We will have to model two loss functions, one for each element, after which we will combine the two to find the loss function of the model. Both problems are classification problems, and we will use the same loss function for both. Therefore, the model outputs the softmax, which is performed separately for the pitch and the . After the softmax, we will perform a cross entropy on the targets and the predictions, again for the pitch and the rhythm section separately. If y is the target and ˆy is the prediction of the model, we can mathematically write:

∑129 − (j) · (j) LCE,pitch(yi, ˆyi) = yi,0:129 log(ˆyi,0:129) (6.2) j=0 ∑12 − (j) · (j) LCE,rhythm(yi, ˆyi) = yi,130:142 log(ˆyi,130:142) (6.3) j=0

58 Figure 6.2: Architecture of the first simple melody generation model with two LSTM layers. Measure bars are included in this specific example

where the index i represents the time step at which we are calculating the loss and the index j is the jth element of the vector. The total loss is a combination of the two (α ∈ [0, 1]):

L(yi, ˆyi) = α · LCE,pitch(yi, ˆyi) + (1 − α) · LCE,rhythm(yi, ˆyi) (6.4)

The mean is taken for all time steps:

∑ − time_steps 1 L(y , ˆy ) L(y, ˆy) = i=0 i i (6.5) time_steps Given the formulas above, we can further explain the reason why the measure bar is modeled in both elements of the note. If a zero-vector target would be given (which is the case if it is only modeled once), the loss of that element would always be zero. This is of course not a favorable outcome.

The model is trained by using two sequences, both of size time_steps that can

59 be defined in the model. For each note of the first or input sequence, thenext note is the target in the second or target sequence. Figure 6.3 explains this all more in-depth.

Figure 6.3: Training procedure of the first simple melody generation model

Of course, these sequences are batched together in a batch of batch size k for one training iteration [91]. For each element of the batch, the input and target sequence are generated from a randomly selected score in the training dataset.

In order to generate music, we need an initial seed to kickstart the generation process. Usually this seed is taken from the test dataset, but sequences from the train and validation set can also be used. The length of the seed sequence can be chosen, but usually the training sequence length time_steps is used.

60 This seed is given to the model as input. Afterwards the model does a forward pass to predict the next note, inspired by [92], since this way a strange sampling choice doesn’t propagate as much as other generation procedures. This predicted note is added to the seed to renew the process. We could just let the initial sequence grow and keep adding predicted notes to the input, but this will result in an increase of computation time for each sampling step. We will therefore opt for a different solution: we will always shift out the oldest note of the input sequence when adding the latest predicted note. This process can be repeated infinitely if necessary. Figure 6.4 clarifies this paragraph further. Seed length is set to four in this figure.

The way the next note is predicted, is through sampling. The model will out- put a softmax for both the pitch and the rhythm part of the note. The pitch and the rhythm will be sampled based on these probability distributions.

Once a song of a decent length is reached (which can be specified in the code), we will then output the music piece to a MIDI format. The tempo of the song from which the seed was generated, is taken from the seed tempo. If this is not specified, this tempo is defaulted to 120 bpm, which is the default tempoof MuseScore [93]. It can also be specified how many scores need to be generated and where they need to be stored.

6.2 Melody on Chord generation

Once we were acquainted, we could make a model that generates both the chords and the melody. Especially in jazz, the chord scheme is kind of like the skeleton of the music. The solo artist improvises on the chord scheme of the song. This is usually in between the actual written melody of the song. For example, the first time a chord scheme is played, the singer sings the melody of the song.The second time, they sing the second verse and the chorus. The third time the chord scheme is repeated, the solo artists start to improvise one by one. This could go

61 Figure 6.4: Generation procedure of the first simple melody generation model with seed length equal to 4 on for as many chord schemes as they want. They usually end by repeating the initial melody. In pop music, this is less the case, but even then, chords usually have a repeated structure.

When approaching this problem, we wanted to first build this backbone, before handling the melody and the lyrics. In this master dissertation, there were two possibilities of modeling this that we have considered.

The first option is to generate a new chord scheme, just as in the training data.

62 A chord, similarly to a note in the simple melody generation model in Section 6.1, has a pitch and a duration. This pitch consists of a root letter (A,B,C,D,E,F or G), an alter (e.g. ♯) and a mode (e.g. Major, Minor). Similarly to the simple melody model in 6.1, we could generate a chord scheme. Once a chord scheme is generated, then we can generate the melody on this generated chord scheme. There are two difficulties that arise when using this as amodel. Firstly, many melody notes are usually played on the same chord. Therefore, if we want to generate melody notes on the generated chord scheme, we don’t really know how many melody notes we should generate per chord. Should it be a dense or fast piece, or should we only play one note on the chord? The second difficulty is to make sure that the duration of the different melody notes from the same chord sum up to the duration of the chord. If, for example, the model decides that four quarter notes should be played on a half note chord, this is a problem.

In light of these two difficulties, we have opted for option two: combining the duration of the notes with the chords. As can be seen by Figures 6.5a and 6.5b, we repeat the chord for each note where it should be played, so the duration of the chord becomes the duration of the note. If we generate a chord scheme where the duration of the chord is equal to the duration of the melody note, we know that each chord needs to have exactly one melody note. This solves the two difficulties mentioned above.

With this in mind, we actually divide our problem into two subproblems: 1. The generation of chords, combined with the rhythm of the melody notes.

2. The generation of melody pitches on a provided chord scheme. In Section 6.2A and 6.2B, these two problems are further explained.

A Chords, with note duration

As described previously in Section 5.2.6, a chord has a root, an alter, a mode and a rhythm or duration. We limit the options for those four elements, as seen in

63 (a) Original chord scheme

(b) Repeated chord structure. The chord is repeated for each note so the duration of the chord becomes the duration of the note

Figure 6.5: Some Old Blues - John Coltrane: the original and adapted chord structure

Table 6.1.

Root [A, B, C, D, E, F, G] 7 elements, no changes

Alter [♮ or 0, ♯ or 1] 2 elements, change ♭♭ and ♭ to one of the two 4 elements, the 43 other Mode [Maj, Min, Dim and Aug] modes were replaced by one of these four ==3 3 12 elements, scores with Rhythm [ˇ *“ , ˇ *“ ‰ , ˇ )“ , ˇ “ ,ˇ (“ , ˇ “ , ˇ (“ ‰ , less occurring rhythm types ˇ “, ˇ ‰“ , ˘ “, ˘ ‰“ , ¯ ] were deleted

Table 6.1: The options for root, alter, mode and rhythm in a chord

Since we match the duration of the chord to the duration of the note in another preprocessing step, we can model the prediction of the next chord similarly to the simple melody generation problem in Section 6.1. The next chord is based on the previous time_steps chords:

64 | p(chordi chordi−1, chordi−2, ..., chordi−time_steps) (6.6)

Similarly to Section 6.1, we form two one-hot encoders that we concatenate to form one big vector. The first one represents the chord itself. There are [A, A♯ B, C, C♯, D, D♯ E, F, F♯, G, G♯] x [Maj, Min, Dim, Aug] or 48 possibilities. Table 6.2 clarifies further. The measure bar can also be included, which resultsin a one-hot vector of size 49.

Root Alter Mode Index in one hot vector A 0 Maj 0 A 1 Maj 1 B 0 Maj 2 C 0 Maj 3 ...... A 0 Min 12 A 1 Min 13 ......

Table 6.2: Chord generation: snippet of table for the index where the chord one hot vector is one depending on the root, alter and mode of the chord

The second one-hot vector for the rhythm is the same as the second one in the simple generation model in Section 6.1. The first element models the measure bar and the other twelve elements represent the twelve different rhythm types. Again, including the measure bar is optional. Figure 6.6 shows the chord representation visually.

B Melody pitches

Next, we use the generated chord scheme to generate the melody. We predict the current melody note as follows:

| p(notei notei−1, notei−2, notei−3..., notei−time_steps, chords) (6.7)

65 Figure 6.6: Two one-hot encoder vectors representing the chord itself and the duration of the chord concatenated in the chord generation problem. Measure bars representations are included in this figure.

So we predict the following notes based on time_steps previous notes and all the chords of the music piece, or all the chords of the previously generated chord scheme.

The representation of the note looks similar to the one previously discussed. The only addition is the start token. The use of this will be further explained in Section 6.2.1. Figure 6.7 shows the current representation.

6.2.1 Machine Learning Details

The research on lead sheet generation of modern music have been mostly limited to Markov models. We wanted to use RNNs in this master dissertation to gain a new perspective on lead sheet generation. Figure 6.8 shows the architecture of this model. In this figure, measure bars are included in the sizing. As explained in Section 6.2, we have two big components: a chord scheme generating component and a melody generating component, based on this generated chord scheme. These individual components will be further explained in the following paragraphs.

66 Figure 6.7: Melody on chord generation: representation of a note where the measure bar is included

The first component is the chord scheme generating component. This consists of an input layer which will read the chords with their duration. This is followed by a number of LSTM layers, which is set to two in the figure. In this figure, the LSTM size is set to 512, but this can be adapted. The fully connected (FC) layer outputs the chord predictions of the chord generation part of the model.

The chord scheme is used as an input for the bidirectional LSTM encoder in the melody generation part of the model. A bidirectional RNN uses information about both the past and the future to generate an output. Figure 6.9 shows a ←− −→ Bidirection RNN network, where h and h are the hidden states of the backward and forward RNN layer [94]. If Wx,y represent the weight matrices, then the output gets computed as such: −→ −→ = ( −→ · + −→ −→ · + −→) (6.8) h t sigmoid Wx, h xt W h , h h t+1 b h ←− ←− = ( ←− · + ←− ←− · + ←−) (6.9) h t sigmoid Wx, h xt W h , h h t+1 b h

67 Figure 6.8: Architecture of the melody on chord generation model where measure bars are included

−→ ←− = −→ · + ←− · + (6.10) yt W h ,y h t W h ,y h t by When we replace the RNN cell by a LSTM cell, we get a bidirectional LSTM. By using this as an encoder, we get information about the previous, current and future chords. This information can be used together with the previous melody notes when we generate a note, so the prediction looks ahead at how the chord scheme is going to progress and looks back at both the chord scheme and the melody. Again, 512 is used as a dimension for the LSTMs but this can be adapted. Multiple layers of LSTMs in both directions are possible.

In our architecture, there is exactly one melody note per chord. Therefore we can adapt the regular encoder-decoder architecture (see Section 3.2.3) to Figure 6.10. The encoder is a Bi-LSTM so there are arrows in both ways, which results in each hidden state having information about the entire chord scheme. Then the

68 Figure 6.9: Bidirection RNN [94]

< start > token is concatenated to the first chord, which is given to the decoder as input. This < start > token is used to kickstart the melody generation process. This is inspired on the start token often used in sequence-to-sequence models, such as translation [47]. The following process is similar to the one above. The decoder itself is composed of a number of LSTM layers, followed by a fully connected layer that predicts the note. Therefore we predict the note based on information about the entire chord scheme and the previous notes.

Again, since there are two components of the model, there are also two loss functions. For the first component, the chord scheme generating component, we have very similar loss functions to the melody generation model in Section 6.1. Again, there is a loss function for both components: the chord itself and the rhythm. The model outputs the softmax for each element. Afterwards, we perform cross entropy on the targets and the predictions.

69 Figure 6.10: Melody on chord generation: our Encoder-Decoder architecture

∑48 − (j) · (j) LCE,chord(ychord,i, ˆychord,i) = ychord,i,0:48 log(ˆychord,i,0:48) (6.11) j=0 ∑12 − (j) · (j) LCE,rhythm(ychord,i, ˆychord,i) = ychord,i,49:61 log(ˆychord,i,48:61) (6.12) j=0

(j) where i represents the time step at which we are calculating the loss and is the jth of the vector. The total loss is a combination of the two (α ∈ [0, 1]):

L(ychord,i, ˆychord,i) = α · LCE,chord(ychord,i, ˆychord,i) + (1 − α) · LCE,rhythm(ychord,i, ˆychord,i) (6.13)

The mean is taken for all time steps:

70 ∑ time_steps−1 L(y , , ˆy , ) L(y , ˆy ) = i=0 chord i chord i (6.14) chord chord time_steps

For the melody generating component, the target is actually the note that fits with the chord that was last given as input to the encoder/decoder. Therefore, the loss functions are as follows:

∑131 − (j) · (j) L(ymelody,i, ˆymelody,i) = ymelody,i,0:131 log(ˆymelody,i,0:131) (6.15) j=0 where the index i represents the time step at which we are calculating the loss and the index j is the jth element of the vector. The mean is taken for all time steps: ∑ time_steps−1 L(ymelody,i, ˆymelody,i) L(y , ˆy ) = i=0 (6.16) melody melody time_steps

The chord generating component is trained very similarly to the simple melody model, using an input sequence of size time_steps and a target sequence of size time_steps. Again, the subsequent note of each note in the input sequence is the target for that note. These sequences are selected from random scores in the training set each time and also batched together. The melody generating component is trained by giving a set of chords of size time_steps (chordi, ..., chordi+time_steps, batched together) to the encoder. The forward and backward bi-directional LSTM outputs are concatenated to form a matrix of size (batch_size, time_steps, 2 · LSTM_size). Then, the previous melody notes

(notei−1, ..., notei+time_steps−1) are concatenated to form a matrix of size (batch_size, time_steps, 2 · LSTM_size + note_size). This is used as the input for the decoder. The target sequences, also batched together, are the melody notes corresponding to the chords (notei, ..., notei+time_steps). Again, all these sequences are randomly selected from a random score in the training set. A possible improvement for this

71 is to actually use the entire chord scheme in training as well. This wasn’t tested due to timing constraints.

The generation of the chord generating component is very similar to the simple melody model. We use an initial seed with a chosen length to kickstart the generation. In this case, we use a seed length of time_steps. A forward pass will predict the next note and is added each time, shifting out the oldest note. We will also sample the next chord and its rhythm. The generation process of the melody generating component runs the entire chord scheme through the encoder first. This chord scheme can be selected by the user. We will take the start token and concatenate the corresponding encoder output and put this through the decoder. We sample the next note and add this to the input note sequence, after the start token. Once the input note sequence reaches size time_steps, we will keep shifting out the oldest note, in order to reduce computation time. This continues until each chord has a corresponding melody note.

72 “I’m not frightened by the advent of intelligent machines. It’s the sarcastic ones I worry about.” Quentin R. Bufogle 7 Implementation and Results

7.1 Experiment Details

To implement the models that were described in Section 6, we use the open-source library TensorFlow [95]. More specifically, Python 3.6 was used with TensorFlow 1.4.0. TensorFlow lets the programmer run computations on both CPU and GPU. The number of GPUs or CPUs used can also be specified. TensorBoard, a visual- ization tool for data, is included in TensorFlow [96] and is also used in this master dissertation to represent the training and validation losses.

Two music tools were used in this master dissertation: MidiUtil and Mus- eScore. MidiUtil is a Python library that makes it easier to write MIDI files from within a Python program [97]. This was used to write the generated music pieces to MIDI files that will be discussed in the following sections. MuseScore is a free music notation program which we used to visualize the

73 generated MIDI files created by MidiUtil. The only disadvantage of using this method, is that MidiUtil writes MIDI pitches, whilst MuseScore writes music in standard music notation (see Section 2.1). This sometimes resulted in double flats followed by sharps, or other notations that no composer would write down, since it is not musically coherent. However, since we only need the audio of the MIDI file, this is not a problem.

7.2 Melody Generation

This section will discuss the results from the simple melody generation model that was discussed in Section 6.1.

7.2.1 Training and generation details

We compared our results by generating MIDI files fluctuating the parameters that can be found in Table 7.1. The default values that are used for the non-fluctuating hyperparameters, are also given. For each of these generations, we set the seed length to 50. We also generate with different temperatures for each of them, selecting from values [0.1, 0.5, 0.8, 1.0, 2.0]. We generate 5 MIDI files each time. We will discuss temperatures further in Section A. For Equation 6.4, we have set α to 0.5, so the pitch loss and the rhythm loss have the same weight in the final loss.

Parameters Possible values Default value Time steps [25, 50, 75, 100] 50 Data augmentation [Yes, No] Yes Size of LSTM layers [256, 512] 256 How many LSTM layers [2,3,4] 2 Include measure bars or not [Yes, No] No

Table 7.1: Default and possible values for parameters in the melody generation model

Training was done on a server using a GPU type Nvidia GTX 980. We trained each time for a number of batches num_batches that was obtained as such:

74 1. First, we see how many sequences of length time_steps there are in each score

2. We add all these numbers up to find how many sequences of length time_steps there are in all the scores

3. We divide that number by the batch_size

Training lasted about 7 hours for each combination of parameters, except for the one where the training dataset wasn’t augmented. Then it only lasted about 35 min. We only had 9 unique combinations of the parameters, because we used defaults that were already present in the values that we were testing. Therefore the whole training time was about:

Training_time = 8 · 7h + 1 · 35min = 56h35min (7.1)

Generation lasts about 1h30min for each combination of parameters, including the temperatures. So the total generation time was about:

Generation_time = 9 · 1h30min = 13h30min (7.2)

7.2.2 Subjective comparison and Loss curves of MIDI output files

In this section, we will compare the different hyperparameter values that were discussed previously for the simple melody generation model. We will discuss their loss curves and also give a subjective listening comparison.

A Temperature

The softmax from Equation 3.5 can be redefined by including temperature τ as follows: exp(x /τ) ( ) = ∑ i (7.3) f x i k j=0 exp(xj/τ)

75 This temperature can tweak the desire for randomness in the generated music piece, whilst still holding on to some certainty in the piece. With lower temper- atures (close to zero), the pitch with the highest confidence will almost always be picked. The higher the temperature, the more room exists for freedom and randomness in the piece. Therefore, it can be interesting to see how the music pieces evolve when the temperature is changed.

These statements can also be seen from the results that were generated. The examples below use the default values of the parameters listen in Table 7.1, except that the temperature was changed. These were also generated when the validation loss value was at its minimum. The recordings for each temperature can be found on https://tinyurl.com/ycgqmzs9 with filenames temp_«insert the tempera- ture here».mp3. The seed size was set to the same value as the time steps (50), so the actual generation only starts after about 30 seconds. The same seed was used for each of the examples, so the comparison was easier to make. The seed was taken from one of the files that were not included in the training set. Itcan be clearly seen from e.g. temp_0.1.mp3 that it only generates a melody note G, whilst temp_5.mp3 generates very randomly. Temperature 0.9 or 1 are preferred. From now on, we will use temperature 1 to generate the examples.

The MIDI files were also included with filenames temp_«insert the temperature here».mid. If you open these with MuseScore, it can occur that a weird notation is used (see Figure 7.1). Flats suddenly followed by sharps isn’t easily readable for composers or musicians. MuseScore just takes the MIDI pitch of the note and converts it to a notation (flat, neutral or sharp) it chooses. It should be cleaned by a professional musician in order to make it readable for musicians.

B Comparison underfitting, early stopping and overfitting

To find the best results, we have to know when we should stop training themodel. It is said that for a lot of classification problems the lowest validation loss gives a stopping point so the model can generalize better. But is it truly useful to stop

76 Figure 7.1: The conversion of MIDI pitches to standard music notation by MuseScore at a lowest validation loss in this case? For example, if three notes are a possible logical successor to the sequence of notes we already have, it doesn’t make a lot of sense to generalize which one of the three to choose. A lower training loss models the training set better, but a low validation loss doesn’t necessarily say that we predict music better in general. Of course, we have to keep in mind that the overfit model doesn’t copy and paste examples from the training set, sofurther research is required.

Figure 7.2 show the train and loss curves for the model trained with all the default hyperparameter values from Table 7.1. The model was trained for about 151500 batches (num_batches calculated as described in Section 7.2.1). We can see that the lowest validation loss is already achieved quite early, at batch number 8900.

The listening examples can be found in https://tinyurl.com/ycp6sxza. Again the first 50 notes are the seed, which is about 30 seconds or so. After that, the generated pieces start. The underfitting example (filename underfitting.mp3 or underfitting.mid) is very random and is not coherent at all. The early stopping (filename early_stopping.mp3 or early_stopping.mid) can still be very random but has some better parts. The overfitting example (filename overfitting.mp3 or overfitting.mid) is the best example of all three.

To know for sure if training for num_batches doesn’t just copy examples from the dataset, we want to find the longest common subsequent melody sequence be- tween the training files and the generated file. This was checked for five generated examples using the default hyperparameter values. It was checked in two ways:

77 Figure 7.2: Comparison Training and Validation Loss for all default values for the melody model once only including the pitch of the melody and not considering the rhythm and once including both the pitch and rhythm of the melody notes. The rests were left out. The files can be found in https://tinyurl.com/y9f8yptc and Table 7.2 shows the results.

Even though there are short sequences very similar to ones from the training dataset, we’ve found the training data set isn’t copied completely. However, there should be a more thorough examination about which fragments it takes from which scores in order to guarantee our conclusion. From now on, we will always select overfit examples for this model.

C Time Steps

In this section, we compare 4 different values (25, 50, 75 and 100) for time_steps, or the size of the history we want to incorporate in the prediction. All the other values were set to their default value (see Table 7.1). Figure 7.3 shows the comparison of training loss curves for these different values. It can be seen that if we set time_steps to 25, the loss curve is higher and probably hasn’t converged yet. If

78 Longest common Ratio subse- LC- Generated Original training SS/length Method quent file se- file gener- quence ated (LCSS) score David & Max Sapp - 0.mid 13 13 = 0.01274 There is a River 1020 Only Jimmy McHugh - pitch 1.mid 48 48 = 0.0498 Let’s Get Lost 963 Charles Fox, Nor- man Gimbel - Ready 2.mid 13 13 = 0.0138 To Take A Chance 938 Again Marc Bolan - Chil- 3.mid 14 14 = . dren Of The Revolu- 995 0 01407 tion Hurricane Smith - 4.mid 14 14 = 0.01393 Oh Babe 1005 Irving Berlin 0.mid 10 - Count Your 10 = . 1020 0 0098 Pitch Blessings and dura- Jimmy McHugh - 1.mid 35 35 = 0.0363 tion Let’s Get Lost 963 Robert Freeman - 2.mid 12 12 = . Do You Want To 938 0 0128 Dance Sergio Eulogio Gon- 3.mid 8 8 = . zalez Siaba - El 995 0 00804 Cuarto de Tula Enrique Francini - 4.mid 12 12 = 0.01194 Azabache 1005

Table 7.2: The longest common subsequent melody sequence for five generated MIDI files using the default hyperparameter values. One time using only pitch, the other using both pitch and duration. 79 we would take the training loss as a parameter that should be as low as possible (without actually copying the training data), then we would think 75 or 100 is the best value.

Figure 7.3: Training loss curves for different time_steps values for the melody model

Now we will discuss the different MIDI examples in order to confirm or reject the claim that 75 or 100 is the best value. The files can be found in https: //tinyurl.com/ya8rxul3 with filenames time_steps_«insert time_steps value here».mp3 or time_steps_«insert time_steps value here».mid. Again, the first 50 notes are the same in each example, since this is the seed. These training examples were all trained for 150000 batches. Num_batches would be a different value for each possible time_steps value and the audio files wouldn’t be comparable.

In general, all music pieces are pretty long (about 1000 notes were generated each time) and can have good moments and less coherent moments. The less coherent moments have strange melody and rhythm transitions. For example, bar 34-38 in time_steps_25.mid or bar 267-271 in time_steps_100.mid. But these sorts of examples can be found throughout a lot of examples in our corpus.

80 The example for time_steps equal to 25 has a lot of incoherent pieces and a lot of random high or low notes. The one for time_steps 100 is the most random, has some good moments, but is in general coherent. The best examples are for time_steps 50 and 75, where 75 is slightly preferred. The lowest training loss is therefore not the perfect indicator to see what will result in the best music.

D Inclusion of Measure bars

Figure 7.4 shows the loss curves for the two cases: inclusion of the measure bars or not. We can see these curves are quite similar, so no concrete conclusion can be drawn from this figure.

Figure 7.4: Training loss curves: inclusion of the measure bars or not for the melody model

In the end, we’ve found that, during the generation, measure bars didn’t re- ally fall in the places they were expected. Sometimes we had only one quarter note in between measure bars and sometimes 4 quarter notes filled the measure completely. Sometimes, the pitch part indicated the measure bar, sometimes the rhythm part and sometimes both. We can clearly see that this way of modeling the measure wasn’t successful. We therefore dropped the measure bar completely

81 during the notation of the music piece in MIDI and excluded them for generation. Further research needs to be done on how to model measures efficiently.

E Data augmentation

In a lot of research papers, as was seen in Section 4.2, the training data is aug- mented in order to expand the music pieces to all possible keys. The same was done for this master dissertation. Figure 7.5 shows the training loss curves while data augmenting the training data or not. It can be seen that without data aug- mentation the training loss converges more quickly, as was expected, since less data is available.

Figure 7.5: Training loss curves using data augmentation or not

The files can be found in https://tinyurl.com/yca4krn3. The files with filenames data_aug.mid/mp3 and no_data_aug.mid/mp3 are trained for the same number of batches, namely the num_batches for the data augmentation model. As a reminder, the procedure to find num_batches can be found in Section 7.2.1. Since we train the models for the same number of batches and without data augmentation there is a lot less training data available, that means it is also more

82 likely for the example without data augmentation to be stealing examples from the training data set. This is, of course, not favorable.

Therefore we want to compare it to an example for which the model without data augmentation was only trained for its own num_batches. This example can be found in no_data_aug_stop_at_lowest_validation_loss.mid/mp3. This piece is a lot more random than the data augmentation one. Only if the dataset could be expanded with a lot more files in a lot of different scales, the model without data augmentation could be favorable. In our case, the data augmented version is favored.

F LSTM size

Figure 7.6 shows the training loss curves for different LSTM size values. We can clearly see that an LSTM size of 512 generates a much lower training loss. We could state that this is better, but this would be a premature conclusion. We would have to compare generated examples.

Figure 7.6: Training loss curves for different LSTM size values for the melody model

The files can be found in https://tinyurl.com/y9tkrp8d with filenames lstm_size_«insert lstm size value here».mp3 or lstm_size_«insert lstm size value

83 here».mid. Even though it is very hard to hear a distinct difference between the two sizes, we do prefer the size 512 since the melody is a bit more coherent and dynamic. However, the model with LSTM size 256 also generates a good melody.

G Number of LSTM Layers

Figure 7.7 shows the training loss curves where the number of LSTM layers was adapted. The lowest training losses can be found with four LSTM layers, closely followed by three LSTM layers. Two LSTM layers generate a higher loss curve.

Figure 7.7: Training loss curves: compare number of LSTM layers for the melody model

The files can be found in https://tinyurl.com/yaam5479 with filenames lstm_layers_«insert number of lstm layers here».mp3 or lstm_layers_«insert number of lstm layers here».mid. Even though it wasn’t expected, the examples with four LSTM layers generate more random rhythms (this was noticed over five files with five different seeds). Two or three LSTM layers were quite similar.

7.2.3 Conclusion

This simple model gave us a basic understanding of the problem ahead, and also provided us with good examples and inspiration.

84 We’ve looked at the training losses of the music pieces and wanted to see if there was any link to which hyperparameter generated the best result. We concluded that they do not necessarily indicate which hyperparameter generates the best musical examples. Therefore, we will only discuss them superficially for the next model.

7.3 Melody on Chord Generation

This section will discuss the results for the lead sheet generation model discussed in Section 6.2.

7.3.1 Training and generation details

The exact same method as in Section 7.2.1 was used for the chord generation. Even the parameters that were chosen, were the same. α in Equation 6.13 was set on 0.5, so the chord loss and the rhythm loss have the same weight in the final loss. For the melody generating component we used different parameters that can be seen in Table 7.3. Some parameters were limited more since it lasted a long time to train this model.

Parameter Value possibilities Default value Time steps [50, 75] 50 Data augmentation [Yes, No] Yes Size of LSTM layers in encoder [256, 512] 256 Size of LSTM layers in decoder [256, 512] 256 How many LSTM layers in encoder [2,3] 2 How many LSTM layers in decoder [2,3] 2

Table 7.3: Default and possible values for parameters in the melody generation component for the melody on chord generation model

Again, training was done on a server using Nvidia GTX 980. For the chord generating component, training lasted about 5 hours for a combination of param- eters, except where the training dataset wasn’t augmented. Then it only lasted

85 about 25 minutes. For the melody generating component, training lasted about 16h30min for each combination of parameters with data augmentation. Without data augmentation, training was about 1h20min.

Training_time(chord) = 8 · 5h + 1 · 25min = 40h25min (7.4)

Training_time(melody) = 6 · 16h30 + 1 · 1h20min = 100h20min (7.5)

These two components could be trained in parallel.

7.3.2 Subjective comparison of MIDI files

Since we’ve seen in the melody model that it was very hard to compare the pieces based on the hyperparameter, we will only compare certain aspects this time. We will still compare the lowest validation model to the overfitting one, since this could be relevant to see if we would copy certain aspects of the training dataset if we use the overfitted examples. We will not discuss the loss curves of the other hyperparameters however. We will provide both good and bad examples from the dataset and give some intermediate conclusions.

A Comparison underfitting, early stopping and overfitting

First of all, even though the chord generating component has very similar ar- chitecture to the melody generating component, the loss curves do look rather different (see Figures 7.8a and 7.8b). The chord generating component’s valida- tion curve still keeps going down, long after the simple melody model’s validation curve does. No direct conclusion from this can be made, since we don’t know if a lower validation loss actually makes the generation any better. More examination needs to be made.

The files can be found in https://tinyurl.com/y9wk7r7t. Two examples are given for both overfitting and lowest validation, all with the same seed. Because of this seed, the MIDI and mp3 files differ only after about 30 seconds. Overfitting is preferred again, similarly to the melody model. However, we’ve found that

86 (a) Simple melody generation model

(b) Chord generating component of melody on chord model

(c) Melody generating component of melody on chord model

Figure 7.8: Comparison Training and Validation Loss for all default values

87 there was more copying of the training dataset in the overfitting parts, and that the lowest validation also generated good results. We will therefore continue with examples where the model was trained until the point of lowest validation, for both the chord generating component and the melody generating component.

B Examples and intermediate conclusions

To give a more well-rounded discussion about the generated pieces, we wanted to discuss the best and worst examples. They can be found in https://tinyurl. com/y7tmnnah. We generated long pieces where both the chord scheme and melody models were trained until their lowest validation point. We took the best and worst examples of five pieces, tied the chords and sometimes played the chords an octave lower in order to hear the melody better. We named them best_«i».mid/mp3 and worst_«i».mid/mp3. If the index i is the same, that means it came from the same piece. The files original_«i».mid/mp3 show the original files they were taken from. These files were grouped together in folders Piece «i».

For the best parts, we found in a lot of pieces at least one fragment of 20 or 30 seconds that was considered ‘good’. To find longer fragments, we had to search more, since usually one bad harmonic move made the fragment audibly generated. However, the parts we’ve found were melodically coherent. Most of the good examples are of a slow nature, except for the one from piece 5. In this example, you can also hear best that there is no real sense of measure in (most of) the generated pieces. More on the good outputs and their promising results can be found in the survey (Section 7.3.3). For the worst parts, some were amelodic in the melody (such as the worst example from piece 1), some were mostly bad in terms of rhythm (such as the worst example from piece 3) and some were both (such as the worst example from piece 2). For most pieces, it was easier to find the worst fragments, except for piece 4. This was probably mostly because there was a better ‘base’, the chord scheme.

In general, we have confirmed the claim of Waite in[98] that the music made

88 by RNNs lacks a sense of direction and become boring. Even though the model produces musically coherent fragments sometimes, a lot of the time you couldn’t recognize that two melodies chosen from the same piece are in fact from the same piece. Also, even though our pieces have some good fragments, they usually have at least one portion which doesn’t sound musical at all once we start generating longer pieces. This was confirmed by the examples above. This could possibly be fixed with a larger dataset, by training longer or letting an expert in musicfilter out these fragments.

7.3.3 Survey

A survey was conducted in order to see how our computer-generated pieces com- pare to real human-composed pieces. This was done by asking the participant a set of questions for each piece in Google Form format. There were 177 participants for this particular online survey. For each chord in a piece, it was played in its root position1. To make it easier on the ears, some chords were tied together, since the output of the model repeats each chord for each note. Figure 7.9 clarifies. The pieces were exported to mp3 in MuseScore, in order to eliminate any human nuances that could hint to the real nature of the piece. Then, for each piece, a fragment of 30 seconds was chosen.

We wanted to establish what background the participant of the survey had in music. When asked about their music experience, where they could choose between experienced musician, amateur musician or listener. This was followed by three questions for each audio fragment:

1. Do you recognize this audio fragment? Answers: yes or no

2. How do you like this audio fragment? Answer: They could answer on a scale from 1 to 5 where 1 represented

1https://en.wikipedia.org/wiki/Inversion_(music)

89 (a) The initial MIDI file with no ties

(b) The MIDI file after the notes were tied together

Figure 7.9: A step in preparing the MIDI files for the survey

complete dislike and 5 represented love

3. Is this audio fragment made by a computer or a human? Beware: the music is played by a computer. We only want to know if you think the music itself is composed by a human or a computer. Answer: They could answer on a scale from 1 to 5 where 1 represented ’definitely a human’ and 5 represented ’definitely a computer’

In the survey, there were three categories of music pieces: human-composed pieces, completely computer-generated pieces and pieces where the melody was generated on an actual human-composed chord scheme. For each of these cat- egories, three audio fragments were included in the survey. For the human- composed pieces, we tried to select audio fragments that weren’t too popular

90 with the public, so not like e.g. Village People - Y.M.C.A. For the generated pieces, we selected fragments that were completely generated, which means that we only started listening after the seed of 50 notes in the chord scheme. The gen- erated pieces were taken from the default value hyperparameters, only training until lowest validation.

The files that were used in the survey can be foundin https://tinyurl.com/ y8qpvu92 and Table 7.4 shows their (potential) origin and category. The pieces and the answers can also be found in https://youtu.be/Qayllb1EZKU. The origin for the partially generated files is the chord scheme on which the melody was based. The human-composed scores show as origin the original name of the song in the Wikifonia dataset.

In order to make sure that these files weren’t copied, we wanted to see ifa melody or chord scheme was copied at all from the training dataset. This is again done by finding the Longest Common Subsequent Sequence (LCSS). Of course, we expect full copies for the chord scheme for the files 3.mid, 6.mid and 8.mid since they use that chord scheme. The results can be found in Table 7.5. The LCSSs for the chords are longer than the melodies since it is much more likely for example to have many of the same chords (e.g. C major) after each other than melody notes. For example, 7.mid’s LCSS for only chord has 111 times the same chord (F# Major), followed by four B major chords.

For each piece, at least 2.3% of the participants noted that they recognized the fragment, with a record of 12.4% for 3.mid. Table 7.6 shows all the percentages. The reason this could be a high percentage for 3.mid is because the chord scheme could be more recognizable, since Elton John is quite a popular artist. However, the melody is not taken from the dataset (see Table 7.5).

In general, there was some sort of correlation in how much a person liked the audio piece and how much they think it is composed by a human if we disregard the neutral answers. If the highest percentage of people didn’t like an audio fragment, they were more likely to give it a ‘computer’ stamp, even if it was made

91 If (partially) human, real File Category name of chord scheme or piece Burton Lane, Ralph Freed - How 1.mid Human-composed pieces About You (Arr: Wim Alderd- ing) 2.mid Generated pieces N/A Generated melody, Bernie Taupin, Elton John - 3.mid human chord scheme Crocodile Rock 4.mid Human-composed pieces Marr Morrissey - How soon is now Maud Nugent - Sweet Rosie 5.mid Human-composed pieces O’Grady Generated melody, 6.mid Diane Warren - There You’ll Be human chord scheme 7.mid Generated pieces N/A Generated melody, 8.mid Frank Mills - The Happy Song human chord scheme 9.mid Generated pieces N/A

Table 7.4: The pieces that were included in the survey, with their category and potential origin

92 Ratio LC- Generated Original training SS/length Method LCSS file file gener- ated score Coldplay - Cemeteries 2.mid 14 14 = 0.01333 of London 1050 Only melody Neil Diamond - I 3.mid 19 19 = 0.03454 pitch am....I said 550 Maria Grever, Stanley 6.mid 14 Adams - What a Dif- 14 = . 454 0 03083 ference a Day Makes Street, Morrissey - 7.mid 21 21 = 0.02 SUEDEHEAD 1050 Giosy Cento - Viaggio 8.mid 14 14 = 0.0342 Nella Vita 409 Bobby Timmons, Jon 9.mid 29 29 = 0.0276 Hendricks - Moanin’ 1050 2.mid 63 63 = . Peter Gabriel - Steam 1050 0 06 Bernie Taupin, Elton Only 3.mid 550 550 = 1 chord John - Crocodile Rock 550 Diane Warren - There 6.mid 454 454 =1 You’ll Be 454 Paul Peterson - She 7.mid 115 115 = 0.1095 Can’t Find Her Keys 1050 Frank Mills - The 8.mid 409 409 = 1 Happy Song 409 Jerry Leiber, Mike 9.mid 54 54 = 0.05142 Stoller - Lucky lips 1050

Table 7.5: The Longest Common Subsequent Sequence (LCSS) for (partially) generated MIDI files in the survey. One time using only pitch and one time using only chords.Afull copy is expected for the ‘only chord’ one for files 3.mid, 6.mid and 8.mid.

93 File Category % recognition 1.mid Human-composed 4.5% 2.mid Generated 2.3% Generated melody, 3.mid 12.4% human chord scheme 4.mid Human-composed 5.1% 5.mid Human-composed 3.4% Generated melody, 6.mid 5.1% human chord scheme 7.mid Generated 4% Generated melody, 8.mid 5.1% human chord scheme 9.mid Generated 2.8%

Table 7.6: The percentage of participants of the survey who responded that they recognized the audio fragment

94 by a human. This can also be seen in Tables 7.7 and 7.8. Only 6.mid was an exception to this observation. This observation makes sense, since most people associate human-composed music to be more easy on the ears, whilst assuming computer-generated music to be more random and ‘unlikeable’. This correlation can also be seen in Figure 7.10. In this figure, the mean likeability (values from one to five) and the mean of how much the people think the piece is computer generated (values from one to five) was taken and plotted against each other. A correlation with r2 = 0.82051 was found. We now want to test the significance of this correlation. We want to do this through a hypothesis test [99, 100]. These are our hypotheses:

Figure 7.10: Survey responses: correlation between mean likeability and how much the par- ticipants of the survey think it is computer generated

1. Null Hypothesis H0: The correlation coefficient is not significantly dif- ferent from zero. There is not a significant linear relationship (correlation) between the mean likeability and the mean perception of the piece is being computer generated.

2. Alternate Hypothesis Ha: The population correlation coefficient is signif-

95 icantly different from zero. There is a significant linear relationship (corre- lation) between the mean likeability and the mean perception of computer- generated content.

Our degrees of freedom are n − 2 = 9 − 2 = 7, where 9 is the number of pieces n. We set the level of significance to 0.05. We know our r2 = 0.82051 so our r is 0.90582. We perform a one-tailed test, so we find the critical t value tobe 1.895 from the one-tail t-distribution table2. We observe: √ n − 2 t = r = 5.6568 (7.6) 1 − r2

This t value is larger than our critical t value so our Null Hypothesis can be rejected. There is therefore a significant relationship between the mean likeability and the mean perception of computer-generated content.

File Category % dislike % neutral % like 1.mid Human-composed 39.5% 39.5% 20.9% 2.mid Generated 20.9% 32.2% 46.8% Generated melody, 3.mid 14.1% 28.2% 57.6% human chord scheme 4.mid Human-composed 17.5% 27.1% 55.3% 5.mid Human-composed 66.1% 24.8% 9.0% Generated melody, 6.mid 18.6% 33.8% 47.4% human chord scheme 7.mid Generated 28.2% 38.9% 32.7% Generated melody, 8.mid 22.5% 33.3% 44.1% human chord scheme 9.mid Generated 50.8% 35.0% 14.1%

Table 7.7: How much the participants responded they liked or disliked a piecet

2http://www.statisticshowto.com/tables/t-distribution-table/

96 File Category % human % neutral % computer 1.mid Human-composed 20.3% 12.4% 67.2% 2.mid Generated 38.9% 28.2% 32.7% Generated melody, 3.mid 55.9% 24.8% 19.2% human chord scheme 4.mid Human-composed 46.8% 16.3% 36.7% 5.mid Human-composed 15.2% 15.8% 68.9% Generated melody, 6.mid 37.8% 21.4% 40.6% human chord scheme 7.mid Generated 48.5% 20.9% 30.5% Generated melody, 8.mid 46.8% 23.1% 29.9% human chord scheme 9.mid Generated 20.9% 19.7% 59.3%

Table 7.8: The participants’ answer to the question if the fragment was computer-generated or human-composed

97 We will perform a bootstrap hypothesis in order to determine if the com- puter generated pieces were perceived as more human than the real pieces. We concluded that the computer-generated pieces were perceived at least as human as the human pieces and outperforming them even. This was significant with p = 8 · 10−6. We have considered the fact that perhaps the first example, which was human, influenced the answers. This could be the case since the pieces were played by MuseScore, which sounded more computer generated. The listeners perhaps needed to adapt to the sound first. We performed a bootstrap hypothesis test, comparing the first example and the second example, and the second exam- ple was perceived more human than the first with p = 0.0. We therefore wanted to repeat the first bootstrap test, without the first human example. Even then, our results were significant, with p = 0.0026. We could therefore conclude that the computer generated pieces were perceived to be as human or more than the real human pieces. The partially computer-generated examples outperformed the computer-generated pieces (p = 1.2 · 10−5).

Figure 7.11: Survey responses: number of correct answers out of six per music experience category

We now wanted to see if the experienced or amateur musicians estimate better

98 if a piece is human-composed or computer-generated. We wanted to do this by looking at the number of correct answers of the completely human-composed and completely computer-generated pieces. It was debatable what a correct answer was for the partially human-composed, partially computer-generated pieces, so these weren’t included. Figure 7.11 show the results. Between the listeners and amateurs, a slight but negligible difference was found. However, we have found that self-proclaimed experienced musicians have in general a higher likelihood of guessing right. No one guessed all six correctly and only one listener guessed five correctly. The average number of correct answers can be found Table 7.9. The differences are not high, but self-proclaimed experienced musicians scored slightly better on average.

Music experience level Average number of correct answers Listener 1.99 Amateur 1.98 Experienced 2.33

Table 7.9: Survey responses: average number of correct answers (out of six) per music expe- rience category

Through bootstrap testing, we can see that it is not significant that self- proclaimed experienced musicians were less likely to give a human stamp to computer-generated pieces compared to listeners (p = 0.24826) and amateurs (p = 0.16). They were also not significantly better in stamping the human pieces correctly than amateurs (p = 0.297) and listeners (p = 0.301). In turn, amateur musicians were not significantly better in any case compared to listeners. Wecan therefore conclude that musicians do not necessarily outperform listeners signifi- cantly in this survey.

Even though these results are very positive for the master dissertation, it would be interesting to repeat the survey in a different manner. If the music pieces would be played by real musicians, the comparison could be more reliable, but not necessarily. We would also want to shuffle the order of the music pieces

99 for each participant in order to guarantee that the adaption period is filtered out. Unfortunately, this was not possible with Google Forms. An extra form field to put comments in for each piece could give us more insights inwhythe user responded in a certain way. We could also include ‘bad’ computer-generated examples in order to see the influence it could have on the results.

7.3.4 Conclusion

The survey results are very positive. The participants couldn’t really distinguish the computer-generated pieces from the human-composed pieces. Self-proclaimed expert musicians and amateur musicians didn’t perform better than listeners to music. In general, we found a significant correlation between the mean likeability of a piece and the mean of how much the participants think it is human-composed.

As was said before, the pieces can still lacks a sense of direction and become a bit boring to listen to after a while, especially if listened to in the software program MuseScore. Long-term coherence is also missing. Melodies taken from the same piece cannot always be linked to each other. Even though our pieces have some good fragments, they usually have at least one portion which doesn’t sound musical at all once we start generating longer pieces. This was confirmed by the examples above. This could possibly befixed with a larger dataset, by training longer or letting an expert in music filter out these fragments. So, even though not a lot of effort was put in choosing the 30 seconds fragments for the survey, it would have taken a lot longer if more extensive fragments needed to be chosen for the survey. All pieces don’t have a sense of measure yet. This still needs to be further researched.

100 “Everything in the universe has a rhythm, every- thing dances.” Maya Angelou 8 Conclusion

This master dissertation focused on generating lead sheets, specifically melody combined with chords, in the style of the Wikifonia dataset. We’ve found that a lot of lead sheet generation focused on Markov models. We therefore wanted to adapt the information of RNNs in the classical world to more modern music.

Two models were discussed: a simple melody generating model and a melody on chord generating model. The second model first used a number of LSTM layers and a fully connected layer to generate a chord scheme, which included the chord itself and also the rhythm of the melody note. After that, we used a Bi-LSTM encoder to gain some information about the entire chord scheme we want to generate the melody pitches on. This chord scheme information and the information about previous melody notes was used in the decoder (again a number of LSTM layers and a fully connected layer). This second part generates the

101 melody pitches on the rhythm that was already established in the chord scheme.

In general, we have confirmed the claim of Waite in[98] that the music made by RNNs lacks a sense of direction and become boring. Even though the model produces musically coherent pieces sometimes, a lot of the time you couldn’t recognize that two melodies chosen from the same piece are in fact from the same piece. Also, our pieces have some good fragments, but once we start generating longer pieces, they usually have at least one portion which doesn’t sound musical at all. This could possibly be fixed with a larger dataset, possible by training longeror letting an expert in music filter out these fragments. All pieces don’t have a sense of measure yet. This still needs to be further researched.

The survey results are very positive. The participants couldn’t really dis- tinguish the computer-generated pieces from the human-composed pieces. Self- proclaimed expert musicians and amateur musicians didn’t perform better than listeners to music. In general, we found a significant correlation between the mean likeability of a piece and the mean of how much the participants think it is human-composed.

8.1 Further work

Three main areas of further research have been discussed. The first one is adding lyrics to the lead sheets. This can be done by working on a syllable level for the training dataset and adding a third component to the melody on chord model, which generates the lyrics on the melody notes. We believe the information of the chords can be left out in this stage, since the melody guides the words, but further research needs to be done. The second part of further research is to add more structure to the lead sheets. Usually a lead sheet consists of some sort of structure, such as verse - chorus - verse

102 - chorus - bridge - chorus, as was seen in Section 2.5. This could be established by adding for example a similarity factor to the models discussed, such as in [84]. Thirdly, ways to represent measures could still be researched further. We have tried one way to represent measures in this master dissertation, through repre- senting measure bars, which was unsuccessful since some measures for example were more than full and others empty. Perhaps adding some sort of metric to see how many notes we still need to fill up the measure could be an option.

Smaller areas of improvements can also be found: Experimentation within the melody on chord model could still be done. For example, training the melody on chord component using the entire chord scheme instead of only one part could improve its results. Adding lead sheets to the dataset could also lead to significant improvements. Using a GRU instead of an LSTM could also be worth investigating, since it has been shown to outper- form the LSTM in some cases in terms of CPU time and parameter updates and generalization [101]. Even though these results are very positive for the survey, it would be inter- esting to repeat it differently to gain more insight. If the music pieces wouldbe played by real musicians, the comparison could be more reliable, but not nec- essarily. We would also want to shuffle the order of the music pieces foreach participant in order to guarantee that the adaption period is filtered out. An extra form field to put comments in for each piece could give us more insightsin why the user responded in a certain way. We could also include ‘bad’ computer- generated examples in order to see the influence it could have on the results.

103 A Chords

Figure A.1 gives both the textual as musical representation of common chords.

104 Figure A.1: Different representations for chords 105 References

[1] J. J. Bharucha and P. M. Todd, “Modeling the perception of tonal structure with neural nets,” Computer Music Journal, vol. 13, no. 4, pp. 44–53, 1989.

[2] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “Midinet: A convolutional gen- erative adversarial network for symbolic-domain music generation,” in Pro- ceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China, 2017.

[3] A. Papadopoulos, P. Roy, and F. Pachet, “Assisted lead sheet composition using flowcomposer,” in International Conference on Principles and Practice of Constraint Programming, pp. 769–785, Springer, 2016.

[4] “Flowmachines project.” [Online]. Available: http://www.flow-machines. com/ [Accessed:20-May-2018].

[5] F. Pachet, A. Papadopoulos, and P. Roy, “Sampling variations of sequences for structured music generation,” in Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR’2017), Suzhou, China, pp. 167–173, 2017.

[6] M. Ramona, G. Cabral, and F. Pachet, “Capturing a musician’s groove: Generation of realistic accompaniments from single song recordings.,” in IJCAI, pp. 4140–4142, 2015.

[7] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.

[8] M. C. Mozer, “Neural network music composition by prediction: Explor- ing the benefits of psychoacoustic constraints and multi-scale processing,” Connection Science, vol. 6, no. 2-3, pp. 247–280, 1994.

106 [9] D. Murray-Rust, A. Smaill, and M. Edwards, “Mama: An architecture for interactive musical agents,” Frontiers in Artificial Intelligence and Applica- tions, vol. 141, p. 36, 2006. [10] K. Shaffer, B. Hughes, and B. Moseley, “Open music theory.” [Online]. Available: http://openmusictheory.com/ [Accessed:4-Dec-2017].

[11] “Some clefs.” [Online]. Available: http://www2.sunysuffolk.edu/ prentil/webnastics2015/zen_my-final-site/Musical_Staff.html [Accessed:4-Dec-2017]. [12] “Teaching the basic music notes, what are the symbols?,” 2016. [On- line]. Available: https://music.stackexchange.com/questions/ 41452/teaching-the-basic-music-notes-what-are-the-symbols [Accessed:16-Dec-2017]. [13] “Repeats, codas, and endings.” [Online]. Available: http: //totalguitarist.com/lessons/reading/notation/guide/repeats/ [Accessed:4-Dec-2017]. [14] “Transposition.” [Online]. Available: https://www.clementstheory.com/ study/transposition/ [Accessed:13-Apr-2018]. [15] S. M. Wood, “Circle of fifths,” 2016. [Online]. Available: http: //www.musiccrashcourses.com/lessons/key_relationships.html [Accessed:2-Dec-2017]. [16] J. McCarthy, M. Minsky, and N. Rochester, “A proposal for the dartmouth summer research project on artificial intelligence,” 1955. [17] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. [18] A. of Rhodes, Argonautica. Digireads.com Publishing, 1606. [19] J. McCarthy, “What is artificial intelligence,” 2007. [20] E. Brynjolfsson and A. Mcafee, “The business of artificial intelligence,” Har- vard Business Review, 2017. [21] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards real-time ob- ject detection with region proposal networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.

107 [22] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative ad- versarial nets with policy gradient.,” in AAAI, pp. 2852–2858, 2017. [23] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoo- rian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017. [24] L. Deng, D. Yu, et al., “Deep learning: methods and applications,” Foun- dations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014. [25] C. Donalek, “Supervised and unsupervised learning,” in Astronomy Collo- quia. USA, 2011. [26] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Ben- gio, “Why does unsupervised pre-training help deep learning?,” Journal of Machine Learning Research, vol. 11, no. Feb, pp. 625–660, 2010. [27] “What is deep learning? how it works, techniques and applications,” 2017. [28] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015. [29] S. J. Russell and P. Norvig, “Artificial intelligence: a modern approach (third edition),” 2010. [30] X. Yao, “Evolving artificial neural networks,” Proceedings of the IEEE, vol. 87, no. 9, pp. 1423–1447, 1999. [31] F.-F. Li, “Stanford lectures: Convolutional neural networks for visual recog- nition - activation functions,” 2017. [Online]. Available: http://cs231n. github.io/neural-networks-1/#actfun [Accessed: 13-Nov-2017]. [32] S. Sonoda and N. Murata, “Neural network with unbounded activation functions is universal approximator,” Applied and Computational Harmonic Analysis, 2015. [33] S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, “Fpga acceleration of recurrent neural network based language model,” 2015. [34] R. Hecht-Nielsen et al., “Theory of the backpropagation neural network.,” Neural Networks, vol. 1, no. Supplement-1, pp. 445–448, 1988.

108 [35] M. Nielsen, “How the backpropagation algorithm works,” 2017. [36] “Early stopping.” [Online]. Available: http://neuron.csie.ntust. edu.tw/homework/94/neuron/Homework3/M9409204/discuss.htm [Accessed:18-Dec-2017]. [37] F.-F. Li, “Stanford lectures: Convolutional neural networks for visual recognition,” 2017. [Online]. Available: http://cs231n.stanford.edu/ [Accessed:31-Oct-2017]. [38] R. Sun and C. L. Giles, “Sequence learning: from recognition and prediction to sequential decision making,” IEEE Intelligent Systems, vol. 16, no. 4, pp. 67–70, 2001. [39] J. Brownlee, “Making predictions with sequences,” 2017. [Online]. Avail- able: https://machinelearningmastery.com/sequence-prediction/ [Accessed:12-Mar-2018]. [40] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, “Re- current neural network based language model.,” in Interspeech, vol. 2, p. 3, 2010. [41] D. Britz, “Recurrent neural networks tutorial, part 1 – introduction to rnns,” 2015. [Online]. Available: http://www.wildml.com/2015/09/ recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ [Accessed:12-Mar-2018]. [42] J.-P. Briot, G. Hadjeres, and F. Pachet, “Deep learning techniques for music generation-a survey,” arXiv preprint arXiv:1709.01620, 2017. [43] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recur- rent neural networks,” in International Conference on Machine Learning, pp. 1310–1318, 2013. [44] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural com- putation, vol. 9, no. 8, pp. 1735–1780, 1997. [45] “The lstm memory cell figure,” 2016. [46] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Annual Conference of the International Speech Communication Association, 2014.

109 [47] “Encoder decoder architecture.” [Online]. Available: https://smerity. com/articles/2016/google_nmt_arch.html [Accessed:11-May-2018]. [48] A. Van den Oord, S. Dieleman, and B. Schrauwen, “Deep content-based mu- sic recommendation,” in Advances in neural information processing systems, pp. 2643–2651, 2013. [49] E. Cheever, “A fourier approach to audio signals,” 2015. [Online]. Available: http://lpsa.swarthmore.edu/Fourier/Series/WhyFS.html [Accessed:27-Nov-2017]. [50] A. M. Sarroff and M. A. Casey, “Musical audio synthesis using autoencoding neural nets,” in ICMC, 2014. [51] G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent neural net- works for polyphonic sound event detection in real life recordings,” in Acous- tics, Speech and Signal Processing (ICASSP), 2016 IEEE International Con- ference on, pp. 6440–6444, IEEE, 2016. [52] D. Vandenneucker, “Midi tutorial,” 2012. [Online]. Available: http://www.music-software-development.com/midi-tutorial.html [Accessed:27-Nov-2017]. [53] A. Huang and R. Wu, “Deep learning for music,” arXiv preprint arXiv:1606.04930, 2016. [54] M. Good et al., “Musicxml: An internet-friendly format for sheet music,” in XML Conference and Expo, pp. 03–04, 2001. [55] “Musicxml,” 2017. [Online]. Available: http://www.musicxml.com/ [Accessed:27-Nov-2017]. [56] “Musedata.” [Online]. Available: http://www.musedata.org/ [Accessed:27- Nov-2017]. [57] “The humdrum toolkit: Software for music research.” [Online]. Available: http://www.humdrum.org/ [Accessed:27-Nov-2017]. [58] C. Walder, “Modelling symbolic music: Beyond the piano roll,” in Asian Conference on Machine Learning, pp. 174–189, 2016. [59] “abc notation.” [Online]. Available: http://www.humdrum.org/ [Accessed:27-Nov-2017].

110 [60] F. Pachet, J. Suzda, and D. Martinez, “A comprehensive online database of machine-readable lead-sheets for jazz standards.,” in ISMIR, pp. 275–280, 2013. [61] C. Walder, “Symbolic music data version 1.0,” arXiv preprint arXiv:1606.02542, 2016. [62] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling tempo- ral dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription,” arXiv preprint arXiv:1206.6392, 2012. [63] G. Hadjeres and F. Pachet, “Deepbach: a steerable model for bach chorales generation,” arXiv preprint arXiv:1612.01010, 2016. [64] M. Bretan, S. Oore, D. Eck, and L. Heck, “Learning and evaluating musical features with deep autoencoders,” arXiv preprint arXiv:1706.04486, 2017. [65] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with recurrent neural networks,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024, 2011. [66] A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013. [67] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, “Draw: A recurrent neural network for image generation,” arXiv preprint arXiv:1502.04623, 2015. [68] D. Eck and J. Schmidhuber, “A first look at music composition using lstm recurrent neural networks,” Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale, vol. 103, 2002. [69] P. M. Todd, “A connectionist approach to algorithmic composition,” Com- puter Music Journal, vol. 13, no. 4, pp. 27–43, 1989. [70] K. Ebcioğlu, “An expert system for harmonizing four-part chorales,” Com- puter Music Journal, vol. 12, no. 3, pp. 43–51, 1988. [71] C. Wolff, Johann Sebastian Bach: the learned musician. WW Norton & Company, 2001. [72] F. Liang, BachBot: Automatic composition in the style of Bach chorales. PhD thesis, Masters thesis, University of Cambridge, 2016.

111 [73] T. Hori, K. Nakamura, S. Sagayama, and M. Univerisity, “Jazz piano trio synthesizing system based on hmm and dnn,” 2017.

[74] H. Chu, R. Urtasun, and S. Fidler, “Song from pi: A musically plausible network for pop music generation,” arXiv preprint arXiv:1611.03477, 2016.

[75] “Google magenta: Make music and art using machine learning.” [Online]. Available: https://magenta.tensorflow.org/ [Accessed:3-Dec-2017].

[76] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “Musegan: Symbolic-domain music generation and accompaniment with multi-track se- quential generative adversarial networks,” arXiv preprint arXiv:1709.06298, 2017.

[77] J. A. Franklin and V. U. Manfredi, “Nonlinear credit assignment for musical sequences,” in Second international workshop on Intelligent systems design and application, pp. 245–250, 2002.

[78] S. Le Groux and P. Verschure, “Towards adaptive music generation by re- inforcement learning of musical tension,” in Proceedings of the 6th Sound and Music Conference, Barcelona, Spain, vol. 134, 2010.

[79] N. Collins, “Reinforcement learning for live musical agents.,” in ICMC, 2008.

[80] D. Murray-Rust, “Musical acts and musical agents: theory, implementation and practice,” 2008.

[81] C.-Z. A. Huang, D. Duvenaud, and K. Z. Gajos, “Chordripple: Recommend- ing chords to help novice composers go beyond the ordinary,” in Proceedings of the 21st International Conference on Intelligent User Interfaces, pp. 241– 250, ACM, 2016.

[82] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

[83] S. Madjiheurem, L. Qu, and C. Walder, “Chord2vec: Learning musical chord embeddings,”

[84] S. Lattner, M. Grachten, and G. Widmer, “Imposing higher-level structure in polyphonic music generation using convolutional restricted boltzmann machines and constraints,” arXiv preprint arXiv:1612.04742, 2016.

112 [85] A. Tikhonov and I. P. Yamshchikov, “Music generation with vari- ational recurrent autoencoder supported by history,” arXiv preprint arXiv:1705.05458, 2017. [86] B. L. Sturm, J. F. Santos, O. Ben-Tal, and I. Korshunova, “Music tran- scription modelling and composition using deep learning,” arXiv preprint arXiv:1604.08723, 2016. [87] B. L. Sturm and O. Ben-Tal, “Taking the models back to music practice: evaluating generative transcription models built using deep learning,” Jour- nal of Creative Music Systems, vol. 2, no. 1, 2017. [88] N. Agarwala, Y. Inoue, and A. Sly, “Music composition using recurrent neural networks,” 2017. [89] “Protocol buffers: Google’s data interchange format. documentation and open source release;.” [Online]. Available: https://developers.google. com/protocol-buffers/docs/proto [Accessed:13-Apr-2018]. [90] “Ornament (music).” [Online]. Available: https://en.wikipedia.org/ wiki/Ornament_(music) [Accessed:13-Apr-2018]. [91] “Batch size (machine learning).” [Online]. Available: https: //radiopaedia.org/articles/batch-size-machine-learning [Accessed:27-Apr-2018]. [92] C. De Boom, S. Leroux, S. Bohez, P. Simoens, T. Demeester, and B. Dhoedt, “Efficiency evaluation of character-level rnn training schedules,” arXiv preprint arXiv:1605.02486, 2016. [93] “Musescore: default tempo.” [Online]. Available: https://musescore.org/ en/node/16635 [Accessed:30-Apr-2018]. [94] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649, IEEE, 2013. [95] “Tensorflow: open-source machine learning framework.” [Online]. Available: https://www.tensorflow.org/ [Accessed:27-Apr-2018]. [96] “Tensorboard: a data visualization toolkit.” [Online]. Available: https:// www.tensorflow.org/programmers_guide/summaries_and_tensorboard [Accessed:27-Apr-2018].

113 [97] “Midiutil documentation.” [Online]. Available: https://media. readthedocs.org/pdf/midiutil/latest/midiutil.pdf [Accessed:12- May-2018].

[98] E. Waite, “Generating long-term structure in songs and stories,” 2016. [Online]. Available: https://magenta.tensorflow.org/2016/07/15/ lookback-rnn-attention-rnn [Accessed:25-May-2018].

[99] “Testing the significance of the correlation coefficient.” [On- line]. Available: https://www.texasgateway.org/resource/ 124-testing-significance-correlation-coefficient-optional [Accessed:22-May-2018].

[100] “Testing the significance of the correlation coefficient.” [Online]. Avail- able: http://janda.org/c10/Lectures/topic06/L24-significanceR. htm [Accessed:22-May-2018].

[101] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.

114 Index

A exploding gradients problem ...... 22 ABC notation ...... 30 accidental ...... 7 F activation function ...... 15, 16 feed-forward neural network (FFNN) 18, anacrusis ...... 47 36 Artificial Intelligence (AI) ...... 14 flat ...... 7, 9 audio ...... 25 fourier transformation ...... 25 audio spectrum ...... 25 G Auto-Encoder ...... 23 Generative Adversarial Network . . . . 39 B Gradient Descent ...... 20 backpropagation ...... 20 H backpropagation through time (BPTT) hidden Markov models ...... 38 22 bias ...... 15 I bidirectional LSTM ...... 67 integer encoding ...... 18

C J chord ...... 9, 42 jazz ...... 38 circle of fifths ...... 11, 42 classical music ...... 37 K clef ...... 5 key ...... 8 cost function ...... 19 key signature ...... 8

D L Deep Learning (DL) ...... 15 Lead Sheet ...... 33 Deep Neural Network (DNN) . . . 15, 20 Leaky ReLu ...... 17 Long Short-Term Memory (LSTM) 22, E 36 early stopping ...... 20 loss function ...... 19 embellishment ...... 47

115 M S Machine Learning (ML) ...... 15 scale ...... 10 Magenta ...... 44 Sequence Generation ...... 21 major scale ...... 11 sharp ...... 7, 9 maxout ...... 17 sigmoid function ...... 16 mean-squared error ...... 19 signal representation ...... 25 measure ...... 10 softmax ...... 17 meter ...... 9 staff ...... 5 MIDI ...... 27 standard music notation ...... 4 minor scale ...... 11 supervised ...... 15 monophonic ...... 46 symbolic representation ...... 26 music representation ...... 25 Musical Instrument Digital Interface 27 T MusicXML ...... 28, 44 tanh ...... 17 MXL ...... 28 time signature ...... 9 transpose ...... 35 N transposition ...... 11 Neural Network (NN) ...... 15 neuron ...... 15 U note ...... 5 underfitting ...... 20 unsupervised ...... 15 O octave ...... 8 V one-hot encoding ...... 18, 36 vanishing gradients problem ...... 22 ornament ...... 47 Variational Auto-Encoder (VAE) . . . 42 overfitting ...... 20 Variational Recurrent Auto-encoder Sup- ported by History (VRASH) 42 P piano roll ...... 30 W pitch ...... 5 weight ...... 15 polyphonice notes ...... 46 Wikifonia ...... 44 pop ...... 38 Protocol Buffer ...... 44 R recurrent neural network (RNN) . . . 18, 22, 36 ReLU ...... 17 rest ...... 7 rhythm ...... 5

116

Lead Sheet Generation with Deep Learning

Stephanie Van Laere

Supervisor: Prof. dr. ir. Bart Dhoedt Counsellors: Ir. Cedric De Boom, Dr. ir. Tim Verbelen

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Information Technology Chair: Prof. dr. ir. Bart Dhoedt Faculty of Engineering and Architecture Academic year 2017-2018