THE BACH DOODLE: APPROACHABLE MUSIC COMPOSITION WITH MACHINE LEARNING AT SCALE

Cheng-Zhi Anna Huang] Curtis Hawthorne] Adam Roberts] Monica Dinculescu] James Wexler† Leon Hong? Jacob Howcroft? {annahuang,fjord,adarob,noms,jwexler,leonhong,jhowcroft}@.com ] Magenta † Google Brain Pair ? Google Doodle

1. ABSTRACT Creating this first AI-powered doodle involved over- coming challenges in user interaction and interface design, To make music composition more approachable, we de- and also technical challenges in both machine learning and signed the first AI-powered Google Doodle, the Bach Doo- in infrastructure for serving the models at scale. For in- dle [1], where users can create their own melody and have putting melodies, we designed a simplified sheet music in- it harmonized by a machine learning model (Coconet [22]) terface that facilitates easy trial and error and found that in the style of Bach. For users to input melodies, we de- users adapted to it quickly even when they were not famil- signed a simplified sheet-music based interface. To sup- iar with western music notation. port an interactive experience at scale, we re-implemented As most users do not have dedicated hardware to run Coconet in TensorFlow.js [32] to run in the browser and re- machine learning models, we re-implemented Coconet in duced its runtime from 40s to 2s by adopting dilated depth- TensorFlow.js [32] so that it could run in the browser. wise separable convolutions and fusing operations. We We reduced the model run-time from 40s to 2s by adopt- also reduced the model download size to approximately ing dilated depth-wise separable convolutions and fusing 400KB through post-training weight quantization. We cal- operations, and we reduced the model download size to ibrated a speed test based on partial model evaluation time ∼400KB through post-training weight quantization. To to determine if the harmonization request should be per- prepare for large-scale deployment, we calibrated a speed formed locally or sent to remote TPU servers. In three test to determine if a user’s device is fast enough for run- days, people spent 350 years worth of time playing with the ning the model in the browser. If not, the harmonization Bach Doodle, and Coconet received more than 55 million requests were sent to remote TPU servers. queries. Users could choose to rate their compositions and Users in 80% of sessions explored multiple harmoniza- contribute them to a public dataset, which we are releas- tions, and 53.8% of the harmonizations were rated as pos- ing with this paper. We hope that the community finds this itive. One complaint from advance users was the presence dataset useful for applications ranging from ethnomusico- of parallel fifths (P5s) and octaves (P8s). We analyzed 21.8 logical studies, to music education, to improving machine million harmonizations and found that P5s and P8s occur learning models. on average 0.365/measure and 0.391/measure respectively. Furthermore, P5s and P8s were more common when user 2. INTRODUCTION input was out of distribution, and fewer P5s and P8s were correlated with positive user feedback. Machine learning can extend our creative abilities by offer- ing generative models that can rapidly fill in missing parts of our composition, allowing us to see a prototype of how 3. RELATED WORK a piece could sound. To celebrate J.S. Bach’s 334th birth- day, we designed the Bach Doodle to create an interactive Machine learning has been used in algorithmic music com- experience where users can rapidly explore different pos- position to support a wide range of musical tasks [5, 13, sibilities in harmonization by tweaking their melody and 19, 28, 29]. Melody harmonization is one of the canonical requesting new harmonizations. The harmonizations are tasks [7, 11, 20, 26], encourages human-computer interac- powered by Coconet [22], a versatile generative model of tion [3, 14, 21, 25, 33], and is particularly approachable for counterpoint that can fill in arbitrarily incomplete scores. novices. Different interfaces and tools have been devel- oped to make the interaction experience more accessible. For example, in MySong [31], users can sing a melody and c Cheng-Zhi Anna Huang, Curtis Hawthorne, Adam have the system harmonize it. In Hyperscore [12], users Roberts, Monica Dinculescu, James Wexler, Leon Hong, Jacob Howcroft. can draw multiple levels of “motifs” on a graphical sketch- Licensed under a Creative Commons Attribution 4.0 International Li- pad and have them harmonized according to the tension cense (CC BY 4.0). Attribution: Cheng-Zhi Anna Huang, Curtis curve they specified. Startups such as JukeDeck and Am- Hawthorne, Adam Roberts, Monica Dinculescu, James Wexler, Leon Hong, Jacob Howcroft. “The Bach Doodle: Approachable music com- per Music offer APIs that allow users to describe a piece position with machine learning at scale”, 20th International Society for through timing and mood tags. Opensource libraries such Music Information Retrieval Conference, Delft, The Netherlands, 2019. as Magenta.js [30] allow machine learning models to be

793 Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019 used in digital audio work stations. For score-based in- teraction, FlowComposer [27] offers an augmented lead- sheet based interface, while DeepBach [17] demonstrates interactive chorale composition in MuseScore which uses standard music notation.

4. THE BACH DOODLE 4.1 A walk through of the user experience The Bach Doodle user experience begins by demonstrat- ing 4-part harmony using two measures of a Bach chorale, Ach wie flüchtig, ach wie nichtig, BWV 26. By playing the Figure 2. The harmonization returned by Coconet is no- soprano line alone, followed by soprano and alto, and then tated in color, carrying the alto, , and bass voices. all four voices, users are shown how the harmony enhances the melody. Users are then presented with two measures of 4.2 Design challenges blank sheet music, in the treble clef, with a key signature of C major, in standard time. There are four vertical lines Celebrating J.S. Bach’s birthday using machine learn- in each measure to indicate the beats which give the user ing presented many unique design opportunities as well visual cues on where to put notes. as some user experience challenges. One of the main The user enters a monophonic melody using quarter and goals was to empower people with the feeling that they eighth notes. The note duration is automatic. If a note is could augment their own creativity in ways not previously entered on a beat, it is a quarter note by default. However, thought possible, by allowing them to directly collaborate if a note is added after the beat, the existing quarter note with a machine learning model. Another important goal becomes an eighth note. This simple interface makes it was to convey the message that machine learning is not easy for users with no musical knowledge to input a com- “magical” or incomprehensible, but rather a science that position, and can be seen in Figure 1. If the user clicks can be understood. Finally, a notable challenge was to en- on the “star” button on the left-hand side of the sheet mu- sure that these aims would be met for a large diversity of sic, they can enter “advanced mode”. This allows the user individuals, from children who have not yet learned to read to input sixteenth notes anywhere on the staff. It also en- to experts of music and technology. ables a control where the key can be changed to any of the In order to acquire early feedback on the design, user 12 key signatures. This mode is hidden because it can be tests were employed. Over the course of the project, overwhelming for new users. It is also easier to make less dozens of people were asked to play the demos and com- enjoyable music this way, for example by going off key, or ment on their experience through both pre-defined ques- making the music overly complex. tions and open comments. The first user test in the de- Clicking the “Harmonize” button sends the generation velopment process revealed that many people do not fully request to either TensorFlow.js or the TPU server. When understand the concept of harmony, but fortunately, further the response is ready, it is presented to the user one voice testing showed that short animated musical examples were at a time, listing them out: “Soprano”, “Alto”, “Tenor”, enough for people to comprehend these concepts quickly. “Bass”. The voices are color-coded to illustrate the harmo- Also, user tests indicated that only a small subset of peo- nization in relation to the soprano input notes (Figure 2). ple could read standard music notation. Our intuition was that using standard notation, rather than a grid based inter- face would be intuitive and frictionless to anybody only if the user interface (UI) was responsive with animations and sound and also if the note input interface was kept sim- ple. Further user testing of the standard notation input UI proved this to be correct. In order to accommodate people of all ages and expe- riences, a common technique employed is to remove any advanced feature or unexpected delight from the core ex- perience and instead integrate them as “ eggs”. This allows people of all skill levels to experience the full core experience without feeling frustrated, while also giving the rarer advanced user more features. While the core expe- Figure 1. The user interface of the Bach Doodle, where rience primarily allows eighth notes and tempo changes, users can input a melody and then click on the green “Har- clicking on a special button in the background addition- monize” button on the bottom right to request a harmoniza- ally allowed the user to add sixteenth notes and change the tion. key – two features that are very confusing to those without musical backgrounds.

794 Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

4.3 Reusable insights if the ith instrument plays pitch p at time t. Each in- strument is assumed to play exactly one pitch at a time, For future projects, we have shown that if the technol- therefore P x = 1 for all (i, t) positions. We ogy being used is unfamiliar or perceived as “scary” to p i,t,p also focus on four-part Bach chorales as used in prior those who know little about it, tethering the experience work [2, 4, 8, 16, 18, 24], and assume I = 4 throughout. to a familiar story and visuals can be a successful strat- Conventional approaches often factorize the joint dis- egy. Most people have a limited understanding of musi- tribution p(x) into conditional distributions of the form cal concepts such as harmony and standard notation, but it p(x | x ), where k indexes a sequence in some predeter- is possible that people of all ages can quickly acquire an k

1 X X L(x; C) = − x log p(x | x ,C) |¬C| i,t,p i,t,p C (i,t)∈¬C p

In contrast to generating from left to right in one pass, Coconet uses Gibbs sampling to improve sample quality Figure 3. Coconet can be used in a wide range of musi- through rewriting (see [22] for convergence analysis). Fig- cal tasks, such as completing partial scores, harmonizing ure 4 shows how the procedure iterates between filling in melodies and generating from scratch. missing parts and then erasing other parts so that they could be rewritten given the updated context. Coconet represents counterpoint as a stack of piano rolls encoded in a binary three-dimensional tensor x ∈ 5.2 Improving and speeding up Coconet {0, 1}I×T ×P , where I, T , and P denotes the number of in- struments, time steps, and pitches respectively. xi,t,p = 1 The original Coconet uses dense convolutions, where filter weights and their mixing weights W are fully connected 1 Blog post: https://magenta.tensorflow.org/coconet Code: https://github.com/tensorflow/magenta/tree/ (Equation 1). This makes it unable to fully leverage GPU master/magenta/models/coconet parallelization in TF.js (see Section 5.3.2).

795 Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

Let s, q index the pitch and time dimension in filters, a reasonable time for the interaction we desired. TF.js is and i, j the input and output channels. In dense convolu- a library for GPU-accelerated machine learning. tions, each output position (p, t, j) indexed by pitch, time It makes use of WebGL 2 to leverage the parallel process- and output channel is a sum over the resultant input chan- ing power of GPUs to speed up machine learning opera- nels and also over the positions in each filter (given by the tions, supporting the development and training of models, neighbourhood function). For a 3-by-3 filter this summing as well as deployment of trained models on web browsers. is over the 9 positions. By enabling users to run trained models directly in their In contrast, depthwise separable convolutions [6], web browsers, it alleviates the need for remote servers to shown in Equation 2, factorizes the dense tensor W into run those models. This can enable faster, more interactive a depthwise tensor V and a pointwise U. As a result, the experiences between a user and a machine learning system. multiplications between V and X can be parallelized over While some models can easily be ported to TF.js us- the input channels i in the inner sum of Equation 3. ing a conversion script, Coconet’s Python TensorFlow im- plementation used some ops that did not yet exist in TF.js dense X X Yp,t,j = Ws,q,i,jXs,q,i (1) (e.g., cumsum), and we also needed the flexibility to op- i s,q∈neighborhood(p, t) timize the performance of the model for our use case. We X X Y dsep = U V X (2) therefore manually re-implemented Coconet in TF.js our- p,t,j i,j s,q,i s,q,i 3 i s,q∈neighborhood(p,t) selves and have made the code opensource . We also con- X X tributed missing ops to TF.js with WebGL fragment shader = Ui,j Vs,q,iXs,q,i (3) code for GPU acceleration. i s,q∈neighborhood(p,t) 5.3.1 UI Challenges To further speed up Coconet, we adopt dilated convolu- tions to grow the receptive field exponentially to reduce TF.js makes use of the async/await pattern for access to the number of layers needed. As in [36], where in each outputs of models and individual TensorFlow operations. block the dilation factors double in each layer for both the During inference, users receive a callback for when GPU pitch and time dimension and then the block repeats. operations have completed and the result is ready to be The original Coconet was trained on eight measures consumed. In this way, there is no blocking of the UI while (T=128). However, the Bach Doodle is designed for two waiting for model results. In practice, with large mod- measures (T=32), so we retrained the model with the orig- els like Coconet (which includes many repeated sampling inal architecture and saw that the loss increased from 0.57 steps of a deep network), it is still important to cede control to 0.62 (show in Table 1), possibly because there is less back to the UI explicitly during the course of the model op- context. Switching from dense to depthwise separable erations, which can be done with the tf.nextFrame() convolutions reduced the loss, requiring more filters but operation. Our op-by-op code port of the network allowed fewer layers. Since Tensorflow.js allows for parallelization us to add these occasional UI breaks, which avoided a poor across filters, this still resulted in much faster generation user experience where the page would freeze for multiple (see Section 5.3.2). Dilated convolutions reduced both the seconds during model prediction. number of layers and number of filters and also reducing the loss. The particular scheme we used is 7 blocks of di- 5.3.2 Performance Challenges lation rates (1, 2, 4, 8, 16, 16) for the pitch dimension and The initial port of Coconet to TF.js took over 40 seconds (1, 2, 4, 8, 16, 32) for the time dimension. to do one harmonization. For a satisfying user experi- ence, we needed to lessen this latency to below 5 sec- Table 1. Comparing frame-wise negative loglikelihood onds. While TF.js is able to take advantage of GPU ac- (NLL) on the 16th-note resolution as in [22] and the gen- celeration, WebGL does not directly support the types of eration time (in seconds) when model was ported to Ten- tensor operations used in deep learning. Instead, these op- sorflow.js (see Section 5.3.2 for details). The bottom three erations must be implemented as shader programs, which rows are all trained on two-measure (T=32) random crops. were originally intended to compute the color for pixels Convolution type NLL run time during graphics rendering. This mismatch leads to inef- ficiencies that sometimes vary by operation. It turned out Dense (T=128), 64L, 128f 0.57 that the (unavoidably) inefficient shader implementation of Dense (T=32), 64L, 128f 0.62 > 40s convolutional layers were the main culprit. By switching to Depthwise separable, 48L, 192f 0.59 7s depthwise-separable convolutions, however, we were able Dilated, 45L (7 blocks), 128f 0.58 ∼4s to avoid many of these performance issues, reducing gen- eration time to 7 seconds. As we run Coconet through 64 Gibbs sampling steps, any improvement to operations that are used in this loop 5.3 Porting Coconet to the Browser JavaScript is the standard language for browser-based 2 https://developer.mozilla.org/en-US/docs/Web/ API/WebGL_API computation, but native JavaScript is too inefficient to han- 3 https://tensorflow.github.io/magenta-js/music/ dle the the amount of computation required by Coconet in classes/_coconet_model_.coconet.html

796 Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019 could lead to a significant saving. We wrote a custom op- 6. DATASET RELEASE AND ANALYSIS eration using WebGL shaders to fuse together the opera- 6.1 Data structure tions used in our initial TF.js implementation of the an- nealing schedule. This schedule by [37] is a sequence of Every user who interacted with the Bach Doodle had the simple element-by-element operations that is run on every opportunity to add their composition to a dataset. We sampling step during harmonization. Because of the sim- make this entire dataset available at https://g.co/ plicity of the operations (a scalar subtraction, multiplica- magenta/bach-doodle-dataset under a Creative tion, division, and a max operation), we were able to easily Commons license. Of more than 55 million requests, the fuse them into a single operation that avoided the over- user contributed dataset contains over 21.6 million minia- head of executing multiple shader programs on the GPU, ture compositions. The compositions are split across 8.5 speeding up inference by about 5%. The combined savings million sessions. Each session represents an anonymous of adding depthwise-separable convolutions, shrinking the user’s interaction with the Bach Doodle over a single model by using dilated convolutions, and using the fused pageload and may contain multiple data points. Each data schedule operation resulted in a reduction of the model la- point consists of the user’s input melody, the 4-voice har- tency from 40s to 2s. monization returned by Coconet, as well as other metadata: the country of origin, the user’s rating, the composition’s key signature, how long it took to compose the melody, and 5.3.3 Download Size the number of times the composition was listened to.

Due to the number of users we intended to reach as well 6.2 Analysis as the variety of locations, devices, and bandwidth limits they would have, we needed to ensure the download size We present some preliminary analysis of the dataset to of the model weights was as small possible. To achieve shed some light on how users interacted with the doodle. this goal, we implemented support for post-training weight Out of the 21.6 million sequences in the dataset, about 14 quantization and contributed it to TF.js. This quantization million (or 65.7%) are unique pitch sequences, that are compresses each float32 weight tensor by mapping the not repeated anywhere in the dataset (without consider- full range of its dimensions down to 256 uniformly-spaced ing timing information). Overall, the median amount of buckets in order to represent them as int8 values, which time spent composing a sequence was 25.5 seconds, and are then stored along with float32 min and scale val- sequences were listened to for a median of 3 loops, with a ues used to recover the range. During model initialization, total of 78.2 million loops listened across the entire dataset. the weights are converted back to float32 tensors using The sequences come from 109 different countries, with linear interpolation. By using these quantization, we were the United States, Brazil, Mexico, Italy, and Spain rank- able to reduce the size of the downloaded weights by ap- ing in the top 5. Countries that had a small number of se- proximately 4, resulting in a payload of ∼ 400KB without quences were all grouped together in a separate category, to any noticable sacrifice in quality. minimize the possibility of identifying users. While many sessions (∼20%) contained only one request for harmo- nization (shown in Figure 5), most sessions had 2 or more harmonizations, either of the same melody, or of a differ- 5.4 Balancing Load Between Tensorflow.js and TPU ent one. As shown in Figure 6, more than 5 million of the input sequences used the maximum number of notes in the We ideally wanted to run the harmonization model com- default version of the doodle, which is 16. It is interesting pletely on end-user devices using TF.js to avoid the need to note that despite being an Easter egg, 7.6% of user ses- for serving infrastructure, which adds cost, effort, and ad- sions discovered the advanced mode that allowed them to ditional points of failure. But the speed of harmonization enter longer sequences. differs by user device, with older and lower-end devices 4.5M taking longer to run the TF.js model code. For devices 4.0M where harmonization take more than a few seconds, the 3.5M 3.0M harmonization is instead done by the cloud-served model. 2.5M The first step in checking if a device can run harmonization 2.0M

#sessions 1.5M locally is to check if WebGL is supported on the device, 1.0M since that is required for using GPU-acceleration through 0.5M 0.0M TF.js. If WebGL is supported then we perform a speed test 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 on the model, running a sample melody through its first #requests four layers. If the latency of this model inference is below Figure 5. Histogram: number of requests per session a set threshold, then the TF.js version is used. As there is overhead on the first inference of a model in TF.js, due to The doodle has 3 presets: Twinkle Twinkle Little Star, initial loading of the model weight textures onto the GPU, Mary had a Little Lamb, and the beginning to Bach’s Toc- we actually run the speed test twice and use the second cata and Fugue in D Minor, BWV 565, which are the 3 measurement to make the decision. most repeated sequences. However, there are also shows

797 Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

6.0M 0.40 5.0M 0.35 p5/measure 4.0M 0.30 p8/measure

3.0M 0.25

2.0M 0.20 #sequences 1.0M 0.15

0.0M 0.10 0 4 8 12 16 20 24 28 32 0.05 #notes 0.00

Figure 6. Histogram: length of sequences Bach

pitch/delta outsidepitch/delta limits outsidepitch/delta limits withinpitch/delta limitspositive within feedback limits positive feedbacknon-positive feedback some surprising runner ups, such as Beethoven’s Ode to non-positive feedback Joy, and Megalovania, a popular song from the game Un- Figure 8. Parallel fifths and octaves per measure dertale, as well as some regional hits 4 . Overall, users enjoyed their harmonizations, with 53.8% of all composi- tions rated as “Good”. Figure 7 gives the breakdown of P5s/P8s was correlated with positive user feedback. We user ratings. first split the output based on whether users gave posi- tive feedback or not (non-positive feedback includes neu- 6.2% 4.8% tral, negative, and the absence of feedback). Next, we Rating split based on whether user input was within the same Poor pitch range as the soprano lines in the training data (MIDI Neutral 53.8% 35.2% Good pitches from 60 through 81) and whether the maximum No rating delta between consecutive pitches exceeded that of the training data (1 octave). In total, we found 15,816,599 P5s (0.365/measure) and Figure 7. Breakdown of user ratings on harmonized com- 16,949,818 P8s (0.391/measure) in the model output. Re- positions. sults split into the four categories are shown in Figure 8. As hypothesized, P5s/P8s were more common when user input was out of distribution, and their absence correlated 6.3 Parallel Fifths and Parallel Octaves with positive user feedback. A Kruskal-Wallis H test for The Coconet model that powered the Bach Doodle was both the number of P5s and P8s showed that there is at trained to produce harmonizations in the style of Bach least one statistically significant difference between the chorales, and one well known characteristic of Bach’s four categories with p < 1e−4. Further, Mann-Whitney counterpoint writing is how carefully he followed the rule rank tests between the categories showed significant dif- of avoiding parallel fifths (P5s) and parallel octaves (P8s). ferences, each with p < 1e−4. The correlation between However, one complaint from advanced users of the app fewer P5s/P8s and positive user feedback is particularly in- was the presence of P5s and P8s in the output of the model. teresting. This could either indicate that users prefer music Here, we present some analysis of how frequently and un- with fewer P5s/P8s or it could simply mean that when the der what circumstances such outputs occurred. To identify model produces poor output, P5s/P8s tend to be a feature the P5 and P8 occurrences, we used music21 [9]. of that output. In any case, the presence of P5s/P8s seems First, we looked at how frequently P5s and P8s ap- to be a useful proxy metric for model output quality. In fu- peared in our training data. We were surprised to find ture work, it could be a useful signal during training (sim- that in the 382 Bach chorale preludes we used in our ilar to [23]), evaluation, or perhaps even during inference train and validation sets, there were 132 instances of P5s where it could trigger additional Gibbs sampling steps. (0.023/measure) and 51 instances of P8s (0.009/measure). Given this prevalence, the model may learn to output this 7. CONCLUSION kind of parallel motion. However, many of these instances can be “excused” because they occur under special circum- The Bach Doodle enabled large-scale participation in stances such as at phrase boundaries or when using non- baroque-style counterpoint composition through an intu- chord tones [10, 15]. Unfortunately, our training data does itive sheet music interface assisted by machine learning. not include key signatures, time signatures, or fermatas, so We hope this encourages more creative apps that allow the model likely learned to treat P5s as more permissible novices and artists to interact with music composition and than was actually the case in Bach’s music. machine learning in approachable ways. With this pa- We then examined the output of the model to see if per, we are releasing a dataset of 21.6 million instances P5s/P8s occurred more frequently when user input was of human-computer collaborative miniature compositions, outside the training distribution and if the absence of along with meta-data such as user rating and country of ori- gin. We hope the community will find it useful for ethno- 4 Visit https://g.co/magenta/bach-doodle-dataset to interact with visualizations of the top repeated melodies overall and in musicological studies, music education, or improving ma- each region, as well as regional unique hits. chine learning models.

798 Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

8. ACKNOWLEDGEMENTS [11] Mary Farbood and Bernd Schöner. Analysis and syn- thesis of palestrina-style counterpoint using markov Many thanks to Ann Yuan, Daniel Smilkov and Nikhil chains. In Proceedings of the International Computer Thorat from Tensorflow.js for their expert assistance. A Music Conference, 2001. big shoutout to Pedro Vergani, Rebecca Thomas, Jordan Thompson and others on the Doodle team for their contri- [12] Morwaread M Farbood, Egon Pasztor, and Kevin Jen- bution to the core components of the Doodle. Thank you nings. Hyperscore: a graphical sketchpad for novice Lauren Hannah-Murphy and Chris Han for keeping us on composers. IEEE Computer Graphics and Applica- track. Thank you to Magenta colleagues for their support. tions, 24(1):50–54, 2004. Thank you Tim Cooijmans for co-authoring the blog post. [13] Jose D Fernández and Francisco Vico. Ai methods 9. REFERENCES in algorithmic composition: A comprehensive survey. Journal of Artificial Intelligence Research, 48:513– [1] Celebrating Johann Sebastian Bach. https: 582, 2013. //www.google.com/doodles/celebrating- johann-sebastian-bach. Accessed: 2019-04- [14] Rebecca Anne Fiebrink. Real-time human interaction 04. with supervised learning algorithms for music compo- sition and performance. PhD dissertation, Princeton [2] Moray Allan and Christopher KI Williams. Harmon- University, 2011. ising chorales by probabilistic inference. Advances in neural information processing systems, 17:25–32, [15] George Fitsioris and Darrell Conklin. Parallel succes- 2005. sions of perfect fifths in the bach chorales. MUSICAL [3] Gérard Assayag, Georges Bloch, Marc Chemillier, Ar- STRUCTURE, page 52, 2008. shia Cont, and Shlomo Dubnov. Omax brothers: a dy- [16] Kratarth Goel, Raunaq Vohra, and JK Sahoo. - namic yopology of agents for improvization learning. phonic music generation by modeling temporal depen- In Proceedings of the 1st ACM workshop on Audio and dencies using a RNN-DBN. In International Confer- music computing multimedia, pages 125–132. ACM, ence on Artificial Neural Networks, 2014. 2006. [4] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and [17] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. Pascal Vincent. Modeling temporal dependencies in Deepbach: a steerable model for bach chorales gener- high-dimensional sequences: Application to poly- ation. In International Conference on Machine Learn- phonic music generation and transcription. Interna- ing, pages 1362–1371, 2017. tional Conference on Machine Learning, 2012. [18] Gaëtan Hadjeres, Jason Sakellariou, and François Pa- [5] Jean-Pierre Briot, Gaëtan Hadjeres, and François Pa- chet. Style imitation and chord invention in poly- chet. Deep learning techniques for music generation-a phonic music with exponential families. arXiv preprint survey. arXiv preprint arXiv:1709.01620, 2017. arXiv:1609.05152, 2016.

[6] François Chollet. Xception: Deep learning with depth- [19] Dorien Herremans, Ching-Hua Chuan, and Elaine wise separable convolutions. In Proceedings of the Chew. A functional taxonomy of music generation IEEE conference on computer vision and pattern systems. ACM Computing Surveys (CSUR), 50(5):69, recognition, pages 1251–1258, 2017. 2017.

[7] Ching-Hua Chuan and Elaine Chew. A hybrid system [20] Dorien Herremans and Kenneth Sörensen. Composing for automatic generation of style-specific accompani- fifth species counterpoint music with a variable neigh- ment. In Proceedings of the 4th International Joint borhood search algorithm. Expert systems with appli- Workshop on Computational Creativity, pages 57–64. cations, 40(16):6427–6437, 2013. Goldsmiths, University of London London, 2007. [21] Cheng-Zhi Anna Huang, Sherol Chen, Mark Nelson, [8] David Cope. Computers and musical style. Oxford and Doug Eck. Mixed-initiative generation of multi- University Press, 1991. channel sequential structures. In International Con- [9] Michael Scott Cuthbert and Christopher Ariza. mu- ference on Learning Representations Workshop Track, sic21: A toolkit for computer-aided musicology and 2018. symbolic music data. In International Society for Mu- sic Information Retrieval, 2010. [22] Cheng-Zhi Anna Huang, Tim Cooijmnas, Adam Roberts, Aaron Courville, and Douglas Eck. Counter- [10] Luke Dahn. Consecutive 5ths and octaves in bach point by convolution. ISMIR, 2017. chorales. https://lukedahn.wordpress.com/ 2016/04/15/consecutive-5ths-and- [23] Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, octaves-in-bach-chorales/. Accessed: José Miguel Hernández-Lobato, Richard E Turner, and 2019-04-12. Douglas Eck. Sequence tutor: Conservative fine-tuning

799 Proceedings of the 20th ISMIR Conference, Delft, Netherlands, November 4-8, 2019

of sequence generation models with kl-control. In Pro- [35] Benigno Uria, Iain Murray, and Hugo Larochelle. A ceedings of the 34th International Conference on Ma- deep and tractable density estimator. In In Proceedings chine Learning-Volume 70, pages 1645–1654. JMLR. of the International Conference on Machine Learning, org, 2017. 2014.

[24] Feynman Liang. Bachbot: Automatic composition in [36] Aäron van den Oord, Sander Dieleman, Heiga the style of bach chorales. Masters thesis, University Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, of Cambridge, 2016. Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw [25] Francois Pachet. The continuator: Musical interaction audio. with style. Journal of New Music Research, 32(3):333– 341, 2003. [37] Li Yao, Sherjil Ozair, Kyunghyun Cho, and Yoshua Bengio. On the equivalence between deep nade and [26] François Pachet and Pierre Roy. Musical harmoniza- generative stochastic networks. In In Proceedings of tion with constraints: A survey. Constraints, 6(1):7–19, the Joint European Conference on Machine Learning 2001. and Knowledge Discovery in Databases, 2014.

[27] Alexandre Papadopoulos, Pierre Roy, and François Pa- chet. Assisted lead sheet composition using flowcom- poser. In International Conference on Principles and Practice of Constraint Programming, pages 769–785. Springer, 2016.

[28] George Papadopoulos and Geraint Wiggins. Ai meth- ods for algorithmic composition: A survey, a critical view and future prospects. In AISB Symposium on Mu- sical Creativity, volume 124, pages 110–117. Edin- burgh, UK, 1999.

[29] Philippe Pasquier, Arne Eigenfeldt, Oliver Bown, and Shlomo Dubnov. An introduction to musical metacre- ation. Computers in Entertainment (CIE), 14(2):2, 2016.

[30] Adam Roberts, Curtis Hawthorne, and Ian Simon. Ma- genta. js: A javascript api for augmenting creativity with deep learning. 2018.

[31] Ian Simon, Dan Morris, and Sumit Basu. Mysong: au- tomatic accompaniment generation for vocal melodies. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 725–734. ACM, 2008.

[32] Daniel Smilkov, Nikhil Thorat, Yannick Assogba, Ann Yuan, Nick Kreeger, Ping Yu, Kangyi Zhang, Shanqing Cai, Eric Nielsen, David Soergel, et al. Tensorflow. js: Machine learning for the web and beyond. arXiv preprint arXiv:1901.05350, 2019.

[33] Bob L Sturm, Oded Ben-Tal, Una Monaghan, Nick Collins, Dorien Herremans, Elaine Chew, Gaëtan Had- jeres, Emmanuel Deruty, and François Pachet. Ma- chine learning research that matters for music cre- ation: A case study. Journal of New Music Research, 48(1):36–55, 2019.

[34] Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregres- sive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.

800