Deep Neural Models for Personalized Jazz Improvisation
Total Page:16
File Type:pdf, Size:1020Kb
BEBOPNET: DEEP NEURAL MODELS FOR PERSONALIZED JAZZ IMPROVISATIONS Shunit Haviv Hakimi∗ Nadav Bhonker∗ Ran El-Yaniv Computer Science Department Technion – Israel Institute of Technology [email protected], [email protected], [email protected] ABSTRACT forms of art, and notably for composing music [1]. The ex- plosive growth of deep learning models over the past sev- A major bottleneck in the evaluation of music generation eral years has expanded the possibilities for musical gen- is that music appreciation is a highly subjective matter. eration, leading to a line of work that pushed forward the When considering an average appreciation as an evalua- state-of-the-art [2–6]. Another recent trend is the devel- tion metric, user studies can be helpful. The challenge opment and offerings of consumer services such as Spo- of generating personalized content, however, has been ex- tify, Deezer and Pandora, aiming to provide personalized amined only rarely in the literature. In this paper, we streams of existing music content. Perhaps the crowning address generation of personalized music and propose a achievement of such personalized services would be for novel pipeline for music generation that learns and opti- the content itself to be generated explicitly to match each mizes user-specific musical taste. We focus on the task individual user’s taste. In this work we focus on the task of of symbol-based, monophonic, harmony-constrained jazz generating user personalized, monophonic, symbolic jazz improvisations. Our personalization pipeline begins with improvisations. To the best of our knowledge, this is the BebopNet, a music language model trained on a corpus of first work that aims at generating personalized jazz solos jazz improvisations by Bebop giants. BebopNet is able to using deep learning techniques. generate improvisations based on any given chord progres- The common approach for generating music with neu- sion 1 . We then assemble a personalized dataset, labeled ral networks is generally the same as for language mod- by a specific user, and train a user-specific metric that re- eling. Given a context of existing symbols (e.g., charac- flects this user’s unique musical taste. Finally, we employ ters, words, music notes), the network is trained to predict a personalized variant of beam-search with BebopNet to the next symbol. Thus, once the network learns the dis- optimize the generated jazz improvisations for that user. tribution of sequences from the training set, it can gener- We present an extensive empirical study in which we ap- ate novel sequences by sampling from the network output ply this pipeline to extract individual models as implicitly and feeding the result back into itself. The products of defined by several human listeners. Our approach enables such models are sometimes evaluated through user studies an objective examination of subjective personalized mod- (crowd-sourcing). Such studies assess the quality of gen- els whose performance is quantifiable. The results indi- erated music by asking users their opinion, and computing cate that it is possible to model and optimize personal jazz the mean opinion score (MOS). While these methods may preferences and offer a foundation for future research in measure the overall quality of the generated music, they personalized generation of art. We also briefly discuss op- tend to average-out evaluators’ personal preferences. An- portunities, challenges, and questions that arise from our other, more quantitative but rigid approach for evaluation work, including issues related to creativity. of generated music is to compute a metric based on musical theory principles. While such metrics can, in principle, be defined for classical music, they are less suitable for jazz 1. INTRODUCTION improvisation, which does not adhere to such strict rules. Since the dawn of computers, researchers and artists have To generate personalized jazz improvisations, we pro- been interested in utilizing them for producing different pose a framework consisting of the following elements: (a) BebopNet: jazz model learning; (b) user preference elicita- 1 Supplementary material and numerous MP3 demonstrations of jazz improvisations of jazz standards and pop songs generated by BebopNet tion; (c) user preference metric learning; and (d) optimized are provided in https://shunithaviv.github.io/bebopnet. music generation via planning. As many jazz teachers would recommend, the key to at- taining great improvisation skills is by studying and emu- c Shunit Haviv Hakimi, Nadav Bhonker, and Ran El- lating great musicians. Following this advice, we train Be- Yaniv. Licensed under a Creative Commons Attribution 4.0 Interna- tional License (CC BY 4.0). Attribution: Shunit Haviv Hakimi, Na- bopNet, a harmony-conditioned jazz model that composes dav Bhonker, and Ran El-Yaniv, “BebopNet: Deep Neural Models for entire solos. We use a training dataset of hundreds of pro- Personalized Jazz Improvisations”, in Proc. of the 21st Int. Society for fessionally transcribed jazz improvisations performed by Music Information Retrieval Conf., Montréal, Canada, 2020. saxophone giants such as Charlie Parker, Phil Woods and 828 Proceedings of the 21st ISMIR Conference, Montreal,´ Canada, October 11-16, 2020 C7 Cm7 F7 B♭maj7 Cm7 F7 B♭6 2. RELATED WORK Many different techniques for algorithmic musical compo- Cm7 F7 Fm7 B♭7 E♭9 E♭9 B♭6 B♭6 sition have been used over the years. For example, some Figure 1. A short excerpt generated by BebopNet. are grammar-based [11], rule-based [1, 12], use Markov F7 Cm7 B♭6 B♭6 chains [13–15], evolutionary methods [16, 17] or neural networks [18–20]. For a comprehensive summary of this Cannonball Adderley (see details in Section 4.1.1). In this broad area, we refer the reader to [21]. Here we con- Cm7 F7 B♭6 B♭6 dataset, each solo is a monophonic note sequence given fine the discussion to closely related works that mainly in symbolic form (MusicXML) accompanied by a syn- concern jazz improvisation using deep learning techniques Cm7 F7 B♭7 Fm7 E♭9 E♭9 chronized harmony sequence. After training, BebopNet3 is over symbolic data. In this narrower context, most works capable of generating high fidelity improvisation phrases follow a generation by prediction paradigm, whereby a (thisB♭6 is a subjective B♭6 C impressionm7 F7 of the authors).B♭maj7 B♭m Figureaj7 1 model trained to predict the next symbol is used to greed- presents a short excerpt generated by BebopNet. ily generate sequences. The first work on blues improvisa- Considering that different people have different musi- tion [22] straightforwardly applied long short-term mem- Cm7 F7 B♭6 B♭6 cal tastes, our goal in this paper is to go beyond straight- ory (LSTM) networks on a small training set. While forward generation by this model and optimize the gener- their results may seem limited at a distance of nearly two Cm7 F7 ation toward personalized preferences. For this purpose, decades 2 , they were the first to demonstrate long-term we determine a user’s preference by measuring the level of structure captured by neural networks. their Fm satisfaction7 throughout theB♭7 solos using a digital vari- One approach to improving a naïve greedy genera- ant of continuous response interface (CRDI) [7]. This is tion from a jazz model is by using a mixture of experts. For example, Franklin et al. [23] trained an ensemble of accomplishedE♭9 byA playing,♭9 for theB♭6 user, computer-generated B♭6 solos (from the jazz model) and recording their good/bad neural networks were trained, one specialized for each feedback in real time throughout each solo. Once we have melody, and then selected from among them at genera- B♭6 B♭6 Am7♭57 Am7♭57 D7 D7 gathered sufficient data about the user’s preferences, con- tion time using reinforcement learning (RL) utilizing a sisting of two aligned sequences (for the solos and feed- handcrafted reward function. Johnson et al. [24] gener- ated improvisations by training a network consisting of back),2 we train a user preference metric in the form of a recurrent regression model to predict this user’s prefer- two experts, each focusing on a different note represen- ences. A key feature of our technique is that the result- tation. The experts were combined using the technique ing model can be evaluated objectively using hold-out user of product of experts [25] 3 . Other remotely related non- preference sequences (along with their corresponding so- jazz works have attempted to produce context-dependent los). A big hurdle in accomplishing this step is that the melodies [2, 3, 5, 26–30]. signal elicited from the user is inevitably extremely noisy. A common method for collecting continuous measure- To reduce this noise, we apply selective prediction tech- ments from human subjects listening to music is the con- niques [8, 9] to distill cleaner predictions from the user’s tinuous response digital interface (CRDI), first reported preference model. Thus, we allow this model to abstain by [7]. CRDI has been successful in measuring a variety whenever it is not sufficiently confident. The fact that it of signals from humans such as emotional response [31], is possible to extract a human continuous response prefer- tone quality and intonation [32], beauty in a vocal perfor- ence signal on musical phrases and use it to train (and test) mance [33], preference for music of other cultures [34] and a model with non-trivial predictive capabilities is interest- appreciation of the aesthetics of jazz music [35]. Using ing in itself (and new, to the best of our knowledge). CRDI, listeners are required to rate different elements of Equipped with a personalized user preference metric the music by adjusting a dial (which looks similar to a vol- (via the trained model), in the last stage we employ a vari- ume control dial present on amplifiers).