Random Forest Regression of Markov Chains for Accessible Music Generation
Total Page:16
File Type:pdf, Size:1020Kb
Random Forest Regression of Markov Chains for Accessible Music Generation Vivian Chen Jackson DeVico Arianna Reischer [email protected] [email protected] [email protected] Leo Stepanewk Ananya Vasireddy Nicholas Zhang [email protected] [email protected] [email protected] Sabar Dasgupta* [email protected] New Jersey’s Governor’s School of Engineering and Technology July 24, 2020 *Corresponding Author Abstract—With the advent of machine learning, new generative algorithms have expanded the ability of computers to compose creative and meaningful music. These advances allow for a greater balance between human input and autonomy when creating original compositions. This project proposes a method of melody generation using random forest regression, which in- creases the accessibility of generative music models by addressing the downsides of previous approaches. The solution generalizes the concept of Markov chains while avoiding the excessive computational costs and dataset requirements associated with past models. To improve the musical quality of the outputs, the model utilizes post-processing based on various scoring metrics. A user interface combines these modules into an application that achieves the ultimate goal of creating an accessible generative music model. Fig. 1. A screenshot of the user interface developed for this project. I. INTRODUCTION One of the greatest challenges in making generative music is emulating human artistic expression. DeepMind’s generative II. BACKGROUND audio model, WaveNet, attempts this challenge, but requires A. History of Generative Music large datasets and extensive training time to produce qual- ity musical outputs [1]. Similarly, other music generation The term “generative music,” first popularized by English algorithms such as MelodyRNN, while effective, are also musician Brian Eno in the late 20th century, describes the resource intensive and time-consuming. As a result, generative process of creating music through algorithmic and computer- music models are relatively inaccessible for experimentation based methods. Although the phrase is relatively new, gener- by music creators and the general public. ative music originated in the 1950s with the work of Lejaren The objective of this project was to develop a generative Hiller and Leonard Isaacson at the University of Illinois music model that can be run and trained on a typical computer. [2]. Hiller and Isaacson fed randomly generated notes into This was accomplished by utilizing a machine learning algo- parameters enforcing basic music theory standards and stylistic rithm that can train in a matter of minutes. A post-processing rules, creating one of the first generative music models in the algorithm was used to create harmonies and improve the world [3]. Their work was followed by that of Iannis Xenaxis, quality of the final output. A user interface, shown in figure who took a stochastic, or randomized, approach in generating 1, was developed to simplify the use of the model. For the computer-based music. Xenaxis used probability theories as purposes of this project, accessible is defined as simple to the basis for his work and composed pieces like Atrees´ (1962) use, resource efficient, and applicable to new datasets. from his generated outputs [2]. 1 B. WaveNet C. Melody RNN The work of pioneers such as Hiller, Isaacson, and Xenaxis A series of recurrent neural network (RNN) models, built was built upon by DeepMind with the creation of the WaveNet on Tensorflow Magenta, were also released in 2016 under model. Introduced in 2016, WaveNet is a deep neural network the name Melody RNN. Neural networks simulate the human that uses probabilities to generate raw audio waveforms. The brain by connecting a series of trainable artificial neurons; an combined probability of the raw audio waveforms can be RNN is a special type of neural network that recursively uses represented through the equation its outputs as inputs and accumulates information over time. T The music generation models Lookback RNN and Attention Y p(x) = p ( xt j x1; : : : ; xt−1) RNN use long short-term memory (LSTM) cells and specific t=1 training set designs to maintain melodic structure throughout a where the probability of some waveform x relies on nu- composition. The algorithms are trained on thousands of MIDI merous conditional probabilities. Thus, every audio sample, files of popular music and can generalize concepts they have learned to extend melodies supplied by the user [4]. represented by xt, is dependent on samples at prior timesteps (x1,...., xt−1). The distribution of these conditional probabili- The defining aspect of Melody RNN models is the LSTM ties is illustrated through a stack of convolutional layers, which layer, which provides long-term memory to standard recurrent are portrayed in figure 2. These layers apply filters to inputs neural networks. Although an RNN retains information about to perform convolution operations, extracting features from previous numerical sequences, earlier memories are eventually spacial and time-series data [1]. lost, regardless of importance. LSTM addresses the issue by having an input, output, and forget gate system that collec- tively regulate the cell state, or network memory. The forget gate is a sigmoid layer that determines which data to remove from the cell state. The sigmoid is a nonlinear function, as defined by the formula below, that restricts the output to a range between 0 and 1 [4]. 1 f(x) = 1 + e−x Fig. 2. Visualization of a stack of causal convolutional layers [1]. Conversely, the input gate is a sigmoid layer that identifies information to be added to the cell state. The final output gate The central components of the WaveNet model are causal selects parts from the memory to pass onto the next network convolutions and dilated convolutions. Causal convolutions layer; this is accomplished by applying both a sigmoid layer ensure that the network maintains the order of the data. This and the hyperbolic tangent (tanh) function to the cell state and way, the prediction for each audio sample xt, as shown in multiplying the outputs [4]. the previous equation, is not conditioned on samples at future timesteps (xt+1, xt+2,..., xT ). Dilated convolutions increase the receptive field of the model. These convolutions apply a smaller filter over a larger area by ignoring certain inputs, while still allowing for the production of outputs with the same size as the inputs. In WaveNet, stacks of dilated convolutions, in which the dilation is doubled after every layer until a certain limit, are used to increase both WaveNet’s receptive field and model capacity [1]. WaveNet can produce outputs resembling human voices, music compositions, and other audio types; however, the network’s complexity negatively impacts its accessibility, since it requires a large dataset to produce desirable results. One of the smallest datasets used in initial WaveNet experiments consisted of 24.6 hours of audio, which is still a considerable amount of data as the network specifically works with raw, Fig. 3. The structure of the LSTM unit [5]. uncompressed audio. Every second, thousands of data points are processed by the model. Other experiments included up to One of the models is Lookback RNN, which attempts to 200 hours of audio, far exceeding the capabilities of ordinary detect and reproduce patterns found across several measures. computers. WaveNet thus requires massive compute power, The algorithm considers the content of previous measures, the making the model inaccessible for adaptation by potential current position within the measure, and whether or not there users [1]. is a repeating pattern in the input. Attention RNN also creates 2 predictions based on earlier measures, but instead learns “at- E. Random Forest tention” weights that set the importance of past data segments Random forest is a machine learning technique that com- when calculating the output. Despite their differences, both of bines numerous decision trees, or tree predictors, to perform these approaches represent long-term structures and repeating regression or classification. A standard, lone decision tree is melodic elements [4]. a model that contains several nodes, each diverging according Although the models can generate realistic compositions, to the best split of all the variables being considered [11]. In the computational requirements leave them inaccessible. The a random forest, as shown in figure 5, the nodes of several training set consists of thousands of MIDI files, making it parallel decision trees are split according to a subgroup of difficult for a typical user to implement the model for a new the dataset that varies by node. For some decision tree k, a dataset and provide the necessary processing resources to run random vector Θk is created accordingly to guide the selection the networks [6]. of the data subset. Although the subsets are chosen from the D. Markov Chains same training data, the algorithm ensures that a vector is not correlated with any previous vectors Θ ,..., Θ . This results A Markov chain is a stochastic model that uses the current 1 k−1 in the classifier h(x; Θ ), where x represents any arbitrary state to determine the occurrence probabilities of future states. k input and Θ represents the ensuing vector [12]. Because of its dependence on only the current state, a Markov k chain cannot create a repeated sequence across an entire output [7], [8]. This is known as the Markov Property, given by the following equation: P (Xn = xn j Xn−1 = xn−1;:::;X1 = x1) = P (Xn = xn j Xn−1 = xn−1) where Xn is the current state, xn is an arbitrary possible successive state, and xn−c is an arbitrary prior state where c is a natural number [9]. To generate music based on this model, the Markov chain observes transitions from the current musical note to the next and calculates probabilities for the next note’s occurrence, as shown in figure 4. Based on the notes’ weighted probability distributions, the model uses pseudo-random number genera- tion to choose the next note [10].