Song from Pi: a Musically Plausible Network for Pop
Total Page:16
File Type:pdf, Size:1020Kb
Under review as a conference paper at ICLR 2017 SONG FROM PI: A MUSICALLY PLAUSIBLE NETWORK FOR POP MUSIC GENERATION Hang Chu, Raquel Urtasun, Sanja Fidler Department of Computer Science University of Toronto Ontario, ON M5S 3G4, Canada fchuhang1122,urtasun,[email protected] ABSTRACT We present a novel framework for generating pop music. Our model is a hierarchi- cal Recurrent Neural Network, where the layers and the structure of the hierarchy encode our prior knowledge about how pop music is composed. In particular, the bottom layers generate the melody, while the higher levels produce the drums and chords. We conduct several human studies that show strong preference of our gen- erated music over that produced by the recent method by Google. We additionally show two applications of our framework: neural dancing and karaoke, as well as neural story singing. 1 INTRODUCTION Neural networks have revolutionized many fields. They have not only proven to be powerful in performing perception tasks such as image classification and language understanding, but have also shown to be surprisingly good “artists”. In Gatys et al. (2015), photos were turned into paintings by exploiting particular drawing styles such as Van Gogh’s, Kiros et al. (2015) produced stories about images biased by writing style (e.g., romance books), Karpathy et al. (2016) wrote Shakespeare inspired novels, and Simo-Serra et al. (2015) gave fashion advice. Music composition is another artistic domain where neural based approaches have been proposed. Early approaches exploiting Recurrent Neural Networks (Bharucha & Todd (1989); Mozer (1996); Chen & Miikkulainen (2001); Eck & Schmidhuber (2002)) date back to the 80’s. The main varia- tions between the different models is the representation of the notes and the outputs they produced, which typically encode melody and chord. Most of these approaches were single track, in that they produced only one note per time step. The exception is Boulanger-lewandowski et al. (2012) which generated polyphonic music, i.e., simultaneous independent melodies. In this paper, we aim to generate pop music, where the melody but also chords and other instruments make up what is typically called a song. We draw inspiration from the Song from π by Macdonald 1, a piano video on Youtube, where the pleasing music is created from a sequence of digits of π. This video shows both the randomness and the regularity of music. On one hand, since any possible digit sequence is a subset of the π digit sequence, this implies that pleasing music can be created even from a totally random base signal. On the other hand, the composer uses specific rules such as A Harmonic Minor scale and harmonies to convert the digit sequence into a music sheet. It is these rules that play the key role in converting randomness into music. Following the ideas of Songs from π, we aim to generate both the melody as well as accompanying effects such as chords and drums. Arguably, these turn even a not particularly pleasing melody into a well sounding song. We propose a hierarchical approach, where each level is a Recurrent Neural Network producing a key aspect of the song. The bottom layers generate the melody, while the higher levels produce drums and chords. This enables the drum and chord layers to compensate for the melody in order to produce appleasing music. Adopting the key idea from Songs from π, we condition our model on the scale type allowing the melody generator to learn the notes that are typically played in a particular scale. 1https://youtu.be/OMq9he-5HUU 1 Under review as a conference paper at ICLR 2017 We train our model on 100 hours of midi music containing user-composed pop songs and video game music. We conduct human studies with music generated with our approach and compare it against a recent approach by Google, showing that our songs are strongly preferred over the baseline. In our human study we also perform an ablation analysis of our model. We additionally show two new applications: neural dancing and karaoke as well as neural music singing. As part of the first application we generate a stickman dancing to our music and lyrics that can be sung with, while in the second application we condition on the output of Kiros et al. (2015) which writes a story about an image and convert it into a pop song. We refer the reader to http://www.cs.toronto.edu/songfrompi/ for our demos and results. 2 RELATED WORK Generating music has been an active research area for decades. It brings together machines learn- ing researchers that aim to capture the complex structure of music (Eck & Schmidhuber (2002); Boulanger-lewandowski et al. (2012)), as well as music professionals (Chan et al. (2006)) and en- thusiasts (Johnson; Sun) that want to see how far a computer can get to be a real composer. Real-time music generation is also explored for gaming (Engels et al. (2015)). Early approaches mostly instilled knowledge from music theory into generation, by using rules of how music segments can be stitched together in a plausible way, e.g., Chan et al. (2006). On the other hand, neural networks have been used for music generation since the 80’s (Bharucha & Todd (1989); Mozer (1996); Chen & Miikkulainen (2001); Eck & Schmidhuber (2002)). Mozer (1996) used a Recurrent Neural Network that produced pitch, duration and chord at each time step. Unlike most other neural network approaches, this work encodes music knowledge into the representation. Eck & Schmidhuber (2002) was first to use LSTMs to generate both melody and chord. Compared to Mozer (1996), the LSTM captured more global music structure across the song. Like us, Kang et al. (2012) built upon the randomness of melody by trying to accompany it with drums. However, in their model the scale type is enforced. No details about the model are given, and thus it is virtually impossible to compare to. Boulanger-lewandowski et al. (2012) propose to learn complex polyphonic musical structure which has multiple notes playing in parallel through the song. The model is single-track in that it only produces melody, whereas in our work we aim to produce multi-track songs. Just recently, Huang & Wu (2016) proposed a 2-layer LSTM that, like Boulanger- lewandowski et al. (2012), produces music that is more complex than a single note sequence, and is able to produce chords. The main novelty of our work over existing approaches is a hierarchical model that incorporates knowledge from music theory to build the neural architecture, and produces multi-track pop music (melody, chord, drum). We also present two novel fun applications. 3 CONCEPTS FROM MUSIC THEORY We start by introducing the basic notation and definitions from music theory. A note defines the basic unit that music is composed of. Music follows the 12-tone system, i.e., 12 is the cycle length of all notes. The 12 tones are: C, C]=D[, D, D]=E[, E, F , F ]=G[, G, G]=A[, A, A]=B[, B.A bar is a short segment of time that corresponds to a specific number of beats (notes). The boundaries of the bar are indicated by vertical bar lines. Scale is a subset of notes. There are four types of scales most commonly used: Major (Minor), Har- monic Minor, Melodic Minor and Blues. Each scale type specifies a sequence of relative intervals (or shifts) which act relative to the starting note. For example, the sequence for the scale type Major is 2 ! 2 ! 1 ! 2 ! 2 ! 2 ! 1. Thus, C Major specifies the starting note to be C, and applying the relative sequence of shifts yields: C −!2 D −!2 E −!1 F −!2 G −!2 A −!2 B −!1 C. The subset of notes specified by C Major is thus C, D, E, F, G, A, and B (a subset of seven notes). All scales types have a subset of seven notes except for Blues which has six. In total we have 48 unique scales, i.e. 4 scale types and 12 possible starting notes. We treat Major and Minor as one type as for a Major scale there is always a Minor that has exactly the same set of notes. In music theory, this is referred to as Relative Minor. 2 Under review as a conference paper at ICLR 2017 t−16 ... t−8 t−4 t ydrm ydrm ydrm ydrm ... Drum Layer t−16 ... t−8 t−4 t ychd ychd ychd ychd ... Chord Layer t−17 t−16 t−9 t−8 t−4 t−3 t−2 t−1 t yprs yprs ... yprs yprs ... yprs yprs yprs yprs yprs ... ... Press Layer t−17 t−16 ... t−9 t−8 ... t−4 t−3 t−2 t−1 t ykey ykey ykey ykey ykey ykey ykey ykey ykey ... ... Key Layerjs t−17 t−16 ... t−9 t−8 ... t−4 t−3 t−2 t−1 t xprf xprf xprf xprf xprf xprf xprf xprf xprf Figure 1: Overview of our framework. Only skip connections for the current time step t are plotted. Chord is a group of notes that sound good together. Similarly to scale, a chord has a start note and a type defining a set of intervals. There are mainly 6 types in triads chords: Major Chord, Minor Chord, Augmented Chord, Diminished Chord, Suspended 2nd Chord, and Suspended 4th Chord. The Circle of Fifths is often used to produce a chord progression. It maps 12 chord starting notes to a circle. When changing from one chord to another chord, moving to a nearby chord on the circle is often preferred as this forms a strong chord progression that produces the sense of harmony.