Automatic Playlist Generation

Automatic Playlist Generation Xingting Gong and Xu Chen Stanford University [email protected] [email protected] I. Introduction mitigate the problem of a small training set. Instead of designing a machine learning problem to train only on Digital music applications have become an increasingly the user-provided seeds, the Microsoft group gathered popular means of listening to music. Applications such 174,577 songs from 14,198 albums as a kernel "meta"- as Spotify allows the user to add songs to his/her training set. The idea is that songs placed on the same playlist without downloading the song to his/her com- album are similar, and so it is appropriate to train the puter. The user can also be recommended songs by similarity metric on the set of 174,577 songs. However, Spotify through Spotify’s “Discover" option. Pandora— their selection of musical features were mostly qualita- an online radio—generates a radio station based on tive, and consisted of: genre (i.e. jazz, rap), subgenera a single user-inputted artist, genre, or composer. For (i.e. heavy metal), mood (i.e. angry, happy), style (i.e. these types of applications, applying algorithms to East coast rap, Gangsta rap) , rhythm type (i.e swing, learn user preferences is extremely important. In this disco), rhythmic description (i.e. funky, lazy), and vo- report, we explore two different methods to generate cal code (i.e. duet, instrumental). We felt that some of automatic playlists based on user-input: these features were not very well defined and seemed 1. Gaussian Process Regression (GPR): redundant in the musical qualities they were trying to This method takes in a set of seed songs (which can capture. The example features classified under genre, contain as little as a single song) inputted by the user to subgenera, and style, for example, seem extremely in- train a preference function that predicts the user pref- terchangeable amongst the 3 categories. For our first erence for a new song to be considered for the playlist. method using GPR, we aim to expand upon the Mi- 2. SVM + HMM: crosoft group’s work by applying KMT to data from the For this method, we assume that the user has gener- Million Song Dataset (described in section III). Briefly, ated a large number of seed songs (i.e. a user has liked the Million Song Dataset provides quantitative features hundreds of songs are pandora throughout the course such as tempo in beats-per-minute, or loudness in deci- of a year). For this method we also require a set of bels, which we will instead use as features to train the low-preference songs (i.e. the user has skipped hun- similarity metric. dreds of songs on pandora over a year). With a large Our second method falls into the more standard cat- training set of labelled data, we can apply classification egory of binary classification problems. More notably, algorithms such as SVM to determine if a new song we want to expand upon a standard SVM by modeling will be liked or disliked by the user. the timbre sequence as a HMM. There has been a lot Because we believe timbre to be an important predic- of research on using timbre to analyze complex instru- tor for music, we combine the SVM with an HMM to mental textures and even rhythmic styles [2-4]. Due model the timbre spectra. to the importance of timbre in characterizing sound, These methods are described in much greater detail in we believe that an SVM armed with HMM can be an section IV. effective classifier on large training sets. II. Related Work III. Dataset and Features Automatic Playlist Generation can be a difficult task We obtained our data from the Million Song Dataset, a due to the fact that the user will often provide only dataset compiled by Columbia University’s Laboratory a few seed songs. With such a small training set, it for the Recognition and Organization of Speech and can be difficult to train a sensible similarity metric. A Audio and The Echo Nest. The entirety of this dataset paper by Zheng et. al. at Microsoft Corporation de- consists of audio features and metadata for a million vised a novel "Kernel Meta-Training" (KMT) method to popular songs. For the sake of practicality, we down- 1 loaded the 10,000 song subset. IV. Methods The dataset contains approximately 54 features per track. Of this set we hand selected a subset of 5 fea- Gaussian Process Regression: The first of our main tures upon which to perform our analysis: methods makes use of Gaussian Process Regression. In a Gaussian process (GP), any finite subset of the points 1. Genre: Each track can consist of anywhere from 0 in our domain space satisfies a multivariate Gaussian to multiple user-supplied genre tags from the Mu- distribution. For automatic playlist generation, our sicBrainz website. To keep our analysis simple, we domain space consists of the possible user preferences randomly assigned each track to one its genre tags. f for some song which we wish to predict. To simplify Each track is therefore labelled by an integer that calculations, the mean of the GP is often assumed to be corresponds to a particular genre. 0, which makes sense in our case since it is reasonable 2. Tempo: The estimated tempo in BMP. To discretize to assume that in the space of all songs, a user will this feature, we binned the tempo values by incre- probably not want to listen to most of them. = f gN ments of 20. Let seedSongs xi i=1 be the set of user-inputted songs that serve as the "seed" for which we will gen- 3. Loudness: The average loudness of the song in dB. erate a playlist around, where xi denotes the feature We binned these values by increments of 5 dB. vector for seed song i. Let fi denote the true user preference for these songs (though fi can in principle take 4. Decade: We included the decade in which the song on any real value, for simplicity we assume it is approx- was released as a feature. The motivation behind imately 1 with noise s if the user selects the song as a this is that songs produced in the same decade seed). Let f∗ denote the user preference for some song sound similar in style. x∗ that we want to predict. Then the joint distribution 5. Timbre: Timbre is represented as a 12 × N matrix [ fi, f∗] is Gaussian: of Mel-frequency cepstral coefficients, where N is f the number of segments. Each column of the matrix i = N (0, K) (1) f is thus a 12 dimensional vector representing the 12 ∗ MFCCs of a particular time segment. We processed where K is the covariance matrix. timbre differently for each of our two methods: Since [ fi, f∗] is jointly Gaussian, the conditional distribution P( f j f ) is therefore also a Gaussian with GPR: For this method, we limited ourselves to tracks ∗ i parameters: with N ≥ 200 and randomly selected 200 rows of the timbre matrix for these tracks. P( f∗j fi) ∼ N (m, S) (2a) SVM + HMM: Since each column of the timbre ma- m = K(xi, x∗)K(xi, xi) fi (2b) trix is obtained from a different time segment, it −1 S = K(x∗, x∗) − K(x, x∗)K(x , x ) K(x∗, x ) (2c) makes more sense to represent the timbre matrix as i i i a hidden markov model. We trained an HMM for The preference function f∗ is then obtained by tak- each of the two song sets representing songs "liked" ing the posterior mean of this conditional distribution, and "disliked" by the user (i.e. added or not added resulting in: to the user’s Spotify playlist from the application’s N list of recommendations). The loglikelihoods of each f∗ = ∑ aiK(xi, x∗) (3a) track given each of the HMM models are then used i=1 as features for the SVM. N 2 −1 ai = ∑(K(xi, xj) + s dij) (3b) j=1 Table 1: Feature Vector Examples Lastly, since s is the noise in our GP, it is obtained by Features Example raw values Discretized/Processed values maximizing the log likelihood of obtaining the set of Genre rock, indie, pop, hip-hop Integer values 0 to 9 country, jazz, metal, folk seed songs: rap, dance 1 1 N Tempo 130.861, 122.174, 80.149 6, 6, 4 log p( f js) = − f TK−1 f − log jKj − log 2p (4) Year 1989, 1999, 2007 198, 199, 200 2 2 2 Avg. Loudness -13.366, -7.928, -15.367 -2, -1, -3 Timbre (GPR) 12 ×N matrix of MFCCs randomly selected 200 rows Thus, in order to learn the preference function we must Timbre (SVM+HM) 12 ×N matrix of MFCCs loglikelihoods under each HMM obtain a kernel K(x, y) that will serve as our similarity metric between two songs x and y. Once we have 2 the preference function, selecting a playlist becomes as thought of as a mask that allows us to compare a subset easy as computing the preference of each song under of features at a time. In other words, yn evaluates to 1 consideration and ranking the top M. only when the components of x and y are exactly equal For our project we tried two kernels, the first of (or less than a threshold) whenever the component of a which must be learned (a method called "Kernel Meta is 1.

Load more