Topic Detection in a Million Songs
Total Page:16
File Type:pdf, Size:1020Kb
Topic detection in a million songs Lucas Sterckx Promotoren: prof. dr. ir. Chris Develder, dr. ir. Thomas Demeester Begeleiders: ir. Johannes Deleu, Laurent Mertens Masterproef ingediend tot het behalen van de academische graad van Master in de ingenieurswetenschappen: computerwetenschappen Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. Daniël De Zutter Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2012-2013 Topic detection in a million songs Lucas Sterckx Promotoren: prof. dr. ir. Chris Develder, dr. ir. Thomas Demeester Begeleiders: ir. Johannes Deleu, Laurent Mertens Masterproef ingediend tot het behalen van de academische graad van Master in de ingenieurswetenschappen: computerwetenschappen Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. Daniël De Zutter Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2012-2013 i Voorwoord Hierbij wil ik mijn promotors en begeleiders bedanken, in het bijzonder dr. ir. Thomas Demeester, Laurent Mertens en ir. Johannes Deleu, voor al hun inzet, interesse en sympathie. Ook voor de creatieve vrijheid die ik kreeg tijdens het voorbije jaar, wat ervoor zorgde dat het werken aan mijn thesis geen moment verveelde. Via deze weg wil ik ook Lauren Virshup en de mensen van `GreenbookofSongs.com' bedanken voor hun medewerking en om mij gratis toegang te verlenen tot hun databank. Hun bijdrage was van essentieel belang tot het resultaat. Ten slotte wil ik mijn ouders, grootouders en broer bedanken voor hun steun tijdens mijn opleiding en alle tijd daarvoor. Lucas Sterckx, juni 2013 ii Toelating tot bruikleen \De auteur geeft de toelating deze scriptie voor consultatie beschikbaar te stellen en delen van de scriptie te kopi¨erenvoor persoonlijk gebruik. Elk ander gebruik valt onder de beperkingen van het auteursrecht, in het bijzonder met be- trekking tot de verplichting de bron uitdrukkelijk te vermelden bij het aanhalen van resultaten uit deze scriptie." Lucas Sterckx, juni 2013 iii Topic detection in a million songs door Lucas Sterckx Afstudeerwerk ingediend tot het behalen van de graad van Master in de ingenieurswetenschappen: computerwetenschappen Academiejaar 2012-2013 Universiteit Gent Faculteit Ingenieurswetenschappen en Architectuur Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. D. De Zutter Promotor: prof. dr. ir. C. Develder, dr. ir. T. Demeester Thesisbegeleiders: ir. J. Deleu, L. Mertens Summary In this work topic modeling was applied on song lyrics. Next to a large corpus of lyrics, a set of supervised label-assignments from a commercial lyrics listings website was retrieved and analyzed. The subset was used to study the use of machine learning techniques for automatic categorization using lyrics and song titles, in a multi-label classification. Title words were shown to be highly informative for automatic classification. A combination of features showed beneficial for some categories and metrics. Next, community-sourced labels known as social tags were studied for lyrics-specific assignment. Semantic relations between tagged documents were studied using unsupervised clustering, which showed the textual dependency of some social tags. Social tags are then used as feature for multi-label classification of lyrics, overall highest F1-score was obtained using a combination of all features. Labeled Latent Dirichlet Allocation, a supervised topic model, was trained using a labeled subset and used for classification, which obtained results competitive with baseline performers but no large overall improvement. Latent Dirichlet Allocation, an unsupervised topic model was inferred from the corpus of lyrics, and evaluated according to semantic coherence and interpretability. A metric for evaluation was proposed using supervised data and the kurtosis measure, this metric achieved high correlation with manual scoring. Three topic models were compared in terms of the amount and quality of unique themes. Finally, some applications of topic models for Music Information Retrieval are presented. Keywords: Music Information Retrieval, Lyrics, Topic Models, Latent Dirichlet Allocation Topic Detection in a Million Songs Lucas Sterckx Supervisor(s): prof. dr. ir. Chris Develder, dr. ir. Thomas Demeester, ir. Johannes Deleu, Laurent Mertens Abstract—In this work topic modeling was applied on song lyrics. Next models. Supervised data is then used to evaluate an unsuper- to a large corpus of lyrics, a set of supervised label-assignments from a vised topic model inferred from a much larger collection of commercial lyrics listings website was retrieved and analyzed. The sub- set was used to study the use of machine learning techniques for automatic lyrics. categorization using lyrics and song titles, in a multi-label classification. Ti- tle words were shown to be highly informative for automatic classification. III. TOPIC MODELS A combination of features showed beneficial for some categories and met- rics. Next, community-sourced labels known as social tags were studied for lyrics-specific assignment. Semantic relations between tagged documents Probabilistic topic models are a tool for the unsupervised were studied using unsupervised clustering, which showed the textual de- analysis of text, providing both a predictive model of future text pendency of some social tags. Social tags are then used as feature for multi- and a latent topic representation of the corpus. label classification of lyrics, overall highest F1-score was obtained using a combination of all features. Labeled Latent Dirichlet Allocation, a super- Latent Dirichlet Allocation (LDA) is a Bayesian graphical vised topic model, was trained using a labeled subset and was used for clas- model for text document collections represented by bags-of- sification, which obtained results competitive with baseline performers but words [3]. In a topic model, each document in the collection no large overall improvement. Latent Dirichlet Allocation, an unsupervised of documents is modeled as a multinomial distribution over a topic model was inferred from the corpus of lyrics, and evaluated according to semantic coherence and interpretability. A metric for evaluation was pro- number of topics of choice. Each topic is a multinomial distri- posed using supervised data and the kurtosis measure, this metric achieved bution over all words. Typically, only a small number of words high correlation with manual scoring. Three topic models were compared are important for each topic, and only a small number of topics in terms of the amount and quality of unique themes. Finally, some ap- plications of topic models for Music Information Retrieval are presented. are present in each document. Labeled Latent Dirichlet Allocation (L-LDA) is an improve- Keywords— Music Information Retrieval, Lyrics, Topic Models, Latent ment upon LDA for labeled corpora by incorporating user su- Dirichlet Allocation pervision in the form of a one-to-one mapping between topics and labels [4]. I. INTRODUCTION HE way people consume music has changed considerably IV. THE DATASET Tin terms of quantity and access over the last decade, and is continuing to do so. Large collections of music make it diffi- The main dataset used for this research is the so-called ‘Mil- cult for users to overlook the immense offer, but can also lead to lion Song Dataset’ (MSD) [5], with metadata for 1.000.000 possibilities for new ways of exploring the collection and finding songs. This metadata is matched with 237.662 lyrics from com- music matching ones taste. Music Information Retrieval (MIR) mercial lyrics catalogue, ‘musiXmatch’ and a dataset containing is the interdisciplinary science addressing this potential, devel- 8.598.630 social tag assignments (community-sourced labels) oping techniques including music recommendation. This work from social music service, ‘Last.fm’. studies the use of themes in lyrics for this matter, using statisti- A clean dataset was provided by commercial lyrics listings website, ‘GreenbookofSongs.com R ’ (GOS). The GOS-dataset cal analysis to detect topics. assigns multiple labels from a large class-hierarchy to 9.261 II. RELATED WORK lyrics. While a case is made for the importance of words and lyri- cal themes in music and its contribution to a musical identity, V. LYRICS CATEGORIZATION they are often treated as secondary features when determining A. Lyrics and Titles similarity in music, as compared to the audio-signal. Notable exceptions are research presented in [1] by Logan et. al. and First, focus is placed on the GOS-dataset. This clean set of [2] by Kleerdorfer et. al. in which attempts were made to apply documents and assignments allows us to measure the perfor- thematic categorization to lyrics. Mahadero et. al.[1] perform mance of statistical text classification of lyrics. Songs from a small scale evaluation of a probabilistic classifier, classifying the GOS-dataset are classified in 24 super-categories recognized lyrics into five manually applied thematic categories. Kleerdor- by its creators, a selection of baseline classifiers from the do- fer et. al. [2] focuses solely on topic-detection in lyrics using an main of Machine Learning was applied. On average, each song unsupervised statistical model called Non-negative Matrix Fac- is applied with two super-categories, which shows that multi- torization (NMF) on 32.323 lyrics. After clustering by NMF, ple labels must be assigned to each document when classify- each cluster was manually labeled by judgement of its most sig- ing. A one-vs-all scheme is applied using a binary classifica- nificant