Topic detection in a million

Lucas Sterckx

Promotoren: prof. dr. ir. Chris Develder, dr. ir. Thomas Demeester Begeleiders: ir. Johannes Deleu, Laurent Mertens

Masterproef ingediend tot het behalen van de academische graad van Master in de ingenieurswetenschappen: computerwetenschappen

Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. Daniël De Zutter Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2012-2013

Topic detection in a million songs

Lucas Sterckx

Promotoren: prof. dr. ir. Chris Develder, dr. ir. Thomas Demeester Begeleiders: ir. Johannes Deleu, Laurent Mertens

Masterproef ingediend tot het behalen van de academische graad van Master in de ingenieurswetenschappen: computerwetenschappen

Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. Daniël De Zutter Faculteit Ingenieurswetenschappen en Architectuur Academiejaar 2012-2013 i

Voorwoord

Hierbij wil ik mijn promotors en begeleiders bedanken, in het bijzonder dr. ir. Thomas Demeester, Laurent Mertens en ir. Johannes Deleu, voor al hun inzet, interesse en sympathie. Ook voor de creatieve vrijheid die ik kreeg tijdens het voorbije jaar, wat ervoor zorgde dat het werken aan mijn thesis geen moment verveelde. Via deze weg wil ik ook Lauren Virshup en de mensen van ‘GreenbookofSongs.com’ bedanken voor hun medewerking en om mij gratis toegang te verlenen tot hun databank. Hun bijdrage was van essentieel belang tot het resultaat. Ten slotte wil ik mijn ouders, grootouders en broer bedanken voor hun steun tijdens mijn opleiding en alle tijd daarvoor.

Lucas Sterckx, juni 2013 ii

Toelating tot bruikleen

“De auteur geeft de toelating deze scriptie voor consultatie beschikbaar te stellen en delen van de scriptie te kopi¨erenvoor persoonlijk gebruik. Elk ander gebruik valt onder de beperkingen van het auteursrecht, in het bijzonder met be- trekking tot de verplichting de bron uitdrukkelijk te vermelden bij het aanhalen van resultaten uit deze scriptie.”

Lucas Sterckx, juni 2013 iii

Topic detection in a million songs door Lucas Sterckx Afstudeerwerk ingediend tot het behalen van de graad van Master in de ingenieurswetenschappen: computerwetenschappen Academiejaar 2012-2013 Universiteit Gent Faculteit Ingenieurswetenschappen en Architectuur Vakgroep Informatietechnologie Voorzitter: prof. dr. ir. D. De Zutter Promotor: prof. dr. ir. C. Develder, dr. ir. T. Demeester Thesisbegeleiders: ir. J. Deleu, L. Mertens

Summary

In this work topic modeling was applied on lyrics. Next to a large corpus of lyrics, a set of supervised label-assignments from a commercial lyrics listings website was retrieved and analyzed. The subset was used to study the use of machine learning techniques for automatic categorization using lyrics and song titles, in a multi-label classification. Title words were shown to be highly informative for automatic classification. A combination of features showed beneficial for some categories and metrics. Next, community-sourced labels known as social tags were studied for lyrics-specific assignment. Semantic relations between tagged documents were studied using unsupervised clustering, which showed the textual dependency of some social tags. Social tags are then used as feature for multi-label classification of lyrics, overall highest F1-score was obtained using a combination of all features. Labeled Latent Dirichlet Allocation, a supervised topic model, was trained using a labeled subset and used for classification, which obtained results competitive with baseline performers but no large overall improvement. Latent Dirichlet Allocation, an unsupervised topic model was inferred from the corpus of lyrics, and evaluated according to semantic coherence and interpretability. A metric for evaluation was proposed using supervised data and the kurtosis measure, this metric achieved high correlation with manual scoring. Three topic models were compared in terms of the amount and quality of unique themes. Finally, some applications of topic models for Music Information Retrieval are presented.

Keywords: Music Information Retrieval, Lyrics, Topic Models, Latent Dirichlet Allocation Topic Detection in a Million Songs Lucas Sterckx Supervisor(s): prof. dr. ir. Chris Develder, dr. ir. Thomas Demeester, ir. Johannes Deleu, Laurent Mertens

Abstract—In this work topic modeling was applied on song lyrics. Next models. Supervised data is then used to evaluate an unsuper- to a large corpus of lyrics, a set of supervised label-assignments from a vised topic model inferred from a much larger collection of commercial lyrics listings website was retrieved and analyzed. The sub- set was used to study the use of machine learning techniques for automatic lyrics. categorization using lyrics and song titles, in a multi-label classification. Ti- tle words were shown to be highly informative for automatic classification. III.TOPIC MODELS A combination of features showed beneficial for some categories and met- rics. Next, community-sourced labels known as social tags were studied for lyrics-specific assignment. Semantic relations between tagged documents Probabilistic topic models are a tool for the unsupervised were studied using unsupervised clustering, which showed the textual de- analysis of text, providing both a predictive model of future text pendency of some social tags. Social tags are then used as feature for multi- and a latent topic representation of the corpus. label classification of lyrics, overall highest F1-score was obtained using a combination of all features. Labeled Latent Dirichlet Allocation, a super- Latent Dirichlet Allocation (LDA) is a Bayesian graphical vised topic model, was trained using a labeled subset and was used for clas- model for text document collections represented by bags-of- sification, which obtained results competitive with baseline performers but words [3]. In a topic model, each document in the collection no large overall improvement. Latent Dirichlet Allocation, an unsupervised of documents is modeled as a multinomial distribution over a topic model was inferred from the corpus of lyrics, and evaluated according to semantic coherence and interpretability. A metric for evaluation was pro- number of topics of choice. Each topic is a multinomial distri- posed using supervised data and the kurtosis measure, this metric achieved bution over all words. Typically, only a small number of words high correlation with manual scoring. Three topic models were compared are important for each topic, and only a small number of topics in terms of the amount and quality of unique themes. Finally, some ap- plications of topic models for Music Information Retrieval are presented. are present in each document. Labeled Latent Dirichlet Allocation (L-LDA) is an improve- Keywords— Music Information Retrieval, Lyrics, Topic Models, Latent ment upon LDA for labeled corpora by incorporating user su- Dirichlet Allocation pervision in the form of a one-to-one mapping between topics and labels [4]. I.INTRODUCTION HE way people consume music has changed considerably IV. THE DATASET Tin terms of quantity and access over the last decade, and is continuing to do so. Large collections of music make it diffi- The main dataset used for this research is the so-called ‘Mil- cult for users to overlook the immense offer, but can also lead to lion Song Dataset’ (MSD) [5], with metadata for 1.000.000 possibilities for new ways of exploring the collection and finding songs. This metadata is matched with 237.662 lyrics from com- music matching ones taste. Music Information Retrieval (MIR) mercial lyrics catalogue, ‘musiXmatch’ and a dataset containing is the interdisciplinary science addressing this potential, devel- 8.598.630 social tag assignments (community-sourced labels) oping techniques including music recommendation. This work from social music service, ‘Last.fm’. studies the use of themes in lyrics for this matter, using statisti- A clean dataset was provided by commercial lyrics listings website, ‘GreenbookofSongs.com R ’ (GOS). The GOS-dataset cal analysis to detect topics. assigns multiple labels from a large class-hierarchy to 9.261 II.RELATED WORK lyrics. While a case is made for the importance of words and lyri- cal themes in music and its contribution to a musical identity, V. LYRICS CATEGORIZATION they are often treated as secondary features when determining A. Lyrics and Titles similarity in music, as compared to the audio-signal. Notable exceptions are research presented in [1] by Logan et. al. and First, focus is placed on the GOS-dataset. This clean set of [2] by Kleerdorfer et. al. in which attempts were made to apply documents and assignments allows us to measure the perfor- thematic categorization to lyrics. Mahadero et. al.[1] perform mance of statistical text classification of lyrics. Songs from a small scale evaluation of a probabilistic classifier, classifying the GOS-dataset are classified in 24 super-categories recognized lyrics into five manually applied thematic categories. Kleerdor- by its creators, a selection of baseline classifiers from the do- fer et. al. [2] focuses solely on topic-detection in lyrics using an main of Machine Learning was applied. On average, each song unsupervised statistical model called Non-negative Matrix Fac- is applied with two super-categories, which shows that multi- torization (NMF) on 32.323 lyrics. After clustering by NMF, ple labels must be assigned to each document when classify- each cluster was manually labeled by judgement of its most sig- ing. A one-vs-all scheme is applied using a binary classifica- nificant terms. tion for each category and averaging the results. Categorization We expand on this work by processing a clean dataset of la- was evaluated using common metrics; precision, recall and F1- beled lyrics, applying text categorization and supervised topic scores, results are presented in Table II. B. Social Tags TABLE II MACROAVERAGE RESULTS FOR CLASSIFICATION OF LYRICS USING A Social tags are free text labels applied to items such as artists, SUPPORT VECTOR MACHINEAND L-LDA(%) and songs. Unlike traditional keyword assignment, where terms are often drawn from a controlled, static vocabu- Feature(s) Prec. Rec. F1 lary, no restrictions are placed on social tags. In our research we Lyrics 60,58 36,76 45,76 investigate the appliance of song-level social tags based on lyri- Title 63,75 42,63 51,09 cal theme. A hierarchical categorization, making a distinction Social Tags 48,66 18,73 27,04 between tags that are relevant and those which are not, was cre- Lyrics + Title 63,16 43,13 51,25 ated to this cause. 180 were social tags selected as lyrics-related. Lyrics + Social Tags 62,05 40,30 48,86 To demonstrate the relation between social tags and lyrics, an Title + Social Tags 61,29 35,87 45,25 unsupervised clustering algorithm is applied on the centroids of Lyrics + Title + Social Tags 66,11 43,57 52,52 lyrics assigned with one of the 180 social tags. Some clusters show strong semantic coherence between social tags demon- L-LDA Topics 44,70 62,46 52,11 strating their textual dependency, several clusters are shown in Table I. do not know what topics may emerge from the model, we prefer TABLE I evaluation which takes into account semantic coherence. CLUSTERS OF SOCIAL TAGS Three topic models were inferred from 181.892 English lyrics for evaluation, one with 60 (T60), 120 (T120) and 200 (T200) love, love song, lovesongs, lovesong animals, animal kingdom, birds, animal song topics. Topics were first evaluated manually, applying quality- rain, weather, weather songs scores from 1(=useless) to 3(=useful), labels are assigned to top- heartbreak, breakup, goodbye, Heartbreaking, heartache, ics of the highest quality. break up, love hurts, break-up, broken heart, Breakup songs, relationships, heartbroken, i miss you A. Semantic coherence First, evaluation was performed using metrics presented in Social tags were also applied as features in multi-label classi- [6]. These metrics use WordNet, a lexical ontology [7], to score fication. topics by applying a metric which measures the average seman- tic distance in the dataset between words of a topic. Overall, C. Supervised Topic Model these metrics showed less rank-correlation with manually as- The supervised topic model, L-LDA was used as tool for signed quality than reported in [6]. feature transformation for classification using the 24 super- categories. The advantage of using L-LDA on multiply labeled B. Kurtosis-evaluation documents comes from the model’s document-specific topic As second method for evaluation, a metric was proposed re- mixture. L-LDA can effectively perform some contextual word lying on supervised topics or labeled documents, and a measure sense disambiguation, which suggests why L-LDA could out- for ‘peakedness’. In previous sections several sources for super- perform SVM’s. Classification was performed by thresholding vised labels were presented, a supervised topic model (L-LDA) the corresponding topic-contributions. A threshold was chosen and social tags specifically assigned to lyrics. In an ideal case, to obtain optimal F1-scores. an unsupervised topic model would produce the same topics as a model incorporating supervised data, we therefore match D. Results word-distributions from a supervised model, or documents as- Of all baseline classifiers, a Support Vector Machine achieved signed with social tags, with unsupervised topics. LDA-topics best results. Results for each combination of features, together are scored according to the extent they match supervised data, with classification using L-LDA, are presented in Table II. A but not only high similarity is wishful, ideally only one of the combination of all textual features obtained the overall best re- supervised topics or labels shows high similarity with the un- sults, performance is less than for classification in news- or supervised topic which demonstrates the distinctiveness of the book-corpora. theme. For each of the unsupervised LDA-topics, cosine sim- ilarity is calculated with each of the supervised L-LDA topics, VI.TOPIC DETECTION resulting in a similarity-distribution for each LDA-topic. These In a second part of the thesis, Latent Dirichlet Allocation was distributions are then scored according to the extent they show applied on the complete set of lyrics. While unsupervised topic ‘peakedness’, a statistical measure also known as kurtosis. Kur- models make assumptions which lead to better statistical models tosis (β2) is defined as the fourth central moment divided by the of documents than supervised models, they offer no guarantee of square of the variance, producing a human-interpretable decomposition of the texts like E[(X µ)4] µ supervised models, evaluation of the output is of importance. β = − = 4 2 (E[(X µ)2])2 σ4 In the case of lyrics, there is no general consensus about the − amount of topics or thematic contents present in lyrics, as is the with µ4 being the fourth moment about the mean and σ the stan- case for news-corpora (Sport, Science, Entertainment,. . . ) or dard deviation. An LDA-topic with high and low kurtosis is books-corpora (Comedy, Thriller, Romance, . . . ). Since we shown in Figure 1. Topic 65: white black red sky color green eyes paint light 0.8

Christmas 0.7 Colors Fire Water Religion Weather 0.6 Music/Rocking War/Peace Places/Cities Animals 0.5 Drugs/Alcohol Travelling/Moving Seasons Political 0.4 Anatomy Dancing/Party Sleep Dreams Sports/Games 0.3 Space/Moon and Stars Cosine Similarity Home/House Law/Crime Work 0.2 Heartbreak Money Family Time 0.1 Love People Sex Nature 0.0 Society/Classes Media/Showbiz Love Life Food Fire Sex ColorsNature Water Time PeopleNight Family AnatomyWeather Political Seasons Religion AnimalsNumbers Food Law/Crime War/Peace Heartbreak Places/Cities Sleep Dreams Life Society/Classes Media/Showbiz Communication Travelling/Moving Education/Advice Education/Advice Space/Moon and Stars Communication Night

β2 = 19, 19 Topic 43: road find lead time life walk light back follow path 0.8 5 0 5 10 15 20 25 30 − Kurtosis 0.7 Fig. 2. Matched Topics from T120 using Supervised Topics 0.6

0.5 The detection of topics is dependent on the labels included 0.4 in the supervised data, quality LDA-topics, not included in the 0.3

Cosine Similarity supervised set are not detected. For topic-matching an L-LDA model was inferred using the GOS-dataset containing 38 su- 0.2 pervised topics. In Figure 2 for T120, 11 different themes are 0.1 strongly (β2 > 10) linked to LDA-topics. Increasing the amount of topics gives rise to new matches, but increases the amount of 0.0 LifeLove Fire Sex Time Family Nature People Water Night Political Religion Anatomy Numbers Weather Animals low-quality topics and increases the amount of LDA-topics per Law/CrimeHeartbreak War/Peace Places/Cities Home/HouseSleep Dreams Sports/Games Drugs/Alcohol Media/ShowbizCommunication Society/Classes Travelling/Moving Education/Advice Space/Moon and Stars theme.

β2 = 0, 02 − VII.APPLICATION Fig. 1. Kurtosis Measure for Topic Evaluation Three proof-of-concept applications were implemented using topic models. One straightforward way of applying topic-models in a mu- This metric resulted in highest rank-correlation with manual sic application, is automatic generation of playlists. A plug-in scores using L-LDA topics, using tagged lyrics obtained slightly was implemented for the desktop-software of the popular mu- lower scores. All rank correlation, using the Spearman rank cor- sicstreaming service Spotify, using topic-based representations relation coefficients are presented in Table III, with the LESK- to create playlists containing certain lyrical themes specified by metric [8] having the highest correlation for WordNet-scores. users. TABLE III Artist similarity based on topics was computed and compared SPEARMAN CORRELATION WITH MANUAL EVALUATION to community data. High scores for Mean Reciprocal Rank were obtained for a selection artists from genres Metal, Hip-Hop and Evaluation Metric T60 T120 T200 Christian, but surprisingly also for Rock, Pop, Female vocalists WordNet (LESK) 0,35 0,23 0,31 and Jazz-singers. Kurtosis (Social Tags) 0,32 0,37 0,36 The use of topic models in social sciences was pitched. Kurtosis (L-LDA topics) 0,49 0,49 0,56 VIII.CONCLUSION A first evaluation of automatic lyrics categorization was per- C. Topic Detection formed, obtaining a highest score of 52,52% for F1. The use of community-sourced data was shown to be beneficial. A su- Combining all kurtosis scores, shows which topics can be pervised topic model for classification is competitive but not inferred using LDA. In Figure 2 all supervised topics are superior to other techniques. A new evaluation metric for un- shown versus kurtosis scores obtained for LDA-topics of T120, supervised topics was proposed and applied, having high cor- for each LDA-topic, highest kurtosis-score is displayed in the relation with manual scores while showing which themes can graph. Color indicates the manual score the topics received be detected using an unsupervised topic model. Finally, some (Green=Good, Red=Bad). promising applications of topic models in MIR are proposed. REFERENCES [1] Beth Logan, Andrew Kositsky, and Pedro Moreno, “Semantic analysis of song lyrics,” in Multimedia and Expo, 2004. ICME’04. 2004 IEEE Interna- tional Conference on. IEEE, 2004, vol. 2, pp. 827–830. [2] Florian Kleedorfer, Peter Knees, and Tim Pohle, “Oh oh oh whoah! towards automatic topic detection in song lyrics,” in Proceedings of the 9th Interna- tional Conference on Music Information Retrieval (ISMIR 2008), 2008, pp. 287–292. [3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003. [4] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Man- ning, “Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora,” in Proceedings of the 2009 Conference on Empiri- cal Methods in Natural Language Processing: Volume 1-Volume 1. Associ- ation for Computational Linguistics, 2009, pp. 248–256. [5] Brian McFee, Thierry Bertin-Mahieux, Daniel PW Ellis, and Gert RG Lanckriet, “The million song dataset challenge,” in Proceedings of the 21st international conference companion on World Wide Web. ACM, 2012, pp. 909–916. [6] David Newman, Sarvnaz Karimi, and Lawrence Cavedon, “External eval- uation of topic models,” in Australasian Document Computing Symposium (ADCS). School of Information Technologies, University of Sydney, 2009, pp. 1–8. [7] George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller, “Introduction to wordnet: An on-line lexical database*,” International journal of lexicography, vol. 3, no. 4, pp. 235–244, 1990. [8] Satanjeev Banerjee and Ted Pedersen, “An adapted lesk algorithm for word sense disambiguation using wordnet,” in Computational linguistics and intelligent text processing, pp. 136–145. Springer, 2002. CONTENTS viii

Contents

1 Introduction 1 1.1 Problem Statement...... 2 1.2 Overview...... 3

2 Lyrics in MIR4 2.1 Early Lyrics-Related Research...... 4 2.2 Combining Features...... 4 2.3 Natural Language Processing of Lyrics...... 5 2.4 Overview and Contribution...... 6

3 Topic Models8 3.1 Introduction...... 8 3.2 Latent Dirichlet Allocation...... 9 3.2.1 Statistical Model...... 10 3.2.2 Computation...... 11 3.3 Labeled Latent Dirichlet Allocation...... 12 3.3.1 Statistical Model...... 12 3.3.2 Computation...... 14 3.3.3 Application...... 14 3.4 Evaluation...... 14 3.4.1 Semantic Coherence...... 15 3.4.2 Matching with Supervised Data...... 16 3.5 Conclusion...... 16

4 The Dataset 17 4.1 The Million Song Dataset...... 17 4.2 The musiXmatch Dataset...... 18 4.3 Last.fm Social Tags Dataset...... 18 4.4 Greenbook of Songs...... 19 4.5 Conclusion...... 20 CONTENTS ix

5 Lyrics Categorization 22 5.1 The Greenbook of Songs’ Taxonomy...... 22 5.2 Text Categorization...... 24 5.2.1 Multi Label Text categorization...... 25 5.2.2 Machine Learning Approach...... 25 5.2.3 Attributes...... 26 5.2.4 Classifiers...... 27 5.2.5 Measures for Effectiveness...... 30 5.3 Evaluation...... 31 5.4 Discussion...... 33 5.5 Conclusion...... 34

6 Social Tags 35 6.1 Introduction...... 35 6.2 Issues with Social Tags...... 35 6.3 Lyrical Themes in Social Tags...... 36 6.4 Unsupervised Clustering of Social Tags...... 38 6.5 Social Tag Features in Lyrics Categorization...... 39 6.6 Discussion...... 42 6.7 Lyrics for Auto-tagging...... 42 6.8 Conclusion...... 43

7 Supervised Topic Model for Lyrics 44 7.1 Introduction...... 44 7.2 L-LDA using the GOS-dataset...... 44 7.3 Classification using L-LDA...... 44 7.4 Conclusion...... 47

8 Unsupervised Topic Model for Lyrics 48 8.1 Introduction...... 48 8.2 Latent Dirichlet Allocation on musiXmatch-dataset...... 49 8.3 Evaluation...... 49 8.3.1 Manual Evaluation...... 49 8.3.2 Semantic coherence...... 52 8.3.3 Match with Supervised Topic Model...... 55 8.3.4 Match with Social Tags...... 58 8.3.5 Analysis...... 61 8.4 Topic detection...... 62 8.5 Conclusion...... 66 CONTENTS x

9 Application 68 9.1 Introduction...... 68 9.2 Spotify Plug-in...... 68 9.3 Automatic Playlist Generation...... 68 9.4 Artist Similarity...... 70 9.5 Topic-models as Tool Social Sciences...... 72 9.6 Conclusion...... 73

10 Conclusion 74 10.1 Future work...... 75

A LDA-topics and Manual Evaluation 76

B Supervised Topic Model for Evaluation 81

Bibliography 82 LIST OF FIGURES xi

List of Figures

3.1 Example of lyrics as mixture of topics, created by a generative process...... 9 3.2 Plate-notation of the LDA-model...... 11 3.3 Plate-notation of the L-LDA model...... 14

4.1 Overview of dataset...... 21

5.1 Example of a sub-tree from hierarchical structure from the GOS-dataset.... 23

6.1 Hierarchical categorization for song-level social tags...... 37 6.2 Lyrics-related social tag frequencies...... 38

7.1 Contextual word sense disambiguation using L-LDA...... 45 7.2 Classification using L-LDA...... 46

8.1 HSO-score versus manual scores...... 54 8.2 Kurtosis measure...... 57 8.3 Correlation with kurtosis measure...... 58 8.4 Average topic-distributions for social tags...... 59 8.5 Correlation with kurtosis measure using social tags...... 60 8.6 Kurtosis measure using social tags...... 61 8.7 Label-matching...... 64 8.8 Number of topics matched versus kurtosis...... 65 8.9 Number of topics per matched theme versus kurtosis...... 65 8.10 Manual labels versus supervised labels for T60...... 67 8.11 Manual labels versus social tags for T60...... 67

9.1 Topic-based plug-in for Spotify...... 69 9.2 Spotify Lyrics Plug-in...... 69 9.3 Mean Reciprocal Rank for artists in descending order...... 71 9.4 Average ‘Crime’-Topic distribution over time versus crime rate in the U.S.A... 73 LIST OF FIGURES xii

Abbreviations

DT Decision Tree GOS GreenbookofSongs.com ISMIR International Society for Music Information Retrieval k-NN k Nearest Neighbor Classifier KL Kullback-Leibler LCS Least Common Subsumer LDA Latent Dirichlet Allocation L-LDA Labeled Latent Dirichlet Allocation LR Logistic Regression MIR Music Information Retrieval MIREX Music Information Retrieval Evaluation eXchange ML Machine Learning NB Naive Bayes SVM Support Vector Machine T60 LDA Topic model with 60 topics T120 LDA Topic model with 120 topics T200 LDA Topic model with 200 topics TC Text Classification TF Term Frequency TF-IDF Term Frequency - Inverse Document Frequency VSM Vector Space Model INTRODUCTION 1

Chapter 1

Introduction

The way people consume music has changed considerably over the last decade and is continuing to do so. Today’s music enthusiast has a much larger collection at his disposal than twenty years ago. Catalysts are the many technological developments in the field of data storage, compression and the rise of the internet leading to the online distribution of music. These large collections of music make it difficult for users to overlook the immense offer, but also provide possibilities for new ways of exploring the collection and finding music matching one’s taste. The classic song metadata (’s title, artist and genre) can be expanded with additional information, extracted from audio, contextual data or information provided by a community of listeners, to enhance discovery and browsing. Music Information Retrieval (MIR) is the interdisciplinary science of extracting information from music. MIR finds its roots in a variety of research fields including musicology, cognition sciences and computer science. The leading research forum on MIR is the ‘International Society for Music Information Retrieval’ (ISMIR), organizing annual conferences with the aim of pro- viding a meeting place for the discussion of MIR-related research, developments, methods, tools and presentation of experimental results. The increasing number of papers presented during these conferences, shows the increasing interest in music recommendation and discovery. In the music-industry MIR is already applied in recommender systems, automatic annotation of songs, instrument recognition, computer-generated sheet music and more. The ‘Music Information Re- trieval Evaluation eXchange’ (MIREX) is an annual evaluation of classification-algorithms for music, in collaboration with ISMIR. Assessing the similarity of music, musical artists, or musical styles is a non-trivial task as there is no explicit definition of what makes music similar. Techniques for recommendation focus on analysis of the audio-signal or community data (collaborative filtering). Digital processing techniques are used to determine the intensity (mood), the timbre or rhythmical structure from the audio. Audio based techniques are known as content-based techniques. Apart from the musical properties of a song, similarity in music is also contributed to some extent to cultural aspects. For example, two artists can share certain political views which they express 1.1 Problem Statement 2 in their music. These aspects are used in so-called context-based techniques. Context data is all information about a song that is not encoded in the audio-file, but originates from external sources. Incorporation of this extra information has already proven to be beneficial for tasks like auto-tagging of artists, recommendation, or new intuitive interfaces for music-players [19]. Combinations of both content- and context-based techniques have also been successfully used for classification of genre, instrumentation or mood. Most notable sources for context-based features are social tags, community-sourced playlists and lyrics.

1.1 Problem Statement

Lyrics are the words which are sung to music and can be an important feature in the perception of music. In some genres of music, lyrics even claim a central role. These genres are defined by a specific use of words or thematic content. Search-query statistics from search engine Google show that artists are more searched for in combination with the keyword ‘lyrics’ than without [67]. The internet offers a vast selection of websites with online catalogues containing several hundreds of thousands of lyrics. Services from these websites are often incorporated in offline and online music players. Lyrics are automatically fetched by software, by calling an Application Programming Interface (API) offered by the web service. In streaming service ‘Spotify’, as of June 2013, the number one and two most popular extensions for the player are applications which display lyrics. ‘MusiXmatch’, one of the leading companies distributing lyrics has recently hit ten million users of its mobile application showing lyrics, and raised 3,7 million in funding. Lyrics can also be seen as a form social commentary by the artist, they can contain polit- ical, social or economic themes. The effect of lyrics on listeners has been studied in multiple psychological and sociological experiments [8][18]. Recently two websites dedicated to the interpretation of lyrics, have gained attention. ‘Song- Meanings.com’[6] is a community of thousands of music lovers who contribute lyrics, discuss interpretations and connect over songs and artists they love. This website also allows lyrics- specific tagging of songs. ‘Rap Genius’ is a website dedicated to the annotation and interpre- tation of hip-hop music. The site’s purpose is “not to translate rap into nerdspeak”, but rather to critique rap as poetry [5]. As of June 2012, ‘Rap Genius’ receives approximately 10 million unique visitors per month. The analysis presented in this thesis shows the importance of lyrics for listeners. Although some research has been done, in research and commercial platforms the influence of the actual theme of the lyrics is minimal. Large catalogues are rarely equipped with functions exceeding the searching of songs using more than only title and artist. This work addresses these shortcomings and studies possibilities to explore and employ, the thematic content of lyrics. Several data sources or techniques with potential for classification and topic detection will be evaluated. The following questions will be assessed in this thesis. 1.2 Overview 3

What are lyrical themes common in music? • How effective are classification algorithms in classifying lyrics into themes? • Do unsupervised or supervised algorithms recognize topics which are interpretable for • humans? Can unsupervised algorithms aid with classification?

Does a classification correspond with the actual preferences in user communities? Can • community-sourced information aid with classification?

A system which successfully assigns topics to a song can be used for several interesting goals.

Enhance music recommendation systems by incorporating lyrical features • Automatic playlist generation • Filtering of certain themes or coarse language • Assist with psychological or sociological studies involving lyrics •

1.2 Overview

In the work at hand, manually labeled lyrics and community-sourced tags, are evaluated for use in classification algorithms, as well as several supervised and unsupervised statistical models for topic detection. The structure of this thesis is as follows: chapter2 gives an overview of the related research concerning lyrics and a detailed overview of contributions by the researcher is presented in section 2.4. In chapter3 several state-of-the-art models for topic detection are explained. All data used for the research is presented in chapter4. Chapter5 focuses on a manually labeled corpus of lyrics and presents automatic classification of lyrics into themes. The community-sourced or social tags are discussed in chapter6. In chapters7 and8 topic- models are applied and evaluated regarding quality and usefulness. Chapter9 presents some applications of the presented research. In chapter 10 a summary and an outlook for future work is given. LYRICS IN MIR 4

Chapter 2

Lyrics in MIR

In the field of MIR, lyrics or contextual data as a whole are treated as secondary features for determining similarity or classification, as compared to the audio-signal. This chapter gives a brief overview of the research which takes into account lyrical content, separated by the focus and intended use.

2.1 Early Lyrics-Related Research

Early MIR-research incorporating lyrics, until 2002, only did so by adding lyrics to the searchable content [23][45]. A first system making use of natural language processing (NLP) of lyrics in combination with audio-based methods is presented in [10]. The system offers retrieval of similar songs with respect to lyrics. Lyrics are transformed into the vector space model (VSM) and similarity is measured using the cosine distance. The authors also state the potential of using lyrics and IR-techniques to automatically create meaningful terms and topics.

2.2 Combining Features

A second stage of the role of lyrics in MIR, can be identified as the computation of music similarity by combining lyrics-based features with audio-based features. A first account of this multi-modality is found in [17]. A multi-modal mixture model is trained using the expectation maximization algorithm. The model is applied for retrieval of songs or searching for songs. Similar ideas are applied in [24], where artist style identification is performed using a combination of independent classifiers for both features. Experiments for genre classification using the multi-modal approach are presented in [52]. Support Vector Machines (SVM) are applied to a combination lyrics- and audio-based fea- tures. Classification results are compared for classifiers using independent and combined fea- tures. Combination is shown to be beneficial for classification. Best accuracy attained is 48.4 % for a classification in 41 genres. 2.3 Natural Language Processing of Lyrics 5

Another application of multi-modality is mood classification. Songs are classified into four mood categories by means of lyrics and content analysis. Audio-based features perform better compared to lyrics-based features, however a combination of both yields better results [39] for some categories. In [29] experiments are performed using TF-IDF, TF and Boolean vectors rep- resenting words, as well as the impact of stemming, part-of-speech tagging and function words. Classification is performed into 18 mood-categories, for seven categories, lyrics outperform au- dio. For those seven categories, the top-ranked words show strong semantic connections to the categories.

2.3 Natural Language Processing of Lyrics

Research, most relevant to this thesis, is based solely on lyrics from a MIR point of view. One of the earliest papers in this category, and one of the most frequently cited by lyrics-related articles, was published in 2004 by Logan et al. [44]. In it, lyrics are analyzed using probabilistic Latent Semantic Analysis (pLSA), a statistical technique presented in [27], used for discovering the abstract ‘topics’ that occur in a collection of documents. PLSA is applied to a collection of 40.000 song lyrics. All lyrics by an artist are then processed using each of the extracted topics to create N-dimensional vectors, each component representing the likelihood that a song lyric corresponds to a topic. Similar artists are then found using the L1-distance between these vectors. The approach is evaluated against human judgments of the ‘uspop2002’-set[11] using the ‘survey’ data for similarity, and yields worse results than similarity data obtained via acoustic features regardless of the amount of topics chosen. However, as lyrics-based and audio-based approaches make different errors, a combination of both is suggested. Another influential paper is [46] by Mahedero et al., who performed experiments in four dif- ferent areas: language identification, thematic categorization, structure extraction and similarity searches. Thematic categorization was performed into 5 distinct categories, namely: ‘Love’, ‘Vi- olent’, ‘Protest’ (antiwar), ‘Christian’ and ‘Drugs’. The classification was performed using a classical probabilistic classifier method known as Naive Bayes. The corpus for this experiment consisted of 125 songs manually divided into the 5 mentioned categories. The authors state that the definition of categories is very subjective but, very influential to the results. The Naive Bayes classifier yielded an accuracy of 82% on a 10-fold cross validation. For similarity com- putation, a TF-IDF representation with cosine similarity is proposed as initial step. Songs are represented by concatenating distances to all songs in the collection into a new vector. These representations are then compared using an unspecified algorithm. Exploratory experiments indicate some potential for identification and plagiarism detection. Research, closely related to the one performed in this thesis, is described in [34] by Kleerdorfer et al., this paper focuses solely on topic-detection in lyrics using an unsupervised statistical model. Non-negative matrix factorization (NMF) is used for automatic topic-detection. Latent Semantic Analysis (LSA) was found not suitable for large sparse matrices due to its space 2.4 Overview and Contribution 6

appearance boys and girls broken hearted clubbing conflict crime dance depression future gangsta gospel hard times hiphop home leave listen loneliness loss love music nature party sorrow talk weather world

Table 2.1: Significant topics in lyrics recognized by unsupervised algorithm and manually labeled clusters [34] complexity, NMF was performed on 32.323 song lyrics. Lyrics were preprocessed by deleting stop words, removing terms with high frequency and filtering short lyrics. Documents are transformed to vectors containing binary presentations of words. After clustering by NMF, each cluster was manually labeled by judgment of its most significant terms. The cluster-topics are evaluated for quality by measuring the agreement among the subjects labeling the clusters. Labels with highest significance are presented in Table 2.1.

2.4 Overview and Contribution

The idea of discovering topics and organizing songs using lyrics is not entirely new. The subject has been noted in [44] and tackled on a very small scale in [46]. Kleerdorfer et al. [34] demon- strate a first complete effort for automatic topic detection in songs. They state the reason why so few attempts have been made, is due to the lack of a song set of realistic size and the absence of adequate ground truth. Kleerdorfer et al. circumvent this problem by applying an unsuper- vised method that produces topics and evaluate the quality, instead of the association of songs to topics. This work contributes to research concerning semantic analysis of lyrics and topic detection, in several ways;

Next to standard datasets for MIR-research, an exclusive dataset with theme-labeled lyrics, • presented in chapter4, containing a clean ground-truth was acquired, enabling us to perform a first automatic classification of lyrics on a large scale. In chapter5, statistical properties are studied and used to train a model for classification of a test-set, classification is then evaluated using several metrics.

In chapter6, Community-sourced data is studied for lyrics-related assignment and used • for supervised classification.

In chapter7, Labeled-Latent Dirichlet Allocation, a supervised topic model, is applied for • lyrics classification, evaluated and compared to other classification techniques. 2.4 Overview and Contribution 7

In chapter8, Latent Dirichlet Allocation, an unsupervised topic model is applied on a large • corpus of English lyrics for topic-detection and evaluated. A new measure of evaluation of topics from Latent Dirichlet Allocation, making use of supervised data and the kurtosis measure, is proposed and applied.

In chapter9, several proof-of-concept applications are presented. The topic-based repre- • sentation of lyrics is used for automatic playlist generation, computing of artist similarity and applied as tool for research in social sciences. TOPIC MODELS 8

Chapter 3

Topic Models

This chapter offers some insight into one of the tools that will be used for the discovery and classification of topics in lyrics, namely topic models.

3.1 Introduction

The collective knowledge that is digitized and stored continues to grow. It is becoming increas- ingly difficult to find or discover what one is looking for. The current tools that are used when working with online information rely on search-functions and the interconnections between doc- uments, but these can be expanded. Think of searching and exploring documents based on the themes that run through them [13]. This thematic structure could be a new kind of window through which to explore and digest the collection. While the digital collection of text grows, we do not have the power to annotate documents in such a way manually. To this end researchers have developed topic models. Topic models are algorithms for discovering themes in large unstructured collections of documents. They do not require any prior annotations or labeling of the documents, the topics emerge from the analysis. These topic models enable summarization (finding concise restatements), classification and better similarity computation of texts. The next paragraphs discuss the development of the field of topic modeling. A significant step in the field of Information Retrieval (IR) was the basic methodology of rep- resenting each document as a vector of real numbers, each representing a ratio of word-counts. A well-known scheme using this representation is ‘Term Frequency - Inverse Document Frequency’ (TF-IDF) [63]. A count is formed of all occurrences of words in a document. After normalization this count is multiplied by the inverse document frequency count, which is obtained by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of that quotient. The result is a term-by-document matrix whose columns contain the TF-IDF values and which rows represent the documents. The scheme reduces a corpus to a length of fixed-lists of real numbers. This reduction brought a first basic identification of sets of 3.2 Latent Dirichlet Allocation 9 words that are discriminative for documents, but reveals little of the inter- and intra-document statistical structure. These shortcomings were addressed by the development of new dimensionality reduction techniques, most notably Latent Semantic Analysis (LSA) [20]. LSA applies Singular Value Decomposition of the TF-IDF matrix to identify a linear subspace in the total space of TF- IDF vectors, to capture the most significant sources of variance within the collection. The inventors also state that the features derived from LSA can capture some aspects of basic linguistic notions such as synonymy and polysemie. However, when considering a generative model of text documents, it is not clear why one should adopt the LSA methodology, as one could proceed more directly by fitting a model using maximum likelihood or Bayesian methods. A development addressing this regard is probabilistic LSA (pLSA), also know as the aspect- model. pLSA is based on a likelihood principle and defines a generative model. Each word in a document is seen as a sample from a mixture model, where components from this mixture are multinomial variables representing the topics. Each word from a document belongs to a single topic and each document is represented as a list of proportions of mixture components representing a fixed list of topics. A shortcoming of this model is that it provides no probabilistic model at the level of documents, therefore it is not a generative model for new documents. This lead to Latent Dirichlet Allocation (LDA) which adds a Dirichlet prior to the per-document topic distribution.

3.2 Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a standard among topic-models, discovered by David Blei, Andrew Ng and Michael Jordan in 2003 [15]. The basic idea behind LDA and topic models in general is that documents tend to belong to a mixture of multiple topics. We demonstrate this intuition for a song’s lyrics shown in figure 3.1.

Documents Topic proportions and assignments Topics

Christ 0.04 Church 0.03 Pray 0.01 Mamas & The Papas: ...... Dreamin

Stopped into a church Snow 0.08 I passed along the way Winter 0.03 Well, I got down on my knees Cold 0.02 Got down on my knees ...... And I pretend to pray I pretend to pray You know the preacher likes the cold Travel 0.07 Preacher likes the cold Home 0.03 He knows I'm gonna stay Way 0.01 Knows I'm gonna stay ...... California dreamin California dreamin On such a winter's day California 0.10 States 0.04 New-York 0.01 ......

Figure 3.1: Example of lyrics as mixture of topics, created by a generative process. 3.2 Latent Dirichlet Allocation 10

It is assumed that a number of topics exist for a collection, represented by a distribution over words. In Figure 3.1, the lyrics to “California Dreamin” by “The Mamas and the Papas” are assumed to be generated as follows. First a distribution over the topics is chosen (the histogram), then for each word, a topic assignment and the corresponding word from this topics, is chosen. The statistical model is further explained in the next section.

3.2.1 Statistical Model

LDA is a generative model for the documents, i.e. it specifies an imaginary probabilistic pro- cedure by which documents can be created. A topic is defined as a distribution over a fixed vocabulary. For each document in the collection, these words are generated in a two-stage process.

Randomly choose a distribution Multinomial(θ) over topics, with θ drawn from a Dirichlet • prior α.

For each of the words in the document: • 1. Randomly choose a topic from the distribution over the vocabulary Multinomial(θ). 2. Randomly choose a word from the corresponding distribution Multinomial(β) over the vocabulary, with β drawn from a Dirichlet prior η.

This statistical model reflects the intuition that documents exhibit multiple topics. Each document exhibits the topics in different proportion (step #1); each word in each document is drawn from one of the topics (step #2-2), where the selected topic is chosen from the per- document distribution over topics (step #2-1). This is the distinguishing characteristic of LDA, all the documents in the collection share the same set of topics, but each document exhibits these topics in different proportions [13]. The distribution that is used to draw the per-document topic distributions is called a Dirichlet distribution. In the example-document in Figure 3.1, the distribution over topics would place probabilities over the topics ‘religious’, ‘winter’, ‘traveling’ and ‘city’. The goal of topic modeling is to automatically discover the topics in a corpus. The documents are observed while the topics, per-document topic distributions, and per-document per-word topic assignments are hidden. The computational problem is to infer this hidden structure, which can also be seen as reversing the generative process. It is emphasized that the algorithm uses no labels or human-supplied information to infer the topics. The fact that many of the inferred topics are interpretable by its most probable words, comes from the statistical structure of language and the probabilistic assumptions of LDA. In generative probabilistic modeling, data is treated as arising from a generative process that includes hidden variables. This process defines a joint probability distribution over the observed 3.2 Latent Dirichlet Allocation 11

1.For each topic k ∈ 1,...,K: T 2. Generate βk = (βk,1, . . . , βk,V ) Dir(·|η) 3.For each document d: T 4. Generate θ = (θl , . . . , θl ) Dir(·|θ) 1 Md 5. For each i in 1,...,Nd :

6. Generate zi ∈ λ1, . . . , λMd Mult(·|θ)

7. Generate wi ∈ 1, . . . , V Mult(·|βzi )

Figure 3.2: Plate-notation of the LDA-model. variables (the words in the documents) and hidden random variables (the topic-structure). Data analysis is performed using that joint distribution to compute the conditional distribution of the hidden variables given the observed variables. This conditional distribution is also called the posterior distribution.

LDA can be described using the following notation. The topics are β1:K , where each βK is a th distribution over the vocabulary. The topic proportions for the d document are θd, where θd,k is the topic proportion for topic k in document d. The topic assignments for the dth document th are zd, where zd,n is the topic assignment for the n word in document d. Finally, the observed th words for document d are wd, where wd,n is the n word in document d, which is an element from the fixed vocabulary. Using this notation, the generative process for LDA corresponds to the following joint distribution of the hidden and observed variables:

K D N ! Y Y Y p(β1:K , θ1:D, z1:D, w1:D) = p(βi) p(θd) p(zd,n θd)p(wd,n β1:k, zd,n) | | i=1 d=1 n=1

This distribution specifies a number of dependencies. For example, the topic-assignment zd,n depends on the per-document topic proportions θd, these dependencies define LDA. Another way to express these dependencies is by using a probabilistic graphical model or plate-notation. The graphical model for LDA is depicted in Figure 3.2, next to an overview of the generative process. Each node is a random variable in the generative process. The unknown nodes are unshaded, observed ones are shaded. The rectangles or plates denote replication. N denotes the collection of words in documents, D denotes the documents in the collection. α and η are vectors with prior weights for respectively the topic per-document and word per-topic Dirichlet-distributions.

3.2.2 Computation

We now briefly discuss the computation of the conditional distribution of the topic structure given the observed documents. Using the previously mentioned notation, the posterior is defined as,

p(β1 : K, θ1:D, z1:D, w1:D) p(β1:K , θ1:D, z1:D w1:d) = | p(w1:D) 3.3 Labeled Latent Dirichlet Allocation 12

The numerator is the joint distribution of all the random variables. The denominator is the marginal probability of the observations, which is the probability of seeing the observed corpus under any topic model. In theory, this can be computed by summing the joint distribution over every possible instantiation of the hidden topic structure. This number is however exponentially large, this sum is intractable to compute. Much re- search has been performed to develop efficient methods for approximating it. Topic modeling algorithms generally fall into two categories; sampling-based algorithms and variational algo- rithms. Gibbs sampling is the most commonly used sampling algorithm. Here a Markov chain is constructed whose limiting distribution is the posterior. The Markov chain is defined on the hidden topic variables for a particular corpus, and the algorithm is to run the chain for a long time, collect samples from the limiting distribution, and then approximate the distribution with the collected samples. Variational methods are a deterministic alternative for sampling methods. Rather than approximating the posterior with samples, variational methods posit a parametrized family of distributions over the hidden structure and then find the member of that family that is closest to the posterior. Loosely speaking, both types of algorithms perform a search over the topic-structure.

3.3 Labeled Latent Dirichlet Allocation

LDA has the ability to model multiple topics per document, however, it is not appropriate for a multi-labeled corpus because, as an unsupervised model, there is no possibility of incorporating a supervised label set into its learning procedure. LDA brings no guarantee that all learned topics will be interpretable and ready to be used in an application. In our case, some supervised labeling of song lyrics will be available (this will be discussed in the following chapter). It is therefore wishful to incorporate this supervised information in the model. Several modifications of LDA have been proposed that incorporate supervision. Supervised LDA [14] and DiscLDA [37] are two such models. Both models limit documents to being associ- ated with only one label. Learned topics also do not correspond directly with the label set. As will be discussed later on, lyrics tend to be assigned multiple labels. Therefore a model which takes multiple labels into account per document, is better fit for our needs. A third supervised model, providing this, is Labeled-LDA (L-LDA)[60]. In contrast to standard LDA and its supervised variants, L-LDA associates each label with just one topic. L-LDA can be seen as an extension to both LDA (by using supervision) and Multinomial Naive Bayes (by incorporating a mixture model).

3.3.1 Statistical Model

Like LDA, L-LDA models each document as a mixture of underlying topics. Unlike LDA, L- LDA constrains the topic model to use only topics that correspond to a document’s observed 3.3 Labeled Latent Dirichlet Allocation 13

d label set. Again each document d is represented by a list of word indices wi , but now a list of d binary topic presence/absence indicators Λ = (l1, . . . , lK ) is added, where w 1,...,V and ∈ { } lk 0, 1 . V is the vocabulary size and K is the number of unique labels in the corpus. ∈ { } The multinomial topic distributions over vocabulary β for each topic k remain the same as for traditional LDA. However, in this case the multinomial distribution θd for word-topic assignments must be restricted over topics that correspond to labels in Λd. First document- labels are generated using a Bernoulli coin-toss for each topic k, with a labeling priority Φk. d Then, a vector of document-labels is defined to λ = k Λk = 1 . This allows us to define a d { | } d document-specific matrix L of size Md K for each document d, where Md = λ , as follows; × | | For each row i 1,...,Md and column j 1,...,K : ∈ { } ∈ { } ( d d 1 if λi = j Lij = 0 otherwise.

th d th d Thus the i row of L has an entry of 1 in column j if the i document label λi is equal to the topic j, and zero otherwise. This matrix is used to project the parameter vector of the T Dirichlet topic prior α = (α1 ...K ) to a lower dimension vector defined as,

d T α = L α = (αλd , . . . , αλd ) . × 1 Md The dimensions of the projected vector correspond to the topics represented by the labels of the document. For example, consider a document given labels 0, 1, 1, 0 which implies λ = 2, 3 then: { } { } ! 0 0 1 0 L = . 0 1 0 0

d d d T Then, θ is drawn from a Dirichlet distribution with parameters α = L α = (α2, α3) . × This fulfills the requirement that a document’s topics are restricted to its own labels. The remaining part of the algorithm is identical as for regular LDA. The dependency of θ on both α and Λ is indicated by direct edges from Λ and α to θ in the plate-notation in Figure 3.3. The generative process is again depicted next to the plate-notation. L-LDA can also be seen as an extension of Multinomial Naive Bayes. In a singly labeled document case, the probability of each document under L-LDA is equal to the probability of the document under the Multinomial Naive Bayes event model trained on those same document instances. Unlike Multinomial Naive Bayes no decision parameter is encoded, and for multi- labeled corpora a separate classifier for each label would be trained. By contrast, L-LDA assumes that each document is a mixture of topics, here the probability of a single word instance is distributed over the document’s observed labels. 3.4 Evaluation 14

1.For each topic k ∈ 1,...,K: T 2. Generate βk = (βk,1, . . . , βk,V ) Dir(·|η) 3.For each document d: 4. For each topic k ∈ 1,...,K d 5. Generate Λk ∈ 0, 1 Bernoulli(·|Φk) 6. Generate αd = Ld × α d T d 7. Generate θ = (θl , . . . , θl ) Dir(·|α ) 1 Md 8. For each i in 1,...,Nd : d d d 9. Generate zi ∈ λ1, . . . , λMd Mult(·|θ )

10. Generate wi ∈ 1, . . . , V Mult(·|βzi ) Figure 3.3: Plate-notation of the L-LDA model.

3.3.2 Computation

The computation process for learning and inference used in [60] is similar to that of traditional LDA. In L-LDA, Gibbs sampling is commonly used. The only distinction is that topics are restricted to belonging to a set of labels.

3.3.3 Application

In [60], L-LDA is applied for visualizations of a corpus of tagged web-pages. Improved expres- siveness is demonstrated over traditional LDA. L-LDA outperformed Support Vectors Machines (SVM) by more than 3 to 1 when extracting tag-specific document snippets. As a multi-label text classifier, the model is superior with SVM’s for classifying a variety of datasets.

3.4 Evaluation

While topics learned by topic models are statistically justified, they can be less useful for end-use because no particular theme is recognizable. To apply topic modeling in real world problems some form of evaluation is required. In research concerning topic modeling, depending on the application of the topic model, a variety of methods have been performed to evaluate the quality of a topic model.

1. The most common method are measures of model fit by estimating the probability of unseen held-out documents given some training documents [68][15]. A better model will give rise to a higher probability of the held-out documents. This method only measures the probability of observations.

2. A second, minor, category focuses on the semantic meaningfulness of the topics. This has been measured in several ways. One way is a fully manual evaluation of the semantic coherence of the words with high probability [16]. Test subjects simply score the topics with a grade or are asked to detect intruding words in a list of most significant words per topic. Other scoring models use co-occurrence measurements based on point-wise mutual information over Wikipedia, of pairs of words from topics [54], measure semantic coherence between words in a lexical ontology called WordNet or use results from search engines. 3.4 Evaluation 15

3. Topic models are also frequently evaluated by their performance on a secondary task independent of the topic space, such as sentiment detection or information retrieval [66][69], using a supervised training- and test-set of topics assigned to documents.

It is advised to evaluate models using methods that match how the algorithms will be used. Since little is known about the thematic structure of lyrics, we wish to use topic models to organize, summarize and help listeners explore music. We therefore add importance to the interpretation of the topics. Since there is no technical reason to suppose that held-out accuracy corresponds to better organization or easier interpretation, we will focus on the latter two evaluation methods.

3.4.1 Semantic Coherence

When evaluating semantic coherence, two types of methods can be distinguished in literature, methods that use some form of human input and others that use external text data sources such as Wikipedia, WordNet or Google. In [16], topic models are evaluated using human experiments. Two evaluation tasks are proposed, firstly word intrusion measures how semantically ‘cohesive’ the topics inferred by a model are, by letting subjects find a word, in a set of meaningful words from a topic, that does not belong with the others. The second, topic intrusion, measures how well a topic model’s decomposition of a document as a mixture of topics agrees with human associations of topics with a document by letting subjects assign an intruding topic in a set topics of high probability for a document. In [34], the topic modeling technique Non-negative Matrix Factorization was applied to a corpus of lyrics. Topics are evaluated by asking human test-subjects to summarize the most important terms of a cluster. In a second phase the same terms are shown to the test-subjects, but were then asked to choose the best tags from those collected in the first phase. The strength of agreement among test subjects is then measured by computing the probability of the actual result being attained by random behavior of the subjects. In [53][54] external data is used, firstly topics are manually scored, then a variety of scoring methods are used drawing on WordNet, Wikipedia and the Google search engine. Methods using WordNet, score topics by applying a metric which measures the average semantic distance in the dataset between words of a topic. WordNet [51] is a lexical ontology that represents word sense via ‘synsets’, which are structured in a hypernym/hyponym hierarchy (nouns), or hypernym/troponym hierarchy (verbs). WordNet additionally links both ‘synsets’ and words via lexical relations including antonymy, morphological derivation and holonymy/meronym. Computational methods for calculating semantic relatedness between words have been de- veloped, operating on the links between words. Important features are the length of the path that connects words within the taxonomy. Another important concept is the Least Common 3.5 Conclusion 16

Subsumer or most specific ancestor node in the taxonomy between two words. The metrics will be discussed in more detail in chapter8. An example of a distance metric in the Wikipedia-dataset, is Pointwise Mutual Informa- tion (PMI), which scores word-pairs using term co-occurrence. In search engine-based scoring methods, a topic is queried in its entirety and scored according to the number of hits or title matches.

3.4.2 Matching with Supervised Data

One way to evaluate topic models is to simply match the unsupervised topics with supervised labeling of documents. This can be measured by using topic distributions as features for au- tomatically classifying lyrics. Or by measuring distance between topic-word distributions of a supervised topic model and LDA. A good model will have high agreement between topics.

3.5 Conclusion

In previous sections topic models were introduced and its statistical foundations were discussed in an unsupervised and supervised case. An important aspect is evaluation of the models. When evaluating, the importance of semantic coherence and usefulness of the topics for organization, is stated. THE DATASET 17

Chapter 4

The Dataset

An important aspect of research in the domain of MIR is the dataset at hand. In this chapter sev- eral important properties and the origins of the data and lyrics used for research, are described. The main dataset used for this research is the so-called ‘Million Song Dataset’ (MSD).

4.1 The Million Song Dataset

A point of criticism often directed toward research in MIR, is that the datasets on which the research is based, are not representative for data handled by commercial systems. For a long time there was also a lack of publicly open and transparent data for academic research. To help open the door to reproducible, open evaluation of music recommendation algorithms, the ‘Million Song Dataset’ was developed [12]. It pushes the boundaries of MIR research to commercial scales. The MSD was created using data from the ‘The Echo Nest’[1], a music intelligence platform company that provides music services to developers and media companies, and ‘MusicBrainz’[3], an open music encyclopedia that collects music metadata and makes it available to the public.

1.000.000 Songs 273 GB Data 44.745 Unique artists 7.643 Unique artist terms from ‘The Echo Nest’ 2.321 Unique ‘MusicBrainz’ tags 43.943 Artists with at least one term 515.576 Dated tracks starting from 1922

Table 4.1: Several statistics of the Million Song Dataset

The MSD consists of 273GB of audio features and metadata. Included are 1.000.000 songs from 44.745 unique artists, with user-supplied tags for artists from the ‘MusicBrainz’ website, comprising 2.231 unique tags. Statistics from the dataset are depicted in Table 4.1. The MSD does not distribute raw acoustic signals due to copyright reasons, but does distribute a range of extracted audio features, like average loudness or estimated tempo. 4.2 The musiXmatch Dataset 18

Field name Type Description artist familiarity float numerical estimation of how familiar an artist is artist hottnesss float popularity of an artist artist id string Echo Nest ID artist latitude float latitude artist location string location name artist longitude float longitude artist mbid string ID from musicbrainz.org artist mbtags array string tags from musicbrainz.org artist mbtags count array int tag counts for musicbrainz tags artist name string artist name artist terms array string Echo Nest tags artist terms freq array float Echo Nest tags freqs artist terms weight array float Echo Nest tags weight similar artists array string Echo Nest artist ID’s song hottness float song popularity song id string Echo Nest song ID title string song title track id string Echo Nest track ID year int song release year from MusicBrainz or 0

Table 4.2: Fields for each file in The Million Song Dataset (non-audio related)

For the task at hand, topic detection in song lyrics, the MSD will solely be used for its metadata about songs like the song’s name, artist, etc., and provide id’s to match data from the two additional datasets. A list with fields for each file, non-audio related, is depicted in Table 4.2. Associated with the dataset, a dataset with lyrics from commercial lyrics-service ‘musiXmatch’[4], and a dataset containing social tags from online social music service ‘Last.fm’[2] were released.

4.2 The musiXmatch Dataset

The ‘musiXmatch’ dataset, associated with the MSD, provides lyrics for a total of 237.662 tracks in the MSD. The lyrics come in a bag-of-words format, each song-lyric is described as the word- counts for a dictionary of the top 5.000 words across the set. Copyright issues prevent lyrics from being distributed fully. All words are also reduced using a stemming algorithm based on the ‘Porter2’-algorithm [57]. For example: the words ‘cry’,‘cried’, and ‘crying’ are all mapped to the word ‘cri’.

4.3 Last.fm Social Tags Dataset

Last.fm is a music website which tracks its users’ listening habits by recording details of the songs they are listening to. Last.fm builds a detailed profile of the user’s musical taste and recommends music accordingly. The site also offers numerous social networking features. Users are able to assign tags to songs, also called social tags. 4.4 Greenbook of Songs 19

Some statistics of the Last.fm dataset are given in Table 4.3. 943.347 matched tracks MSD to Last.fm 505.216 tracks with at least one tag 584.897 tracks with at least one similar track 522.366 unique tags 8.598.630 (track - tag) pairs

Table 4.3: Statistics of the Last.fm Dataset

This tagging can be used to indicate genre, mood, artist or any other form of user-defined classification. Below is a list of the top tags with their total frequencies in the dataset. rock 101.071 pop 69.159 alternative 55.777 indie 48.175 electronic 46.270 female vocalists 42.565 favorites 39.921 Love 34.901 ......

Table 4.4: Popular tags with frequencies

Social tags will play an important role in our research. They will be evaluated for use in topic detection or text classification, as some social tags like ‘Christmas’ are assigned to songs with a specific lyrical theme, a potential which, until now, was not recognized in MIR-research. Chapter6 elaborates on this topic.

4.4 Greenbook of Songs

The second important dataset is supplied to us by a commercial lyrics listings website. The company ‘GreenbookofSongs.com R ’ (GOS), located in Tennessee USA, specializes in classifying lyrics according to lyrical themes. The GOS is the self-proclaimed world leader in classifying songs by themes, concepts or topics. For 30 years, every song in the GOS database has been listened to and classified according to one or more topics discussed in the song, often with the help of numerous top music industry professionals and the artists themselves. Each song listing includes the name of one or more artists who recorded it, the album titles and the related record labels. The GOS is used by advertising agents, journalists, sociologists, teachers and many more. Notable customers are MTV, Universal Music, CNN, EMI Music and many more. The GOS licenses its contents to third parties for their own business, marketing and promotional purposes. Needless to say that the GOS contains information that is very valuable for this research. As earlier research in topic detection stated, a problem is the lack of a decent ground truth. GOS 4.5 Conclusion 20 provides us with a large set of clean data, to be used for training and evaluating in classification tasks. Next to a ground truth, GOS also provides an extensive classification-hierarchy for lyrics. The company was contacted and informed about the research. The representatives of GOS provided us with access to the database, on the condition that a non-disclosure agreement was signed. This means no proprietary information from the GOS can be made publicly available when reporting results from this research. As permitted by representatives of the GOS, the whole of the GOS-database was retrieved by crawling the website, using the web-interface to the database to retrieve all song listings for each categorization. For each labeled song; the name, the album and the category was stored in a local database for further processing. The MSD was then searched for a matching artist- title combination, using fuzzy string matching, with documents in the ‘musiXmatch’ set. For each lyric in the GOS, the performing artist was matched against all artists in the MSD by computing the Levenshtein or edit distance [41] between artists and declaring a maximum edit distance of 20% as correct match. The same is then performed for the title in the GOS and all song-titles from the matched artist, if both maximums are met, a match is declared. The amount of matched songs and other statistics are given in Table 4.5.

183.898 song-category pairs 55.447 unique artist-song combinations 15.191 tracks matched to the MSD-set 9.261 tracks matched to the musiXmatch-set 24 super-categories 877 subcategories

Table 4.5: Statistics from the ‘GreenbookofSongs.com R ’

In the next chapter, the class-hierarchy made in the GOS will be discussed in detail.

4.5 Conclusion

Diagram 4.1 shows an overview of the data used in this research. The GOS-dataset is expected to be clean data of commercial quality made by experts, while social tags are user-supplied and much more noisy and biased. Both sources will be evaluated further in following chapters. 4.5 Conclusion 21

GreenbookofSongs dataset 55.447 songs

Song metadata Title, album, 15.191 tracks Million Song Dataset year, artist 9.216 tracks 1.000.000 songs Categories Super-categories, subcategories Audio Segments, MFCC, BPM, key, loudness, musiXmatch dataset 13.954 tracks at least one tag timbre 237.662 song lyrics 8.723 with lyrics Song metadata Title, album, Bag-of-words 5.000 terms year, artist Last.fm dataset Stemmed and Reduced 8.598.630 pairings Artist metadata Name, genre tags, similar Song-level Social Tags 173.730 tracks at artists,location 505.216 522.366 unique tags, least one tag tracks at relative frequencies least one tag

Figure 4.1: Overview of dataset.

A crucial aspect when researching topic modeling, is the data at hand. For our research, we were supplied with several datasets coming from diverse origins with different inherent prop- erties. A core set containing all track names and other metadata (MSD), which are linked to dataset providing lyrics (‘musiXmatch’). Two sources of labels for the text documents are available, one large, noisy and community-supplied set of keywords with unknown potential for lyrics classification, and one smaller dataset of commercial quality applying a hierarchical labeling scheme to text documents (GOS). Considering the field of lyrics classification, where few (labeled) datasets of considerate size are available,the GOS-dataset offers possibilities to study the automatic categorization of lyrics, and to evaluate noisy labeling or unsupervised topic models. LYRICS CATEGORIZATION 22

Chapter 5

Lyrics Categorization

This chapter focuses on the newly acquired dataset from ‘GreenbookofSongs.com R ’ and clas- sification. First, we look at the taxonomy for song topics recognized by its creators. Then we measure the performance of statistical text classification of lyrics using a selection of baseline classifiers trained using lyrics and titles.

5.1 The Greenbook of Songs’ Taxonomy

As noted earlier, labels in the GOS are arranged in a hierarchical structure. The upper-level of the hierarchy consists of 24 super-categories, which can contain a sub-tree with a maximum of 5 levels of subcategories. The total number of subcategories is 901. Table 5.1 shows an overview of the 24 super-categories along with the amount of corresponding subcategories, and fractions of the total amount of tracks that have at least one label from the super-category or generality of the super-category. Fractions from Table 5.1 show the dominance of the lyrics concerning love in music. More than half of the tracks in the dataset have been assigned one or more labels from this super- category. Second most popular topic according to the GOS, are songs about specific people and life, followed by lyrics about music itself and entertainment. Rare categories are lyrics about tools and items and space and stars. While the dominance of love in lyrics is well known, popularity of super-categories can also be attributed to a category being broadly defined, which is partly indicated by the amount of subcategories a super-category has. A sub-tree of the hierarchical structure is shown in Figure 5.1, showing the super-category ‘Society, Class and Social Relationships’ and some of its subcategories.

The GOS-dataset contains 183.898 track-to-category pairs, with 55.447 unique tracks. On average, each song is assigned 3 (3,31) labels, this is also the median of the amount of assign- ments. The standard deviation equals 2,35. The maximum amount of labels assigned to a single 5.1 The Greenbook of Songs’ Taxonomy 23

Super-category Number of sub- #Tracks (%) in cat- #Tracks (%) in categories egory at least once category in lyrics- dataset Arts and music 51 9.823 (17) 1.530 (16) Animals 38 2.165 (3) 266 (2) Beauty and fashion 12 1.704 (3) 362 (3) Alcohol and drugs 7 1.551 (2) 290 (11) Communication 23 6.033 (10) 1.103 11 Earth and nature 31 6.392 (11) 719 (11) Education and knowledge 21 3.809 (6) 1.580 (7) Faith and religion 28 7.898 (14) 272 (10) Food and beverages 10 1.953 (3) 550 (2) Geography and locations 175 8.333 (15) 1.096 (11) Government and politics 23 3.317 (5) 550 (5) Holidays 33 2.951 (5) 333 (3) House and home 11 2.089 (3) 377 (4) Law and order 16 3.643 (6) 669 (7) Monsters and magic 6 1.275 (2) 230 (2) People and life 75 11.998 (21) 2.215 (23) Society and class 18 2.311 (4) 447 (4) Love and emotions 188 29.784 (53) 5.822 (62) The universe 5 1.005 (1) 146 (1) Time 45 7.213 (13) 1.300 (14) Tools 11 765 (1) 150 (1) Travelling 39 7.069 (12) 1.287 (13) Weights and measures 14 2.873 (5) 584 (6) Work and money 21 3.754 (6) 654 (7)

Table 5.1: Super categories from the ‘GreenbookofSongs.com’

Society, Class and Social Relationships

Culture and Class Ethnicity Social Class

Gender Issues Generation Pop Cul- Prejudice African- Hispanic Native Rural Outcasts Gap ture American

Feminism Stereotypes

Figure 5.1: Example of a sub-tree from hierarchical structure from the GOS-dataset

song is 21 labels (the song ‘El Paso’ by Marty Edwards). In Figure 5.2a the amount of tracks is plotted versus the amount of labels assigned to the tracks. In Figure 5.2b, only different super-categories are considered. The average amount of as- signed super-categories is about one label less than when considering the total amount of labels. 5.2 Text Categorization 24

105 105

104 104

103 103 Tracks Tracks 102 102

101 101

100 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 1 2 3 4 5 6 7 8 9 10 11 12 Assigned labels Assigned supercategories

(a) Tracks versus total assigned labels. (b) Tracks versus assigned super-categories.

All Categories Super-Categories Mean 3,31 2,34 Median 3 2 Standard Deviation 2,35 1,39 Maximum 21 12

Table 5.2: Statistics label assignment

An important fact we conclude of these statistics, is that the task of classifying lyrics according to theme, is a multi-label classification. Most of the songs manually classified are assigned more than one label. This has to be kept in mind for the remainder of this research, as the task of giving multiple labels is intrinsically different from a single-label classification task.

5.2 Text Categorization

A first analysis is made using lyrics and the assigned labels from the GOS-dataset. How well are lyrics automatically categorized using techniques from Machine Learning (ML), or in this case statistical Text Categorization (TC)? Text Categorization is the problem of automatically assigning predefined categories to text documents, lyrics in our case. TC is applied in many contexts, ranging from document filtering to automated metadata generation, spam-filtering and language identification. This is, to the best of our knowledge, the first time Text Categorization is applied to a corpus of lyrics of this size (9.261 documents) as opposed to those performed in [46](125 documents). Lyrics are different from texts from corpora usually classified in TC research. Unlike news articles or movie reviews, the goal of song lyrics is not to inform a reader, but use a more narrative writing style comparable to poetry. Song lyrics benefit from well-applied poetic devises, such as metaphor, simile, alliteration, hyperbole, personification, onomatopoeia and relies on effective use of descriptive imagery. A lyric is designed to be sung by the human voice and heard with 5.2 Text Categorization 25 music. Little is known about the effect of this different writing style on the performance of text-classifiers. We evaluate the performance of a selection of base line classifiers. This short introduction to Text Categorization is based on guides found in [64] and [7].

5.2.1 Multi Label Text categorization

Text categorization is the task of assigning a Boolean value to each pair d(j, ci) D C, where ∈ × D is a domain of documents and C = c1, . . . , c is a set with predefined categories [64]. A { |C|} value of 1 assigned to (dj, ci), indicates a decision to file dj under ci, while a value of 0 indicates not to file dj under ci. The goal is to approximate an unknown target function φ : D C 1, 0 × → { } using a second function called the classifier which coincides with φ as much as possible. This coincidence, also known as effectiveness, is measured using several metrics which are discussed later on. A single-label classification is concerned with learning from a set of examples that are asso- ciated with a single label from a set of labels. When only two labels are in the set, the task is called a binary classification, when three or more labels are included it is called a multi-class classification. In our case, lyrics are assigned one or more labels from the total set of 877 labels, this is called multi-label classification. We restrain the set to the 24 super-categories for the remainder of this task. Multi-label classification methods can be categorized into two different groups; problem transformation methods, and algorithm adoption methods. The first group of methods are independent of the algorithm. The classification task is transformed into one or more single- label classification tasks. The second group of methods extend specific learning algorithms in order to handle multi-label data directly [33]. Some research has been done into classifying according to a hierarchy of labels, as is the case. For simplicity reasons, we do not include the total hierarchy made in the GOS and focus on the upper-level of the hierarchy, namely the super-categories. The most widely used problem transformation method (and the one which we will apply), is called Binary Relevance. Prediction of each label is seen as an independent binary classification task. It learns one binary classifier for each different label and transforms the original dataset into a collection of sets, one for each label, and assigns each text with one label (0 or 1). These classifiers are also commonly referred to as one-against-all or one-versus-rest classifications.

5.2.2 Machine Learning Approach

In ML, pre-classified documents are the key resource for making intelligent decisions. The approach relies on the availability of an initial corpus Ω = d1, . . . , d D of documents pre- { |Ω| ⊂ } classified under the set of labels C = c1, . . . , c . All values of the total function φ : D C { |C|} × → 0, 1 are known for every pair (dj, ci) Ω C. A document dj is a positive example of category { } ∈ × 5.2 Text Categorization 26

ci if φ(dj, ci) = 1, a negative example of ci if φ(dj, ci) = 0. Once the classifier has been built its effectiveness is measured. For this, the initial corpus is split into two sets. One training set, on which the classifier is inductively built by observing the characteristics of these documents, and a test set used for testing the effectiveness of the classifiers. Each document from the test set is fed to the classifier, and the classifier decisions are compared to that of the experts (from the GOS). A measure of classification effectiveness is based on how often the predicted value matches the true value.

5.2.3 Attributes

Document indexing

Before texts are interpreted by a classifier, an indexing procedure is needs to be applied to the text that maps a text onto a more compact representation of its content. A text dj is usually ~ represented as a vector of term weights dj = (w1j , . . . , w|τ|j ), where τ is the set of terms, also called features. Different approaches can be made accounted for by the different understandings of a term and the different ways to compute term weights. We are bound to the representation of one word per term. As our text already comes in a bag-of-words representation, we have no knowledge about the order of the words, so we cannot use any sequential information about the features. As for term weighting, weights can be binary (1 denoting presence and 0 absence of the term), Term Frequency (the number of times the term occurs), or often the TF-IDF representation is used (depending on the choice of algorithm). The standard TF-IDF function is defined as,

D TF-IDF(tk, dj) = TF (tk, dj) log . · DF (tk) where TF (tk, dj) denotes the number of times term tk occurs in dj, and DF (tk) denotes the number of documents in which term tk occurs. D is the total number of documents in the corpus. This function embodies the intuitions that the more often a term occurs in a document, the more it is representative of its content, and the more documents a term occurs in, the less discriminating it is. In our classification all three representations are used.

Feature Selection

Performance of many of the sophisticated learning algorithms does not scale well for high dimen- sionality of the term-space. Because of this, before classifiers are trained, often dimensionality reduction (DR) is applied, which reduces the vector space. DR automatically removes non- informative terms according to corpus statistics. 5.2 Text Categorization 27

Various DR methods have been proposed, two of the most effective methods are Information Gain (IG) and the χ2 statistic [71]. We will use the χ2-statistic to reduce the amount of unique terms. The χ2-statistic measures the lack of independence between a term t and category c and can be compared to the χ2-distribution with one degree of freedom to judge extremes. Using the two-way contingency table of a term t and a category c, where A is the number of times t and a category c co-occur, B is the number of times the t occurs without c, C is the number of times c occurs without t, and D is the number of times neither c nor t occurs, and N the total number of documents, the terms are valued using:

N (AD CB)2 χ2(t, c) = · − . (A + C) (B + D) (A + B) (C + D) · · · The χ2-statistic is a normalized value, so values are comparable across terms for the same category.

Feature Transformation

A different approach is feature transformation. Feature transformation methods create a new set of features as a function of the original set of features. These methods are not surprisingly the same techniques used for topic modeling. Latent Semantic Analysis (LSA), its probabilistic variant pLSA and LDA have all been used as feature transformation methods for classification tasks. These methods are blind to the underlying class-distribution, the features found by topic models are not necessarily the directions along which the class-distribution of the underlying documents can best be separated. Techniques have also been proposed to perform the feature transformation methods by using the class labels for effective supervision. L-LDA was applied in this regard.

5.2.4 Classifiers

In this section some text-classification algorithms applied to the lyrics are briefly discussed. All classifiers will be used in a binary labeling scheme, as part of a multi-label classification.

Naive Bayes

The Naive Bayes classifier (NB) models the distribution of the documents in each class using a probabilistic model with independence assumptions about the distributions of the different terms [47][48]. Two classes of are often distinguished for NB classifiers, the difference being the assumption in terms of taking word frequencies into account. The probability of a document d being in class c is computed as 5.2 Text Categorization 28

Y P (c d) P (c) P (tk c), | ∝ | 1≤k≤nd

where P (tk c) is the conditional probability of term tk occurring in a document of class c. | P (tk c) is interpreted as a measure of how much evidence tk contributes that c is the correct | class. P (c) is the prior probability of a document occurring in class c. nd of unique terms in document d. For classification the goal is to find the best class for the document. The best class in NB classification is the most likely or Maximum a Posteriori (MAP) class cmap:

Y cmap = argc∈C maxPˆ(c d) = argc∈C maxPˆ(c) Pˆ(tk c). | | 1≤k≤nd

Pˆ(c) and Pˆ(tk c) are estimated as follows, first the maximum likelihood estimate is calculated. | This is simply the relative frequency and corresponds to the most likely value of each parameter given the training data. For the priors this estimate is:

Nc Pˆ(c) = N

, with Nc the number of documents in class c and N the total number of documents. The conditional probability Pˆ(tk c) is estimated as the relative frequency of term t in documents | belonging to class c: ˆ Tct P (tk c) = P . | t0∈V Tct0 where Tct is the number of occurrences of t in training documents from class c, including multiple occurrences of a term in a document. The multinomial generates one term from the vocabulary in each position of the document. The alternative to this model, is the Bernoulli model. The Bernoulli model generates an indicator for each term of the vocabulary, either 1 indicating presence of the term in the document or 0 indicating absence. The Bernoulli model estimates P (t c) as the fraction of documents of class | c that contain term t. In contrast, the multinomial model estimates P (t c) as the fraction of | tokens or fraction of positions in documents of class c that contain term t. When classifying a test document, the Bernoulli model uses binary occurrence information, ignoring the number of occurrences, whereas the multinomial model keeps track of multiple occurrences. As a result, the Bernoulli model typically makes many mistakes when classifying long documents.

Logistic Regression

Regression modeling is a method commonly used in order to learn relationships between real- valued attributes [25]. An early application of regression to text classification is the Linear Least

Squares Fit (LLSF) method. Suppose the predicted label for a document to be pi = A¯ X¯i + b, · and yi, is known to be the true class label, then the aim is to learn values of A and b, such that 5.2 Text Categorization 29 the LLSF is minimized. P is a 1 n vector of binary values indicating the binary class to which × the corresponding document belongs. Thus, if X is the n d term-matrix, we wish to determine × the 1 d vector of regression coefficients A for which A XT P is minimized, with × || · − || || · || representing the Froebinus norm. Logistic regression differs from LLSF in that the objective function to be optimized is the likelihood-function. Instead of using pi = A¯ X¯i + b directly to fit the true label yi, we assume · the probability of observing label yi is,

exp(A¯ X¯i + b) P (C = yi Xi) = · . | 1 + exp(A¯ X¯i + b) ·

This gives a conditional generative model for yi given Xi. Logistic regression (LR) is a linear classifier as the decision boundary is determined using a linear function of the features. In the case of binary classification, P (C = yi Xi) can be used to | determine the class label, using a threshold of 0, 5.

Rocchio

Rocchio’s learning algorithm originates from the field of classical Information Retrieval, origi- nally designed to be used as relevance feedback in querying full-text databases [31]. Rocchio’s algorithm is a type of vector space method, representing documents in the bag-of-words repre- sentation. The algorithm computes the average vector over all training document vectors that belong to class ci (the centroid), and the distance between a test document and each centroid. The document is labeled to with the class of the nearest centroid. k-Nearest Neighbors k-Nearest Neighbors (k-NN) is a case-based learning algorithm that is based on a distance or similarity function for pairs of observations in the vector space, such as the euclidean distance or cosine similarity [36]. The main assumption by this classification scheme is that documents which belong to the same class are likely to be near to one another in the vector space. To perform classification, the k-nearest neighbors in the training data are determined. The majority class from these k neighbors are assigned as class label. The choice of k depends upon the size of the underlying corpus.

Decision Tree Classifier

A Decision Tree (DT) text classifier is a tree in which internal nodes are labeled by terms, branches departing from them are labeled by tests on the weight that term has in the test document, and leafs are labeled by categories [36]. A test document dj is classified by recursively 5.2 Text Categorization 30

testing for the weights that terms labeling the internal nodes have in vector d~j until a leaf node is reached, the label of this node is then assigned to dj. There are a number of standard algorithms for DT learning, among the most popular ones are ID3 [58], C4.5 [59], C5 and CART [65]. In our classification task the CART algorithm will be used. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node of the tree.

Support Vector Machines

The Support Vector Machine (SVM) method can be seen, in geometrical terms, as the attempt to find, among all the surfaces in a vector space for documents, a surface that separates the positives from the negatives by the widest possible margin [32]. The idea can be understood by considering a case in which positives and negatives are linearly separable, in which case the decision surface is a hyperplane. The SVM method chooses the middle element from the ‘widest’ set of parallel lines, that is, from the set in which the maximum distance between two elements in the set is highest. The best decision surface is determined by only a small set of training examples, called support vectors. Some methods do not make the assumption that positives and negatives are linearly separable. We will use SVM’s that learn linear threshold functions.

5.2.5 Measures for Effectiveness

Classification effectiveness is measured in terms of the classic IR notions precision and recall, adapted to the case of TC. Precision is defined as the probability that if a random document d is classified under ci, this decision is correct. Analogously, recall is defined as the conditional that, if a random document d ought to be classified under ci, this decision is taken. Terms often used for classification tasks, are true positives (tp), true negatives (tn), false positives (fp) and false negatives (fn). The terms positive and negative refer to the classifier’s prediction, the terms true and false refer to whether the prediction corresponds to external, expert judgment. Using these terms precision and recall can be defined as;

tp P recision = , tp + fp tp Recall = . tp + fn

A measure that combines these two, is the harmonic mean of recall and precision or F 1 score. − precision recall F 1 = 2 · · precision + recall 5.3 Evaluation 31

For obtaining estimates of precision and recall relative to the whole category set two dif- ferent methods may be adopted. With microaveraging the average is obtained using the sum over all individual decisions, regardless of category. In macroaveraging, precision and recall are first evaluated locally or for each category, and then averaged over the results of the differ- ent categories. These two methods may give different results, especially when categories have very different generality. Categories with few positive training instances will be emphasized by macroaveraging and much less so by microaveraging. The choice of method depends on the application’s requirements.

5.3 Evaluation

In the previous section some techniques for automatic Text Categorization were discussed. We now apply these methods on our corpus of lyrics assigning the documents with one or more labels from the 24 super categories from the GOS-dataset. As noted earlier, the number of songs from the GOS matched to one in present in the MSD and the ‘musiXmatch’ equals 9,242 tracks. This will be the size of our total set. When splitting the songs into training- and test-sets, the method of K-fold cross-validation was applied. With a K-fold cross-fold, the set is split randomly into K equally sized subsets. One subset is retained as the test-set, with the remaining K 1 sets used as training data. This process is repeated − K times (the folds), with each one of the K subset used just once as the validation set. The K-results are than averaged to produce a single estimation. The advantage off this method is that all samples are used for testing exactly once. Table 5.1 shows the generality of all super-categories. In section 5.1 the dominance of some categories was discussed. Because of this, our classification applies a 5-fold cross validation, this to ensure that all categories are sufficiently represented in the test-set. We opt macroaveraging as the preferable method for overall validation. In the case of microaverage the dominance of the ‘Love’-category would lead to a distorted representation of the measurements as presented in Table 5.4. Features were filtered for English stop-words and ranked using the χ2-method, the thousand most significant features were maintained for classifying the lyrics. All classifiers discussed in section 5.2.4 were applied to the data. For all classifiers, except those from the Naive Bayes family, the features were represented in the TF-IDF representation. Next to the words of the lyrics, the words of the song’s title were used. A song’s title is but a handful of words, but possibly of great value. One would expect artists to summarize song content and inform listeners about the song by assigning an appropriate title relevant to the thematic content, to each song. We thereby evaluate the use of song titles for classification. As third type of features, we combine both the lyrics and words from the title by appending both strings. Words from the song title are separated from those the lyrics by adding a label to 5.3 Evaluation 32 each word. This way classifiers are aware of the word’s nature and can add importance to the feature accordingly. Macroaverage classification results are presented in Table 5.3 using the measures discussed in section 5.2.5, microaverage in Table 5.4. All results are presented in %. Best results per-feature are indicated by bold text, overall best results are indicated by underlining of the score.

Lyrics Title Lyrics + Title Classifier P R F P R F P R F MultinomialNB 34, 23 61,42 43,96 82, 30 31, 56 45, 62 37, 06 64,88 47, 17 BernoulliNB 33, 23 39, 15 35, 95 76, 67 24, 90 37, 59 48, 51 52, 61 50,48 Logistic Regression 73, 57 29, 27 41, 88 77, 60 35, 77 48, 97 76, 68 30, 48 43, 62 SVM 73,76 28, 25 40, 85 74, 62 39,72 51,84 79,52 30, 07 43, 64 Decision Tree 58, 28 18, 01 27, 52 89,36 19,10 31, 47 76, 04 20, 97 32, 87 k-NN 53, 85 23, 85 33, 06 67, 20 27, 96 39, 49 70, 54 16, 54 26, 80 Rocchio 30, 88 56,57 39, 95 48, 93 38, 02 42, 79 33, 27 60, 60 42, 96

Table 5.3: Macroaverage results from classification

Lyrics Title Lyrics + Title Classifier P R F P R F P R F MultinomialNB 47,97 65,15 55,26 78,96 47,48 59,30 50,30 67,62 57,69 BernoulliNB 45,64 48,13 46,85 78,10 44,97 57,08 57,39 58,19 57,79 Logistic Regression 71,28 44,13 54,51 76,04 49,44 59,92 74,50 45,61 56,58 SVM 71,72 43,11 53,85 75,44 51,37 61,12 76,50 44,71 56,44 Decision Tree 60,45 33,72 43,29 84,46 35,62 50,11 73,67 35,70 48,09 k-NN 55,42 37,63 44,82 65,48 40,50 50,05 67,35 31,64 43,05 Rocchio 43,14 58,78 49,76 60,09 38,80 47,15 45,23 61,36 52,07

Table 5.4: Microaverage results from classification

Some categories are inherently easier to classify than others. Standard deviations for all measurements vary between 0, 10 and 0, 20. Table 5.5 shows categories which were overall easier to detect and those which are not, by showing top- and bottom-3 F 1-scores for single-label SVM-classification Lyrics Title Lyrics + Title Category P R F P R F P R F Love and emotions 74, 84 85, 71 79, 90 75, 60 89, 13 81, 81 75, 41 86, 26 80, 47 Holidays 94, 37 50, 15 65, 35 81, 91 45, 32 58, 08 95, 31 48, 34 64, 04 Geography and locations 76, 81 36, 50 49, 44 82, 94 49, 18 61, 66 80, 73 37, 87 51, 55 ...... Law and Order 60, 63 13, 60 22, 15 59, 27 19, 57 29, 24 65, 47 13, 30 22, 02 Tools 83, 33 11, 33 19, 30 72, 35 35, 33 47, 45 96, 00 14, 00 24, 16 Education 62, 26 7, 09 12, 68 53, 58 14, 32 22, 55 80, 34 10, 43 18, 35

Table 5.5: Top- and bottom-3 categories from classification

It should be noted that all classifications were performed using the ‘scikit-learn’-module for Python [56]. All classifiers were applied using default configurations as supplied by the module. For SVM and LR classifiers the trade-off between recall and precision is partially dependent 5.4 Discussion 33 off several parameters (which were initially kept at default values) determining the level of regularization or penalizing during training. Default parameters used in the model for SVM and LR, are slightly biased towards precision as shown in results. Table 5.6 shows results for the SVM classifier with manually optimized regularization parameter for lyrics. Large values mean a large penalty is assigned to errors during training of the SVM which results in a smaller margin between the decision boundary and the positive/negative training examples.

Lyrics Title Lyrics + Title Classifier P R F P R F P R F Optimized SVM 60,58 36,76 45,76 63,75 42,63 51,09 63,15 43,13 51,25

Table 5.6: Macroaverage results for SVM with optimized penalty parameters.

With optimized parameters, the SVM-classifier achieves a better trade-of between recall and precision for lyrics and scores an average F 1 surpassing those of the Naive Bayes classifiers. Optimization also shows improved results when combining features.

5.4 Discussion

Overall, automatic classification of lyrics performs less well than measurements reported when classifying more TC-standard corpora. The best F 1 score by the SVM, using title-words as fea- − tures, of 51, 84% is well bellow metrics reported for classifying news-articles using SVM’s[32][42]. This could be due to the different nature of writing which was mentioned earlier in section 5.2, or the taxonomy of the GOS which may not be statistically justified. Classifiers get diverse results for different metrics and features. LR and SVM get high precision for lyrics but lower recall whereas Naive Bayes classifiers get high values for recall with below average precision. The overall best precision-value of 89, 36% is achieved by the DT using title-words. The effect of combining features has the effect of raising precision for some classifiers while recall is improved for others. The overall best recall value is reached this way. Vector space classifiers, k-NN and Rocchio do not perform well, as compared to the regression classifiers. Naive Bayes classifiers getting high values for recall can be attributed to the feature selection. When all words are used, F 1-score for the MultinomialNB-classifier drops to 33, 99%, while other classifiers’ results are less influenced. Precision is higher using title-words, especially for the Naive Bayes classifiers. As expected, micro-averaged results are higher due to some categories, with good predictive behavior, being represented in high numbers. One explanation for the good performance using titles as features, could be that people from the GOS used search engines based on song titles as a preliminary step for manual classification of the results. 5.5 Conclusion 34

5.5 Conclusion

In this chapter we applied techniques from Machine Learning to automatically classify lyrics into categories. Performance metrics are lower than reported for classification on standard corpora. Classification using but a song’s title is shown to be superior to using lyrics for most categories. An optimized SVM-classifier showed improved results for lyrics and a combination of features. SOCIAL TAGS 35

Chapter 6

Social Tags

In this chapter we focus on a different source of data for lyrics categorization, namely community- sourced data represented by social tags.

6.1 Introduction

Social tags are free text labels applied to items such as artists, albums and songs. Unlike tra- ditional keyword assignment, where terms are often drawn from a controlled, static vocabulary, no restrictions are placed on social tags. The tags are generally assigned by a non-expert for personal use, such as personal organization or to assist with future retrieval. The true value of these tags for MIR emerges when tags are aggregated in a single, shared pool, also referred to as a folksonomy. The sharing of tags improves its usefulness for other users.

6.2 Issues with Social Tags

Many positive aspects can be attributed to social tags when applied to music. However, some issues with social tags can make working with them difficult, the most prominent of these being the following [38];

Cold Start Not all artists are tagged in equal quantities: popular artists are tagged frequently while unpopular are not. Also, it takes time for new artists to have an established listeners base thus many tags. The amount of tags is biased towards popular artists, and has a very long tail with many artists barely tagged. When looking at the tags applied to tracks this problem is even worse. Since social tags are not applied evenly, they are much more effective to explore and recommend popular content. 6.3 Lyrical Themes in Social Tags 36

Synonymy, Polysemie and Noise The unstructured, free form of social tags can bring some problems for those using them. Misspellings, spelling variants and synonyms are present in the pool of social tags. Words often have more than one meaning. These variants can dilute the pool of tags and add to the sparseness of the tag space.

Hacking Sometimes taggers will tag items such as artists or tracks dishonestly. Instead of tagging items to describe them or to aid in future retrieval, these taggers will tag for nefarious reasons. For example, one of the most applied tags to Canadian pop artist, Justin Bieber is ‘Brutal Death Metal’. These malicious tags can dilute the value of the overall tag pool.

Tagger bias A typical tagger is likely to be young, affluent, and internetsavvy. The music taste of these taggers may not be representative of music tastes of the general population not using the social music service. This can lead to tagging bias where some types of music receive more than their fair share of tags.

6.3 Lyrical Themes in Social Tags

Social tags can be a powerful tool to assist in the classification and exploration of music. Music service ‘Last.fm’ owns one of the largest collection of social tags for music and released the dataset with song-level social tags mentioned earlier. The majority of these tags are genre related, but many other types of tags can be distinguished. In [38], a first evaluation of social tags is presented and the authors propose a first categorization of social tags, this distribution is presented in Table 6.1. This categorization is based on the 500 tags most frequently applied to artists. Tag Type Frequency Examples Genre 68% heavy metal, punk Locale 12% French, Seattle, NYC Mood 5% chill, party Opinion 4% love, favorite Instrumentation 4% piano, female vocal Style 3% political, humor Misc 3% Coldplay, Personal 1% seen , I own it Organisational 1% check out

Table 6.1: Tag-types relative frequencies proposed in [38]

In our research we investigate the appliance of song-level social tags based on lyrical theme of the song, a category not recognized by researchers before. We therefore propose a new 6.3 Lyrical Themes in Social Tags 37 categorization of social tags based on song-level tags. We use a hierarchical categorization, depicted in figure 6.1, making a distinction between tags that are relevant and those which are not at the top of the hierarchy. Relevant tags are those which can be of value for other users, whereas non-relevant are too personal to ever be used by anyone else who assigned them, like the tags ‘favorite’ or ‘Sh-t that my sister listened to on my pc grrrrr’. Relevant types of social tags include, genre, mood, instrumentation and location, also present in categorization for artists from [38]. A novel category for social tags is recognized as tags which refer to lyrical content, this category has subdivisions for topical content and the language spoken in the lyrics of the song. Social Tags

Non-Relevant Relevant

Genre Mood Artist Instrumentation Lyrics Time

Gender Artist/Label Home Topic Language

Figure 6.1: Hierarchical categorization for song-level social tags

Relative frequencies for the different tag-categories are shown in Table 6.2. The categoriza- tion was applied to the 2.000 tags most frequently applied to songs. The problem of polysemie becomes apparent when classifying social tags, like ‘New-York’, which can refer to songs about the city or to its residents. Tags from different categories are also often combined, e.g. tags like ‘male vocals’, ‘Spanish rock’, ‘80s metal’ and ‘melancholic pop’,. . . combine several categories.

Tag Type Frequency Examples Non-Relevant 25% slgdm, heard on pandora, favourite Relevant 75% Genre 40% rock, jazz, World Music Mood 12% chill, nostalgia, feel good Time 6% 00s, 1969 Instrumentation 4% acoustic, vocal, Topic 4% Love, summer, political Home 4% USA, UK, Sweden Gender 2% male, female Language 2% Spanish, English

Table 6.2: Relative frequencies of different social tag types 6.4 Unsupervised Clustering of Social Tags 38

In the 2.000 most frequently applied social tags in the ‘Last.fm’-dataset, 99 social tags referring to lyrical themes were distinguished. Top-10 topic-related tags are shown in Table 6.2. Like the complete distribution of social tags, frequencies of these social tags exhibit a long tail in terms of assignments. The most frequently applied tag ‘Love’ is responsible for 18% of the total number of assignments of the 99 social tags for lyrics.

34.901 Love 6 10 8.820 summer Lyrics-related tags 5.283 christian 4.364 love song 3.723 political 3.217 Christmas 105 3.186 sex 2.877 christian rock 2.877 comedy 2.594 freedom 2.413 night 104 2.313 work

Frequency 2.060 Dream 2.030 names 2.024 Sleep 103 1.943 morning 1.909 winter 1.723 death 1.619 worship 1.484 heartbreak 102 . . 0 200 400 600 800 1000 . . Tags ordered by popularity

Figure 6.2: Lyrics-related social tag frequencies

6.4 Unsupervised Clustering of Social Tags

In this section we look into the different themes which are tagged by users of ‘Last.fm’. The 5.000 most frequently applied social tags were assessed manually for lyrics-related tags, which resulted in a total of 180 social tags related to a lyrical theme. As mentioned earlier, social tags are subject to noise, synonymy and polysemie, this also means some social tags can be categorized in the same semantic category. For example, social tags ‘heartbreak’ and ‘heartbreaking’ could be used interchangeably by listeners. While other tags’ meanings can be misinterpreted, for example, social tag ‘Love’ can label songs about feeling, while other times it is applied by users to indicate songs that they like very much. When manually clustering social tags into semantic categories, the 180 social tags are clas- sified into 50 categories. To see which tags are distinct or statistically related to one another, an unsupervised clustering algorithm is applied on the lyrics assigned with social tags. First, for each of the tags, the 100 most significant lyrics (according to the relative frequency of the applied tag to the song) are retrieved from the dataset. Lyrics are then transformed to the VSM using a binary representation of documents, and the centroid for each of the social tags’ lyrics is 6.5 Social Tag Features in Lyrics Categorization 39

calculated. Centroids are then clustered using the standard (unsupervised) K-means algorithm into 50 clusters. When evaluating the clusters, some, but not all, clusters show similar combinations of social tags compared to the manual categorization. Clusters with high overlap and thus semantic coherence between tags, are shown in Table 6.3. The fact that these clusters are semantically similar, demonstrates the textual dependency of the application of these social tags.

Cluster Social Tags 1 love, love song, lovesongs, lovesong 2 colors, colours 3 political, revolution 4 comedy 7 drinking, alcohol, whiskey 12 roots and culture 10 sunny, Sunday, sunshine, sun, sunny day, morning 11 politics, war ,protest, anti-war 12 food 13 places, geography, cities 15 god, heaven and hell, satan and hell 16 praise & worship, praise and worship, worship 18 social commentary 19 sea, water, nature 20 death, horror 22 animals, animal kingdom, birds, animal songs, out of space 23 cowboy 25 sex, erotic, sex music, sexual 26 Christmas, xmas, holiday, , Christmas tag, holidays 29 family, dad 30 Christian, Christian rock, top Christian 31 summer, summertime 32 gangsta 36 satanic, Satan 37 dreams, sleep and dreams 41 rain, weather, weather songs 42 heartbreak, breakup, goodbye, Heartbreaking, heartache, break up, love hurts, break-up, broken heart, Breakup songs, relationships, heartbroken, i miss you 43 fire 46 kids, children 50 drugs, weed, marijuana, poker

Table 6.3: Coherent clusters of social tags

6.5 Social Tag Features in Lyrics Categorization

In the previous section, the assignment of social tags according to lyrical themes was recognized. We now attempt to asses these presumptions and the quality of the tags using the GOS-dataset discussed earlier. As opposed to the GOS-dataset, the social tag labels-set is far more noisy and sparsely applied. We now attempt to use the clean GOS-dataset to further evaluate social tags as labeling instrument for the classification of lyrics. 6.5 Social Tag Features in Lyrics Categorization 40

All lyrics from the GOS-dataset that are also connected to the MSD, have social tags assigned to them in the associated ‘Last.fm’ social tags dataset. The social tags are studied for use in lyrics classification by using them, as features for the classification of songs. The set-up for this experiment is exactly the same as our classification task in chapter5, except now instead of lyrics, social tags replace words in the classification task. The complete set of social tags is used, but reduced to its most 5.000 relevant tags using feature selection. Earlier we discussed feature selection for classification. One way to make a case for the potential of social tags for topic classification, is to look at valuable features in the classification process. Using the χ2-method we determine the top-features for classifying each of the 24 super-categories of the GOS-dataset.

Category name top-χ2 Most valuable social tags Arts and music 127,75 fun, upbeat, party, dance, love songs,. . . Animals 2.197,12 animals, animal kingdom, my zany zoo, songs with animals,. . . Beauty and fashion 139,50 they said shoes, dedicated follow, boots, the word beauty,. . . Alcohol and drugs 1.475,63 drugs, under the influence, alcohol, drinking, booze,. . . Communication 113,53 question songs, questions, pox, smooth rb, gunwine,. . . Earth and nature 379,51 rain, weather, weather songs, rain songs, water, heavy water,. . . Education and knowledge 270,61 number songs, numbers, tnphp numbers, numbers add up,. . . Faith and religion 254,76 Christian, heaven and hell, inspirational, gospel, angel,. . . Food and beverages 1.582,60 food, my crazy cookbook, foods, chyzyweezie, honey,. . . Geography and locations 831,42 places, songs with places, geography, all around the world, . . . Government and politics 1.587,74 political, protest, anti-war, politics, war, . . . Holidays 11.699,51 Christmas, xmas,holiday, Christmas music, Christmas songs,. . . House and home 122,45 tobacco road, home, harmonica , coming home, tkn, Law and order 214,71 prison, murder, law, murder ballads, police and thieves,. . . Monsters and magic, 420,03 halloween, the word strange, miracle, horror, October moon,. . . People and life 649,12 names, songs with names, girls name, name droppers, death,. . . Society and class 392,91 political, redneck, classic hip-hop, eastcoast rap, . . . Love and emotions 332,27 love, Christmas, love songs, classic rock, rock,. . . The universe 885,78 moon, space, the word moon, floating in space, blue moon, . . . Time 202,66 songs of day and night, time, the word time,the word night,. . . Tools 234,99 drink to me, wine songs, sponge, dr demento, songs with music,. . . Traveling 155,16 cars, trains, train songs, on the road, car, . . . Weights and measures 93,55 the word little, gina cd, nanana, mysongs, slow and lovely, . . . Work and money 279,95 class struggle, money, jay-z, jay-z the truth, jason, . . .

Table 6.4: Categories with most valuable social tags

Table 6.4 shows the semantic relatedness between GOS-categories and social tags, many of the informative social tags which were classified earlier as topic-related social tags return as import features when classifying. Peaking values show for categories ‘Holidays’, ‘Animals’, ‘Politics’,‘Food’ and ‘Alcohol and Drugs’. These values can be contributed to the high semantic overlap between social tags and super-category and the popularity of the social tags. Lower values are caused by the broadness of the super-category or GOS-categories not recognized by the ‘Last.fm’ listeners community or considered ‘tag-worthy’. Categories with low χ2-scores and few relevant social tags are ‘Tools’, ‘Weights and Measures’,. . . When combining all categories for multi-label classification, the most significant social tags for lyrics-classification are presented 6.5 Social Tag Features in Lyrics Categorization 41

in Table 6.5. 1.Christmas 11.food 2.xmas 12.holidays 3.holiday 13.country Christmas 4.Christmas music 14.weihnachten 5.political 15.protest 6.animals 16.tinsel 7.christmas songs 17.4th of july 8.drugs 18.animal kingdom 9.x-mas 19.Christmas tag 10.war 20.ironman Christmas

Table 6.5: Most informative social tags

This ranking shows the dominance of tags involving holidays or Christmas in the community as far as topics go. Not only are they frequently applied, but the community-sourced holiday labels are consistently applied together with expert-assignments. We now perform classification using social tags and combine social features with previously used features, lyrics and the song title. Again, we also use a version of the SVM-classifier optimized for maximum F 1-scores.

Social Tags Title + Social Tags Lyrics + Social Tags Lyr.+Title+Soc. Tags Classifier P R F P R F P R F P R F MultinomialNB 28,53 32,44 30,36 51,79 44,77 48,02 35,73 64,71 46,04 36,60 66,16 47,13 BernoulliNB 33,23 39,15 35,95 37,66 47,60 42,05 37,06 48,76 42,11 41,17 54,46 46,89 Logistic Regression 53,23 12,73 20,55 83,38 23,16 36,25 76,70 28,80 41,88 78,23 30,29 43,67 SVM 52,02 12,47 20,12 84,21 23,93 37,27 78,68 27,80 41,08 81,37 29,62 43,43 Optimized SVM 48,66 18,73 27,04 61,29 35,87 45,25 62,05 40,30 48,86 66,11 43,57 52,52 Decision Tree 58,28 18,01 27,52 83,82 22,66 35,68 66,27 20,17 30,93 74,57 22,36 34,40 k-NN 16,17 6,30 9,07 54,41 7,68 13,46 71,35 15,27 25,16 71,85 15,55 25,57 Rocchio 18,99 43,91 26,51 31,55 51,91 39,25 33,96 60,70 43,55 34,66 60,88 44,17

Table 6.6: Macroaverage results from classification incorporating social tags

We note here that classification of the category ‘Holidays’ using only social tags outperforms other features by a margin. Results of classification for this category using Logistic Regression are shown in Table 6.7.

Lyrics Title Social Tags Category P R F P R F P R F Holidays 93, 53 50, 10 65, 00 87, 64 38, 12 53, 11 97, 51 55, 53 70, 44

Table 6.7: Classification of holidays-category using Logistic Regression 6.6 Discussion 42

6.6 Discussion

In Table 6.8 results are presented using each combination of features, for the MultinomialNB and the optimized SVM classifier.

MultinomialNB Optimized SVM Feature(s) P R F P R F Lyrics 34,23 61,42 43,96 60,58 36,76 45,76 Title 82,30 31,56 45,62 63,75 42,63 51,09 Social Tags 28,53 32,44 30,36 48,66 18,73 27,04 Lyrics + Title 37,06 64,88 47,47 63,16 43,13 51,25 Lyrics + Social Tags 35,73 64,71 46,04 62,05 40,30 48,86 Title + Social Tags 51,79 44,77 48,02 61,29 35,87 45,25 Lyrics + Title + Social Tags 36,60 66,16 47,13 66,11 43,57 52,52

Table 6.8: Macroaverage results using all features

Using only community-supplied labels directly as features, is hard due to them being sparsely applied. Nonetheless are highest recall and precision values obtained using a combination of features involving social tags. Performance of classification is strongly dependent on the frequency of the assignment of the tag and the correspondence with the super-category. Important social tags were shown in Table 6.4 for each super-category. Super-categories with social tags having high χ2-scores, are those which benefit the most from inclusion of social tags. These are shown in Table 6.9 with improvements for inclusion.

Lyrics + Title + Social Tags Super-category P R F P R F Holidays 74,04 60,96 66,68 +12, 34 +5, 39 +7, 8 Government and politics 56,72 38,91 45,93 +2, 75 +3, 82 +3, 7 Food 63,16 41,16 49,70 +2, 42 2, 16 0, 96 Alcohol and drugs 57,78 37,24 45,15 +6, 82 −+3, 1 +4− , 26

Table 6.9: Effect of inclusion of social tags with high χ2-scores

The overall highest F 1-score is obtained by the optimized SVM-classifier using all features. It is clear that some social tags are directed to lyrical themes or topics, and are thus features of high value for classifiers, depending on the category and assumptions of the classifier, social tags can increase recall and/or precision values.

6.7 Lyrics for Auto-tagging

One of the strategies employed to mitigate the cold start problem, mentioned in section 6.2, is autotagging. Autotagging is a technique in which automated content analysis is used in order to predict social tags directly from audio [22]. For each tag a training set is created consisting of positive and negative examples. New music and unpopular music can be tagged automatically 6.8 Conclusion 43 at tagging rates far exceeding human taggers. One of the issues with autotagging, is that all autotaggers focus on audio for prediction of social tags assigned to artists. Yet, audio-related tags (like genre, mood, instrumentation) only account for 60% of the relevant tags. While being a minor category in tag-assignments, some social tags are connected to lyrics. For example, it would be impossible for an audio-trained autotagger to appropriately assign the social tag ‘political’ or ‘Christmas’. We therefore state the potential of lyrics for automatic tagging. Techniques to perform this tagging would be similar to those discussed in chapter5. Evaluating the performance of this will however be much harder. Social tags are sparsely assigned and no clean ground truth is available to measure performance for song-level based prediction.

6.8 Conclusion

The chapter presents social tags from a lyrical perspective addressing the ‘lyrical theme’-specific assignment of social tags by listeners. Unsupervised clustering of lyrics groups semantically similar social tags. While not suited directly for text classification due to its sparseness, using social tags in combination with other features was shown to be beneficial. The use of lyrics for auto-tagging is stated. SUPERVISED TOPIC MODEL FOR LYRICS 44

Chapter 7

Supervised Topic Model for Lyrics

In this chapter we apply Labeled Latent Dirichlet Allocation (L-LDA) to the GOS-dataset.

7.1 Introduction

This is the first of two chapter which will lay focus on topic models. Topic models were presented in chapter3 as generative models used for the discovery of topics in large corpora. Supervised topic models come in several varieties, depending on the use of unsupervised data or application of hierarchical structure when classifying. In this chapter we will construct a supervised topic model using the previously discussed L-LDA model.

7.2 L-LDA using the GOS-dataset

L-LDA requires some form of labeling when devising its topic structure, documents can be assigned multiple labels. A one-to-one relation is then constructed for each label and topic. The GOS-dataset is used for this purpose, the 24 super-categories assigned to all documents in the GOS-dataset are used. The resulting topic model is shown in Table 7.1 with words according to descending probability. A software implementation by the inventors of L-LDA was used called the ‘The Stanford Topic Modeling Toolbox’ [61].

7.3 Classification using L-LDA

We now apply L-LDA for text classification. We use the same set-up as was used in the previous chapter, lyrics from the GOS-dataset are labeled using its 24 super categories. L-LDA transforms documents to vectors with one dimension for each topic, measuring the contribution of the topic to the song. The advantage of using L-LDA on multiply labeled 7.3 Classification using L-LDA 45

Category name Words with high probability Arts-music dance,rock,shake,music,boogie,radio,play,party,sing,easy,song ,. . . Animals monkey,dog,pony,horse,bird,eagle,saddle,tiger,aye,butterfly ,. . . Beauty, fashion beautiful,pretty,wear,jeans,boots,diamond,shoes,cool,collar ,. . . Drugs and alcohol wine,cocaine,drugs,whiskey,junkie,beer,addicted ,. . . Communication tell,talk,say,hello,beep,why,boo,goodbye,phone,secret,lette ,. . . Earth and nature rain,world,fire,sky,river,sun,water,light,rose,burn,shine ,. . . Education and knowledge one,teacher,two,sixteen,minute,zoo,seven,bottle,school,ten ,. . . Faith, religion angel,heaven,god,faith,lai,believe,imma,jesus,help,devil ,. . . Food and beverages sweet,candy,sugar,pie,jane,tea,egg,hungry,floating,seh,sixteen ,. . . Geography and locations america,city,country,cowboy,georgia,tennessee,american , . . . Government and politics war,soldier,freedom,flag,free,king,peace,queen,se,clap , . . . Holidays Christmas,jingle,santa,merry,bells,sleigh,snow,usa,wonderland ,. . . House and home home,house,door,roof,window,clean,view,boo,room,wash,sore ,. . . Law and order gun,danger,love,check,fight,shoot,trigger,murder,rebel,chain ,. . . Monsters and magic miracle,magic,stranger,ghost,lucky,strange,floating ,. . . People and life man,woman,life,girl,change,la,baby,lady,boy,mary,name . . . Society and class generation,love,n!gga,indian,black,rosa,equal,white,f!ck . . . Love and emotions love,Christmas,rock,heart,pop,baby,radio,city ,. . . The universe moon,stars,moonlight,rocket,alien,galaxy,space,planet . . . Time night,tonight,waiting,summer,time,sunday,tomorrow,day ,. . . Tools bottle,beep,clap,rub,click,hammer,writer,paper,buzz,bang ,. . . Traveling train,run,road,highway,fly,walking,walk,roll,stop,cadillac . . . Weights and measures little,circle,bit,big,spoon,round,open,na,woh,break, , . . . Work and money money,work,gold,golden,buy,dollar,working,collar,love , . . .

Table 7.1: L-LDA topics with words according to descending probability

documents comes from the model’s document-specific topic mixture. L-LDA can effectively per- form some contextual word sense disambiguation, which suggests why L-LDA could outperform SVM’s. As an example consider an excerpt from the lyrics to jazz-standard ‘Blue Moon’ in Figure 7.1. Initially some words from this line are classified as belonging the ‘Love’ (red) and ‘Heartbreak’ (green) topic, because likelihood parameters p(w topic) for words like ‘love’ and ‘heart’ are higher | for ‘Love’ than for ‘Heartbreak’. After performing inference using the topic model, the inferred document probability for ‘Heartbreak’ is much higher than for ‘Love’. The higher probability for this label makes up for the difference in the likelihood per-word, but not for all of them.

Before inference: Blue moon, you saw me standin’ alone. Withouta dream in my heart, withouta love of my own

⇓ ⇓ After inference: Blue moon, you saw me standin’ alone. Withouta dream in my heart, withouta love of my own

Figure 7.1: Contextual word sense disambiguation using L-LDA

Classification is performed by simply thresholding the posterior topic-probabilities as was done in [60]. The supervised model was trained on a training set containing 80% of the lyrics 7.3 Classification using L-LDA 46 from the GOS-dataset. The L-LDA representations for the rest of the lyrics were then inferred. A threshold was trained from the training set for maximum F 1 score, for which documents from − the test-set having a higher contribution, were assigned the label. Four examples of classifications are shown in figure 7.2.

1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 Precision Precision Recall Recall F1 F1 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Threshold Threshold

(a) Arts and Entertainment (b) Travelling 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 Precision Precision Recall Recall F1 F1 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Threshold Threshold

(c) Faith and Religion (d) People and Life

Figure 7.2: Classification using L-LDA

The classification-threshold determines the trade-off between recall and precision, highest F 1-scores are mostly reached slightly above the threshold of 0, 05 contribution by the topic, after which recall values sharply drop. Depending on the application, one could favor precision and retrieve songs according to maximum topic-contributions. For example, in the case of automatic generation of topic-based playlists as will be discussed in section 9.3, quality of the playlist is of importance. When retrieving songs according only songs of high probability can be retrieved to ensure the quality of the playlist. Again, each category was classified separately in a binary-classification scheme, metrics were macro- and micro-averaged over the 24 super-categories. Results for all standard metrics are 7.4 Conclusion 47 presented in table 7.2.

Precision Recall F1 Macro-Average 39, 94 47, 41 43, 36 Micro-Average 44, 70 62, 46 52, 11

Table 7.2: Results threshold-classification using L-LDA-vectors

As second classification, all topic-scores were used as features for classification of a single category. The linear classifier, Logistic Regression (non-optimized) was used for this purpose, each document is presented in a topic-based vector space. Results are shown in Table 7.3. Classification using thresholding obtains the third best F 1 score using lyrics, with slightly − Precision Recall F1 Macro-Average 75, 88 29, 05 42, 01 Micro-Average 73,57 40,61 52,33

Table 7.3: Results Logistic Regression classification using L-LDA higher precision than the MultinomialNB-classifier but less recall. The same goes for the linear classifier when all topics are used as classification-features, precision is increased slightly. Again, penalty-parameters for the Logistic Regression classifier were kept at default values, which ex- plains the bias towards precision.

7.4 Conclusion

In this chapter a supervised topic model was created using lyrics and labels from the GOS- dataset. While competitive with the base-line classifiers no large improvements were achieved in a classification task. The supervised model will be used for evaluation of an unsupervised model in the next chapter. UNSUPERVISED TOPIC MODEL FOR LYRICS 48

Chapter 8

Unsupervised Topic Model for Lyrics

In this chapter we apply Latent Dirichlet Allocation (LDA) on the complete corpus of lyrics and look into the contents and usefulness of the LDA-topics.

8.1 Introduction

In previous chapters, supervised data was used to perform classification of lyrics into topics, either supplied by experts or a community of listeners. In this chapter we assume no taxonomy of topics is available, thus focus on an unsupervised case of topic detection in song lyrics. In chapter3 techniques were discussed for learning topics which capture latent semantics of a document collection, by means of unsupervised probabilistic models also known as topic models. We will focus on the topic model known as Latent Dirichlet Allocation (LDA). Typically, for each document only a small number of topics have a notable contribution, and a limited amount of terms, mainly responsible for the topic distribution. In section 3.4, evaluation of topics learned by topic models, was briefly discussed. While topic models make assumptions which lead to good statistical models of documents, they offer no guarantee of producing a human-interpretable decomposition of the texts. They anecdotally lead to semantically meaningful decompositions because of the statistical nature of the docu- ments and human language. A distinction in the methods for evaluation was made between types that evaluate the predictive model by measuring how well the information learned from a corpus applies to unseen documents, and those which focus on interpretability by humans and semantic coherence. In the case of lyrics, there is no general consensus about the amount of topics or thematic contents present in lyrics, as is the case for news-corpora (Sport, Science, Entertainment,. . . ) or books-corpora (Comedy, Thriller, Romance, . . . ). Since we do not know what topics may emerge from the model, we prefer evaluation which takes into account semantic coherence. 8.2 Latent Dirichlet Allocation on musiXmatch-dataset 49

8.2 Latent Dirichlet Allocation on musiXmatch-dataset

For calculation of the unsupervised LDA-topic model, the ‘MALLET topic modeling toolkit [49] was used. MALLET uses a scalable implementation of Gibbs sampling and a method for document-topic hyperparameter optimization. Since no supervised data is needed, the whole collection of lyrics in the ‘musiXmatch’-dataset can be used. The ‘musiXmatch’-dataset includes a variety of languages besides English, but are present in minor quantities. Words from these non-English documents are seen by LDA as separate topics, since they are likely to occur in each others presence and less frequent than most English words. A selection of these topics is presented in Table 8.1, words are ranked from left to right according to descending probability, when calculating 60 topics over the whole of the corpus. Words in documents were presented in binary format and filtered from English stop-words.

(Spanish) que,de,el,la,en,un,mi,se,es, . . . (Dutch) van,al,en,de,ben,het,een,dan,ik,. . . (French) de,la,et,le,les,je,un,que,dan,qui,. . . (German) und,die,der,ich,das,nicht,ist,es,ein,. . . . .

Table 8.1: Topics containing foreign languages

When calculating 60 topics, 11 topics contain mostly non-English words, these topics are of no importance for our purpose. We therefore exclude these topics from the topic model by excluding all documents with notable contributions for these topics. After filtering 186.892 remain in the lyrics dataset, and 55.770 were excluded. Three topic models were then inferred from this dataset for evaluation, one with 60 (T60), 120 (T120) and 200 (T200) topics.

8.3 Evaluation

8.3.1 Manual Evaluation

As baseline metric, the topics were first scored manually by the researcher. A 3-point scale, where 3=useful and 1=useless, was used. A label was also applied to keep track of the different subjects of the word-topic distributions, as there is a possibility that several word-topic distributions contain the same theme. When scoring topics, several types of relatedness were distinguished, we present these below with some examples of each. 8.3 Evaluation 50

A strong lyrical theme

space star earth planet sky world fly moon universe sun • winter cold summer snow wind fall day spring • train track back ride hear whistle blow • Word-use linked strongly to specific genre of music

(Reggae) dem man jah de fi ya mi pon inna ah dis run da give chorus di babylon • (Hiphop) rhyme yo rock mic rap cause style beat check ya back em flow lyric • (Blues) man back town little home good lord gonna road boy blue • Rhyming or clich´es

reaction action attraction situation conversation sensation • older shoulder grow time colder little world • els someone cause nobody everybody yeah somebody •

Only topics related to a lyrical theme qualify for the highest score of 3, depending on the usefulness, other topics are assigned a 1 or 2. Topics with strong themes are interpreted as a coherent set of words all related to the same general area or category. In Table 8.2, the frequency of the different quality assignments is given for each of the topic models, along with the amount of themes recognized and themes unique for that topic model.

Quality T60 T120 T200 1 16 32 77 2 12 22 39 3 32 66 84 # Different themes 27 43 47 # Unique themes 0 1 5

Table 8.2: Results manual evaluation of topics

In the T60-model, little over half of the topics were assigned the highest quality score, containing 26 different themes. This means that on average a theme is assigned 1, 32 topic. When producing twice as many topics the percentage of quality-topics remains, with an increase of 43 different themes, however the amount of topics per theme has increased to 1,53 themes per topic. When producing the highest amount of topics, T200 contains a lower percentage of high quality topics with only a slight increase with 47 different themes. Different themes are also spread more over different topics, 1,78 topics per theme. When producing more topics new lyrical themes emerge. In the model containing 60 topics, each recognized theme is also included in models with higher amounts of topics, but more 8.3 Evaluation 51 distributed over several topics. The topic models with higher amounts of produced topics each contain some unique themes not included in any of the other models. In Table 8.3 the 26 themes distinguished in topics with a quality-score 3 in T60 are shown, together with the amount of assigned topics in each topic model.

Lyrical Theme T60 T120 T200 Lyrical Theme T60 T120 T200 Love 1 4 3 Writing 1 1 2 House 1 2 2 Communication 1 1 3 Coarse Language 2 3 3 Crime 1 1 3 Day and Night 2 1 1 Weather 1 2 3 Dancing 1 1 1 Christmas 1 2 1 Sea 1 2 2 Fire 1 1 2 Heartbreak 1 2 4 Time 1 3 4 The Body 1 2 1 War 2 2 2 Truth and Lies 1 1 2 Family 1 2 2 Space 1 2 1 Money 1 1 1 Media 1 1 1 Sky 1 1 2 Death 2 1 3 Music 1 1 1 Sex 1 1 4 Christian 1 3 5

Table 8.3: Topics in a 60-topic LDA-model as interpreted by the researcher

In Table 8.4 some examples of labeled topics are shown across the different models, again words are ranked left-to-right according to descending probability.

Heartbreak [T 60] love heart cry tear night day dream time kiss lone goodbye hold true blue only alone arm [T 120] love baby make time tri cause feel thing leave gonna pleas hurt heart wrong back stay [T 120] love heart cry tear hurt baby time goodbye eye broken feel leave make lone sad break [T 200] love heart baby cry leave pleas cause hurt try gonna make break goodbye walk time [T 200] love hurt thing made back make cause time try heart sorry feel lie wrong cry leave [T 200] pain tear year face life lost fear love heart live die cry dream shame left hope [T 200] broken heart piece fall break word left shatter love back mend start feel life nothing [T 200] heart love blue lone cry tear left day broken night dream alone sad care only break

Writing [T 60] write letter call read time word phone line home song love hear book wrote send [T 120] write read letter word book page line song time wrote hope picture paper love story [T 200] letter write read word wrote song send love line paper hope book time make call [T 200] book page read story word write turn age line time history end learn open written

Table 8.4: Two Examples of labeled topics in the three models.

All topics for all models are shown in appendixA by their most probable words. Manual evaluation metrics for quality and the assigned label are included as well. Manual evaluation is however very subjective and depends on the interpretation of the subject evaluating. A more reliable way, is to use labeled data from a whole community of listeners and taggers, as is performed in following sections. 8.3 Evaluation 52

8.3.2 Semantic coherence

In section 3.4.1, the measurement of semantic coherence was discussed as evaluation-scheme for LDA-topics. In this section we experiment with scoring methods based on WordNet and Wikipedia. We use an experimental set-up similar to the one used in [53]. This takes the form of scoring each word-pair in a given topic t based on the topic’s ten most probable words, given some word similarity measure D(wj, wi), all scores are then combined using the arithmetic mean, which was found to be superior to the median [53].

MeanD(topici) = mean D(wj, wi), ij 1 ... 10, i < j { ∈ }

Metrics

For producing WordNet-similarity scores, the ‘WS4J library for Java was used [28]. We now briefly describe the semantic relatedness metrics. The Least Common Subsumer (LCS) is a common feature to a number of the measures, which has the shortest distance from the two concepts compared. For example, ‘animal’ and ‘fish’ both are the subsumers of ‘shark’ and ‘goldfish’, but ‘fish’ is lower subsumer than ‘animal’ for them.

Path Distance (PATH) Path distance counts the number of nodes visited while going from one word to another via the hypernym hierarchy. The path distance between two nodes is defined as the number of nodes that lie on the shortest path between two words in the hierarchy.

Hirst-St Onge (HSO) [26] Two lexicalized concepts are semantically close if their WordNet synsets are connected by a path that is not too long and that “does not change direction too often”.

Leacock-Chodorow (LCH) [40] The measure of semantic similarity devised by Leacock et al. [40] finds the shortest path between two WordNet synsets using hypernym and synonym relationships.

Lesk (LESK) [9] Lesk (1985) proposed that the relatedness of two words is proportional to the extent of overlaps of their dictionary definitions. Banerjee and Pedersen [9] extended this notion to use WordNet as the dictionary for the word definitions.

Wu-Palmer (WUP) [70] The Wu & Palmer measure calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS.

Resnik Information Content (RES) [62] Resnik defined the similarity between two synsets

to be the information content (IC(c) = logp(c)) of their lowest super-ordinate (most − LCS). 8.3 Evaluation 53

Jiang-Conrath (JCN) [30] Also uses the notion of information content, but in the form of the conditional probability of encountering an instance of a child-synset given an instance of a parent synset.

Lin (LIN) [43] A minor modification of the JCN measure.

Wikipedia [35] As metric using Wikipedia we use the measure proposed in [35] called DISCO (extracting DIStributionally similar words using COoccurrences). DISCO, for each word, nds the words that share a maximum number of common co-occurrences in Wikipedia- articles. Two words are compared by measuring overlap between two sets of co-occurring words. Sets with high overlap are deemed semantically related.

Results

Results for the semantic coherence measurements are presented in Table 8.5. These values can only be compared horizontally for each metric separately.

Distance metric T60 T120 T200 Supervised PATH 0,106 0,101 0,100 0,105 HSO 0,35 0,33 0,31 0,60 LCH 1,13 1,07 1,06 1,14 LESK 0,068 0,069 0,074 0,154 WUP 0,30 0,28 0,27 0,31 RES 0,71 0,67 0,65 0,95 JCN 4769,21 9683,72 8804,09 12256,35 LIN 0,086 0,078 0,077 0,095 DISCO (Wikipedia) 0,023 0,026 0,024 0,020

Table 8.5: Results for semantic coherence measurements

All metrics, except for the DISCO and PATH-metric attribute the highest average score to the supervised topic-model. For the unsupervised topics, the majority of the metrics assign the T60-model the highest score, with the JCN and LESK metrics choosing T200 and T120 respectively. We also measure the correlation between manual scores and semantic coherence metrics by measuring the Spearman rank correlation coefficient between the assigned scores, shown in Table 8.6. As in [53], the LESK-metric attains the highest correlation with human judgment but assigns many topics with a value of 0. The HSO-metric is more consistent for the different topic models. In Figure we present all HSO-scores versus their supervised equivalent. 8.3 Evaluation 54

Distance metric T60 T120 T200 PATH 0,09 0,24 0,21 HSO 0,22 0,23 0,20 LCH 0,11 0,27 0,21 LESK 0,35 0,23 0,31 WUP 0,06 0,21 0,15 RES -0,02 0,06 0,09 JCN 0,18 0,18 0,18 LIN 0,02 0,02 0,09 Wikipedia 0,21 0,13 0,16

Table 8.6: Spearman rank correlation coefficients for different scoring methods

3.0 3.0

2.5 2.5

2.0 2.0

1.5 1.5 HSO-metric HSO-metric

1.0 1.0

0.5 0.5

0.0 0.0 1 2 3 1 2 3 Manual score Manual score

(a) T60 (b) T120 3.0

2.5

Supervised 2.0

T200 1.5 HSO-metric

1.0 T120

0.5 T60

0.0 1 2 3 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Manual score − HSO-Metric (c) T200 (d) HSO-metric for all topics

Figure 8.1: HSO-score versus manual scores. 8.3 Evaluation 55

8.3.3 Match with Supervised Topic Model

In previous chapters, topics were determined using supervised data, one way to asses the quality of the unsupervised topics, is to measure how well they match the topics from a supervised topic model. In chapter7 the construction of a supervised topic model was discussed, and successfully used for text classification. In the applied implementation of a supervised topic model, Labeled Latent Dirichlet Allocation (L-LDA), a topic model is constructed using the manually assigned labels of the GOS-dataset. The resulting topics are, because of the supervised data, highly interpretable and useful for topic detection, but data from the GOS-dataset is proprietary and only a small portion of the songs were matched to lyrics. The regular, unsupervised version of LDA is solely based on the text documents and performs on the complete set of 181.892 lyrics. This output was already evaluated manually in section 8.3.1. In this section we measure to what extent topics from an unsupervised model can be matched with those inferred from a smaller labeled corpus. For this task a new supervised model was constructed, different from the one inferred for classification in chapter7. To increase the range of possible topic-matches with unsupervised modeling, a supervised topic model was constructed with 38 topics. This supervised model is presented in appendixB. Super-categories from the GOS with many subcategories like ‘Love’, are split up in several topics, minor categories like ‘Tools’ were removed, some themes recognized by a community, as were shown in Table 6.3,were also represented. Topics from LDA are assessed by the maximum amount of similarity they contain with one and just one of the supervised topics. We measure this by calculating the cosine similarity from each of the unsupervised topics to each of the supervised topics, by representing the word-topic distributions as vectors. Cosine similarity between two vectors A and B is defined as,

A B CosineSimilarity = cos(θ) = · A B || || || || For each of the unsupervised LDA-topics, cosine similarity is calculated with each of the supervised L-LDA topics, resulting in a similarity-distribution for each LDA-topic. These dis- tributions are then scored according to the extent they show ‘peakedness’ which measures the distinctiveness of theme an LDA-topic shows, a statistical measure also known as kurtosis, an important measure used in the remainder of this chapter. Good distinction in similarity from other LDA-topics is found to be often linked to high interpretability.

Kurtosis

Karl Pearson introduced the idea of kurtosis to describe distributions that differed from normal distributions in terms of ‘peakedness’[55]. Kurtosis (β2) is defined as the fourth central moment 8.3 Evaluation 56 divided by the square of the variance.

4 E[(X µ) ] µ4 β2 = − = (E[(X µ)2])2 σ4 −

with µ4 being the fourth moment about the mean and σ the standard deviation. This is a definition used in older works, we apply a variation called excess kurtosis (γ2). Excess kurtosis is almost identical to the classic definition, only here a value of 3 is subtracted from β2.

µ4 γ2 = 3 σ4 −

This correction is added to make the kurtosis for normal distributions equal to zero. In Figure 8.2 four examples of linking LDA with L-LDA topics are shown, using topics from T120. Cosine similarities with L-LDA topics are shown on the Y-axis, ranked from high to low values, versus the corresponding L-LDA labels on the X-axis. Two of which have high values for kurtosis and two which do not. It is clear that LDA topics with high kurtosis measurements are strongly linked to one of the supervised topics. This means that the unsupervised topic is similar to the supervised topic, which in many cases means the LDA-topic is interpretable by humans and thus useful. Often the peaking supervised topic even supplies an appropriate label for the unsupervised topic.

Correlation with Manual Scores

We now measure the correlation between the kurtosis values and quality-scores assigned man- ually in section 8.3.1 by calculating the Spearman(ρ) correlation coefficient between the two variables. Both variables are plotted in graphs presented in Figure 8.3, along with the matching correlation coefficients. Color indicates the manual score the LDA-topics received(Green=Good, Red=Bad). The procedure for evaluation using word-distributions from a supervised model is presented once more in algorithm 1. 8.3 Evaluation 57

Topic 65: white black blue red sky color green eyes paint light Topic 43: road find lead time life walk light back follow path 0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 Cosine Similarity Cosine Similarity

0.2 0.2

0.1 0.1

0.0 0.0 Love Life Food Fire Sex LifeLove Fire Sex ColorsNature Water Time PeopleNight Family Time Family Nature People Water Night AnatomyWeather Political Seasons Religion AnimalsNumbers Political Religion Anatomy Numbers Weather Animals Law/Crime War/Peace Heartbreak Law/CrimeHeartbreak War/Peace Places/Cities Sleep Dreams Places/Cities Home/HouseSleep Dreams Sports/Games Society/Classes Media/Showbiz Communication Drugs/Alcohol Media/ShowbizCommunication Society/Classes Travelling/Moving Education/Advice Travelling/Moving Education/Advice Space/Moon and Stars Space/Moon and Stars

(a) β = 19, 19 (b) β = 0, 02 2 2 − Topic 38: sea water ocean river swim wave sun drown sand deep Topic 22: man little gonna back boy big town baby blue good 0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3

Cosine Similarity 0.3 Cosine Similarity

0.2 0.2

0.1 0.1

0.0 0.0 Fire Life Sex Sex Water Time Love Food WorkFood Love Time Life WeatherNature Political Seasons Colors People Family PeopleFamily Colors Money WaterNature AnatomyAnimalsReligion Numbers War/Peace Animals Numbers Religion Political Places/Cities Law/Crime Heartbreak Law/Crime Heartbreak War/Peace Drugs/Alcohol Sleep Dreams Home/House Sports/Games Places/Cities Home/House Sports/Games Media/Showbiz Drugs/Alcohol Music/RockingSociety/ClassesMedia/ShowbizCommunication Dancing/Party Travelling/Moving Education/Advice Travelling/Moving Education/Advice Space/Moon and Stars

(c) β2 = 17, 29 (d) β2 = 0, 59

Figure 8.2: Kurtosis measure

Algorithm 1 Algorithm for evaluation using supervised topics Initiliaze list of KurtosisV alues for Unsupervised word-distribution UT opici, i 1 . . . NumberOfT opics do ∈ Initiliaze list SimilarityDistribution for Supervised word-distribution ST opicj, j 1 . . . NumberOfSupervisedT opics do ∈ Push CosineSimilarity(UT opici, ST opicj) to SimilarityDistribution end for Push Kurtosis(SimilarityDistribution) to KurtosisV alues end for Calculate Spearman-coefficient ρ(KurtosisV alues, ManualScores) 8.3 Evaluation 58

30 30

25 25

20 20

15 15 Kurtosis Kurtosis

10 10

5 5

0 0

1 2 3 1 2 3 Manual score Manual score

(a) T60 , ρ = 0, 49 (b) T120, ρ = 0, 49 30

25

T200 20

15 T120 Kurtosis

10

T60 5

0

1 2 3 5 0 5 10 15 20 25 30 35 − Manual score Kurtosis

(c) T200, ρ = 0, 56 (d) All topic models

Figure 8.3: Correlation with kurtosis measure

8.3.4 Match with Social Tags

In this section we perform a similar evaluation of the LDA-topics, using supervised labeling. We now attempt to match LDA-topics with social tags. In chapter6 a case was made for topic-specific assignment of tags by a community. Social tags are a non-proprietary form of information, made by and in some cases for a community of music lovers. Since there is no restriction on the form of the tags, there is no clear taxonomy for the topics as is the case for labels in the GOS-dataset. We previously linked some of the tags to labels from the GOS, and now like the GOS-labels, attempt to match social tags with topics from the unsupervised topic model. To do so, a similar technique is applied, instead of matching word distributions, tagged documents are used. Social tags are sparsely and unevenly assigned to documents with no true taxonomy. It is therefore not feasible to create a supervised topic model as was performed using 8.3 Evaluation 59 the GOS-dataset. Instead of matching word-topic distributions, we look at documents assigned with social tags and calculate average topic distributions. Each tag-assignment from the ‘Last.fm’-dataset is equipped with a relative frequency of application. For each of the topic-related social tags, the 100 most significant, assigned with the highest frequency, documents are retrieved from the ‘musiXmatch’-dataset, the average topic-distribution using the LDA-model is then computed for each social tag. Two average distributions are presented in Figure 8.4, for social tags ‘Christmas’ and ‘Politics’. In these examples certain topics show to be dominant for documents assigned with the social tag, words from these dominant topics are shown.

0.25 0.18

0.16

0.20 0.14

0.12 0.15 0.10

0.08 0.10

Topic Distribution Topic Distribution 0.06

0.04 0.05

0.02

0.00 0.00 0 10 20 30 40 50 60 0 10 20 30 40 50 60

(a) Christmas (b) Politics

Figure 8.4: Average topic-distributions for social tags Christmas [Topic 29] Christmas bells snow Santa ring merry years tree bright sleigh day [Topic 60] god lord Jesus love heaven holy sing praise life heart glory

Political [Topic 45] war people world live fight die kill man god children nation land

For each of the unsupervised topics, all averages are then combined and again the kurtosis is measured. One example with high and low kurtosis is shown in Figure 8.6. 8.3 Evaluation 60

Topic 51:find time place feel day mind change found inside lost Topic 57:rain sky pain day sun clouds tears feel fall cry 0.10 0.10

0.08 0.08

0.06 0.06

0.04 0.04 Average Tag-Topic Contribution

Average Tag-Topic Contribution 0.02 0.02

0.00 0.00 time star train peace cosmicDream sunset rain waterbirds Hopefaith SUNsunny peace christian morning autumn evening colors dreams sunday cosmic in love afternoonreflection DarkLight dreaminglovesongslove hurts marijuanarainy days i miss you weather sunshine break-up cookingworship heartachebreak upmemories sleep music top christian High School Out of Space about a girl rainy days sunny day street at night christian rock weather songs mom and pop top christian broken heart songs about loveanimal kingdom sunday morning heaven and hell contemporary christian

(a) β2 = 40, 78 (b) β2 = 0, 71

Figure 8.5: Correlation with kurtosis measure using social tags.

We again measure the correlation of kurtosis with the manually assigned scores. The proce- dure for evaluation using tagged documents is presented in algorithm 2.

Algorithm 2 Algorithm for evaluation using social tags Initiliaze list of KurtosisV alues

for UT opici, i 1 . . . NumberOfT opics do ∈ Initiliaze list AverageT opicScore for Each lyrics-related social tag do Retrieve 100 frequently tagged lyrics

Calculate Average = Mean(T opicScore(i)1...100) for lyrics Push Average to AverageT opicScores end for Push Kurtosis(AverageT opicScores) to KurtosisV alues end for Calculate Spearman-coefficient ρ(KurtosisV alues, ManualScores) 8.3 Evaluation 61

160 160

140 140

120 120

100 100

80 80 Kurtosis Kurtosis 60 60

40 40

20 20

0 0 1 2 3 1 2 3 Manual score Manual score

(a) T60, ρ = 0, 32 (b) T120, ρ = 0, 37 160

140

120 T200

100

80 T120 Kurtosis 60

40 T60

20

0 1 2 3 0 20 40 60 80 100 120 140 160 Manual score Kurtosis

(c) T200, ρ = 0, 36 (d) All topic models

Figure 8.6: Kurtosis measure using social tags

8.3.5 Analysis

In previous sections, four methods for evaluation of unsupervised topics were proposed and exe- cuted. Topics were first scored manually, after which unsupervised or supervised methods were performed and correlation was measured with manual scores. Scoring methods using WordNet have less correlation with manual scores than reported for topic models of datasets containing news-articles or books, this is of course strongly dependent of the manual scoring by the anno- tators. A maximum for correlation with manual quality, is reached using the LESK-metric (as was in [53]) for the T60 model, these metrics are however inconsistent across the topic models. More consistent correlation was found using social tags as indicator for documents with strong lyrical themes, and topic distributions for these documents. Several topics cause large peaks in the unsupervised topic distribution. This skew was used as quality but only a selection of lyrical themes is recognized by a community. 8.4 Topic detection 62

Highest correlation was found using a supervised topic model’s topics as reference for the unsupervised topics. Cosine similarity between each of the supervised and unsupervised topics was measured and the level ‘peakedness’ was measured in the distribution of cosine similarities per unsupervised topic. All Spearman correlation coefficients are shown in Table 8.7.

Evaluation Metric T60 T120 T200 WordNet (LESK) 0,35 0,23 0,31 Kurtosis (Social Tags) 0,32 0,37 0,36 Kurtosis (L-LDA topics) 0,49 0,49 0,56

Table 8.7: Spearman correlation with manual evaluation

Highest correlation is achieved using supervised topics. In Figure 8.3, high concentration of low-quality topics is shown, while topics of high quality are distinct towards higher scores for kurtosis. For all models, some overlap between low and high quality topics as pictures in Figure 8.3d, in terms of kurtosis, is present. Increasing the amount of topics has the effect of increasing this overlap, making separation between high- and low-quality topics less defined while increasing the overall amount of good topics. In the following section, we analyze the amount and which unique topics are detected using the kurtosis-based evaluation.

8.4 Topic detection

Metrics for semantic coherence to score topics are independent of any prior knowledge of lyrical themes. The two latter methods measure overlap with a taxonomy devised by experts or a community. Along with evaluation, these methods provide the unsupervised topics with labels. We now look into what themes are discovered by the standard LDA and the effect of increasing the amount of topics on the detection of themes. Matches with a supervised model and social tags are presented in Figure 8.7. Matching with a supervised model is dependent on the choice of themes in the supervised model. Some unsupervised topics of quality, may be present in the LDA-model, but are not detected because of no appropriate supervised label being available. Consider the T60 model in Figure 8.7b, 5 labels are matched with very high kurtosis val- ues, (‘Christmas’, ‘Water’, ‘Fire, ‘Music’, ‘Religion’ and ‘WarPeace’). Most of these labels are but a best scoring label for one of the unsupervised topics, indicating strong concentration of the theme. For descending values of kurtosis, more unique labels are matched but with less distinction. The increase of topics in the unsupervised model, has the effect of matching more labels with high kurtosis from the supervised model, as shown for the T120 model in Figure 8.7d. A total of 11 different themes exceed a threshold of 10 kurtosis, with for T60 only 6 having higher scores. 8.4 Topic detection 63

For example the theme of ‘Drugs and Alcohol’ is barely matched in the T60 model, while in T120 this emerges with high kurtosis values. Evaluating this topic in the T60 model, shows that this supervised topic is not related to the manual label (topic 47 for T60 in appendixA), but is a collection of coarse language, while in the T120 model (topic 98 for T120 inA) the supervised topic is a good match. When further increasing the amount of topics to 200 in T200, the amount of themes with scores higher than 10 reaches 13. More themes, but with less increase than the T60-to-T120 increase. Supervised themes are more spread across several LDA-topics, each matched theme also has more ‘junk’-topics attributed. All these observations are demonstrated in Figure 8.8, which shows the amount of unique topics matched with the LDA-topic for increasing kurtosis values. In Figure 8.9 the average amount of LDA-topics per matched theme for each of the models is shown. A single topic is matched to each theme above a threshold of 10 kurtosis for T60 while matching 5 themes, a threshold of 15 for the T120 matching 7 themes and a threshold of 16 for the T200 model but only matching 5 themes. It is shown that all themes are matched at least once in T120 and T200, but with T200 showing a steeper increase in LDA-topics per theme, so less concentration of quality LDA-topics due to there being 80 more LDA-topics. 8.4 Topic detection 64

Christmas Water Fire Music/Rocking gangsta Religion satanic death War/Peace police and thieves sea Dancing/Party fire Anatomy darklight christmas music Places/Cities hot and steamy sex class struggle Weather trains Law/Crime war food Seasons praise and worship family Home/House political Heartbreak travelling moon Family mom and pop Love animals weather Money hope kids Life social commentary Sports/Games love lap dance Media/Showbiz heartbreak city Space/Moon and Stars comedy Society/Classes drinking god People about a girl Travelling/Moving christian high school Food dreaming love to death Drugs/Alcohol in love you and me Time reflection Communication fall Sex

5 0 5 10 15 20 25 30 0 20 40 60 80 100 120 140 160 − Kurtosis Kurtosis

(a) T60 - Social Tag (b) T60 - Supervised

Christmas roots and culture gangsta Colors trains Fire police and thieves Water satanic Religion death drinking Weather sea Music/Rocking fire War/Peace hot and steamy sex weather Places/Cities class struggle Animals spiritual Drugs/Alcohol christmas music Travelling/Moving food animals Seasons christmas Political colors Anatomy praise and worship war Dancing/Party car songs Sleep Dreams erotic Sports/Games god Space/Moon and Stars moon political Home/House city Law/Crime travelling cowboy Work nature Heartbreak family Money sex Family darklight social commentary Time drugs Love sunday morning People winter Sex love heartbreak Nature cosmic Society/Classes songs about love Media/Showbiz lost love kids Food comedy Life nighttime Education/Advice Communication Night

5 0 5 10 15 20 25 30 0 20 40 60 80 100 120 140 160 − Kurtosis Kurtosis

(c) T120 - Social Tag (d) T120 - Supervised

Christmas Fire Water roots and culture Weather trains Colors police and thieves Drugs/Alcohol gangsta spiritual Music/Rocking peace Religion death War/Peace satanic food Places/Cities drinking Seasons animals Anatomy hot and steamy sex Sports/Games cowboy travelling Space/Moon and Stars praise and worship Work fire Sleep Dreams sea Dancing/Party christmas weather Travelling/Moving drugs Political christmas music class struggle Time city Home/House winter Law/Crime christmas tag sex Life freedom Food halloween Money moon Media/Showbiz summer war Family god Heartbreak earth Love fall darklight Numbers cosmic People sunday morning Animals family Nature love comedy Society/Classes nature Communication protest Education/Advice Night Sex

5 0 5 10 15 20 25 30 0 20 40 60 80 100 120 140 160 − Kurtosis Kurtosis

(e) T200 - Social Tag (f) T200 - Supervised

Figure 8.7: Label-matching 8.4 Topic detection 65

40 T60

35 T120 T200

30

25

20

15 # Unique Themes Matched 10

5

0 5 0 5 10 15 20 25 30 − Kurtosis

Figure 8.8: Number of topics matched versus kurtosis.

6 T60 T120 5 T200

4

3

LDA-Topics / Theme 2

1

0 5 0 5 10 15 20 25 30 − Kurtosis

Figure 8.9: Number of topics per matched theme versus kurtosis. 8.5 Conclusion 66

Kurtosis values for tag-topic relations are shown in 8.7. Figures 8.10 and 8.11 show the dom- inant GOS-labels and social tags for the T60-model, in most cases these are related. Sometimes a social tag even provides an LDA-topic with a more accurate label. For example social tag ‘death’ and ‘satanic’ summarize better the contents of topic 15 from T60 as opposed to ‘Life’ provided by the supervised topic model. In some cases social tags are linked with high kurtosis to topics not matched with supervised topics nor assigned with high manual scores. Examples of these are shown in Figure 8.7d, these include social tag ‘roots and culture’ being matched to an LDA-topic containing terms related to the genre of ‘Reggae’ or social tag ‘gangsta’ for a topic containing coarse language. While not being appropriate labels for these topics, they provide some insight about its contents.

8.5 Conclusion

In this chapter we focused on an unsupervised topic model, using classic Latent Dirichlet Allo- cation. Three topic models containing 60, 120 and 200 topics were derived from a set of 181.892 lyrics. Topics were then evaluated using measurements of semantic coherence. Overlap with topics from supervised data was done using a new proposed measure based on cosine similarity and the kurtosis measure, obtaining high correlation with manual evaluation. Combining all kurtosis scores, shows which topics can be inferred using LDA. For topic- matching a supervised topic model was inferred using the GOS-dataset with supervised topics. For a model containing 120 topics, 11 unsupervised topics are strongly linked to supervised ones. Increasing the amount of topics gives rise to new matches, but increases the amount of low-quality topics and the amount of topics per matched theme. Some topics, not matched to supervised topics, are strongly correlated with the assignment of related social tags. 8.5 Conclusion 67

god lord jesus love heaven holy sing praise life heart glory king earth soul grace sin christ give live [Christian] Religion song sing play hear music love words sound dance tune melody heart ring heard long listen every time day [Music] Music/Rocking sky fly stars sun shine love high eyes light dream moon world wings above day night blue clouds heaven [Sky] Space/Moon and Stars sun summer sky rain love tree day wind winter fall cold snow leave blue sing night song flowers long [Nature] Seasons money pay work day buy man time dollar car make job bill live life good gonna things give home [Money] Money mother father man home son boy brother little love daddy years family young sister life day kids live girl [Family] Family fight war battle stand die land blood men sword death fire kill god power hand king man march soldier [War] War/Peace War/Peace time day life live memories back dream past tomorrow years things remember change love today forget end long yesterday [Time] Life fire burn desire love flame feel heart light soul night higher eyes inside turn set make heat touch hot [Fire] Fire love heart cry tears night day dream time kiss lonely goodbye hold true blue only alone arms back long [Heartbreak] Heartbreak christmas bells snow santa ring merry years tree bright sleigh day sing happy children every time make claus tonight [Christmas] Christmas sky rain sun fall wind clouds thunder ground turn storm burn fire sound light run water lightning eyes air [Weather] Weather gun kill run dead shot head bullet shoot die man back gonna fight blood pull hell street put fire [Crime] Law/Crime hear voice words sound feel ears listen speak scream eyes loud whisper make head call heard silence close heart [Communic] Communication write letter call read time words phone line home song love hear book wrote send number make waiting miss [Write] Communication baby feel make love girl yeah body give mind move gonna ’cause time touch boy little crazy good stop [Sex] Sex soul death dark god hell burn blood evil fire die black earth rise dead world eyes light flame [Death] ReligionLife stars play show tv movie make dream world scene screen big rock girl magazine car song live kids stage [Media] Media/Showbiz mind life world time space human earth power live nature universe form soul control exist future thought man vision [Space] Life lies truth eyes believe words try hide time fool mind face disguise only show blind make nothing end true [Truth/Lies] Heartbreak eyes hand head skin blood back mouth cut bones face burn breath black heart tongue dead lips fingers body [Anatomy] Anatomy sea ocean water sail wave ship shore river swim sand wind tide boat deep drown land sun sink hand [Sea] Water dance rock move yeah music gonna beat feel night everybody party play baby make roll floor time shake hey [Dance] Dancing/Party light night dark eyes dream shadows sky cold sun fall fade day wind stars black sleep moon time face [DayNight] TimePlaces/Cities f!ck sh!t gonna a!! ’cause yeah make head give back really hate face b!tch kick friend suck sick hell [Coarse] Society/ClassesDrugs/Alcohol door window room light walk street floor back night home wall house car turn waiting outside open sit sleep [House] Home/House love kiss heart eyes touch hold lips sweet night feel arms hand mine close dream smile soft make dance [Love] People Love

5 0 5 10 15 20 25 30 − Kurtosis

Figure 8.10: Manual labels versus supervised labels for T60

god lord jesus love heaven holy sing praise life heart glory king earth soul grace sin christ give live [Christian] praise and worship song sing play hear music love words sound dance tune melody heart ring heard long listen every time day [Music] kids sky fly stars sun shine love high eyes light dream moon world wings above day night blue clouds heaven [Sky] moon sun summer sky rain love tree day wind winter fall cold snow leave blue sing night song flowers long [Nature] weather money pay work day buy man time dollar car make job bill live life good gonna things give home [Money] class struggle mother father man home son boy brother little love daddy years family young sister life day kids live girl [Family] family fight war battle stand die land blood men sword death fire kill god power hand king man march soldier [War] politicalwar time day life live memories back dream past tomorrow years things remember change love today forget end long yesterday [Time] reflection fire burn desire love flame feel heart light soul night higher eyes inside turn set make heat touch hot [Fire] fire love heart cry tears night day dream time kiss lonely goodbye hold true blue only alone arms back long [Heartbreak] heartbreak christmas bells snow santa ring merry years tree bright sleigh day sing happy children every time make claus tonight [Christmas] christmas music sky rain sun fall wind clouds thunder ground turn storm burn fire sound light run water lightning eyes air [Weather] weather gun kill run dead shot head bullet shoot die man back gonna fight blood pull hell street put fire [Crime] police and thieves hear voice words sound feel ears listen speak scream eyes loud whisper make head call heard silence close heart [Communic] youth write letter call read time words phone line home song love hear book wrote send number make waiting miss [Write] high school baby feel make love girl yeah body give mind move gonna ’cause time touch boy little crazy good stop [Sex] hot and steamy sex soul death dark god hell burn blood evil fire die black earth rise dead world eyes light flame eternal [Death] deathsatanic stars play show tv movie make dream world scene screen big rock girl magazine car song live kids stage [Media] social commentary mind life world time space human earth power live nature universe form soul control exist future thought man vision [Space] darklight lies truth eyes believe words try hide time fool mind face disguise only show blind make nothing end true [Truth/Lies] in love eyes hand head skin blood back mouth cut bones face burn breath black heart tongue dead lips fingers body [Anatomy] love to death sea ocean water sail wave ship shore river swim sand wind tide boat deep drown land sun sink hand [Sea] sea dance rock move yeah music gonna beat feel night everybody party play baby make roll floor time shake hey [Dance] lap dance light night dark eyes dream shadows sky cold sun fall fade day wind stars black sleep moon time face [DayNight] moon drinking fuck shit gonna ass ’cause yeah make head give back really hate face bitch kick friend suck sick hell [Coarse] comedy gangsta door window room light walk street floor back night home wall house car turn waiting outside open sit sleep [House] city love kiss heart eyes touch hold lips sweet night feel arms hand mine close dream smile soft make dance [Love] mommom and pop and pop

20 0 20 40 60 80 100 120 − Kurtosis

Figure 8.11: Manual labels versus social tags for T60 APPLICATION 68

Chapter 9

Application

9.1 Introduction

In chapter1, possible applications of lyrics-based MIR-techniques are pitched. As a proof-of- concept, we apply topic models in three MIR-related tasks. The modified supervised topic model used for evaluation in section8 was used for all applications.

9.2 Spotify Plug-in

All applications were implemented as a plug-in for the desktop-software of the the popular music- streaming service ‘Spotify’. Plug-ins inside the application are written in a similar way as web pages. They are a combination of HTML and CSS for layout and JavaScript for functionality. The backbone of the application, communicating with the ‘Spotify’-plugin, is written in Python, presenting its functions via a RESTful web-service using the ‘web.py’ web framework for Python. The main data-source for this application is a transformed ‘musiXmatch’-dataset using the supervised topic model presented in chapter7. All lyrics from the ‘musiXmatch’-dataset were transformed to their vector-representations according to the topic-contributions inferred using L-LDA and stored in a database. An UML-diagram depicting interaction between components of the application is depicted in Figure 9.1.

9.3 Automatic Playlist Generation

One straightforward way of applying topic-models in a music application, is automatic generation of playlists. Lyrical theme based playlists can easily be constructed using the topic-vector representation provided by the topic-model, and ranking songs according to the desired theme. Themes can be combined to match several topics. Another use of the topic-representations, is filtering certain topics. Listeners could prefer not to have certain lyrical themes in a playlist, 9.3 Automatic Playlist Generation 69

Spotify Plug-in Python Web-API L-LDA Database

RequestSongs() FetchLyrics()

ProcessLyrics()

Figure 9.1: Topic-based plug-in for Spotify

Figure 9.2: Spotify Lyrics Plug-in possible candidates could be offensive or sexual language. This can be done by filtering songs with topic contributions higher than a certain threshold for these themes. The function was implemented as follows in the ‘Spotify’-plugin. The user is presented with all topics, present for lyrics, in the database, and can select high or low contributions for the topics, using radio button-lists. A possibility to retrieve only songs from a certain genre was also implemented. After selecting preferences, the user submits a list with all preferred and rejected topics to the web service. The web service than fetches a random selection of 10.000 topic-distributions from the lyrics database. Suitability of songs is measured by calculating the harmonic mean of all contributions of preferred topics and inverted contributions of all rejected topics. The harmonic mean is used to maximize all components of the preferred topics. 9.4 Artist Similarity 70

9.4 Artist Similarity

Certain use of lyrics and lyrical themes can form a crucial part of an artist’s musical identity or a music-genre as a whole. Two genres with specific use of language that come to mind are ‘Hip-Hop’ or ‘Singer-’, audio for these genres can often take a secondary place to lyricism by the artist. Bob Dylan for example, has incorporated a variety of political, social, philosophical, and literary influences. They defied existing conventions and appealed hugely to the then burgeoning counter-culture [21]. Although lyrics take this central role in a minority of music and are not a necessity for commercial success, one can’t argue that word-use, depending on the listeners taste, is part of how music is perceived. Topic models for lyrics brings the opportunity of defining artists from a lyrical perspective, and investigate to what extent lyrical identity is interpreted as musical identity. We perform this research by matching artists in the topic-based vector space and comparing these similar artists with manually assigned artists, data from ‘Last.fm’ was to this goal. First, the average topic-distribution for each artist, having at least 20 lyrics in the dataset, is calculated, after which 3.019 artists remain. For each artist, the distance to each of the other artists is calculated using a distance measure. Several metrics were used to this cause, best performing metric was the Kullback-Leibler (KL) divergence (DKL), a non-symmetric measure of the difference between two probability distributions P and Q. The KL divergence between two distributions Q to P is defined as follows,

X P (i) DKL(P Q) = ln( )P (i), || Q(i) i

or the information lost when Q is used to approximate P , with P and Q being topic- distributions for two artists. The list of distances is then sorted according to increasing values, so a ranked list of similar artists is produced for each. Via the API of the Last.fm music service, the 100 most similar artists are retrieved for each of the 3.019 artists, as ground truth. We evaluate the similarity of the two ranked lists using the so-called Mean Reciprocal Rank (MRR) defined as,

|Q| 1 X 1 MRR = . Q ranki | | i=1 The reciprocal rank of a list of proposed answers, is the multiplicative inverse of the rank of the correct answers. We determine the MRR of the ranks, of artists from the list of similar artists according to ‘Last.fm’, in our list ranked according to distance in the topic-based vector space. Not all artists from the ‘Last.fm’-artists are included in our lists due to the lack of lyrics, so the 5 most significant artists in these rankings, also included in our ranked lists, were chosen. Figure 9.3 9.4 Artist Similarity 71

100

1 10−

2 10− Mean Reciprocal Rank

3 10−

10 4 − 0 500 1000 1500 2000 2500 Artists

Figure 9.3: Mean Reciprocal Rank for artists in descending order

shows the resulting scores for all artists with decreasing agreement between rankings. This curve shows exponential decrease in MMR-scores. A selection of artists with similar artists in the topic-model vector space, are depicted in Table 9.2.

Artist MRR Similar Artists NoUseForAName 0,37 Good Riddance, Lagwagon, Ten Foot Pole, Pulley, . . . Michael Bubl´e 0,32 Tony Bennett, , Carpenters, Frank Sinatra, . . . Ella Fitzgerald 0,31 Frank Sinatra, Billie Holiday, Carly Simone Diana Krall, . . . Amerie 0,31 Ashanti, Brandi, Jeniffer Love Hewitt, Joss Stone, . . . Napalm Death 0,30 Brutal Truth, Nasum, F-minus, Against All Authority . . . Westlife 0,30 Boyzone, Ronan Keating, Rick Astley, Rachael Yamagate, . . . Barry Manilow 0,24 Neil Diamond, Carly Simon, Carpenters,. . . 4Him 0,24 Avalon, Larue, Newsboys, Clay Crosse, Gaither Vocal Band, . . . Outkast 0,24 Jay-Z, The Pharcyde, Tech N9ne, Murs, . . . B.B. King 0,22 Eric Clapton, Billy Dean, George Jones, The Long Blondes, . . . 0,21 Crosby & Nash, , , Scorpions, . . . My Bloody Valentine 0,21 Lush, The Cure, The Cardigans, Jewel, Ivy, . . . 0,12 John Fogerty ,Stone Roses , ,Bob Dylan ,John Prine ,. . .

Table 9.1: Artists with similar artists according to topic-based vector space.

The measure shows high agreement for some artists. An interesting observation is that in some cases, high similarity is computed between a solo-artist and a band from which this artist is a member, as is the case for David Crosby or Art Garfunkel. 9.5 Topic-models as Tool Social Sciences 72

It is clear that artists, with good predictions, belonging to some genres are better represented than others. When retrieving social tags for the 200 artists with best similarity-predictions, we get an idea of which genres have distinguishable use of lyrics. Similar social-tags are included only once.

44 Rock 35 Metal 33 Pop 29 Hip-Hop 27 Female Vocalists 25 Christian 17 Punk 17 Soul 13 Jazz 13 Singer- . . . .

Table 9.2: Genres for artists with high MRR scores.

High scores for Mean Reciprocal Rank were obtained for artists from genres ‘Metal’, ‘Hip- Hop’ and ‘Christian’, but more surprisingly also for ‘Rock’, ‘Pop’, ‘Female vocalists’ and ‘Jazz- singers’.

9.5 Topic-models as Tool Social Sciences

Lyrics have been subject to numeral sociological studies examining the effects of lyrics on the behavior of youth. In particular the effect of lyrics with violent and misogynistic themes [8][50]. Topic models can be put to use in this field of research by offering a fast way to analyze large collections of lyrics according to lyrical theme. An example is shown in Figure 9.4, in which the average annual contribution in lyrics from a topic which contains words related to crime, is shown versus the annual crime rate in the of America. Some similarity between both curves can be noticed during the period of 1970 to 1990 which hints a certain connection, we do not however make any statements about this matter and solely demonstrate this possible application of topic model. 9.6 Conclusion 73

0.05 7000 Topic Law and Crime Crime Rate 6000 0.04

5000

0.03 4000

3000 0.02 Average Contribution 2000 Crime Rate per 100.000 Population 0.01 1000

0.00 0 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 9.4: Average ‘Crime’-Topic distribution over time versus crime rate in the U.S.A.

9.6 Conclusion

In this chapter several applications of lyrical theme-based applications were pitched. Applica- tions were implemented as a plug-in for the ‘Spotify’ music streaming service. A database of songs, transformed to a topic-based vector space, was constructed. The plug-in interact with this database using Python-scripts. Users are able to retrieve playlists from the webservice, according to one or more lyrical themes. Lyrical identity of music artists using topic models was researched. Measuring correlation with community sourced annotations, shows strong similarity for some artists or genres of music. The use of topic models to study lyrics from a sociological perspective was briefly discussed. As example, the annual average contribution in lyrics from a topic which contains lyrics about crime, was shown versus the annual crime rate in the United States of America. CONCLUSION 74

Chapter 10

Conclusion

In this document a selection of subjects concerning semantic analysis and topic detection in lyrics were covered. Performed experiments and results can be categorized in two area’s. In chapters 5 to 7, text classification of lyrics is performed. Lyrics assigned with a set of supervised labels, acquired from a commercial lyrics listings website, were automatically categorized using several baseline classifiers. The effect of using different features like lyrics and song titles was evaluated. Next community-sourced labels known as social tags are studied for lyrics-specific assignment. Social tags are used in a unsupervised clustering, which shows the clustering of related social tags using only textual features, and also used as features for lyrics classification. An optimized Linear Support Vector Machine achieved highest results using a combination of all three features, performance is less than for classification of news- or book- corpora. A supervised topic model trained using the small labeled subset, is used for classification which shows competitive results with baseline performers but no overall improvement. All results, using textual and topic features, are presented in Table 10.1.

Feature(s) Prec. Rec. F1 Lyrics 60,58 36,76 45,76 Title 63,75 42,63 51,09 Social Tags 48,66 18,73 27,04 Lyrics + Title 63,16 43,13 51,25 Lyrics + Social Tags 62,05 40,30 48,86 Title + Social Tags 61,29 35,87 45,25 Lyrics + Title + Social Tag 66,11 43,57 52,52 L-LDA Topics 44,70 62,46 52,11

Table 10.1: Complete macroaverage results for classification of lyrics using a Support Vector Machine and L-LDA

In chapter8 Latent Dirichlet Allocation, an unsupervised topic model was inferred from the corpus of lyrics, and evaluated according to semantic coherence and interpretability. Three 10.1 Future work 75 models with different amounts of topics were constructed from the lyrics and evaluated manu- ally. Evaluation was performed using metrics based on WordNet, these showed less correlation with manual scores than metrics including supervised data. A metric was proposed relying on supervised topics or labeled documents, and a measure for ‘peakedness’. LDA-topics are scored according to the extent they match supervised data, ideally only one of the supervised topics or labels shows high similarity with the unsupervised topic. This metric resulted in high rank- correlation with manual scores using L-LDA topics. Using tagged lyrics obtained slightly lower scores but gave useful information as to what content an LDA-topic contains. Combining all kurtosis scores, shows which themes can be detected using LDA. For topic-matching a supervised topic model was inferred using the GOS-dataset containing 38 supervised topics. For a model containing 120 topics, 11 unsupervised topics are strongly linked to supervised ones. Increasing the amount of topics gives rise to new matches, but increases the amount of low-quality topics and increases the amount of topics per matched theme. Finally topic models were applied in three applications showing promising results. Auto- matic generation of playlists, artist similarity based on topics was computed and compared to community data, and the use of topic models in social sciences was demonstrated.

10.1 Future work

Possibilities for future research are the following;

Application of other topic models, like the Correlated Topic Model or hierarchical LDA, • and compare quality of topics using metrics proposed in this work.

Apply the proposed evaluation metrics on more standardized corpora and compare. • Measure classification performance of unsupervised models. • Applying topic-based representations for automatic genre classification, mood detection • or hit-song analysis. LDA-TOPICS AND MANUAL EVALUATION 76

Appendix A

LDA-topics and Manual Evaluation

This appendix presents all unsupervised topics models inferred from the complete set of lyrics, manual scoring, and label are included. Topics are presented by their 5 most probable words. These words’ stemming was reversed to one of its complete words for better readability.

T60

Topic High Probability Words Q Label Topic High Probability Words Q Label 0 home time day alone long 2 30 eat little big man drink 1 1 love kiss heart eyes touch 3 Love 31 back town ride road man 2 2 door window room light walk 3 House 32 else someone find care feel 1 3 fuck shit gonna ass ’cause 3 Coarse 33 face place time feel find 1 4 baby love yeah gonna girl 3 Love 34 love baby time make heart 1 5 light night dark eyes dream 3 DayNight 35 love heart cry tears night 3 Heartbreak 6 dance rock move yeah music 3 Dance 36 girl hair dress wear little 2 7 time day things make feel 2 37 inside eyes feel time open 1 8 time love feel make things 1 38 blood dead death die kill 3 Death 9 sea ocean water sail wave 3 Sea 39 love man tree young hand 1 10 pain tears heart cry feel 2 40 feel time things try everything 1 11 eyes hand head skin blood 3 Anatomy 41 fire burn desire love flame 3 Fire 12 lies truth eyes believe words 3 Truth/Lies 42 time day life live memories 3 Time 13 mind life world time space 3 Space 43 fall back time ground make 1 14 stars play show tv movie 3 Media 44 fight war battle stand die 3 War 15 soul death dark god hell 3 Death 45 war people world live fight 3 War 16 knew thought time day eyes 2 46 mother father man home son 3 Family 17 make things time try only 1 47 sh!t n!gga f!ck b!tch ya 3 Coarse 18 yo rhyme rap mic ’cause 2 48 night morning day sleep bed 3 DayNight 19 baby feel make love girl 3 Sex 49 money pay work day buy 3 Money 20 write letter call read time 3 Write 50 life lies pain live soul 2 21 chorus verse repeat bridge love 1 51 sun summer sky rain love 3 Nature 22 music record lyrics guitar vocal 2 52 sky fly stars sun shine 3 Sky 23 hear voice words sound feel 3 Communic. 53 time night gonna make tonight 2 Continued on next page LDA-TOPICS AND MANUAL EVALUATION 77

Table A.1 – Continued from previous page Topic High Probability Words Q Label Topic High Probability Words Q Label 24 road find heart light time 1 54 song sing play hear music 3 Music 25 street walk people town city 2 55 baby yeah gonna ’cause goin’ 1 26 yeah baby girl hey ooh 1 56 love feel heart life make 2 27 gun kill run dead shot 3 Crime 57 game play pay life make 2 28 sky rain sun fall wind 3 Weather 58 things make good love feel 1 29 Christmas bells snow Santa ring 3 Christmas 59 god lord Jesus love heaven 3 Christian

T120

Topic High Probability Words Q Label Topic High Probability Words Q Label 0 yeah baby ooh hey ah 1 60 game play win lose time 3 Game 1 west east city south north 3 Travel 61 hear voice sound words loud 3 Communic. 2 girl baby ya yeah gotta 1 62 write read letter words book 3 Write 3 girl baby yeah boy gonna 1 63 words things talk heard try 2 4 love life world things make 1 64 stars tv show play girl 3 Media 5 space stars sky earth world 3 Space 65 white black blue red sky 3 Colors 6 pay attention action give make 2 66 love heart kiss sweet dream 3 Love 7 knew thought made eyes heard 1 67 lies eyes hide try time 2 Truth/Lies 8 fight war battle die stand 3 War 68 people man hand brother things 2 9 song sing play music hear 3 Music 69 blood death body flesh dead 3 Death 10 n!gga sh!t b!tch f!ck ya 3 Coarse 70 love hold kiss arms heart 3 Sex 11 money pay work buy dollar 3 Money 71 drink wine night bottle bar 3 Drinking 12 feel love stronger heart make 1 72 baby love feel make time 3 Sex 13 don? won time ve back 1 73 love heart cry tears hurt 3 Heartbreak 14 fire burn flame desire feel 3 Fire 74 sea sail ship shore wave 3 Sea 15 love sing sweet ring rose 3 Love 75 years day time hour long 3 Time 16 sh!t n!gga f!ck yo b!tch 3 Coarse 76 bells sleigh bright snow christmas 3 Christmas 17 sky night wind light dark 2 77 love bad baby fool good 2 18 room door bed window floor 3 House 78 heart feel love time fall 2 19 lies life fear pain truth 2 79 tomorrow today sorrow day time 2 20 war people world fight kill 3 War 80 gun run kill bullet shot 3 Crime 21 things girl really friend make 1 81 eyes breath feel hand touch 3 Anatomy 22 man little gonna back boy 1 82 smile face laugh clown wear 3 Happy 23 chorus verse repeat bridge fade 1 83 fly sky wings high bird 3 Sky 24 face place time race space 1 84 wear dress hair clothes girl 3 Beauty 25 back head hand tied cut 1 85 stand strong fall fight give 2 26 home alone long road day 1 86 world live human mind life 3 Protest 27 car drive road ride wheel 3 Driving 87 free set chain break prison 3 Crime 28 die dead grave death life 3 Death 88 morning sunday day monday night 3 Time 29 fuck shit ass bitch gonna 3 Coarse 89 gonna time yeah ’cause make 1 30 wall clock window fall time 2 90 light eyes shadows night face 3 Dark/Light 31 open eyes wide inside door 1 91 feel confused mind emotions control 2 32 gonna baby yeah ’cause goin’ 1 92 call phone alone home waiting 3 Phone 33 thousand miles ten hundred years 2 93 die alive cry live life 3 Death 34 tree grow seed fields sun 3 Nature 94 street walk city town night 3 City 35 summer winter wind cold sun 3 Weather 95 kids girl home school little 3 Family Continued on next page LDA-TOPICS AND MANUAL EVALUATION 78

Table A.2 – Continued from previous page Topic High Probability Words Q Label Topic High Probability Words Q Label 36 tale fairy story angel dream 3 Fantasy 96 mother father man son brother 3 Family 37 love baby make time try 3 Heartbreak 97 eat little big man head 2 38 sea water ocean river swim 3 Sea 98 dead black night dark kill 2 39 school learn fool rules teach 3 School 99 fuck make shit try feel 1 40 ground back fall time head 1 100 older shoulder grow time colder 1 41 love heart forever feel dream 1 101 things time make little try 1 42 things make only any good 1 102 god lord heaven sing Jesus 3 Christian 43 road find lead time life 2 103 answer truth learn question believe 3 Truth/Lies 44 pain heart tears feel inside 3 Pain 104 sleep night dream bed wake 3 Sleep 45 blood eyes skin hand mouth 2 105 door floor before open knock 3 House 46 time everything nothing life things 1 106 things time try made love 1 47 love heart true day someone 3 Love 107 pay price life paid live 1 48 love god lord heart life 3 Christian 108 else someone care love feel 1 49 god pray Jesus heaven lord 3 Christian 109 dance move beat 3 Dance 50 mind form time power world 2 110 death blood soul god dark 3 Death 51 find time place feel day 1 111 Christmas years Santa tree snow 3 Christmas 52 light shine night stars sun 3 DayNight 112 lips love kiss girl body 3 Love 53 train ride back track wind 3 Trains 113 time day good love feel 1 54 years time long day memories 3 Time 114 king man land men stone 2 55 music record guitar lyrics vocal 2 115 rhyme yo rap mic rock 2 56 life soul mind light eternal 1 116 time make life try waiting 1 57 rain sky pain day sun 3 Weather 117 love baby girl honey little 2 58 dem man jah de fi 1 118 time day back memories remember 2 59 baby love yeah gonna girl 1 119 inside feel hide mind pain 1

T200

Topic High Probability Words Q Label Topic High Probability Words Q Label 0 inside hide pride try side 1 100 stars tv movie screen dream 3 Media 1 live survive life alive world 2 101 love hurt things made back 3 Heartbreak 2 love young day faire man 1 102 hear voice sound words scream 3 Communic. 3 make time life lose choose 1 103 tall big little feet man 2 4 change things feel strange make 1 104 girl love boy little baby 1 5 crime pay kill die death 3 Crime 105 baby gonna woman man lord 1 6 speak talk words mouth make 3 Communic. 106 stop time back slow try 1 7 don’ ve won ain’ feel 1 107 space stars earth planet sky 3 Space 8 monday sunday friday day night 3 108 stronger longer little grow any 1 9 child young wild world man 2 109 baby love feel make yeah 1 10 room door window wall floor 3 House 110 bad tough make gonna 1 11 girl baby lips hot little 3 Sex 111 dem man jah de fi 2 12 baby love girl yeah chorus 1 112 love heart kiss arms hold 3 Love 13 feel heart breath cold touch 1 113 rain sky day clouds sun 3 Weather 14 sad love bad good make 2 114 wrong feel strong love time 1 15 life eyes fear mind nothing 1 115 yeah baby ooh hey love 1 16 lies truth words believe face 3 Truth/Lies 116 things really something friend time 1 17 night light shadows dream hear 2 117 time worry mind gonna things 1 Continued on next page LDA-TOPICS AND MANUAL EVALUATION 79

Table A.3 – Continued from previous page Topic High Probability Words Q Label Topic High Probability Words Q Label 18 diamond ring love things buy 2 118 pain tears years face life 3 Heartbreak 19 tree wind grow mountain river 3 Nature 119 guitar vocal music 2 20 care people place everywhere friend 2 120 things time talk really try 1 21 grave dead die bury death 3 Death 121 inside mind feel pain insane 1 22 war people world fight live 3 War 122 man back town little home 1 23 wall fall build stand climb 3 Wall 123 head cut make back loose 1 24 shoulder older weight heavy colder 1 124 learn lesson burn time turn 1 25 money pay work buy dollar 3 Money 125 baby girl yeah ya gotta 1 26 years long ago time day 2 126 love forever heart together baby 2 27 tomorrow sorrow today borrow yesterday 3 Time 127 rock roll dance music gonna 2 28 face place race space grace 1 128 house door bed home room 3 House 29 clock tick time waiting watch 2 129 hold feel night eyes tonight 1 30 point view give time try 1 130 chorus verse repeat bridge fade 1 31 empty alone feel left cold 2 131 love cry eyes feel try 1 32 daddy home kids mother boy 3 Family 132 loud crowd hear shout scream 2 33 dumb funny make laugh money 1 133 problem things life time make 3 Change 34 reason time season mind change 1 134 any very people only kind 1 35 fall ground breath feel back 1 135 round spinning turn wheel circle 1 36 line hand black cut wire 1 136 angel devil heaven soul hell 3 Christian 37 past time day life back 3 Time 137 together forever weather love things 1 38 pain die life fear soul 3 Pain 138 winter cold summer snow wind 3 Weather 39 jail man gun police street 3 Crime 139 hear ears words eyes whisper 3 Communic. 40 god earth dark ancient rise 3 Christian 140 arms harm warm safe feel 1 41 someone else find love time 1 141 heart love blue lonely cry 3 Heartbreak 42 feel things make ’cause time 1 142 mind life nature power exist 2 43 letter write read words wrote 3 Write 143 fire burn flame desire heart 3 Fire 44 open wide eyes door heart 1 144 wind night dark moon cold 3 Weather 45 love stars gold shine heart 2 145 love heart feel emotions eyes 2 46 back time left try make 1 146 morning night day wake sleep 3 DayNight 47 sky stars sun shine light 3 Sky 147 god Jesus lord pray sin 3 Christian 48 feel love heart baby make 2 148 lips kiss love sweet eyes 3 Sex 49 sleep bed dream night head 3 Sleep 149 Christmas snow santa bells tree 3 Christmas 50 lord god sing praise glory 3 Christian 150 day tomorrow time today stay 3 Time 51 ring bells sing hear song 1 151 baby yeah honey love girl 1 52 smile happy love make life 3 Happy 152 play song record band show 2 53 time nothing feel make try 1 153 make things try life wrong 1 54 gonna money dime little man 2 154 dream free life world believe 2 55 doctor cure pain pills medicine 3 Health 155 lies truth believe words told 3 Truth/Lies 56 clown smile wear frown town 2 156 death hell blood soul god 3 Death 57 picture face eyes paint mirror 1 157 tears cry eyes love pain 3 Pain 58 love heart kiss dream hold 3 Love 158 run dog night black kill 3 Animals 59 hand understand man plan land 1 159 born heaven angel sing god 3 Christian 60 eyes time feel life world 1 160 light night shine dark bright 3 Dark/Light 61 cut eyes heart bleed die 1 161 hold tight love kiss night 3 Sex 62 hand head eyes fingers back 3 Anatomy 162 fire burn thunder bomb lightning 3 Fire 63 desire feel love pleasure fire 3 Sex 163 river water sea flow mountain 3 Sea 64 blood skin burn eyes dead 2 164 sh!t n!gga b!tch f!ck a!s 3 Coarse Continued on next page LDA-TOPICS AND MANUAL EVALUATION 80

Table A.3 – Continued from previous page Topic High Probability Words Q Label Topic High Probability Words Q Label 65 music lyrics words album song 2 165 world death lies life die 1 66 lord god heart love life 2 166 birth stars shine lay worth 2 67 feet street walk ground beat 2 167 back time move mind step 1 68 play game cards win roll 3 Game 168 fight war battle die stand 3 War 69 bones stone home hole hand 1 169 sea sail ship wave ocean 3 Sea 70 man big boy little back 3 Cowboy 170 fly wings sky high bird 3 Sky 71 book page read story words 3 Write 171 song sing play music hear 3 Music 72 mother brother father sister son 3 Family 172 girl ya club hit baby 2 73 time make everything nothing end 1 173 home alone long day time 1 74 remember time day memories love 3 Time 174 love told friend girl man 1 75 door floor open before back 2 175 f!ck sh!t gonna ass b!tch 3 Coarse 76 dance move music feel yeah 3 Dance 176 feel reaction action attraction 1 77 knew thought before eyes time 1 177 phone call alone home hear 3 Phone 78 dark light shadows night eyes 3 Dark/Light 178 west east south city north 3 Travel 79 tree flowers sun sky sing 3 Nature 179 else someone ’cause nobody 1 80 time line waiting mind make 1 180 broken heart piece fall break 3 Heartbreak 81 baby love girl man yeah 1 181 sun light time day fall 1 82 fool play rules game cool 1 182 time things end good love 1 83 heal wounds heart pain feel 3 Pain 183 little bit make crazy baby 1 84 eat little big man make 3 Food 184 strong stand faith strength love 2 85 road home find long walk 2 185 girl kids car friend guy 1 86 love heart baby cry leave 3 Heartbreak 186 train track back ride hear 3 Trains 87 rhyme yo rock mic rap 2 187 answer question find time reason 2 88 love life give heart world 3 Love 188 wear hair dress shoes girl 2 89 white hair red blue dress 3 Beauty 189 game play pain blame shame 2 90 baby yeah nothin’ goin’ gonna 1 190 make feel bad try life 1 91 car drive road ride wheel 3 Driving 191 black white red light blue 3 Colors 92 eyes lies hide play disguise 1 192 fall breath break make hand 1 93 knew hand walk fell thought 1 193 gun kill bullet shot head 3 Crime 94 dance dream magic eyes light 1 194 blood flesh dead body death 3 Death 95 sh!t n!gga f!ck yo ’em 3 Coarse 195 street city walk town night 3 City 96 control mind machine line live 3 Protest 196 line time mind mine fine 1 97 free chain set price break 2 197 gonna everybody yeah ’cause hey 1 98 drink wine bottle night bar 3 Drinking 198 tonight fire lot sleigh nose 2 99 amazing save sound grace taught 1 199 miles thousand million time smile 1 SUPERVISED TOPIC MODEL FOR EVALUATION 81

Appendix B

Supervised Topic Model for Evaluation

Topic High Probability Words Home/House home house door bring want window look said lyrics clean round road place live alright Places and Cities city town country new look old home blue america said place boy eyes world end Music and Rocking rock sing song music play roll pop hear rhythm beat long rockin’ hey wanna everybody Life life live hard change die want ready look gone things hey why old try think Sports and Games play chance game win better gotta boy world remind things round wanna toy gamble lose Religion believe angel heaven god faith hell lord world devil jesus little hope soul think eyes Love want good life think kiss hold wanna always only things true little better every fall Anatomy eyes hand look face want head open think read lips close butt arms hold dream People woman boy little want lady sweet hey think look good hand said ooh everybody long Drugs and Alcohol wine high little drink sweet life think drugs beer addicted kick good whiskey another play Education and Advice gotta wanna things bottle whoa want learn teacher look treat true believe try step matter Travelling and Moving run walk stop train want roll road look ride said drive walkin’ fall little fly Communication talk why want did hear wanna things think look try said hello only cry mind Numbers only minute sweet inside things miles hey seven sixteen memories sorry think million lonely second Animals monkey ride fly bird dog pop want pony look horse goes little big said butterfly Sleep, Dreams dream sleep eyes another bye live every hold life tired awake things only wake true Time waiting new world want morning long tomorrow turn sunday think every look light eyes said Family mama friend daddy said little mother want born life family things cry look try child Work work job hard gotta every pay week long god workin’ boss friday old high tonight Society and Classes black said want hey did try why boy fuck wanna look human live things hard Heartbreak gone think lies cry want did good why things try wanna said hurt look only Law and Crime check gun everybody fight trouble things head chain said kill little shot life did why Dancing and Party dance shake party floor want wanna everybody ooh boogie hey body round mama things tonight Water river water sea floating run ocean gone sun look only round lay fall boat sky Weather rain sun wind walk always sunshine turn thunder sky blow sunny clouds eyes rainbow look Nature sky rose big mountain flowers look high hard hill nature said world hey little lonely Sex want wanna body stop hey really touch hold think bad please why sexy sex turn Money money dollar buy alright poor pay nah hey want look street life since why new Seasons summer long hot leave winter kiss words roof yes years only took remember things september Food candy want sweet good pie song long tea eat hungry hey egg think sugar home Continued on next page SUPERVISED TOPIC MODEL FOR EVALUATION 82

Table B.1 – Continued from previous page Topic High Probability Words Colors blue black red white hey light hot leave men alone color matter yes world head Media and Showbiz lost stars hollywood goin’ good hey ooh turn want light live think things people life War and Peace war world peace soldier people fight many god hand stand die another seed said want Fire burn flame light fall ashes burnin’ wanna gone long bye memories higher old life end Space, Moon and Stars moon stars lyrics space moonlight want look bitch sky dream little shine rocket far light Night tonight eyes want feelin’ light midnight stop late nobody hold everything groove crazy little turn Christmas christmas bells santa years snow little world tree white joy light sing merry every stars BIBLIOGRAPHY 83

Bibliography

[1] The echonest. http://echonest.com/. [2] Last.fm. http://www.last.fm/. [3] Musicbrainz. http://musicbrainz.org/. [4] musixmatch lyrics catalog. http://musixmatch.com/. [5] Rap genius. http://rapgenius.com/static/about. [6] Songmeanings. http://www.songmeanings.net. [7] Charu C Aggarwal and ChengXiang Zhai. A survey of text classification algorithms. In Mining Text Data, pages 163–222. Springer, 2012. [8] Craig A Anderson, Nicholas L Carnagey, and Janie Eubanks. Exposure to violent media: The effects of songs with violent lyrics on aggressive thoughts and feelings. Journal of personality and social psychology, 84(5):960, 2003. [9] Satanjeev Banerjee and Ted Pedersen. An adapted lesk algorithm for word sense disambiguation using wordnet. In Computational linguistics and intelligent text processing, pages 136–145. Springer, 2002. [10] Stephan Baumann and Andreas Kl¨uter. Super convenience for non-musicians: Querying mp3 and the se- mantic web. In Proceedings of the International Conference on Music Information Retrieval, pages 297–298, 2002. [11] Adam Berenzweig, Beth Logan, Daniel PW Ellis, and Brian Whitman. A large-scale evaluation of acoustic and subjective music-similarity measures. Computer Music Journal, 28(2):63–76, 2004. [12] Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida, pages 591–596. University of Miami, 2011. [13] David M Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012. [14] David M Blei and Jon D McAuliffe. Supervised topic models. arXiv preprint arXiv:1003.0783, 2010. [15] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003. [16] Jonathan Boyd-Graber, Jordan Chang, Sean Gerrish, Chong Wang, and David Blei. Reading tea leaves: How humans interpret topic models. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, 2009. [17] Eric Brochu and Nando De Freitas. name that song!: A probabilistic approach to querying on music and text. Advances in neural information processing systems, 15:1505–1512, 2002. [18] Claire Carden. From Making Love to Sexing: Historical Development of Sexual References in Popular Music 1960-2011. PhD thesis, 2012. [19] O.` Celma. Music Recommendation and Discovery in the Long Tail. PhD thesis, Universitat Pompeu Fabra, Barcelona, 2008. BIBLIOGRAPHY 84

[20] Scott Deerwester, Susan T. Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. In- dexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990. [21] Bob Dylan, John Hammond, and Stacey Williams. Bob Dylan. Columbia, 1967. [22] Douglas Eck, Paul Lamere, Thierry Bertin-Mahieux, and Stephen Green. Automatic generation of social tags for music recommendation. Advances in neural information processing systems, 20(20):1–8, 2007. [23] Gustavo C. S. Frederico. Actos: a peer-to-peer application for the retrieval of encoded music. In Proceedings of the 1st International Conference on Musical Application Using XML (MAX ’02), Milan, Italy, September 2002. [24] David A. Grossman, Luis Gravano, ChengXiang Zhai, Otthein Herzog, and David A. Evans, editors. Pro- ceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, Washington, DC, USA, November 8-13, 2004. ACM, 2004. [25] Alexander Hinneburg, Charu C Aggarwal, and Daniel A Keim. What is the nearest neighbor in high dimen- sional spaces? Bibliothek der Universit¨atKonstanz, 2000. [26] Graeme Hirst and David St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. WordNet: An electronic lexical database, 305:305–332, 1998. [27] Thomas Hofmann. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 289–296. Morgan Kaufmann Publishers Inc., 1999. [28] David Hope. Wordnet similarity for java. https://code.google.com/p/ws4j/, June 2013. [29] Xiao Hu, J Stephen Downie, and Andreas F Ehmann. Lyric text mining in music mood classification. American music, 183(5,049):2–209, 2009. [30] Jay J Jiang and David W Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008, 1997. [31] Thorsten Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical report, DTIC Document, 1996. [32] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. Springer, 1998. [33] Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. Multilabel text classification for automated tag suggestion. In Proceedings of the ECML/PKDD, 2008. [34] Florian Kleedorfer, Peter Knees, and Tim Pohle. Oh oh oh whoah! towards automatic topic detection in song lyrics. In Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR 2008), pages 287–292, 2008. [35] Peter Kolb. Disco: A multilingual database of distributionally similar words. Proceedings of KONVENS-2008, Berlin, 2008. [36] Vandana Korde and C Namrata Mahender. Text classification and classifiers: A survey. International Journal, 3, 2012. [37] Simon Lacoste-Julien, Fei Sha, and Michael I Jordan. Disclda: Discriminative learning for dimensionality reduction and classification. Advances in Neural Information Processing Systems (NIPS), 21, 2008. [38] Paul Lamere. Social Tagging and Music Information Retrieval. Journal of New Music Research, 37(2):101– 114, 2008. [39] Cyril Laurier, Jens Grivolla, and Perfecto Herrera. Multimodal music mood classification using audio and lyrics. In Machine Learning and Applications, 2008. ICMLA’08. Seventh International Conference on, pages 688–693. IEEE, 2008. [40] Claudia Leacock and Martin Chodorow. Combining local context and wordnet similarity for word sense identification. WordNet: An electronic lexical database, 49(2):265–283, 1998. BIBLIOGRAPHY 85

[41] Vladimir I Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. In Soviet physics doklady, volume 10, page 707, 1966. [42] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5:361–397, 2004. [43] Dekang Lin. An information-theoretic definition of similarity. In Proceedings of the 15th international conference on Machine Learning, volume 1, pages 296–304. San Francisco, 1998. [44] Beth Logan, Andrew Kositsky, and Pedro Moreno. Semantic analysis of song lyrics. In Multimedia and Expo, 2004. ICME’04. 2004 IEEE International Conference on, volume 2, pages 827–830. IEEE, 2004. [45] Donald MacLellan, Carola Boehm, and Carola Boehm. Mutated’ll: A system for music information retrieval of encoded music. In ISMIR, 2000. [46] Jose PG Mahedero, Alvaro´ Mart´Inez, Pedro Cano, Markus Koppenberger, and Fabien Gouyon. Natural language processing of lyrics. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 475–478. ACM, 2005. [47] Christopher D Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. Introduction to information retrieval, volume 1. Cambridge University Press Cambridge, 2008. [48] Andrew McCallum, Kamal Nigam, et al. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, volume 752, pages 41–48. Citeseer, 1998. [49] Andrew Kachites McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002. [50] Douglas M McLeod, William P Eveland, and Amy I Nathanson. Support for censorship of violent and misogynic rap lyrics an analysis of the third-person effect. Communication Research, 24(2):153–174, 1997. [51] George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. Introduction to wordnet: An on-line lexical database*. International journal of lexicography, 3(4):235–244, 1990. [52] Robert Neumayer and Andreas Rauber. Integration of text and audio features for genre classification in music information retrieval. In Advances in Information Retrieval, pages 724–727. Springer, 2007. [53] David Newman, Sarvnaz Karimi, and Lawrence Cavedon. External evaluation of topic models. In Aus- tralasian Document Computing Symposium (ADCS), pages 1–8. School of Information Technologies, Univer- sity of Sydney, 2009. [54] David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 100–108. Association for Computational Linguistics, 2010. [55] Karl Pearson and Egon Sharpe Pearson. Biometrika, volume 4. University Press, 1906. [56] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [57] Martin F Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130–137, 1980. [58] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. [59] John Ross Quinlan. C4. 5: programs for machine learning, volume 1. Morgan kaufmann, 1993. [60] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 248–256. Association for Computational Linguistics, 2009. [61] Daniel Ramage and Evan Rosen. Stanford topic modeling toolbox, 2009. BIBLIOGRAPHY 86

[62] Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007, 1995. [63] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5):513–523, 1988. [64] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1–47, 2002. [65] Dan Steinberg and Phillip Colla. Cart: Tree-structured non-parametric data analysis. San Diego, CA: Salford Systems, 1995. [66] Ivan Titov and Ryan McDonald. A joint model of text and aspect ratings for sentiment summarization. Urbana, 51:61801, 2008. [67] Google Trends. Search statistics. http://www.google.com/trends/explore#q=coldplay%2C%20coldplay% 20lyrics&cmpt=q, June 2013. [68] Hanna M Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation methods for topic models. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1105–1112. ACM, 2009. [69] Xing Wei and W Bruce Croft. Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 178–185. ACM, 2006. [70] Zhibiao Wu and Martha Palmer. Verbs semantics and lexical selection. In Proceedings of the 32nd an- nual meeting on Association for Computational Linguistics, pages 133–138. Association for Computational Linguistics, 1994. [71] Yiming Yang and Jan O Pedersen. A comparative study on feature selection in text categorization. In MA- CHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pages 412–420. MORGAN KAUFMANN PUBLISHERS, INC., 1997.