Love Me, Love Me, Say (and Write!) that You Love Me: Enriching the WASABI Song Corpus with Lyrics Annotations Michael Fell, Elena Cabrio, Elmahdi Korfed, Michel Buffa, Fabien Gandon

To cite this version:

Michael Fell, Elena Cabrio, Elmahdi Korfed, Michel Buffa, Fabien Gandon. Love Me, Love Me, Say (and Write!) that You Love Me: Enriching the WASABI Song Corpus with Lyrics Annotations. LREC 2020 - 12th edition of the Language Resources and Evaluation Conference, May 2020, Marseille, France. ￿hal-03133188￿

HAL Id: hal-03133188 https://hal.archives-ouvertes.fr/hal-03133188 Submitted on 5 Feb 2021

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2138–2147 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC

Love Me, Love Me, Say (and Write!) that You Love Me: Enriching the WASABI Song Corpus with Lyrics Annotations

Michael Fell, Elena Cabrio, Elmahdi Korfed, Michel Buffa and Fabien Gandon Universite´ Coteˆ d’Azur, CNRS, Inria, I3S, France {michael.fell, elena.cabrio, elmahdi.korfed, michel.buffa}@unice.fr, [email protected]

Abstract We present the WASABI Song Corpus, a large corpus of songs enriched with metadata extracted from music databases on the Web, and resulting from the processing of song lyrics and from audio analysis. More specifically, given that lyrics encode an important part of the semantics of a song, we focus here on the description of the methods we proposed to extract relevant information from the lyrics, such as their structure segmentation, their topics, the explicitness of the lyrics content, the salient passages of a song and the emotions conveyed. The creation of the resource is still ongoing: so far, the corpus contains 1.73M songs with lyrics (1.41M unique lyrics) annotated at different levels with the output of the above mentioned methods. Such corpus labels and the provided methods can be exploited by music search engines and music professionals (e.g. journalists, radio presenters) to better handle large collections of lyrics, allowing an intelligent browsing, categorization and recommendation of songs. We provide the files of the current version of the WASABI Song Corpus, the models we have built on it as well as updates here: https://github.com/micbuffa/WasabiDataset.

Keywords: Corpus (Creation, Annotation, etc.), Information Extraction, Information Retrieval, Music and Song Lyrics

1. Introduction An analysis of the correlations among the above mentioned annotation layers reveals interesting insights about the song Let’s imagine the following scenario: following David corpus. For instance, we demonstrate the change in corpus Bowie’s death, a journalist plans to prepare a radio show annotations diachronically: we show that certain topics be- about the artist’s musical career to acknowledge his quali- come more important over time and others are diminished. ties. To discuss the topic from different angles, she needs We also analyze such changes in explicit lyrics content and to have at her disposal the artist biographical information expressed emotion. to know the history of his career, the song lyrics to know The paper is organized as follows. Section 2. introduces the what he was singing about, his musical style, the emotions WASABI Song Corpus and the metadata initially extracted his songs were conveying, live recordings and interviews. from music databases on the Web. Section 3. describes Similarly, streaming professionals such as , , the segmentation method we applied to decompose lyrics Pandora or aim at enriching music listening in their building blocks in the corpus. Section 4. explains with artists’ information, to offer suggestions for listening the method used to summarize song lyrics, leveraging their to other songs/albums from the same or similar artists, or structural properties. Section 5. reports on the annotations automatically determining the emotion felt when listening resulting from the explicit content classifier, while Section to a track to propose coherent playlists to the user. To sup- 6. describes how information on the emotions are extracted port such scenarios, the need for rich and accurate musi- from the lyrics. Section 7. describes the topic modeling cal knowledge bases and tools to explore and exploit this algorithm to label each lyrics with the top 10 words, while knowledge becomes evident. Section 8. examines the changes in the annotations over In this paper, we present the WASABI Song Corpus, a large the course of time. Section 9. reports on similar existing corpus of songs (2.10M songs, 1.73M with lyrics) enriched resources, while Section 10. concludes the paper. with metadata extracted from music databases on the Web, and resulting from the processing of song lyrics and from 2. The WASABI Song Corpus audio analysis. The corpus contains songs in 36 different 1 languages, even if the vast majority are in English. As for In the context of the WASABI research project that started the songs genres, the most common ones are Rock, Pop, in 2017, a two million song database has been built, with Country and Hip Hop. metadata on 77k artists, 208k albums, and 2.10M songs More specifically, while an overview of the goals of the (Meseguer-Brocal et al., 2017). The metadata has been i) WASABI project supporting the dataset creation and the aggregated, merged and curated from different data sources description of a preliminary version of the dataset can be on the Web, and ii) enriched by pre-computed or on- found in (Meseguer-Brocal et al., 2017), this paper focuses demand analyses of the lyrics and audio data. on the description of the methods we proposed to annotate We have performed various levels of analysis, and inter- relevant information in the song lyrics. Given that lyrics active Web Audio applications have been built on top of encode an important part of the semantics of a song, we the output. For example, the TimeSide analysis and an- propose to label the WASABI dataset lyrics with their struc- notation framework have been linked (Fillon et al., 2014) ture segmentation, the explicitness of the lyrics content, the to make on-demand audio analysis possible. In connection salient passages of a song, the addressed topics and the emotions conveyed. 1http://wasabihome.i3s.unice.fr/

2138 Figure 2: The datasources connected to the WASABI Song Corpus.

tify, Amazon, AllMusic, GoHear, YouTube. Figure 2 il- lustrates7 all the data sources we have used to create the WASABI Song Corpus. We have also aligned the WASABI Song Corpus with the publicly available LastFM dataset8, Figure 1: The WASABI Interactive Navigator. resulting in 327k tracks in our corpus having a LastFM id. As of today, the corpus contains 1.73M songs with lyrics with the FAST project2, an offline chord analysis of 442k (1.41M unique lyrics). 73k songs have at least an ab- songs has been performed, and both an online enhanced stract on DBpedia, and 11k have been identified as “clas- audio player (Pauwels and Sandler, 2019) and chord search sic songs” (they have been number one, or got a Grammy engine (Pauwels et al., 2018) have been built around it. A award, or have lots of cover versions). About 2k songs have rich set of Web Audio applications and plugins has been a multi-track audio version, and on-demand source sepa- proposed (Buffa and Lebrun, 2017a; Buffa and Lebrun, ration using open-unmix (Stoter¨ et al., 2019) or Spleeter 2017b; Buffa et al., 2018), that allow, for example, songs to (Hennequin et al., 2019) is provided as a TimeSide plugin. be played along with sounds similar to those used by artists. Several Natural Language Processing methods have been All these metadata, computational analyses and Web Au- applied to the lyrics of the songs included in the WASABI dio applications have now been gathered in one easy-to-use Song Corpus, as well as various analyses of the extracted web interface, the WASABI Interactive Navigator3, illus- information have been carried out. After providing some trated4 in Figure 1. statistics on the WASABI corpus, the rest of the article de- We have started building the WASABI Song Corpus by scribes the different annotations we added to the lyrics of collecting for each artist the complete discography, band the songs in the dataset. Based on the research we have con- members with their instruments, time line, equipment they ducted, the following lyrics annotations are added: lyrical use, and so on. For each song we collected its lyrics from structure (Section 3.), summarization (Section 4.), explicit LyricWiki5, the synchronized lyrics when available6, the lyrics (Section 5.), emotion in lyrics (Section 6.) and topics DBpedia abstracts and the categories the song belongs to, in lyrics (Section 7.). e.g. genre, label, writer, release date, awards, produc- 2.1. Statistics on the WASABI Song Corpus ers, artist and band members, the stereo audio track from Deezer, the unmixed audio tracks of the song, its ISRC, This section summarizes key statistics on the corpus, such bpm and duration. as the language and genre distributions, the songs coverage We matched the song ids from the WASABI Song Cor- in terms of publication years, and then gives the technical pus with the ids from MusicBrainz, iTunes, Discogs, Spo- details on its accessibility. Language Distribution Figure 3a shows the distribution 2http://www.semanticaudio.ac.uk of the ten most frequent languages in our corpus.9 In to- 3http://wasabi.i3s.unice.fr/ 4Illustration taken from (Buffa et al., 2019a). 7Illustration taken from (Buffa et al., 2019b). 5http://lyrics.wikia.com/ 8http://millionsongdataset.com/lastfm/ 6from http://usdb.animux.de/ 9Based on language detection performed on the lyrics.

2139 try (5.2%), Hip Hop (4.5%) and Folk (2.7%). Publication Year Figure 3c shows the number of songs published in our corpus, by decade.11 We find that over 50% of all songs in the WASABI Song Corpus are from the 2000s or later and only around 10% are from the seventies or earlier. Accessibility of the WASABI Song Corpus The WASABI Interactive Navigator relies on multiple database engines: it runs on a MongoDB server altogether with an indexation by Elasticsearch and also on a Virtuoso triple store as a RDF graph database. It comes with a REST API12 (a) Language distribution (100% = 1.73M) and an upcoming SPARQL endpoint. All the database metadata is publicly available13 under a CC licence through the WASABI Interactive Navigator as well as programmat- ically through the WASABI REST API. We provide the files of the current version of the WASABI Song Corpus, the models we have built on it as well as updates here: https://github.com/micbuffa/ WasabiDataset.

3. Lyrics Structure Annotations Generally speaking, lyrics structure segmentation consists of two stages: text segmentation to divide lyrics into seg- ments, and semantic labelling to label each segment with a structure type (e.g. Intro, Verse, Chorus). (b) Genre distribution (100% = 1.06M) In (Fell et al., 2018) we proposed a method to segment lyrics based on their repetitive structure in the form of a self-similarity matrix (SSM). Figure 4 shows a line-based SSM for the song text written on top of it14. The lyrics consists of seven segments and shows the typical repetitive structure of a Pop song. The main diagonal is trivial, since each line is maximally similar to itself. Notice further the additional diagonal stripes in segments 2, 4 and 7; this in- dicates a repeated part, typically the chorus. Based on the simple idea that eyeballing an SSM will reveal (parts of) a song’s structure, we proposed a Convolutional Neural Net- work architecture that successfully learned to predict seg- ment borders in the lyrics when “looking at” their SSM. Table 1 shows the genre-wise results we obtained using our (c) Decade of publication distribution (100% = 1.70M) proposed architecture. One important insight was that more repetitive lyrics as often found in genres such as Country Figure 3: Statistics on the WASABI Song Corpus and Punk Rock are much easier to segment than lyrics in Rap or Hip Hop which often do not even contain a chorus. tal, the corpus contains songs of 36 different languages. In the WASABI Interactive Navigator, the line-based SSM The vast majority (76.1%) is English, followed by Spanish of a song text can be visualized. It is toggled by clicking (6.3%) and by four languages in the 2-3% range (German, on the violet-blue square on top of the song text. For a sub- French, Italian, Portugese). On the bottom end, Swahili and set of songs the color opacity indicates how repetitive and Latin amount to 0.1% (around 2k songs) each. representative a segment is, based on the fitness metric that we proposed in (Fell et al., 2019b). Note how in Figure 4, Genre Distribution In Figure 3b we depict the distribu- tion of the ten most frequent genres in the corpus.10 In to- 11We take the album publication date as proxy since song-wise tal, 1.06M of the titles are tagged with a genre. It should labels are too sparse. be noted that the genres are very sparse with a total of 12https://wasabi.i3s.unice.fr/apidoc/ 528 different ones. This high number is partially due to 13There is no public access to copyrighted data such as lyrics many subgenres such as Alternative Rock, Indie Rock, Pop and full length audio files. Instructions on how to obtain lyrics are Rock, etc. which we omitted in Figure 3b for clarity. The nevertheless provided and audio extracts of 30s length are avail- most common genres are Rock (9.7%), Pop (8.6%), Coun- able for nearly all songs. 14https://wasabi.i3s.unice.fr/#/search/ 10We take the genre of the album as ground truth since song- artist/Britney%20Spears/album/In%20The% wise genres are much rarer. 20Zone/song/Everytime

2140 Genre P R F1 Rock 73.8 57.7 64.8 Hip Hop 71.7 43.6 54.2 Pop 73.1 61.5 66.6 RnB 71.8 60.3 65.6 Alternative Rock 76.8 60.9 67.9 Country 74.5 66.4 70.2 Hard Rock 76.2 61.4 67.7 Pop Rock 73.3 59.6 65.8 Indie Rock 80.6 55.5 65.6 Heavy Metal 79.1 52.1 63.0 Southern Hip Hop 73.6 34.8 47.0 Figure 5: Human ratings per summarization model (five Punk Rock 80.7 63.2 70.9 point Likert scale). Models are Rank: graph-based, Topic: Alternative Metal 77.3 61.3 68.5 topic-based, Fit: thumbnail-based, and model combina- Pop Punk 77.3 68.7 72.7 tions. Gangsta Rap 73.6 35.2 47.7 Soul 70.9 57.0 63.0 larity to enable other researchers to work with these struc- Table 1: Lyrics segmentation performances across musi- tural representations: line-wise similarity and segment- cal genres in terms of Precision (P), Recall (R) and F1 in wise similarity. %. Underlined are the performances on genres with less repetitive text. Genres with highly repetitive structure are in bold. 4. Lyrics Summary

Given the repeating forms, peculiar structure and other unique characteristics of song lyrics, in (Fell et al., 2019b) we introduced a method for extractive summarization of lyrics that takes advantage of these additional elements to more accurately identify relevant information in song lyrics. More specifically, it relies on the intimate relation- ship between the audio and the lyrics. The so-called audio thumbnails, snippets of usually 30 seconds of music, are a popular means to summarize a track in the audio com- munity. The intuition is the more repeated and the longer a part, the better it represents the song. We transferred an audio thumbnailing approach to our domain of lyrics and showed that adding the thumbnail improves summary qual- ity. We evaluated our method on 50k lyrics belonging to the top 10 genres of the WASABI Song Corpus and ac- cording to qualitative criteria such as Informativeness and Coherence. Figure 5 shows our results for different summa- rization models. Our model RankTopicFit, which com- bines graph-based, topic-based and thumbnail-based sum- Figure 4: Structure of the lyrics of “Everytime” by Britney marization, outperforms all other summarizers. We further Spears as displayed in the WASABI Interactive Navigator. find that the genres RnB and Country are highly overrepre- sented in the lyrics sample with respect to the full WASABI Song Corpus, indicating that songs belonging to these gen- the segments 2, 4 and 7 are shaded more darkly than the res are more likely to contain a chorus. Finally, Figure 6 surrounding ones. As highly fit (opaque) segments often shows an example summary of four lines length obtained coincide with a chorus, this is a first approximation of cho- with our proposed RankTopicFit method. It is toggled in rus detection. Given the variability in the set of structure the WASABI Interactive Navigator by clicking on the green types provided in the literature according to different gen- square on top of the song text. res (Tagg, 1982; Brackett, 1995), rare attempts have been The four-line summaries of 50k English used in our exper- made in the literature to achieve a more complete seman- iments are freely available within the WASABI Song Cor- tic labelling, labelling the lyrics segments as Intro, Verse, pus; the Python code of the applied summarization methods Bridge, Chorus etc. is also available16. For each song text we provide an SSM based on a normal- ized character-based edit distance15 on two levels of granu- phonetics or the syntax. 15In our segmentation experiments we found this simple metric 16https://github.com/TuringTrain/lyrics_ to outperform more complex metrics that take into account the thumbnailing

2141 Model P R F1 Majority Class 45.0 50.0 47.4 Dictionary Lookup 78.3 76.4 77.3 Dictionary Regression 76.2 81.5 78.5 Tf-idf BOW Regression 75.6 81.2 78.0 TDS Deconvolution 81.2 78.2 79.6 BERT Language Model 84.4 73.7 77.7

Table 2: Performance comparison of our different models. Figure 6: Summary of the lyrics of “Everytime” by Britney Precision (P), Recall (R) and f-score (F1) in %. Spears as displayed in the WASABI Interactive Navigator.

5. Explicit Language in Lyrics On audio recordings, the Parental Advisory Label is placed in recognition of profanity and to warn parents of mate- rial potentially unsuitable for children. Nowadays, such la- belling is carried out mainly manually on voluntary basis, with the drawbacks of being time consuming and therefore costly, error prone and partly a subjective task. In (Fell et al., 2019a) we have tackled the task of automated explicit lyrics detection, based on the songs carrying such a label. We compared automated methods ranging from dictionary- based lookup to state-of-the-art deep neural networks to au- tomatically detect explicit contents in English lyrics. More specifically, the dictionary-based methods rely on a swear word dictionary Dn which is automatically created from example explicit and clean lyrics. Then, we use Dn to pre- dict the class of an unseen song text in one of two ways: (i) the Dictionary Lookup simply checks if a song text contains words from Dn. (ii) the Dictionary Regression uses BOW made from Dn as the feature set of a logistic regression classifier. In the Tf-idf BOW Regression the BOW is ex- Figure 7: Emotion distribution in the corpus in the valence- panded to the whole vocabulary of a training sample instead arousal plane. of only the explicit terms. Furthermore, the model TDS De- convolution is a deconvolutional neural network (Vanni et dicted labels in the WASABI Song Corpus and the trained al., 2018) that estimates the importance of each word of classifier to apply it to unseen text. the input for the classifier decision. In our experiments, we worked with 179k lyrics that carry gold labels provided by Deezer (17k tagged as explicit) and obtained the results 6. Emotional Description shown in Figure 2. We found the very simple Dictionary In sentiment analysis the task is to predict if a text has a pos- Lookup method to perform on par with much more complex itive or a negative emotional valence. In the recent years, models such as the BERT Language Model (Devlin et al., a transition from detecting sentiment (positive vs. negative 2018) as a text classifier. Our analysis revealed that some valence) to more complex formulations of emotion detec- genres are highly overrepresented among the explicit lyrics. tion (e.g. joy, fear, surprise) (Mohammad et al., 2018) has Inspecting the automatically induced explicit words dictio- become more visible; even tackling the problem of emotion nary reflects that genre bias. The dictionary of 32 terms in context (Chatterjee et al., 2019). One family of emo- used for the dictionary lookup method consists of around tion detection approaches is based on the valence-arousal 50% of terms specific to the Rap genre, such as glock, gat, model of emotion (Russell, 1980), locating every emotion clip (gun-related), thug, beef, gangsta, pimp, blunt (crime in a two-dimensional plane based on its valence (positive and drugs). Finally, the terms holla, homie, and rapper are vs. negative) and arousal (aroused vs. calm).18 Figure 7 is obviously no swear words, but highly correlated with ex- an illustration of the valence-arousal model of Russell and plicit content lyrics. shows exemplary where several emotions such as joyful, Our corpus contains 52k tracks labelled as explicit and angry or calm are located in the plane. Manually labelling 663k clean (not explicit) tracks17. We have trained a clas- texts with multi-dimensional emotion descriptions is an in- sifier (77.3% f-score on test set) on the 438k English lyrics herently hard task. Therefore, researchers have resorted to which are labelled and classified the remaining 455k pre- distant supervision, obtaining gold labels from social tags viously untagged English tracks. We provide both the pre- from lastfm. These approaches (Hu et al., 2009a; C¸ano and

17Labels provided by Deezer. Furthermore, 625k songs have a 18Sometimes, a third dimension of dominance is part of the different status such as unknown or censored version. model.

2142 Morisio, May 2017) define a list of social tags that are re- lated to emotion, then project them into the valence-arousal space using an emotion lexicon (Warriner et al., 2013; Mo- hammad, 2018). Recently, Deezer made valence-arousal annotations for 18,000 English tracks available19 they have derived by the aforementioned method (Delbouys et al., 2018). We aligned the valence-arousal annotations of Deezer to our Figure 8: Topic War Figure 9: Topic Death songs. In Figure 7 the green dots visualize the emotion distribution of these songs.20 Based on their annotations, we train an emotion regression model using BERT, with an evaluated 0.44/0.43 Pearson correlation/Spearman correla- tion for valence and 0.33/0.31 for arousal on the test set. We integrated Deezer’s labels into our corpus and also pro- vide the valence-arousal predictions for the 1.73M tracks with lyrics. We also provide the last.fm social tags (276k) and emotion tags (87k entries) to facilitate researchers to Figure 10: Topic Love Figure 11: Topic Family build variants of emotion recognition models. 7. Topic Modelling We built a topic model on the lyrics of our corpus using Latent Dirichlet Allocation (LDA) (Blei et al., 2003). We determined the hyperparameters α, η and the topic count such that the coherence was maximized on a subset of 200k lyrics. We then trained a topic model of 60 topics on the unique English lyrics (1.05M). Figure 12: Topic Money Figure 13: Topic Religion We have manually labelled a number of more recogniz- able topics. Figures 9-13 illustrate these topics with word 21 clouds of the most characteristic words per topic. For in- genre (see for instance the “Metal top 100 words” in (Fell stance, the topic Money contains words of both the field of and Sporleder, 2014)). We measure a decline in the impor- earning money (job, work, boss, sweat) as well as spending tance of the topics Love and Family. The topics Money and it (pay, buy). The topic Family is both about the people of Religion seem to be evergreens as their importance stayed the family (mother, daughter, wife) and the land (sea, val- rather constant over time. ley, tree). Changes in Explicitness We find that newer songs are We provide the topic distribution of our LDA topic model more likely being tagged as having explicit content lyrics. for each song and make available the trained topic model to Figure 14b shows our estimates of explicitness per decade, enable its application to unseen lyrics. the ratio of songs in the decade tagged as explicit to all 8. Diachronic Corpus Analysis songs of the decade. Note that the Parental Advisory Label was first distributed in 1985 and many older songs may not We examine the changes in the annotations over the course have been labelled retroactively. The depicted evolution of of time by grouping the corpus into decades of songs ac- explicitness may therefore overestimate the “true explicit- cording to the distribution shown in Figure 3c. ness” of newer music and underestimate it for music before Changes in Topics The importance of certain topics has 1985. changed over the decades, as depicted in Figure 14a. Some Changes in Emotion We estimate the emotion of songs topics have become more important, others have declined, in a decade as the average valence and arousal of songs of or stayed relatively the same. We define the importance of a that decade. We find songs to decrease both in valence and topic for a decade of songs as follows: first, the LDA topic arousal over time. This decrease in positivity (valence) is in model trained on the full corpus gives the probability of the line with the diminishment of positively connotated topics topic for each song separately. We then average these song- such as Love and Family and the appreciation of topics with wise probabilities over all songs of the decade. For each of a more negative connotation such as War and Death. the cases of growing, diminishing and constant importance, we display two topics. The topics War and Death have ap- 9. Related Work preciated in importance over time. This is partially caused This section describes available songs and lyrics databases, by the rise of Heavy Metal in the beginning of the 1970s, and summarizes existing work on lyrics processing. as the vocabulary of the Death topic is very typical for the Songs and Lyrics Databases. The Million Song Dataset 22 19https://github.com/deezer/deezer_mood_ (MSD) project (Bertin-Mahieux et al., 2011) is a collec- detection_dataset tion of audio features and metadata for a million contempo- 20Depiction without scatterplot taken from (Parisi et al., 2019) 21made with https://www.wortwolken.com/ 22http://millionsongdataset.com

2143 with MSD tracks. However, no other processing of the lyrics is done, as is the case in our work. MusicWeb and its successor MusicLynx (Allik et al., 2018) link music artists within a Web-based application for dis- covering connections between them and provides a brows- ing experience using extra-musical relations. The project shares some ideas with WASABI, but works on the artist level, and does not perform analyses on the audio and lyrics content itself. It reuses, for example, MIR metadata from AcousticBrainz. The WASABI project has been built on a broader scope than these projects and mixes a wider set of metadata, in- (a) Evolution of topic importance cluding ones from audio and natural language processing of lyrics. In addition, as presented in this paper, it comes with a large set of Web Audio enhanced applications (multitrack player, online virtual instruments and effect, on-demand au- dio processing, audio player based on extracted, synchro- nized chords, etc.) Companies such as Spotify, GraceNote, Pandora, or Apple Music have sophisticated private knowledge bases of songs and lyrics to feed their search and recommendation algo- rithms, but such data are not available (and mainly rely on audio features). Lyrics Segmentation. Only a few papers in the literature have focused on the automated detection of the structure of lyrics. (Watanabe et al., 2016) propose the task to au- tomatically identify segment boundaries in lyrics and train a logistic regression model for the task with the repeated (b) Evolution of explicit content lyrics pattern and textual features. (Mahedero et al., 2005) re- port experiments on the use of standard NLP tools for the analysis of music lyrics. Among the tasks they address, for structure extraction they focus on a small sample of lyrics having a clearly recognizable structure (which is not always the case) divided into segments. More recently, (Barate` et al., 2013) describe a semantics-driven approach to the au- tomatic segmentation of song lyrics, and mainly focus on pop/rock music. Their goal is not to label a set of lines in a given way (e.g. verse, chorus), but rather identifying re- current as well as non-recurrent groups of lines. They pro- pose a rule-based method to estimate such structure labels of segmented lyrics. Explicit Content Detection. (Bergelid, 2018) consider a dataset of English lyrics to which they apply classical ma- (c) Evolution of emotion chine learning algorithms. The explicit labels are obtained from Soundtrack Your Brand 24. They also experiment with Figure 14: Evolution of different annotations during the adding lyrics metadata to the feature set, such as the artist decades name, the release year, the music energy level, and the va- lence/positiveness of a song. (Chin et al., 2018) apply ex- plicit lyrics detection to Korean song texts. They also use rary popular music tracks. Such dataset shares some sim- tf-idf weighted BOW as lyrics representation and aggregate ilarities with WASABI with respect to metadata extracted multiple decision trees via boosting and bagging to clas- from Web resources (as artist names, tags, years) and audio sify the lyrics for explicit content. More recently, (Kim and features, even if at a smaller scale. Given that it mainly Mun, 2019) proposed a neural network method to create ex- focuses on audio data, a complementary dataset provid- plicit words dictionaries automatically by weighting a vo- ing lyrics of the Million Song dataset was released, called cabulary according to all words’ frequencies in the explicit musiXmatch dataset23. It consists in a collection of song class vs. the clean class, accordingly. They work with a lyrics in bag-of-words (plus stemmed words), associated corpus of Korean lyrics. 23http://millionsongdataset.com/ musixmatch/ 24https://www.soundtrackyourbrand.com

2144 Emotion Recognition Recently, (Delbouys et al., 2018) scale lyrics corpus of 1M songs. They build models using address the task of multimodal music mood prediction Latent Dirichlet allocation with topic counts between 60 based on the audio signal and the lyrics of a track. They and 240 and show that the 60 topics model gives a good propose a new model based on deep learning outperform- trade-off between topic coverage and topic redundancy. ing traditional feature engineering based approaches. Per- Since popular topic models such as LDA represent topics as formances are evaluated on their published dataset with as- weighted bags of words, these topics are not immediately sociated valence and arousal values which we introduced in interpretable. This gives rise to the need of an automatic Section 6. labelling of topics with smaller labels. A recent approach (Xia et al., 2008) model song texts in a low-dimensional (Bhatia et al., 2016) relates the topical BOWs with titles of vector space as bags of concepts, the “emotional units”; Wikipedia articles in a two step procedure: first, candidates those are combinations of emotions, modifiers and nega- are generated, then ranked. tions. (Yang and Lee, 2009) leverage the music’s emotion annotations from Allmusic which they map to a lower di- 10. Conclusion mensional psychological model of emotion. They train a lyrics emotion classifier and show by qualitative interpre- In this paper we have described the WASABI dataset of tation of an ablated model (decision tree) that the deciding songs, focusing in particular on the lyrics annotations re- features leading to the classes are intuitively plausible. (Hu sulting from the applications of the methods we proposed et al., 2009b) aim to detect emotions in song texts based to extract relevant information from the lyrics. So far, on Russell’s model of mood; rendering emotions continu- lyrics annotations concern their structure segmentation, ously in the two dimensions of arousal and valence (posi- their topic, the explicitness of the lyrics content, the sum- tive/negative). They analyze each sentence as bag of “emo- mary of a song and the emotions conveyed. Some of those tional units”; they reweight sentences’ emotions by both annotation layers are provided for all the 1.73M songs in- adverbial modifiers and tense and even consider progress- cluded in the WASABI corpus, while some others apply to ing and adversarial valence in consecutive sentences. Ad- subsets of the corpus, due to various constraints described ditionally, singing speed is taken into account. With the in the paper. Table 3 summarizes the most relevant annota- fully weighted sentences, they perform clustering in the 2D tions in our corpus. plane of valence and arousal. Although the method is unsu- Annotation Labels Description pervised at runtime, there are many parameters tuned man- ually by the authors in this work. Lyrics 1.73M segments of lines of text Languages 1.73M 36 different ones (Mihalcea and Strapparava, 2012) render emotion detection Genre 1.06M 528 different ones as a multi-label classification problem, songs express inten- Last FM id 326k UID sities of six different basic emotions: anger, disgust, fear, Structure 1.73M SSM ∈ n×n (n: length) joy, sadness, surprise. Their corpus (100 song texts) has R ocial tags 276k = {rock, joyful, 90s, ...} time-aligned lyrics with information on musical key and S S motion tags 87k ⊂ = {joyful, tragic, ...} note progression. Using Mechanical Turk they each line E E S Explicitness 715k True (52k), False (663k) of song text is annotated with the six emotions. For emo- Explicitness p 455k True (85k), False (370k) tion classification, they use bags of words and concepts, as p musical features key and notes. Their classification results Summary 50k four lines of song text Emotion 16k (valence, arousal) ∈ R2 using both modalities, textual and audio features, are sig- p Emotion 1.73M (valence, arousal) ∈ R2 nificantly improved compared to a single modality. p Topics 1.05M Prob. distrib. ∈ R60 Topic Modelling Among the works addressing this task Total tracks 2.10M diverse metadata for song lyrics, (Mahedero et al., 2005) define five ad hoc topics (Love, Violent, Antiwar, Christian, Drugs) into Table 3: Most relevant song-wise annotations in the which they classify their corpus of 500 song texts using su- WASABI Song Corpus. Annotations with p are predic- pervision. Related, (Fell, 2014) also use supervision to find tions of our models. bags of genre-specific n-grams. Employing the view from the literature that BOWs define topics, the genre-specific As the creation of the resource is still ongoing, we plan to terms can be seen as mixtures of genre-specific topics. integrate an improved emotional description in future work. (Logan et al., 2004) apply the unsupervised topic model In (Atherton and Kaneshiro, 2016) the authors have studied Probabilistic LSA to their ca. 40k song texts. They learn how song writers influence each other. We aim to learn a latent topics for both the lyrics corpus as well as a NYT model that detects the border between heavy influence and newspaper corpus (for control) and show that the domain- plagiarism. specific topics slightly improve the performance in their MIR task. While their MIR task performs highly better when using acoustic features, they discover that both meth- Acknowledgement ods err differently. (Kleedorfer et al., 2008) apply Non- This work is partly funded by the French Research National negative Matrix Factorization (NMF) to ca. 60k song texts Agency (ANR) under the WASABI project (contract ANR- and cluster them into 60 topics. They show the so discov- 16-CE23-0017-01) and by the EU Horizon 2020 research ered topics to be intrinsically meaningful. and innovation programme under the Marie Sklodowska- (Sterckx, 2014) have worked on topic modelling of a large- Curie grant agreement No. 690974 (MIREL).

2145 11. Bibliographical References based on audio and lyrics with deep neural net. arXiv preprint arXiv:1809.07276. Allik, A., Thalmann, F., and Sandler, M. (2018). MusicL- Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. ynx: Exploring music through artist similarity graphs. (2018). Bert: Pre-training of deep bidirectional trans- In Companion Proc. (Dev. Track) The Web Conf. (WWW formers for language understanding. arXiv preprint 2018). arXiv:1810.04805. Atherton, J. and Kaneshiro, B. (2016). I said it first: Topo- Fell, M. and Sporleder, C. (2014). Lyrics-based analysis logical analysis of lyrical influence networks. In ISMIR, and classification of music. In Proceedings of COLING pages 654–660. 2014, the 25th International Conference on Computa- Barate,` A., Ludovico, L. A., and Santucci, E. (2013). tional Linguistics: Technical Papers, pages 620–631. A semantics-driven approach to lyrics segmentation. In Fell, M., Nechaev, Y., Cabrio, E., and Gandon, F. (2018). 2013 8th International Workshop on Semantic and So- Lyrics Segmentation: Textual Macrostructure Detection cial Media Adaptation and Personalization, pages 73– using Convolutions. In Conference on Computational 79, Dec. Linguistics (COLING), pages 2044–2054, Santa Fe, New Bergelid, L. (2018). Classification of explicit music con- Mexico, , August. tent using lyrics and music metadata. Fell, M., Cabrio, E., Corazza, M., and Gandon, F. (2019a). Bhatia, S., Lau, J. H., and Baldwin, T. (2016). Auto- Comparing Automated Methods to Detect Explicit Con- matic labelling of topics with neural embeddings. arXiv tent in Song Lyrics. In RANLP 2019 - Recent Ad- preprint arXiv:1612.05340. vances in Natural Language Processing, Varna, Bul- Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). La- garia, September. tent dirichlet allocation. Journal of machine Learning Fell, M., Cabrio, E., Gandon, F., and Giboin, A. (2019b). research, 3(Jan):993–1022. Song lyrics summarization inspired by audio thumbnail- Brackett, D. (1995). Interpreting Popular Music. Cam- ing. In RANLP 2019 - Recent Advances in Natural Lan- bridge University Press. guage Processing (RANLP), Varna, Bulgaria, September. Buffa, M. and Lebrun, J. (2017a). Real time tube guitar Fell, M. (2014). Lyrics classification. In Master’s thesis, amplifier simulation using webaudio. In Proc. 3rd Web Saarland University, Germany, 2014., 01. Audio Conference (WAC 2017). Fillon, T., Simonnot, J., Mifune, M.-F., Khoury, S., Pel- Buffa, M. and Lebrun, J. (2017b). Web audio guitar tube lerin, G., and Le Coz, M. (2014). Telemeta: An open- amplifier vs native simulations. In Proc. 3rd Web Audio source web framework for ethnomusicological audio Conf. (WAC 2017). archives management and automatic analysis. In Pro- Buffa, M., Lebrun, J., Kleimola, J., Letz, S., et al. (2018). ceedings of the 1st International Workshop on Digital Towards an open web audio plugin standard. In Com- Libraries for Musicology, pages 1–8. ACM. panion Proceedings of the The Web Conference 2018, Hennequin, R., Khlif, A., Voituret, F., and Moussallam, pages 759–766. International World Wide Web Confer- M. (2019). Spleeter: A fast and state-of-the art music ences Steering Committee. source separation tool with pre-trained models. Late- Buffa, M., Lebrun, J., Pauwels, J., and Pellerin, G. (2019a). Breaking/Demo ISMIR 2019, November. Deezer Re- A 2 Million Commercial Song Interactive Navigator. In search. WAC 2019 - 5th WebAudio Conference 2019, Trondheim, Hu, X., Downie, J. S., and Ehmann, A. F. (2009a). Lyric Norway, December. text mining in music mood classification. American mu- Buffa, M., Lebrun, J., Pellerin, G., and Letz, S. (2019b). sic, 183(5,049):2–209. Webaudio plugins in daws and for live performance. In Hu, Y., Chen, X., and Yang, D. (2009b). Lyric-based song 14th International Symposium on Computer Music Mul- emotion detection with affective lexicon and fuzzy clus- tidisciplinary Research (CMMR’19). tering method. In ISMIR. C¸ano, E. and Morisio, M. (May 2017). Music mood Kim, J. and Mun, Y. Y. (2019). A hybrid modeling ap- dataset creation based on last.fm tags. In 2017 Interna- proach for an automated lyrics-rating system for ado- tional Conference on Artificial Intelligence and Applica- lescents. In European Conference on Information Re- tions, Vienna Austria. trieval, pages 779–786. Springer. Chatterjee, A., Narahari, K. N., Joshi, M., and Agrawal, Kleedorfer, F., Knees, P., and Pohle, T. (2008). Oh oh oh P. (2019). Semeval-2019 task 3: Emocontext contextual whoah! towards automatic topic detection in song lyrics. emotion detection in text. In Proceedings of the 13th In ISMIR. International Workshop on Semantic Evaluation, pages Logan, B., Kositsky, A., and Moreno, P. (2004). Seman- 39–48. tic analysis of song lyrics. In 2004 IEEE International Chin, H., Kim, J., Kim, Y., Shin, J., and Yi, M. Y. (2018). Conference on Multimedia and Expo (ICME) (IEEE Cat. Explicit content detection in music lyrics using machine No.04TH8763), volume 2, pages 827–830 Vol.2, June. learning. In 2018 IEEE International Conference on Big Mahedero, J. P. G., MartInez,´ A., Cano, P., Koppenberger, Data and Smart Computing (BigComp), pages 517–521. M., and Gouyon, F. (2005). Natural language processing IEEE. of lyrics. In Proceedings of the 13th Annual ACM In- Delbouys, R., Hennequin, R., Piccoli, F., Royo-Letelier, ternational Conference on Multimedia, MULTIMEDIA J., and Moussallam, M. (2018). Music mood detection ’05, pages 475–478, New York, NY, USA. ACM.

2146 Mihalcea, R. and Strapparava, C. (2012). Lyrics, music, HLT-Short ’08, pages 133–136, Stroudsburg, PA, USA. and emotions. In Proceedings of the 2012 Joint Confer- Association for Computational Linguistics. ence on Empirical Methods in Natural Language Pro- Yang, D. and Lee, W. (2009). Music emotion identification cessing and Computational Natural Language Learning, from lyrics. In 2009 11th IEEE International Symposium pages 590–599, Jeju Island, Korea, July. Association for on Multimedia, pages 624–629, Dec. Computational Linguistics. Mohammad, S., Bravo-Marquez, F., Salameh, M., and Kir- 12. Language Resource References itchenko, S. (2018). Semeval-2018 task 1: Affect in Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, tweets. In Proceedings of the 12th international work- P. (2011). The million song dataset. In Proceedings of shop on semantic evaluation, pages 1–17. the 12th International Conference on Music Information Mohammad, S. (2018). Obtaining reliable human ratings Retrieval (ISMIR 2011). of valence, arousal, and dominance for 20,000 english Meseguer-Brocal, G., Peeters, G., Pellerin, G., Buffa, M., words. In Proceedings of the 56th Annual Meeting of Cabrio, E., Faron Zucker, C., Giboin, A., Mirbel, I., Hen- the Association for Computational Linguistics (Volume nequin, R., Moussallam, M., Piccoli, F., and Fillon, T. 1: Long Papers), pages 174–184. (2017). WASABI: a Two Million Song Database Project Parisi, L., Francia, S., Olivastri, S., and Tavella, with Audio and Cultural Metadata plus WebAudio en- M. S. (2019). Exploiting synchronized lyrics and hanced Client Applications. In Web Audio Conference vocal features for music emotion detection. CoRR, 2017 – Collaborative Audio #WAC2017, London, United abs/1901.04831. Kingdom, August. Queen Mary University of London. Pauwels, J. and Sandler, M. (2019). A web-based system for suggesting new practice material to music learners based on chord content. In Joint Proc. 24th ACM IUI Workshops (IUI2019). Pauwels, J., Xambo,´ A., Roma, G., Barthet, M., and Fazekas, G. (2018). Exploring real-time visualisations to support chord learning with a large music collection. In Proc. 4th Web Audio Conf. (WAC 2018). Russell, J. A. (1980). A circumplex model of affect. Jour- nal of personality and social psychology, 39(6):1161. Sterckx, L. (2014). Topic detection in a million songs. Ph.D. thesis, PhD thesis, Ghent University. Stoter,¨ F.-R., Uhlich, S., Liutkus, A., and Mitsufuji, Y. (2019). Open-unmix-a reference implementation for music source separation. Journal of Open Source Soft- ware. Tagg, P. (1982). Analysing popular music: theory, method and practice. Popular Music, 2:37–67. Vanni, L., Ducoffe, M., Aguilar, C., Precioso, F., and Mayaffre, D. (2018). Textual deconvolution saliency (tds): a deep tool box for linguistic analysis. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 548–557. Warriner, A. B., Kuperman, V., and Brysbaert, M. (2013). Norms of valence, arousal, and dominance for 13,915 en- glish lemmas. Behavior research methods, 45(4):1191– 1207. Watanabe, K., Matsubayashi, Y., Orita, N., Okazaki, N., Inui, K., Fukayama, S., Nakano, T., Smith, J., and Goto, M. (2016). Modeling discourse segments in lyrics using repeated patterns. In Proceedings of COLING 2016, the 26th International Conference on Computational Lin- guistics: Technical Papers, pages 1959–1969. Xia, Y., Wang, L., Wong, K.-F., and Xu, M. (2008). Sen- timent vector space model for lyric-based song senti- ment classification. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguis- tics on Human Language Technologies: Short Papers,

2147