Multi-Label Music Genre Classification from Audio, Text, and Images Using Deep Features
Total Page:16
File Type:pdf, Size:1020Kb
MULTI-LABEL MUSIC GENRE CLASSIFICATION FROM AUDIO, TEXT, AND IMAGES USING DEEP FEATURES Sergio Oramas1, Oriol Nieto2, Francesco Barbieri3, Xavier Serra1 1Music Technology Group, Universitat Pompeu Fabra 2Pandora Media Inc. 3TALN Group, Universitat Pompeu Fabra fsergio.oramas, francesco.barbieri, [email protected], [email protected] ABSTRACT exclusive (i.e., a song could be Pop, and at the same time have elements from Deep House and a Reggae grove). In Music genres allow to categorize musical items that share this work we aim to advance the field of music classifi- common characteristics. Although these categories are not cation by framing it as multi-label genre classification of mutually exclusive, most related research is traditionally fine-grained genres. focused on classifying tracks into a single class. Further- To this end, we present MuMu, a new large-scale mul- more, these categories (e.g., Pop, Rock) tend to be too timodal dataset for multi-label music genre classification. broad for certain applications. In this work we aim to ex- MuMu contains information of roughly 31k albums clas- pand this task by categorizing musical items into multiple sified into one or more 250 genre classes. For every al- and fine-grained labels, using three different data modal- bum we analyze the cover image, text reviews, and audio ities: audio, text, and images. To this end we present tracks, with a total number of approximately 147k audio MuMu, a new dataset of more than 31k albums classified tracks and 447k album reviews. Furthermore, we exploit into 250 genre classes. For every album we have collected this dataset with a novel deep learning approach to learn the cover image, text reviews, and audio tracks. Addition- multiple genre labels for every album using different data ally, we propose an approach for multi-label genre classi- modalities (i.e., audio, text, and image). In addition, we fication based on the combination of feature embeddings combine these modalities to study how the different com- learned with state-of-the-art deep learning methodologies. binations behave. Experiments show major differences between modalities, Results show how feature learning using deep neu- which not only introduce new baselines for multi-label ral networks substantially surpasses traditional approaches genre classification, but also suggest that combining them based on handcrafted features, reducing the gap between yields improved results. text-based and audio-based classification [29]. Moreover, an extensive comparative of different deep learning archi- 1. INTRODUCTION tectures for audio classification is provided, including the usage of a dimensionality reduction approach that yields Music genres are useful labels to classify musical items improved results. Finally, we show how the late fusion of into broader categories that share similar musical, regional, feature vectors learned from different modalities achieves or temporal characteristics. Dealing with large collections better scores than each of them individually. of music poses numerous challenges when retrieving and classifying information [3]. Music streaming services tend to offer catalogs of tens of millions of tracks, for which 2. RELATED WORK tasks such as music classification are of utmost importance. Most published music genre classification approaches rely Music genre classification is a widely studied problem in on audio sources [2, 40]. Traditional techniques typically arXiv:1707.04916v1 [cs.IR] 16 Jul 2017 the Music Information Research (MIR) community [40]. use handcrafted audio features, such as Mel Frequency However, almost all related work is concentrated in multi- Cepstral Coecients (MFCCs) [20], as input of a machine class classification of music items into broad genres (e.g., learning classifier (e.g., SVM) [39, 44]. More recent deep Pop, Rock), assigning a single label per item. This is prob- learning approaches take advantage of visual representa- lematic since there may be hundreds of more specific mu- tions of the audio signal in form of spectrograms. These sic genres [33], and these may not be necessarily mutually visual representations are used as input to Convolutional Neural Networks (CNNs) [5, 6, 8, 9, 34], following ap- c Sergio Oramas1, Oriol Nieto2, Francesco Barbieri3, proaches similar to those used for image classification. Xavier Serra1. Licensed under a Creative Commons Attribution 4.0 In- Text-based approaches have also been explored for this 1 ternational License (CC BY 4.0). Attribution: Sergio Oramas , Oriol task. For instance, in [13, 29] album customer reviews Nieto2, Francesco Barbieri3, Xavier Serra1. “Multi-label Music Genre Classification from audio, text, and images using Deep Features”, 18th In- are used as input for the classification, whereas in [4, 22] ternational Society for Music Information Retrieval Conference, Suzhou, song lyrics are employed. By contrast, there are a limited China, 2017. number of papers dealing with image-based genre classi- fication [18]. Most multimodal approaches for this task 3.1 Genre Labels found in the literature combine audio and song lyrics as text [16, 27]. Moreover, the combination of audio and Amazon has its own hierarchical taxonomy of music gen- video has also been explored [37]. However, the authors res, which is up to four levels in depth. In the first level are not aware of published multimodal approaches for mu- there are 27 genres, and almost 500 genres overall. In our sic genre classification that involve deep learning. dataset, we keep the 250 genres that satisfy the condition of having been annotated in at least 12 albums. Every al- Multi-label classification is a widely studied problem bum in Amazon is annotated with one or more genres from [14, 43]. Despite the scarcity in terms of approaches for different levels of the taxonomy. The Amazon Dataset con- multi-label classification of music genres [36, 46], there is tains complete information about the specific branch from a long tradition in MIR for tag classification, which is a the taxonomy used to classify each album. For instance, an multi-label problem [5, 46]. album annotated as Traditional Pop comes with the com- plete branch information Pop / Oldies / Traditional Pop. To exploit either the taxonomic and the co-occurrence infor- mation, we provide every item with the labels of all their 3. MULTIMODAL DATASET branches. For example, an album classified as Jazz / Vocal Jazz and Pop / Vocal Pop is annotated in MuMu with the To the best of our knowledge, there are no publicly avail- four labels: Jazz, Vocal Jazz, Pop, and Vocal Pop. There able large-scale datasets that encompass audio, images, are in average 5.97 labels for each song (3.13 standard de- text, and multi-label annotations. Therefore, we present viation). MuMu, a new Multimodal Music dataset with multi- label genre annotations that combines information from the Amazon Reviews dataset [23] and the Million Song Table 1. Top-10 most and least represented genres Dataset (MSD) [1]. The former contains millions of al- Genre % of albums Genre % of albums bum customer reviews and album metadata gathered from Pop 84.38 Tributes 0.10 Amazon.com. The latter is a collection of metadata and Rock 55.29 Harmonica Blues 0.10 Alternative Rock 27.69 Concertos 0.10 precomputed audio features for a million songs. World Music 19.31 Bass 0.06 Jazz 14.73 European Jazz 0.06 To map the information from both datasets we use Mu- Dance & Electronic 12.23 Piano Blues 0.06 sicBrainz 1 . For every album in the Amazon dataset, we Metal 11.50 Norway 0.06 Indie & Lo-Fi 10.45 Slide Guitar 0.06 query MusicBrainz with the album title and artist name to R&B 10.10 East Coast Blues 0.06 find the best possible match. Matching is performed using Folk 9.69 Girl Groups 0.06 the same methodology described in [30], following a pair- wise entity resolution approach based on string similarity. The labels in the dataset are highly unbalanced, follow- Following this approach, we were able to map 60% of the ing a distribution which might align well with those found Amazon dataset. For all the matched albums, we obtain the in real world scenarios. In Table 1 we see the top 10 most MusicBrainz recording ids of their songs. With these, we and least represented genres and the percentage of albums use an available mapping from MSD to MusicBrainz 2 to annotated with each label. The unbalanced character of the obtain the subset of recordings present in the MSD. From genre annotations poses an interesting challenge for music the mapped recordings, we only keep those associated with classification that we also aim to exploit. Among the mul- a unique album. This process yields the final set of 147,295 tiple possibilities that this dataset may offer to the MIR songs, which belong to 31,471 albums. community, we focus our work on the multi-label classifi- The song features provided by the MSD are not gener- cation problem, described next. ally suitable for deep learning [45], so we instead use in our experiments audio previews between 15 and 30 seconds re- 4. MULTI-LABEL CLASSIFICATION trieved from 7digital.com. For the mapped set of al- bums, there are 447,583 customer reviews in the Amazon In multi-label classification, multiple target labels may be Dataset. In addition, the Amazon Dataset provides further assigned to each classifiable instance. More formally: information about each album, such as genre annotations, given a set of n labels L = fl1; l2; : : : ; lng, and a set of average rating, selling rank, similar products, cover image m items I = fi1; i2; : : : ; img, we aim to model a function url, etc. We employ the provided image url to gather the f able to associate a set of c labels to every item in I, where cover art of all selected albums. The mapping between the c 2 [1; n] varies for every item.