Multimodal Deep Learning for Music Genre Classification.Transactions of the International Society for Music 7,60,5 Information Retrieval, 1(1), Pp
Total Page:16
File Type:pdf, Size:1020Kb
Oramas, S., et al. (2018). Multimodal Deep Learning for Music Genre Classification. Transactions of the International Society for Music 7,60,5 Information Retrieval, 1(1), pp. 4-21. DOI: https://doi.org/10.5334/tismir.10 RESEARCH Multimodal Deep Learning for Music Genre Classification Sergio Oramas*,‡, Francesco Barbieri†, Oriol Nieto‡ and Xavier Serra* Music genre labels are useful to organize songs, albums, and artists into broader groups that share similar musical characteristics. In this work, an approach to learn and combine multimodal data representations for music genre classification is proposed. Intermediate representations of deep neural networks are learned from audio tracks, text reviews, and cover art images, and further combined for classification. Experiments on single and multi-label genre classification are then carried out, evaluating the effect of the different learned representations and their combinations. Results on both experiments show how the aggregation of learned representations from different modalities improves the accuracy of the classification, suggesting that different modalities embed complementary information. In addition, the learning of a multimodal feature space increases the performance of pure audio representations, which may be specially relevant when the other modalities are available for training, but not at prediction time. Moreover, a proposed approach for dimensionality reduction of target labels yields major improvements in multi-label classification not only in terms of accuracy, but also in terms of the diversity of the predicted genres, which implies a more fine-grained categorization. Finally, a qualitative analysis of the results sheds some light on the behavior of the different modalities on the classification task. Keywords: information retrieval; deep learning; music; multimodal; multi-label classification 1. Introduction be Pop, and at the same time have elements from Deep The advent of large music collections has posed the House and a Reggae groove). Second, handcrafted challenge of how to retrieve, browse, and recommend features may not fully represent the variability of the their contained items. One way to ease the access of data. By contrast, representation learning approaches large music collections is to keep tag annotations of all have demonstrated their superiority in multiple domains music resources (Sordo, 2012). Annotations can be added (Bengio et al., 2013). Third, large music collections either manually or automatically. However, due to the contain different modalities of information, i.e., audio, high human effort required for manual annotations, the images, and text, and all these data are suitable to be implementation of automatic annotation processes is exploited for genre classification. Several approaches more cost-effective. dealing with different modalities have been proposed Music genre labels are useful categories to organize and (Wu et al., 2016; Schedl et al., 2013). However, to the best classify songs, albums, and artists into broader groups of our knowledge, no multimodal approach based on that share similar musical characteristics. Music genres deep learning architectures has been proposed for this have been widely used for music classification, from Music Information Retrieval (MIR) task, neither for single- physical music stores to streaming services. Automatic label nor multi-label classification. music genre classification thus is a widely explored topic In this work, we aim to fill this gap by proposing a (Sturm, 2012; Bogdanov et al., 2016). However, almost system able to predict music genre labels using deep all related work is concentrated in the classification of learning architectures given different data modalities. Our music items into broad genres (e.g., Pop, Rock) using approach is divided into two steps: (1) A neural network handcrafted audio features and assigning a single label is trained on the classification task for each modality. per item (Sturm, 2012). This is problematic for several (2) Intermediate representations are extracted from reasons. First, there may be hundreds of more specific each network and combined in a multimodal approach. music genres (Pachet and Cazaly, 2000), and these may Experiments on single-label and multi-label genre not necessarily be mutually exclusive (e.g., a song could classification are then carried out, evaluating the effect of the learned data representations and their combination. Audio representations are learned from time-frequency * Music Technology Group, Universitat Pompeu Fabra, Barcelona, ES representations of the audio signal in form of audio † TALN Group, Universitat Pompeu Fabra, Barcelona, ES spectrograms using Convolutional Neural Networks ‡ Pandora Media Inc., 94612 Oakland, US (CNNs). Visual representations are learned using a Corresponding author: Sergio Oramas ([email protected]) state-of-the-art CNN (ResNet) (He et al., 2016), initialized Oramas et al: Multimodal Deep Learning for Music Genre Classification 5 with pretrained parameters learned in a general image Traditional techniques typically use handcrafted audio classification task (Russakovsky et al., 2015), and fine- features, such as Mel Frequency Cepstral Coefficients tuned on the classification of music genre labels from (MFCCs) (Logan, 2000), as input to a machine learning the album cover images. Text representations are learned classifier (e.g., SVM, k-NN) (Tzanetakis and Cook, 2002; from music related texts (e.g., album reviews) using a Seyerlehner et al., 2010a; Gouyon et al., 2004). More feedforward network over a Vector Space Model (VSM) recent deep learning approaches take advantage of representation of texts, previously enriched with semantic visual representations of the audio signal in the form information via entity linking (Oramas, 2017). of spectrograms. These visual representations of audio A first experiment on single-label classification is are used as input to Convolutional Neural Networks carried out from audio and images. In this experiment, in (CNNs) (Dieleman et al., 2011; Dieleman and Schrauwen, addition to the audio and visual learned representations, a 2014; Pons et al., 2016; Choi et al., 2016a, b), following multimodal feature space is learned by aligning both data approaches similar to those used for image classification. representations. Results show that the fusion of audio Text-based approaches have also been explored for this task. and visual representations improves the performance of For instance, one of the earliest attempts on classification the classification over pure audio or visual approaches. of music reviews is described by Hu et al. (2005), where In addition, the introduction of the multimodal feature experiments on multi-class genre classification and star space improves the quality of pure audio representations, rating prediction are described. Similarly Hu and Downie even when no visual data are available in the prediction. (2006) extend these experiments with a novel approach for Next, the performance of our learned models is compared predicting usages of music via agglomerative clustering, and with those of a human annotator, and a qualitative conclude that bigram features are more informative than analysis of the classification results is reported. This unigram features. Moreover, part-of-speech (POS) tags along analysis shows that audio and visual representations seem with pattern mining techniques are applied by Downie and to complement each other. In addition, we study how Hu (2006) to extract descriptive patterns for distinguishing the visual deep model focuses its attention on different negative from positive reviews. Additional textual evidence regions of the input images when evaluating each genre. is leveraged by Choi et al. (2014), who consider lyrics as well These results are further expanded with an experiment as texts referring to the meaning of the song, and used for on multi-label classification, which is carried out over training a kNN classifier for predicting song subjects (e.g., audio, text, and images. Results from this experiment love, war, or drugs). In Oramas et al. (2016a), album reviews show again how the fusion of data representations are semantically enriched and classified among 13 genre learned from different modalities achieves better scores classes using an SVM classifier. than each of them individually. In addition, we show There are few papers dealing with image-based music that representation learning using deep neural networks genre classification (Libeks and Turnbull, 2011). Regarding substantially surpasses a traditional audio-based approach multimodal approaches found in the literature, most of that employs handcrafted features. Moreover, an extensive them combine audio and song lyrics (Laurier et al., 2008; comparison of different deep learning architectures for Neumayer and Rauber, 2007). Other modalities such as audio classification is provided, including the usage of a audio and video have been explored (Schindler and Rauber, dimensionality reduction technique for labels that yields 2015). McKay and Fujinaga (2008) combine cultural, improved results. Then, a qualitative analysis of the multi- symbolic, and audio features for music classification. label classification experiment is finally reported. Multi-label classification is a widely studied problem This paper is an extended version of a previous in other domains (Tsoumakas and Katakis, 2006; Jain et contribution (Oramas et al., 2017a), with the main novel al.,