Learning Audio Embeddings with User Listening Data for Content-Based Music Recommendation

LEARNING AUDIO EMBEDDINGS WITH USER LISTENING DATA FOR CONTENT-BASED MUSIC RECOMMENDATION Ke Chen 1, Beici Liang 2, Xiaoshuan Ma 2, Minwei Gu 2 1 CREL, Music Department, University of California San Diego, USA [email protected] 2 QQ Music BU, Tencent Music Entertainment, China fbeiciliang, stanfordma, [email protected] ABSTRACT on Audio Embedding (AE). Here AE refers to music repre- sentations that are extracted from audio. For example, the AE Personalized recommendation on new track releases has in [4] corresponds to probabilities of 62 semantic descriptors always been a challenging problem in the music industry. To which include musical genre, mood, instruments and so on. combat this problem, we first explore user listening history These probabilistic estimates are the output of Support Vec- and demographics to construct a user embedding represent- tor Machines (SVMs) [5], which operate on low-level audio ing the user’s music preference. With the user embedding features. In [6, 7], the use of deep Convolutional Neural Net- and audio data from user’s liked and disliked tracks, an au- works (CNN) to predict AE is investigated. It outperforms dio embedding can be obtained for each track using metric traditional approaches using a Bag-of-Words representation learning with Siamese networks. For a new track, we can de- from feature engineering. Recently, metric learning [8] has cide the best group of users to recommend by computing the been used to learn AE. In [9], same/different-artist track pairs similarity between the track’s audio embedding and different are used as supervision. And in [10], metadata and co-listen user embeddings, respectively. The proposed system yields statistics are firstly used to train AE from audio inputs for state-of-the-art performance on content-based music recom- music recommendation and tagging. Based on this work, at- mendation tested with millions of users and tracks. Also, we tribute embeddings are obtained for playlist exploration and extract audio embeddings as features for music genre classi- expansion in [11]. fication tasks. The results show the generalization ability of our audio embeddings. However, user listening data have not been fully utilized Index Terms— Audio Embedding, Music Representa- in content-based music recommendation. Since most of the tions Learning, Music Recommendation. existing studies [6, 12] rely on the Echo Nest Taste Profile [13] and the Million Song Dataset [14], user listening data 1. INTRODUCTION only include play counts associated with limited users and tracks. Still, such data have shown how they surpass using With the increasing number of tracks in online streaming ser- audio only in mood classification [15] and estimation of con- vices, a personalized music recommendation system plays a textual tags [16]. We believe a more informative UE can be arXiv:2010.15389v1 [cs.SD] 29 Oct 2020 vital role in discovering potential tracks and distributing them obtained if more user listening data can be included. Such to the right users. Collaborative Filtering (CF) is a commonly UE can also work together with AE to extract information used method that can infer similarities between items for rec- that may not necessarily present in the audio content. ommendation [1]. It can be formulated as a deep neural network in [2] to model user–item interactions and offer better In this paper, with real-world data collected from an on- recommendation performance. Using such a recommenda- line music platform, we propose a model with two branches tion system as in YoutubeDNN [3], a User Embedding (UE) as presented in Section 2. One is the user branch that consid- can be learned as a function of the user’s history and context. ers different aspects of user listing data to train UE. This UE While CF performs well when historical data are avail- is further utilized in the audio branch that uses metric learn- able for each item, it suffers from the cold start problem for ing with user-liked/disliked track pairs to obtain AE of each novel or unpopular items. New track releases will not be track. With the trained model, accurate and efficient represen- recommended unless their similarities can be learned directly tations of tracks can be obtained for music recommendation from the audio content. This has motivated researchers to im- and genre classification. Experimental results in Section 3 prove content-based recommendation system, which operates demonstrate significant improvements in related tasks. Lookup-Embedding Audio Embeddings avg-pool recent listening ... ... tracks (40) (40) (40) (40) avg-pool Top-N Hit (2,0) (2,0) (2,0) (2,0) recent Nearliest Neighbour listening ... albums 40-d (2,1) (2,1) (2,1) (2,1) avg-pool (4,2) (4,2) (4,2) (4,2) recent User Embeddings ... listening ... artists (4,2) (4,2) (4,2) (4,2) ... User Embedding User Branch sampled (4,2) (4,2) (4,2) (4,2) softmax Demographic ... ... ... Features FC+Leaky ... ... ... ReLU class probabilities user listening data and demographics user-liked track user-disliked tracks Fig. 1: Architecture of the user branch to obtain UE using (kernel in time, padding) Conv-1D + Batch-Norm Max-Pool (4 or 2 kernel) users’ listening history and demographics. (out dim) Fully-connected Layer Adaptive-Max-Pool Fig. 2: Architecture of the audio branch to obtain AE using 2. PROPOSED MODEL UE and audio data. We first detail the user branch to obtain UE, and then present how the UE is used in the audio branch in a metric learning 2.2. Audio Branch framework to obtain AE for content-based music recommen- The audio branch contains a Siamese model following the ar- dation. chitecture in [12] to learn AE directly from audio. It also uses UE from the user branch as an important anchor for metric 2.1. User Branch learning as shown in Figure 2: given a user’s UE (u), an AE + The user branch of our model is to encode user’s music prefer- of a track liked by this user (r ), and n AEs of tracks dis- − − − ence according to the user’s listening history and demograph- liked by this user (fr1 ; r2 ; :::rn g), the model should differ + − ics. It is designed to address a classification problem formu- r from r . lated as follows: given a user’s listening history X1:t−1 = To get AE, we feed log-mel spectrograms of the audio fx1; x2; :::; xt−1g and demographics D, the model can clas- inputs to the Siamese model, consisting of CNNs that share sify xt as a user-liked/disliked track by maximizing the con- weights and configurations. As depicted in Figure 2, each ditional probability: CNN is composed of 5 convolution and max-pooling layers. ReLU activation layer is used after every layer except for the max P (xtjX1:t−1;D) final feature vector layer to get AE with a dimension of 40. (1) = P (xtjx1; ::xt−1; a1; :::; at−1; s1; ::; st−1;D) In training, we apply the negative sampling technique over the pairwise similarity score, which is measured as: where xt denotes music tracks, at denotes albums in which these tracks appear, and st denotes the corresponding artists. uT · r M(r; u) = (2) All the data are represented by IDs which can be mapped to juj · jrj lookup-embeddings. As shown in Figure 1, the network of the user branch is in- The loss is then calculated using the above relevance scores spired by the structure of YoutubeDNN [3]. To accommodate obtained from the UE and the AEs of liked and disliked music recommendation tasks, we use both explicit and im- tracks. We use the max-margin hinge loss to set margins plicit feedback of tracks to train the network. For instance, a between positive and negative examples. This loss function user putting a track in a playlist is a positive example. To feed is defined as follows: all lookup-embeddings to the network, embeddings of several n X groups are averaged into fixed-width vectors of 40 dimen- L(u; R) = max[0; ∆ − M(r+; u) + M(r−; u)] (3) sions. They are concatenated into a vector, which is passed i i=1 into hidden layers that are fully connected with leaky-ReLU activation. In training, a cross-entropy loss is minimized for where ∆ is a margin hyper-parameter, which is set to 0.2 in the positive and the sampled negative classes. this paper. The final state before the output layer is used as UE with Using the trained CNN in the audio branch, a novel track a dimension of 40. At serving, nearest-neighbor lookup can can be represented by its AE. If the similarity between the AE be performed on UE to generate top-N candidates for music and a user’s UE yields a high score, the track can be a candi- recommendation. date that fits this user’s music preference. Therefore AE can be used to recommend novel or unpopular tracks for which Model Precision AUC collaborative filter data is unavailable. Experimental results Basic-Binary verify that the proposed audio branch outperforms competing 0.677 0.747 methods for content-based music recommendation. DCUE-1vs1 0.623 0.675 Multi-1vs1 0.745 0.752 3. EXPERIMENTS Multi-1vs4 0.687 0.749 3.1. Experiment Setup and Dataset Metric-1vs1 0.691 0.765 Our deep collaborative filtering model in the user branch has Metric-1vs4 0.681 0.778 been used to generate candidates for recommendation at QQ (a) context duration equals 3 seconds. Music1. It was trained with over 160 millions of user-track interaction data associated with 2 million tracks. To measure Model Precision AUC the performance, hit rate was used.

Load more