Automatic, Personalized, and Flexible Playlist Generation Using Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
AUTOMATIC, PERSONALIZED, AND FLEXIBLE PLAYLIST GENERATION USING REINFORCEMENT LEARNING Shun-Yao Shih Heng-Yu Chi National Taiwan University KKBOX Inc., Taipei, Taiwan [email protected] [email protected] ABSTRACT etc. Therefore, most of previous works try to deal with such problems by considering some particular assump- Songs can be well arranged by professional music curators tions. McFee et al. [14] consider playlist generation as a to form a riveting playlist that creates engaging listening language modeling problem and solve it by adopting statis- experiences. However, it is time-consuming for curators tical techniques. Unfortunately, statistical method does not to timely rearrange these playlists for fitting trends in fu- perform well on small datasets. Pampalk et al. [16] gen- ture. By exploiting the techniques of deep learning and erate playlists by exploiting explicit user behaviors such reinforcement learning, in this paper, we consider music as skipping. However, for implicit user preferences on playlist generation as a language modeling problem and playlists, they do not provide a systematic way to handle solve it by the proposed attention language model with it. policy gradient. We develop a systematic and interactive As a result, for generating personalized playlists auto- approach so that the resulting playlists can be tuned flex- matically and flexibly, we develop a novel and scalable ibly according to user preferences. Considering a playlist music playlist generation system. The system consists of as a sequence of words, we first train our attention RNN three main steps. First, we adopt Chen et al.’s work [4] language model on baseline recommended playlists. By to generate baseline playlists based on the preferences of optimizing suitable imposed reward functions, the model users about songs. In details, given the relationship be- is thus refined for corresponding preferences. The ex- tween users and songs, we construct a corresponding bipar- perimental results demonstrate that our approach not only tite graph at first. With the users and songs graph, we can generates coherent playlists automatically but is also able calculate embedding features of songs and thus obtain the to flexibly recommend personalized playlists for diversity, baseline playlist for each songs by finding their k-nearest novelty and freshness. neighbors. Second, by formulating baseline playlists as sequences of words, we can pretrain RNN language model 1. INTRODUCTION (RNN-LM) to obtain better initial parameters for the fol- lowing optimization, using policy gradient reinforcement Professional music curators or DJs are usually able to care- learning. We adopt RNN-LM because not only RNN-LM fully select, order, and form a list of songs which can give has better ability of learning information progresses than listeners brilliant listening experiences. For a music radio traditional statistical methods in many generation tasks, with a specific topic, they can collect songs related to the but also neural networks can be combined with reinforce- topic and sort in a smooth context. By considering pref- ment learning to achieve better performances [10]. Finally, erences of users, curators can also find what they like and given preferences from user profiles and the pretrained pa- recommend them several lists of songs. However, different rameters, we can generate personalized playlists by ex- people have different preferences toward diversity, popu- ploiting techniques of policy gradient reinforcement learn- larity, and etc. Therefore, it will be great if we can refine ing with corresponding reward functions. Combining these playlists based on different preferences of users on the fly. training steps, the experimental results show that we can Besides, as online music streaming services grow, there arXiv:1809.04214v1 [cs.CL] 12 Sep 2018 generate personalized playlists to satisfy different prefer- are more and more demands for efficient and effective mu- ences of users with ease. sic playlist recommendation. Automatic and personalized Our contributions are summarized as follows: music playlist generation thus becomes a critical issue. However, it is unfeasible and expensive for editors to • We design an automatic playlist generation frame- daily or hourly generate suitable playlists for all users work, which is able to provide timely recommended based on their preferences about trends, novelty, diversity, playlists for online music streaming services. • We remodel music playlist generation into a se- quence prediction problem using RNN-LM which is c Shun-Yao Shih, Heng-Yu Chi. Licensed under a Cre- ative Commons Attribution 4.0 International License (CC BY 4.0). At- easily combined with policy gradient reinforcement tribution: Shun-Yao Shih, Heng-Yu Chi. “automatic, personalized, and learning method. flexible playlist generation using reinforcement learning”, 19th Interna- tional Society for Music Information Retrieval Conference, Paris, France, • The proposed method can flexibly generate suitable 2018. personalized playlists according to user profiles us- ing corresponding optimization goals in policy gra- superhuman performance in many games. Besides, rein- dient. forcement learning has been successfully applied to many other problems such as dialogue generation modeled as The rest of this paper is organized as follows. In Sec- Markov Decision Process (MDP). tion 2, we introduce several related works about playlist A Markov Decision Process is usually denoted by a tu- generation and recommendation. In Section 3, we provide ple (S; A; P; R; γ), where essential prior knowledge of our work related to policy gra- dient. In Section 4, we introduce the details of our pro- •S is a set of states posed model, attention RNN-LM with concatenation (AC- •A is a set of actions RNN-LM). In Section 5, we show the effectiveness of our •P (s; a; s0) = Pr[s0js; a] is the transition probability method and conclude our work in Section 6. that action a in state s will lead to state s0 •R (s; a) = E[rjs; a] is the expected reward that an 2. RELATED WORK agent will receive when the agent does action a in Given a list of songs, previous works try to rearrange them state s. for better song sequences [1,3,5,12]. First, they construct a • γ 2 [0; 1] is the discount factor representing the im- song graph by considering songs in playlist as vertices, and portance of future rewards relevance of audio features between songs as edges. Then they find a Hamiltonian path with some properties, such as Policy gradient is a reinforcement learning algorithm to smooth transitions of songs [3], to create new sequencing solve MDP problems. Modeling an agent with parameters θ θ of songs. User feedback is also an important considera- , the goal of this algorithm is to find the best of a pol- π (s; a) = Pr[ajs; θ] tion when we want to generate playlists [6, 7, 13, 16]. By icy θ measured by average reward per considering several properties, such as tempo, loudness, time-step X πθ X topics, and artists, of users’ favorite played songs recently, J(θ) = d (s) πθ(s; a)R(s; a) (1) authors of [6, 7] can thus provide personalized playlist s2S a2A for users based on favorite properties of users. Pampalk where dπθ (s) is stationary distribution of Markov chain for et al. [16] consider skip behaviors as negative signals and πθ. the proposed approach can automatically choose the next Usually, we assume that πθ(s; a) is differentiable with song according to audio features and avoid skipped songs respect to its parameters θ, i.e., @πθ (s;a) exists, and solve at the same time. Maillet et al. [13] provides a more in- @θ this optimization problem Eqn (1) by gradient ascent. For- teractive way to users. Users can manipulate weights of mally, given a small enough α, we update its parameters θ tags to express high-level music characteristics and obtain by corresponding playlists they want. To better integrate user behavior into playlist generation, several works are pro- θ θ + αrθJ(θ) (2) posed to combine playlist generation algorithms with the where techniques of reinforcement learning [11, 19]. Xing et al. X πθ X rθJ(θ) = d (s) πθ(s; a)rθπθ(s; a)R(s; a) first introduce exploration into traditional collaborative fil- s2S a2A tering to learn preferences of users. Liebman et al. take = [r π (s; a)R(s; a)] the formulation of Markov Decision Process into playlist E θ θ (3) generation framework to design algorithms that learn rep- resentations for preferences of users based on hand-crafted 4. THE PROPOSED MODEL features. By using these representations, they can generate personalized playlist for users. The proposed model consists of two main components. We Beyond playlist generation, there are several works first introduce the structure of the proposed RNN-based adopting the concept of playlist generation to facilitate model in Section 4.1. Then in Section 4.2, we formulate recommendation systems. Given a set of songs, Vargas the problem as a Markov Decison Process and solve the et al. [18] propose several scoring functions, such as diver- formulated problem by policy gradient to generate refined sity and novelty, and retrieve the top-K songs with higher playlists. scores for each user as the resulting recommended list of 4.1 Attention RNN Language Model songs. Chen et al. [4] propose a query-based music rec- ommendation system that allow users to select a preferred Given a sequence of tokens fw1; w2; : : : ; wtg, an RNN- song as a seed song to obtain related songs as a recom- LM estimates the probability Pr[wtjw1:t−1] with a recur- mended playlist. rent function h = f(h ; w ) (4) 3. POLICY GRADIENT REINFORCEMENT t t−1 t−1 LEARNING and an output function, usually softmax, Reinforcement learning has got a lot of attentions from > public since Silver et al. [17] proposed a general reinforce- exp(W ht + bv ) Pr[w = v jw ] = vi i (5) ment learning algorithm that could make an agent achieve t i 1:t−1 P exp(W > h + b ) k vk t vk Figure 1.