AUTOMATIC, PERSONALIZED, AND FLEXIBLE PLAYLIST GENERATION USING REINFORCEMENT LEARNING

Shun-Yao Shih Heng-Yu Chi National University KKBOX Inc., , Taiwan [email protected] [email protected]

ABSTRACT etc. Therefore, most of previous works try to deal with such problems by considering some particular assump- Songs can be well arranged by professional curators tions. McFee et al. [14] consider playlist generation as a to form a riveting playlist that creates engaging listening language modeling problem and solve it by adopting statis- experiences. However, it is time-consuming for curators tical techniques. Unfortunately, statistical method does not to timely rearrange these playlists for fitting trends in fu- perform well on small datasets. Pampalk et al. [16] gen- ture. By exploiting the techniques of deep learning and erate playlists by exploiting explicit user behaviors such reinforcement learning, in this paper, we consider music as skipping. However, for implicit user preferences on playlist generation as a language modeling problem and playlists, they do not provide a systematic way to handle solve it by the proposed attention language model with it. policy gradient. We develop a systematic and interactive As a result, for generating personalized playlists auto- approach so that the resulting playlists can be tuned flex- matically and flexibly, we develop a novel and scalable ibly according to user preferences. Considering a playlist music playlist generation system. The system consists of as a sequence of words, we first train our attention RNN three main steps. First, we adopt Chen et al.’s work [4] language model on baseline recommended playlists. By to generate baseline playlists based on the preferences of optimizing suitable imposed reward functions, the model users about songs. In details, given the relationship be- is thus refined for corresponding preferences. The ex- tween users and songs, we construct a corresponding bipar- perimental results demonstrate that our approach not only tite graph at first. With the users and songs graph, we can generates coherent playlists automatically but is also able calculate embedding features of songs and thus obtain the to flexibly recommend personalized playlists for diversity, baseline playlist for each songs by finding their k-nearest novelty and freshness. neighbors. Second, by formulating baseline playlists as sequences of words, we can pretrain RNN language model 1. INTRODUCTION (RNN-LM) to obtain better initial parameters for the fol- lowing optimization, using policy gradient reinforcement Professional music curators or DJs are usually able to care- learning. We adopt RNN-LM because not only RNN-LM fully select, order, and form a list of songs which can give has better ability of learning information progresses than listeners brilliant listening experiences. For a music radio traditional statistical methods in many generation tasks, with a specific topic, they can collect songs related to the but also neural networks can be combined with reinforce- topic and sort in a smooth context. By considering pref- ment learning to achieve better performances [10]. Finally, erences of users, curators can also find what they like and given preferences from user profiles and the pretrained pa- recommend them several lists of songs. However, different rameters, we can generate personalized playlists by ex- people have different preferences toward diversity, popu- ploiting techniques of policy gradient reinforcement learn- larity, and etc. Therefore, it will be great if we can refine ing with corresponding reward functions. Combining these playlists based on different preferences of users on the fly. training steps, the experimental results show that we can Besides, as online music streaming services grow, there arXiv:1809.04214v1 [cs.CL] 12 Sep 2018 generate personalized playlists to satisfy different prefer- are more and more demands for efficient and effective mu- ences of users with ease. sic playlist recommendation. Automatic and personalized Our contributions are summarized as follows: music playlist generation thus becomes a critical issue. However, it is unfeasible and expensive for editors to • We design an automatic playlist generation frame- daily or hourly generate suitable playlists for all users work, which is able to provide timely recommended based on their preferences about trends, novelty, diversity, playlists for online music streaming services. • We remodel music playlist generation into a se- quence prediction problem using RNN-LM which is c Shun-Yao Shih, Heng-Yu Chi. Licensed under a Cre- ative Commons Attribution 4.0 International License (CC BY 4.0). At- easily combined with policy gradient reinforcement tribution: Shun-Yao Shih, Heng-Yu Chi. “automatic, personalized, and learning method. flexible playlist generation using reinforcement learning”, 19th Interna- tional Society for Music Information Retrieval Conference, Paris, France, • The proposed method can flexibly generate suitable 2018. personalized playlists according to user profiles us- ing corresponding optimization goals in policy gra- superhuman performance in many games. Besides, rein- dient. forcement learning has been successfully applied to many other problems such as dialogue generation modeled as The rest of this paper is organized as follows. In Sec- Markov Decision Process (MDP). tion 2, we introduce several related works about playlist A Markov Decision Process is usually denoted by a tu- generation and recommendation. In Section 3, we provide ple (S, A, P, R, γ), where essential prior knowledge of our work related to policy gra- dient. In Section 4, we introduce the details of our pro- •S is a set of states posed model, attention RNN-LM with concatenation (AC- •A is a set of actions RNN-LM). In Section 5, we show the effectiveness of our •P (s, a, s0) = Pr[s0|s, a] is the transition probability method and conclude our work in Section 6. that action a in state s will lead to state s0 •R (s, a) = E[r|s, a] is the expected reward that an 2. RELATED WORK agent will receive when the agent does action a in Given a list of songs, previous works try to rearrange them state s. for better song sequences [1,3,5,12]. First, they construct a • γ ∈ [0, 1] is the discount factor representing the im- song graph by considering songs in playlist as vertices, and portance of future rewards relevance of audio features between songs as edges. Then they find a Hamiltonian path with some properties, such as Policy gradient is a reinforcement learning algorithm to smooth transitions of songs [3], to create new sequencing solve MDP problems. Modeling an agent with parameters θ θ of songs. User feedback is also an important considera- , the goal of this algorithm is to find the best of a pol- π (s, a) = Pr[a|s, θ] tion when we want to generate playlists [6, 7, 13, 16]. By icy θ measured by average reward per considering several properties, such as tempo, loudness, time-step

X πθ X topics, and artists, of users’ favorite played songs recently, J(θ) = d (s) πθ(s, a)R(s, a) (1) authors of [6, 7] can thus provide personalized playlist s∈S a∈A for users based on favorite properties of users. Pampalk where dπθ (s) is stationary distribution of Markov chain for et al. [16] consider skip behaviors as negative signals and πθ. the proposed approach can automatically choose the next Usually, we assume that πθ(s, a) is differentiable with song according to audio features and avoid skipped songs respect to its parameters θ, i.e., ∂πθ (s,a) exists, and solve at the same time. Maillet et al. [13] provides a more in- ∂θ this optimization problem Eqn (1) by gradient ascent. For- teractive way to users. Users can manipulate weights of mally, given a small enough α, we update its parameters θ tags to express high-level music characteristics and obtain by corresponding playlists they want. To better integrate user behavior into playlist generation, several works are pro- θ ← θ + α∇θJ(θ) (2) posed to combine playlist generation algorithms with the where techniques of reinforcement learning [11, 19]. Xing et al. X πθ X ∇θJ(θ) = d (s) πθ(s, a)∇θπθ(s, a)R(s, a) first introduce exploration into traditional collaborative fil- s∈S a∈A tering to learn preferences of users. Liebman et al. take = [∇ π (s, a)R(s, a)] the formulation of Markov Decision Process into playlist E θ θ (3) generation framework to design algorithms that learn rep- resentations for preferences of users based on hand-crafted 4. THE PROPOSED MODEL features. By using these representations, they can generate personalized playlist for users. The proposed model consists of two main components. We Beyond playlist generation, there are several works first introduce the structure of the proposed RNN-based adopting the concept of playlist generation to facilitate model in Section 4.1. Then in Section 4.2, we formulate recommendation systems. Given a set of songs, Vargas the problem as a Markov Decison Process and solve the et al. [18] propose several scoring functions, such as diver- formulated problem by policy gradient to generate refined sity and novelty, and retrieve the top-K songs with higher playlists. scores for each user as the resulting recommended list of 4.1 Attention RNN Language Model songs. Chen et al. [4] propose a query-based music rec- ommendation system that allow users to select a preferred Given a sequence of tokens {w1, w2, . . . , wt}, an RNN- song as a seed song to obtain related songs as a recom- LM estimates the probability Pr[wt|w1:t−1] with a recur- mended playlist. rent function h = f(h , w ) (4) 3. POLICY GRADIENT REINFORCEMENT t t−1 t−1 LEARNING and an output function, usually softmax, Reinforcement learning has got a lot of attentions from > public since Silver et al. [17] proposed a general reinforce- exp(W ht + bv ) Pr[w = v |w ] = vi i (5) ment learning algorithm that could make an agent achieve t i 1:t−1 P exp(W > h + b ) k vk t vk Figure 1. The structure of our attention RNN language model with concatenation

4.1.2 Our Attention RNN-LM with concatenation where the implementation of the function f depends on D D×V si si si h ∈ W ∈ In our work, {s1, s2, . . . , sN } and {w , w , . . . , w } which kind of RNN cell we use, t R , R 1 2 |si| with the column vector Wvi corresponding to a word vi, are playlists and songs by adopting Chen et al.’s work [4]. V wsi s and b ∈ R with the scalar bvi corresponding to a word vi More specifically, given a seed song 1 for a playlist i, si (D is the number of units in RNN, and V is the number of we find top-k approximate nearest neighbors of w1 to for- mulate a list of songs {wsi , wsi , . . . , wsi }. unique tokens in all sequences). 1 2 |si| We then update the parameters of the RNN-LM by The proposed attention RNN-LM with concatenation maximizing the log-likelihood on a set of sequences with (AC-RNN-LM) is shown in Figure 1. We pad w1:t−1 to 0 size N, {s1, s2, . . . , sN }, and the corresponding tokens, w1:T and concatenate the corresponding h1:T as the input {wsi , wsi , . . . , wsi }. of our RNN-LM’s output function in Eqn (5), where T is 1 2 |si| the maximum number of songs we consider in one playlist. N |sn| 1 X X L = log Pr[wsn |wsn ] (6) Therefore, our output function becomes N t 1:t−1 n=1 t=2 exp(W >h0 + b ) vi vi 4.1.1 Attention in RNN-LM Pr[wt = vi|w1:t−1] = (11) P exp(W > h0 + b ) k vk vk Attention mechanism in sequence-to-sequence model has been proven to be effective in the fields of image caption where W ∈ RDT ×V , b ∈ RV and generation, machine translation, dialogue generation, and  0  h1 etc. Several previous works also indicate that attention is 0 h2  even more impressive on RNN-LM [15]. h0 =   ∈ DT ×1 (12)  .  R In attention RNN language model (A-RNN-LM), given  .  the hidden states from time t − C to t, denoted as 0 ws hT ht−Cws:t, where Cws is the attention window size, we want to compute a context vector ct as a weighted sum of hid- den states ht−Cws:t−1 and then encode the context vector 4.2 Policy Gradient c into the original hidden state h . t t We exploit policy gradient in order to optimize Eqn (1), > βi = ν tanh(W1ht + W2ht−Cws+i) (7) which is formulated as follows.

exp(βi) 4.2.1 Action αi = (8) PCws−1 k=0 exp(βk) An action a is a song id, which is a unique representation C −1 Xws of each song, that the model is about to generate. The set ct = αiht−Cws+i (9) of actions in our problem is finite since we would like to i=0   recommend limited range of songs. 0 ht ht = W3 (10) ct 4.2.2 State where β is Bahdanau’s scoring style [2], W1,W2 ∈ A state s is the songs we have already recommended in- D×D D×2D si si si R , and W3 ∈ R . cluding the seed song, {w1 , w2 , . . . , wt−1}. 4.2.3 Policy Coherence is the major feature we should consider to avoid situations that the generated playlists are A policy π (s, a) takes the form of our AC-RNN-LM and θ highly rewarded but lack of cohesive listening ex- is defined by its parameters θ. periences. We therefore consider the policy of our

4.2.4 Reward pretrained language model, πθRNN (s, a), which is well-trained on coherent playlists, as a good enough Reward R(s, a) is a weighted sum of several reward func- generator of coherent playlists. tions, i.e., Ri : s × a 7→ R. In the following introductions, we formulate 4 important metrics for playlists generation. The policy of our pretrained AC-RNN-LM is denoted as R4(s, a) = log(Pr[a|s, θRNN ]) (16) π (s, a) with parameters θ , and the policy of our θRNN RNN Combining the above reward functions, our final reward AC-RNN-LM optimized by policy gradient is denoted as for the action a is πθRL (s, a) with parameters θRL. R(s, a) =γ1R1(s, a) + γ2R2(s, a)+ Diversity represents the variety in a recommended list of (17) γ R (s, a) + γ R (s, a) songs. Several generated playlists in Chen et al.’s 3 3 4 4 work [4] are composed of songs with the same artist where the selection of γ , γ , γ , and γ depends on dif- or album. It is not regarded as a good playlist for 1 2 3 4 ferent applications. recommendation system because of low diversity. Note that although we only list four reward functions Therefore, we formulate the measurement of the di- here, the optimization goal R can be easily extended by a versity by the euclidean distance between the em- linear combination of more reward functions. beddings of the last song in the existing playlist, s w|s|, and the predicted song, a. 5. EXPERIMENTS AND ANALYSIS R (s, a) = − log(|d(ws , a) − C |) (13) 1 |s| distance In the following experiments, we first introduce the details of dataset and evaluation metrics in Section 5.1 and train- where d(·) is the euclidean distance between the em- ing details in Section 5.2. In Section 5.3, we compare pre- s beddings of w|s| and a, and Cdistance is a parameter trained RNN-LMs with different mechanism combination that represents the euclidean distance that we want by perplexity to show our proposed AC-RNN-LM is more the model to learn. effectively and efficiently than others. In order to demon- strate the effectiveness of our proposed method, AC-RNN- Novelty is also important for a playlist generation sys- LM combined with reinforcement learning, we adopt three tem. We would like to recommend something new standard metrics, diversity, novelty, and freshness (cf. Sec- to users instead of recommend something familiar. tion 5.1) to validate our models in Section 5.4. More- Unlike previous works, our model can easily gener- over, we demonstrate that it is effortless to flexibly ma- ate playlists with novelty by applying a correspond- nipulate the properties of resulting generated playlists in ing reward function. As a result, we model reward of Section 5.5. Finally, in Section 5.6, we discuss the details novelty as a weighted sum of normalized playcounts about the design of reward functions with given preferred in periods of time [20]. properties. X log(pt(a)) R2(s, a) = − log( w(t) 0 ) 5.1 Dataset and Evaluation Metrics log(max 0 p (a )) t a ∈A t (14) The playlist dataset is provided by KKBOX Inc., which is a regional leading music streaming company. It consists of where w(t) is the weight of a time period, t, with a 10, 000 playlists, each of which is composed of 30 songs. P constraint t w(t) = 1, pt(a) is playcounts of the There are 45, 243 unique songs in total. song a, and A is the set of actions. Note that songs For validate our proposed approach, we use the metrics with less playcounts have higher value of R2, and as follows. vice versa. Perplexity is calculated based on the song probability dis- Freshness is a subjective metric for personalized playlist tributions, which is shown as follows. generation. For example, latest songs is usually 1 PN P −q(x) log p(x) perplexity = e N n=1 x more interesting for young people, while older peo- ple would prefer old-school songs. Here, we arbi- where N is the number of training samples, x is a trarily choose one direction for optimization to the song in our song pool, p is the predicted song prob-

agent πθRL to show the feasibility of our approach. ability distribution, and q is the song probability dis- Y − 1900 tribution in ground truth. R (s, a) = − log( a ) (15) 3 2017 − 1900 Diversity is computed as different unigrams of artists scaled by he total length of each playlist, which is

where Ya is the release year of the song a. measured by Distinct-1 [9] Table 1. Weights of reward functions for each model

Model γ1 γ2 γ3 γ4 RL-DIST 0.5 0.0 0.0 0.5 RL-NOVELTY 0.0 0.5 0.0 0.5 RL-YEAR 0.0 0.0 0.5 0.5 RL-COMBINE 0.2 0.2 0.2 0.4

Table 2. Comparison on different metrics for playlist gen- eration system (The bold text represents the best, and the underline text represents the second) Model Diversity Novelty Freshness Embedding [4] 0.32 0.19 2007.97 AC-RNN-LM 0.39 0.20 2008.41 RL-DIST 0.44 0.20 2008.37 RL-NOVELTY 0.42 0.05 2012.89 RL-YEAR 0.40 0.19 2006.23 RL-COMBINE 0.49 0.18 2007.64 Figure 2. Log-perplexity of different pretrained models on the dataset under different training steps probabilities of songs in the reward function. Therefore, we simply select the model with the lowest training er- Novelty is designed for recommending something new to ror to be optimized by policy gradient and an estimator of users [20]. The more the novelty is, the lower the Pr[a|s, θRNN ] (cf. Eqn (16)). metric is. 5.4 Playlist Generation Results Freshness is directly measured by the average release year of songs in each playlist. As shown in Table 2, to evaluate our method, we compare 6 models on 3 important features, diversity, novelty, and 5.2 Training Details freshness (cf. Section 5.1), of playlist generation system. The details of models are described as follows. Embed- In the pretraining and reinforcement learning stage, we ding represents the model of Chen et al.’s work [4]. Chen use 4 layers and 64 units per layer in all RNN-LM with et al. construct the song embedding by relationships be- LSTM units, and we choose T = 30 for all RNN-LM tween user and song and then finds approximate k nearest with padding and concatenation. The optimizer we use is neighbors for each song. RL-DIST, RL-NOVELTY, RL- Adam [8]. The learning rates for pretraining stage and re- YEAR, and RL-COMBINE are models that are pretrained inforcement learning stage are empirically set as 0.001 and and optimized by the policy gradient method (cf. Eqn (17)) 0.0001, respectively. with different weights, respectively, as shown in Table 1. The experimental results show that for single objec- 5.3 Pretrained Model Comparison tive such as diversity, our models can accurately gener- ate playlists with corresponding property. For example, In this section, we compare the training error of RNN-LM RL-Year can generate a playlist which consists of songs combining with different mechanisms. The RNN-LM with with earliest release years than Embedding and AC-RNN- attention is denoted as A-RNN-LM, the RNN-LM with LM. Besides, even when we impose our model with mul- concatenation is denoted as C-RNN-LM, and the RNN- tiple reward functions, we can still obtain a better resulting LM with attention and concatenation is denoted as AC- playlist in comparison with Embedding and AC-RNN-LM. RNN-LM. Figure 2 reports the training error of different Sample result is shown in Figure 3. RNN-LMs as log-perplexity which is equal to negative log- From Table 2, we demonstrate that by using appropriate likelihood under the training step from 1 to 500, 000. Here reward functions, our approach can generate playlists to fit one training step means that we update our parameters by the corresponding needs easily. We can systematically find one batch. As shown in Figure 2, the training error of more songs from different artists (RL-DIST), more songs our proposed model, AC-RNN-LM, can not only decrease heard by fewer people (RL-NOVELTY), or more old songs faster than the other models but also reach the lowest value for some particular groups of users (RL-YEAR). at the end of training. Therefore, we adopt AC-RNN-LM as our pretrained model. 5.5 Flexible Manipulating Playlist Properties Worth noting that the pretrained model is developed for two purposes. One is to provide a good basis for fur- After showing that our approach can easily fit several ther optimization, and the other is to estimate transition needs, we further investigate the influence of γ to the re- Figure 4. Novelty score of playlists generated by different imposing weights Figure 3. Sample playlists generated by our approach. The left one is generated by Embedding [4] and the right one is the reward function may be influenced by the number of generated by RL-COMBINE. songs in state s. For example, in our experiments, we adopt Distinct-1 as the metric for diversity. However, we cannot sulting playlists. In this section, several models are trained also adopt Distinct-1 as our reward function directly be- cause it is scaled by the length of playlists which results with the weight γ2 from 0.0 to 1.0 to show the vari- ances in novelty of the resulting playlists. Here we keep in all actions from states with fewer songs will be bene- fited. Therefore, difference between cR and Distinct-1 is γ2 + γ4 = 1.0 and γ1 = γ3 = 0 and fix the training steps 1 to 10, 000. the reason that RL-DIST does not achieve the best perfor- As shown in Figure 4, novelty score generally decreases mance in Distinct-1 (cf. Table 1). In summary, we should be careful to design reward functions, and sometimes we when γ2 increases from 0.0 to 1.0 but it is also possible that the model may sometimes find the optimal policy ear- may need to formulate another approximation objective function to avoid biases. lier than expectation such as the one with γ2 = 0.625. Nevertheless, in general, our approach can not only let the model generate more novel songs but also make the extent 6. CONCLUSIONS AND FUTURE WORK of novelty be controllable. Besides automation, this kind In this paper, we develop a playlist generation system of flexibility is also important in applications. which is able to generate personalized playlists automat- Take online music streaming service as an example, ically and flexibly. We first formulate playlist generation when the service provider wants to recommend playlists as a language modeling problem. Then by exploiting the to a user who usually listens to non-mainstream but fa- techniques of RNN-LM and reinforcement learning, the miliar songs (i.e., novelty score is 0.4), it is more suitable proposed approach can flexibly generate suitable playlists to generate playlists which consists of songs with novelty for different preferences of users. scores equals to 0.4 instead of generating playlists which is In our future work, we will further investigate the pos- composed of 60% songs with novelty scores equals to 0.0 sibility to automatically generate playlists by considering and 40% songs with novelty scores equals to 1.0. Since qualitative feedback. For online music streaming service users usually have different kinds of preferences on each providers, professional music curators will give qualitative property, to automatically generate playlists fitting needs feedback on generated playlists so that research develop- of each user, such as novelty, becomes indispensable. The ers can improve the quality of playlist generation system. experimental results verify that our proposed approach can Qualitative feedback such as ‘songs from diverse artists satisfy the above application. but similar genres’ is easier to be quantitative. We can de- sign suitable reward functions and generate corresponding 5.6 Limitation on Reward Function Design playlists by our approach. However, other feedback such When we try to define a reward function Ri for a prop- as ‘falling in love playlist’ is more difficult to be quantita- erty, we should carefully avoid the bias from the state s. tive. Therefore, we will further adopt audio features and In other words, reward functions should be specific to the explicit tags of songs in order to provide a better playlist corresponding feature we want. One common issue is that generation system. 7. REFERENCES [13] Franc¸ois Maillet, Douglas Eck, Guillaume Desjardins, and Paul Lamere. Steerable playlist generation by [1] Masoud Alghoniemy and Ahmed H. Tewfik. A net- learning song similarity from radio station playlists. In work flow model for playlist generation. In Proceed- Proceedings of the 10th International Society for Mu- ings of International Conference on Multimedia and sic Information Retrieval Conference, 2009. Expo, pages 329–332, 2001. [14] Brian McFee and Gert Lanckriet. The natural language [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- of playlists. In Proceedings of the 12th International gio. Neural machine translation by jointly learning to Society for Music Information Retrieval Conference, align and translate. In International Conference on pages 537–542, 2011. Learning Representations, 2015. [3] Rachel M. Bittner, Minwei Gu, Gandalf Hernandez, [15] Hongyuan Mei, Mohit Bansal, and Matthew R. Walter. Eric J. Humphrey, Tristan Jehan, P. Hunter McCurry, Coherent dialogue with attention-based language mod- and Nicola Montecchio. Automatic playlist sequenc- els. In The Thirty-First AAAI Conference on Artificial ing and transitions. In Proceedings of the 18th Inter- Intelligence, 2016. national Conference on Music Information Retrieval, [16] Elias Pampalk, Tim Pohle, and Gerhard Widmer. Dy- pages 472–478, 2017. namic playlist generation based on skipping behavior. [4] Chih-Ming Chen, Ming-Feng Tsai, Yu-Ching Lin, and In Proceedings of the 6th International Society for Mu- Yi-Hsuan Yang. Query-based music recommendations sic Information Retrieval Conference, pages 634–637, via preference embedding. In Proceedings of the ACM 2005. Conference Series on Recommender Systems, pages [17] David Silver, Thomas Hubert, Julian Schrittwieser, 79–82, 2016. Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc [5] Arthur Flexer, Dominik Schnitzer, Martin Gasser, and Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Gerhard Widmer. Playlist generation using start and Graepel, Timothy Lillicrap, Karen Simonyan, and end songs. In Proceedings of the 9th International Demis Hassabis. Mastering chess and shogi by self- Society for Music Information Retrieval Conference, play with a general reinforcement learning algorithm. pages 173–178, 2008. 2017. [6] Negar Hariri, Bamshad Mobasher, and Robin Burke. [18] Saul´ Vargas. New approaches to diversity and novelty Context-aware music recommendation based on latent- in recommender systems. In Proceedings of the Fourth topic sequential patterns. In Proceedings of the Sixth BCS-IRSG Conference on Future Directions in Infor- ACM Conference on Recommender Systems, RecSys mation Access, FDIA’11, pages 8–13, Swindon, UK, ’12, pages 131–138, New York, NY, USA, 2012. ACM. 2011. BCS Learning & Development Ltd. [7] Dietmar Jannach, Lukas Lerche, and Iman [19] Zhe Xing, Xinxi Wang, , and Ye Wang. Enhancing col- Kamehkhosh. Beyond ”hitting the hits”: Gener- laborative filtering music recommendation by balanc- ating coherent music playlist continuations with the ing exploration and exploitation. In Proceedings of the right tracks. In Proceedings of the 9th ACM Confer- 15th International Society for Music Information Re- ence on Recommender Systems, RecSys ’15, pages trieval Conference, pages 445–450, 2014. 187–194, New York, NY, USA, 2015. ACM. [20] Yuan Cao Zhang, Diarmuid OS´ eaghdha,´ Daniele [8] Diederik P. Kingma and Jimmy Ba. Adam: A method Quercia, and Tamas Jambor. Auralist: Introducing for stochastic optimization. CoRR, abs/1412.6980, serendipity into music recommendation. In Proceed- 2014. ings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, pages 13–22, [9] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng New York, NY, USA, 2012. ACM. Gao, and Bill Dolan. A diversity-promoting objec- tive function for neural conversation models. CoRR, abs/1510.03055, 2015. [10] Jiwei Li, Will Monroe, Alan Ritter, Michel Gal- ley, Jianfeng Gao, and Dan Jurafsky. Deep rein- forcement learning for dialogue generation. CoRR, abs/1606.01541, 2016. [11] Elad Liebman and Peter Stone. DJ-MC: A reinforcement-learning agent for music playlist recommendation. CoRR, abs/1401.1880, 2014. [12] Beth Logan. Content-based playlist generation: Ex- ploratory experiments. 2002.