<<

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

Adapting to User Preference Changes in Interactive Recommendation ∗

Negar Hariri Bamshad Mobasher Robin Burke DePaul University DePaul University DePaul University Chicago, IL Chicago, IL Chicago, IL [email protected] [email protected] [email protected]

1 Abstract preference changes enables the recommender to distinguish Recommender systems have become essential tools in many between the new and old preferences of the user and to gen- application areas as they help alleviate information overload erate the recommendations that match the user’s current in- by tailoring their recommendations to users’ personal prefer- terests. ences. Users’ interests in items, however, may change over We investigate this problem in the framework of an in- time depending on their current situation. Without consid- teractive recommender system that can dynamically adapt to ering the current circumstances of a user, recommendations changes in users’ interests. At each step of the interaction, the may match the general preferences of the user, but they may system presents a list of recommendations and receives feed- have small utility for the user in his/her current situation. We back indicating recommendation utilities. The main objective focus on designing systems that interact with the user over a of the proposed recommendation approach is to identify the number of iterations and at each step receive feedback from recommendation list at each step such that the average ob- the user in the form of a reward or utility value for the rec- tained utility is maximized over the interaction session with ommended items. The goal of the system is to maximize the a user. Choosing the best recommendation list at each step sum of obtained utilities over each interaction session. We requires a strategy in which the best trade-off between ex- use a multi-armed bandit strategy to model this online learn- ploitation and exploration is achieved. During an exploitation ing problem and we propose techniques for detecting changes phase, the recommender system generates the recommenda- in user preferences. The recommendations are then generated tion list that maximizes the immediate utility for the current based on the most recent preferences of a user. Our evalua- step. During exploration, the system may recommend items tion results indicate that our method can improve the existing that do not optimize the immediate utility, but reduce the un- bandit algorithms by considering the sudden variations in the certainty in modeling a user’s preferences and eventually help user’s feedback behavior. maximize the utility over the whole session. This problem can be formulated as a multi-armed bandit problem (MAB) which has been studied in various systems and applications 2 Introduction [Chapelle and Li, 2011][Li et al., 2010]. For the applica- Users’ interests in items may change depending on their cur- tion of personalized recommendation, most of the existing rent situation. For example, users may have different pref- bandit algorithms assume that a user’s interests in items are erences for music depending on their current mood or ac- static over time despite the fact that change of preferences tivity. Similarly, users’ preferences for travel packages can can greatly impact the utility or usefulness of the recommen- depend on the current season or the weather conditions. In dations for a user. We address this problem by proposing a these and many other similar applications, recommendations change detection algorithm that predicts sudden variations in are not useful for the user if they are generated based on previ- a user’s preferences based on his/her feedback on the recom- ous selections and preferences of the user without considering mendations. These changes are then considered by our bandit his/her current situation. recommendation method when generating the recommenda- In this paper, we focus on a common setting where change tions for a user. of interests of a user is not explicitly available and the system Our evaluation results show that by considering users’ should infer this information based on user feedback data. In change of interests in a bandit strategy, the recommender this case, the system tracks the user’s set of interactions over can provide more interesting and useful recommendations to time and detects any significant changes in the user’s behav- users. ior. User interaction data can include various types of feed- back given by the user to the recommended items such as 3 Related Work ratings or clicking on the recommendations. Detecting the Bandit algorithms have been widely used for the application ∗This paper was invited for submission to the Best Papers From of producing personalized recommendations [Chapelle and Sister Conferences Track, based on a paper that appeared in Eighth Li, 2011], [Li et al., 2010]. The news recommender intro- ACM Conference on Recommender Systems, RecSys ’14. duced in [Li et al., 2010] adapts to the temporal changes

4268 of existing content. In the proposed system, both users and of slot machines known as one-armed bandits. The gambler items are represented as sets of features. Features of a user receives a reward from playing each of these arms. To max- can include demographic attributes as well as features ex- imize his or her benefit, the gambler should decide on which tracted from the user’s historical activities such as clicking slot machines to play, the order of playing them and the num- or viewing items while item features may contain descrip- ber of times that each slot machine is played. A similar type tive information and categories. They formulate this prob- of decision problem occurs in many real-world applications lem as a bandit problem with linear payoff and introduced including interactive recommender systems. For an interac- an upper confidence bound bandit strategy which is shared tive recommender, the set of arms corresponds to a collection among all the users. Similarly, the system in [Chapelle and of items that can be recommended at each step, and the re- Li, 2011] uses a bandit algorithm based on Thompson sam- wards correspond to the user’s feedback, such as ratings or a pling for the task of news recommendation and display ad- clickthrough on a recommended item. vertisement. Similar to [Li et al., 2010], several features are An important consideration in a bandit algorithm is the used to represent the user and the items and the recommender trade-off between exploration and exploitation. Exploitation learns a single bandit strategy for all users while representing is referred to the phase where the available information gath- each user as a fixed set of features. Our approach is differ- ered during the previous steps is used to choose the most ent from [Li et al., 2010] and [Chapelle and Li, 2011] as we profitable item. During exploration, the most profitable alter- are learning a bandit strategy for each user. While these two native may not necessarily be picked but the algorithm may methods focus on adapting to the changes in items populari- choose other items in order to acquire more information about ties, our system captures and adapts to the changes in users’ the preferences of the user which would help to gain more re- interests. Representing each user as a fixed set of features, wards in future interactions. as in [Li et al., 2010] and [Chapelle and Li, 2011] would Several techniques have been proposed for solving the prevent us from detecting changes in the users’ preferences multi-armed bandit problem. Optimal solutions for multi- and capturing his/her current interests. Similarly, other ban- armed bandit problems are generally difficult to obtain com- dit strategies such as [Deshpande and Montanari, 2013] and putationally. However, various heuristics and techniques have [Abbasi-Yadkori et al., 2011] disregard the variations in user been proposed to compute sub-optimal solutions effectively. preferences and give all the interactions the same weight in The -greedy approach is one of the most famous and widely defining the bandit strategy. Unlike these methods, our ap- used techniques for solving the bandit problems. At each proach enables us to monitor changes in the users’ behaviors round t, the algorithm selects the arm with the highest ex- and to produce recommendations that match the current in- pected reward with probability 1 − , and selects a random terests of the user. arm with probability . The Upper Confidence Bounds (UCB) Tailoring the recommendations to match the current prefer- class of algorithms, on the other hand, attempt to guarantee an ences of users follows the general idea of context-awareness upper bound on the total regret (difference of the total reward [Adomavicius et al., 2011] in recommender systems. Inte- from that of an optimal strategy). grating context in the recommendation model, when this in- Thompson Sampling is another heuristic that plays each formation is explicitly given to the system, has been investi- arm in proportion to its probability of being optimal. As we gated in various research work [Panniello et al., 2009], [Bal- are using this heuristic in our system, we briefly describe it trunas et al., 2011], [Adomavicius et al., 2005]. However, in more details. In the next section, we will discuss how this there has been less focus on interactional systems where con- heuristic is used in our system to generate the recommenda- text should be inferred from users’ interactions. This is partly tion list at each interaction step. due to the fact that it is more difficult to capture the contex- The reward (or utility) distribution for each item can de- tual information when it is not known to the system at design pend on various factors including characteristics of the item, time. general preferences of the user, and the current interests of the user. Let p(r|a, θ) represent the utility distribution for item a, 4 Interactive Recommendation and θ indicate the unknown parameter characterizing the dis- tribution. Also, let Er(r|a, θ) represent the expected reward Interactive recommenders have been extensively used in a va- for the item a for the given θ. riety of application domains. These systems usually rely on At the first step of an interaction, the set of observations, an online learning algorithm that gradually learns users’ pref- denoted by D, is empty and p(θ|D) is initialized with a prior erences. At each step of the interaction, the system generates distribution for θ. As more rewards are observed at each step, a list of recommendations and observes the user’s feedback Bayesian updating is performed to update the θ distribution on the recommended items indicating the utility of the rec- based on the observed rewards. ommendations. The goal of such a system is to maximize the Algorithm 1 describes the Thompson sampling procedure total utility obtained over the whole interaction session. [Chapelle and Li, 2011], [Scott, 2010]. At each step, θ is drawn as a sample from p(θ|D). The expected reward for 4.1 Multi-Arm Bandit Approach to Online each of the items is then computed and the arm having maxi- Recommendation mum expected reward is played and the reward for that arm is Choosing the best strategy to select the recommendation list observed. The selected arm and the observed reward is then at each step can be formulated as a multi-armed bandit prob- added to the observations set. lem (MAB). Consider a gambler who can choose from a row

4269 Algorithm 1 Thompson Sampling timated as rx = fx · θ, where fx represents the extracted fea- D=∅ tures for the item x. The items are then ranked based on the for t = 1 to T do estimated expected reward values. The item with the highest Draw θt ∼ P (θ|D) reward is recommended to the user. Select a = argmaxE (r|a, θt) t r 4.3 Adaptation to Changes in Users’ interests Observe reward rt D = D ∪(at, rt) This section describes our approach for adapting the recom- end for mendations to changes in users’ interests. Our method has two main steps. The first step, is to detect preference changes and the second step is to adapt the recommendations based on 4.2 Generating the Recommendations the detected changes. We adapt Thompson sampling heuristic as the bandit strategy Our system includes a change detection module that tracks for generating a recommendation list at each step of interac- users’ interactions and detects any sudden and significant tion with a user. In this setting, θ which characterizes the changes in the users’ behaviors. At each step t of interac- utility distribution for each item, represents a user’s prefer- tion with a given user u, the change detection module models ence model. It is a k-dimensional random vector drawn from the user’s interactions in interval It = (t − N, t] as well as an unknown multivariate distribution. The user model is up- the interactions in the interval I(t−N) = (t − 2N, t − N], dated after each interaction. where N is a fixed pre-specified parameter representing the Linearly parameterized bandits[Rusmevichientong and length of the interval. These two windows are modeled by Tsitsiklis, 2010] are a special type of bandits where the ex- computing the following posterior distributions: pected reward of each item is a linear function of θ. It has been shown that Thompson sampling with linear rewards can W = p(θ|I ) = N (µ , Σ ) (6) achieve theoretical regret bounds that are close to the best It t t t lower bounds [Agrawal and Goyal, 2013]. In our approach WI(t−N) = p(θ|I(t−N)) = N (µt−N , Σt−N ) (7) we also assume that the rewards follow a linear Gaussian dis- The above two distributions can each be computed accord- tribution. ing to equation 3 where r, and F are substituted with the re- Given d observations, let r ∈ Rd be the observed re- wards and the item features of the recommendations in the wards for the recommended items and let F represent a corresponding window. d × k constant matrix containing the features of the recom- The distance between W and W is used as a mea- mended items. For example, given two observations (a , r ) It I(t−N) 1 1 sure of dissimilarity between these two windows. Various and (a , r ), r = [r , r ] has two elements r and r and 2 2 1 2 1 2 measures such as KL-divergence can be used to compute the F = [f , f ] where f represents the k-dimensional fea- a1 a2 ai distance between the two windows. In our experiments, we ture vector for item a . The item features can be obtained i used Mahalanobis distance which is computed as follows: in various ways. They can be extracted based on the avail- able content data for the items or can be extracted from the users’ feedback using different techniques such as matrix fac- Σt + Σt−N Σ = (8) torization. Given a training data set represented as a matrix 2 of users’ preferences on items, we apply Principal Compo- T −1 distance = (µt − µt−N ) Σ (µt − µt−N ) nent Analysis to represent each item in a k-dimensional space (where k  the number of users in the training data). If the user feedback behavior is significantly different be-

Assuming a linear Gaussian distribution, we have the fol- tween the two windows, then the models for WIt and WI(t−N) lowing prior and likelihood distributions: would be different. Therefore, the change detection module in our system is based on the heuristic that sudden changes in the users’ feedback behavior would result in sudden changes p(θ) = N (θ; µθ, Σθ) (1) of the distance measure computed in 8. Based on this heuris- p(r|θ) = N (r; F θ, Σr) (2) tic, we compute the distance measure at each step t, and exploit a change point analysis approach to detect sudden Also, we assume Σ and Σ are diagonal matrices with θ r changes in a user’s feedback behavior. diagonal values of σ , and σ respectively. θ r In our system, we utilize the technique described in Given a linear Gaussian system, the posterior p(θ|r) can be [Wayne, 2000] for change point analysis. This procedure in- computed as follows (see [Murphy, 2012] for a proof): volves iteratively applying a combination of cumulative sum charts (CUSUM) and bootstrapping to detect the changes. p(θ|r) = N (µθ|r, Σθ|r) (3) For each of the detected changes, a confidence level is provided which represents the confidence of the predicted Σ−1 = Σ−1 + F T Σ−1F θ|r θ r (4) change point. Given a pre-specified threshold, only those T −1 −1 change points that have confidence level above that thresh- µθ|r = Σθ|r[F Σ (r) + Σ µθ] (5) r θ old are accepted. Also, in our adaptation of this approach, we At each step t, θt is drawn as a sample from the posterior have incorporated a look ahead (LA) parameter to improve distribution. The expected reward for each item x is then es- the accuracy. Our change detection module accepts a change

4270 point only if there are at least LA observations after the pre- The purpose of this experiment was two-fold: first, we dicted change point. evaluated the accuracy of our change detection approach in In our system, we assume a user’s interactions after the real-time detection of changes. Secondly, we examined the change point reflect his or her most recent preferences and performance of our recommender compared with conven- the interactions before the change point reflect the user’s pref- tional recommendation methods that disregard user prefer- erences in a different context. The strategy to aggregate the ence changes. The evaluation was performed in the frame- old and new preferences depends on the specific application work of an interactive recommender that interacts with the in which the recommender is being used. In some applica- user in a number of consecutive rounds. In each round, the tions, a user’s preferences in a different context may still con- recommender presents a recommendation and observes the tain some useful information about the general preferences of user’s feedback (utility) for the recommended item. The that user. In this case, a good strategy for the recommender recommenders were evaluated based on the average utility would be to consider all the user’s preferences while giving gained over the session of interaction. more weight to the feedback after the change point. In other We followed a five-fold cross validation approach for eval- applications, a user’s preferences in different contexts are al- uation. In each round, one of the folds was used for test- most independent. For example, the user may follow different ing, one was used for tuning the parameters and the remain- independent tasks at different contextual states. In this situ- ing folds were used for training the model. To simulate the ation, a good strategy would be to provide recommendations changes in a user’s rating behavior, we randomly selected two just based on the current preferences of the user, or in other users u1 and u2 from our test data and combined them to cre- words, discarding the interactions before the change of in- ate a hybrid user uh. We treated each of these two users as terests. The recommender presented in this paper, discards the rating behavior of a single hybrid user in two different users’ interactions before the last detected change point and situations. produces the recommendations only based on the interactions To simulate an interactive recommendation process, at in the current context. each step of interaction, the recommender used the ratings in Depending on the type of utility (binary, rating), the change the hybrid test user profile to recommend a new item that has point detection may falsely detect change points at earlier in- not been presented to the user before. We randomly chose teractions where it starts to learn the users’ preferences. To an item rating from the u1 profile and used it as the initial avoid this situation, we have incorporated another parameter profile of the hybrid user. For the first T = 30 rounds of in- in our model to avoid splitting the user profile if there are teraction, the observed utility for each recommended item x, fewer interactions in the user’s profile than the pre-specified was assumed to be the rating assigned to x by the user u1. splitting threshold. We assumed that the interest change occurs after T steps of In our future work, we plan to investigate adapting a bandit interactions. Therefore, for the next T = 30 steps, the same strategy that considers all of user’s preferences. This would process was repeated with the difference that the observed include, defining a utility function that while giving more utility for a recommended item was set to the rating for that weight to the current preferences of a user, also considers the item in the profile of u2. For this evaluation, we filtered the interactions before the change point. test users so that each user had at least T ratings in the user’s profile. 5 Evaluations If the user has not rated a recommended item, one option would to ignore the recommendation and discard it from the Evaluation of change adaptation in recommenders can be a evaluation results. However, this would create a bias in the challenging task. There are few publicly available datasets experiment. If the user rates none of the recommendations, that explicitly contain users’ change of preferences that could then the recommender should be negatively scored (in com- be used as the ground truth for evaluation. In our evaluations, parison to a recommender that recommends items that at least we designed an experiment for simulating users’ change of are rated by the user). Setting the utility to zero for the non- behaviors. Having a dataset containing users’ ratings for rated items would also be too strict, as the user may not nec- items, to simulate sudden changes in a user’s rating behav- essarily dislike the items that she hasn’t rated yet. Another ior, we built test hybrid profiles by merging two random user possibility would be to set the utility to average rating of all profiles. Treating the two selected user profiles as the rating the users for that item. This also creates a bias as item popu- behavior of a single hybrid user, we evaluated our method larities are not uniform. For example, if an algorithm suggests in correctly predicting the change of preferences of the user very rare items with high average ratings, it will gain a high and producing recommendations matching the users’ current utility score in the evaluation. A similar bias occurs if the interests. average rating of the user is used as the utility for non-rated We used the Yahoo! Music ratings of musical artists ver- items. In our evaluation, if a recommended item had not been sion 1.0 dataset in our experiments. This dataset represents a rated in the user’s profile, the observed utility was set to the snapshot of the Yahoo! Music community’s preferences for average rating of all items in the dataset. various musical artists. It contains over ten million ratings of The set of parameters used in this experiment for each of musical artists given by Yahoo! Music users over the course the algorithms are specified in table 1. of a one month period sometime prior to March 2004. It con- User-based kNN was used as one of our baselines for this tains ratings in the range of 0 to 100 given by 1948882 users experiment. Cosine similarity metric was used to compute to 98213 artists. the similarity of users. In each round of interaction, the kNN

4271 Table 1: Method parameters Method Parameters Bandit recommender Number of features = 5, µθ = 0, σθ = 0.5, σr = 0.5 Contextual Recommender Number of features = 5, µθ = 0, σθ = 0.5, σr = 0.5, Change detection window size = 5, (proposed method) Change detection confidence = 0.95, Chage detection look ahead = 3, splitting threshold = 10 User-based kNN Number of neighbors = 10 algorithm computed the item scores and recommended a new 6, the recommender starts generating better recommendations item that has the highest score. as most of them have high ratings in the u1 profile. After the The optimal recommender was chosen as another baseline simulated change of interests occurs at iteration 31, the user to compare our method with an ideal algorithm. This rec- (u2) has not rated most of the recommendations, showing the ommender has full knowledge of users’ preferences and the dramatic degradation of the recommender performance. Note preference changes of users. For the first T = 30 iterations, it that many of the recommendations are artists similar to those followed a greedy strategy and recommended new items that rated highly by u1. For example, Foo fighters recommended had the highest rating in the u1 profile. After change of inter- at iteration 31, is highly similar to Nirvana recommended at est at T = 30, in each round it recommended a new item that iteration 21 or Hoobastank, recommended at iteration 32, has had the highest rating in the u2 profile. The standard bandit super high similarity to 3 Doors Down, which has received algorithm with Thompson sampling heuristic (as describe in utility of 100 at iteration 27. This example shows that a tra- section 4.2) was used as another baseline in our experiment ditional bandit approach cannot adapt to the changes in the and was trained with the parameters specified in table 1. In users’ preferences. the remaining of the paper, we refer to this baseline as the standard bandit recommender. Table 2 illustrates an example interaction session generated based on the standard bandit recommender for this experi- ment. The recommendations that are not rated in the user’s pro- file are marked as ’-’. For each recommended artist, a few of most similar artists that have previously been presented to the user are also shown. The artist similarities are determined based on the Last.FM data. For any given artist, the Last.FM API can be used to find the most similar artists as well as the degree of similarity between the artists. In table 2, the similar- ity levels are abbreviated as M(medium), H(high), VH(very high), and SH(super high). As can be seen, most of the first

Figure 2: Expriment I-Average distance at each iteration

Figure 1 presents the average utility at each iteration com- puted over the set of test users. The optimal recommender achieves the highest utility of 96 at the first step and its utility decreases as more iterations pass. This can be explained by the fact that in the first T iterations items are recommended in the decreasing order of their ratings in the u1 profile, there- fore the average utility gradually decreases. After iteration 30, the optimal baseline recommends items based on their ratings in the u2 profile and so a jump in utility can be ob- served at iteration 31. Both the standard bandit recommender and the contextual recommender (proposed method) need 5-6 iterations to learn users’ preferences and show a dramatic im- provement in the utility. As indicated in table 1, the number Figure 1: Experiment I-Average utility at each iteration of factors is set to 5 for these two methods. We repeat this ex- periment for different number of factors. The experiments re- few (5) recommendations are not rated by u1. After iteration vealed that by lowering the number of factors, it takes less it-

4272 Table 2: An example interaction session generated using the standard bandit recommender Step Artist Name Utility Similar artists Step Artist Name Utility Similar artists 1 Chingy - 31 Foo Fighters - Nirvana(SH), Red Hot Chili Peppers(SH) 2 Disturbed 90 32 Hoobastank - (SH), 3 Doors Down(SH) 3 Led Zeppelin - 33 System Of A Down - (H), Disturbed(H) 4 Faith Hill - 34 No Doubt 70 5 Eminem - 35 U2 - Red Hot Chili Peppers(L) 6 Linkin Park 100 Disturbed(M), Eminem(M) 36 Alien Ant Farm - Incubus(SH),(SH) 7 Staind 100 Disturbed(H) 37 Creed - Staind(VH), 3 Doors Down(H) 8 - 38 Smile Empty Soul - Trapt(SH), Staind(VH) 9 90 Led Zeppelin(H), Disturbed(H) 39 Slipknot - Korn(VH), System of a Down(VH) 10 90 Good Charlotte(M), Linkin Park(L) 40 Michelle Branch - - 11 Red Hot Chili Peppers 90 Metallica(M) 41 - Jimmy Eat World(VH) 12 The Offspring - Metallica(VH), Green Day(VH) 42 The White Stripes - Nirvana(M) 13 Korn - Disturbed(H), Metallica(H) 43 Avril Lavigne - 14 Evanescence 100 Disturbed(H), Linkin Park(H) 44 Dashboard Confessional - Jimmy Eat World(SH) 15 Blink 182 90 Good Charlotte(H), Green Day(H) 45 Matchbox Twenty - - 16 Limp Bizkit 50 Korn(SH), Disturbed(H) 46 New Found Glory - blink-182(SH) 17 Nickelback 100 Linkin Park(VH), Staind(VH) 47 Chevelle - Staind(VH), Trapt(H) 18 Trapt 90 Staind(VH), Disturbed(M) 48 Brand New - Jimmy Eat World(H) 19 Simple Plan - Green Day(VH), Blink 182(H) 49 Nine Inch Nails - - 20 90 Nickelback(VH), Linkin Park(VH) 50 Switchfoot - 21 Nirvana 90 Metallica(H) 51 Goo Goo Dolls - 3 Doors Down(H), Trapt(H) 22 Godsmack 100 Disturbed(SH), Staind(H) 52 Rob Zombie - Korn(M) 23 P.O.D. 100 Papa Roach(VH), Staind(VH) 53 Audioslave - Foo Fighters(SH), Incubus(VH) 24 All-American Rejects 90 Good Charlotte(SH), Blink 182(VH) 54 Sugarcult - Good Charlotte(VH) 25 Puddle Of Mudd 100 Staind(SH), Trapt(VH) 55 Deftones - Korn(VH), Limp Bizkit(H) 26 Sum 41 - Blink 182(SH), Green Day(SH) 56 The Used - Blink-182(H), Papa Roach(H) 27 3 Doors Down 100 Nickelback(SH), Trapt(SH) 57 Lifehouse - 3 Doors Down(SH) 28 Incubus 100 Staind(H) 58 Static-X - Korn(VH), Disturbed(VH) 29 A.F.I. 90 Good Charlotte(L) 59 Fountains Of Wayne - 30 Jimmy Eat World 90 Blink 182(VH), A.F.I(H) 60 Taking Back Sunday - Jimmy Eat World(VH) eration for the algorithm to reach its best performance. How- ting threshold to 10 ensures that the user profile is not split for ever, the highest achieved utility would be smaller as well. the detected change points before round 10. Also, at iteration 38 there’s another local maximum point where average fre- quency of change points reaches 0.4. This peak corresponds to the change of preferences in the hybrid user profile.

6 Conclusions In this paper we have presented a novel approach for adap- tation to changes in users’ preferences in interactive recom- mender systems. In an online recommendation scenario, at each step the system provides the user with a list of recom- mendations and receives feedback from the user on the rec- ommended items. Various multi-armed bandit algorithms can be used as recommendation strategies to maximize the aver- age utility over each user’s session. Our approach extends the existing bandit algorithms by considering the sudden varia- tions in the user’s feedback behavior as a result of preference changes. Our system contains a change detection component Figure 3: Experiment I-Average frequency of change points that determines any significant changes in the users’ prefer- at each iteration ences. If a change is detected, the bandit algorithm based on Thompson sampling, prioritizes the observed feedback after Figure 2 illustrates the average distance computed as in the detected change point reflecting the current preferences of equation 8 in each round of interaction. As can be seen in the user. The proposed recommender, discards users’ interac- the figure, at iterations 33 to 35, distance has the maximum tions before the last detected change point and produces the value. As the window size for change detection is set to 5, recommendations just based on the interactions in the current it takes at least 3 rounds until WI mostly contains the inter- context. In our future work, we plan to investigate various ap- t proaches to take into account the impact of users’ interactions actions after the change of interests and WIt−N only has the interactions before the change of interests and therefore, the before the change point on the current user behavior. distance metric become maximized. Figure 3 presents the average number of change points de- References tected at each iteration. As evident in the figure, the average frequency of change points reaches the maximum value of 0.5 [Abbasi-Yadkori et al., 2011] Yasin Abbasi-Yadkori, David´ at iteration 6 where WIt−N has one interaction and WIt win- Pal,´ and Csaba Szepesvari.´ Improved algorithms for linear dow contains 5 (window size) interactions. Setting the split- stochastic bandits. In NIPS, pages 2312–2320, 2011.

4273 [Adomavicius et al., 2005] Gediminas Adomavicius, Ramesh Sankaranarayanan, Shahana Sen, and Alexander Tuzhilin. Incorporating contextual information in recom- mender systems using a multidimensional approach. ACM Trans. Inf. Syst., 23(1):103–145, January 2005. [Adomavicius et al., 2011] Gediminas Adomavicius, Bamshad Mobasher, Francesco Ricci, and Alexander Tuzhilin. Context-aware recommender systems. AI Magazine, 32(3):67–80, 2011. [Agrawal and Goyal, 2013] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. volume 28 of JMLR Proceedings, pages 127–135, 2013. [Baltrunas et al., 2011] Linas Baltrunas, Bernd Ludwig, and Francesco Ricci. Matrix factorization techniques for con- text aware recommendation. In In ACM RecSys, pages 301–304, 2011. [Chapelle and Li, 2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In NIPS, pages 2249–2257, 2011. [Deshpande and Montanari, 2013] Yash Deshpande and An- drea Montanari. Linear bandits in high dimension and rec- ommendation systems. CoRR, abs/1301.1722, 2013. [Li et al., 2010] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to per- sonalized news article recommendation. In WWW, pages 661–670. ACM, 2010. [Murphy, 2012] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective (Adaptive Computation and Ma- chine Learning series). The MIT Press, 2012. [Panniello et al., 2009] Umberto Panniello, Alexander Tuzhilin, Michele Gorgoglione, Cosimo Palmisano, and Anto Pedone. Experimental comparison of pre- vs. post-filtering approaches in context-aware recommender systems. In RecSys, pages 265–268. ACM, 2009. [Rusmevichientong and Tsitsiklis, 2010] Paat Rus- mevichientong and John N. Tsitsiklis. Linearly pa- rameterized bandits. Math. Oper. Res., 35(2):395–411, 2010. [Scott, 2010] Steven L. Scott. A modern bayesian look at the multi-armed bandit. Appl. Stochastic Models Bus. Ind., 26(6):639–658, November 2010. [Wayne, 2000] Taylor Wayne. Change-point analysis: a powerful new tool for detecting changes. 2000.

4274