Adapting to User Preference Changes in Interactive Recommendation ∗
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Adapting to User Preference Changes in Interactive Recommendation ∗ Negar Hariri Bamshad Mobasher Robin Burke DePaul University DePaul University DePaul University Chicago, IL Chicago, IL Chicago, IL [email protected] [email protected] [email protected] 1 Abstract preference changes enables the recommender to distinguish Recommender systems have become essential tools in many between the new and old preferences of the user and to gen- application areas as they help alleviate information overload erate the recommendations that match the user’s current in- by tailoring their recommendations to users’ personal prefer- terests. ences. Users’ interests in items, however, may change over We investigate this problem in the framework of an in- time depending on their current situation. Without consid- teractive recommender system that can dynamically adapt to ering the current circumstances of a user, recommendations changes in users’ interests. At each step of the interaction, the may match the general preferences of the user, but they may system presents a list of recommendations and receives feed- have small utility for the user in his/her current situation. We back indicating recommendation utilities. The main objective focus on designing systems that interact with the user over a of the proposed recommendation approach is to identify the number of iterations and at each step receive feedback from recommendation list at each step such that the average ob- the user in the form of a reward or utility value for the rec- tained utility is maximized over the interaction session with ommended items. The goal of the system is to maximize the a user. Choosing the best recommendation list at each step sum of obtained utilities over each interaction session. We requires a strategy in which the best trade-off between ex- use a multi-armed bandit strategy to model this online learn- ploitation and exploration is achieved. During an exploitation ing problem and we propose techniques for detecting changes phase, the recommender system generates the recommenda- in user preferences. The recommendations are then generated tion list that maximizes the immediate utility for the current based on the most recent preferences of a user. Our evalua- step. During exploration, the system may recommend items tion results indicate that our method can improve the existing that do not optimize the immediate utility, but reduce the un- bandit algorithms by considering the sudden variations in the certainty in modeling a user’s preferences and eventually help user’s feedback behavior. maximize the utility over the whole session. This problem can be formulated as a multi-armed bandit problem (MAB) which has been studied in various systems and applications 2 Introduction [Chapelle and Li, 2011][Li et al., 2010]. For the applica- Users’ interests in items may change depending on their cur- tion of personalized recommendation, most of the existing rent situation. For example, users may have different pref- bandit algorithms assume that a user’s interests in items are erences for music depending on their current mood or ac- static over time despite the fact that change of preferences tivity. Similarly, users’ preferences for travel packages can can greatly impact the utility or usefulness of the recommen- depend on the current season or the weather conditions. In dations for a user. We address this problem by proposing a these and many other similar applications, recommendations change detection algorithm that predicts sudden variations in are not useful for the user if they are generated based on previ- a user’s preferences based on his/her feedback on the recom- ous selections and preferences of the user without considering mendations. These changes are then considered by our bandit his/her current situation. recommendation method when generating the recommenda- In this paper, we focus on a common setting where change tions for a user. of interests of a user is not explicitly available and the system Our evaluation results show that by considering users’ should infer this information based on user feedback data. In change of interests in a bandit strategy, the recommender this case, the system tracks the user’s set of interactions over can provide more interesting and useful recommendations to time and detects any significant changes in the user’s behav- users. ior. User interaction data can include various types of feed- back given by the user to the recommended items such as 3 Related Work ratings or clicking on the recommendations. Detecting the Bandit algorithms have been widely used for the application ∗This paper was invited for submission to the Best Papers From of producing personalized recommendations [Chapelle and Sister Conferences Track, based on a paper that appeared in Eighth Li, 2011], [Li et al., 2010]. The news recommender intro- ACM Conference on Recommender Systems, RecSys ’14. duced in [Li et al., 2010] adapts to the temporal changes 4268 of existing content. In the proposed system, both users and of slot machines known as one-armed bandits. The gambler items are represented as sets of features. Features of a user receives a reward from playing each of these arms. To max- can include demographic attributes as well as features ex- imize his or her benefit, the gambler should decide on which tracted from the user’s historical activities such as clicking slot machines to play, the order of playing them and the num- or viewing items while item features may contain descrip- ber of times that each slot machine is played. A similar type tive information and categories. They formulate this prob- of decision problem occurs in many real-world applications lem as a bandit problem with linear payoff and introduced including interactive recommender systems. For an interac- an upper confidence bound bandit strategy which is shared tive recommender, the set of arms corresponds to a collection among all the users. Similarly, the system in [Chapelle and of items that can be recommended at each step, and the re- Li, 2011] uses a bandit algorithm based on Thompson sam- wards correspond to the user’s feedback, such as ratings or a pling for the task of news recommendation and display ad- clickthrough on a recommended item. vertisement. Similar to [Li et al., 2010], several features are An important consideration in a bandit algorithm is the used to represent the user and the items and the recommender trade-off between exploration and exploitation. Exploitation learns a single bandit strategy for all users while representing is referred to the phase where the available information gath- each user as a fixed set of features. Our approach is differ- ered during the previous steps is used to choose the most ent from [Li et al., 2010] and [Chapelle and Li, 2011] as we profitable item. During exploration, the most profitable alter- are learning a bandit strategy for each user. While these two native may not necessarily be picked but the algorithm may methods focus on adapting to the changes in items populari- choose other items in order to acquire more information about ties, our system captures and adapts to the changes in users’ the preferences of the user which would help to gain more re- interests. Representing each user as a fixed set of features, wards in future interactions. as in [Li et al., 2010] and [Chapelle and Li, 2011] would Several techniques have been proposed for solving the prevent us from detecting changes in the users’ preferences multi-armed bandit problem. Optimal solutions for multi- and capturing his/her current interests. Similarly, other ban- armed bandit problems are generally difficult to obtain com- dit strategies such as [Deshpande and Montanari, 2013] and putationally. However, various heuristics and techniques have [Abbasi-Yadkori et al., 2011] disregard the variations in user been proposed to compute sub-optimal solutions effectively. preferences and give all the interactions the same weight in The -greedy approach is one of the most famous and widely defining the bandit strategy. Unlike these methods, our ap- used techniques for solving the bandit problems. At each proach enables us to monitor changes in the users’ behaviors round t, the algorithm selects the arm with the highest ex- and to produce recommendations that match the current in- pected reward with probability 1 − , and selects a random terests of the user. arm with probability . The Upper Confidence Bounds (UCB) Tailoring the recommendations to match the current prefer- class of algorithms, on the other hand, attempt to guarantee an ences of users follows the general idea of context-awareness upper bound on the total regret (difference of the total reward [Adomavicius et al., 2011] in recommender systems. Inte- from that of an optimal strategy). grating context in the recommendation model, when this in- Thompson Sampling is another heuristic that plays each formation is explicitly given to the system, has been investi- arm in proportion to its probability of being optimal. As we gated in various research work [Panniello et al., 2009], [Bal- are using this heuristic in our system, we briefly describe it trunas et al., 2011], [Adomavicius et al., 2005]. However, in more details. In the next section, we will discuss how this there has been less focus on interactional systems where con- heuristic is used in our system to generate the recommenda- text should be inferred from users’ interactions. This is partly tion list at each interaction step. due to the fact that it is more difficult to capture the contex- The reward (or utility) distribution for each item can de- tual information when it is not known to the system at design pend on various factors including characteristics of the item, time.