Sequential Learning of Product Recommendations with Customer Disengagement
Total Page:16
File Type:pdf, Size:1020Kb
Sequential Learning of Product Recommendations with Customer Disengagement Hamsa Bastani Wharton School, Operations Information and Decisions, [email protected] Pavithra Harsha IBM Thomas J. Watson Research, [email protected] Georgia Perakis MIT Sloan School of Management, Operations Management, [email protected] Divya Singhvi MIT Operations Research Center, [email protected] We consider the problem of sequential product recommendation when customer preferences are unknown. First, we present empirical evidence of customer disengagement using a sequence of ad campaigns from a major airline carrier. In particular, customers decide to stay on the platform based on the relevance of recommendations. We then formulate this problem as a linear bandit, with the notable difference that the customer's horizon length is a function of past recommendations. We prove that any algorithm in this setting achieves linear regret. Thus, no algorithm can keep all customers engaged; however, we can hope to keep a subset of customers engaged. Unfortunately, we find that classical bandit learning as well as greedy algorithms provably over-explore, thereby incurring linear regret for every customer. We propose modifying bandit learning strategies by constraining the action space upfront using an integer program. We prove that this simple modification allows our algorithm to achieve sublinear regret for a significant fraction of customers. Furthermore, numerical experiments on real movie recommendations data demonstrate that our algorithm can improve customer engagement with the platform by up to 80%. Key words : bandits, online learning, recommendation systems, disengagement, cold start History : This paper is under preparation. 1. Introduction Personalized customer recommendations are a key ingredient to the success of platforms such as Netflix, Amazon and Expedia. Product variety has exploded, catering to the heterogeneous tastes of customers. However, this has also increased search costs, making it difficult for customers to find products that interest them. Platforms add value by learning a customer's preferences over time, and leveraging this information to match her with relevant products. The personalized recommendation problem is typically formulated as an instance of collaborative filtering (Sarwar et al. 2001, Linden et al. 2003). In this setting, the platform observes different cus- tomers' past ratings or purchase decisions for random subsets of products. Collaborative filtering 1 Author: Learning Recommendations with Customer Disengagement 2 techniques use the feedback across all observed customer-product pairs to infer a low-dimensional model of customer preferences over products. This model is then used to make personalized recom- mendations over unseen products for any specific customer. While collaborative filtering has found industry-wide success (Breese et al. 1998, Herlocker et al. 2004), it is well-known that it suffers from the \cold start" problem (Schein et al. 2002). In particular, when a new customer enters the platform, no data is available on her preferences over any products. Collaborative filtering can only make sensible personalized recommendations for the new customer after she has rated at least O(d log n) products, where d is the dimension of the low-dimensional model learned via collabora- tive filtering and n is the total number of products. Consequently, bandit approaches have been proposed in tandem with collaborative filtering (Bresler et al. 2014, Li et al. 2016, Gopalan et al. 2016) to tackle the cold start problem using a combination of exploration and exploitation. The basic idea behind these algorithms is to offer random products to customers during an exploration phase, learn the customer's low-dimensional preference model, and then exploit this model to make good recommendations. A key assumption underlying this literature is that customers are patient, and will remain on the platform for the entire (possibly unknown) time horizon T regardless of the goodness of the recommendations that have been made thus far. However, this is a tenuous assumption, particularly when customers have strong outside options (e.g., a Netflix user may abandon the platform for Hulu if they receive a series of bad entertainment recommendations). We demonstrate this effect using customer panel data on a series of ad campaigns from a major commercial airline. Specifically, we find that a customer is far more likely to click on a suggested travel product in the current ad campaign if the previous ad campaign's recommendation was relevant to her. In other words, customers may disengage from the platform and ignore new recommendations entirely if past recommendations were irrelevant. In light of this issue, we introduce a new formulation of the bandit product recommendation problem where customers may disengage from the platform depending on the rewards of past recommendations, i.e., the customer's time horizon T on the platform is no longer fixed, but is a function of the platform's actions thus far. Customer disengagement introduces a significant difficulty to the dynamic learning or bandit literature. We prove lower bounds that show that any algorithm in this setting achieves regret that scales linearly in T (the customer's time horizon on the platform if they are given good recommendations). This hardness result arises because no algorithm can satisfy every customer early on when we have limited knowledge of their preferences; thus, no matter what policy we use, at least some customers will disengage from the platform. The best we can hope to accomplish is to keep a large fraction of customers engaged on the platform for the entire time horizon, and to match these customers with their preferred products. Author: Learning Recommendations with Customer Disengagement 3 However, classical bandit algorithms perform particularly badly in this setting { we prove that every customer disengages from the platform with probability one as T grows large. This is because bandit algorithms over-explore: they rely on an early exploration phase where customers are offered random products that are likely to be irrelevant for them. Thus, it is highly probable that the customer receives several bad recommendations during exploration, and disengages from the plat- form entirely. This exploration is continued for the entire time horizon, T , under the principal of optimism. This is not to say that learning through exploration is a bad strategy. We show that a greedy exploitation-only algorithm also under-performs by either over-exploring through natural exploration, or under-exploring by getting stuck in sub-optimal fixed points. Consequently, the platform misses out on its key value proposition of learning customer preferences and matching them to their preferred products. Our results demonstrate that one needs to more carefully balance the exploration-exploitation tradeoff in the presence of customer disengagement. We propose a simple modification of classical bandit algorithms by constraining the space of possible product recommendations upfront. We leverage the rich information available from existing customers on the platform to identify a diverse subset of products that are palatable to a large segment of potential customer types; all recom- mendations made by the platform for new customers are then constrained to be in this set. This approach guarantees that mainstream customers remain on the platform with high probability, and that they are matched to their preferred products over time; we compromise on tail customers, but these customers are unlikely to show up on the platform, and catering recommendations to them endangers the engagement of mainstream customers. We formulate the initial optimization of the product offering as an integer program. We then prove that our proposed algorithm achieves sublinear regret in T for a large fraction of customers, i.e., it succeeds in keeping a large fraction of customers on the platform for the entire time horizon, and matches them with their preferred product. Numerical experiments on synthetic and real data demonstrate that our approach signif- icantly improves both regret and the length of time that a customer is engaged with the platform compared to both classical bandit and greedy algorithms. 1.1. Main Contributions We highlight our main contributions below: 1. Empirical evidence of disengagement: We first present empirical evidence of customer dis- engagement using a sequence of ad campaigns from a major airline carrier. Our results strongly suggest that customers decide to stay on the platform based on the quality of recommendations. 2. Disengagement model: A linear bandit is the classical formulation for learning product rec- ommendations for new customers. Motivated by our empirical results on customer disengagement, Author: Learning Recommendations with Customer Disengagement 4 we propose a novel formulation, where the customer's horizon length is endogenously determined by past recommendations, i.e., the customer may exit if given poor recommendations. 3. Hardness & classical approaches: We show that no algorithm can achieve sub-linear regret in this setting, i.e., customer disengagement introduces substantial difficulty to the dynamic learning problem. Even worse, we show that classical bandit and greedy algorithms over-explore and fail to keep any customer engaged on the platform. 4. Algorithm: We propose the Constrained Bandit algorithm, which modifies standard bandit strategies by constraining the product set