Concurrent Reinforcement Learning from Customer Interactions

Concurrent Reinforcement Learning from Customer Interactions David Silver [email protected] Department of Computer Science, CSML, University College London, London WC1E 6BT Leonard Newnham, Dave Barker, Suzanne Weller, Jason McFall [email protected] Causata Ltd., 33 Glasshouse Street, London W1B 5DG Abstract emails) and actions by the customer (such as point-of- In this paper, we explore applications in sale purchases, or clicks on a website). which a company interacts concurrently with Typically, thousands or millions of these interaction many customers. The company has an ob- streams occur with different customers in parallel. Our jective function, such as maximising revenue, goal is to maximise the future rewards for each cus- customer satisfaction, or customer loyalty, tomer, given their history of interactions with the com- which depends primarily on the sequence of pany. This setting differs from traditional reinforce- interactions between company and customer. ment learning paradigms, due to the concurrent na- A key aspect of this setting is that interac- ture of the customer interactions. This distinction tions with different customers occur in par- leads to new considerations for reinforcement learn- allel. As a result, it is imperative to learn ing algorithms. In particular, when large numbers of online from partial interaction sequences, so interactions occur simultaneously, it is imperative to that information acquired from one customer learn both online and to bootstrap, so that feedback is efficiently assimilated and applied in sub- from one customer can be assimilated and applied im- sequent interactions with other customers. mediately to other customers. We present the first framework for concur- The majority of prior work in customer analytics, rent reinforcement learning, using a variant data mining and customer relationship management of temporal-difference learning to learn ef- collects data after-the-fact (i.e. once interaction se- ficiently from partial interaction sequences. quences have been completed), and analyses the data We evaluate our algorithms in two large- offline (for example, Tsiptsis and Chorianopoulos scale test-beds for online and email interac- 2010). However, learning offline or from complete in- tion respectively, generated from a database teraction sequences has a fundamental inefficiency: it of 300,000 customer records. is not possible to perform any learning until interaction sequences have terminated. This is particularly 1. Introduction significant in situations with high concurrency and In many commercial applications, a company or organ- delayed feedback, for example during an email cam- isation interacts concurrently with many customers. paign. In these situations it is imperative to learn For example, a supermarket might offer customers dis- online from partial interaction sequences, so that in- counts or promotions at point-of-sale; an online store formation acquired from one customer is efficiently as- might serve targeted content to its customers; or a similated and applied in subsequent interactions with bank might email appropriate customers with loan or other customers. In these cases the final outcome of mortgage offers. In each case, the company seeks to the sequence is unknown, and therefore it is necessary maximise an objective function, such as revenue, cus- to bootstrap from a prediction of the final outcome. tomer satisfaction, or customer loyalty. This objective It is often also important to learn from the absence of can be represented as the discounted sum of a reward customer interaction. For example, customer attrition function. A stream of interactions occurs between the (where a customer leaves the company due to an un- company and each customer, including actions from desirable sequence of interactions with the company) the company (such as promotions, advertisements, or is often only apparent after months of inactivity by th that customer. Waiting until a time-out event occurs Proceedings of the 30 International Conference on Ma- is again inefficient, because learning only occurs af- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). ter the time-out event is triggered. However, the lack Concurrent Reinforcement Learning from Customer Interactions of interaction by the customer provides accumulating the value function. However, this prior work ignored evidence that the customer is likely to attrite. Boot- the concurrent aspect of the problem setting: learn- strapping can again be used to address this problem; ing was applied either in batch (incurring opportunity by learning online from predicted attrition, the com- loss by not learning online), and/or by Monte-Carlo pany can avoid repeating the undesirable sequence of learning (incurring opportunity loss by waiting un- interactions with other customers. til episodes complete before learning). Our approach In addition to concurrency, customer-based reinforce- to concurrent reinforcement learning both learns on- ment learning raises many challenges. Observa- line and bootstraps using TD learning, avoiding both tions may be received, and decisions requested, asyn- forms of opportunity loss. In Section 5 we provide em- chronously at different times for each customer. The pirical evidence that both components are necessary learning algorithm must scale to very large quantities to learn efficiently in concurrent environments. of data, whilst supporting rapid response times (for Much recent research has focused on a special case example, < 10 ms for online website targeting). In ad- of the customer-based reinforcement learning frame- dition, each customer is only partially observed via a work, using contextual bandits. In this setting, ac- sparse series of interactions; as a result it can be very tions lead directly to immediate rewards, such as the challenging to predict subsequent customer behaviour. click-through on an advert (Graepel et al., 2010), or Finally, there is usually a significant delay between news story (Li et al., 2010). A key assumption of an action being chosen, and the effects of that action this setting is that the company's actions do not affect occurring. For example, offering free shipping to a the customer's future interactions with the company. customer may result in several purchases over the fol- However, in many cases this assumption is false. For lowing few days. More sophisticated sequential inter- example, advertising too aggressively to a customer in actions may also occur, for example where customers the short-term may irritate or desensitise a customer are channelled through a \sales funnel" by a sequence and make them less likely to respond to subsequent of progressively encouraging interactions. interactions in the long-term. We focus specifically on In this paper we formalise the customer interac- applications where the sequential nature of the prob- tion problem as concurrent reinforcement learning. lem is significant and contextual bandits are therefore This formalism allows for interactions to occur asyn- a poor model of customer behaviour. chronously, by incorporating null actions and null To avoid any ambiguity, we note that there has been observations to represent the absence of interaction. significant prior work on distributed (sometimes also We then develop a concurrent variant of temporal- referred to as parallel) reinforcement learning. This difference (TD) learning which bootstraps online from body of work has focused on how a serial (i.e. con- partial interaction sequences. To increase computa- tinuing or episodic) reinforcement learning environ- tional efficiency, we allow decisions to be taken at any ment can be efficiently solved by distributing the al- time, using an instance of the options framework (Sut- gorithm over multiple processors. Perhaps the best ton et al., 1999); and we allow updates to be performed known approach is distributed dynamic programming at any time, using multi-step TD learning (Sutton (Bertsekas, 1982; Archibald, 1992), in which Bellman and Barto, 1998). We demonstrate the performance backups can be applied to different states, in par- of concurrent TD on two large-scale test-beds for on- allel and asynchronously; the value function is then line and email interaction respectively, generated from communicated between all processors. More recently, real data about 300,000 customers. a distributed TD learning algorithm was developed (Grounds and Kudenko, 2007). Again, this focused 2. Prior Work on efficient distributed computation of the solution, in Reinforcement learning has previously been applied to this case applying TD backups in parallel to different sequential marketing problems (Pednault et al., 2002; states, and then communicating the value function be- Abe et al., 2002), cross-channel marketing (Abe et al., tween processors. Other work has investigated multi- 2004), and market discount selection (Gomez-Perez agent reinforcement learning, where multiple agents et al., 2008). This prior work has found that model- interact together within a single environment instance free methods tend to outperform model-based meth- (Littman, 1994). Our focus is very different to these ods (Abe et al., 2002); has applied model-free meth- approaches: we consider a single-agent reinforcement ods such as batch Sarsa (Abe et al., 2002) and batch learning problem that is fundamentally concurrent Monte-Carlo learning (Gomez-Perez et al., 2008); us- (because the agent is interacting with many instances ing regression trees (Pednault

Concurrent Reinforcement Learning from Customer Interactions

Survey on Reinforcement Learning for Language Processing

A Deep Reinforcement Learning Neural Network Folding Proteins

An Introduction to Deep Reinforcement Learning

Modelling Working Memory Using Deep Recurrent Reinforcement Learning

Reinforcement Learning: a Survey

An Energy Efficient Edgeai Autoencoder Accelerator For

Sparse Cooperative Q-Learning

4 Perceptron Learning

Optimal Path Routing Using Reinforcement Learning

18 Reinforcement Learning

Deep Learning and Reinforcement Learning

Deep Reinforcement Learning for Imbalanced Classification