<<

Concurrent Learning from Customer Interactions

David Silver [email protected] Department of , CSML, University College London, London WC1E 6BT Leonard Newnham, Dave Barker, Suzanne Weller, Jason McFall [email protected] Causata Ltd., 33 Glasshouse Street, London W1B 5DG

Abstract emails) and actions by the customer (such as point-of- In this paper, we explore applications in sale purchases, or clicks on a website). which a company interacts concurrently with Typically, thousands or millions of these interaction many customers. The company has an ob- streams occur with different customers in parallel. Our jective function, such as maximising revenue, goal is to maximise the future rewards for each cus- customer satisfaction, or customer loyalty, tomer, given their history of interactions with the com- which depends primarily on the sequence of pany. This setting differs from traditional reinforce- interactions between company and customer. ment learning paradigms, due to the concurrent na- A key aspect of this setting is that interac- ture of the customer interactions. This distinction tions with different customers occur in par- leads to new considerations for reinforcement learn- allel. As a result, it is imperative to learn ing . In particular, when large numbers of online from partial interaction sequences, so interactions occur simultaneously, it is imperative to that information acquired from one customer learn both online and to bootstrap, so that feedback is efficiently assimilated and applied in sub- from one customer can be assimilated and applied im- sequent interactions with other customers. mediately to other customers. We present the first framework for concur- The majority of prior work in customer analytics, rent reinforcement learning, using a variant and customer relationship management of temporal-difference learning to learn ef- collects data after-the-fact (i.e. once interaction se- ficiently from partial interaction sequences. quences have been completed), and analyses the data We evaluate our algorithms in two large- offline (for example, Tsiptsis and Chorianopoulos scale test-beds for online and email interac- 2010). However, learning offline or from complete in- tion respectively, generated from a teraction sequences has a fundamental inefficiency: it of 300,000 customer records. is not possible to perform any learning until interac- tion sequences have terminated. This is particularly 1. Introduction significant in situations with high concurrency and In many commercial applications, a company or organ- delayed feedback, for example during an email cam- isation interacts concurrently with many customers. paign. In these situations it is imperative to learn For example, a supermarket might offer customers dis- online from partial interaction sequences, so that in- counts or promotions at point-of-sale; an online store formation acquired from one customer is efficiently as- might serve targeted content to its customers; or a similated and applied in subsequent interactions with bank might email appropriate customers with loan or other customers. In these cases the final outcome of mortgage offers. In each case, the company seeks to the sequence is unknown, and therefore it is necessary maximise an objective function, such as revenue, cus- to bootstrap from a prediction of the final outcome. tomer satisfaction, or customer loyalty. This objective It is often also important to learn from the absence of can be represented as the discounted sum of a reward customer interaction. For example, customer attrition function. A stream of interactions occurs between the (where a customer leaves the company due to an un- company and each customer, including actions from desirable sequence of interactions with the company) the company (such as promotions, advertisements, or is often only apparent after months of inactivity by

th that customer. Waiting until a time-out event occurs Proceedings of the 30 International Conference on Ma- is again inefficient, because learning only occurs af- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). ter the time-out event is triggered. However, the lack Concurrent Reinforcement Learning from Customer Interactions of interaction by the customer provides accumulating the value function. However, this prior work ignored evidence that the customer is likely to attrite. Boot- the concurrent aspect of the problem setting: learn- strapping can again be used to address this problem; ing was applied either in batch (incurring opportunity by learning online from predicted attrition, the com- loss by not learning online), and/or by Monte-Carlo pany can avoid repeating the undesirable sequence of learning (incurring opportunity loss by waiting un- interactions with other customers. til episodes complete before learning). Our approach In addition to concurrency, customer-based reinforce- to concurrent reinforcement learning both learns on- ment learning raises many challenges. Observa- line and bootstraps using TD learning, avoiding both tions may be received, and decisions requested, asyn- forms of opportunity loss. In Section 5 we provide em- chronously at different times for each customer. The pirical evidence that both components are necessary learning must scale to very large quantities to learn efficiently in concurrent environments. of data, whilst supporting rapid response times (for Much recent research has focused on a special case example, < 10 ms for online website targeting). In ad- of the customer-based reinforcement learning frame- dition, each customer is only partially observed via a work, using contextual bandits. In this setting, ac- sparse series of interactions; as a result it can be very tions lead directly to immediate rewards, such as the challenging to predict subsequent customer behaviour. click-through on an advert (Graepel et al., 2010), or Finally, there is usually a significant delay between news story (Li et al., 2010). A key assumption of an action being chosen, and the effects of that action this setting is that the company’s actions do not affect occurring. For example, offering free shipping to a the customer’s future interactions with the company. customer may result in several purchases over the fol- However, in many cases this assumption is false. For lowing few days. More sophisticated sequential inter- example, advertising too aggressively to a customer in actions may also occur, for example where customers the short-term may irritate or desensitise a customer are channelled through a “sales funnel” by a sequence and make them less likely to respond to subsequent of progressively encouraging interactions. interactions in the long-term. We focus specifically on In this paper we formalise the customer interac- applications where the sequential nature of the prob- tion problem as concurrent reinforcement learning. lem is significant and contextual bandits are therefore This formalism allows for interactions to occur asyn- a poor model of customer behaviour. chronously, by incorporating null actions and null To avoid any ambiguity, we note that there has been observations to represent the absence of interaction. significant prior work on distributed (sometimes also We then develop a concurrent variant of temporal- referred to as parallel) reinforcement learning. This difference (TD) learning which bootstraps online from body of work has focused on how a serial (i.e. con- partial interaction sequences. To increase computa- tinuing or episodic) reinforcement learning environ- tional efficiency, we allow decisions to be taken at any ment can be efficiently solved by distributing the al- time, using an instance of the options framework (Sut- gorithm over multiple processors. Perhaps the best ton et al., 1999); and we allow updates to be performed known approach is distributed at any time, using multi-step TD learning (Sutton (Bertsekas, 1982; Archibald, 1992), in which Bellman and Barto, 1998). We demonstrate the performance backups can be applied to different states, in par- of concurrent TD on two large-scale test-beds for on- allel and asynchronously; the value function is then line and email interaction respectively, generated from communicated between all processors. More recently, real data about 300,000 customers. a distributed TD learning algorithm was developed (Grounds and Kudenko, 2007). Again, this focused 2. Prior Work on efficient distributed computation of the solution, in Reinforcement learning has previously been applied to this case applying TD backups in parallel to different sequential marketing problems (Pednault et al., 2002; states, and then communicating the value function be- Abe et al., 2002), cross-channel marketing (Abe et al., tween processors. Other work has investigated multi- 2004), and market discount selection (Gomez-Perez agent reinforcement learning, where multiple agents et al., 2008). This prior work has found that model- interact together within a single environment instance free methods tend to outperform model-based meth- (Littman, 1994). Our focus is very different to these ods (Abe et al., 2002); has applied model-free meth- approaches: we consider a single-agent reinforcement ods such as batch Sarsa (Abe et al., 2002) and batch learning problem that is fundamentally concurrent Monte-Carlo learning (Gomez-Perez et al., 2008); us- (because the agent is interacting with many instances ing regression trees (Pednault et al., 2002) and neural of the environment), rather than a distributed solution networks (Gomez-Perez et al., 2008) to approximate to a serial problem (where the agent interacts with a single instance of the environment at any given time). Concurrent Reinforcement Learning from Customer Interactions

i We note that algorithms for concurrent reinforcement response, the reward is defined to be zero, rt = 0. learning may themselves utilise distributed processing, Real (non-null) observations and real actions may oc- however that is not the focus of this paper and is not cur at different time-steps, with many intervening null discussed further. interactions and zero rewards. Each time-step has a duration of d; this is typically short and real inter- 3. Concurrent Reinforcement Learning actions with each customer are typically sparse.1 We Reinforcement learning optimises long-term reward assume that there are a total of N customers; if cus- over the sequential interactions between an agent and tomers arrive in or exit the system, then their periods its environment. Environments are typically divided of inactivity are represented by null actions. into two categories: episodic and continuing environ- A second challenge is partial observability. Reinforce- ments (Sutton and Barto, 1998). In an episodic en- ment learning often focuses on the fully observable set- i vironment, the interaction sequence eventually termi- ting where observations ot satisfy the Markov property i i i i i i i i nates, at which point the environment resets and a P ot+1 | ot, at = P ot+1 | o1, a1, ..., ot, at , in other new interaction sequence begins; whereas in continu- words the agent is provided with full knowledge of the ing environments, there is a single interaction sequence state of the environment. However, when interacting that never terminates. In both cases, interactions oc- with a customer, the environmental state includes the cur serially: the agent receives an observation, takes latent “mental” state of the customer. Rather than an action, and receives a reward. attempting to model beliefs over this unwieldy envi- We introduce a third category for reinforcement learn- ronmental state – the approach taken by POMDPs ing environments. In a concurrent environment, the (Kaelbling et al., 1995) – we represent the customer’s agent interacts in parallel with many instances of the state directly by the history of their interactions. This environment. In our motivating application, the agent is, by definition, a Markov state representation: the is a company and the environment instances are cus- interaction histories form an infinite, tree-structured tomers. At any given time, the company may be with the empty history at the involved in interaction sequences with many differ- root, and histories of length t at depth t. ent customers; furthermore new customers may ar- We make a simplifying assumption that customer be- rive, or old customers may leave, at any time. This haviour is identically distributed and conditionally in- is quite different from episodic reinforcement learning, dependent given their personal interaction history. In in which one customer must complete its sequence be- other words, customers are fully described by the ob- fore a new customer arrives; or from continuing rein- servable facts about that customer: interactions with forcement learning, which considers a single customer one customer do not affect the behaviour of other forever. The distinct nature of concurrent environ- customers (for example via communication between ments leads to new challenges and considerations for customers); and unobservable variations between cus- reinforcement learning algorithms. tomers (for example due to personality or mood) are 2 One challenge that arises naturally in concurrent envi- modelled as noise. Informally, our goal is to maximise ronments is that actions and observations may occur future rewards for a customer given the interactions asynchronously: actions may be executed, and obser- with that customer. We therefore focus on predict- vations or rewards received, at different times for each ing future rewards directly from interaction histories, customer. We address this asynchronicity by repre- rather than marginalising over latent variables that senting customer interaction sequences at a fine time- model human decision-making behaviour. scale, and introducing null actions and observations Formally, we define an interaction history to be a se- to represent the absence of interaction with a given quence ht = {a1, o1, r1, ..., at, ot, rt}, containing ac- i tions a ∈ A ∪ a , observations o ∈ O ∪ o , and customer. Specifically, we define at to be the decision k ∅ k ∅ (or decisions) taken by the company with respect to rewards rk ∈ R. There are a total of N interac- 1 N customer i at time t. If the company does not take any tion histories, ht = [ht ; ...; ht ], occurring concurrently action at time t, then that action is defined to be null, (the superscript i is henceforth used to denote the i i customer, which will be suppressed when discussing at = a∅. Similarly, we define ot to be the observation of the customer (which might include actions by the a single customer; and bold font to denote the vector customer, or any new information acquired about the 1The continuous-time case is derived from the limit d → customer) i at time t; if we do not observe anything 0; for simplicity we treat time as discrete. for that customer, then that observation is defined to 2Note however that fine distinctions between the be- i i be null, ot = o∅. Finally, we define rt to be the re- haviour of individual customers may still be modelled, for ward for customer i at time t; in the absence of any example by including customer attributes such as gender or location, as an initial observation. Concurrent Reinforcement Learning from Customer Interactions of all customers). The observations and rewards are Customer 1 Customer 2 1 1 assumed to be i.i.d. given their interaction history, 2 ω(a ) 2 N ∅ Q i i i i 3 a1 3 P (ot+1, rt+1 | ht, at) = i=1 P ot+1, rt+1 | ht, at . 3 4 4 ω(a ) 5 1 5 ∅ We define a policy π(at+1 | ht) = P (at+1 | ht) to be o5 6 6 a single strategy that can be applied to any customer; 7 7 2 ω(a1) o7 8 3 8 2 the next action is selected according to the interaction a8 9 9 history for that customer only. We define a concurrent 10 10 11 1 11 o11 ω(a2) algorithm π(at+1 | ht) = (at+1 | ht) to be a strategy 12 1 12 8 P a12 that depends on interaction histories for all customers. 13 13 14 14 2 o14 15 15 2 In particular, a concurrent algorithm can apply expe- a15 16 16 rience gained from interactions with one customer so 1 17 ω(a12) 17 as to improve the policy for another customer. 18 18 19 19 ω(a2 ) The return is the discounted sum of rewards for a sin- 20 20 15 P∞ k 21 21 gle customer, Rt:∞ = γ rt+k+1, where γ ∈ [0, 1] 22 1 22 k=0 a22 is the discount factor. This represents, for example, 23 23 24 24 a2 1 24 the total revenue from a customer. The action-value 25 ω(a ) 25 2 22 ω(a ) function Qπ, is the expected return for a customer, 24 given their interaction history followed by action a, updates t decisions updates t decisions π Q (hta) = E (Rt:∞ | hta). The optimal value func- Figure 1. Concurrent TD applied to two example cus- tion is the maximum action-value function achievable tomers. Only real (non-null) actions and observations are ∗ π by any policy, Q (hta) = maxπ Q (hta); our goal is shown. Curved red lines indicate concurrent TD backups; ∗ to approximate the optimal policy π (at+1 | ht) that curved blue lines indicate concurrent TD options. achieves this maximum, i.e. the best policy given the accumulated information about that customer. In our approach, the system only makes decisions There are an infinite number of possible interaction at specific time-steps where a decision has been re- histories, and therefore it is not feasible to repre- quested; and “executes” null actions (i.e. does noth- sent the value function for all distinct histories. We ing) at all other time-steps, without any further com- π putation. Decisions may be requested for a variety of utilise linear value function approximation, Q (hta) ≈ > reasons: for online content serving, a real customer Qθ(hta) = φ(hta) θ to estimate the value function for response will typically trigger a decision request; for the current policy π, where φ(hta) is a n × 1 feature vector, θ is an n×1 parameter vector. The feature vec- email serving, a timer event might trigger a decision request (note that the decision may be a null action, tor φ(hta) summarises the relevant information about i.e. choosing to do nothing). In addition, it is not nec- both the interaction history ht and also the proposed action a into a compact representation. essary to perform learning updates at all time-steps, but only at those time-steps where an update has been 4. Concurrent TD Learning requested. Again, updates might be triggered by real customer responses, but also by timer events. In gen- In concurrent reinforcement learning, the importance eral, our approach is fully asynchronous and there is of bootstrapping is accentuated and it is crucial to use no requirement that decisions and updates occur at TD learning. However, naive application of TD(λ) is the same time-step. computationally inefficient and does not exploit the To implement asynchronous decision requests, we de- asynchronous structure of concurrent reinforcement fine a set of options {ω(a) | a ∈ A}, corresponding to learning. In this section we develop a simple variant of actions a ∈ A. Each option ω(a) performs the corre- TD learning that is practical for large-scale concurrent sponding real action a exactly once, then performs null reinforcement learning environments. actions at all subsequent time-steps until the next de- If the time interval d is very small, and the number of cision request or a time-out event. We seek the optimal customers N is large, it is computationally inefficient policy over these options, taking account of the ran- to update all customers at all time-steps. Typically, domness in the decision requests; i.e. the best we can little changes between time-steps in which only null do under the constraint that non-null actions can only actions and null observations are observed. Learning be taken on time-steps when decisions are requested.3 at microscopic time-scales may also be data inefficient: 3 for example when using TD(0), backups must propa- We note that the history hta uniquely identifies the gate over many time-steps, and the algorithm’s be- event of starting an option ω(a) at time-step t, and so the action-value function Q(hta) also defines the value of haviour depends on the choice of time interval d. each corresponding option ω(a). Unlike options in general, Concurrent Reinforcement Learning from Customer Interactions

To implement asynchronous update requests, we Algorithm 1 Concurrent TD utilise multi-step TD learning (Sutton and Barto, Initialise θ 1998) to evaluate the current option-selection policy. i i K-step TD updates provide a consistent learning al- Initialise u ← 0,R ← 0, ∀i ∈ [1,N] gorithm for any K; mixing over different K leads to for each time-step t do the well-known TD(λ) algorithm (Sutton, 1988). In for each customer i do our algorithm, we apply TD updates between succes- if decision requested for customer i then 0 sive update requests at time-steps t and t . In par- ai ← -greedy(Q) ticular, these time-steps need not correspond to de- t+1 else cision requests, since the value of taking an option i may legitimately be evaluated by intra-option value at+1 ← a∅ learning at any time-steps during its execution (Sut- end if ton et al., 1999). Specifically, we perform a t0 − t if update requested for customer i then step Sarsa update. This updates the action-value i t−ui i i i i δ ← R + γ Q(htat+1) − Q(hui aui+1) function Q(htat+1), towards the subsequent action- i i 0 θ ← θ + αδφ(hui aui+1) value function Q(ht0 at0+1), discounted t − t times, plus the discounted reward accumulated along the way, ui ← t, Ri ← 0 0 Pt −t k end if Rt:t0 = γ rt+k+1, k=0 if reward observed for customer i then i i t−ui i  t0−t  R ← R + γ r ∆θ = α R 0 + γ Q(h 0 a 0 ) − Q(htat+1) φ(htat+1) t+1 t:t t t +1 end if (1) end for The concurrent temporal-difference learning algorithm end for combines the options framework for asynchronous de- cision requests, with multi-step TD updates for asyn- chronous update requests (Figure 1). The implemen- constructed from a database containing 300,000 tation of this algorithm is straightforward (Algorithm anonymised customer records from an online bank. 1): at every decision request, an action is selected Our experiments were designed to evaluate two ques- by -greedy maximisation of the action-value function tions. First, how important is it to learn online, and Q(hta); and at every update request, the action-value to bootstrap, when operating at different levels of con- function is updated according to Equation 1. currency (i.e. when interacting with different numbers As discussed earlier, it is often desirable to learn from of customers in parallel)? Second, in a concurrent set- the absence of customer feedback. To allow for this ting, how important is it to learn from delayed rewards possibility, we include updates for time-steps at which (i.e. the full reinforcement learning problem), com- no real observation occurred (for example, the final pared to learning from immediate rewards (i.e. the update for each customer in Figure 1). These updates popular contextual bandit setting)? To answer these can be scheduled so as to minimise computational ex- questions, we compared TD learning (which learns pense, while ensuring that learning takes place for any online from partial interaction sequences) to Monte- significant change in customer value. In the following Carlo learning (which learns after-the-fact from com- experiments these updates are scheduled at exponen- plete interaction sequences) and to a contextual ban- tially increasing time intervals after the last real obser- dit algorithm (which learns online from immediate re- vation, eventually resulting in a time-out and a final wards). The simulator included eight scenarios for in- update upon termination of the interaction sequence. ternet targeting and email campaigns, which varied considerably in their “sequentiality” (i.e. whether ac- 5. Experiments tions have long-term consequences). We investigated To evaluate our concurrent reinforcement learning al- how the relative performance of the three algorithms gorithms, we used a commercial simulator for inter- changed across different levels of sequentiality. We net targeting and email campaign scenarios.4 The now give details for both the simulator and algorithms. simulator was based on a nearest neighbour model, Each customer was represented by 30 variables such as session count, page view counters, and maximum which require a semi-Markov representation (Sutton et al., credit. In addition, time-based variables have a spe- 1999), these simple options do not violate the Markov prop- erty since the event of starting the macro-action ω(a) at cial importance, due to the time-dependence of many time-step t is represented in the interaction history hta. As customer interactions. For example, the a result, whenever a decision request is received, an option of a positive customer response is typically dependent is selected simply by maximising (with some exploration) on the recency and frequency with which the company over the action-value function at = argmax Q (hta). a∈A has been interacting with that customer: too little in- 4Simulator developer anonymised for blind review. teraction and the customer is unlikely to respond; too Concurrent Reinforcement Learning from Customer Interactions much interaction and the customer is likely to be de- , so as to minimise the mean-squared sensitised or annoyed; similarly, the current enthusi- prediction error. Using this algorithm, the step-size asm of the customer is often captured by the recency of noise variables is gradually reduced, allowing more and frequency with which the customer has been inter- credit to be assigned to the significant variables. acting with the company. To represent these factors, We compared three reinforcement learning algorithms, we also included two time-based variables that depend all of which used the adaptive discretisation and adap- directly on the recent interaction history. One of these tive step-size algorithm described above. The compu- variables, τo, measured the time since the last real ob- tation required by each of these algorithms is min- servation; the second variable, τa, measured the time imal, and they are therefore suitable for large-scale since the last real action. implementation with rapid response times. In each al- Each simulated customer was initialised with a real be- gorithm we used a naive exploration policy, based on havioural history taken from the customer database. -greedy with  = 0.1. We note that, although more Subsequently, the reward and transition sophisticated exploration strategies have been consid- were generated from a commercial simulator. The sim- ered in the literature (especially in the contextual ban- ulator included four internet targeting and four email dit setting), these do not always perform significantly campaign scenarios. In all scenarios, the agent was better in real applications, and often less robustly, able to select amongst 7 real actions plus the null (do than a naive -greedy algorithm (Li et al., 2010). nothing) action a∅. Customers were limited to a binary response at each time-step (i.e. they either respond or Monte-Carlo This algorithm learns after-the-fact fail to respond). The reward was r = 1 for a positive from complete interaction sequences. A time-out response and r = 0 for no response. The probabil- event is generated if a customer i does not respond ity of response is a Bernoulli random variable with a for 100 time-steps. No updates are performed un- mean that depends on all 30 customer variables, based til the time-out event occurs; at which point the on a nearest-neighbour representation, multiplied by a value function is retrospectively updated towards function of the two time-based variables. Customers the observed return: for each time-step at which i returned after a mean time interval of 13 time-steps, a real action was selected, {t | at 6= a∅}, the pa- but with a probability of attrition of 0.05. Once a rameter vectors are updated by , i i i customer attrites, she never returns (this is therefore ∆θ = α(Rt:∞ − Q(hat))φ(hat). equivalent to a discount factor of γ = 0.996). Action decisions were requested for each customer: once ini- Contextual bandit This was a simple instance of a tially, and subsequently once each time they return. contextual bandit algorithm, under the simplify- In high-dimensional problems based on real customer ing assumption that actions do not affect the cus- data, it is common for some variables to be more sig- tomer’s state. This was implemented in the same nificant than others; for some variables to be collinear; way as the Monte-Carlo algorithm, but a time-out and for other variables to be predominantly noisy. was generated before the next real action. This effect was modelled in the simulator by weighting Concurrent TD A(t0 − t) step TD update was ap- the variables appropriately in the nearest neighbour plied between successive time-steps (t, t0) at which response model. We combined two techniques to deal decisions were requested {t | ai 6= a }, or where a with the high-dimensional nature of the input. t ∅ time-out event has occurred. The update is iden- First, each of the 30 customer variables and 2 time- tical to that described in Algorithm 1, based variables was discretised into twelve bins with  i t0−t i i  i ∆θ = α R 0 + γ Q(ha 0 ) − Q(ha ) φ(ha ). dynamically adapted bin boundaries, using a variant t:t t t t of online k-means in each dimension. The feature vec- In all cases we compared a variety of constant step- tor φ(hta) contained one feature for each combination of bin and action (for customer variables), and one fea- size parameters α, and also an automatically adapted ture for each combination of bin and real or null action step-size, using a variety of meta-step-size parameters (for time-based variables) to give a total of n = 2916 β. For each algorithm we report the results for the features, each of which captures an aspect of the rela- best-performing parameters. tionship between one variable and one action. We compared the three algorithms on all eight scenar- Second, we automatically adapted the step-size for ios of the simulator. To evaluate the performance of each feature, by stochastic gradient meta-descent (Sut- the algorithms, we measured their efficiency. We de- π ton, 1992; Schraudolph, 1999). The key idea of this fine the efficiency E (t1, t2) of an algorithm π to be the algorithm is to adapt a gain vector of step-sizes, by mean total reward over time-period [t1, t2], normalised such that the uniform random policy has efficiency 0 Concurrent Reinforcement Learning from Customer Interactions

Internet Email decisions have been completed. At the other extreme, ABCD ABCD at a concurrency level of 10,000, there is a pool of 100 99 66 52 40 30 22 15 10,000 customers interacting in parallel, each of which returns just 10 times. We started each run with a Table 1. Efficiency of the greedy oracle in each scenario. “warm-up” period (selecting actions randomly, with- In a true bandit, the greedy oracle has efficiency 100. out any learning updates) to ensure that the customer and the optimal policy5 has efficiency 100. To esti- pool included customers at a variety of different stages mate the sequentiality of each scenario, we used an or- of interaction. The efficiency was measured over the acle that greedily maximises immediate reward, when second half of each run (Figure 2c). The results show provided with the true reward function. We measured a general degradation of performance with larger con- the efficiency of this “greedy oracle”, Egreedy(0,T ), for currency, illustrating the increased opportunity loss. large T (Table 1). We note that Internet A scenario is In all cases, the Monte-Carlo algorithm was most sensi- a true bandit scenario (i.e. actions have no effect on tive to concurrency; at high concurrency levels Monte- customers beyond their immediate response), whereas Carlo barely outperformed the uniform random base- the other seven scenarios are increasingly sequential. line. This demonstrates that after-the-fact learning in- Each algorithm was run approximately 100 times for curs a prohibitively large opportunity loss at high con- each setting. In our first experiment, we measured currency. In contrast, the Concurrent TD algorithm is the efficiency of the algorithms over time, using 100 more robust at high concurrency, thanks to bootstrap- concurrent customers and online updates. We mea- ping. In non-sequential scenarios such as Internet A, sured the efficiency over windows of 1,000 time-steps, the Contextual Bandit was the most robust to high E(t, t + 1000), and plotted the learning curves (Fig- concurrency. However, for the non-sequential Email ure 2a). For the true bandit scenario, Internet A, scenarios, the Contextual Bandit performed poorly, the Contextual Bandit outperformed Concurrent TD. only occasionally stumbling on important sequences This is not surprising, since the contextual bandit ex- of null actions through random exploration. For all ploits prior knowledge that the immediate response algorithms, the degradation of performance with con- by the customer is independent of subsequent visits.6 currency was most apparent in the strongly sequential However, whenever there was any significant degree scenarios. In these scenarios customers must return of sequentiality, such as the Internet C, D and email several times before positive responses are observed, scenarios, Concurrent TD outperformed the Contex- increasing the potential for opportunity loss. tual Bandit, by learning the sequential behaviour of Finally, we compared the efficiency of Concurrent TD customers during repeated visits. The advantage of between online and batch learning. Rather than learn- Concurrent TD over the Contextual Bandit became ing online at each time-step, updates were batched more pronounced with increasing levels of sequential- together and executed every m time-steps, where m ity. Monte-Carlo learning was more effective in the is the batch length; m = 1 is closest to online learn- email scenario; however, it was slower to learn than ing. Efficiency was measured over the full duration of Concurrent TD, due to the opportunity-loss incurred each run. The results clearly show a very significant by waiting until a time-out before learning.7 drop in performance for higher batch lengths. Further- In our second experiment, we measured the efficiency more, the greater the concurrency level, the faster this of the algorithms at different levels of concurrency. For drop occurred. At low concurrency, batch length was this experiment we continued until customers had re- not too important; for example when the concurrency turned 100,000 times in total (requiring 100,000 deci- level is 1 (corresponding to an episodic environment in sions to be made), but we varied the number of con- classic reinforcement learning) performance was main- currently interacting customers at any given time. In tained up to m = 100000. However, at high concur- other words we maintained a constant size pool of ac- rency batching becomes much more problematic; for tive customers; if one customer attrited then a new example at a concurrency level of 10,000 the perfor- customer would join the pool. A concurrency level of mance started dropping significantly at m = 50. The 1 is just like an episodic environment: the company in- natural conclusion is that, when interacting concur- teracts with a single customer at a time, until 100,000 rently with large numbers of customers, it is impera- tive to use online updates rather than the offline up- 5The simulator was designed such that the optimal pol- dates or large batch lengths used in prior work, e.g. icy and optimal greedy policy could both be computed. (Pednault et al., 2002; Abe et al., 2002; 2004). 6This could be implemented in Concurrent TD by γ = 0 7If time-out is too small, Monte-Carlo truncates se- Our experiments show the importance of bootstrap- quences too early; if it’s too big, opportunity loss is ex- ping online. We would expect similar performance for aggerated; results are only reported for optimal time-out. a well-tuned TD(λ) algorithm. Unfortunately, TD(λ) Concurrent Reinforcement Learning from Customer Interactions

(a) (b) Figure 2. a) Efficiency of algorithms over time (on a scale from random=0 to optimal=100), when applied to four internet targeting and four email campaign scenarios of increasing sequentiality, at concurrency 100. Each point represents the average result over a window of 1,000 time-steps; results were averaged over approximately 100 runs. b) Efficiency of algorithms at various levels of concurrency (number of customers interacting with the system at any given time). requires orders of magnitude more computation, as it updates all weights at every microscopic time-step. Finally, we note that the performance of all algorithms is significantly below the optimal policy, due largely to limitations of the linear function approximator; this could perhaps be addressed by a richer set of features. 6. Conclusion We have introduced a framework for reinforcement learning from customer interaction histories. Its cru- cial component, compared to traditional reinforce- ment learning approaches, is that the agent interacts with many customers concurrently. Our solution is a temporal-difference learning algorithm that bootstraps online from concurrent interactions. Our algorithm is computationally efficient and suitable for large-scale online learning. We evaluated Concurrent TD in a high dimensional commercial simulator against non- bootstrapping (Monte-Carlo), non-online (batch TD), and non-sequential (Contextual Bandit) algorithms re- spectively. Our results clearly demonstrate that, in highly concurrent and sequential scenarios, it is vi- tally important to bootstrap from partial interaction sequences, to learn online, and to use sequential rein- Figure 3. Efficiency of Concurrent TD at various levels forcement learning algorithms. Unlike prior work, our of concurrency (number of customers interacting with the algorithm combines all three of these key properties. system at any given time) and batch lengths (number of time-steps between batch updates). Concurrent Reinforcement Learning from Customer Interactions

References decision making with reinforcement learning. In In- Abe, N., Pednault, E., Wang, H., Zadrozny, B., Fan, ternational Conference on Knowledge Discovery and W., and Apte, C. (2002). Empirical comparison of Data Mining (KDD). various reinforcement learning strategies for sequen- Schraudolph, N. N. (1999). Local gain adaptation in tial targeted marketing. In International Conference stochastic gradient descent. In International Con- on Data Mining, pages 3–10. ference on Artificial Neural Networks, pages 569– Abe, N., Verma, N., Schroko, R., and Apte, C. 574. (2004). Cross channel optimized marketing by re- Sutton, R. (1988). Learning to predict by the method inforcement learning. In International Conference of temporal differences. , 3(9):9– on Knowledge Discovery and Data Mining (KDD), 44. pages 767–772. Sutton, R. (1992). Adapting bias by gradient descent: Archibald, T. (1992). Parallel dynamic programming. An incremental version of delta-bar-delta. In 10th In Kronsj¨o,L. and Shumsheruddin, D., editors, Ad- National Conference on Artificial Intelligence, pages vances in parallel algorithms, pages 343–367. John 171–176. Wiley & Sons, Inc. Sutton, R. and Barto, A. (1998). Reinforcement Learn- Bertsekas, D. (1982). Distributed dynamic program- ing: an Introduction. MIT Press. ming. Automatic Control, IEEE Transactions on, 27(3):610–616. Sutton, R., Precup, D., and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal Gomez-Perez, G., Martin-Guerrero, J. D., Soria- abstraction in reinforcement learning. Artificial In- Olivas, E., Balaguer-Ballester, E., Palomares, A., telligence, 112(1-2):181–211. and Casariego, N. (2008). Assigning discounts in a Tsiptsis, K. and Chorianopoulos, A. (2010). Data Min- marketing campaign by using reinforcement learning ing Techniques in CRM: Inside Customer Segmen- and neural networks. Expert Systems with Applica- tation. Wiley. tions, (doi: 10.1016/j.eswa.2008.10.064).

Graepel, T., Candela, J. Q., Borchert, T., and Her- brich, R. (2010). Web-scale Bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In 27th International Conference on Machine Learning, pages 13–20.

Grounds, M. and Kudenko, D. (2007). Parallel rein- forcement learning with linear function approxima- tion. In Adaptive Agents and Multi-Agent Systems, pages 60–74.

Kaelbling, L., Littman, M., and Cassandra, A. (1995). Planning and acting in partially observable stochas- tic domains. Artificial Intelligence, 101:99–134.

Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to per- sonalized news article recommendation. CoRR, abs/1003.0146.

Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In 11th In- ternational Conference on Machine Learning, pages 157–163.

Pednault, E., Abe, N., Zadrozny, B., Wang, H., Fan, W., and Apte, C. (2002). Sequential cost-sensitive