PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference

Meta-Learning Effective Exploration Strategies for Contextual Bandits

Amr Sharaf,1 Hal Daume´ III1,2 1 University of Maryland 2 Microsoft Research [email protected], [email protected]

Abstract sets on which we can simulate contextual bandit tasks in an offline setting. Based on these simulations, our algo- In contextual bandits, an algorithm must choose actions given rithm, MELˆ EE´ (MEta LEarner for Exploration), learns a observed contexts, learning from a reward signal that is ob- served only for the action chosen. This leads to an explo- good heuristic exploration strategy that generalize to future ration/exploitation trade-off: the algorithm must balance tak- contextual bandit problems. MELˆ EE´ contrasts with more ing actions it already believes are good with taking new ac- classical approaches to exploration (like ✏-greedy or Lin- tions to potentially discover better choices. We develop a UCB), in which exploration strategies are constructed by meta-learning algorithm, MELˆ EE´ , that learns an exploration expert algorithm designers. These approaches often achieve policy based on simulated, synthetic contextual bandit tasks. provably good exploration strategies in the worst case, but MELˆ EE´ uses imitation learning against these simulations to are potentially overly pessimistic and are sometimes com- train an exploration policy that can be applied to true contex- putationally intractable. tual bandit tasks at test time. We evaluate MELˆ EE´ on both a MELˆ EE´ is an example of meta-learning in which we natural contextual bandit problem derived from a dataset as well as hundreds of simulated contextual ban- replace a hand-crafted learning algorithm with a learned learning algorithm. At training time ( 3), MELˆ EE´ simu- dit problems derived from classification tasks. MELˆ EE´ out- § performs seven strong baselines on most of these datasets by lates many contextual bandit problems from fully labeled leveraging a rich feature representation for learning an explo- synthetic data. Using this data, in each round, MELˆ EE´ is ration strategy. able to counterfactually simulate what would happen under all possible action choices. We can then use this informa- tion to compute regret estimates for each action, which can 1 Introduction be optimized using the AggreVaTe imitation learning algo- In a contextual bandit problem, an agent attempts to opti- rithm (Ross and Bagnell 2014). mize its behavior over a sequence of rounds based on limited Our imitation learning strategy mirrors the meta-learning feedback (Kaelbling 1994; Auer 2003; Langford and Zhang approach of Bachman, Sordoni, and Trischler (2017) in 2008). In each round, the agent chooses an action based on the active learning setting. We present a simplified, styl- a context (features) for that round, and observes a reward ized analysis of the behavior of MELˆ EE´ to ensure that our for that action but no others ( 2). The feedback is partial: § cost function encourages good behavior ( 4), and show that the agent only observes the reward for the action it selected, MELˆ EE´ enjoys the no-regret guarantees§ of the AggreVaTe and for no other actions. This is strictly harder than super- imitation learning algorithm. vised learning in which the agent observes the reward for all Empirically, we use MELˆ EE´ to train an exploration policy available actions. Contextual bandit problems arise in many on only synthetic datasets and evaluate this policy on both real-world settings like learning to rank for information re- a contextual bandit task based on a natural learning to rank trieval, online recommendations and personalized medicine. dataset as well as three hundred simulated contextual bandit As in , the agent must learn to bal- tasks ( 5). We compare the trained policy to a number of al- ance exploitation (taking actions that, based on past experi- ternative§ exploration algorithms, and show that MELˆ EE´ out- ence, it believes will lead to high instantaneous reward) and performs alternative exploration strategies in most settings. exploration (trying new actions). However, contextual ban- dit learning is easier than reinforcement learning: the agent needs to only take one action, not a sequence of actions, and 2 Preliminaries: Contextual Bandits and therefore does not face a credit assignment problem. Policy Optimization In this paper, we present a meta-learning approach to Contextual bandits is a model of interaction in which an automatically learn a good exploration strategy from data. agent chooses actions (based on contexts) and receives im- To achieve this, we use synthetic data mediate rewards for that action alone. For example, in a sim- Copyright c 2021, Association for the Advancement of Artificial plified news personalization setting, at each time step t,a Intelligence (www.aaai.org). All rights reserved. user arrives and the system must choose a news article to display to them. Each possible news article corresponds to the type of the estimator used by the oracle policy optimizer an action a, and the user corresponds to a context xt. After POLOPT. the system chooses an article at to display, it can observe, for instance, the amount of time that the user spends reading 3 Approach: Learning an Effective that article, which it can use as a reward rt(at). Exploration Strategy Formally, we largely follow the setup and notation of In order to have an effective approach to the contextual Agarwal et al. (2014). Let be an input space of contexts bandit problem, one must be able to both optimize a pol- (users) and [K]= 1,...,KX be a finite action space (ar- icy based on historic data and make decisions about how ticles). We consider{ the statistical} setting in which there ex- to explore. The exploration/exploitation dilemma is funda- ists a fixed but unknown distribution over pairs (x, r) mentally about long-term payoffs: is it worth trying some- [0, 1]K , where r is a vector of rewardsD (for convenience,2 thing potentially suboptimal now in order to learn how to weX⇥ assume all rewards are bounded in [0, 1]). In this setting, behave better in the future? A particularly simple and ef- the world operates iteratively over rounds t =1, 2,.... Each fective form of exploration is ✏-greedy: given a function f round t: output by POLOPT, act according to f(x) with probability 1. The world draws (xt, rt) and reveals context xt. (1 ✏) and act uniformly at random with probability ✏. 2. The agent (randomly) chooses⇠D action a [K] based on t 2 Intuitively, one would hope to improve on a strategy like xt, and observes reward rt(at). ✏-greedy by taking more (any!) information into account; for The goal of an algorithm is to maximize the cumulative instance, basing the probability of exploration on f’s uncer- sum of rewards over time. Typically the primary quantity tainty. In this section, we describe MELˆ EE´ , first by showing considered is the average regret of a sequence of actions how it operates in a Markov Decision Process ( 3), and then a1,...,aT to the behavior of the best possible function in a by showing how to train it using synthetic simulated§ contex- prespecified class : F tual bandit problems based on imitation learning ( 3). 1 T § Reg(a1,...,aT ) = max rt(f(xt)) rt(at) (1) f T Markov Decision Process Formulation 2F t=1 A no-regret agent has a zeroX averageh regret in the limiti of We model the exploration / exploitation task as a Markov large T . Decision Process (MDP). Given a context vector x and a f OL PT 1 ...t 1 To produce a good agent for interacting with the world, function output by P O on rounds , the agent ⇡ we assume access to a function class and to an oracle learns an exploration policy to decide whether to exploit by acting according to f(x), or explore by choosing a dif- policy optimizer for that function class PFOLOPT. For exam- a = f(x) ple, may be a set of single layer neural networks map- ferent action . We model this as an MDP, repre- 6 A, S, s ,T,R A = a pingF user features x to predicted rewards for actions sented as a tuple 0 , where is the set S =h s i { } s a [K]. Formally, the2 observableX record of interaction re- of all actions, is the space of all possible states, 0 { } R(s, a) sulting2 from round t is the tuple (x ,a ,r (a ),p (a )) is a starting state (or initial distribution), is the re- t t t t t t T (s s, a) [K] [0, 1] [0, 1], where p (a ) is the probability that2 ward function, and 0 is the transition function. We X⇥ ⇥ ⇥ t t describe these below. | the agent chose action at, and the full history of interaction t is ht = (xi,ai,ri(ai),pi(ai)) . The oracle policy opti- States A state s in our MDP represents all past informa- h ii=1 mizer, POLOPT, takes as input a history of user interactions tion seen in the world (there are a very large number of and outputs an f with low expected regret. states). At time t, the state st includes: all past experiences 2F t 1 An example of the oracle policy optimizer POLOPT is to (xi,ai,ri(ai))i=1 , the current context xt, and the current combine inverse propensity scaling (IPS) with a regression function ft computed by running POLOPT on the past ex- algorithm (Horvitz and Thompson 1952). Here, given a his- periences. tory h, each tuple (x, a, r, p) in that history is mapped to a For the purposes of learning, this state space is far too multiple-output regression example. The input for this re- large to be practical, so we model each state using a set of ex- gression example is the same x; the output is a vector of ploration features. The exploration policy ⇡ is trained based K costs, all of which are zero except the ath component, on exploration features (Alg 1, line 12). These features are which takes value r/p. This mapping is done for all tuples allowed to depend on the current classifier ft, and on any in the history, and a supervised learning algorithm on the part of the history except the inputs xt in order to maintain function class is used to produce a low-regret regression task independence. We additionally ensure that its features function f. ThisF is the function returned by the oracle pol- are independent of the dimensionality of the inputs, so that icy optimizer POLOPT. IPS has the property of being un- ⇡ can generalize to datasets of arbitrary dimensions. The biased, however, it often suffers from large variance. The specific features we use are listed below; these are largely direct method (DM) (Dudik et al. 2011) is another kind of inspired by Konyushkova, Sznitman, and Fua (2017) but an oracle policy optimizer POLOPT that has lower variance adapted to our setting. than IPS. The direct method estimates the reward function Importantly, we wish to train ⇡ using one set of tasks (for directly from the history h without importance sampling, which we have fully supervised data on which to run sim- and uses this estimate to learn a low-regret function f. In ulations) and apply it to wholly different tasks (for which our experiments, we use the direct method, largely for its we only have bandit feedback). To achieve this, we allow low variance and simplicity. However, MELˆ EE´ is agnostic to ⇡ to depend representationally on ft in arbitrary ways: for instance, it might use features that capture ft’s uncertainty More formally, let ⇡ be the exploration policy we are on the current example. We additionally allow ⇡ to depend learning, which takes two inputs: a function f and in a task-independent manner on the history (for instance, a context x, and outputs an action. In our example,2Ff will which actions have not yet been tried): it can use features be the output of the policy optimizer on all historic data, of the actions, rewards and probabilities in the history but and x will be the current user. This is used to produce an not depend directly on the contexts x. This is to ensure that agent which interacts with the world, maintaining an initially ⇡ only learns to explore and not also to solve the underly- empty history buffer h, as: ing task-dependent classification problem. Because ⇡ needs 1. The world draws (x , r ) and reveals context x . t t ⇠D t to learn to be task independent, we found that if ft’s predic- 2. The agent computes f POLOPT(h ) and a greedy ac- t t tions were uncalibrated, it was very difficult for ⇡ to general- tion a˜t = ⇡(ft,xt). ize well to unseen tasks. Therefore, we additionally allow ⇡ 3. The agent plays at =˜at with probability (1 µ), and at to depend on a very small amount of fully labeled data from uniformly at random otherwise. the task at hand, which we use to allow ⇡ to calibrate ft’s 4. The agent observes rt(at). predictions. In our experiments we use only 30 fully labeled 5. The agent appends (xt,at,rt(at),pt) to the history ht, examples, but alternative approaches to calibrating ft that where p = µ/K if a =˜a ; and p =1 µ + µ/K if t t 6 t t do not require this data would be preferable. The features of at =˜at. ft that we use are: a) predicted probability p(at ft, xt), we Here, f is the function optimized on the historical data, and | t use a softmax over the predicted rewards from ft to convert ⇡ uses it and xt to choose an action. Intuitively, ⇡ might them to probabilities; b) entropy of the predicted probability choose to use the prediction ft(xt) most of the time, un- distribution; c) a one-hot encoding for the predicted action less ft is quite uncertain on this example, in which case ⇡ ft(xt). might choose to return the second (or third) most likely ac- The features of ht 1 that we use are: a) current time step tion according to ft. The agent then performs a small amount t; b) normalized counts for all previous actions predicted so of additional µ-greedy-style exploration: most of the time it far; c) average observed rewards for each action; d) empir- acts according to ⇡ but occasionally it explores some more. ical variance of the observed rewards for each action in the In practice ( 5), we find that setting µ =0is optimal in history. aggregate, but§ non-zero µ is necessary for our theory ( 4). We use Platt’s scaling (Platt 1999; Lin, Lin, and Weng § 2007) method to calibrate the predicted probabilities. Platt’s Rewards The reward function is chosen so that reward scaling works by fitting a model to the maximization by the learned policy is equivalent to low re- classifier’s predicted scores. gret in the contextual bandit problem. Formally, at state st, let r ( ) be the reward function for the contextual bandit task t · Initial State The intial state distribution is formed by at that state. The reward function is: R(s, a)=rt(a). drawing a new contextual bandit task at random, setting the history to the empty set, and initializing the first context x1 Training MELˆ EE´ by Imitation Learning as the first example in that contextual bandit task. The meta-learning challenge is: how do we learn a good ex- Actions At each state, our learned exploration policy ⇡ ploration policy ⇡? We assume we have access to fully la- beled data on which we can train ⇡; this data must include must take an input state st (described above) and make a decision. Its action space A is the same action space as that context/reward pairs, but where the reward for all actions is of the contextual bandit problem it is trying to solve. If ⇡ known. This is a weak assumption: in practice, we use purely chooses to take the same action as f, then we interpret this synthetic data as this training data; one could alternatively as an “exploitation” step, and if it takes another action, we use any fully labeled classification dataset as in (Beygelz- interpret this as an “exploration” step. imer and Langford 2009). Under this assumption about the data, and with our model of behavior as an MDP, a natural Transitions Each episode starts off with a new contex- class of learning algorithms to consider for learning are im- tual bandit task and an empty history h0 = . The subse- itation learning algorithms (Daume,´ Langford, and Marcu quent steps in the episode involve observing context{} vectors 2009; Ross, Gordon, and Bagnell 2011; Ross and Bagnell x1, ,xT from the new contextual bandit task. A single 2014; Chang et al. 2015). In other work on meta-learning, transition··· in the episode consists of the exploration policy such problems are often cast as full reinforcement-learning ⇡ being given the state s containing information about the problems. We opt for imitation learning instead because it current context vector xt and the history ht 1, using which, is computationally attractive and effective when a simulator the exploration policy ⇡ chooses the next action a. The tran- exists. sition function T (s0 s, a) incorporates the action a chosen Informally, at training time, MELˆ EE´ will treat one of these by the exploration policy| in state s along with the features synthetic datasets as if it were a contextual bandit dataset. At representing the current state s, and produces the next state each time step t, it will compute ft by running POLOPT on s0 that represents a new feature vector xt+1. The episode the historical data, and then consider: for each action, what terminates whenever all the context vectors in the contex- would the long time reward look like if I were to take this tual bandit task have been exhausted. During the test phase, action. Because the training data for MELˆ EE´ is fully labeled, each contextual bandit task is handled only once in a single this can be evaluated for each possible action, and a policy episode. ⇡ can be learned to maximize these rewards. Algorithm 1 MELˆ EE´ (supervised training sets Sm , hy- this action, defined as: { } ? ? pothesis class , exploration rate µ, number of validation Qt (s, a)=rt(a)+Es T (. s,a),a ⇡?(. s)[Qt+1(s, a)]. (2) F ⇠ | ⇠ | examples NVal, feature extractor ) where the expectation is taken over the randomness of the policy ⇡? and the MDP. 1: initialize meta-dataset D = Each such step generates a cost-weighted training exam- 2: for episode n =1, 2,...,N{}do ple (s, t, a, Q?) and AggreVaTe trains a policy ⇡ to mini- 3: choose S at random from S , and set history h = 1 { m} 0 mize the expected cost-to-go on this dataset. At each follow- ing iteration n, AggreVaTe collects data through interaction 4: partition{} and permute S randomly into train Tr and with the learner as follows: for each trajectory, begin by us- validation Val where Val = N Val ing the current learner’s policy ⇡ to perform the task, inter- 5: for round t =1, 2,...,| Tr| do n | | t 1 rupt at time t, explore a roll-in action a in the current state 6: let (xt, rt)=Trt, st =(xi,ai,ri(ai))i=1 ,xt,ft s, after which control is provided back to the expert to con- 7: for each action a =1,...,K do tinue up to time-horizon T . This results in new examples of 8: ft,a = POLOPT( ,ht 1 (xt,a,rt(a), 1 ? K 1 F the cost-to-go (roll-out value) of the expert (s, t, a, Q ), un- K µ)) on augmented history ⇡ ? der the distribution of states visited by the current policy n. 9: roll-out: estimate Q (st,a), the cost-to-go of a, This new data is aggregated with all previous data to train r (a) ⇡out f using t and a roll-out policy on t,a the next policy ⇡n+1; more generally, this data can be used 10: end for by a no-regret online learner to update the policy and obtain 11: compute ft = POLOPT( ,ht 1) ⇡ . This is iterated for some number of iterations N and ? F ? n+1 12: D D ((st), Q (st, 1),...,Q (st,K) ) µ h i the best policy found is returned. 13: roll-in: at 1K +(1 µ)⇡n 1(ft,xt) with ⇠ K Following the AggreVaTe template, MELˆ EE´ operates in probability pt, where 1 is the ones-vector an iterative fashion, starting with an arbitrary ⇡ and improv- 14: append history ht ht 1 (xt,at,rt(at),pt) ing it through interaction with an expert. Over N episodes, 15: end for MELˆ EE´ selects random training sets and simulates the test- 16: update ⇡n = LEARN(D) time behavior on that training set. The core functionality is 17: end for N to generate a number of states (ft,xt) on which to train ⇡, 18: return best policy in ⇡n { }n=1 and to use the supervised data to estimate the value of every action from those states. MELˆ EE´ achieves this by sampling a random supervised training set and setting aside some val- idation data from it (line 4). It then simulates a contextual More formally, in imitation learning, we assume training- bandit problem on this training data; at each time step t, ⇡? time access to an expert, , whose behavior we wish to it tries all actions and “pretends” like they were appended learn to imitate at test-time. Because we can train on fully to the current history (line 8) on which it trains a new pol- supervised training sets, we can easily define an optimal ref- icy and evaluates it’s roll-out value (line 9). This yields, for ⇡? erence policy , which “cheats” at training time by looking each t, a new training example for ⇡, which is added to ⇡’s ⇡? at the true labels: in particular, can always pick the cor- training set (line 12); the features for this example are fea- rect action (i.e., the action that maximizes future rewards) tures of the classifier based on true history (line 11) (and pos- at any given state. The learning problem is then to estimate sibly statistics of the history itself), with a label that gives, ⇡ ⇡? to have as similar behavior to as possible, but without for each action, the corresponding cost-to-go of that action access to those labels. (the Q?s computed in line 9). MELˆ EE´ then must commit to Suppose we wish to learn an exploration policy ⇡ for a a roll-in action to actually take; it chooses this according to contextual bandit problem with K actions. We assume ac- a roll-in policy (line 13). MELˆ EE´ has no explicit “exploita- cess to M supervised learning datasets S1,...,SM , where tion policy”, exploitation happens when ⇡ chooses the same each Sm = (x1, r1),..., (xNm , rNm ) of size Nm, where action as f , while exploration happens when it chooses a { } t each xn is from a (possibly different) input space m and the different action. In learning to explore, MELˆ EE´ simultane- K X reward vectors are all in [0, 1] . In particular, multi-class ously learns when to exploit. classification problems are modeled by setting the reward Roll-in actions. The distribution over states visited by for the correct label to be one and the reward for all other MELˆ EE´ depends on the actions taken, and in general it is labels to be zero. good to have that distribution match what is seen at test time. The imitation learning algorithm we use is AggreVaTe This distribution is determined by a roll-in policy (line 13), (Ross and Bagnell 2014) (closely related to DAgger (Ross, controlled in MELˆ EE´ by exploration parameter µ [0, 1]. Gordon, and Bagnell 2011)), and is instantiated for the con- As µ 1, the roll-in policy approaches a uniform2 random textual bandits meta-learning problem in Alg 1. AggreVaTe policy;! as µ 0, the roll-in policy becomes deterministic. learns to choose actions to minimize the cost-to-go of the When the roll-in! policy does not explore, it acts according expert rather than the zero-one classification loss of mim- to ⇡(ft,.). icking its actions. On the first iteration AggreVaTe collects Roll-out values. The ideal value to assign to an action data by observing the expert perform the task, and in each (from the perspective of the imitation learning procedure) is trajectory, at time t, explores an action a in state s, and ob- that total reward (or advantage) that would be achieved in ? serves the cost-to-go Qt (s, a) of the expert after performing the long run if we took this action and then behaved accord- ing to our final learned policy. Unfortunately, during train- algorithm is BANDITRON.BANDITRON is a variant of the ing, we do not yet know the final learned policy. Thus, a multiclass that operates under bandit feedback. surrogate roll-out policy ⇡out is used instead. A convenient, Details of this analysis (and proofs, which directly follow the and often computationally efficient alternative, is to evaluate original BANDITRON analysis) are given in the appendix. the value assuming all future actions were taken by the ex- pert (Langford and Zadrozny 2005; Daume,´ Langford, and 5 Experimental Setup and Results Marcu 2009; Ross and Bagnell 2014). In our setting, at any time step t, the expert has access to the fully supervised re- Using a collection of synthetically generated classification ELˆ EE´ ward vector rt for the context xt. When estimating the roll- problems, we train an exploration policy ⇡ using M out value for an action a, the expert will return the true re- (Alg 1). This exploration policy learns to explore on the ba- ward value for this action rt(a) and we use this as our esti- sis of calibrated probabilistic predictions from f together mate for the roll-out value. with a predefined set of exploration features ( 3). Once ⇡ is learned and fixed, we follow the test-time behavior§ de- 4 Theoretical Guarantees scribed in 3 to evaluate ⇡ on a set of contextual bandit problems. We§ evaluate MELˆ EE´ on a natural learning to rank We analyze MELˆ EE´ , showing that the no-regret property of task ( 5). To ensure that the performance of MELˆ EE´ general- AGGREVATE can be leveraged in our meta-learning setting izes beyond§ this single learning to rank task, we additionally for learning contextual bandit exploration. In particular, we perform thorough evaluation on 300 “simulated” contextual first relate the regret of the learner in line 16 to the overall bandit problems, derived from standard classification task. regret of ⇡. This will show that, if the underlying classifier improves sufficiently quickly, MELˆ EE´ will achieve sublinear regret. We then show that for a specific choice of underly- ing classifier (BANDITRON), this is achieved. MELˆ EE´ is an instantiation of AGGREVATE (Ross and Bagnell 2014); as such, it inherits AGGREVATE’s regret guarantees. Theorem 1 After N episodes, if LEARN (line 16) is no- regret algorithm, then as N , with probability 1, it ? !1 holds that J(¯⇡) J(⇡ ) 2T K✏ˆclass(T ), where J( ) is the reward of the exploration policy, ⇡¯ is the average policy· returned, and ✏ˆclass(T ) is the averagep regression regret for ? each ⇡n accurately predicting Q , where N 1 ˆ ? ? ✏ˆclass(T )=min E QT t+1(s, ⇡) min QT t+1(s, a) ⇡ ⇧ N a 2 i=1 is the empirical minimumX h expected cost-sensitive classifi-i cation regret achieved by policies in the class ⇧ on all the data over the N iterations of training when compared to the t Bayes optimal regressor, for t U(T ),s d⇡i ,U(T ) the ⇠ t ⇠ uniform distribution over 1,...,T , d⇡ the distribution of states at time t induced by{ executing} policy ⇡, and Q? the cost-to-go of the expert. Figure 1: Win/Loss counts for all pairs of algorithms over 16 random shuffles for the MSLR-10K dataset. Thus, achieving low regret at the problem of learning ⇡ on the training data it observes (“D” in MELˆ EE´ ), i.e. ✏ˆclass(T ) is small, translates into low regret in the contextual-bandit In all cases, the underlying classifier f is a linear model setting. At first glance this bound looks like it may scale trained with an optimizer that runs stochastic gradient de- linearly with T . However, the bound in Theorem 1 is depen- scent. We seek to answer two questions experimentally: dent on ✏ˆclass(T ). Note however, that s is a combination of 1. How does MELˆ EE´ compare empirically to alternative (ex- the context vector xt and the classification function ft. As pert designed) exploration strategies? T , one would hope that f improves significantly and !1 t ✏ˆclass(T ) decays quickly. Thus, sublinear regret may still be 2. How important are the additional features used by MELˆ EE´ achievable when f learns sufficiently quickly as a function in comparison to using calibrated probability predictions of T . For instance, if f is optimizing a strongly convex loss from f as features? function, online gradient descent achieves a regret guarantee log T Training Datasets of O( T ) (Hazan et al. 2016, Theorem 3.3), potentially leading to a regret for MELˆ EE´ of O( (log T )/T ). In our experiments, we follow Konyushkova, Sznitman, and The above statement is informal (it does not take into Fua (2017) (and also Peters et al. (2014), in a different set- account the interaction between learningp f and ⇡). How- ting) and train the exploration policy ⇡ only on synthetic ever, we can show a specific concrete example: we analyze data. This is possible because the exploration policy ⇡ never MELˆ EE´ ’s test-time behavior when the underlying learning makes use of x explicitly and instead only accesses it via Figure 2: Learning curve on the MSLR-10K dataset: x-axis Figure 3: Behavior of MELˆ EE´ in comparison to baseline shows the number of queries observed, and y-axis shows the and state-of-the-art exploration algorithms. A representative progressive reward. learning curve on dataset #1144.

ft’s behavior on it. We generate datasets with uniformly dis- which take 5 values from 0 (irrelevant) to 4 (perfectly rele- tributed class conditional distributions. The datasets are al- vant) . In our experiments, we limit the number of labels ways two-dimensional. to the two extremes: 0 and 4, and we drop the queries not labelled as any of the two extremes. A query-url pair is rep- Evaluation Methodology resented by a 136-dimensional feature vector. The dataset For evaluation, we use progressive validation (Blum, Kalai, is highly imbalanced as the number of irrelevant queries is and Langford 1999), which is exactly computing the re- much larger than the number of relevant ones. To address ward of the algorithm. Specifically, to evaluate the perfor- this, we sample the number of irrelevant queries to match mance of an exploration algorithm on a dataset S of size that of the relevant ones. To avoid correlations between the n, we compute the progressive validationA return G( )= observed query-url pairs, we group the queries by the query 1 n r (a ) as the average reward up to n, where a Ais the ID, and sample a single query from each group. We convert n t=1 t t t relevance scores to losses with 0 indicating a perfectly rele- action chosen by the algorithm and rt is the true reward. ProgressiveP validation is particularlyA suitable for measuring vant document, and 1 an irrelevant one. the effectiveness of the exploration algorithm, since the de- Figure 2 shows the evaluation results on a subset of cision on whether to exploit or explore at earlier time steps the MSLR-10K dataset. Since the performance is closely will affect the performance on the observed examples in the matched between the different exploration algorithms, we future. repeat the experiment 16 times with randomly shuffled per- Because our evaluation is over 300 datasets, we report ag- mutations of the MSLR-10K dataset. Figure 2 shows the gregate results in terms of We compare learning curve of the trained policy ⇡ as well as the base- Win/Loss Statistics: ˆ ´ two exploration methods by counting the number of statisti- lines. Here, we see that MELEE quickly achieves high re- cally significant wins and losses. An exploration algorithm ward, after about 100 examples the two strongest baselines wins over another algorithm if the progressive valida- catch up. By 200 examples all approaches have asymptoted. A B We exclude LinUCB from these runs because the required tion return G( ) is statistically significantly larger than B’s 1 return G( ) atA the 0.01 level using a paired sample t-test. matrix inversions made it too computationally expensive. B Figure 1 shows statistically-significant win/loss differences for each of the algorithms, across these 16 shuffles. Each Experimental Results row/column entry shows the number of times the row algo- Learning to Rank We evaluate MELˆ EE´ on a natural learn- rithm won against the column, minus the number of losses. ing to rank dataset. The dataset we consider is the Microsoft MELˆ EE´ is the only algorithm that always wins more than it Learning to Rank dataset, variant MSLR-10K from (Qin and loses against other algorithms, and outperforms the nearest Liu 2013). The dataset consists of feature vectors extracted competition (✏-decreasing) by 3 points. from query-url pairs along with relevance judgment labels. The relevance judgments are obtained from a retired label- 1In a single run of LinUCB we observed that its performance is ing set of a commercial web search engine (Microsoft Bing), on par with ✏-greedy. Figure 5: MELˆ EE´ vs ✏-decreasing; every point represents one dataset; the x-axis shows the reward of MELˆ EE´ , the y-axis shows ✏-decreasing, and red dots represent statistically sig- nificant runs.

Figure 4: Behavior of MELˆ EE´ in comparison to baseline and state-of-the-art exploration algorithms. Win statistics: each (row, column) entry shows the number of times the row algo- rithm won against the column, minus the number of losses. MELˆ EE´ outperforms the nearest competition (✏-decreasing) by 23.

Simulated Contextual Bandit Tasks We perform an ex- haustive evaluation on simulated contextual bandit tasks to ensure that the performance of MELˆ EE´ generalizes be- yond learning to rank. Following Bietti, Agarwal, and Lang- ford (2018), we use a collection of 300 binary classifica- tion datasets from openml.org for evaluation. These datasets cover a variety of different domains including text & image Figure 6: MELˆ EE´ vs MELˆ EE´ using only the calibrated pre- processing, medical, and sensory data. We convert classi- diction probabilities (x-axis). MELˆ EE´ gets an additional fication datasets into cost-sensitive classification problems leverage when using all the features. by using a 0/1 encoding. Given these fully supervised cost- sensitive multi-class datasets, we simulate the contextual bandit setting by only revealing the reward for the selected actions. achieved by MELˆ EE´ (x-axis) and ✏-decreasing (y-axis) on In Figure 3, we show a representative learning curve. each of the 300 datasets, with statistically significant dif- Here, we see that as more data becomes available, all the ap- ferences highlighted in red and insignificant differences in proaches improve (except ⌧-first, which has ceased to learn blue. Points below the diagonal line correspond to better after 2% of the data). MELˆ EE´ , in particular, is able to very performance by MELˆ EE´ (147 datasets) and points above to quickly achieve near-optimal performance (in around 40 ex- ✏-decreasing (124 datasets). The remaining 29 had no statis- amples) in comparison to the best baseline which takes at tically significant difference. least 200. Finally, we consider the effect that the additional features In Figure 4, we show statistically-significant win/loss have on MELˆ EE´ ’s performance. In particular, we consider a differences for each of the algorithms. Here, each (row, version of MELˆ EE´ with all features (this is the version used column) entry shows the number of times the row algo- in all other experiments) with an ablated version that only rithm won against the column, minus the number of losses. has access to the (calibrated) probabilities of each action MELˆ EE´ is the only algorithm that always wins more than it from the underlying classifier f. The comparison is shown loses against other algorithms, and outperforms the nearest as a scatter plot in Figure 6. Here, we can see that the full competition (✏-decreasing) by 23 points. feature set does provide lift over just the calibrated probabil- To understand more directly how MELˆ EE´ compares to ✏- ities, with a win-minus-loss improvement of 24 by adding decreasing, in Figure 5, we show a scatter plot of rewards additional features from which to learn to explore. Broader Impacts Chang, K.; Krishnamurthy, A.; Agarwal, A.; Daume,´ III, H.; The motivation of this work is to give prac- and Langford, J. 2015. Learning to Search Better Than Your titioners another tool in their toolbox to learn better explo- Teacher. In Proceedings of the 32Nd International Confer- ration strategies for contextual bandits. Our primary target ence on International Conference on Machine Learning - stakeholder population is such machine learning practition- Volume 37, ICML, 2058–2066. JMLR.org. ers and data scientists. Secondarily, as that primary stake- Daume,´ III, H.; Langford, J.; and Marcu, D. 2009. Search- holder population builds and deploys algorithms, those who based . Machine Learning 75(3): 297– are impacted by those algorithms through direct or indirect 325. ISSN 1573-0565. doi:10.1007/s10994-009-5106-x. use will, we hope, benefit from using a better exploration Dudik, M.; Hsu, D.; Kale, S.; Karampatziakis, N.; Langford, algorithm. A possible risk of our algorithm is around, if de- J.; Reyzin, L.; and Zhang, T. 2011. Efficient optimal learning ployed, how the explored actions would affect the fairness for contextual bandits. arXiv preprint arXiv:1106.2369 . of the learned model with respect to different demographic groups. On the positive side, we provide theoretical guaran- Fang, M.; Li, Y.; and Cohn, T. 2017. Learning how to Ac- tees for the regret bounds achieved by MELˆ EE´ in a simplified tive Learn: A Deep Reinforcement Learning Approach. In setting. Overall, while there are real concerns about how this EMNLP. technology might be deployed, our hope is that the positive Feraud, R.; Allesiardo, R.; Urvoy, T.; and Clrot, F. 2016. impacts outweigh the negatives, specifically because stan- for the Contextual Bandit Problem. In Gret- dard best-use practices should mitigate most of the risks. ton, A.; and Robert, C. C., eds., Proceedings of the 19th In- ternational Conference on Artificial Intelligence and Statis- Acknowledgments tics, volume 51 of Proceedings of Machine Learning Re- We thank members of the CLIP lab for reviewing earlier ver- search, 93–101. Cadiz, Spain: PMLR. sions of this work. This material is based upon work sup- Gupta, A.; Mendonca, R.; Liu, Y.; Abbeel, P.; and Levine, S. ported by the National Science Foundation under Grant No. 2018. Meta-Reinforcement Learning of Structured Explo- 1618193. Any opinions, findings, and conclusions or rec- ration Strategies. arXiv preprint arXiv:1802.07245 . ommendations expressed in this material are those of the Hazan, E.; et al. 2016. Introduction to online convex opti- author(s) and do not necessarily reflect the views of the Na- mization. Foundations and Trends R in Optimization 2(3-4): tional Science Foundation. 157–325. References Horvitz, D. G.; and Thompson, D. J. 1952. A Generalization of Sampling Without Replacement from a Finite Universe. Agarwal, A.; Hsu, D.; Kale, S.; Langford, J.; Li, L.; and Journal of the American Statistical Association 47(260): Schapire, R. E. 2014. Taming the monster: A fast and sim- 663–685. doi:10.1080/01621459.1952.10483446. ple algorithm for contextual bandits. In In Proceedings of the 31st International Conference on Machine Learning (ICML- Kaelbling, L. P. 1994. Associative reinforcement learning: 14, 1638–1646. Functions ink-dnf. Machine Learning 15(3): 279–298. Andrychowicz, M.; Denil, M.; Gomez, S.; Hoffman, M. W.; Karnin, Z. S.; and Anava, O. 2016. Multi-armed Ban- Pfau, D.; Schaul, T.; and de Freitas, N. 2016. Learning to dits: Competing with Optimal Sequences. In Lee, D. D.; learn by gradient descent by gradient descent. In Advances Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., in Neural Information Processing Systems, 3981–3989. eds., Advances in Neural Information Processing Systems 29, 199–207. Curran Associates, Inc. Auer, P. 2003. Using confidence bounds for exploitation ex- ploration trade-offs. The Journal of Machine Learning Re- Konyushkova, K.; Sznitman, R.; and Fua, P. 2017. Learning search 3: 397–422. Active Learning from Data. In Advances in Neural Informa- tion Processing Systems. Bachman, P.; Sordoni, A.; and Trischler, A. 2017. Learning Langford, J.; and Zadrozny, B. 2005. Relating reinforcement algorithms for active learning. In ICML. learning performance to classification performance. In Pro- Beygelzimer, A.; and Langford, J. 2009. The offset tree for ceedings of the 22nd international conference on Machine learning with partial labels. In Proceedings of the 15th ACM learning, 473–480. ACM. SIGKDD international conference on Knowledge discovery Langford, J.; and Zhang, T. 2008. The Epoch-Greedy Algo- and , 129–138. ACM. rithm for Multi-armed Bandits with Side Information. In Ad- Bietti, A.; Agarwal, A.; and Langford, J. 2018. A Contextual vances in Neural Information Processing Systems 20, 817– Bandit Bake-off. Working paper or preprint. 824. Curran Associates, Inc. Blum, A.; Kalai, A.; and Langford, J. 1999. Beating Li, K.; and Malik, J. 2016. Learning to optimize. arXiv the hold-out: Bounds for k-fold and progressive cross- preprint arXiv:1606.01885 . validation. In Proceedings of the twelfth annual conference Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010a. A on Computational learning theory, 203–208. ACM. Contextual-bandit Approach to Personalized News Article Breiman, L. 2001. Random Forests. Mach. Learn. 45(1): Recommendation. In Proceedings of the 19th International 5–32. ISSN 0885-6125. doi:10.1023/A:1010933404324. Conference on World Wide Web, WWW ’10, 661–670. New York, NY, USA: ACM. ISBN 978-1-60558-799-8. doi:10. Xu, T.; Liu, Q.; Zhao, L.; Xu, W.; and Peng, J. 2018. Learn- 1145/1772690.1772758. ing to Explore with Meta-Policy Gradient. arXiv preprint Li, W.; Wang, X.; Zhang, R.; Cui, Y.; Mao, J.; and Jin, arXiv:1803.05044 . R. 2010b. Exploitation and Exploration in a Performance Zoph, B.; and Le, Q. V. 2016. Neural architec- Based Contextual Advertising System. In Proceedings of ture search with reinforcement learning. arXiv preprint the 16th ACM SIGKDD International Conference on Knowl- arXiv:1611.01578 . edge Discovery and Data Mining, KDD ’10, 27–36. New York, NY, USA: ACM. Lin, H.-T.; Lin, C.-J.; and Weng, R. C. 2007. A note on Platt’s probabilistic outputs for support vector machines. Machine Learning 68(3): 267–276. ISSN 1573-0565. doi: 10.1007/s10994-007-5018-6. Maes, F.; Wehenkel, L.; and Ernst, D. 2012. Meta-learning of exploration/exploitation strategies: The multi-armed ban- dit case. In International Conference on Agents and Artifi- cial Intelligence, 100–115. Springer. Osband, I.; Blundell, C.; Pritzel, A.; and Van Roy, B. 2016. Deep Exploration via Bootstrapped DQN. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29, 4026–4034. Curran Associates, Inc. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit- learn: Machine Learning in Python. Journal of Machine Learning Research 12: 2825–2830. Peters, J.; Mooij, J. M.; Janzing, D.; and Scholkopf,¨ B. 2014. Causal discovery with continuous additive noise models. The Journal of Machine Learning Research 15(1): 2009– 2053. Platt, J. C. 1999. Probabilistic Outputs for Support Vec- tor Machines and Comparisons to Regularized Likelihood Methods. In ADVANCES IN LARGE MARGIN CLASSI- FIERS, 61–74. MIT Press. Qin, T.; and Liu, T. 2013. Introducing LETOR 4.0 Datasets. CoRR abs/1306.2597. URL http://arxiv.org/abs/1306.2597. Ross, S.; and Bagnell, J. A. 2014. Reinforcement and im- itation learning via interactive no-regret learning. arXiv preprint arXiv:1406.5979 . Ross, S.; Gordon, G.; and Bagnell, J. A. 2011. A Re- duction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In Proceedings of the Four- teenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learn- ing Research, 627–635. Fort Lauderdale, FL, USA: PMLR. Sutton, R. S. 1996. Generalization in reinforcement learn- ing: Successful examples using sparse coarse coding. In Advances in neural information processing systems, 1038– 1044. Sutton, R. S.; and Barto, A. G. 1998. Introduction to Rein- forcement Learning. Cambridge, MA, USA: MIT Press, 1st edition. ISBN 0262193981. Woodward, M.; and Finn, C. 2017. Active one-shot learning. arXiv preprint arXiv:1702.06559 .