Preliminary Version Do Not Cite

PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference Meta-Learning Effective Exploration Strategies for Contextual Bandits Amr Sharaf,1 Hal Daume´ III1,2 1 University of Maryland 2 Microsoft Research [email protected], [email protected] Abstract sets on which we can simulate contextual bandit tasks in an offline setting. Based on these simulations, our algo- In contextual bandits, an algorithm must choose actions given rithm, MELˆ EE´ (MEta LEarner for Exploration), learns a observed contexts, learning from a reward signal that is observed only for the action chosen. This leads to an explo- good heuristic exploration strategy that generalize to future ration/exploitation trade-off: the algorithm must balance tak- contextual bandit problems. MELˆ EE´ contrasts with more ing actions it already believes are good with taking new ac- classical approaches to exploration (like ✏-greedy or Lin- tions to potentially discover better choices. We develop a UCB), in which exploration strategies are constructed by meta-learning algorithm, MELˆ EE´ , that learns an exploration expert algorithm designers. These approaches often achieve policy based on simulated, synthetic contextual bandit tasks. provably good exploration strategies in the worst case, but MELˆ EE´ uses imitation learning against these simulations to are potentially overly pessimistic and are sometimes com- train an exploration policy that can be applied to true contex- putationally intractable. tual bandit tasks at test time. We evaluate MELˆ EE´ on both a MELˆ EE´ is an example of meta-learning in which we natural contextual bandit problem derived from a learning to rank dataset as well as hundreds of simulated contextual ban- replace a hand-crafted learning algorithm with a learned learning algorithm. At training time ( 3), MELˆ EE´ simu- dit problems derived from classification tasks. MELˆ EE´ out- § performs seven strong baselines on most of these datasets by lates many contextual bandit problems from fully labeled leveraging a rich feature representation for learning an explo- synthetic data. Using this data, in each round, MELˆ EE´ is ration strategy. able to counterfactually simulate what would happen under all possible action choices. We can then use this information to compute regret estimates for each action, which can 1 Introduction be optimized using the AggreVaTe imitation learning algo- In a contextual bandit problem, an agent attempts to opti- rithm (Ross and Bagnell 2014). mize its behavior over a sequence of rounds based on limited Our imitation learning strategy mirrors the meta-learning feedback (Kaelbling 1994; Auer 2003; Langford and Zhang approach of Bachman, Sordoni, and Trischler (2017) in 2008). In each round, the agent chooses an action based on the active learning setting. We present a simplified, styl- a context (features) for that round, and observes a reward ized analysis of the behavior of MELˆ EE´ to ensure that our for that action but no others ( 2). The feedback is partial: § cost function encourages good behavior ( 4), and show that the agent only observes the reward for the action it selected, MELˆ EE´ enjoys the no-regret guarantees§ of the AggreVaTe and for no other actions. This is strictly harder than super- imitation learning algorithm. vised learning in which the agent observes the reward for all Empirically, we use MELˆ EE´ to train an exploration policy available actions. Contextual bandit problems arise in many on only synthetic datasets and evaluate this policy on both real-world settings like learning to rank for information re- a contextual bandit task based on a natural learning to rank trieval, online recommendations and personalized medicine. dataset as well as three hundred simulated contextual bandit As in reinforcement learning, the agent must learn to bal- tasks ( 5). We compare the trained policy to a number of al- ance exploitation (taking actions that, based on past experi- ternative§ exploration algorithms, and show that MELˆ EE´ out- ence, it believes will lead to high instantaneous reward) and performs alternative exploration strategies in most settings. exploration (trying new actions). However, contextual bandit learning is easier than reinforcement learning: the agent needs to only take one action, not a sequence of actions, and 2 Preliminaries: Contextual Bandits and therefore does not face a credit assignment problem. Policy Optimization In this paper, we present a meta-learning approach to Contextual bandits is a model of interaction in which an automatically learn a good exploration strategy from data. agent chooses actions (based on contexts) and receives im- To achieve this, we use synthetic supervised learning data mediate rewards for that action alone. For example, in a sim- Copyright c 2021, Association for the Advancement of Artificial plified news personalization setting, at each time step t,a Intelligence (www.aaai.org). All rights reserved. user arrives and the system must choose a news article to display to them. Each possible news article corresponds to the type of the estimator used by the oracle policy optimizer an action a, and the user corresponds to a context xt. After POLOPT. the system chooses an article at to display, it can observe, for instance, the amount of time that the user spends reading 3 Approach: Learning an Effective that article, which it can use as a reward rt(at). Exploration Strategy Formally, we largely follow the setup and notation of In order to have an effective approach to the contextual Agarwal et al. (2014). Let be an input space of contexts bandit problem, one must be able to both optimize a pol- (users) and [K]= 1,...,KX be a finite action space (ar- icy based on historic data and make decisions about how ticles). We consider{ the statistical} setting in which there ex- to explore. The exploration/exploitation dilemma is funda- ists a fixed but unknown distribution over pairs (x, r) mentally about long-term payoffs: is it worth trying some- [0, 1]K , where r is a vector of rewardsD (for convenience,2 thing potentially suboptimal now in order to learn how to weX⇥ assume all rewards are bounded in [0, 1]). In this setting, behave better in the future? A particularly simple and ef- the world operates iteratively over rounds t =1, 2,.... Each fective form of exploration is ✏-greedy: given a function f round t: output by POLOPT, act according to f(x) with probability 1. The world draws (xt, rt) and reveals context xt. (1 ✏) and act uniformly at random with probability ✏. 2. The agent (randomly) chooses⇠D action a [K] based on − t 2 Intuitively, one would hope to improve on a strategy like xt, and observes reward rt(at). ✏-greedy by taking more (any!) information into account; for The goal of an algorithm is to maximize the cumulative instance, basing the probability of exploration on f’s uncer- sum of rewards over time. Typically the primary quantity tainty. In this section, we describe MELˆ EE´ , first by showing considered is the average regret of a sequence of actions how it operates in a Markov Decision Process ( 3), and then a1,...,aT to the behavior of the best possible function in a by showing how to train it using synthetic simulated§ contex- prespecified class : F tual bandit problems based on imitation learning ( 3). 1 T § Reg(a1,...,aT ) = max rt(f(xt)) rt(at) (1) f T − Markov Decision Process Formulation 2F t=1 A no-regret agent has a zeroX averageh regret in the limiti of We model the exploration / exploitation task as a Markov large T . Decision Process (MDP). Given a context vector x and a f OL PT 1 ...t 1 To produce a good agent for interacting with the world, function output by P O on rounds , the agent ⇡ − we assume access to a function class and to an oracle learns an exploration policy to decide whether to exploit by acting according to f(x), or explore by choosing a dif- policy optimizer for that function class PFOLOPT. For exam- a = f(x) ple, may be a set of single layer neural networks map- ferent action . We model this as an MDP, repre- 6 A, S, s ,T,R A = a pingF user features x to predicted rewards for actions sented as a tuple 0 , where is the set S =h s i { } s a [K]. Formally, the2 observableX record of interaction re- of all actions, is the space of all possible states, 0 { } R(s, a) sulting2 from round t is the tuple (x ,a ,r (a ),p (a )) is a starting state (or initial distribution), is the re- t t t t t t T (s s, a) [K] [0, 1] [0, 1], where p (a ) is the probability that2 ward function, and 0 is the transition function. We X⇥ ⇥ ⇥ t t describe these below. | the agent chose action at, and the full history of interaction t is ht = (xi,ai,ri(ai),pi(ai)) . The oracle policy opti- States A state s in our MDP represents all past informa- h ii=1 mizer, POLOPT, takes as input a history of user interactions tion seen in the world (there are a very large number of and outputs an f with low expected regret. states). At time t, the state st includes: all past experiences 2F t 1 An example of the oracle policy optimizer POLOPT is to (xi,ai,ri(ai))i=1− , the current context xt, and the current combine inverse propensity scaling (IPS) with a regression function ft computed by running POLOPT on the past ex- algorithm (Horvitz and Thompson 1952). Here, given a his- periences. tory h, each tuple (x, a, r, p) in that history is mapped to a For the purposes of learning, this state space is far too multiple-output regression example.

Load more