Thompson Sampling for Dynamic Multi-Armed Bandits
Total Page:16
File Type:pdf, Size:1020Kb
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/232616670 Thompson Sampling for Dynamic Multi-armed Bandits Article · December 2011 DOI: 10.1109/ICMLA.2011.144 CITATIONS READS 12 1,566 3 authors: Neha Gupta Ole-Christoffer Granmo Munich University of Applied Sciences Universitetet i Agder 7 PUBLICATIONS 25 CITATIONS 169 PUBLICATIONS 1,088 CITATIONS SEE PROFILE SEE PROFILE Ashok Agrawala University of Maryland, College Park 302 PUBLICATIONS 7,698 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: SmartRescue View project The Tsetlin Machine View project All content following this page was uploaded by Ole-Christoffer Granmo on 26 January 2016. The user has requested enhancement of the downloaded file. 2011 10th International Conference on Machine Learning and Applications Thompson Sampling for Dynamic Multi-Armed Bandits Neha Gupta Ole-Christoffer Granmo Ashok Agrawala Department of Computer Science Department of ICT Department of Computer Science University of Maryland University of Agder University of Maryland College Park 20742 Norway College Park 20742 Email: [email protected] Email: [email protected] Email: [email protected] Abstract—The importance of multi-armed bandit (MAB) prob- from the Kalman filter based scheme proposed in [2], the lems is on the rise due to their recent application in a large latter problem area is largely unexplored when it comes to variety of areas such as online advertising, news article selection, Thompson Sampling. Another important obstacle in solving wireless networks, and medicinal trials, to name a few. The most the problem is due to the fact that we cannot sample noisy common assumption made when solving such MAB problems is k that the unknown reward probability θk of each bandit arm k is instances of θ directly, as done in [2]. Instead, we must fixed. However, this assumption rarely holds in practice simply rely on samples obtained from Bernoulli trials with reward because real-life problems often involve underlying processes that probability θk, which renders the problem unique. are dynamically evolving. In this paper, we model problems In this paper, we introduce a novel strategy — Dynamic where reward probabilities θk are drifting, and introduce a new Thompson Sampling. Order Statistics based Thompson Sam- method called Dynamic Thompson Sampling (DTS) that facilitates k Order Statistics based Thompson Sampling for these dynamically pling is used for arm selection, but the reward probabilities θ evolving MABs. The DTS algorithm adapts its success proba- are tracked using an exponential filtering technique, allowing ˆk bility estimates, θ , faster than traditional Thompson Sampling adaptive exploration. In brief, we explicitly model changing schemes and thus leads to improved performance in terms θk’s as an integrated part of the Thompson Sampling, consider- of lower regret. Extensive experiments demonstrate that DTS ing changes in reward probability to follow a Brownian motion outperforms current state-of-the-art approaches, namely pure Thompson Sampling, UCB-Normal and UCBf , for the case of – one of the most well-known stationary stochastic processes, dynamic reward probabilities. Furthermore, this performance extensively applied in many fields, including modeling of stock advantage increases persistently with the number of bandit arms. markets and commodity pricing in economics. II. RELATED WORK I. INTRODUCTION In their seminal work on MAB problems, Lai and Rob- The multi-armed bandit (MAB) problem forms a classical bins proved that for certain reward distributions, such as the arena for the conflict between exploration and exploitation, Bernoulli-, Poisson-, and uniform distributions, there exist an well-known in reinforcement learning. Essentially, a decision asymptotic bound on regret that only depends on the logarithm maker iteratively pulls the arms of the MAB, one arm at of the number of trials and the Kullback-Leibler value of each a time, with each arm pull having a chance of releasing a reward distribution [3]. The main idea behind the strategy reward, specified as the arm’s reward probability θk. The goal following from this insight is to calculate an upper confidence of the decision maker is to maximize the total number of index for each arm. At each trial the arm which has the rewards obtained without knowing the reward probabilities. maximum upper confidence value is played, thus enabling Although seemingly a simplistic problem, solution strategies deterministic arm selection. Auer et al. [4] further proved are important because of their wide applicability in a myriad that instead of an asymptotic logarithmic upper bound, an of areas. upper bound could be obtained in finite time, and introduced Thompson Sampling based solution strategies have recently the algorithms UCB-1, UCB-2 and their variants to this end. been established as top performers for MABs with Bernoulli The pioneering Gittins Index based strategy [5] performs a distributed rewards [1]. Such strategies gradually move from Bayesian look ahead at each step in order to decide which exploration to exploitation, converging towards only selecting arm to play. Although allowing optimal play for discounted the optimal arm, simply by pulling the available arms with rewards, this technique is intractable for MAB problems in frequencies that are proportional to their probabilities of being practice. optimal. This behavior is ideal when the reward probabilities Dynamic Bandits have also been known as Restless Bandits. of the bandit arms are fixed. However, in cases where the Restless Bandits were introduced by Whittle [6] and are reward probabilities are dynamically evolving, referred to considered to be PSPACE-hard. Guha et al. [7] introduced as Dynamic Bandits, one would instead prefer schemes that approximation algorithms for a special setting of the Restless explore and track potential reward probability changes. Apart Bandit problem. Auer et al. [8] introduced a version of 978-0-7695-4607-0/11 $26.00 © 2011 IEEE 484 DOI 10.1109/ICMLA.2011.144 Restless Bandits called Adversarial Bandits, but the technique variance of this posterior, Beta(α0 + s, β0 + r), can be used suggested was designed to perform against an all powerful to characterize θk: adversary and hence led to very loose bounds for the reward αn μˆn = (5) probabilities. αn + βn In this work, we look at the problem of Dynamic Bandits 2 (αnβn) σˆn = 2 . (6) in which the reward probabilities of the arms follow bounded (αn + βn +1)(αn + βn) Brownian motion. In [9], the authors consider a similar sce- nario of Brownian bandits with reflective boundaries, assuming B. Pure Thompson Sampling (TS) that a sample from the current distribution of θk itself is Thompson Sampling is a randomized algorithm that takes observed at each trial. Granmo et al. introduced the Order advantage of Bayesian estimation to reason about the re- Statistics based Kalman Filter Multi-Armed Bandit Algorithm ward probability θk associated with each arm k of a MAB, [2]. In their model, reward obtained from an arm is affected as summarized in Algorithm 1. After conducting n MAB 2 k by Gaussian noise ∼ N(0,σob) and an independent Gaussian trials, the reward probability θ of each arm k is esti- 2 perturbations ∼ N(0,σtr) at each trial. A key assumption in mated using a posterior distribution over possible estimates, k k [2] is again that at each trial a noisy sample of the true reward Beta(αn,βn) [11], and the state of a system designed is observed. In contrast, in our work, estimation of the reward for K armed MABs can therefore be fully specified by k 1 1 2 2 K K probabilities θ is done by only using Bernoulli outcomes {(αn,βn), (αn,βn), ...(αn ,βn )}. rk ∼ Bernoulli(θk). Our work is thus well suited for For arm selection at each trial, one sample θˆk is drawn k k modeling of problems such as click through rate optimization for each arm k from the random variable Θˆ n, Θˆ n ∼ k k in the Internet domain, where a click on a newspaper article or Beta(αn,βn),k =1, 2, 3, ..., K, and the arm obtaining the advertisement results in a binary reward, from which the click largest sample value is played. The above means that the through rate θk is estimated. Also, instead of using reflective probability of arm k being played is P (θˆk > θˆ1 ∧ θˆk > boundaries we consider absorbing and “cutoff” boundaries, θˆ2 ∧ θˆk > θˆ3...θˆk > θˆK ), however, the beauty of Thompson which are more suited for the Internet domain. Sampling is that there is no need to explicitly compute this value. Formal convergence proofs for this method have been III. PROBLEM DEFINITION discussed in [1], [12]. A. Constant Rewards Algorithm 1 Thompson Sampling (TS) For the MAB problems we study here, each pull of an arm Initialize k=2, k . can be considered as a Bernoulli trial having the output set α0 β0 =2 k loop {0, 1}, with the probability θ denoting the probability of Sample reward probability estimate θˆk randomly from success (event {1}). The probability distribution of the number k k k Beta(α ,β ) for k ∈{1,...,K}. of successes S obtained in n Bernoulli trials is known to have n−1 n−1 ∼ k k Arrange the samples in decreasing order. a Binomial distribution, S Binomial(n ,θ ): ˆA {ˆ1 ˆK } Select the arm A s.t. θ = maxk θ ,...,θ . k k n k n−s k s Pull arm A and receive reward rn. p(S = s|θ )= (1 − θ ) (θ ) . (1) A A A A s Obtain αn and βn : αn = αn−1 + rn; A A βn = βn−1 +(1− rn). This means that since the Beta distribution is a conjugate end loop prior for the Binomial distribution [10], Bayesian estimation is a viable option for estimating θk.