See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/232616670

Thompson Sampling for Dynamic Multi-armed Bandits

Article · December 2011 DOI: 10.1109/ICMLA.2011.144

CITATIONS READS 12 1,566

3 authors:

Neha Gupta Ole-Christoffer Granmo Munich University of Applied Sciences Universitetet i Agder

7 PUBLICATIONS 25 CITATIONS 169 PUBLICATIONS 1,088 CITATIONS

SEE PROFILE SEE PROFILE

Ashok Agrawala University of Maryland, College Park

302 PUBLICATIONS 7,698 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

SmartRescue View project

The Tsetlin Machine View project

All content following this page was uploaded by Ole-Christoffer Granmo on 26 January 2016.

The user has requested enhancement of the downloaded file. 2011 10th International Conference on and Applications

Thompson Sampling for Dynamic Multi-Armed Bandits

Neha Gupta Ole-Christoffer Granmo Ashok Agrawala Department of Computer Science Department of ICT Department of Computer Science University of Maryland University of Agder University of Maryland College Park 20742 Norway College Park 20742 Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—The importance of multi-armed bandit (MAB) prob- from the Kalman filter based scheme proposed in [2], the lems is on the rise due to their recent application in a large latter problem area is largely unexplored when it comes to variety of areas such as online advertising, news article selection, Thompson Sampling. Another important obstacle in solving wireless networks, and medicinal trials, to name a few. The most the problem is due to the fact that we cannot sample noisy common assumption made when solving such MAB problems is k that the unknown reward probability θk of each bandit arm k is instances of θ directly, as done in [2]. Instead, we must fixed. However, this assumption rarely holds in practice simply rely on samples obtained from Bernoulli trials with reward because real-life problems often involve underlying processes that probability θk, which renders the problem unique. are dynamically evolving. In this paper, we model problems In this paper, we introduce a novel strategy — Dynamic where reward probabilities θk are drifting, and introduce a new Thompson Sampling. Order Statistics based Thompson Sam- method called Dynamic Thompson Sampling (DTS) that facilitates k Order Statistics based Thompson Sampling for these dynamically pling is used for arm selection, but the reward probabilities θ evolving MABs. The DTS algorithm adapts its success proba- are tracked using an exponential filtering technique, allowing ˆk bility estimates, θ , faster than traditional Thompson Sampling adaptive exploration. In brief, we explicitly model changing schemes and thus leads to improved performance in terms θk’s as an integrated part of the Thompson Sampling, consider- of lower regret. Extensive experiments demonstrate that DTS ing changes in reward probability to follow a Brownian motion outperforms current state-of-the-art approaches, namely pure Thompson Sampling, UCB-Normal and UCBf , for the case of – one of the most well-known stationary stochastic processes, dynamic reward probabilities. Furthermore, this performance extensively applied in many fields, including modeling of stock advantage increases persistently with the number of bandit arms. markets and commodity pricing in economics.

II. RELATED WORK I. INTRODUCTION In their seminal work on MAB problems, Lai and Rob- The multi-armed bandit (MAB) problem forms a classical bins proved that for certain reward distributions, such as the arena for the conflict between exploration and exploitation, Bernoulli-, Poisson-, and uniform distributions, there exist an well-known in . Essentially, a decision asymptotic bound on regret that only depends on the logarithm maker iteratively pulls the arms of the MAB, one arm at of the number of trials and the Kullback-Leibler value of each a time, with each arm pull having a chance of releasing a reward distribution [3]. The main idea behind the strategy reward, specified as the arm’s reward probability θk. The goal following from this insight is to calculate an upper confidence of the decision maker is to maximize the total number of index for each arm. At each trial the arm which has the rewards obtained without knowing the reward probabilities. maximum upper confidence value is played, thus enabling Although seemingly a simplistic problem, solution strategies deterministic arm selection. Auer et al. [4] further proved are important because of their wide applicability in a myriad that instead of an asymptotic logarithmic upper bound, an of areas. upper bound could be obtained in finite time, and introduced Thompson Sampling based solution strategies have recently the algorithms UCB-1, UCB-2 and their variants to this end. been established as top performers for MABs with Bernoulli The pioneering Gittins Index based strategy [5] performs a distributed rewards [1]. Such strategies gradually move from Bayesian look ahead at each step in order to decide which exploration to exploitation, converging towards only selecting arm to play. Although allowing optimal play for discounted the optimal arm, simply by pulling the available arms with rewards, this technique is intractable for MAB problems in frequencies that are proportional to their probabilities of being practice. optimal. This behavior is ideal when the reward probabilities Dynamic Bandits have also been known as Restless Bandits. of the bandit arms are fixed. However, in cases where the Restless Bandits were introduced by Whittle [6] and are reward probabilities are dynamically evolving, referred to considered to be PSPACE-hard. Guha et al. [7] introduced as Dynamic Bandits, one would instead prefer schemes that approximation algorithms for a special setting of the Restless explore and track potential reward probability changes. Apart Bandit problem. Auer et al. [8] introduced a version of

978-0-7695-4607-0/11 $26.00 © 2011 IEEE 484 DOI 10.1109/ICMLA.2011.144 Restless Bandits called Adversarial Bandits, but the technique variance of this posterior, Beta(α0 + s, β0 + r), can be used suggested was designed to perform against an all powerful to characterize θk: adversary and hence led to very loose bounds for the reward αn μˆn = (5) probabilities. αn + βn In this work, we look at the problem of Dynamic Bandits 2 (αnβn) σˆn = 2 . (6) in which the reward probabilities of the arms follow bounded (αn + βn +1)(αn + βn) Brownian motion. In [9], the authors consider a similar sce- nario of Brownian bandits with reflective boundaries, assuming B. Pure Thompson Sampling (TS) that a sample from the current distribution of θk itself is Thompson Sampling is a randomized algorithm that takes observed at each trial. Granmo et al. introduced the Order advantage of Bayesian estimation to reason about the re- Statistics based Kalman Filter Multi-Armed Bandit Algorithm ward probability θk associated with each arm k of a MAB, [2]. In their model, reward obtained from an arm is affected as summarized in Algorithm 1. After conducting n MAB 2 k by Gaussian noise ∼ N(0,σob) and an independent Gaussian trials, the reward probability θ of each arm k is esti- 2 perturbations ∼ N(0,σtr) at each trial. A key assumption in mated using a posterior distribution over possible estimates, k k [2] is again that at each trial a noisy sample of the true reward Beta(αn,βn) [11], and the state of a system designed is observed. In contrast, in our work, estimation of the reward for K armed MABs can therefore be fully specified by k 1 1 2 2 K K probabilities θ is done by only using Bernoulli outcomes {(αn,βn), (αn,βn), ...(αn ,βn )}. rk ∼ Bernoulli(θk). Our work is thus well suited for For arm selection at each trial, one sample θˆk is drawn k k modeling of problems such as click through rate optimization for each arm k from the random variable Θˆ n, Θˆ n ∼ k k in the Internet domain, where a click on a newspaper article or Beta(αn,βn),k =1, 2, 3, ..., K, and the arm obtaining the advertisement results in a binary reward, from which the click largest sample value is played. The above means that the through rate θk is estimated. Also, instead of using reflective probability of arm k being played is P (θˆk > θˆ1 ∧ θˆk > boundaries we consider absorbing and “cutoff” boundaries, θˆ2 ∧ θˆk > θˆ3...θˆk > θˆK ), however, the beauty of Thompson which are more suited for the Internet domain. Sampling is that there is no need to explicitly compute this value. Formal convergence proofs for this method have been III. PROBLEM DEFINITION discussed in [1], [12]. A. Constant Rewards Algorithm 1 Thompson Sampling (TS) For the MAB problems we study here, each pull of an arm Initialize k=2, k . can be considered as a Bernoulli trial having the output set α0 β0 =2 k loop {0, 1}, with the probability θ denoting the probability of Sample reward probability estimate θˆk randomly from success (event {1}). The probability distribution of the number k k k Beta(α ,β ) for k ∈{1,...,K}. of successes S obtained in n Bernoulli trials is known to have n−1 n−1 ∼ k k Arrange the samples in decreasing order. a Binomial distribution, S Binomial(n ,θ ): ˆA {ˆ1 ˆK } Select the arm A s.t. θ = maxk θ ,...,θ . k k n k n−s k s Pull arm A and receive reward rn. p(S = s|θ )= (1 − θ ) (θ ) . (1) A A A A s Obtain αn and βn : αn = αn−1 + rn; A A βn = βn−1 +(1− rn). This means that since the Beta distribution is a conjugate end loop prior for the Binomial distribution [10], Bayesian estimation is a viable option for estimating θk. It is thus natural to use the Beta distribution to obtain a prior fully specified by the C. UCB Algorithm parameters (α0,β0): The UCB-1 [4] algorithm computes an Upper Confidence α0−1 β0−1 k x (1 − x) ˆk 2ln N ˆk p(θˆ ; α0,β0)= . (2) Bound (UCB) for each arm k: E[θn]+ nk . Here E[θn] is B(α0,β0) the average reward obtained from arm k when the number of k The posterior distribution after the nth trial can be defined times arm k has been played is n and N is the overall number th of trials so far. In this algorithm, the arm which produces the recursively. If a success is received at the n trial, αn and βn are identified as follows: largest UCB is played at each trial, and the UCBs are updated accordingly. αn = αn−1 +1,βn = βn−1. (3) UCB-Normal is a modification of the UCB-1 algorithm for the case of Gaussian rewards. The bounds used in UCB- th qk−nk(θˆk )2 ln(N−1) Conversely, if a failure is received at the n trial, we have: ˆk · n k Normal are E[θn]+ 16 nk−1 nk where q is the αn = αn,βn = βn−1 +1. (4) sum of the square of the reward of arm k. The UCBf algorithm [9] is a more general form of the After s successes and r failures, the parameters of the posterior UCB − 1 algorithm that also incorporates Brownian mo- Beta distribution thus become (α0 + s, β0 + r). The mean and tion with reflecting boundaries. In brief, the bound from

485 UCB-1√ is extendend with an additional bound component: 1) If αn−1 + βn−1

The key assumption made in most MAB algorithms is that 2) If αn−1 + βn−1 ≥ C, the reward probabilities remain constant. In practice, it is rare C to have constant reward probabilities, and the algorithm that αn =(αn−1 + rn) (10) we propose here explicitly takes into account changing reward C +1 C probabilities. βn =(βn−1 +(1− rn)) (11) Brownian motion is a simple stochastic process in which the C +1 value of a random variable at step n is the sum of its value Notice that the first set of update rules makes the scheme at time n − 1 and a Gaussian noise term ∼ N(0,σ2). In this behave identical to Pure Thompson Sampling when αn−1 + paper, we consider the time varying reward probability θn to βn−1 1 ⎩⎪ C C 0 if θn−1 + νn < 0 αn = (αn−2 + rn−1) + rn (15) C +1 C +1 • Absorbing Boundary: With absorbing boundaries θn re- C 2 C 2 C = αn−2( ) + rn−1( ) + rn (16), mains at the boundary forever after reaching it. C +1 C +1 C +1 ⎧ whence, it becomes apparent that the weighting is exponential. ⎪ θn−1 + νn if  ∃i ≤ n : θi ≥ 1 ∨ θi ≤ 0 ⎨ To summarize, the above strategy provides exponential ∃ ≤ ≥ θn = 1 if i n : θi 1 weighting of the outcomes of the trials, with the more recent ⎩⎪ 0 if ∃i ≤ n : θi ≤ 0 outcomes getting more weight. In the same manner, we could express βn as a discounted sum of previous outputs of the The performance of MAB solution schemes can be mea- Bernoulli trials. Similarly, we observe that the mean μn of sured in terms of Regret, defined as: Beta(αn,βn) at trial n is, N ∗ − k αn Regret =Σn=0(rn rn). μn = (17) αn + βn Above, N is the total number of trials, r∗ is the Bernoulli n αn−1 + rn × C output one would receive by playing the arm with the highest = (18) k k C C +1 θn at trial n, while rn is the reward obtained after sampling the αn−1 C 1 kth arm as determined by the algorithm being evaluated. Note = + rn (19) ∗ C C +1 C +1 that the arm corresponding to rn may change as the values of C αn−1 1 k = + rn (20) θn evolves. Hence, regret is a measure of the loss suffered by C +1αn−1 + βn−1 C +1 not always playing the optimal arm. =Δ· μn−1 +(1− Δ)rn (21) C E. Dynamic Thompson Sampling Algorithm (DTS) where Δ= C+1 . Clearly, this approach yields exponential 2 Unlike the algorithms for static MAB problems, the goal of filtering of rn [13]. Observe finally that the variance σn of the DTS algorithm proposed presently is to minimize the regret Beta(αn,βn) is bounded as follows: k by tracking the changing values of θn as closely as possible. 2 1 k 0 ≤ σn ≤ . (22) Note that in our model θn changes according to Eqn. 7 whether 4(C +1) arm k is played or not. The DTS algorithm is able to track reward probabilities by replacing the update rules specified in This is the case because the variance of Beta(αn,βn) is Eqn. 3 and 4 by two set of update rules and a threshold C 2 (αnβn) σn = 2 . governing which set of update rules to use: (αn + βn +1)(αn + βn)

486 and because the product of αn and βn is maximized when αn = βn = C/2 and minimized when either αn or βn approaches 0.

Algorithm 2 Dynamic Thompson Sampling (DTS) loop Sample reward probability estimate θˆk randomly from k k Beta(αn−1,βn−1) for k ∈{1,...,K}. Arrange the samples in decreasing order. A 1 K Select the arm A s.t. θˆ = maxk{θˆ ,...,θˆ }. Pull arm A and receive reward rn. A A if αn−1 + βn−1

Fig. 1. Typical variations of the reward probability θ for different values of The DTS algorithm introduced in this paper is based on the standard deviations. θ0 =0.5 in all cases. above two sets of update rules and is specified in Algorithm 2 for the K-armed bandit case, where the motion of the corre- sponding reward probabilities (θ1,θ2,θ3, .., θK ) is Brownian. k k The algorithm starts by initializing the priors α0 =2, β0 =2for all the arms, and then proceeds by gradually updating the αks and βks as penalties and rewards are received. Because of the exponential weighting of rewards, drifting reward probabilities are tracked, which in turn leads to a better performance as will be shown presently.

IV. EXPERIMENTS In this section, we primarily evaluate the performance of the Fig. 2. Plot shows the estimated and actual values of θ for the case of a single arm. Estimated values are calculated based on TS algorithm. DTS algorithm by comparing it with UCBf , TS and UCB- Normal. Even though we have performed a large number of experiments using a wide range of reward distributions, B. Estimation vs. Actual we here only report the most important and relevant ones due to limited space. We report the regret obtained as the We perform these experiments to show how closely the ˆ measure of performance of the different algorithms. As DTS is estimated values of θ are to the actual value of θ for the case a randomized algorithm, the regret becomes a random variable. of TS and DTS algorithms for a single arm. The two graphs, The expected value of the regret is estimated by repeating each Fig. 2 and Fig. 3, show the results for the estimated and actual experiment 400 times. values of θ. We see that the DTS algorithm provides a muche more accurate estimate of θ based on its exponential filtering, A. Varying value of standard deviation σ when compared to the TS algorithm. To get an insight into the Brownian motion of the reward C. Tuning parameter C for DTS algorithm probability θ, we performed experiments in which we sim- Fig. 4 shows a plot of the root mean square error (RMSE) ulated the dynamics of θ for different values of standard obtained for different values of C and standard deviation σ deviation. In Fig. 1, we show a sample plot of the curves for 10, 000 trials in the DTS and TS algorithm for a single for 4 values of σ = {0.05, 0.01, 0.005, 0.001} starting with arm. RMSE is measured as : θ0 =0.5. The curve with standard deviation σ =0.05 is N 2 cutting across the boundaries 0 and 1 very often and accurate Σn=1(θn − θˆn) RMSE = (23) learning seems unrealistic in this situation. The other graphs N { } with standard deviations σ = 0.01, 0.005, 0.001 are more Note here that RMSE values averaged over 400 runs are stable and seem more appropriate to model realistic learning reported in the graph. In this experiment, we take two different problems. values of θ = {0.8, 0.5} and choose the standard deviation

487 Total Trials = 10,000, Arm=1 0.35

0.3

0.25

0.2 RMSE

0.15

0.1 DTS θ=0.5 σ =0.005 TS σ =0.005 DTS θ=0.8 σ =0.005 DTS θ=0.5 σ =0.01 0.05 TS σ =0.01 DTS θ=0.8,σ =0.01 θ=0.5, σ =0.05 TS σ =0.05 θ=0.8,σ =0.05 0 0 50 100 150 200 250 300 350 400 450 500 C

Fig. 4. Plots for RMSE for two different values of θ, 3 different values of standard deviation σ and with/without the exponential filtering for θ

Estimated Values Actual Values 1.0 Total Trials = 10,000, Total arms =10 2500 0.9

0.8

0.7 2000

0.6 Reward Probability 0.5 1500

0.4

0.3 Regret 0 2000 4000 6000 8000 10000 Trials 1000

Fig. 3. Plot shows the estimated and actual values of θt for the case of a single arm. Estimated values are calculated based on DTS algorithm. 500 DTS UCB f UCB−Normal { } TS in the set 0.005, 0.01, 0.05 . We notice that the graphs 0 for different values of , but same standard deviation, are 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 θ Std Deviation overlapping. We also observe that the value at which the RMSE is minimum drops with increasing σ. This is because Fig. 5. Plots of Regret comparing DTS with UCBf , UCB-Normal and TS higher values of σ leads to more dynamic arm probabilities, algorithms for the case of Cutoff Boundaries hence a shorter reward history is required for estimating θ. We next present an empirical evaluation of the different MAB algorithms using tuned values of the model parameters. E. Changing the number of arms D. Varying Standard Deviation We perform this experiment to show the effect of increasing In the first experiment to evaluate the performance of the number of arms on the regret obtained for the case of different MAB strategies, we vary the standard deviation σ of Brownian bandits. We set θmax =0.6 and initially randomly θ. We consider a total of 10 arms, θopt =0.6, and all the other generate a set of 9 arms with reward probabilities in interval 9 arms are generated from Uniform distribution U(0.6, 0). (0.6, 0) using Uniform distribution, and add four arms from the Fig. 5, 6 show the regret obtained by using different standard same set U(0.6, 0) for a total of 10K trials. We use σ =0.005 deviations for the SR method for 10, 000 trials for the case of as standard deviation for the DTS algorithm. As shown in cutoff and absorbing boundaries. The different values of the Fig. 7 and 8, the DTS algorithm performs much better than { } standard deviation are 0.001, 0.005, 0.008, 0.01, 0.02 .We the UCBf , Thompson Sampling and UCB-Normal algorithm. see that the DTS algorithm shows the least regret as compared The difference between UCBf and DTS algorithm grows as to other MAB strategies for both the cases of absorbing as well the number of arms increase which shows that the UCBf as cutoff boundaries. algorithm does not scale with the number of arms for either

488 Total Trials = 10,000, Total arms =10 Total Trials = 10,000, Standard Dev. = 0.005 2500 800 DTS UCB f UCB−Normal 700 2000 TS

600

1500 500 Regret Regret 1000 400

300 500 DTS UCB f TS 200 10 15 20 25 30 35 40 45 50 0 No. of Arms 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 Std Deviation

Fig. 8. Plots of Regret comparing DTS with UCBf , UCB-Normal and TS algorithms for the case of Absorbing Boundaries Fig. 6. Plots of Regret comparing DTS with UCBf , UCB-Normal and TS algorithms for the case of Absorbing Boundaries

Total Trials = 10,000, Standard Dev. = 0.005 ment as the number of arms increases, which demonstrates 2000 the usefulness of our proposed algorithm in large-scale MAB problems. The DTS strategy can be further extended to include 1800 variations such as mortal bandits, hierarchical bandits, as well 1600 as strategies for identifying the k-best arms by introducing immunity from elimination. We are also working on proving 1400 the theoretical bounds of the DTS algorithm.

1200

Regret REFERENCES 1000 [1] O.-C. Granmo, “Solving two-armed bernoulli bandit problems using a bayesian learning automaton,” International Journal of Intelligent 800 Computing and Cybernetics, vol. 2, no. 3, pp. 207–234, 2010. DTS [2] O.-C. Granmo and S. Berg, “Solving non-stationary bandit problems by UCB 600 f random sampling from sibling kalman filters,” in Twenty Third Interna- UCB−Normal tional Conference on Industrial, Engineering, and Other Applications TS 400 of Applied Intelligent Systems (IEA-AIE 2010), 2010. 10 15 20 25 30 35 40 45 50 [3] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive bandit No. of Arms rules,” Advances in Applied Mathemetics, 1985. [4] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite time analysis of multi- armed bandit problem,” Machine Learning, vol. 27, no. 2-3, pp. 235– Fig. 7. Plots of Regret comparing DTS with UCBf , UCB-Normal and TS algorithms for the case of Cutoff Boundaries 256, 2002. [5] J. C. Gittins, “Bandit processes and dynamic allocation indices,” Journal of Royal Statistical Society. Series B, vol. 41, no. 2, pp. 148–177, 1979. [6] P. Whittle, “Restless bandits: Activity allocation in a changing world,” absorbing and cutoff boundaries. We do not show the results Journal of Applied Probability, vol. 25, pp. pp. 287–298, 1988. [Online]. Available: http://www.jstor.org/stable/3214163 of UCB-Normal algorithm in Fig. 8 as it consistently shows [7] S. Guha, K. Munagala, and P. Shi, “Approximation algorithms for poor results for the case of absorbing boundaries also. restless bandit problems,” CoRR, vol. abs/0711.3861, 2007. [8] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling in V. C ONCLUSION a rigged casino: The adversarial multi-armed bandit problem,” in 36th Annual Symposium on Foundations of Computer Science. IEEE, 1995. In this paper, we presented the Dynamic Thompson Sam- [9] A. Slivkins and E. Upfal, “Adapting to changing environment: the pling (DTS) algorithm. DTS builds upon the Order Statis- brownian restless bandits,” in COLT, 2008. [10] A. K. Gupta and S. Nadarajah, Handbook of Beta Distribution and its tics based Thompson Sampling framework by extending the applications. New York: Marcer Dekker Inc., 2004. framework with exponential filtering capability. The purpose [11] J. Wyatt, “Exploration and inference in learning from reinforcement,” is to allow dynamically changing reward probabilities to Ph.D. thesis, University of Edinburgh, 1997. [12] B. C. May, N. Korda, A. Lee, and D. S. Leslie, “Optimistic bayesian be tracked over time. The experimental results and analysis sampling in contextual-bandit problems,” Submitted to the Annals of presented in this paper show that the DTS algorithm sig- Applied Probability. nificantly outperforms current state-of-art methods such as [13] S. Makridakis, S. C. Wheelwright, and R. J. Hyndman, Forecasting Methods and Applications, 3rd ed. John Wiley and Sons, Inc., 1998, f UCB , Thompson Sampling and UCB-Normal for the case chapter 4, pp. 135–179. of dynamic reward probabilities possessing bounded Brownian motion. We also observe an increasing performance improve-

489

View publication stats