Thompson Sampling on Symmetric Α-Stable Bandits

Thompson Sampling on Symmetric α-Stable Bandits Abhimanyu Dubey and Alex Pentland Massachusetts Institute of Technology fdubeya, [email protected] Abstract Thompson Sampling provides an efficient technique to introduce prior knowledge in the multi-armed bandit problem, along with providing remarkable empirical performance. In this paper, we revisit the Thompson Sampling algorithm under rewards drawn from α-stable distributions, which are a class of heavy-tailed probability distributions utilized in finance and economics, in problems such as modeling stock prices and human behavior. We present an efficient framework for α-stable posterior inference, which leads to two algorithms for Thomp- son Sampling in this setting. We prove finite-time regret bounds for both algorithms, and demonstrate through a series of experiments the stronger performance of Thompson Sampling in this setting. With our results, we provide an exposition of α-stable distributions in sequential decision-making, and enable sequential Bayesian inference in applications from diverse fields in finance and complex systems that operate on heavy-tailed features. 1 Introduction The multi-armed bandit (MAB) problem is a fundamental model in understanding the exploration- exploitation dilemma in sequential decision-making. The problem and several of its variants have been studied extensively over the years, and a number of algorithms have been proposed that op- timally solve the bandit problem when the reward distributions are well-behaved, i.e. have a finite support, or are sub-exponential. The most prominently studied class of algorithms are the Upper Confidence Bound (UCB) algorithms, that employ an \optimism in the face of uncertainty" heuristic [ACBF02], which have been shown to be optimal (in terms of regret) in several cases [CGM+13, BCBL13]. Over the past few years, however, there has been a resurgence in interest in the Thompson Sampling (TS) algorithm [Tho33], that approaches the problem from a Bayesian perspective. Rigorous empirical evidence in favor of TS demonstrated by [CL11] sparked new interest in the theoretical analysis of the algorithm, and the seminal work of [AG12, AG13, RVR14] demonstrated the optimality of TS when rewards are bounded in [0; 1] or are Gaussian. These results were extended in the work of [KKM13] to more general, exponential family reward distributions. The empirical studies, along with theoretical guarantees, have established TS as a powerful algorithm for the MAB problem. However, when designing decision-making algorithms for complex systems, we see that inter- actions in such systems often lead to heavy-tailed and power law distributions, such as modeling stock prices [BT03], preferential attachment in social networks [MCM+13], and online behavior on websites [KT10]. Specifically, we consider a family of extremely heavy-tailed reward distributions known as α- stable distributions. This family refers to a class of distributions parameterized by the exponent α, that include the Gaussian (α = 2), Lévy(α = 1=2) and Cauchy (α = 1) distributions, all of which are used extensively in economics [Fra], finance [CW03] and signal processing [SN93]. 1 The primary hurdle in creating machine learning algorithms that account for α-stable distributions, however, is their intractable probability density, which cannot be expressed analytically. This prevents even a direct evaluation of the likelihood under this distribution. Their heavy-tailed nature, additionally, often leads to standard algorithms (such as Thompson Sampling assuming Gaussian rewards), concentrating on incorrect arms. In this paper, we create two algorithms for Thompson Sampling under symmetric α-stable rewards with finite means. Our contributions can be summarized as follows: 1. Using auxiliary variables, we construct a framework for posterior inference in symmetric α- stable bandits that leads to the first efficient algorithm for Thompson Sampling in this setting, which we call α-TS. 2. To the best of our knowledge, we provide the first finite-time polynomial bound on the Bayesian Regret of Thompson Sampling achieved by α-TS in this setting. 3. We improve on the regret by proposing a modified Thompson Sampling algorithm, called 1 Robust α-TS, that utilizes a truncated mean estimator, and obtain the first O~(N 1+ ) Bayesian Regret in the α-stable setting. Our bound matches the optimal bound for α 2 (1; 2) (within logarithmic factors). 4. Through a series of experiments, we demonstrate the proficiency of our two Thompson Sam- pling algorithms for α-stable rewards, which consistently outperform all existing benchmarks. Our paper is organized as follows: we first give a technical overview of the MAB problem, the Thompson Sampling algorithm and α-stable distributions. Next, we provide the central algorithm α-TS and its analysis, followed by the same for the Robust α-TS algorithm. We then provide experimental results on multiple simulation benchmarks and finally, discuss the related work in this area prior to closing remarks. 2 Preliminaries 2.1 Thompson Sampling The K-Armed Bandit Problem: In any instance of the K-armed bandit problem, there exists an agent with access to a set of K actions (or \arms"). The learning proceeds in rounds, indexed by t 2 [1;T ]. The total number of rounds, known as the time horizon T , is known in advance. The problem is iterative, wherein for each round t 2 [T ]: 1. Agent picks arm at 2 [K]. 2. Agent observes reward rat (t) from that arm. 1 For arm k 2 [K], rewards come from a distribution Dk with mean µk = EDk [r] . The largest ∗ expected reward is denoted by µ = maxk2[K] µk, and the corresponding arm(s) is denoted as the optimal arm(s) k∗. In our analysis, we will focus exclusively on the i.i.d. setting, that is, for each arm, rewards are independently and identically drawn from Dk, every time arm k is pulled. To measure the performance of any (possibly randomized) algorithm we utilize a measure known as Regret R(T ), which, at any round T , is the difference of the cumulative mean reward of the algorithm against the expected reward of always playing an optimal arm. T ∗ X R(T ) = µ T − µat (1) t=0 1α-stable distributions with α ≤ 1 do not admit a finite first moment. To continue with existing measures of regret, we only consider rewards with α > 1. 2 Thompson Sampling (TS): Thompson Sampling [Tho33] proceeds by maintaining a posterior distribution over the parameters of the bandit arms. If we assume that for each arm k, the reward distribution Dk is parameterized by a (possibly vector) parameter θk that come from a set Θ with a prior probability distribution p(θk) over the parameters, the Thompson Sampling algorithm proceeds by selecting arms based on the posterior probability of the reward under the arms. For each round t 2 [T ], the agent: ^ 1. Draws parameters θk(t) for each arm k 2 [K] from the posterior distribution of parameters, (1) (2) given the previous rewards rk(t − 1) = frk ; rk ; :::g till round t − 1 (note that the posterior distribution for each arm only depends on the rewards obtained using that arm). When t = 1, this is just the prior distribution over the parameters. ^ θk(t) ∼ p(θkjrk(t − 1)) / p(rk(t − 1)jθk)p(θk) (2) ^ 2. Given the drawn parameters θk(t) for each arm, chooses arm at with the largest mean reward over the posterior distribution. ^ at = arg max µk(θk(t)) (3) k2[K] 3. Obtains reward rt after taking action at and updates the posterior distribution for arm at. In the Bayesian case, the measure for performance we will utilize in this paper is the Bayes Regret (BR) [RVR14], which is the expected regret over the priors. Denoting the parameters over all arms ¯ ¯ Q as θ = fθ1; :::; θkg and their corresponding product distribution as D = i Di, for any policy π, the Bayes Regret is given by: BayesRegret(T; π) = Eθ¯∼D¯ [R(T )]: (4) While the regret provides a stronger analysis, any bound on the Bayes Regret is essentially a bound on the expected regret, since if an algorithm admits a Bayes Regret of O(g(T )), then its Expected Regret is also stochastically bounded by g(·) [RVR14]. Formally, we have, for constants M; : [R(T )jθ¯] E ≥ M ≤ 8 T 2 : (5) P g(T ) N 2.2 α-Stable Distributions α-Stable distributions, introduced by Lévy[Lév25]are a class of probability distributions defined over R whose members are closed under linear transformations. Definition 1 ( [BHW05]). Let X1 and X2 be two independent instances of the random variable X. X is stable if, for a1 > 0 and a2 > 0, a1X1 + a2X2 follows the same distribution as cX + d for some c > 0 and d 2 R. A random variable X ∼ Sα(β; µ, σ) follows an α-stable distribution described by the parameters α 2 (0; 2] (characteristic exponent) and β 2 [−1; 1] (skewness), which are responsible for the shape and concentration of the distribution, and parameters µ 2 R (shift) and σ 2 R+ (scale) which correspond to the location and scale respectively. While it is not possible to analytically express the density function for generic α-stable distributions, they are known to admit the characteristic function φ(x; α; β; σ; µ): α φ(x; α; β; σ; µ) = exp fixµ − jσxj (1 − iβ sign(x)Φα(x))g ; where Φα(x) is given by πα 2 Φα(x) = tan( 2 ) when α 6= 1; − π log jxj; when α = 1 3 Figure 1: Sample probability density for standard (µ = 0; σ = 1) symmetric α-stable distributions with various values of α [Wik19]. For fixed values of α; β; σ and µ we can recover the density function from φ(·) via the inverse Fourier transform: 1 Z 1 p(z) = φ(x; α; β; σ; µ)e−izxdx 2π −∞ Most of the attention in the analysis of α-stable distributions has been focused on the stability parameter α, which is responsible for the tail \fatness".

Thompson Sampling on Symmetric Α-Stable Bandits

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support