Bandits for BMO Functions
Tianyu Wang 1 Cynthia Rudin 1
Abstract the parameter space, and they may completely miss the optima. We study the bandit problem where the underly- ing expected reward is a Bounded Mean Oscilla- As another example, when we try to determine failure modes tion (BMO) function. BMO functions are allowed of a system or simulation, we might try to locate singular- to be discontinuous and unbounded, and are use- ities in the variance of its outputs. These are cases where ful in modeling signals with infinities in the do- the variance of outputs becomes extremely large. In this main. We develop a toolset for BMO bandits, and case, we can use a bandit algorithm for BMO functions to provide an algorithm that can achieve poly-log efficiently find where the system is most unstable. δ-regret – a regret measured against an arm that There are several difficulties in handling BMO rewards. is optimal after removing a δ-sized portion of the First and foremost, due to unboundedness in the expected re- arm space. ward functions, traditional regret metrics are doomed to fail. To handle this, we define a new performance measure, called δ-regret. The δ-regret measures regret against an arm that is 1. Introduction optimal after removing a δ-sized portion of the arm space. Under this performance measure, and because the reward is Multi-Armed Bandit (MAB) problems model sequential a BMO function, our attention is restricted to a subspace on decision making under uncertainty. Algorithms for this which the expected reward is finite. Subsequently, strategies problem have important real-world applications including that conform to the δ-regret are needed. medical trials (Robbins, 1952) and web recommender sys- tems (Li et al., 2010). While bandit methods have been To develop a strategy that handles δ-regret, we leverage developed for various settings, one problem setting that has the John-Nirenberg inequality, which plays a crucial role in not been studied, to the best of our knowledge, is when the harmonic analysis. We construct our arm index using the expected reward function is a Bounded Mean Oscillation John-Nirenberg inequality, in addition to a traditional UCB (BMO) function in a metric measure space. Intuitively, a index. In each round, we play an arm with highest index. BMO function does not deviate too much from its mean As we play more and more arms, we focus our attention on over any ball, and can be discontinuous or unbounded. regions that contain good arms. To do this, we discretize the arm space adaptively, and carefully control how the index Such unbounded functions can model many real-world quan- evolves with the discretization. We provide two algorithms tities. Consider the situation in which we are optimizing – Bandit-BMO-P and Bandit-BMO-Z. They discretize the the parameters of a process (e.g., a physical or biological arm space in different ways. In Bandit-BMO-P, we keep a system) whose behavior can be simulated. The simulator is strict partitioning of the arm space. In Bandit-BMO-Z, we computationally expensive to run, which is why we could keep a collection of cubes where a subset of cubes form a not exhaustively search the (continuous) parameter space discretization. Bandit-BMO-Z achieves poly-log δ-regret for the optimal parameters. The “reward” of the system is with high probability. sensitive to parameter values and can increase very quickly as the parameters change. In this case, by failing to model the infinities, even state-of-the-art continuum-armed bandit 2. Related Works methods fail to compute valid confidence bounds, poten- Bandit problems in different settings have been actively tially leading to underexploration of the important part of studied since as far back as Thompson(1933). Upper 1Department of Computer Science, Duke University, confidence bound (UCB) algorithms remain popular (Rob- Durham, NC, USA. Correspondence to: Tianyu Wang bins, 1952; Lai and Robbins, 1985; Auer, 2002) among the
1 d bound algorithms have been studied. Some works use KL- Lloc(R , µ) is said to be a Bounded Mean Oscillation func- d divergence to construct the confidence bound (Lai and Rob- tion, f ∈ BMO(R , µ), if there exists a constant Cf , such bins, 1985; Garivier and Cappe´, 2011; Maillard et al., 2011), that for any hyper-rectangles Q ⊂ Rd, and some works include variance estimates within the confi- Z R fdµ dence bound (Audibert et al., 2009; Auer and Ortner, 2010). 1 Q |f − hfiQ |dµ ≤ Cf , hfiQ := . (1) UCB is also used in the contextual setting (e.g., Li et al., µ(Q) Q µ(Q) 2010; Krause and Ong, 2011; Slivkins, 2014). For a given such function f, the infimum of the admissi- Perhaps Lipschitz bandits are closest to BMO bandits. The ble constant Cf over all hyper-rectangles Q is denoted by Lipschitz bandit problem was termed “continuum-armed kfkBMO, or simply kfk. We use kfkBMO and kfk inter- bandits” in early stages (Agrawal, 1995). In “continuum- changeably in this paper. armed bandits,” arm space is continuous – e.g., [0, 1]. Along this line, bandits that are Lipschitz continuous (or Holder¨ A BMO function can be discontinuous and unbounded. The continuous) have been studied. In particular, Kleinberg function in Figure1 illustrates the singularities a BMO (2005) proves a Ω(T 2/3) lower bound and proposes a function can have over its domain. Our problem is most 2/3 Oe T algorithm. Under other extra conditions on top of interesting when multiple singularities of this kind occur. 1/2 Lipschitzness, regret rate of Oe(T ) was achieved (Cope, To properly handle the singularities, we will need the John- 2009; Auer et al., 2007). For general (doubling) metric Nirenberg inequality (Theorem1), which plays a central spaces, the Zooming bandit algorithm (Kleinberg et al., role in our paper. 2008) and Hierarchical Optimistic Optimization algorithm Theorem 1 (John-Nirenberg inequality). Let µ be the (Bubeck et al., 2011a) were developed. In more recent years, Lebesgue measure. Let f ∈ BMO d, µ. Then there some attention has been given to Lipschitz bandit problems R exists constants C and C , such that, for any hypercube with certain extra conditions. To name a few, Bubeck et al. 1 2 q ⊂ d and any λ > 0, (2011b) study Lipschitz bandits for differentiable rewards, R which enables algorithms to run without explicitly knowing n o −λ the Lipschitz constants. The idea of robust mean estima- µ x ∈ q : f(x)−hfiq > λ ≤ C1µ(q) exp . C2kfk tors (Bubeck et al., 2013; Bickel et al., 1965; Alon et al., (2) 1999) was applied to the Lipschitz bandit problem to cope with heavy-tail rewards, leading to the development of a The John-Nirenberg inequality dates back to at least John near-optimal algorithm (Lu et al., 2019). Lipschitz ban- (1961), and a proof is provided in AppendixC. dits with an unknown metric, where a clustering is used d to infer the underlying unknown metric, has been studied As shown in AppendixC, C1 = e and C2 = e2 provide a by Wanigasekara and Yu(2019). Lipschitz bandits with pair of legitimate C1,C2 values. However, this pair of C1 discontinuous but bounded rewards were studied by Krish- and C2 values may be overly conservative. Tight values of namurthy et al.(2019). C1 and C2 are not known in general cases (Lerner, 2013; Slavin and Vasyunin, 2017), and it is also conjectured that An important setting that is beyond the scope of the afore- C and C might be independent of dimension (Cwikel mentioned works is when the expected reward is allowed 2 1 et al., 2012). For the rest of the paper, we use kfk = 1, to be unbounded. This setting breaks the previous Lips- C = 1, and C = 1, which permits cleaner proofs. Our chitzness assumption or “almost Lipschitzness” assumption 1 2 results generalize to cases where C , C and kfk are other (Krishnamurthy et al., 2019), which may allow discontinu- 1 2 constant values. ities but require boundedness. To the best of our knowledge, this paper is the first work that studies the bandit learning In this paper, we will work in Euclidean space with the problem for BMO functions. Lebesgue measure. For our purpose, Euclidean space is as general as doubling spaces, since we can always embed a 3. Preliminaries doubling space into a Euclidean space with some distortion of metric. This fact is formally stated in Theorem2. We review the concept of (rectangular) Bounded Mean Os- Theorem 2. (Assouad, 1983). Let (X, d) be a doubling cillation (BMO) in Euclidean space (e.g., Fefferman, 1979; metric space and ς ∈ (0, 1). Then (X, dς ) admits a bi- Stein and Murphy, 1993). n Lipschitz embedding into R for some n ∈ N.
Definition 1. (BMO Functions) Let (Rd, µ) be the Eu- In a doubling space, any ball of radius ρ can be covered by 1 d ρ clidean space with the Lebesgue measure. Let Lloc(R , µ) Md balls of radius 2 , where Md is the doubling constant. d d d denote the space of measurable functions (on R ) that In the space (R ,k · k∞), the doubling constant Md is 2 . are locally integrable with respect to µ. A function f ∈ In domains of other geometries, the doubling constant can Bandits for BMO Functions be much smaller than exponential. Throughout the rest of the paper, we use Md to denote the doubling constant.
4. Problem Setting: BMO Bandits The goal of a stochastic bandit algorithm is to exploit the current information, and explore the space efficiently. In this paper, we focus on the following setting: a payoff function d is defined over the arm space ([0, 1) , k · kmax, µ), where µ is the Lebesgue measure (note that [0, 1)d is a Lipschitz domain). The payoff function is:
d d f : [0, 1) → R where f ∈ BMO([0, 1) , µ). (3) Figure 1. Graph of f(x) = − log(|x|), with δ and f δ annotated. The actual observations are given by y(a) = f(a) + Ea, This function is an unbounded BMO function. where Ea is a zero-mean noise random variable whose dis- tribution can change with a. We assume that for all a, Remark 1. The definition of δ-regret, or a definition of this |Ea| ≤ D almost surely for some constant D (N1). Our E E d results generalize to the setting with sub-Gaussian noise kind, is needed for a reward function f ∈ BMO([0, 1) , µ). (Shamir, 2011). We also assume that the expected reward For an unbounded BMO function f, the max value is infinity, δ function does not depend on noise. while f is a finite number as long as δ is f-admissible. In our setting, an agent is interacting with this environment Remark 2 (Connection to bandits with heavy-tails). In in the following fashion. At each round t, based on past the definition of bandits with heavy tails (Bubeck et al., 2013; Medina and Yang, 2016; Shao et al., 2018; Lu et al., observations (a1, y1, ··· , at−1, yt−1), the agent makes a 2019), the reward distribution at a fixed arm is heavy-tail – query at point at and observes the (noisy) payoff yt, where having a bounded expectation and bounded (1+β)-moment yt is revealed only after the agent has made a decision at. (β ∈ (0, 1]). In the case of BMO rewards, the expected For a payoff function f and an arm sequence a1, a2, ··· , aT , we use δ-regret incurred up to time T as the performance reward itself can be unbounded. Figure1 gives an instance measure (Definition2). of unbounded BMO reward, which means the BMO bandit problem is not covered by settings of bandits with heavy d Definition 2. (δ-regret) Let f ∈ BMO([0, 1) , µ). A num- tails. ber δ ≥ 0 is called f-admissible if there exists a real number z0 that satisfies A quick consequence of the definition of δ-regret is the following lemma. This lemma is used in the regret analysis d µ({a ∈ [0, 1) : f(a) > z0}) = δ. (4) when handling the concentration around good arms. Lemma 1. Let f be the reward function. For any f- For an f-admissible δ, define the set F δ to be admissible δ ≥ 0, let Sδ := a ∈ [0, 1)d : f(a) > f δ . δ δ δ d Then we have S measurable and µ(S ) = δ. F := z ∈ R : µ({a ∈ [0, 1) : f(a) > z}) = δ . (5) Before moving on to the algorithms, we put forward the Define f δ := inf F δ. For a sequence of arms A ,A , ··· , 1 2 following assumption. and σ-algebras F1, F2, ··· where Ft describes all random- ness before arm At, define the δ-regret at time t as Assumption 1. We assume that the expected reward func- d tion f ∈ BMO([0, 1) , µ) satisfies hfi[0,1)d = 0. δ δ rt := max{0, f − Et[f(At)]}, (6) Assumption1 does not sacrifice generality. Since f is a where Et is the expectation conditioned on Ft. The total BMO function, it is locally-integrable. Thus hfi[0,1)d is δ PT δ finite, and we can translate the reward function up or down δ-regret up to time T is then RT := t=1 rt . such that hfi[0,1)d = 0. Intuitively, the δ-regret is measured against an amended re- ward function that is created by chopping off a small portion 5. Solve BMO Bandits via Partitioning of the arm space where the reward may become unbounded. As an example, Figure1 plots a BMO function and its f δ BMO bandit problems can be solved by partitioning the value. A problem defined as above with performance mea- arm space and treating the problem as a finite-arm problem sured by δ-regret is called a BMO bandit problem. among partitions. For our purpose, we maintain a sequence Bandits for BMO Functions of partitions using dyadic cubes. By dyadic cubes of Rd, we Appendix A.4). All these quantities will be discussed in refer to the collection of all cubes of the following form: more detail as we develop our algorithm.
d −k −k −k After playing an arm and observing reward, we update the Q d := Πi=1 mi2 , mi2 + 2 (7) R partition into a finer one if needed. Next, we discuss our where Π is the Cartesian product, and m1, ··· , md, k ∈ partition refinement rules and the tie-breaking mechanism. d [0, 1) d := Z. Dyadic cubes of is Q [0,1) Partition Refinement: We start with Q = {[0, 1)d}. At d 2 0 q∈ d : q⊂ [0, 1) . Dyadic cubes of [0, 1) are Q R time t, we split cubes in Qt−1 to construct Qt so that the {[0, 1)2, [0, 0.5)2, [0.5, 1)2, [0.5, 1) × [0, 0.5), ···}. following is satisfied for any q ∈ Qt We say a dyadic cube Q is a direct sub-cube of a dyadic cube Q0 if Q ⊆ Q0 and the edge length of Q0 is twice the Ht(q) ≥ J(q), or equivalently edge length of Q. By definition of doubling constant, for (Ψ+D )p2 log(2T 2/) E ≥ blog (µ(q)/η)c . (13) any cube Q, it has Md direct sub-cubes, and these direct pn (q) + sub-cubes form a partition of Q. If Q is a direct sub-cube et 0 0 of Q , then Q is a direct super cube of Q. In (13), the left-hand-side does not decrease as we make At each step t, Bandit-BMO-P treats the problem as a finite- splits (the numerator remains constant while the denomina- arm bandit problem with respect to the cubes in the dyadic tor can only decrease), while the right-hand-side decreases partition at t; each cube possesses a confidence bound. The until it hits zero as we make more splits. Thus (13) can algorithm then chooses a best cube according to UCB, and always be satisfied with additional splits. chooses an arm uniformly at random within the chosen cube. Tie-breaking: We break down our tie-breaking mechanism Before formulating our strategy, we put forward several into two steps. In the first step, we choose a cube Qt ∈ Qt−1 functions that summarize cube statistics. such that: d Let Qt be the collection of dyadic cubes of [0, 1) at time t Qt ∈ arg max Ut(q). (14) (t ≥ 1). Let (a1, y1, a2, y2, ··· , at, yt) be the observations q∈Qt−1 received up to time t. We define After deciding from which cube to choose an arm, we uni- • the cube count nt : Qt → R, such that for q ∈ Qt t−1 formly randomly play an arm At within the cube Qt. If X n (q) := ; n (q) := max(1, n (q)). (8) measure µ is non-uniform, we play arm At, so that for any t I[ai∈q] et t µ(S) subset S ⊂ Qt, (At ∈ S) = . i=1 P µ(Qt)
• the cube average mt : Qt → R, such that for q ∈ Qt The random variables {(Qt0 ,At0 ,Yt0 )}t0 (cube selection, ( Pt−1 y arm selection, reward) describe all randomness in the learn- i=1 iI[ai∈q] n (q) , if nt(q) > 0; t mt(q) := t (9) ing process up to time . We summarize this strategy in 0, otherwise. Algorithm1. Analysis of Algorithm1 is found in Section 5.1, which also provides some tools for handling δ-regret. At time t, based on the partition Qt−1 and observations Then in Section6, we provide an improved algorithm that (a1, y1, a2, y2, ··· , at−1, yt−1), our bandit algorithm picks exhibits a stronger performance guarantee. a cube (and plays an arm within the cube uniformly at random). More specifically, the algorithm picks 5.1. Regret Analysis of Bandit-BMO-P
Qt ∈ arg max Ut(q), where (10) In this section we provide a theoretical guarantee on the q∈Qt algorithm. We will use capital letters (e.g., Qt,At,Yt) to de- Ut(q) :=mt(q) + Ht(q) + J(q), (11) note random variables, and use lower-case letters (e.g. a, q) p 2 (Ψ+DE ) 2 log(2T /) to denote non-random quantities, unless otherwise stated. Ht(q) := p , net(q) Theorem 3. Fix any T . With probability at least 1 − 2, for any δ > |QT |η such that δ is f-admissible, the total J(q) := blog (µ(q)/η)c+ , 2 δ-regret for Algorithm1 up to time T satisfies Ψ := max log(T /), 2 log2(1/η) , (12) T where T is the time horizon, DE is the a.s. bound on the X δ p rt .d Oe T |QT | , (15) noise, and η are algorithm parameters (to be discussed in t=1 more detail later), and bzc+ = max{0, z}. Here Ψ is the “Effective Bound” of the expected reward, and η controls where the .d sign omits constants that depends on d, and minimal cube size in the partition Qt (Proposition3 in |QT | is the cardinality of QT . Bandits for BMO Functions
Algorithm 1 Bandit-BMO-Partition (Bandit-BMO-P) for any f-admissible δ > η|Q|, where |Q| is the cardinality
1: Problem intrinsics: µ(·), DE , d, Md. of Q. 2: {µ(·) is the Lebesgue measure. DE bounds the noise.} 3: {d is the dimension of the arm space.} In the proof of Lemma3, we suppose, in order to get a con- tradiction, that there is no such cube. Under this assumption, 4: {Md is the doubling constant of the arm space.} f δ 5: Algorithm parameters: η > 0, > 0, T . there will be contradiction to the definition of . 6: {T is the time horizon. and η are parameters.} By Lemma3, there exists a “good” cube qet (at any time 7: for t = 1, 2,...,T do t ≤ T ), such that (20) is true for qet. Let δ be an arbitrary 8: Let mt and nt be defined as in (9) and (8). number satisfying (1) δ > |QT |η and (2) δ is f-admissible. 9: Select a cube Q ∈ Q such that: t t Then under event E(qet),
δ δ Qt ∈ arg max Ut(l), f = f − hfi + hfi − mt(qt) + mt(qt) q∈Qt−1 qet qet e e 1 where Ut is defined in (11). ≤J(qet) + Ht(qet) + mt(qet), (21) 10: Play arm A ∈ Q uniformly at random. Observe Y . t t t where 1 uses Lemma3 for the first brackets and Lemma2 11: Update the partition Q to Q according to (13). t t+1 (with event E (q )) for the second brackets. 12: end for t et The event where all “good” cubes and all cubes we select (for t ≤ T ) have nice estimates, namely From uniform tie-breaking, we have TT T TT Et(qt) Et(Qt) , occurs with probability Z t=1 e t=1 1 at least 1 − 2. This result comes from Lemma2 and a E[f(At)|Ft] = f(a) da = hfi , (16) µ(Q ) Qt t a∈Qt union bound, and we note that Et(q) depends on (and T ), as in (19). Under this event, from (18) we have Ft = σ(Q1,A1,Y1, ··· ,Qt−1,At−1,Yt−1,Qt), (17) hfi − m (Q ) ≤ H (Q ). This and (16) give us Qt t t t t where Ft is the σ-algebra generated by random vari- Q ,A ,Y , ··· ,Q ,A ,Y ,Q ables 1 1 1 t−1 t−1 t−1 t – all random- [f(A )|F ] = hfi ≥ m (Q )−H (Q ). (22) E t t Qt t t t t ness right after selecting cube Qt. At time t, the expected reward is the mean function value of the selected cube. We can then use the above to get, under the “good event”, The proof of the theorem is divided into two parts. In Part I, δ f − E[f(At)|Ft] we show that some “good event” holds with high probability. 1 In Part II, we bound the δ-regret under the “good event.” ≤mt(qet) + Ht(qet) + J(qet) − mt(Qt) + Ht(Qt) 2 Part I: For t ≤ T , and q ∈ Q , we define t ≤mt(Qt)+Ht(Qt)+J(Qt) − mt(Qt)+Ht(Qt) n o =2Ht(Qt) + J(Qt) ≤ 3Ht(Qt), (23) Et(q) := hfiq − mt(q) ≤ Ht(q) , (18) p 2 where 1 uses (21) for the first three terms and (22) for the (Ψ+DE ) 2 log(2T /) Ht(q) = p . (19) last three terms, 2 uses that Ut(Qt) ≥ Ut(qt) since Qt nt(q) e e maximizes the index Ut(·) according to (14), and the last In the above, Et(q) is essentially saying that the empirical inequality uses the rule (13). q hfi mean within a cube concentrates to q. Lemma2 shows Next, we use Lemma4 to link the number of cubes up to a that Et(q) happens with high probability for any t and q. time t to the Hoeffding-type tail bound in (23). Intuitively, Lemma 2. With probability at least 1 − T , the event Et(q) this bound (Lemma4) states that the numbers of points holds for any q ∈ Qt at any time t. within the cubes grows fast enough to be bounded by a function of the number of cubes. To prove Lemma2, we apply a variation of Azuma’s in- 0 equality (Vu, 2002; Tao and Vu, 2015). We also need some Lemma 4. We say a partition Q is finer than a partition Q 0 0 0 additional effort to handle the case when a cube q contains if for any q ∈ Q, there exists q ∈ Q such that q ⊂ q . Con- no observations. The details are in Appendix A.3. sider an arbitrary sequence of points x1, x2, ··· , xt, ··· in a space X , and a sequence of partitions Q1, Q2, ··· of X Part II: Next, we link the δ-regret to the J(q) term. such that Qt+1 is finer than Qt for all t = 1, 2, ··· ,T − 1. T Lemma 3. Recall J(q) = log(µ(q)/η). For any partition Then for any T , and {qt ∈ Qt}t=1, Q of [0, 1)d, there exists q ∈ Q, such that T X 1 T ≤ e|Q | log 1 + (e − 1) , (24) δ n (q ) T |Q | f − hfiq ≤ J(q), (20) t=1 et t T Bandits for BMO Functions
ter where net is defined in (8) (using points x1, x2, ··· ), and for any terminal cube Q . If (27) is violated for a terminal ter ter |Qt| is the cardinality of partition Qt. cube Q , we include the Md direct sub-cubes of Q ter into Qt. Then Q will no longer be a terminal cube and A proof of Lemma4 is in AppendixB. We can apply Lemma the direct sub-cubes of Qter will be terminal cubes. We 4 and the Cauchy-Schwarz inequality to (23) to prove Theo- repeatedly include direct sub-cubes of (what were) terminal rem3. The details can be found in Appendix A.7. cubes into Qt, until all terminal cubes√ satisfy (27). We (Ψ+D ) 2 log(2T 2/) choose α to be smaller than E so that log(Md/η) 6. Achieve Poly-log Regret via Zooming ter ter (27) can be satisfied with net(Q ) = 1 and µ(Q ) = 1. In this section we study an improved version of the previous As a consequence, any non-terminal cube Qpar (regardless section that uses the Zooming machinery (Kleinberg et al., of whether it is a pre-parent or parent cube) satisfies: 2008; Slivkins, 2014) and inspirations from Bubeck et al. (Ψ+D )p2 log(2T 2/) M µ(Qpar) (2011a). Similar to Algorithm1, this algorithm runs by E < α log d . p par η maintaining a set of dyadic cubes Qt. net(Q ) In this setting, we divide the time horizon into episodes. In (28) each episode t, we are allowed to play multiple arms, and all arms played can incur regret. This is also a UCB strategy, and the index of q ∈ Qt is defined the same way as (11):
Ut(q) := mt(q) + Ht(q) + J(q) (25)
Before we discuss in more detail how to select cubes and arms based on the above index Ut(·), we first describe how we maintain the collection of cubes. Let Qt be the collection of dyadic cubes at episode t. We first define terminal cubes, which are cubes that do not have sub-cubes in Qt. More formally, a cube Q ∈ Qt is a terminal cube if there is no 0 0 Figure 2. Example of terminal cubes, pre-parent and parent cubes. other cube Q ∈ Qt such that Q ⊂ Q.A pre-parent cube is a cube in Qt that “directly” contains a terminal cube: For a cube Q ∈ Qt, if Q is a direct super cube of any terminal After the splitting rule is achieved, we select a parent cube. cube, we say Q is a pre-parent cube. Finally, for a cube Specifically Qt is chosen to maximize the following index: Q ∈ Qt, if Q is a pre-parent cube and no super cube of Q Qt ∈ arg max Ut(q). is a pre-parent cube, we call Q a parent cube. Intuitively, q∈Qt,q is a parent cube no “sibling” cube of a parent cube is a terminal cube. As a consequence of this definition, a parent cube cannot contain Within each direct sub-cube of Qt (either pre-parent or another parent cube. Note that some cubes are none of terminal cubes), we uniformly randomly play one arm. In the these three types of cubes. Figure2 gives examples of each episode t, Md arms are played. This algorithm is terminal cubes, pre-parent cubes and parent cubes. summarized in Algorithm2. Algorithm Description Regret Analysis: For the rest of the paper, we define √ t−1 ! (Ψ+D ) 2 log(2T 2/) E Md Md Pick zooming rate α ∈ 0, . The F := σ Q 0 , {A 0 } , {Y 0 } ,Q , log(Md/η) t t t ,j j=1 t ,j j=1 t t0=1 collection of cubes grows following the rules below: (1) Ini- d d tialize Q0 = {[0, 1) } and [0, 1) . Warm-up: play nwarm which is the σ-algebra describing all randomness right af- arms uniformly at random from [0, 1)d so that ter selecting the parent cube for episode t. We use Et to √ denote the expectation conditioning on Ft. We will show 2 (Ψ+DE ) 2 log(2T /) M Algorithm2 achieves O (poly-log(T )) δ-regret with high √ ≥ α log d e √nwarm η probability (formally stated in Theorem4). 2 . (26) (Ψ+DE ) 2 log(2T /) M √ < α log d nwarm+1 η Let At,i be the i-th arm played in episode t. Let us de- δ δ note ∆t,i := f − Et[f(At,i)]. Since each At,i is selected (2) After episode t (t = 1, 2, ··· ,T ), ensure uniformly randomly within a direct sub-cube of Qt, we have M p 2 ter d (Ψ+DE ) 2 log(2T /) Mdµ(Q ) X ≥ α log (27) Et[f(At,i)] = Md hfiQ , (29) p ter η t net(Q ) i=1 Bandits for BMO Functions
Algorithm 2 Bandit-BMO-Zooming (Bandit-BMO-Z) that is f-admissible, under event EeT , 1: Problem intrinsics: µ(·), DE , d, Md. Md 2: {µ(·), DE , d, Md are same as those in Algorithm1. } X δ δ ∆ = Md f − hfi (32) 3: Algorithm parameters: η, , T > 0. and α ∈ t,i Qt √ i=1 (Ψ+D ) 2 log(2T 2/) E p 2 ! 0, log(M /η) . (Ψ+D ) 2 log(2T /) M µ(Q ) d ≤M 2 E + log d t d p η 4: {η, , T are same as those in Algorithm1. α is the net(Qt) zooming rate.} M µ(Q ) d d t 5: Initialize: let Q =[0, 1) . Play warm-up phase (26). ≤Md(1 + 2α) log , (33) 0 η 6: for episode t = 1, 2,...,T do 7: Let mt, nt, Ut be defined as in (9), (8) and (25). where (32) uses (30) and the last inequality uses (28). 8: Select parent cube Q ∈ Q such that: t t Next, we extend some definitions from Kleinberg et al. (2008), to handle the δ-regret setting. Firstly, we define Qt ∈ arg max Ut(q). q∈Qt, q is a parent cube. the set of (λ, δ)-optimal arms as
9: for j = 1, 2,...,Md do [ d δ sub Xδ(λ) := {Q ⊂ [0, 1] : f − hfiQ ≤ λ} . (34) 10: Locate the j-th direct sub-cube of Qt: Qj . sub 11: Play At,j ∈ Qj uniformly at random, and ob- serve Yt,j. We also need to extend the definition of zooming num- 12: end for ber (Kleinberg et al., 2008) to our setting. We denote by 13: Update the collection of dyadic cubes Qt to Qt+1 Nδ(λ, ξ) the number of cubes of edge-length ξ needed to according to (27). cover the set Xδ(λ). Then we define the (δ, η)-Zooming 14: end for Number with zooming rate α as