Supplementary Material for “Combinatorial Partial Monitoring Game with Linear Feedback and Its Application”
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Material for “Combinatorial Partial Monitoring Game with Linear Feedback and Its Application”. A. Full proof for Theorems 4.1 and 4.2 If the reader will recall, we have the following problem-specific constants in the main text: the size of the global observer set jσj jσj L > 0 β = max k(M |M )−1 P M | M (ν − ν )k , parameter from the continuity assumption, error bound σ σ σ i=1 xi xi i 0 2, n where the max is taken from ν0; ν1; ··· ; νjσj 2 [0; 1] , and the maximum difference in the expected reward Rmax = n maxx1;x22X;ν2[0;1] jr(x1; ν) − r(x2; ν)j. For technical reasons, we also defined φ(ν) = max(min(ν;~1);~0) to adjust ν to the nearest vector in [0; 1]n, and r(x; ν) = r(x; φ(ν)); 8ν 2 Rn n [0; 1]n to preserve the Lipschitz continuity throughout Rn. To make our proof clearer, we define v(t) as the state of any variable v by the end of time step t. Our analysis is based on the snapshot of all variables just before the statement t t + 1 (Line 14 and 30). One batch processing in exploration (j) phase is called one round, and then nσ is increased by 1. Denote ν^ as the estimated mean of outcomes after j rounds of exploration. For example, at time t, the estimated mean of outcomes is ν^(t) and the exploration counter is nσ(t), so we have ν^(nσ (t)) = ν^(t). And for time step t + 1 , the player will use the previous knowledge of ν^(t) to get x^(t + 1) = − argmaxx2X r(x; ν^(t)) and x^ (t + 1) = argmaxx2X nfx^(t+1)g r(x; ν^(t)). In the following analysis, the frequency function is set to fX (t) = ln t + 2 ln jX j. Note that by using fX (t), we can q construct the confidence interval αfX (t) to eliminate failures with high probability. Define N(t) = jX j2t, which will be nσ 2 2 −1 8L βσ frequently used in our analysis, then exp{−fX (t)g = N(t) . Let α = a2 , where a > 0 is a parameter to be tuned later. The symbols used in the proof are listed in Table1, and to facilitate the understanding of our proof, each lemma is briefly summarized in Table2. Below we give an outline of the complete proof, which consists of three main parts: • First, we introduce some general insights, present the preliminaries, and prove basic properties of our model and the Algorithm GCB, which is shared by the proofs of both distribution-independent and distribution-dependent regret bounds. Next, we obtain the concentration property of the empirical mean of outcomes via Azuma-Hoeffding In- equality (Fact A.2) in Lemma A.3. Lemma A.13 shows that our algorithm bounds the number of exploration rounds by O(T 2=3 log T ), which implies that our algorithm will not play exploration for too long. In Lemma A.14, we prove that when the gap between the estimated optimal action x^ and the second optimal x^− is large (i.e. the first condition in Line 8), with low probability the estimated optimal action is sub-optimal. This means that our global confidence bound will exclude sub-optimal actions effectively. 2 • In Section A.1, we prove the distribution-independent regret bound of O(T 3 log T ) (Theorem 4.1 in the main text). In Lemma A.15, we show that the gap between the estimated optimal action x^ and the real one x∗ decays exponentially with the number of exploration rounds. Thus, the penalty in exploitation phase can be derived in Lemma A.16. Then, we use Lemmas A.13 and A.16 to prove Theorem 4.1 in the main text. Hence the distribution-independent bound 2 O(T 3 log T ) is achieved. • In Section A.2, we prove the distribution-dependent bound of O(log T ) related to the predetermined distribution p, assuming that the optimal action x∗ is unique (Theorem 4.2 in the main text). First, we show in Lemma A.17 ln t+2 ln jX j that, when the algorithm plays Ω 2 rounds of exploration, the probability of the estimated optimal ac- ∆min tion x^ being sub-optimal is low. Then, in Lemma A.18, we combine the results of Lemmas A.14 and A.17 and show that with a low probability the algorithm exploits with a sub-optimal action. Thus, Lemma A.18 is enough to bound the regret of exploitation. Next, we bound the regret of exploration by bounding the number of exploration ln T +2 ln jX j rounds to O 2 in Lemma A.22. This is done by showing that whenever the algorithm has conducted ∆min ln t+2 ln jX j Θ 2 rounds of exploration, with high probability it switches to exploitation (Lemma A.19), and then ag- ∆min gregating multiple switches between exploration and exploitation in the proof of Lemma A.22. Finally, we combine Lemmas A.18 and A.22 to prove Theorem 4.2 in the main text. Fact A.1. The following probability laws will be used in the analysis. Symbols in the main text Definition v(t) state any variable v by the end of time step t v 2 [0; 1]n outcomes of the environment v~ 2 [0; 1]n estimation of outcomes through inversion ν mean of outcomes ν^ empirical mean of outcomes x 2 X action x in action set X r(x; v); r(x; ν) 2 R reward function taking x and v, and expected reward function taking x and ν mx×n Mx 2 R transformation matrix of action x where mx depends on x m y(t) 2 R x(t) feedback vector under the choice of x(t) ~y vector that stacks feedbacks from different times ∆x; ∆max; ∆min reward gap of action x, of the maximum, of the (positive) minimum σ ⊂ X global observer set of actions L Lipschitz constant βσ distribution-independent error bound from σ Rmax distribution-independent largest gap of expected reward fX (t) frequency function x∗ real optimal action x^(t) estimated optimal action at time t x^−(t) estimated second optimal action at time t Symbols in the proof Definition nσ exploration counter µ(t); η(t) threshold functions X Good; X Bad ⊂ X good action set, and bad action set F Good; F Bad event of choosing good action set, and bad action set c LCI ; LCI event of occurring gap being larger than confidence interval and its complement EExplore; EInExplore; EFinExplore event of doing exploration, being in the middle of it, and being at its end EExploit event of exploitation ^ δxi;xj ; δxi;xj (t) 2 R reward gap of action xi and xj, and its estimated value at time t Gk the event indicating the first occurrence of k rounds of exploration Table 1. List of symbols in the proof. Succinct interpretation of the results Dependence Lemma A.3 Estimate of outcomes concentrates around the mean. Fact A.2 Lemma A.7 Difference of real and estimated gap is bounded. Lemma A.8 Estimated error of outcomes is small compared to the confidence interval. Lemma A.3 Lemma A.13 The counter of exploration is bounded within O(T 2=3 log T ). Lemma A.14 Finding a bad action to fail confidence interval occurs rarely. Lemma A.8 Lemma A.15 Incurring a large penalty for current optimal action is rare. Lemma A.3 Lemma A.16 The penalty in the exploitation phase is bounded. Lemmas A.14, A.15 Theorem 4.1 Distribution-independent bound: O(T 2=3 log T ). Lemmas A.13, A.16 Lemma A.17 With enough exploration, finding a bad action is rare. Lemma A.3 Lemma A.18 Finding a bad action and exploiting it become rare as time elapses. Lemmas A.14, A.17 Lemma A.19 With enough exploration, finding a good action but yet exploring becomes rare. Lemmas A.3, A.7, A.13 Lemma A.20 Once the algorithm performs enough exploration, it switches to exploitation. Lemmas A.17, A.19 Lemma A.22 Exploration rounds are bounded. Lemma A.20 Theorem 4.2 Distribution-dependent bound: O(log T ). Lemmas A.18, A.22 Table 2. List of lemmas and their dependencies in the proof. • Law of Conditional Probability: Pr [A ^ B] = Pr [A j B] · Pr [B]. • Law of Total Probability: if fBn : n = 1; 2; · · · g is a set of disjoint events whose union is the entire sample space, then X Pr [A] = Pr [A ^ Bn]: n Fact A.2 (Azuma-Hoeffding Inequality in Euclidean Space (Theorem 1.8 of (Hayes, 2003))). Let X = (X0; ··· ;Xn) be a very-weak martingale, which is defined for every i, E [Xi j Xi−1] = Xi−1, and it takes values in Euclidean Space, such d that for every i, Xi 2 R . Suppose X0 = 0, and for i = 1; ··· ; n, kXi − Xi−1k2 ≤ 1. Then, for every > 0, 2 2 1− (−1) 2 − Pr [kXnk2 ≥ ] < 2e 2n < 2e e 2n : (1) We can use the preceding fact to obtain the concentration property of outcomes during exploration. Lemma A.3 (Concentration during exploration). After the exploration round i = 1; 2; ··· ; j at t1; t2; ··· ; tj respectively, + (j) 1 Pj we use the inverse to get v~i = I(Mσ; ~y(ti)) = Mσ ~yi and their mean is ν^ = j i=1 v~i.