Arxiv:2105.11126V2 [Cs.LG] 4 Jun 2021

Cascading Bandit under Differential Privacy

Kun Wang1, Jing Dong2, Baoxiang Wang3 Shuai Li1, Shuo Shao1 1Shanghai Jiao Tong University, 2University of Michigan, 3The Chinese University of Hong Kong, Shenzhen. [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

This paper studies differential privacy (DP) and local differential privacy (LDP) in cascading bandits. Under DP, we propose an algorithm which guarantees - log T 1+ξ indistinguishability and a regret of O(( ) ) for an arbitrarily small ξ. This is log3 T a signiﬁcant improvement from the previous work of O( ) regret. Under (,δ)- LDP, we relax the K2 dependence through the tradeoff between privacy budget K log(1/δ) log T and error probability δ, and obtain a regret of O( 2 ), where K is the size of the arm subset. This result holds for both Gaussian mechanism and Laplace mechanism by analyses on the composition. Our results extend to combinatorial semi-bandit. We show respective lower bounds for DP and LDP cascading bandits. Extensive experiments corroborate our theoretic ﬁndings.

1 Introduction

There exists a rich literature on multi-armed bandits (MAB) as it captures a most fundamental problem in sequential decision making - the exploration-exploitation dilemma [Auer et al., 2002a,b, Bubeck and Cesa-Bianchi, 2012, Besbes et al., 2014]. Despite its simpleness, the MAB problem is found use in many important applications such as online advertising and clinical trials. In the original stochastic case of MAB, an agent is presented with K arms and is asked to pull an arm at each round through a ﬁnite-time horizon. Depending on the agent’s choice, the agent will receive a reward and its goal is set to maximizing the cumulative reward. The natural choice of performance metric is thus the difference between the optimal reward possible and the actual reward received by the agent, which is termed regret in bandit literature.

arXiv:2105.11126v2 [cs.LG] 4 Jun 2021 The original stochastic case of the multi-armed bandits model, though powerful, is insufﬁcient to cope with the complexity of real applications. The reward feedback is rarely as straightforward as depicted in stochastic MAB. The cascading bandit problem is then proposed with the intention to better model complicated situations common in applications such as recommendation systems and search engines [Kveton et al., 2015a,b, Zong et al., 2016, Li et al., 2016, Cheung et al., 2019, Wang et al., 2021]. This variant of MAB gives a realistic depiction of user behavior when searching for attractive things. Cascading bandit is also a sequential learning process, where in each round the agent recommends a list of arms to the user. The user checks from the start of the list and stops at the ﬁrst attractive item, which is manifested by clicks in web recommendation. Then the agent receives the feedback in the form of user’s click information. While the cascading bandit model greatly assists the development of real applications such as recommendation systems, it raises concern of privacy. Many applications rely heavily on this sensitive user data which reveals the preference of a particular user. If additional measures were not taken, one can easily get this information by observing the difference of the algorithm’s output [Dwork et al., 2006]. For example, in a shopping website, users’ click information and browsing

Preprint. Under review. history show the preference for some commodities, and companies can use this information to conduct price discrimination. Since first proposed in 2006, differential privacy (DP) becomes a standard in privacy notion because of its well-defined measure of privacy [Dwork et al., 2006, 2010, 2014, Abadi et al., 2016]. Modifying algorithms to make them possess privacy guarantees is emergent nowadays. An algorithm is regarded to own the capability of protecting differential privacy if the difference between outputs of this algorithm over two adjacent inputs is insignificant, as desired.

Technical difficulty: First, the cascading bandits involve a non-linear reward function. Directly injecting noise into the data may lead to much larger regret. Under the canonical definition of differential privacy, bandits algorithms are known to enjoy an upper bound of O(log3 T ) [Mishra and Thakurta, 2015, Chen et al., 2020]. Compare to the well known O(log T ) non private upper bound [Auer et al., 2002a, Chen et al., 2013, Kveton et al., 2015a], an open question remains on whether it is possible to close this gap. Moreover, compared to the definition of differential privacy, there is a much stricter definition of privacy guarantee known as local differential privacy (LDP). Under LDP, the data has to be injected with noise before collecting the data and thus it does not need a trusted center, which results in a stronger privacy guarantee. There are work on the design of bandit algorithms under LDP [Chen et al., 2020, Ren et al., 2020]. However, in combinatorial (including cascading) bandits, if one directly injects noise to the data and modifies the algorithm in each round as before, the regret scales as O(K2), where K is the maximum number of chosen arms in each round. This is a notorious side- effect. In the work of Chen et al. [2020] for combinatorial bandits, they discard information in each round that does not influence the order of the regret to avoid the dependence on K2. Nevertheless, in real applications, the data is scarce and valuable and need to be well stored to be used in other information analysis tasks. In this case, the natural problem to protect privacy without discarding data and still enjoying a preferable regret bound remains open.

Our contribution: In this paper, we overcome the difﬁculties and solve the problems. Under DP, through analyzing the dominant term in the regret, we utilize the post-processing lemma from Dwork et al. [2014] and a tighter bound for private continual release from Chan et al. [2011], proving the O(log1+ξ T ) regret upper bound, for an arbitrarily small ξ. This matches the lower bound up to a O(logξ T ) factor. Under LDP, we relax the regret dependence on K2 through the balance between privacy budget and error probability δ. This holds for both Laplace mechanism by analyses on composition and Gaussian mechanism. The main contributions of this paper are summarized as follows:

• For the cascading bandit problem under -DP, we propose novel algorithms with regret bound of L log T 1+ξ O(( ) ). It matches the lower bound up to an arbitrary small poly-log factor. • For the cascading bandit problem under (, δ)-LDP, we propose two algorithms upper bounded by K log 1/δ log T O( 2 ) which do not discard information. This holds for both Laplace mechanism and Gaussian mechanism. K log 1/δ log T • We extend our mechanism to combinatorial semi-bandit and proved a O( 2 ) regret, which is of the same order of the regret bound as we achieved in the cascading bandit setting under LDP. (L−K) log T 1 1 • We give two lower bounds, Ω( ∆2 ) and Ω(( ∆ + )(L − K) log T ), for cascading bandits under LDP and DP, respectively.

2 Related work

Cascading bandit The cascading bandits problem was ﬁrst proposed by Kveton et al. [2015a], in which a O(log T ) upper bound by UCB-based methods and a matching lower bound were presented. This problem is then extended to its linear variant by Zong et al. [2016],√ in which a norm bound for contexts from Abbasi-Yadkori et al. [2011] is employed and a O˜(Kd T ) upper bound is presented. The combinatorial semi-bandit, which resembles cascading bandit in many aspects is ﬁrst introduced by Chen et al. [2013]. They also give corresponding O(K log T ) upper bound.

2 Differential privacy Chan et al. [2011] first study the private continual release of statistics, which provides the basis for research for multi-armed bandits. Mishra and Thakurta [2015] give an K log3 T O( ) upper bound for the differential private bandit for the first time, where K is the number of arms. Tossou and Dimitrakakis [2016] tighten the upper bound to O(log T ) but under a relaxed definition of privacy. Tossou and Dimitrakakis√ [2017] also study the adversarial bandit under T log T differential privacy and give a first O( ) upper regret bound. Shariff and Sheffet [2018] give first lower bound for multi-armed bandit with DP. Basu et al. [2019] give essential lower bounds for different privacy definitions.

Local differential privacy Under LDP, Ren et al. [2020] use the Laplace noise and provide the log T ﬁrst O( 2 ) upper bound for the multi-armed bandit. Chen et al. [2020] study the combinatorial log T semi-bandit under LDP and give the ﬁrst O( 2 ) regret bound for the combinatorial bandit. They use the trick, each round only update the worse arm in the group, to avoid dependency on K, which is the size of each round’s chosen arms.

3 Problem formulation

3.1 Cascading bandit

A cascading bandit problem can be denoted by a tuple B = (E,P,K), where E = {1, ..., L} is a set E L of L base items, P is a probability distribution over [0, 1] and its expectation is (w ¯)i=1. K ≤ L represents the number of recommended items each time. Through a time horizon of T , at each round E t, wt ∈ [0, 1] is instant Bernoulli reward drawn from P . It indicates user’s preference by wt(e) = 1 if and only if item e attracts user at time t. The problem proceeds iteratively, at each round t, the algorithm is asked to recommend a list of items t t t At = (a1, a2, ..., aK ) ∈ ΠK (E). When the users receive the recommended list, he reviews the list Ct from the top and stop at ﬁrst attractive item Ct. The feedback to the algorithm is (wt)i=1, where Ct = t arg min{1 ≤ k ≤ K : wt(ak) = 1} with the assumption that arg min ∅ = ∞. The reward function QK t 1 f(A, w) = 1 − k=1(1 − w(ak)). Note that, wt(ak) = {Ct = k} k = 1, ..., min{Ct,K}. We t say item e is observed at time t if e = ak for some 1 ≤ k ≤ min{Ct,K}. The cumulative regret R(T ) is deﬁned as T X ∀T > 1,R(T ) = E[ R(At, wt)] , t=1 ∗ ∗ where R(At, wt) = f(A , wt) − f(At, wt) is the instant stochastic regret at time t and A = arg maxA∈ΠK (E)f(A, w¯) . Next, we maintain two mild assumptions that are common among cascading bandits literature [Kveton et al., 2015a]. Q Assumption 1. The weights w are distributed as, P (w) = e∈E Pe(w(e)) , where Pe is a Bernoulli distribution with mean w¯(e) . ∗ Assumption 2. The optimal action is unique, i.e. ∆min = mine∈[L]\A∗ {w¯ − w¯(e)} > 0 .

With Assumption 1, one can easily verify that E[f(A, w)] = f(A, w¯). This make it possible to decompose the regret to a solvable form. The Assumption 2 is necessary for any problem-dependent regrets. Without loss of generality, for all proofs, we assume w1 ≥ w2 ≥ ... ≥ wL.

3.2 (Locally) differential privacy

We study the problem of cascading bandits under both the classic (, δ)-differential privacy definition and (, δ)-local differential privacy. We first give the rigorous definition of differential privacy.

Deﬁnition 1 (Differential Privacy). Let D = hx1, x2, . . . , xT i be a sequence of data with domain T T X . Let A(D) = Y , where Y = hy1, y2, . . . , yT i ∈ Y be T outputs of the randomized algorithm A on input D. A is said to preserve (, δ)-differential privacy, if for any two data sequences D,D0 that differ in at most one entry, and for any subset U ⊂ Y T , it holds that P (A(D) ∈ U) ≤ e · P (A(D0) ∈ U) + δ . If δ = 0, then we say the algorithm satisﬁes -differential privacy.

3 The local differential privacy model requires masking data with noise before the accumulation of data in order to circumvent the need of a trusted center, which leads to a more promising privacy guarantee. Formally, the (, δ)-local differential privacy is defined as follows. Definition 2 (Local Differential Privacy). A mechanism A : X → Y is said to be (, δ)-local differential private or (, δ)-LDP, if for any x, x0 ∈ X, and any (measurable) subset U ⊂ Y , there is 0 P (A(x) ∈ U) 6 e · P (A(x ) ∈ U) + δ . If δ = 0, then we say the algorithm satisfies -local differential privacy.

4 Cascading bandit under differential privacy

In this section, we study cascading bandit under DP. In this situation, each round, the user’s click information is submitted to a database as a trusted center. When the algorithm needs data each round, it receives the noisy data from the database. The difficulty of this problem is governed by the control of the overall injected noise in the stream data. Allocating the injected noise and adjusting the confidence bound in the algorithm to guarantee better regret is the core in this problem. In the previous work, people use a tree-based mechanism or hybrid mechanism from Chan et al. [2011] to maintain the privacy of the stream data in the MAB setting. It means constructing a tree based on previous data to have a logarithmic number of sum when adding noise. This can control all noises to an acceptable content. However, people either not use best confidence bound or improperly construct the tree over stream, leading to some sub-optimal regret bounds. In this paper, through careful consideration, we find the order of the regret in the bandit algorithm is dominated by the sequence length of the constructed segment tree. If we see each round’s data as a K length vector such as Chen et al. [2020], then the sequence length is related to time t. In this case, it is unavoidable with the log2.5 t by direct derivation from the utility bound of the hybrid mechanism. It is hard to accept. Nonetheless, if the sequence length is relevant with the number of each arm Ti rather than time t, we prove that the overall regret can be bounded by log1+ξ t regret by a common inequality. In this work, we utilize the post-processing lemma well-known in differential privacy, converting the privacy guarantee of algorithm’s output to L arms. Then just equally distribute the privacy budget to all L arms. We can control the pulled numbers of any sub-optimal arm to log1+ξ t, which is the dominant term in regret. We now describe our algorithm under DP. It is inspired by Tossou and Dimitrakakis [2016], but under a much stronger definition of privacy. First, the agent receives data from hybrid mechanism and calculate the upper confidence bound for each arm. Compared to classic UCB algorithm, we add an extra additional term from the noise from the injected noise. The algorithm then outputs K arms. The agent of the algorithm sends chosen K arms to users. Finally, the users pull arms receiving rewards and insert the rewards to the hybrid mechanism for each arm, respectively. Our method is depicted in Algorithm 1. Algorithm 1: Cascading-UCB under DP. Input: , the differential privacy parameter 0 Instantiate L Hybrid Mechanisms with = L . Observe w0 ∀e ∈ E : T0(e) ← 1, wˆ1(e) ← w0(e). for t = 1, 2, ..., n do q 1.5 1.5 log t 3c1L log n log t ∀e ∈ E : Ut(e) =w ˆ + + Tt−1(e) Tt−1(e) t t Let a1, ..., aK be K items with largest private UCBs. t t At ← (a1, ..., aK ) Observe click Ct ∈ {1, ..., K, ∞} for k = 1, ..., min{Ct,K} do t e ← ak Tt(e) ← Tt−1(e) + 1 Insert wt(e) to the hybrid mechanism for arm e.

Next, we introduce two important lemmas to support our algorithm and proof.

4 Lemma 1 (Chan et al. [2011], Utility Bound For Hybrid Mechanism). In the continual release procedure, the hybrid mechanism preserves -differential privacy and with probability 1 − γ, the added noise to the data at time n satisﬁes following inequality: c log1.5 n log 1/γ |Noise| ≤ 1 , where c1 is a constant. Te Lemma 2 (Post-Processing Lemma). If the sequence (wt(e))t=1 for all arm e is -differentially private, then the Algorithm 1 is -differentially private.

Based on Lemma 1, we can construct high probability events that estimation outside of this conﬁdence interval happens with arbitrary small probability. This is the basis of UCB algorithms. The Lemma 2 ensures Algorithm 1 is -differentially private. Theorem 1. Algorithm 1 guarantees -DP. Theorem 2. The regret of Algorithm 1 is upper bounded by:

L 1+ξ 2 L 1+ξ! X 192 c1L 2π X L log T R(T ) ≤ log T + L + c2 = O , ∆e,K 3 e=K+1 e=K+1 where ξ is an arbitrary small positive real value and c1, c2 are constants independent of the problem.

To the best of our knowledge, this is the ﬁrst O(log1+ξ T ) regret in the common deﬁnition of differential privacy. We greatly improve the existing regret bound by the post-processing lemma and the utility bound on hybrid mechanism. Judging from the regret, we get this better dependence on t at the cost of extra dependence on L. However, in view of time t’s dominance in the bandit setting, we believe this is a great improvement compared to before. Besides, our proof method has the universality. It can be used in improving bound on other bandits model under differential privacy such as combinatorial semi-bandit and basic MAB. We do not give them in this paper.

5 Cascading bandit under local differential privacy

Under local differential privacy, each round when the user browses through the list, he directly sends noisy data to the database rather than sending original data and let database inject noise. This circumvents the need of trusted center, so it has much stronger privacy guarantee. In this situation, we need to protect every output at time t and let it satisfy (, δ)-local differential privacy. In previous work, Chen et al. [2020] discard much information each round to reduce the privacy budget in the combinatorial semi-bandit. This is not practical as the importance of information in the real world. Therefore, in this paper, we study if there exists a method can both satisfy (, δ)-LDP and appreciable regret not discarding information.

5.1 Warm up: Laplace mechanism

First, we use Laplace mechanism to provide LDP guarantee. Our algorithm is based on previous non-private cascading bandit. In order to guarantee privacy constraint, each round we have to inject noise to the data, so original conﬁdence interval is no longer applicable. Naturally, the conﬁdence set has to expand to adapt to the introduction to new Laplace noises. Then, we give our method in Algorithm 2. Theorem 3. Algorithm 2 guarantees -LDP. Theorem 4. The regret of Algorithm 2 in time horizon T is upper bounded by:

L √ √ L ! X 4( 1.5 + K/ 24)2 2π2 X K2 R(T ) ≤ log T + L = O 2 log T . ∆e,K 3 ∆e,K e=K+1 e=K+1

Next, we discuss the proof outline of Theorem 4: First, the expected mean outside the con- ﬁdence set is bounded by a small probability because of the sub-Gaussian property of two noises. Then we just consider event in high-probability set. The upper conﬁdence bound

5 K has a factor because of Laplace noise. Then follow the same procedure as non-private K2 log T cascading bandit [Kveton et al., 2015a], we get Ti ≤ O( 2 ) for all arm i. Comb- ing with the decomposed form of regret on account of nonlinear reward function, the ﬁ- K2 nal regret is the weighted sum of all Ti. Thus the ﬁnal regret has O( 2 ) dependence. Algorithm 2: Cascading-UCB under LDP (Laplace Mechanism). Input: : the differential privacy parameter,K: the maximum arm each round recommends. Initialization: Observe w0 ∀e ∈ E : T0(e) ← 1, wˆ1(e) ← w0(e) for t = 1, 2, ..., n do q 1.5 log t K q 24 log t ∀e ∈ E : Ut(e) =w ˆ(e) + + Tt−1(e) Tt−1(e) t t Let a1, ..., aK be K items with largest private UCBs. t t At ← (a1, ..., aK ) Observe click Ct ∈ {1, ..., K, ∞} for k = 1, ..., min{Ct,K} do t e ← ak Tt(e) ← Tt−1(e) + 1 T (e)w ˆ (e)+1{C =k}+Lap( K ) wˆ (e) ← t−1 Tt−1(e) t Te(e) Tt(e)

In our ﬁrst given Laplace mechanism, the dependence on K2 is a notorious side-effect during the learning process. In recommender system, K means the recommended items to user and it is usually very large. In scenarios where recommendation items come from a stream of information such as news feed, K often tends to inﬁnity. This leads to much larger regret. Therefore, a tough work about reducing dependence on K is in emergency. In subsection (5.2)(5.3)(5.4), we make great efforts in reducing this dependence on K.

5.2 Improvement: Gaussian mechanism

The Laplace mechanism, though offering promising privacy protection, is quadratically dependent on K. This thus makes the Laplace mechanism impractical when K is large. Previous solution to this catastrophic effect is to discard information with negligible effect on the order of regret bound [Chen et al., 2020]. Despite the alleviation of the effect, this method is inaccessible in many situations where data disposal is impossible. We now present an improved algorithm to circumvent this dependency K subtly, which is based on the Gaussian mechanism.

Algorithm 3: Cascading-UCB under LDP(Gaussian Mechanism). Input: : the differential privacy parameter,K: the maximum arm each round recommends. 1 q 1.25 Initialization:σ = 2K ln δ Observe w0. ∀e ∈ E : T0(e) ← 1, wˆ1(e) ← w0(e). for t = 1, 2, ..., n do q 1.5 log t q 2 log(2t3) ∀e ∈ E : Ut(e) =w ˆ + + σ Tt−1(e) Tt−1(e) t t Let a1, ..., aK be K items with largest private UCBs. t t At ← (a1, ..., aK ) Observe click Ct ∈ {1, ..., K, ∞} for k = 1, ..., min{Ct,K} do t e ← ak Tt(e) ← Tt−1(e) + 1 T (e)w ˆ (e)+1{C =k}+N (0,σ2) wˆ (e) ← t−1 Tt−1(e) t Te(e) Tt(e)

The Gaussian mechanism, compared to Laplace mechanism,√ is sensitive to L2-norm instead of L1-norm. This key property leads to our design of a K-dependent of upper conﬁdence bound. The

6 adjustment to the confidence interval, directly lessen the dominating effect of the interval on the regret bound and eliminate the K2 dependency. We now give the detailed algorithm description for Gaussian mechanism based cascading UCB algorithm with LDP guarantee. Before going on our work, we first introduce a lemma. 0 Lemma 3 (Zhao et al. [2019]). Let ∆f = max kf(D) − f(D )kL , then ∀δ ∈ (0, 1), and σ > D,D0 2 q ∆f 1.25 2 2 log δ ,M(D) = f(D) + N (0, σ ) satisfied (, δ)-differential privacy.

Lemma 3 gives the Gaussian mechanism privacy guarantee, if we inject N (0, σ2) to reward of each item according to lemma. Based on the concentration bound for Gaussian distribution, we get q q ¯ 2 log 2/γ ¯ 2 log 2/γ with probability of at least 1 − γ, u ∈ [X − σ s , X + σ s ]. By the concentration bound for Gaussian noise, we can use the Optimism in the Face of Uncertainty principle to construct UCB-based method. The bias comes from two noises: the noise from the Bernoulli distribution, this part we can control by sub-Gaussian quality; the noise from the Gaussian noise due to the privacy guarantee, this part can be controlled by the concentration bound for Gaussian distribution. Then follow the same procedure as Laplace mechanism. Our method is depicted in Algorithm 3. Theorem 5. Algorithm 3 guarantees (, δ)-LDP. Theorem 6. The regret of Algorithm 3 is upper bounded by: √ q L 2(2 1.5 + 8 K log 1.25 )2 L ! X δ X K R(T ) ≤ log T = O 2 log T . ∆e,K ∆e,K e=K+1 e=K+1

During the proof, the only position where the Gaussian mechanism is√ different from the Laplace K mechanism, is the conﬁdence interval. We update the interval to O( ) dependence based on concentration bound for Gaussian distribution. As we have discussed in the above, the regret has the linear dependence on K. It is a great improvement from the Laplace mechanism. However, more fundamental reasons need to be explored to guide our algorithm’s design.

5.3 Generalization: composition theorem

After careful investigation the regret of Gaussian mechanism, we identify a trade-off between privacy budget and error probability δ in regret. Comparing the regret of Gaussian mechanism and that of Laplace mechanism, the dependence on privacy budget ( K ) is exchanged by an additional 1 multiplicative δ(log δ ) regret. This motivation of exchanging for by a little δ is just why advanced composition theorem in differential privacy is proposed. Thus we can use this composition theorem at our situation to reduce the dependence on K. Lemma 4 (Kairouz et al. [2015], Corollary 4.1). For any ∈ (0, 0.9] and δ ∈ (0, 1], if the database access mechanism satisﬁes (p2/4k log(e + /δ), δ/2k)-differential privacy on each query output, then it satisﬁes (, δ)-differential privacy under k-fold composition.

Before, we have to ensure every item -indistinguishable. Now, we just to ensure each item √ - K K indistinguishable based on Lemma 4. Thus any -dp mechanism can achieve half dependence on K at the cost of a log 1/δ term in regret while ensuring their (, δ)-local differential privacy. We use this composition theorem in our situation. We give an illustrating example with the Laplace mechanism to highlights the impact of the above theorem. By this theorem, we can improve the regret of Laplace mechanism to O(K) dependence in regret, which is the same as Gaussian mechanism. Theorem 7. Cascading-UCB algorithm using Laplace mechanism with parameter 0 = K log 1/δ log T √ under K-fold composition each round achieves O( 2 ) regret while ensur- 4K log(e+/δ) ing (, δ)-local differential privacy. By the deﬁnition of DP, Laplace mechanism that attains (, 0)-DP is also capable of offering (, δ)- DP protection.√ Using the generalized theorem, every item observed in the list masked with a 4K log(e+/δ) Lap( ) noise is enough to guarantee (, δ)-dp. An UCB algorithm with conﬁdence in- 4 q 6K log(e+/δ) log t q 3 log t K log 1/δ log T terval of + then suffers only O( 2 ) regret while ensuring Tt−1(e) 2Tt−1(e) privacy guarantee.

7 5.4 Relationship with combinatorial semi-bandit

Our proposed algorithms also shed light on privacy persevering bandits algorithm under combinatorial semi-bandit, which holds great similarity with cascading bandit set up. We take the Gaussian mechanism as an example and give the following theorem on the regret upper bound of UCB algorithm under LDP with combinatorial bandit, though we defer the detailed algorithm description to appendix as Algorithm 4. Theorem 8. Algorithm 4 guarantees (, δ)-LDP. Theorem 9. The regret of Algorithm 4 is upper bounded by: 1.25 2 128K log δ log T π R(T ) ≤ 2 −1 2 + + 1 m∆max , min{ , 2}(f (∆min)) 3 where is f is strictly increasing function satisfying B∞ bounded smoothness and K is maximum number of each round chosen arm.

6 Lower bounds

We now discuss the lower bound for cascading bandit under DP and LDP and the optimality of our upper bounds. We ﬁrst decompose the lower bound for cascading bandit into a weighted sum of any sub-optimal arm’s pulled number. Under LDP, this can be further converted into the KL-divergence of two arms and thus implies a Ω(1/2) dependence according to the results from [Basu et al., 2019]. Under DP, we decompose the number of any sub-optimal arm into two parts: the number of pulls with no privacy guarantee and the number of pulls with the cost of privacy. The following theorems formalize our results. Theorem 10. For any cascading bandit of -local differential privacy, the regret of any consistent algorithm R(T ) is lower bounded by R(T ) (L − K)p(1 − p)K lim inf ≥ . T →∞ log T 2 min{4, e2}(e − 1)2∆

(L−K) Note that when p = 1/K and → 0, lim infT →∞ R(T ) ≥ Ω( ∆2 log T ). Theorem 11. For any cascading bandit of -differential privacy, the regret of any consistent algorithm R(T ) is lower bounded by (L − K)(1 − p)K−1 log T (L − K)(1 − p)K−1 log T R(T ) ≥ Ω + . ∆

It can be observed that the deﬁnition of privacy protection, whether it is differential privacy or local differential privacy, has negligible effect on the lower bound of cascading bandits. Our proposed algorithm for cascading bandits achieves matching upper bound under the LDP while there exists a O(logξ T ) gap under DP.

7 Empirical results

In this section, we provide extensive empirical results that corroborate our theoretical ﬁndings. We conduct experiments under DP and LDP for Gaussian and Laplace mechanism with varying value of . In order to highlight our improvement, Laplace mechanism uses the form of Algorithm 2. The experiments were proceeded with L = 20,K = 4, δ = 10−3 unless other wise indicated. The time horizon is set to 100000 for all experiments. To sure reproducible results, each experiment is repeated 10 times and averaged result is presented. Without loss of generality, all noises are multiplied by 0.01 to ensure most of the samples distributed among [0, 1].

Cascading bandits under local differential privacy. The algorithms were analysed empirically under LDP with varying level of . Figure 1(a) 1(b), 1(c), 1(d) shows the cumulative regret incurred by three algorithms mentioned with = {0.2, 0.5, 1, 2}. Laplace based-UCB achieves the most cumulative regret among all algorithms analyzed. Though non-private UCB incurred the lowest

8 cumulative regret, the private algorithms achieve regrets of order O(log T ) as well. Figure 1(e) shows the performance of private algorithms varies according to various number of in range of 0 to 2, where takes values of every 0.02 interval. When = 0, the regret is unaffordable for private algorithm. However, as increases, the cumulative regret of the algorithms quickly decays and soon achieve comparable cumulative regret with non private algorithm, especially in the case of where Gaussian mechanism is applied. We also test CUCB under = 0.2 and = 2 in ﬁgure 1(h), 1(i). It can be observed that, when is set to relatively small value, the gap between performance of private and non private algorithm is comparably larger when is large.

350 500 350 Non-private Non-private Non-private 300 Laplace-LDP Laplace-LDP 300 Laplace-LDP Gaussian-LDP 400 Gaussian-LDP Gaussian-LDP 250 250

300 200 200

150 150 200

Cumulative Reget 100 Cumulative Reget Cumulative Reget 100 100 50 50

0 0 0 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 Rounds Rounds Rounds (a) LDP, = 0.2 (b) LDP, = 0.5 (c) LDP, = 1

250 0.08 Non-private Non-private 3000 =0.1 Laplace-LDP 0.07 Laplace-LDP =0.5 200 Gaussian-LDP 0.06 Gaussian-LDP 2500 =1 =2 150 0.05 2000 Non-private 0.04 1500 100 0.03 1000 Cumulative Reget 0.02 Cumulative Reget 50 Average Cumulative Reget 500 0.01

0 0.00 0 0 20000 40000 60000 80000 100000 0.0 0.5 1.0 1.5 2.0 0 20000 40000 60000 80000 100000 Rounds epsilon Rounds (d) LDP, = 2 (e) LDP, vary (f) DP, vary

3500 30000 L=20 60000 Non-private Non-private 3000 L=16 Laplace-LDP 25000 Laplace-LDP L=12 50000 Gaussian-LDP Gaussian-LDP 2500 L=8 20000 40000 2000 15000 30000 1500

20000 10000 Cumulative Reget 1000 Cumulative Reget Cumulative Reget

500 10000 5000

0 0 0 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 Rounds Rounds Rounds (g) DP, vary L (h) CUCB under LDP, = 0.2 (i) CUCB under LDP, = 2

Figure 1: Empirical results for private and non private algorithms under LDP and DP.

Cascading bandits under differential privacy. We provide two sets of empirical results for cascading bandits under DP. Figure 1(f) validates the algorithms under = {0.2, 0.5, 1, 2}. When is taken to be a relatively larger value, the performance of private algorithm is closer to the performance of non-private algorithm. Figure 1(g) validates the private algorithms under varying number of arms, for L = {8, 12, 16, 20}. It can be observed that larger L leads to larger regret.

8 Conclusion and future work

In this paper, we study the differential privacy and local differential privacy in the cascading bandit setting. Under DP, we give the state-of-the-art O(log1+ξ T ) regret in the common deﬁnition of DP. Under LDP, we utilize the tradeoff between and δ, relaxing the O(K2) dependence to O(K). Two respective lower bounds for both situations are provided. We conjecture the algorithm under DP is not good enough to achieve the best regret. For example, during the allocation of privacy budget, all previous work just equally distributes them. However, distributing privacy budget based on arm’s difference may improve regret order. We leave this direction for future work.

9 References Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, 2016. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2011. Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002a. Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b. Debabrota Basu, Christos Dimitrakakis, and Aristide Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? arXiv preprint arXiv:1905.12298, 2019. Omar Besbes, Yonatan Gur, and Assaf Zeevi. Stochastic multi-armed-bandit problem with non- stationary rewards. Advances in neural information processing systems, 2014. Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi- armed bandit problems. Found. Trends Mach. Learn., 5(1):1–122, 2012. T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security, 14(3):1–24, 2011. Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning, 2013. Xiaoyu Chen, Kai Zheng, Zixin Zhou, Yunchang Yang, Wei Chen, and Liwei Wang. (locally) differentially private combinatorial semi-bandits. In International Conference on Machine Learning, 2020. Wang Chi Cheung, Vincent Tan, and Zixin Zhong. A thompson sampling algorithm for cascading bandits. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, 2019. Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, 2006. Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, 2010. Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014. Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International conference on machine learning, 2015. Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits: Learning to rank in the cascade model. In International Conference on Machine Learning, 2015a. Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Combinatorial cascading bandits. In Advances in Neural Information Processing Systems, 2015b. Shuai Li, Baoxiang Wang, Shengyu Zhang, and Wei Chen. Contextual combinatorial cascading bandits. In International conference on machine learning, 2016. Nikita Mishra and Abhradeep Thakurta. (nearly) optimal differentially private stochastic multi-arm bandits. In Proceedings of the Thirty-First Conference on Uncertainty in Artiﬁcial Intelligence, 2015. Wenbo Ren, Xingyu Zhou, Jia Liu, and Ness B Shroff. Multi-armed bandits with local differential privacy. arXiv preprint arXiv:2007.03121, 2020.

10 Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. Lecture notes for course 18S997, 813:814, 2015. Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. In Advances in Neural Information Processing Systems, 2018. Aristide Tossou and Christos Dimitrakakis. Algorithms for differentially private multi-armed bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016. Aristide Tossou and Christos Dimitrakakis. Achieving privacy in the adversarial multi-armed bandit. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017. Kun Wang, Canzhe Zhao, Shuai Li, and Shuo Shao. Conservative contextual combinatorial cascading bandit. arXiv preprint arXiv:2104.08615, 2021. Jun Zhao, Teng Wang, Tao Bai, Kwok-Yan Lam, Zhiying Xu, Shuyu Shi, Xuebin Ren, Xinyu Yang, Yang Liu, and Han Yu. Reviewing and improving the gaussian mechanism for differential privacy. arXiv preprint arXiv:1911.12060, 2019. Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, and Branislav Kveton. Cascading bandits for large-scale recommendation problems. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, 2016.

11 A Technical Lemmas

∗ Lemma 5 (Kveton et al. [2015a]). For any item e and optimal item e , let: Ge,e∗,t = {∃1 ≤ k ≤ t ∗ t t K s.t. ak = e, πt(k) = e , wt(a1) = ... = wt(ak−1) = 0}, then there exists a permutation πt t of optimal items {1, ..., K}, which is a deterministic function of Ht, such that Ut(ak) ≥ Ut(πt(k)) for all k and the following holds:

L K X X 1 Et[R(At, wt)] ≥ α ∆e,e∗ Et[ {Ge,e∗,t}] , e=K+1 e∗=1 L K X X 1 Et[R(At, wt)] ≤ ∆e,e∗ Et[ {Ge,e∗,t}] , e=K+1 e∗=1 where α = (1 − w¯(1))K−1 and w¯(1) is the attraction probability of the most attractive item.

Lemma 6 (Chan et al. [2011]). Let X1,X2, ...Xn be i.i.d. random variables√ following the Lap(b) √ 2 2v2 distribution and Y = X1 + X2 + ... + Xn. For v ≥ b n and 0 < λ < b , we have λ2 P {Y > λ} ≤ exp(− ) . 8v2 2 ¯ 1 Pn Lemma 7 (Rigollet and Hütter [2015]). If X1, ...Xn are i.i.d N (u, σ ). X = n i=1 Xi ∼ N (u, σ2/n). The following inequality holds: nt2 P (|X¯ − u| > t) ≤ 2 exp(− ) . 2σ2

B CUCB under LDP

Algorithm 4: CUCB under LDP (Gaussian Mechanism). Input: : the differential privacy parameter, m: the number of arm, K: the maximum arm each round recommends. 1 q 1.25 −2 Initialization:σ = 2K ln δ , γ = 2t Observe u0. ∀i ∈ [m]: Ti ← 1 ∀i ∈ [m] :u ˆ(i) ← u0(i) for t = m + 1, m + 2, ..., n do q q 1.25 3 log t 2 2K log δ log t ∀i ∈ [m] :u ¯i =u ˆi + + 2Ti Ti St = Oracle(¯u1, u¯2, ..., u¯K ). Play S and observe ut. for e = 1, ..., K do Tt(e) = Tt−1(e) + 1. T (e)ˆu (e)+u (e)+N (0,σ2) uˆ (e) = t−1 Tt−1(e) t Tt(e) Tt(e)

C Proof of Theorem 2

Proof. According to Lemma 5, the instant regret at time t satisﬁes following inequality:

L K X X 1 Rt ≤ ∆e,e∗ Et[ {Ge,e∗,t}] . e=K+1 e∗=1

Let w˜ denote the empirical mean not adding Laplace noise, Λ1,t = {∀e ∈ E, s.t. |w¯(e) − w˜(e)| ≤ Noise t,Tt−1(e) c }, Λ2,t = {∀e ∈ E, s.t. | | ≤ v }, and Tt−1(e) denotes the number of t,Tt−1(e) Tt−1(e) t,Tt−1(e) arm e pulled until time t − 1,

12 T T t T t T X X X X X X X X X π2 [ 1{Λ¯ }] ≤ P (|w¯(e)−w˜(e)| ≥ c ) ≤ 2 e−3 log t ≤ 2 t−2 ≤ L. E 1,t t,s 3 t=1 e∈E t=1 s=1 e∈E t=1 s=1 e∈E t=1

1.5 Based on Lemma 1, let γ = t−3, then v = 3c1L log Tt−1(e) log t , t,Tt−1(e) Tt−1(e)

T T t T t T 2 X 1 ¯ X X X Noiset,s X X X −3 X X −2 π E[ {Λ2,t}] ≤ P ( ≥ vt,s) ≤ t ≤ t < L. s 3 t=1 e∈E t=1 s=1 e∈E t=1 s=1 e∈E t=1

T T T X 1 X 1 ¯ X 1 ¯ R(T ) ≤ E[ {Λ1,t, Λ2,t}Rt] + E[ {Λ1,t}Rt] + E[ {Λ2,t}Rt] . t=1 t=1 t=1

2π2 Then since Rt ≤ 1, the last two terms can be bounded by 3 L, and for all time t with probability of 1 − t−3, 1.5 Noiset,Tt−1(e) 3c1L log Tt−1(e) log t ≤ = vt,Tt−1(e) . Tt−1(e) Tt−1(e) Therefore, the cumulative regret until time T can be bounded as following:

L K T X X X 2π2L R(T ) ≤ [ ∆ ∗ 1{Λ , Λ ,G ∗ }] + . E e,e 1,t 2,t e,e ,t 3 e=K+1 e∗=1 t=1

If at time t, the algorithm selects a sup-optimal arm, then ∗ ∗ w¯(e) + 2ct−1,Tt−1(e) + 2vt−1,Tt−1(e) ≥ Ut(e) ≥ Ut(e ) ≥ w¯(e ) .

∗ ∆e,e ≤ 2ct−1,Tt−1(e) + 2vt−1,Tt−1(e) .

∗ We consider its complementary set, i.e. Et = {∆e,e > 2ct−1,Tt−1(e) + 2vt−1,Tt−1(e)}. It indicates the event the algorithm always chooses the optimal arm. It is enough to bound the following two inequalities:

λ0∆ > 2ct,Tt−1(e) ,

(1 − λ0)∆ > 2vt−1,Tt−1(e) , For the ﬁrst inequality, we get: 6 Tt−1(e) ≥ 2 2 log t . λ0∆e,e∗ For the second inequality, we get:

1.5 Tt−1(e) ≥ B log Tt−1(e) ,

6c1L log t where B = (1−λ)∆ . 1+ξ If we let Tt−1(e) = B , we get Bξ ≥ (1 + ξ)1.5 log1.5 B.

When t is sufﬁciently large, it must exist a constant c2, when B > c2, the above inequality holds. Thus, 6 1+ξ Tt−1(e) ≥ max{ 2 2 log t, B } . λ0∆e,e∗ 6 1+ξ It means that, when Tt−1(e) ≥ max{ 2 2 log t, B }, the algorithm must choose the optimal λ0∆e,e∗ arm at this time. What’s more, we consider t as the dominant term, and return back to previous situation, 1+ξ ∗ ∗ ∆e,e ≤ 2ct−1,Tt−1(e) + 2vt−1,Tt−1(e) ⇒ Tt−1(e) ≤ B = τe,e .

13 Therefore,

K T K T X X X X ∆e,e∗ 1{Λ1,t, Λ2,t,Ge,e∗,t} ≤ ∆e,e∗ 1{Tt−1(e) ≤ τe,e∗ ,Ge,e∗,t} . (1) e∗=1 t=1 e∗=1 t=1

Let T X Me,e∗ = 1{Tt−1(e) ≤ τe,e∗ ,Ge,e∗,t}. t=1

Now note that (i) the counter Tt−1(e) of item e increases by one when the event Ge,e∗,t happens ∗ ∗ for any optimal item e , (ii) the event Ge,e∗,t happens for at most one optimal e at any time t; and (iii) τe,1 ≤ ... ≤ τe,K . Based on these facts, it follows that Me,e∗ ≤ τe,e∗ , and moreover PK e∗=1 Me,e∗ ≤ τe,K . Therefore, the right-hand side of (1) can be bounded by: ( K K ) X X max ∆e,e∗ me,e∗ : 0 ≤ me,e∗ ≤ τe,e∗ , me,e∗ ≤ τe,K . e∗=1 e∗=1 ∗ Since the gaps are decreasing, ∆e,1 ≥ ... ≥ ∆e,K , the solution to the above problem is me,1 = τe,1, ∗ ∗ me,2 = τe,2 − τe,1, ... , me,K = τe,K − τe,K−1. Therefore, the above can be bounded by:

" K # √ 1 X 1 1 8c1L 2 1+ξ ∆ + ∆ ∗ ( − ) ( ln 4T ) . e,1 1+ξ e,e 1+ξ 1+ξ (1 − λ ) ∆e,1 e∗=2 ∆e,e∗ ∆e,e∗−1 0

Therefore, let λ0 = 1/2, we get:

K T 1+ξ 1+ξ X X 192c1 L 1+ξ ∆e,e∗ 1{Λ1,t, Λ2,t,Ge,e∗,t} ≤ log T. ξ 1+ξ e∗=1 t=1 ∆e,K

L 1+ξ 1+ξ 2 X 192c1 L 1+ξ 2π R(T ) ≤ 1+ξ log T + L + c2 . ∆e,K 3 e=K+1

D Proof of Theorem 4

Proof. Let w˜ denote the empirical mean not adding differential private noise, Λ1,t = {∀e ∈ PTt−1(e) q t=1 Lap K 24 log t E s.t. |w¯(e) − w˜(e)| ≤ c }, Λ2,t = {∀e ∈ E s.t. | | ≤ }, the cu- t,Tt−1(e) Tt−1(e) Tt−1(e) mulative regret until time t is:

T T T X 1 X 1 ¯ X 1 ¯ R(T ) ≤ E[ {Λ1,t, Λ2,t}Rt] + E[ {Λ1,t}Rt] + E[ {Λ2,t}Rt] . t=1 t=1 t=1

T T t T t T X X X X X X X X X π2 [ 1{Λ¯ }] ≤ P (|w¯(e)−w˜(e)| ≥ c ) ≤ 2 e−3 log t ≤ 2 t−2 ≤ L. E 1,t t,s 3 t=1 e∈E t=1 s=1 e∈E t=1 s=1 e∈E t=1 √ √ K √ K K s 2 2Ks According to Lemma 6, let λ = 24s log t,b = ,v = ,λ < , when s ≥ 3 log t (this part can be bounded by 3 log t),

T T t s r T t T P Lap K 24 log t π2 X 1 ¯ X X X t=1 X X X −3 X X −2 E[ {Λ2,t}] ≤ 2P ( ≥ ) ≤ 2 t ≤ 2 t ≤ L, s s 3 t=1 e∈E t=1 s=1 e∈E t=1 s=1 e∈E t=1 Then based on Lemma 5, we get:

T L K T X 1 X X X 1 E[ {Λ1,t, Λ2,t}Rt] ≤ ∆e,e∗ E[ {Λ1,t, Λ2,t,Ge,e∗,t}] . t=1 e=K+1 e∗=1 t=1

14 K q 24 log t Now we bound the above inequality for any sub-optimal arms, let ht−1,s = s , if at time t, the algorithm selects a sub-optimal arm e, then: ∗ ∗ w¯(e) + 2ct−1,Tt−1(e) + 2ht−1,Tt−1(e) ≥ Ut(e) ≥ Ut(e ) ≥ w¯(e ) , which implies:

∗ ∆e,e ≤ 2ct−1,Tt−1(e) + 2ht−1,Tt−1(e) .

Then for any sub-optimal arm e, we have: √ √ K 2 4 log t( 1.5 + 24) Tt−1(e) ≤ 2 = τe,e∗ . ∆e,e∗ Therefore,

K T K T X X X X ∆e,e∗ 1{Λ1,t, Λ2,t,Ge,e∗,t} ≤ ∆e,e∗ 1{Tt−1(e) ≤ τe,e∗ ,Ge,e∗,t} . (2) e∗=1 t=1 e∗=1 t=1

Let T X Me,e∗ = 1{Tt−1(e) ≤ τe,e∗ ,Ge,e∗,t} t=1

( K K ) X X max ∆e,e∗ me,e∗ : 0 ≤ me,e∗ ≤ τe,e∗ , me,e∗ ≤ τe,K . e∗=1 e∗=1 ∗ Since the gaps are decreasing, ∆e,1 ≥ ... ≥ ∆e,K , the solution to the above problem is me,1 = τe,1, ∗ ∗ me,2 = τe,2 − τe,1, ... , me,K = τe,K − τe,K−1. Therefore, the above can be bounded by:

" K !# √ √ 1 X 1 1 2 ∆e,1 2 + ∆e,e∗ 2 − 2 4( 1.5 + K/ 24) log T. ∆ ∆ ∗ ∆ ∗ e,1 e∗=2 e,e e,e −1

The above term is bounded by √ √ 8( 1.5 + K 24)2 log T. ∆e,K Therefore, √ √ L K 2 2 L 2 ! X 8( 1.5 + 24) 2π X K R(T ) ≤ log T + L = O 2 log T . ∆e,K 3 ∆e,K e=K+1 e=K+1

E Proof of Theorem 6

Proof. Following the same procedure as Appendix D, let w˜ denote the empirical mean not adding differential private noise, Λ1,t = {∀e ∈ E s.t. |w¯(e) − w˜(e)| ≤ ct,Tt−1(e)}, Λ2,t = {∀e ∈ T (e) r P t−1 N K log 1.25 log 2 E s.t. | t=1 t | ≤ 2 δ γ }. Tt−1(e) Tt−1(e)

T T T X 1 X 1 ¯ X 1 ¯ R(T ) ≤ E[ {Λ1,t, Λ2,t}Rt] + E[ {Λ1,t}Rt] + E[ {Λ2,t}Rt] . t=1 t=1 t=1

15 r K log 1.25 log 2 Based on Lemma 7, we get with probability at least 1 − γ, u ∈ [X¯ − 2 δ γ , X¯ + Tt−1(e) r K log 1.25 log 2 2 δ γ ]. Let error probability for Gaussian distribution γ = t−3, then the latter two terms Tt−1(e) 2π2 can be bounded by 3 L. As for the ﬁrst term,

T L K T X 1 X X X 1 E[ {Λ1,t, Λ2,t}Rt] ≤ ∆e,e∗ E[ {Λ1,t, Λ2,t,Ge,e∗,t}] . t=1 e=K+1 e∗=1 t=1

q 1.25 3 2 K log δ log 2t Let ht−1,s = s , if the algorithm chooses a sub-optimal arm at round t,

∗ ∗ w¯(e) + 2ct−1,Tt−1(e) + 2ht−1,Tt−1(e) ≥ Ut(e) ≥ Ut(e ) ≥ w¯(e ) ,

∗ 2ct−1,Tt−1(e) + 2ht−1,Tt−1(e) ≥ ∆e,e .

s s 1.25 3 1.5 log t 4 K log δ log 2t 2 + ≥ ∆e,e∗ . Tt−1(e) Tt−1(e)

Then we get, √ 8 q 1.25 2 (2 1.5 + K log δ ) log t Tt−1(e) ≤ 2 . ∆e,e∗ Follow the same procedure as Appendix D, ﬁnally we get: √ q L 2(2 1.5 + 8 K log 1.25 )2 2 L ! X δ 2π X K log 1/δ R(T ) ≤ log T + L = O 2 log T . ∆e,K 3 ∆e,K e=K+1 e=K+1

F Proof of Theorem 9

Proof. Before the proof, we ﬁrst deﬁne some notations based on Chen et al. [2013]: let Ti,t be the number of arm i pulled at the end of time t. The oracle in the algorithm is a (α, β)-approximation oracle, which means P [ru(S) ≥ αoptu] ≥ β. A super arm S is bad if ru(S) < αoptu. And let SB = {S|ru(S) < αoptu}. Under this circumstance,

i ∆min = αoptu − max{ru(S)|S ∈ SB, i ∈ S},

i ∆max = αoptu − min{ru(S)|S ∈ SB, i ∈ S}. Following above, the cumulative regret has the following form:

T X R(T ) = T αβoptu − E[ ru(St)] i=1

Round t is bad if the oracle selects a bad super arm St ∈ SB. Let Ft be the event that the oracle fails to produce an α-approximate answer. We have P [Ft] ≤ 1 − β. Moreover, we maintain counter Ni for each arm i after m initialization rounds. If at round t ≥ m, the oracle selects a bad super arm, let i = arg minj∈St Nj,t−1, then Ni,t = Ni,t−1 + 1. By deﬁnition, Ni,t ≤ Ti,t. Note that in every bad m round, exactly one counter in {Ni} is incremented, so the total number of bad rounds in the ﬁrst P i=1 T rounds is less than or equal i Ni,T .

16 1.25 128K log δ log t Let lt = 2 −1 2 . Consider a bad round t, i.e. St ∈ SB is selected at time t, then we min{ ,2}(f (∆min)) have:

m X Ni,T − m(lT + 1) i=1 T X = 1{St ∈ SB} − mlT t=m+1 T X X ≤ 1{St ∈ SB,Ni,t > Ni,t−1,Ni,t−1 > lT } t=m+1 i∈[m] T X X ≤ 1{St ∈ SB,Ni,t > Ni,t−1,Ni,t−1 > lt} (3) t=m+1 i∈[m] T X = 1{St ∈ SB, ∀i ∈ St,Ni,t−1 > lt} t=m+1 T X ≤ (1{Ft} + 1{¬Ft,St ∈ SB, ∀i ∈ St,Ni,t−1 > lt}) t=m+1 T X ≤ (1{Ft} + 1{¬Ft,St ∈ SB, ∀i ∈ St,Ti,t−1 > lt}) . t=m+1 P t Nt Let u˜ denote the empirical mean not adding differential private noise, and Gt = {∀i ∈ [m], ≤ Ti q 2K log 1.25 log t 2 δ }, Ti q P [|u˜i − ui| ≥ 3 log t/(2Ti,t−1)] t−1 X q ≤ P r[{|uˆ − ui| = 3 log t/(2s),Ti,t−1 = s}] s=1 (4) t−1 X p ≤ P r[|uˆi,s − ui| ≥ 3 log t/(2s)] s=1 ≤ 2te−3 log t = 2t−2 .

Let γ = 2t−2, then with probability of 1 − 2t−2, s P 1.25 t Nt 2 2K log δ log t ≤ . Ti Ti

q 8K log 1.25/δ log t Let Λi,t = 2 2 , then Λi,t is the max conﬁdence interval each round because: min{ ,2}Ti s 1.25 r 2 2K log δ log t 3 log t Λi,t ≥ + . Ti 2Ti

q 8K log 1.25/δ log t Let Et = {∀i ∈ [m], |uˆi,T − ui| ≤ Λi,t}, Λt = max{Λi,t|i ∈ St}, Λ = 2 2 i,t−1 min{ ,2}lt which is not a random variable, we have: −2 P [¬Et] ≤ 4mt ,

Et ⇒ ∀i ∈ St, |u¯i,t − ui| < 2Λt ,

17 {St ∈ SB, ∀i ∈ St,Ti,t−1 > lt} ⇒ Λ > Λt ,

Et ⇒ u¯t ≥ u .

If {Et, ¬Ft,St ∈ SB, ∀i ∈ St,Ti,t−1 > lt} holds at time t, we have the following important derivation: r (S ) + f(2Λ) > r (S ) + f(2Λ ) ≥ r (S ) ≥ α ≥ αr (S∗) ≥ r (S∗) = α . u t u t t u¯t t optu¯t u¯t t u u optu So we have ru(St) + f(2Λ) > αoptu .

1.25 128K log δ log t Since lt = 2 −1 2 , we have f(2Λ) = ∆min. This contradicts the deﬁnition of ∆min min{ ,2}(f (∆min)) and the fact St ∈ SB. Thus we have:

P [{Et, ¬Ft,St ∈ SB, ∀i ∈ St,Ti,t−1 > lt}] = 0 ⇒ −2 (5) P [{¬Ft,St ∈ SB, ∀i ∈ St,Ti,t−1 > lt}] ≤ P [¬Et] ≤ 4mt . m T X X 4m [ N ] ≤ m(l + 1) + (1 − β)(T − m) + E i,T T t2 i=1 t=1 (6) 1.25 2 128mK log δ log T 2π ≤ 2 −1 2 + ( + 1)m + (1 − β)(T − m) . min{ , 2}(f (∆min)) 3

" m # X R(T ) ≤ T αβoptu − (T αoptu − E Ni,T ∆max) i=1 " m # X ≤ T α(β − 1)optu + E Ni,T ∆max i=1 1.25 2 128K log δ log T 2π ≤ 2 −1 2 + + 1 m∆max min{ , 2}(f (∆min)) 3 mK log 1 log T = O δ . 2

G Proof of Theorem 10

Proof. Our lower bound is based on the following problem. There is L items which all obey the Bernoulli distribution P . Each arm’s weight is parameterized by: p e ≤ K w¯(e) = (7) p − ∆ otherwise , where ∆ ∈ (0, p). Based on Lemma 5,

L K K−1 X X 1 E[R(At, wt)] ≥ ∆(1 − p) E[ {Ge,e∗,t}] . e=K+1 e∗=1

L n K K−1 X X X 1 R(n) ≥ ∆(1 − p) E[ {Ge,e∗,t}] e=K+1 t=1 e∗=1 (8) L K−1 X = ∆(1 − p) E[Tn(e)] . e=K+1

18 By the conclusion from the work of Theorem 4 in Basu et al. [2019], we have that for any bandit [T (e)] 1 lim inf E n ≥ . n→∞ log n 2 min{4, e2}(e − 1)2d(pkp − ∆)

∆2 Finally, using inequality d(pkp − ∆) ≤ p(1−p) , we get the ﬁnal bound:

R(n) (L − K)p(1 − p)K lim inf ≥ . n→∞ log n 2∆ min{4, e2}(e − 1)2

H Proof of Theorem 11

Proof. Our lower bound is based on the following problem. There is L items which all obey the Bernoulli distribution P . Each arm’s weight is parameterized by: p e ≤ K w¯(e) = (9) p − ∆ otherwise , where ∆ ∈ (0, p). According to Shariff and Sheffet [2018], the regret lower bound of bandit under DP has to suffer two terms: one term comes from the non-private bandit, one term comes from private bandit, as private cascading bandit is strictly harder than non-private cascading bandit (by reduction). (L−K)(1−p)K−1 log n The ﬁrst term is at least Ω( ∆ )(Kveton et al. [2015a]). Then we just have to prove (L−K)(1−p)K−1 log n cascading bandit under DP suffers regret at least Ω( ).

n L K X X X 1 R(n)private ≥ α ∆e,e∗ Et[ {Ge,e∗,t}] t=1 e=K+1 e∗=1 L K−1 X ≥ ∆(1 − p) E[Tn(e)] . e=K+1 Then according to Claim 14 from Shariff and Sheffet [2018], for any sub-optimal arm, we get log n 1 P T (e) ≥ ≥ . n 100∆ 2 1 log n log n [T (e)] ≥ × = . E n 2 100∆ 200∆ So we get the conclusion: (1 − p)K−1(L − K) log n (L − K)(1 − p)K−1 log n R(n) ≥ = Ω( ) . private 200 Therefore, we get the ﬁnal form of the result.

I Additional Experiments

In this section, we supplement some experiments with different number of arms under LDP. According to results, as K increases, the gap between the Laplace mechanism and Gaussian mechanism becomes larger. This phenomenon reﬂects the inﬂuence of K in the regret.

19 350 175 Non-private Non-private 300 Laplace-LDP 150 Laplace-LDP Gaussian-LDP Gaussian-LDP 250 125

200 100

150 75

Cumulative Reget 100 Cumulative Reget 50

50 25

0 0 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 Rounds Rounds (a) LDP,K = 4 (b) LDP, K = 8

20.0 1.0 Non-private Non-private 17.5 Laplace-LDP Laplace-LDP 0.8 15.0 Gaussian-LDP Gaussian-LDP

12.5 0.6 10.0

7.5 0.4

Cumulative Reget 5.0 Cumulative Reget 0.2 2.5

0.0 0.0 0 20000 40000 60000 80000 100000 0 20000 40000 60000 80000 100000 Rounds Rounds (c) LDP, K = 12 (d) LDP, K = 16

Figure 2: Empirical results for private and non private algorithms under LDP when = 0.2.