<<

Action-Manipulation Attacks Against Stochastic Bandits: Attacks and Defense

Guanlin and Lifeng Lai

Abstract—Due to the broad range of applications of stochastic reward-manipulation attacks. Furthermore, it studies a form multi-armed bandit model, understanding the effects of adver- of online attack strategy that is effective in attacking any sarial attacks and designing bandit algorithms robust to attacks bandit algorithm that has a regret scaling in logarithm order, are essential for the safe applications of this model. In this paper, we introduce a new class of attack named action-manipulation without knowing what particular algorithm the user is using. attack. In this attack, an adversary can change the action [25] considers an attack model where an adversary attacks with signal selected by the user. We show that without knowledge a certain probability at each round but its attack value can be of mean rewards of arms, our proposed attack can manipulate arbitrary and unbounded. The paper proposes algorithms that Upper Confidence Bound (UCB) algorithm, a widely used bandit are robust to these types of attacks. [24] considers how to algorithm, into pulling a target arm very frequently by spending only logarithmic cost. To defend against this class of attacks, we defend against reward-manipulation attacks, a complementary introduce a novel algorithm that is robust to action-manipulation problem to [22], [23]. In particular, [24] introduces a bandit attacks when an upper bound for the total attack cost is given. algorithm that is robust to reward-manipulation attacks under We prove that our algorithm has a pseudo-regret upper bounded certain attack cost, by using a multi-layer approach. [26] by O(max{log T,A}), where T is the total number of rounds and introduces another model of adversary setting where each arm A is the upper bound of the total attack cost. Index Terms—Stochastic bandits, action-manipulation attack, is able to manipulate its own reward and seeks to maximize UCB. its own expected number of pull count. Under this setting, [26] analyzes the robustness of Thompson Sampling, UCB, I.INTRODUCTION and -greedy with the adversary, and proves that all three algorithms achieve a regret upper bound that increases over In order to develop trustworthy machine learning systems, rounds in a logarithmic order or increases with attack cost understanding adversarial attacks on learning systems and in a linear order. This line of reward-manipulation attack has correspondingly building robust defense mechanisms have also recently been investigated for contextual bandits in [27], attracted significant recent research interests [2]–[9]. In this which develops an attack algorithm that can force the bandit paper, we focus on multiple armed bandits (MABs), a simple algorithm to pull a target arm for a target contextual vector by but very powerful framework of online learning that makes de- slightly manipulating rewards in the data. cisions over time under uncertainty. MABs problem is widely In this paper, we introduce a new class of attacks on MABs investigated in machine learning and signal processing [10]– named action-manipulation attack. In the action-manipulation [16] and has many applicants in a variety of scenarios such as attack, an attacker, sitting between the environment and the displaying advertisements [17], articles recommendation [18], user, can change the action selected by the user to another ac- cognitive radios [19], [20] and search engines [21], to name a tion. The user will then receive a reward from the environment few. corresponding to the action chosen by the attacker. Compared Of particular relevance to our work is a line of interesting with the reward-manipulation attacks discussed above, the recent work on online reward-manipulation attacks on stochas- action-manipulation attack is more difficult to carry out. In tic MABs [22]–[25]. In the reward-manipulation attacks, there particular, as the action-manipulation attack only changes is an adversary who can change the reward signal from the the action, it can impact but does not have direct control environment, and hence the reward signal received by the of the reward signal, because the reward signal will be a user is not the true reward signal from the environment. In random variable drawn from a distribution depending on the particular, [22] proposes an interesting attack strategy that action chosen by the attacker. This is in contrast to reward- can force a user, who runs either -Greedy and or Upper manipulation attacks where an attacker has direct control and Confidence Bound (UCB) algorithm, to select a target arm can change the reward signal to any value. while only spending effort that grows in logarithmic order. In order to demonstrate the significant security threat of [23] proposes an optimization based framework for offline action-manipulation attacks to stochastic bandits, we propose an action-manipulation attack strategy against the widely used G. Liu and L. Lai are with Department of Electrical and Com- puter Engineering, University of California, Davis, CA, 95616. Email: UCB algorithm. The proposed attack strategy aims to force {glnliu,lflai}@ucdavis.edu. The work of G. Liu and L. Lai was supported by the user to pull a target arm chosen by the attacker frequently. National Science Foundation under Grants CCF-1717943, ECCS-1711468, We assume that the attacker does not know the true mean CNS-1824553 and CCF-1908258. This paper will be presented in part in the 2020 IEEE International Conference on Acoustics, Speech and Signal reward of each arm. The assumption that the attacker does Processing [1]. not know the mean rewards of arms is necessary, as otherwise the attacker can perform the attack trivially. To see this, with II.MODEL the knowledge of the mean rewards, the attacker knows which arm has the worst mean reward and can perform the following In this section, we introduce our model. We consider the oracle attack: when the user pulls a non-target arm, the attacker standard multi-armed stochastic bandit problems setting. The change the arm to the worst arm. This oracle attack makes all environment consists of K arms, with each arm corresponds to non-target arms have expected rewards less than that of the a fixed but unknown reward distribution. The bandit algorithm, target arm, if the target arm selected by the attacker is not which is also called “user” in this paper, proceeds in discrete the worst arm. In addition, under this attack, all sublinear- time t = 1, 2,...,T , in which T is the total number of regret bandit algorithms will pull the target arm O(T ) times. rounds. At each round t, the user pulls an arm (or action) However, the oracle attack is not practical. The goal of It ∈ {1,...,K} and receives a random reward rt drawn from our work is to develop an attack strategy that has similar the reward distribution of arm It. Denote τi(t) := {s : s ≤ performance of the oracle attack strategy without requiring the t, Is = i} as the set of rounds up to t where the user chooses knowledge of the true mean rewards. When the user pulls a arm i, Ni(t) := |τi(t)| as the number of rounds that arm i non-target arm, the attacker could decide to attack by changing was pulled by the user up to time t and the action to the possible worst arm. As the attacker does not −1 X µˆi(t) := Ni(t) rs (1) know the true value of arms, our attack scheme relies on lower s∈τ (t) confidence bounds (LCB) of the value of each arm in making i attack decision. Correspondingly, we name our attack scheme as the empirical mean reward of arm i. The pseudo-regret ¯ as LCB attack strategy. Our analysis shows that, if the target R(T ) is defined as arm selected by the attacker is not the worst arm, the LCB " T # ¯ X attack strategy can successfully manipulate the user to select R(T ) = T max µi − E rt . (2) maxi∈[K] the target arm almost all the time with an only logarithmic t=1 cost. In particular, LCB attack strategy can force the user to The goal of the user to minimize R¯(T ). pull the target arm T − O(log(T )) times over T rounds, with In this paper, we introduce a novel adversary setting, in total attack cost being only O(log(T )). On the other hand, which the attacker sits between the user and the environment. we also show that, if the target arm is the worst arm and the The attacker can monitor the actions of the user and the reward attacker can only incur logarithmic costs, no attack algorithm signals from the environment. Furthermore, the attacker can can force the user to pull the worst arm more than T −O(T α) introduce action-manipulation attacks on stochastic bandits. In times. particular, at each round t, after the user chooses an arm I , Motivated by the analysis of the action-manipulation attacks t the attacker can manipulate the user’s action by changing I to and the significant security threat to MABs, we then design t another I0 ∈ {1,...,K}. If the attacker decides not to attack, a bandit algorithm which can defend against the action- t I0 = I . The environment generates a random reward r from manipulation attacks and still is able to achieve a small regret. t t t the reward distribution of post-attack arm I0. Then the user The main idea of the proposed algorithm is to bound the t and the attacker receive reward r from the environment. maximum amount of offset, in terms of user’s estimate of t the mean rewards, that can be introduced by the action- manipulation attacks. We then use this estimate of maximum offset to properly modify the UCB algorithm and build spe- cially designed high-probability upper bounds of the mean rewards so as to decide which arm to pull. We name our bandit algorithm as maximum offset upper confidence bound (MOUCB). In particular, our algorithm firstly pulls every arm a certain of times and then pulls the arm whose modified upper confidence bound is largest. Furthermore, we prove that MOUCB bandit algorithm has a pseudo-regret upper bounded by O(max{log T,A}), where T is the total number of rounds and A is an upper bound for the total attack cost. In particular, if A scales as log(T ), MOUCB archives a logarithm pseudo- Fig. 1. Action-manipulation attack model regret which is same as the regret of UCB algorithm. The remainder of the paper is organized as follows. In Sec- tion II, we describe the model. In Section III, we describe the Without loss of generality and for notation convenience, LCB attack strategy and analyze its accumulative attack cost. we assume arm K is the “attack target” arm or target arm. In Section IV, we propose MOUCB and analyze its regret. The attacker’s goal is to manipulate the user into pulling the In Section V, we provide numerical examples to validate the target arm very frequently but by making attacks as rarely as theoretic analysis. Finally, we offer several concluding remarks possible. Define the set of rounds when the attacker decides 0 in Section VI. The proofs are collected in Appendix. to attack as C := {t : t ≤ T,It 6= It}. The cumulative attack

2 cost is the total number of rounds where the attacker decides Under the action-manipulation attack, as the user does not 0 to attack, i.e., |C|. know that rt is generated from arm It instead of It, the In this paper, we assume that the reward distribution of arm empirical mean µˆi(t) computed using (1) is not a proper 2 i follows σ -sub-Gaussian distributions with mean µi. Denote estimate of the true mean reward of arm i anymore. On the the true reward vector as µ = [µ1, ··· , µK ]. Neither the user other hand, the attack is able to obtain a good estimate of µi nor the attacker knows µ, but σ2 is known to both the user by and the attacker. Define the difference of mean value of arm i 0 0 −1 X µˆi (t) := Ni (t) rs, (4) and j as ∆i,j = µi −µj. Furthermore, we refer to the best arm s∈τ 0(t) as iO = arg maxi µi and the worst arm as iW = arg mini µi. i Note that the assumption that the attacker does not know µ 0 0 where τi (t) := {s : s ≤ t, Is = i} is the set of rounds up to is important. If the attacker knows these values, the attacker 0 t when the attacker changes an arm to arm i, and Ni (t) = can adopt a trivial oracle attack scheme: whenever the user 0 |τi (t)| is the number of pulls of post-attack arm i up to round pulls a non-target arm It, the attacker changes It to the worst t. This information gap provides a chance for attack. In this arm iW . It is easy to show that, if the user uses a bandit section, we assume that the target arm is not the worst arm, algorithm that has a regret upper bounded of O(log(T )) when i.e., µK > µiW . We will discuss the case where the target arm there is no attack, the oracle attack scheme can force the user is the worst arm in Section III-C. to pull the target arm T −O(log(T )) times, using a cumulative The proposed attack strategy works as follows. In the first cost |C| = O(log(T )). However, the oracle attack scheme is K rounds, the attacker does not attack. After that, at round t, not practical when the true reward vector is unknown. In this if the user chooses a non-target arm It, the attacker changes it 0 paper, we will first design an effective attack scheme, which to arm It that has the smallest lower confidence bound (LCB): does not assume the knowledge of true reward vector and nearly matches the performance of the oracle attack scheme, 0  0 0  It = arg min µˆi (t − 1) − CB Ni (t − 1) , (5) to attack the UCB algorithm. We will then design a new bandit i algorithm that is robust against the action-manipulation attack. where r The action-manipulation attack considered here is different 2σ2 π2KN 2 CB(N) = log . (6) from reward-manipulation attacks introduced by interesting N 3δ recent work [22], [23], where the attacker can change the δ reward signal from the environment. In the setting considered Here is a parameter that is related to the probability in [22], [23], the attacker can change the reward signal r from statements in the analytical results presented in Section III-B. t t the environment to an arbitrary value chosen by the attacker. We call our scheme as LCB attack strategy. If at round the Correspondingly, the cumulative attack cost in [22], [23] is user chooses the target arm, the attacker does not attack. Thus defined to be the sum of the absolute value of the changes on the cumulative attack cost of our LCB attack scheme is equal the reward. Compared with the reward-manipulation attacks to the total of times when the non-target arms are selected by discussed above, the action-manipulation attack is more dif- the user. The algorithm is summarized in Algorithm 1. ficult to carry out. In particular, as the action-manipulation Algorithm 1: LCB attack strategy on UCB algorithm attack only changes the action, it can impact but does not have direct control of the reward signal, which will be a Input: random variable drawn from a distribution depending on the The user’s bandit algorithm, target arm K action chosen by the attacker. This is in contrast to reward- 1: for t = 1, 2,... do manipulation attacks where an attacker can change the rewards 2: The user chooses arm It to pull according to UCB to any value. algorithm (3). 3: if It = K then 0 III.ATTACK ON UCB AND COST ANALYSIS 4: The attacker does not attack, and It = It. 5: else In this section, we use UCB algorithm as an example to 0 illustrate the effects of action-manipulation attack. We will 6: The attacker attacks and changes arm It to It introduce LCB attack strategy on the UCB bandit algorithm chosen according to (5). 7: end if and analyze the cost. 0 8: The environment generates reward rt from arm It . A. Attack strategy 9: The attacker and the user receive rt. UCB algorithm [28] is one of the most popular bandit 10: end for algorithm. In the UCB algorithm, the user initially pulls each of the K arms once in the first K rounds. After that, the user Here, we highlight the main idea why LCB attack strategy chooses arms according to works. As discussed in Section II, if the attacker knows which ( s ) arm is the worst, the attacker can simply change the action log t to the worst arm when the user pulls the non-target arm. The It = arg max µˆi(t − 1) + 3σ . (3) i Ni(t − 1) main idea of the attack scheme is to estimate the mean of each

3 arm, and change the non-target arm to the arm whose lower Lemma 3. Under event E2, for all arm i and all t > K, we confidence bound is the smallest. Effectively, this will almost have always change the non-target arm to the worst arm. More formally, for i 6= K, we will show that this attack strategy 1 X µˆ (t) − µ 0 < β(N (t)), (9) i Is i will ensure that µˆi computed using (1) by the user converges Ni(t) s∈τi(t) to µi . On the other hand, as the attacker does not attack W where when the user selects K, µˆK computed by the user will still s converge to the true mean µ with N increasing. Because 2 2 2 K K 2σ K π (Ni(t)) the assumption that the target arm is not the worst, which β(Ni(t)) = log . (10) Ni(t) 3δ implies that µK > µiW , µˆi could be smaller than µˆK . Then the user will rarely pull the non-target arms as µˆi is smaller Proof. Please refer to Appendix B. than µˆK . Hence, the attack cost would also be small. The rigorous analysis of the cost will be provided in Section III-B. Under events E1 and E2, we can build a connection between

uˆi(t) and µiW . In the proposed LCB attack strategy, the attacker explores and exploits the worst arm by a lower B. Cost analysis confidence bound method. Thus, when the user pulls a non- target arm, the attacker changes it to the worst arm at most of To analyze the cost of the proposed scheme, we need to rounds, which means that for all i 6= K, uˆi(t) will converge 0 track µˆi (t), the estimate obtained by the attacker using (4), to µiW as Ni(t) increases. Lemma 4 shows the relationship and µˆi(t), the estimate obtained by the user using (1). between uˆi(t) and µiW . 0 The analysis of µˆi (t) is relatively simple, as the attacker 0 Lemma 4. Under events E1 and E2, using LCB attack strat- knows which arm is truly pulled and hence µˆi (t) is the true estimate of the mean of arm i. Define event egy 1, we have 2 2 2 0 0 1 X 8σ π Kt E1 := {∀i, ∀t > K : |µˆi (t) − µi| < CB(Ni (t))}. (7) µˆi(t) ≤ uiW + log (11) Ni(t) ∆j,iW 3δ j6=iW Roughly speaking, event E1 is the event that the empirical s mean computed by the attacker using (4) is close to the true 2σ2K π2(N (t))2 + log i , ∀i, t. (12) mean. The following lemma, proved in [22], shows that the Ni(t) 3δ attacker can accurately estimate the average reward to each arm. Proof. Please refer to Appendix C.

Lemma 1. (Lemma 1 in [22]) For δ ∈ (0, 1), P(E1) > 1 − δ. Lemma 4 shows an upper bound of the empirical mean reward of pre-attack arm i, for all arm i 6= K. Our main The analysis of µˆ (t) computed by the user is more com- i results is the following upper bound on the attack cost |C|. plicated. When the user pulls arm i, because of the action- manipulation attacks, the random rewards may be drawn Theorem 1. With probability at least 1 − 2δ, when T ≥ 2 from different reward distributions. Define τi,j(t) := {s :  π2K  5 0 3δ , using LCB attack strategy specified in Algorithm 1, s ≤ t, Is = i and Is = j} as the set of rounds up to t when the user chooses arm i and the attacker changes the attacker can manipulate the user into pulling the target it to arm j. Lemma 2 shows a high-probability confidence arm in at least T − |C| rounds, with an attack cost bounds of µˆ (t) := N (t)−1 P r , the empirical r i,j i,j s∈τi,j (t) s 2 2 K − 1 p 2 π T mean rewards of a part of arm i whose post-attack arm is |C| ≤ 2 3σ log T + 2σ K log 4∆K,i 3δ j, where Ni,j(t) := |τi,j(t)|. Define event W  r !2 E := {∀i, ∀j, ∀t > K : |µˆ (t) − µ | < p π2T 2 2 i,j j +  3σ log T + 2σ2K log s ) 3δ (13) 2σ2 π2K2(N (t))2 (8) log i,j . 1 2 Ni,j(t) 3δ  2 X 8σ2 π2KT 2 +4∆K,i log  . Lemma 2. For δ ∈ (0, 1), (E2) > 1 − δ. W   P ∆j,iW 3δ j6=iW Proof. Please refer to Appendix A. Proof. Please refer to Appendix D.

Although rs in (1), used to calculate µˆi(t), may be drawn The expression of the cost bound in Theorem 1 is compli- from different reward distributions, we can build a high- cated. The following corollary provides a simpler bound that probability bound of µˆi(t) with the help of Lemma 2. is more explicit and interpretable.

4 Corollary 1. Under the same assumptions in Theorem 1, the to design a new bandit algorithm that is robust against action- total attack cost |C| of Algorithm 1 is upper bounded by manipulation attacks. In this section, we propose such a robust    bandit algorithm and analyze its regret. 2 s Kσ log T X ∆K,iW X ∆K,iW O  2 K + + K  , A. Robust Bandit algorithm ∆K,i ∆j,iW ∆j,iW W j6=iW j6=iW (14) In this section, we assume that a valid upper bound A for the cumulative attack cost |C| is known for the user, although and the total number of target arm pulls is T − |C|. the user do not have to know the exact cumulative attack cost From Corollary 1, we can see that the attack cost scales |C|. A does not need to be constant, it can scale with T . In ∆ σ P K,iW other words, for a given A, our proposed algorithm is robust as log T . Two important constants ∆ and j6=i ∆ K,iW W j,iW to all action-manipulation attacks with a cumulative attack have impact on the prelog factor. In Section V, we provide cost |C| < A. This assumption is reasonable, as if the cost is some numerical examples to illustrate the effects of these two unbounded, it will not be possible to design a robust scheme. constants on the attack cost. We first introduce some notation. Denote N(t − 1) := (N (t − 1),...,N (t − 1)) as the vector counting how C. Attacks fail when the target arm is the worst arm 1 K many times each action has been taken by the user, and One weakness of our LCB attack strategy is that the attack µˆ(t − 1) = (ˆµ1(t − 1),..., µˆK (t − 1)) as the vector of the target arm is necessarily a non-worst arm. In the LCB attack sample means computed by the user. The proposed algorithm strategy, the attacker can not force the user to pull the worst is a modified UCB method by taking the maximum possible arm very frequently by spending only logarithmic cost. The mean estimate offset due to attack into consideration. We name main reason is that, when the target arm is the worst, the our scheme as maximum offset UCB (MOUCB). average reward of each arm is larger or equal to that of the The proposed MOUCB works as follows. In the first 2AK target arm. As the result, our attack scheme is not able to rounds, MOUCB algorithm pulls each arm 2A times. After ensure that the target arm has a higher expected reward than that, at round t, the user chooses an arm It by a modified the user’s estimate of the rewards of other arms. In fact, the UCB method: following theorem shows that all action-manipulation attack It = arg max {µˆa(t − 1) + β(Na(t − 1)) can not manipulate the UCB algorithm into pulling the worst a (15) arm more than T − O(log(T )) by spending only logarithmic +γ(µˆ(t − 1), N(t − 1))} , cost. where Theorem 2. Let δ < 1 . Suppose the attack cost is limited 2A 2 γ(µˆ(t − 1), N(t − 1)) = by O(log(T )), there is no attack that can force the UCB Na(t − 1) α algorithm to pick the worst arm more than T − O(T ) times max {µˆi(t − 1) − µˆj(t − 1) + β(Ni(t − 1)) + β(Nj(t − 1))} , 9 i,j with probability at least 1 − δ, in which α < 64K . and Proof. Please refer to Appendix E. r 2σ2K π2N 2 β(N) = log . (16) This theorem shows a contrast between the case where the N 3δ target arm is not the worst arm and the case where the target The algorithm is summarized in Algorithm 2. arm is the worst arm. If the target arm is not the worst arm, Compared with the original UCB algorithm in (3), the main our scheme is able to force the user to pick the target arm difference is the additional term γ(µˆ(t−1), N(t−1)) in (15). T −O(log(T )) times with only logarithmic cost. On the other We now highlight the main idea why our bandit algorithm hand, if the target arm is the worst, Theorem 2 shows that works and the role of this additional term. In particular, in there is no attack strategy that can force the user to pick the the standard multi-armed stochastic bandit problem, µˆi(t) is α worst arm more than T − O(T ) times while incurring only a proper estimation of µi, the true mean reward of arm i. logarithmic cost. However, under the action-manipulation attacks, as the user does not know which arm is used to generate rt, µˆi(t) is not IV. ROBUSTALGORITHMANDREGRETANALYSIS a proper estimate of the true mean reward anymore. However, we can try to find a good bound of the true mean reward. If The results in Section III exposes a significant security we know ∆ , the reward difference between the optimal threat of the action-manipulation attacks on MABs. Under only iO ,iW arm and the worst arm, we can describe the maximum offset O(log(T )) times of attacks carried out using the proposed of the mean rewards caused by the attack. In particular, we LCB strategy, the UCB algorithm will almost always pull the have target arm selected by the attacker. Although there are some A 1 X A defense algorithms [24] and universal best arm identification µ − ∆ ≤ µ 0 ≤ µ + ∆ , i iO ,iW Is i iO ,iW Ni(t) Ni(t) Ni(t) schemes [29] for stochastic or adversarial bandit, they do not s∈τi(t) apply to action-manipulation attack setting. This motivates us (17)

5 Algorithm 2: Proposed MOUCB bandit algorithm Proof. Please refer to Appendix F. Input: Using Lemma 5, we now bound the regret of Algorithm 2. A valid upper bound A for the cumulative attack cost. 1: for t = 1, 2,... do Theorem 3. Let A be an upper bound on the total attack 1 2: if t ≤ 2AK then cost |C|. For δ ≤ 3 and T ≥ 2AK, MOUCB algorithm has ¯ 3: The user pulls the arm whose pull count is the pseudo-regret R(T ) smallest, i.e. I = arg min N (t − 1).  2 2 2 t i i ¯ X 8σ K π T 4: else R(T ) ≤ max log ,A (∆iO ,a ∆i ,a 3δ a6=i O 5: The user chooses arm It to pull according (15). O r !) 6: end if σ2K 4π2A2 +2∆ + 8 log , 7: if The attacker decides to attack then iO ,iW A 3δ 8: The attacker attacks and changes I to I0. t t (23) 9: else 0 10: The attacker does not attack and It = It. with probability at least 1 − δ. 11: end if 0 Proof. Please refer to Appendix G. 12: The environment generates reward rt from arm It . 13: The attacker and the user receive rt. Theorem 3 reveals that our bandit algorithm is robust to the 14: end for action-manipulation attacks. If the total attack cost is bounded by O(log T ), the pseudo-regret of MOUCB bandit algorithm is still bounded by O(log T ). This is in contrast with UCB, O(T ) which implies for which we have shown that the pseudo-regret is with attack cost O(log T ) in Section III. If the total attack cost is up A 1 X to Ω(log T ), the pseudo-regret of MOUCB bandit algorithm µ ≤ ∆ + µ 0 . (18) i iO ,iW Is Ni(t) Ni(t) is bounded by O(A), which is linear in A. s∈τi(t) In (18), the first term in the right hand side is the maximum V. NUMERICAL RESULTS offset that an attacker can introduce regardless of the attack In this section, we provide numerical examples to illustrate strategy. The second term in the right hand side is related to the analytical results obtained. In our simulation, the bandit the mean estimated by the user. In particular, under event E2, has 10 arms. The rewards distribution of arm i is N (µi, σ). as shown in Lemma 3, we have The attacker’s target arm is K. We let δ = 0.05. We then 1 X run the experiment for multiple trials and in each trial we run µ 0 < µˆi(t) + β(Ni(t)). (19) 7 Is T = 10 rounds. Ni(t) s∈τi(t) A. LCB attack strategy Hence, regardless the attack strategy, we have a upper We first illustrate the impact of the proposed LCB attack confidence bound on µi: strategy on UCB algorithm. A µi ≤ µˆi(t) + ∆iO ,iW + β(Ni(t)). (20) Ni(t) 106 10 In our case, however, ∆iO ,iW is also unknown. In our algo- rithm, we obtain a high-probability bound on ∆ : iO ,iW 8

∆i ,i ≤ 2 max {µˆi − µˆj + β (Ni(t)) + β (Nj(t))} , (21) O W i,j 6 which will proved in Lemma 5 below. Now, the second term 4 of (20) becomes γ(µˆ(t − 1), N(t − 1)) if we replace ∆iO ,iW with the bound (21), and we obtain our final algorithm. 2

B. Regret analysis Target Arm Pull Count 0 0 2 4 6 8 10 Lemma 5 shows a boundary of ∆iO ,iW the maximum 6 reward difference between any two arms, under event E2. T 10 1 Lemma 5. For δ ≤ 3 , t > 2AK and under event E2, MOUCB Fig. 2. Number of rounds the target arm was pulled algorithm have

In Figure 2, we fix σ = 0.1 and ∆K,iW = 0.1 and compare ∆i ,i ≤ 2 max {µˆi − µˆj + β (Ni(t)) + β (Nj(t))} O W i,j the number of rounds at which the target arm is pulled with r (22) σ2K 4π2A2 and without attack. In this experiment, the mean rewards of ≤ 2∆iO ,iW + 8 log . all arms are 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.1, and 0.2 A 3δ

6 respectively. Arm K is not the worst arm, but its average arms are set to be 1.0, 0.8, 0.9, 0.5, 0.2, 0.3, 0.1, 0.4, 0.7, and reward is lower than most arms. The results are averaged over 0.6 respectively. The total attack cost |C| is limited by 2000. 20 trials. The attacker successfully manipulates the user into A given valid upper bound for total attack cost is A = 3000. pulling the target arm very frequently. The results are averaged over 20 trials.

104 106 3 10

8 2 6

cost 4 1 2 Optimal Arm Pull Count 0 0 0 2 4 6 8 10 0 2 4 6 8 10 T 106 T 106

σ Fig. 3. Attack cost vs ∆ Fig. 5. Comparison of number of rounds the optimal arm was pulled K,iW

σ In Figure 5, we simulate MOUCB algorithm with two In Figure 3, in order to study how ∆ affects the attack K,iW different attacks, and compare the numbers of rounds when cost, we fix ∆ = 0.1 and set σ as 0.1, 0.3 and 0.5 K,iW the optimal arm is pulled under these attacks. The first attack respectively. The mean rewards of all arms are same as above. σ is the LCB attack discussed in Section III. The second attack From the figure, we can see that as ∆ increases, the attack K,iW is the oracle attack, in which the attacker knows the true cost increases. In addition, as predicted in our analysis, the mean reward of arms and implements the oracle attacks that attack cost increases with T , the total number of rounds, in a change any non-target arm to a worst arm (see the discussion logarithmic order. in Section II). For comparison purposes, we also add the curve for MOUCB under no attack, and the curve for UCB under 1500 no attack. The results show that, even under the oracle attack, the proposed MOUCB bandit algorithm achieves almost the same performance as the UCB without attack. 1000 106 10 cost 500 8

6

0 4 0 2 4 6 8 10 6 T 10 2

∆ Fig. 4. Attack cost vs P K,iW j6=i ∆ Optimal Arm Pull Count 0 W j,iW

∆ 0 2 4 6 8 10 Figure 4 illustrates how P K,iW affects the attack j6=iW ∆j,i 6 σ W T 10 cost. In this experiment, we fix = 1 and set ∆K,iW ∆K,iW as 0.2, 0.6 and 0.9 respectively. The mean rewards of all Fig. 6. Number of rounds the optimal arm was pulled using UCB algorithm arms are the same as above. The figure illustrates that, as ∆ P K,iW To further compare the performance of UCB and MOUCB, j6=i increases, the attack cost also increases. This W ∆j,iW is consistent with our analysis in Corollary 1. in Figure 6, we illustrate the performance of UCB algorithm for the three scenarios discussed above: under LCB attack, B. MOUCB bandit algorithm under oracle attack and under no attack. The results show that We now illustrate the effectiveness of MOUCB bandit both LCB and oracle attacks can successfully manipulates the algorithm. UCB algorithm into pulling a non-optimal arm very frequently, In this experiment, we use the similar setting as in the as the curves for the LCB attack and oracle attack are far away simulation of the LCB attack scheme. The mean rewards of all from the curve for no attack. This is in sharp contrast with the

7 situation for MOUCB algorithm shown in Figure 5, where the showed that the proposed algorithm is robust to the action- all curves are almost identical. manipulation attacks.

10-3 APPENDIX A 6 PROOFOF LEMMA 2 The proof is similar with the proof of Lemma 1 that was ∞ 2 4 proved in [22]. Let {Xj}j=1 be a sequence of i.i.d σ - sub-Gaussian random variables with mean µ. Let µˆ0(t) = 1 PN(t) N(t) j=1 Xj. By Hoeffding’s inequality. 2  2  0 N(t)η Pseudo-Regret (|µˆ (t) − µ| ≥ η) ≤ 2 exp − . (24) P 2σ2

0 In order to ensure that E2 holds for all arm i, all arm j and all 0 2 4 6 8 10 6δ pull counts N = Ni,j(t), we set δi,j,N := π2K2N 2 . We have 6 T 10 r ! 2σ2 π2K2N 2 ∃i, ∃j, ∃N : |µˆ (t) − µ | ≥ log Fig. 7. Pseudo-regret of MOUCB algorithm P i,j j N 3δ K K ∞ X X X = δi,N = δ. i=1 j=1 N=1 (25) 0.4 APPENDIX B 0.3 PROOFOF LEMMA 3

0.2 According to event E2, we have 0.1

Pseudo-Regret 0 1 X µˆ (t) − µ 0 i Is Ni(t) -0.1 s∈τi(t) 0 2 4 6 8 10 K T 106 X Ni,j(t) = (ˆµi,j(t) − µj) N (t) j=1 i Fig. 8. Pseudo-regret of UCB algorithm (26) K X Ni,j(t) Figure 7 and Figure 8 illustrate the pseudo-regret of ≤ |µˆi,j(t) − µj| Ni(t) MOUCB bandit algorithm and UCB bandit algorithm respec- j=1 tively. In Figure 7, as predicted in our analysis, MOUCB K r 2 2 2 1 X 2 π K (Ni,j(t)) < 2σ Ni,j(t) log . algorithm archives logarithmic pseudo-regrets under both LCB Ni(t) 3δ attacks and the oracle attacks. Furthermore, the curves under j=1 r both attacks are very close to that of the case without attacks. π2K2N 2 Define a function f(N) = 2σ2N log : → , However, as shown in Figure 8, the pseudo-regret of UCB 3δ R R grows linearly under both attacks, while grows logarithmically and we have r under no attack. The figures again show that UCB is vulnerable ∂2 π2K2N 2 to action-manipulation attacks while the proposed MOUCB is f 00(N) = 2σ2N log ∂N 2 3δ robust to the attacks (even for oracle attacks).  2 2 2 2 2σ2 log π K N + 16σ4 VI.CONCLUSION 3δ (27) = − 3 2 π2K2N 2  2 In this paper, we have introduced a new class of attacks 4 2σ N log 3δ on stochastic bandits: action-manipulation attacks. We have <0, analyzed the attack against on the UCB algorithm and proved that the proposed LCB attack scheme can force the user when N ≥ 1. to almost always pull a non-worst arm with only logarithm Hence f is strictly concave when N ≥ 1, and we have   effort. To defend against this type of attacks, we have further K K   X 1 X Ni(t) designed a new bandit algorithm MOUCB that is robust f(N (t)) < Kf N (t) = Kf . i,j K i,j  K to action-manipulation attacks. We have analyzed the regret j=1 j=1 of MOUCB under any attack with bounded cost, and have (28)

8 Thus, Hence, under event E2, we have s 2 2 2 1 X 1 X 2σ K π (Ni(t)) µˆ (t) < µ 0 + log µˆi(t) − µI0 i Is s Ni(t) Ni(t) 3δ Ni(t) s∈τ (t) s∈τi(t) i v s u  2 1 2σ2K π2(N (t))2 2 2 Ni(t) X X i u π K = µ 0 + log K (29) Is 1 t 2 Ni(t) N (t) N (t) 3δ < K 2σ log i j i N (t) K 3δ s∈τi,j (t) i s s 2 2 2 2 2 2 1 X 2σ K π (Ni(t)) 2σ K π (Ni(t)) = Ni,j(t)µj + log = log . Ni(t) Ni(t) 3δ Ni(t) 3δ j s 2 2 2 X Ni,j(t) 2σ K π (Ni(t)) APPENDIX C = (∆ + µ ) + log N (t) j,iW iW N (t) 3δ PROOFOF LEMMA 4 j i i s 2σ2K π2(N (t))2 <µ + log i iW N (t) 3δ The LCB attack scheme uses lower confidence bound to i 1 X 8σ2 π2Kt2 exploit the worst arm, so we need to prove that the attacker’s + log . pull counts of all non-worst arms should be limited at round Ni(t) ∆j,iW 3δ j6=iW t. (36) Consider the case that in round t + 1, the user chooses a The lemma is proved. non-target arm It+1 = i 6= K and the attacker changes it to a 0 non-worst arm It+1 = j 6= iW . On one hand, under event E1, we have µˆ0 (t) − µ < (N 0 (t)), APPENDIX D iW iW CB iW 0 0 (30) PROOFOF THEOREM 1 and µˆj (t) − µj > −CB(Nj (t)). On the other hand, according to the attack scheme, it must be the case that By inferring from Lemma 1, we have that with probability δ 0 0 1 − K , ∀t > K : |µˆK (t) − µK | < CB(NK (t)). µˆ0 (t) − CB(N 0 (t)) > µˆ0(t) − CB(N 0(t)), (31) iW iW j j Because the LCB attack scheme does not attack the target arm, we can also conclude that with probability 1 − δ , ∀t > which is equivalent to K K : |µˆK (t) − µK | < CB(NK (t)). (N 0(t)) > µˆ0(t) − (ˆµ0 (t) − (N 0 (t))). CB j j iW CB iW (32) The user relies on the UCB algorithm to choose arms. If at round t, the user chooses an arm I = i 6= K, which is not Combining (32) with (30), we have t the target arm, we have 0 0 CB(Nj (t)) > µj − CB(Nj (t)) − µiW s s ∆ (33) log t log t and CB(N 0(t)) > j,iW . µˆi(t − 1) + 3σ > µˆK (t − 1) + 3σ , j 2 Ni(t − 1) NK (t − 1) 0 0 (37) Using on the fact that Nj (t) ≤ t and Ni,j(t) ≤ Nj (t), we have which is equivalent to s s ∆j,iW 0 −µˆi(t − 1) +µ ˆK (t − 1) + 3σ . s Ni(t − 1) NK (t − 1) 2 π2K(N 0(t))2 2σ j (38) = 0 log Nj (t) 3δ s Under event E1, we have 2σ2 π2Kt2 (34) ≤ 0 log µˆK (t) > µK − CB(NK (t)). (39) Nj (t) 3δ s 2σ2 π2Kt2 Under event E1 ∩ E2, according to Lemma 4, we have ≤ log , s N (t) 3δ 2 2 2 i,j 2σ K π (Ni(t)) µˆi(t) ≤ µiW + log which is equivalent to Ni(t) 3δ (40) 2 2 2 1 X 8σ2 π2Kt2 8σ π Kt + log . Ni,j(t) < 2 log . (35) ∆ 3δ Ni(t) ∆j,iW 3δ j,iW j6=iW

9 Combing the inequalities above, APPENDIX E s PROOFOF THEOREM 2 s 2 2 2 log t 2σ K π (Ni(t − 1)) 3σ > −µiW − log Ni(t − 1) Ni(t − 1) 3δ 1 X 8σ2 π2K(t − 1)2 − log Because the target arm is the worst arm, the mean rewards Ni(t − 1) ∆j,iW 3δ of all arms are larger than or equal to that of the target arm. j6=iW s Thus, for any attack scheme, we have log t + µK − CB(NK (t − 1)) + 3σ . 1 X N (t − 1) µ 0 ≥ µ . (45) K Is K Ni(t) (41) s∈τi(t)

π2K 2 5 If the user pulls arm K at round t, according to UCB When t ≥ ( 3δ ) , algorithm, we have for any arm i 6= K, s s π2K 2 log t log t log( ) 5 s s 3σ ≥ 4σ2 + 5σ2 3δ log t log t NK (t − 1) NK (t − 1) NK (t − 1) µˆi(t − 1) + 3σ < µˆK (t − 1) + 3σ . Ni(t − 1) NK (t − 1) s 2 2 log π Kt (46) ≥ 2σ2 3δ NK (t − 1) Under event E , we have Lemma 3 and (29) holds for all s 2 2 2 log π K(NK (t−1)) arm i, which implies ≥ 2σ2 3δ NK (t − 1) 1 X µˆ (t − 1) > µ 0 − i Is = CB(NK (t − 1)). Ni(t − 1) s∈τi(t−1) (42) s (47) 2 2 2 2σ K π (Ni(t − 1)) Now the inequality only depends on Ni(t − 1) and some log , constants: Ni(t − 1) 3δ s s 2 2 2 and log t 2σ K π (Ni(t − 1)) 3σ >∆K,iW − log 1 X Ni(t − 1) Ni(t − 1) 3δ µˆ (t − 1) < µ 0 + K Is 2 2 2 NK (t − 1) 1 X 8σ π K(t − 1) s∈τK (t−1) − log s (48) 2 2 2 Ni(t − 1) ∆j,iW 3δ 2σ K π (NK (t − 1)) j6=iW log . s N (t − 1) 3δ 2σ2K π2t2 K >∆K,iW − log Ni(t − 1) 3δ N π2N 2 Noted that for δ > 1 , h(N) = 2σ2 log : → 1 X 8σ2 π2Kt2 2 K 3δ R R − log . is monotonically decreasing in N ≥ 1. Ni(t − 1) ∆j,iW 3δ √ j6=iW 9 1 3δ 64K (43) If Ni(t−1) < 16 NK (t−1) and Ni(t−1) < π t hold for any arm i, we have By solving the inequality above, we have: s s r log t 3 log t 1 p π2t2 3σ < σ , (49) N (t − 1) < 3σ log t + 2σ2K log NK (t − 1) 4 Ni(t − 1) i 4∆2 3δ K,iW and  r !2 π2t2 s p 2 2 2 2 +  3σ log t + 2σ K log 2σ K π (NK (t − 1)) 3δ log NK (t − 1) 3δ 1 2  2 s 2 2 2 2σ2K π2(N (t − 1))2 X 8σ π Kt i +4∆ log  . < log (50) K,iW   Ni(t − 1) 3δ ∆j,iW 3δ j6=iW s 3 log t (44) < σ . 4 Ni(t − 1) Since event E1 ∩ E2 occurs with probability at least 1 − 2δ, we have that (44) holds with probability at least 1−2δ. Theorem 1 Combining the inequalities above, we find follows immediately from the definition of the attack cost and s (44). 3 log t 1 X σ < µ 0 − µ . Is K (51) 4 Ni(t − 1) NK (t − 1) s∈τK (t−1)

10 Since the attack cost is limited in O(log t), Combining (56) and (57), we have 1 X O(log t) 1 X 1 X µ 0 − µ = , µI0 − µI0 Is K (52) s s NK (t − 1) NK (t − 1) NiO (t) NiW (t) s∈τK (t−1) s∈τiO (t) s∈τiW (t) C C so ≥µ − µ − ∆ iW − ∆ iO iO iW iO ,iW 2A iO ,iW 2A (58) 2 Ni(t − 1) = Ω(σ(NK (t − 1)) ). (53) A ≥µ − µ − ∆ iO iW iO ,iW 2A ∆ = iO ,iW . In summary, as as the event E2 holds, at least one of 2 the three following equations must be true: From (29), we could find N (t − 1) = Ω(σ(N (t − 1))2), i K 1 X 1 X µ 0 − µ 0 1 Is Is N (t − 1) ≥ N (t − 1), Ni (t) Ni (t) i K O s∈τ (t) W s∈τ (t) 16 (54) iO iW √ (59) 9 3δ ≤µˆiO (t) + β(NiO (t)) − (ˆµiW (t) − β(NiW (t))) Ni(t − 1) ≥ t 64K . π ≤ max {µˆi(t) + β(Ni(t)) − (ˆµj(t) − β (Nj(t)))} . In addition, any one of the three equations shows that the i,j user pulls the non-target arm more than O(tα) times, in which We now prove the second inequality in Lemma 5: 9 α < 64K . Since event E2 holds with probability at least 1 − δ, max {µˆi(t) + β(Ni(t)) − (ˆµj(t) − β(Nj(t)))} the conclusion in the Theorem holds with probability at least i,j 1 − δ.   1 X ≤ max µ 0 + 2β(N (t)) Is i i,j Ni(t)  s∈τi(t) APPENDIX F   (60)  PROOFOF LEMMA 5 1 X − µ 0 − 2β(N (t))  Is j  Ni(t)  q s∈τi(t) δ ≤ 1 β(N) = 2σ2K log π2N 2 Note that for 3 , N 3δ is mono- ≤∆iO ,iW + max {2β(Ni(t)) + 2β(Nj(t))} . tonically decreasing in N, as i,j 2  2 2  q 2 2 2 ∂ 2σ K π N Recall that for δ ≤ 1 , β(N) = 2σ K log π N is β2(N) = 2 − log 3 N 3δ ∂N N 2 3δ monotonically decreasing in N. Therefore, (55) 2σ2K  π2  ≤ 2 − log < 0. max {2β(Ni(t)) + 2β(Nj(t))} ≤ 4β(2A). (61) N 2 3δ i,j We first prove the first inequality in Lemma 5. Consider the APPENDIX G ROOFOF HEOREM optimal arm iO and the worst arm iW . Define Ci := |{t : t ≤ P T 3 0 T,It 6= It = i}|. In the action-manipulation setting, when t > 2AK, MOUCB algorithm has MOUCB algorithm firstly pulls each arm 2A times. Then for t > 2AK and under event E , if at round t + 1, MOUCB 1 X NiO (t) − CiO CiO 2 µI0 ≥ µi + µi s O W algorithm choose a non-optimal arm It+1 = a 6= iO, we have NiO (t) NiO (t) NiO (t) s∈τiO (t) µˆa + β(Na(t))+ CiO = µiO − ∆iO ,iW 2A NiO (t) max {µˆi − µˆj + β(Ni(t)) + β(Nj(t))} Na(t) i,j CiO ≥ µiO − ∆iO ,iW , 2A ≥µˆiO + β(NiO (t))+ (56) 2A max {µˆi − µˆj + β(Ni(t)) + β(Nj(t))} , i,j and NiO (t)

1 X NiW (t) − CiW CiW which implies to µ 0 ≤ µ + µ Is iW iO NiW (t) NiW (t) NiW (t) r 2 2 2 ! s∈τiW (t) A σ K 4π A µˆa + 2∆iO ,iW + 8 log + β(Na(t)) CiW Na(t) A 3δ = µiW + ∆iO ,iW NiW (t) A ≥µˆiO + ∆iO ,iW + β(NiO (t)), CiW ≤ µ + ∆ . NiO (t) iW iO ,iW 2A (57) according to Lemma 5.

11 From equation (29), we could find [4] Y. , Z. Hong, Y. , M. Shih, M. Liu, and M. , “Tactics of adversarial attack on deep reinforcement learning agents,” arXiv preprint 1 X arXiv:1703.06748, 2017. µˆ ≤ µ 0 + β(N (t)) a Is a Na(t) [5] S. Mei and X. , “Using machine teaching to identify optimal s∈τa(t) training-set attacks on machine learners,” in Proc. of AAAI Conference Ca on Artificial Intelligence, Austin, TX, Jan. 2015, pp. 2871–2877. ≤ µa + ∆iO ,a + β(Na(t)) [6] B. Biggio, B. Nelson, and P. Laskov, “Poisoning attacks against support Na(t) vector machines,” arXiv preprint arXiv:1206.6389, 2012. A [7] H. , B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli, “Is ≤ µ + ∆ + β(N (t)), a iO ,a N (t) a feature selection secure against training data poisoning?,” in Proc. of a International Conference on Machine Learning, Francis Bach and David and Blei, Eds., Lille, France, July 2015, vol. 37 of Proceedings of Machine Learning Research, pp. 1689–1698. 1 X [8] B. , Y. , A. Singh, and Y. Vorobeychik, “Data poisoning µˆi ≥ µI0 − β(Ni (t)) O s O attacks on factorization-based collaborative filtering,” in Advances in NiO (t) s∈τiO (t) Neural Information Processing Systems, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds., 2016, pp. 1885–1893. CiO ≥ µiO − ∆iO ,iW − β(NiO (t)) [9] S. Alfeld, X. Zhu, and P. Barford, “Data poisoning attacks against NiO (t) autoregressive models,” in Proc. of AAAI Conference on Artificial A Intelligence, Phoenix, AZ, Feb. 2016, pp. 1452–1458. ≥ µ − ∆ − β(N (t)). iO iO ,iW N (t) iO [10] H. S. , J. , M. C. , and S. I. Marcus, “Adaptive adversarial iO multi-armed bandit approach to two-person zero-sum markov games,” By combining the inequalities above, we have IEEE Transactions on Automatic Control, vol. 55, no. 2, pp. 463–468, Feb 2010. A [11] C. Tekin and M. van der Schaar, “Distributed online learning via co- µiO ≤µa + ∆iO ,a + 2β(Na(t))+ operative contextual bandits,” IEEE Transactions on Signal Processing, Na(t) vol. 63, no. 14, pp. 3700–3714, July 2015. r ! A σ2K 4π2A2 [12] N. M. Vural, H. Gokcesu, K. Gokcesu, and S. S. Kozat, “Minimax 2∆iO ,iW + 8 log , optimal algorithms for adversarial bandit problem with multiple plays,” Na(t) A 3δ IEEE Transactions on Signal Processing, vol. 67, no. 16, pp. 4383–4398, Aug 2019. which is equivalent to [13] K. Liu, Q. , and B. Krishnamachari, “Dynamic multichannel access s with imperfect channel state detection,” IEEE Transactions on Signal 2 2 2 A 2σ K π (Na(t)) Processing, vol. 58, no. 5, pp. 2795–2808, May 2010. ∆io,a ≤∆iO ,a + 2 log + [14] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with Na(t) Na(t) 3δ multiple players,” IEEE Transactions on Signal Processing, vol. 58, no. r 2 2 2 ! 11, pp. 5667–5681, Nov 2010. A σ K 4π A [15] S. Shahrampour, M. Noshad, and V. Tarokh, “On sequential elimination 2∆iO ,iW + 8 log Na(t) A 3δ algorithms for best-arm identification in multi-armed bandits,” IEEE Transactions on Signal Processing, vol. 65, no. 16, pp. 4281–4292, Aug s 2σ2K π2t2 A 2017. [16] C. Tekin and M. Liu, “Online learning of rested and restless bandits,” ≤2 log + (∆iO ,a+ Na(t) 3δ Na(t) IEEE Transactions on Information Theory, vol. 58, no. 8, pp. 5588– r ! 5611, Aug 2012. σ2K 4π2A2 2∆ + 8 log . [17] O. Chapelle, E. Manavoglu, and R. Rosales, “Simple and scalable iO ,iW A 3δ response prediction for display advertising,” ACM Trans. Intell. Syst. Technol., vol. 5, no. 4, pp. 61:1–61:34, Dec. 2014. Therefore, [18] L. Li, W. , J. Langford, and R. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in Proc. of ( 2 2 2 International Conference on World Wide Web, New York, NY, Apr. 2010, 8σ K π t A pp. 661–670. Na(t) ≤ max log , (∆i ,a ∆2 3δ ∆ O [19] L. Lai, H. El Gamal, H. Jiang, and H. Vincent Poor, “Cognitive medium iO ,a io,a r !) (62) access: Exploration, exploitation and competition,” IEEE Transactions σ2K 4π2A2 on Mobile Computing, vol. 10, no. 2, pp. 239–253, Feb. 2011. +2∆iO ,iW + 8 log . [20] M. Bande and V. V. Veeravalli, “Adversarial multi-user bandits for A 3δ uncoordinated spectrum access,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, United Kingdom, As event E2 holds with probability at least 1 − δ, (44) holds May 2019, pp. 4514–4518. with probability at least 1 − δ. Then Theorem 3 follows [21] B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan, “Cascading bandits: immediately from the definition of the pseudo-regret in (2) Learning to rank in the cascade model,” in Proc. of International Conference on Machine Learning, Francis Bach and David Blei, Eds., and equation (62). Lille, France, July 2015, vol. 37 of Proceedings of Machine Learning Research, pp. 767–776. REFERENCES [22] K. Jun, L. Li, Y. , and X. Zhu, “Adversarial attacks on stochastic bandits,” in Proc. of International Conference on Neural Information [1] G. Liu and L. Lai, “Action-manipulation attacks on stochastic bandits,” Processing Systems, Montreal,´ Canada, Dec. 2018, pp. 3644–3653. in Proc. of IEEE International Conference on Acoustics, Speech, and [23] F. Liu and N. Shroff, “Data poisoning attacks on stochastic bandits,” Signal Processing, Barcelona, Spain, May 2020. in Proc. of International Conference on Machine Learning, Kamalika [2] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing Chaudhuri and Ruslan Salakhutdinov, Eds., Long Beach, CA, June 2019, adversarial examples,” arXiv preprint arXiv:1412.6572, 2014. vol. 97, pp. 4042–4050. [3] S. , N. Papernot, I. Goodfellow, Y. , and P. Abbeel, [24] T. Lykouris, V. Mirrokni, and R. Paes Leme, “Stochastic bandits robust “Adversarial attacks on neural network policies,” arXiv preprint to adversarial corruptions,” in Proc. of Annual ACM SIGACT Symposium arXiv:1702.02284, 2017. on Theory of Computing, Los Angeles, CA, June 2018, pp. 114–122.

12 [25] Ziwei Guan, Kaiyi Ji, D. J. Bucci Jr, Timothy Y Hu, Joseph Palombo, Michael Liston, and Yingbin , “Robust stochastic bandit algorithms under probabilistic unbounded adversarial attack,” in Proc. AAAI, New York City, NY, Feb. 2020. [26] Z. Feng, D. Parkes, and H. , “The intrinsic robustness of stochastic bandits to strategic manipulation,” CoRR, vol. abs/1906.01528, 2019. [27] Y. Ma, K. Jun, L. Li, and X. Zhu, “Data poisoning attacks in contextual bandits,” CoRR, vol. abs/1808.05760, 2018. [28] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends R in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012. [29] C. Shen, “Universal best arm identification,” IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4464–4478, Sep. 2019.

13