Repeated Dollar : A Multi-Armed Bandit Approach

Marcin Waniek Long Tran-Tranh Tomasz Michalak University of Warsaw University of Southampton University of Oxford & [email protected] [email protected] University of Warsaw [email protected]

ABSTRACT a dollar by making bids just as in the regular English auc- We investigate the repeated version of Shubik’s [34] dollar tion. However, both the winner and the loser have to pay auctions, in which the type of the opponent and their level of their final bids. In other words, this is an all-pay rationality is not known in advance. We formulate the prob- which means that, once a player decides to participate and lem as an adversarial multi-armed bandit, and we show that make her first bid, any invested resources are irretrievably a modified version of the ELP algorithm [25], tailored to our lost. Consequently, the only way to gain any profit (or min- imize the loss) is to win the dollar. However, since both setting, can achieve O˜(p|S |T ) performance loss (compared 0 players are in the same situation, the dollar auction often to the best fixed ), where |S | is the cardinality of the 0 ends up with an irrational conflict escalation [23]. set of available strategies and T is the number of (sequential) Interestingly, however, in a celebrated work, O’Neill [31] auctions. We also show that under some further conditions, √ √ proved that the first player has a winning strategy in the these bound can be improved to O˜(|S |1/4 T ) and O˜( T ), 0 dollar auction assuming pure strategies. In particular, if she respectively. Finally, we consider the case of spiteful play- makes a bid that is a specific combination of the auction’s ers. We prove that when a non-spiteful player bids against a stake, the budget of both players and the minimal bid incre- malicious one, the game converges in performance to a Nash ment, her opponent will never win the auction and should equilibrium if both players apply our strategy to place their therefore pass. For instance, if the price increment is $0.05 bids. and both players have budgets of $2.50 each, then the first player that has a chance to bid should offer precisely $0.60 CCS Concepts and the other player should drop out. As a result, escala- •Theory of computation → Online learning algorithms; tion should never occur. The result by O’Neill suggests that Computational pricing and auctions; conflict escalation should never occur and if it does so in real-life experiments, it has to be related to human bounded- rationality or other factors not accounted for by the auction Keywords model. dollar auction; online learning; multi-armed bandits However, O’Neil’s result was re-examined by Leininger [23] who showed the advantage of the first bidder vanishes con- 1. INTRODUCTION sidering mixed strategies equilibria and the auction esca- lates. Demange [13] proved the same for settings where play- Conflicts are an integral part of human culture [8]. Whether ers are uncertain about each other’s strength. In addition, it is a competition of rival technological companies [38], Waniek et al. [39] showed that when used against a spite- a battle of lobbyists trying to acquire a government con- ful or malicious player, the optimal strategy proposed by tract [15] or the arms race of modern empires [31], conflict O’Neill can lead to severe loss. analysis has always been of interest to game theorists and One of the key deficiencies of all of the above results is economists alike [18]. The analysis of conflict is also of nat- that they concern one-shot auction case, where the goal is ural interest analysed in AI and, especially, in Multi-Agent to investigate what would be an optimal bidding strategy in Systems (MAS) literature [27, 37, 11]. Typically conflicts the single auction. However, many real-world applications in MAS used to be considered as failures or synchronisation consist of a sequence of auctions repeatedly played against problems [40, 36]. More recently, however, the need for more the same opponent [7, 20, 17]. In fact, many examples of advanced studies has been recognized [10], including analy- R&D competition, arms races, lobbying, or interactions in ses of conflict generation, escalation, or detection in MAS. MAS can be seen not as a single instance of an auction, but A very simple dollar auction introduced by Shubik [34] rather as a series of confrontations. We will show later in provides a powerful framework to formalise and study con- this paper that the nature of repeated encounters can be flict situations. In this auction two participants compete for leveraged in order to achieve good performance. Typically in such repeated settings, there is little (if any) Appears in: Proceedings of the 15th International Conference prior knowledge about the opponent’s incentives, nor about on Autonomous Agents and Multiagent Systems (AAMAS 2016), her rationality level. Second, while it is typical to assume J. Thangarajah, K. Tuyls, C. Jonker, S. Marsella (eds.), some (perfect or bounded) rationality model of the players, May 9–13, 2016, Singapore. Copyright c 2016, International Foundation for Autonomous Agents and it might be the case that the chosen bids do not follow such Multiagent Systems (www.ifaamas.org). All rights reserved.

579 assumptions (e.g., due to lack of computational power, or The remainder of the paper is organised as follows. We lack of perfect execution). As such, efficient bidding strate- start by introducing the basic definitions and notation, and gies need to consider the uncertainty in both the incentive then outline our multi-armed bandit based bidding model. and rationality of the opponent. We then describe the ELP algorithm, and analyse its perfor- In this paper, we reconsider Shubik’s dollar auction but mance in different settings. The case of non-spiteful against now in the repeated setting. We ask whether it is possible malicious player is discussed afterwards, followed by the con- to design efficient bidding strategies, given the high level clusions. Appendix provides notation table for the reader’s of uncertainty and no prior knowledge of the opponent. Put convenience. differently, we investigate whether a participant’s knowledge about the opponent (gained in the previous encounters) can 2. PRELIMINARIES give the participant an edge over her opponent so that the In this section, we formally describe the two building blocks escalation does not occur. To this end, we view the problem of our problem, namely the dollar auction and the multi- from the perspective of online learning theory, where the armed bandit model. auctions are played against an adversary. More specifically, we formulate the problem using an ad- 2.1 The Dollar Auction versarial multi-armed bandit model [5, 25], where at each The dollar auction [34] is an all-pay auction between two round we choose from a (finite) set of possible strategies S0 players N = {0, 1}. We will often denote them by i and to play against the opponent. At the end of each round, we j. The winner of an auction is given the stake s ∈ N. The observe the of the auction and update our estima- players can make bids that are multiples of a minimal bid tion on the strategy’s fitness (i.e., how much expected profit increment δ. Without loss of generality, we assume that the strategy can achieve). Our model also assumes that there δ = 1. is some level of interdependence between the different strate- While Shubik [34] considered the dollar auction without gies. That is, by observing an outcome of a specific strategy, any budget limit, following O’Neill [31], most subsequent we can get some side information about the fitness of other studies considered the setting in which players have limited strategies. For example, if a particular strategy wins the auc- budgets b0, b1 ∈ N. In this paper, we assume that both play- tion with a (final) bid b, then any other strategies that shares ers have equal budgets, i.e., b0 = b1 = b. the same behaviour as the current one up to b (i.e., the lat- The players move in turns and the starting player is de- ter is a prefix of the former), would also win the auction. As termined randomly at the beginning of the auction. More- such, we can also update our fitness estimate of the interde- over, the starting player i can either make the first bid, pendent strategies as well. pass and leave the auction, or let j move first. Once the Given this, we prove that by applying ELP (i.e., “Expo- first bid has been made, any player making the move can nentially weighted algorithm with Linear Programming”)—a either make a bid higher than the opponent, or pass. In state-of-the-art online learning algorithm proposed by [25]— other words, offering the turn to the opponent is only per- ˜ p 1 we can achieve O( |S0|T ) performance loss against the mitted before the first bid is made. We denote by Xb the set best fixed bidding strategy (i.e., playing the same strategy {(x, y) ∈ {0, . . . , b − 1} × {0, . . . , b − 1} : y > x} ∪ {(0, 0)} each round) in hindsight. In particular, since the regret is where x is the last bid of the player currently choosing her p sub-linear due to the square root in O˜( |S0|T ), it implies bid and y is the last bid of her opponent. Whenever it is a that the average regret per time step is converging to 0 as player’s turn to make a bid, the dollar auction would be in T tends to infinity. Thus, the behaviour of ELP converges one of the states in Xb. to that of the best fixed strategy in hindsight. While this We assume that all the strategies in our setting are deter- result is a direct consequence of the performance analysis of ministic. Let us now consider S the set of all possible strate- ELP from [25], our main contributions are to improve this gies in the dollar auction. We have S ⊂ Xb → {0, . . . , b}, result to more concrete cases where we have some additional where for any strategy f ∈ S, the value f(x, y) represents conditions. Specifically: the bid to be made by the player whose last bid was x and the last bid of her opponent was y. For valid strategies we • By allowing some additional side-information structure have f(x, y) ≥ y, where f(0, 0) = 0 represents the decision (about the other strategies) that we obtain when ob- to let the opponent move first if the player starts (as it is serving the outcome of the chosen strategy, we prove never rational for any player to pass if her opponent has not that the√ performance loss can be bounded with ˜ 1/4 1/4 made any bids yet), and f(x, y) = y represents the deci- O(|S0| T ), which is an improvement of |S0| (note sion to pass in all other cases. Set of strategies S contains that this improvement is significant, especially when all valid strategies, i.e., strategies that meet conditions de- |S0| is large); scribed above. An auction ends when one of the players either passes or • When S0 is the class of non-spiteful optimal bidding makes the bid that her opponent cannot top (in a setting strategies with different budget limits, and the oppo- with equal budgets this would mean bidding the entire bud- nent is deterministic (i.e., she chooses the next bids in get). The stake is then given to the higher bidder. However, a deterministic manner), the performance√ loss of the since this is an all-pay auction, both players have to pay ELP-based bidding strategy is at most O˜( T ). their final bids. The profit of player i from an auction that ends with bids yi and yj is therefore: Finally, we show that when the auction is played between a  non-spiteful and a malicious player, if both use ELP, they s − yi if yi > yj , pi = will converge to a . −yi if yi < yj . 1The logarithmic terms are hidden in the O˜ notation.

580 2.2 The Multi-Armed Bandit Model strategy within the strategy space. Finally we discuss how We now describe the adversarial bandit model [6], a varia- to tailor the ELP algorithm to our setting. tion of the multi-armed bandit problem [33]. In particular, 3.1 Model Description it consists of T rounds, and a player can choose one of K actions (traditionally described as different slot machines, or In our setting, a player takes part in a series of T dollar auc- bandits, hence the name of the multi-armed bandit model) tions against the same opponent, each auction corresponds to play at each round. However, before the player chooses to a single round in the multi-armed bandit model. We as- her action to play, an adversary assigns an arbitrary reward sume that the player has at her disposal a set of strategies K S0 ⊆ S. Before each round t the player has to choose a vector r(t) ∈ [0, 1] , where r(t) = (r1(t), . . . , rK (t)), with- out revealing it to the player. After the player chooses her strategy ft ∈ S to be used in this particular round. Choice of strategy corresponds to the choice of an action (a play- action it to play, the value ri (t) ∈ [0, 1], which represents t ing arm) in the multi-armed bandit model. On the other reward for performing action it in round t, is revealed to the player as her (collected) reward. The goal of the player then hand, her opponent also chooses her strategy (equivalent of is to maximise her total collected reward over T rounds. setting payoffs for each action). We do not put any explicit Let A denote the algorithm of playing actions chosen by restrictions on the strategy of the opponent. The auction is played using chosen strategies and the the player, which selects action it+1 at each in round t + 1 to perform (the choice of action may be based on the player achieves profit pft (t) ∈ {−b + 1, . . . , s}. The profit is then mapped to reward rf (t), corresponding to the re- observed results ri (1), . . . , ri (t) of the previously chosen t 1 t ward in the multi-armed bandit model. We assume that al- actions i1, . . . , it). The total collected reward, or the return, of algorithm A after T rounds is as follows: ways rft (t) ∈ [0, 1] (we normalise the rewards for the sake of simplicity). In case of a non-spiteful player (i.e., player who T just wishes to maximize her profit) reward depends only on X RA(T ) = rit (t). her profit. We discuss the case of a malicious player, whose t=1 reward depends on the profit of her opponent in Section 6. Given this, a meta-strategy is an algorithm that chooses a Since the value of each r(t) is not revealed to the player, it strategy to be used in each auction, based on the knowledge is impossible to maximise the term above without having any about previous auctions. Note that it corresponds to the additional information about r(t). Instead, the performance algorithm of choosing an arm for each round in the multi- of an algorithm is measured with a regret, which is defined armed bandit model. The goal of the meta-strategy A is to as the difference between the return of the best fixed action minimize the regret after a series of auctions, regret being over time and the return of the algorithm, namely: the difference between sum of rewards after using single best T T strategy in hindsight and after using a sequence of strategies X X UA(T ) = max ri(t) − rit (t). chosen by the meta-strategy: i t=1 t=1 T T X X Within our bandit formulation, the repeated dollar auc- UA(T ) = max rg(t) − rft (t). g∈S0 tion problem provides side information. Within the bandit t=1 t=1 literature, the first work that fits our setting is the work of 3.2 Neighbourhood of a Strategy Mannor and Shamir [25], which introduced the multi-armed We now define the relation of being a prefix strategy. In bandit model with side information. In particular, within particular, strategy g ∈ S is a prefix of strategy f ∈ S if and their model, action j is a neighbour of action i chosen by only if ∀ g(x, y) = f(x, y) ∨ g(x, y) = y. Intuitively, the algorithm in round t if and only if for some fixed param- (x,y)∈Xb in every moment of the auction g either makes the same eter d we are able to create unbiased estimatorr ˆ (t) of the j bid as f or passes. We can notice that f is also the prefix of reward r (t), i.e., [ˆr (t)|action i chosen in round t] = r (t) j E j j itself. We can finally define neighbourhood of the strategy f as and (|rˆ (t)| ≤ d) = 1. Intuitively, we can use the knowledge P j N = {g ∈ S : g is a prefix of f}. In fact, the following lemma gained from the execution of action i to estimate the reward f justifies that playing a strategy provides side information of actions in its neighbourhood. about its prefixes. This extra information can be represented as a sequence of graphs G1,...,GT with K nodes each. Let Ni(t) denote Lemma 1. Assuming that player used strategy f in the the neighbourhood of node i in graph Gt, i.e., an edge from dollar auction, we know the result of an auction if player i to j exists in graph Gt if and only if j ∈ Ni(t). Mannor used strategy g, that is a prefix of f. and Shamir proposed an algorithm, called ELP, to efficiently Proof. After the auction using strategy f in round t we tackle this bandit model. Here we will rely on the model of know a sequence of states (x1, y1),..., (xn, yn) when player Mannor and Shamir and the ELP algorithm, and fit them was asked for a bid. If ∀ig(xi, yi) = f(xi, yi) then reward to our setting of the repeated dollar auction. rg(t) for using g in auction would have been exactly the actual reward for using f, i.e., rg(t) = rf (t). Otherwise, 3. META-BIDDING ALGORITHM reward for using g would have been rg(t) = −xj , where j is the lowest index such that g(xj , yj ) = yj , as the player We now turn to the description of our bidding algorithm. would pass earlier than when using strategy f. In particular, this algorithm is a meta-strategy that consists of multiple dollar auction strategies played at each round, 3.3 The ELP Algorithm in order to minimize the regret from the series of auctions. We now describe how the ELP algorithm [25] can be adapted To do so, we first describe the multi-armed bandit formula- as a meta-bidding method in the dollar auction. The adapta- tion of our problem. We then define the neighbourhood of a tion is based on the usage of previously defined set of prefixes

581 Algorithm 1 The ELP algorithm to T T Input: S0 ⊆ S, β,(γ(t))t=1,(∪f∈S0 {qf (t)})t=1 β 1 γ(t) = P . ∀f∈S0 wf (1) ← minf∈S qg(t) |S0| 0 g∈Nf (t) for t = 1,...,T do for g ∈ S0 do In addition, let ε = mint,g γ(t)qg(t) > 0. Then the upper wg (t) bound on the regret of ELP is Pg(t) ← (1 − γ(t)) P w (t) + γ(t)qg(t) h∈S0 h 6|S0| log(|S0|) ft ∼ P (t) U (T ) ≤ 9βT α(G) log + . A α(G)ε β Play ft getting reward rft (t) for g ∈ Nft do where α(G) is the independence number of the underlying Computer ˆg(t) neighbourhood graph G of the strategies in S0. for g ∈ S do 0 Proof. Following the proof of Theorem 3 in [25]2, we can if f ∈ N then t g show that rˆg (t) r˜g(t) ← P P (t) h∈Ng (t) h T else X Pf (t) log |S0| UA(T ) ≤ βT αG + 2β P + Ps(t) β r˜g(t) ← 0 t=1 s∈Nf (t) for g ∈ S0 do From Lemma 13 of [2], we can further bound the second wg(t + 1) ← wg(t) exp(βr˜g(t)) term of the RHS as follows:

2 T |S0| + |S | + 1 X Pf (t)  α(G)ε 0  β P ≤ 2α(G) log 1 + Ps(t) α(G) as the neighbourhood of the strategy, in order to determine t=1 s∈Nf (t) the potential profit of other strategies. + 2α(G) Algorithm 1 presents the pseudocode of ELP. The algo- By using elementary algebra, the RHS can be further bounded rithm uses strategies from the provided set S0 ⊆ S to play as: a series of T dollar auctions, selecting a single strategy for 2 T |S0| each auction. Typically for multi-armed bandits algorithms, + |S0| + α(G) + 1 X Pf (t)  2 α(G)ε  the ELP algorithm chooses between exploration (checking β P ≤ 2α(G) log e Ps(t) α(G) new strategies) and exploitation (using so far most promis- t=1 s∈Nf (t) ing strategies to generate profit). This trade-off is expressed which can be further bounded with by parameters γ(t) and qf (t) of the algorithm. We will dis- 2 T 4 |S0| cuss the exact values of these parameters, that provide low X Pf (t)  α(G)ε2  regret bounds, later in the paper. β P ≤ 2α(G) log 9 Ps(t) α(G) t=1 s∈Nf (t) In round t algorithm selects strategy ft ∈ S0 using a prob-  6|S |  ability distribution P (t) incorporating knowledge gained in = 2α(G) log 0 previous rounds. Knowing reward rft (t) received for using α(G)ε selected strategy, algorithm computes rewardr ˆ (t) that would g Here we exploit the facts that 1 ≤ α(G) ≤ |S | and 0 < ε < have been received for playing strategy g from the neigh- 0 1. This implies that bourhood of ft in round t. Method of computing this reward is given in the proof of Lemma 1. Basically, for auction where  6|S0|  log |S0| UA(T ) ≤ 9βT αG log + player was making decision in states (x1, y1),..., (xn, yn) we α(G)ε β haver ˆ (t) = r (t) if ∀ g(x , y ) = f(x , y ) andr ˆ (t) = −x g f i i i i i g j which concludes the proof. (where j is the lowest index such that g(xj , yj ) = yj ) other- wise. Choosing the value of β such that both components of the This reward is used to construct probability distribution bound are equal, we get the following: in the next round. The higher the reward, the more probably the strategy is going to be used in the subsequent rounds. r 3. For β = log(|S0|) the bound de- Corollary 6|S0| 9T α(G) log α(G)ε 3.4 Regret Analysis fined in Theorem 2 equals to We now give an upper bound for the regret of the ELP s 6|S | algorithm: U (T ) ≤ 6 α(G) log 0 log(|S |)T. A α(G)ε 0

Theorem 2. Assume that the ELP algorithm is run us- p 1 Note that here we have UA(T ) = O˜( α(G)T ). However, ing β ∈ (0, ), with (qf (t))f∈S parameters set such that 2|S0| 0 as S can be arbitrary, in the worst case scenario (when ∀ q (t) ≥ 0, P q (t) = 1 and the value of 0 f f f f there is no side information at all by playing any particular strategy from S ), we have U (T ) = O˜(p|S |T ), which can X 0 A 0 min qg(t) be quite large if the set of possible strategies S0 is large. A f∈S0 g∈Nf (t) 2Note that the original proof of this theorem only consid- ers oblivious opponents. However, by using the argument is maximal (such values can be computed with linear pro- introduced in [32], we can apply the proof to adaptive ad- gramming). Moreover assume that γ(t) parameters are set versaries as well.

582 possible way to improve the bound is to make the underly- 4.2 Regret Analysis ing neighbourhood graph denser. However, it is not trivial Given the random neighbourhood model, we now turn to to provide such improvements. For example, assuming that the regret analysis of the algorithm. In particular, we add even with a neighbourhood graph with a minimum degree the following conditions to the model: of constant m (i.e., m is independent from |S0|) would not help much. In fact, we state the following: |S0| ≥ 8000 (1) Theorem 4. There exists S0 ∈ S with minimum neigh- − 1 bourhood degree of m and a (randomised) opponent strategy, 1 ≥ v ≥ |S0| 10 (2) q such that if T > 374|S |3, we have U (T ) ≥ 0.6 |S0| T for 0 A m+1 The first condition is reasonable as we typically consider any meta-strategy A. large sets of strategies. The second condition guarantees that Proof. From Theorem 4 of [25], we know that under the the neighbourhood graph is quite dense in general. We state conditions given in the theorem, the following holds: the following: U (T ) ≥ 0.06pα(G)T A Theorem 5. Suppose that conditions (1)-(2) hold. Under Now, we just need to show that there exists S0 with mini- the same conditions of Theorem 2, with probability at least |Str | 3 0 1 − o(|S0|), we have the following regret bound for ELP : mum neighbourhood degree of m but α(G) ≥ m+1 . To do so, consider a set of groups of m + 1 strategies such that r within each group, there is one chosen strategy and the rest 1 36|S0| U (T ) ≤ 3|S | 4 2 log log(|S |)T. are a prefix of the chosen one. It is clear that the underly- A 0 ε2 0 ing neighbourhood graph G has at least |S0| cliques, which m+1 Proof. From [16] we have that with at least 1 − o(|S0|) |Str0| implies that α(G) ≥ m+1 . probability, the following holds: p   That is, even in this case, we still have UA(T ) = Ω( |S0|T ). 2 α(G) ≤ log |S0|v − log log |S0|v − log 2 + 2 Nevertheless, in the following sections, we investigate some v special cases of the problem, in which this regret bound can − 1 be significantly improved. Since |S0| > 8000 and v ≥ |S0| 10 , this can further bounded as:

4. BIDDING WITH RANDOM NEIGHBOUR- 2   log |S0| α(G) ≤ log |S0|v + 2 ≤ 4 HOOD GRAPH v v − 1 We first start with the scenario when some additional infor- Note that we use v ≤ 1 here. Now, since v ≥ |S0| 10 , we mation are randomly added to the model. In particular, we have that assume that apart from the side information observed for log |S0| p α(G) ≤ 4 ≤ |S0| prefixes of the chosen strategy, we have access to some ad- − 1 ditional information that can reveal the potential outcome |S0| 10 of strategies other then the prefixes of the chosen one. The last inequality holds if |S0| ≥ 8000. Applying this to 4.1 Random Neighbourhood Theorem 2 concludes the proof. By random additional information, we mean that there is an 1 √ Note that here we have UA(T ) = O˜(|S0| 4 T ), which is a oracle in the system, to whom we can send our requests to in- 1 fer about the outcome of other (not-played) strategies, after |S0| 4 improvement, compared to the regret bound of the observing the outcome of our chosen one. This assumption general case. In the next section, we will show how this of having such an oracle is quite common in the online learn- bound can be further improved by taking a different ap- ing literature [42, 25]. Note that we do not consider query proach of modifying the neighbourhood model. costs here in our model. In addition, we also assume that the result of a query is probabilistic, that is, it is revealed 5. BIDDING WITH THRESHOLD with some probability v > 0. As such, we can represent Beside adding further side information to the model, another the neighbourhood graph of the dollar auction as a random way to simplify the underlying structure of the neighbour- graph G(S0, v). hood graph, and thus, to improve the induced regret bound, Random neighbourhood can model some additional ex- is to restrict the type of behaviour each player can follow. pert knowledge acquired by the player. For example, dur- In this spirit, in this section we investigate a class of meta ing the negotiations (modelled by the dollar auction) some strategies that provides this simplification. strategies can be considered more aggressive than others To do so, we first discuss the optimal non-spiteful strat- (and likely giving similar result), despite not being the pre- egy, proposed by O’Neill, that provides the basis of our class fix of each other (in the strict sense described in the pre- of behaviours. We then describe how to create a class of vious sections). Another possible interpretations of the ad- strategies, which we call optimal strategies with threshold. ditional information include results of intelligence gathering Finally, we show that by restricting to this class of strate- in the arms race scenario (identifying the maximal number gies, we can further improve the previous regret bound with of troops or weapon units that our opponent can use during 1 another factor of |S0| 4 . the conflict) or industrial espionage in the R&D competi- 3 tion scenario (abandoning research on technologies already Note that the o(·) notation here means that o(|S0|) → 0 as tested by the competition). |S0| tends to infinity.

583 5.1 The Optimal Non-Spiteful Strategy for θ > y auction would have been won with the result of We first start with the discussion of the pure strategy in s − x (x being the lowest bid made by fθ higher than y). the dollar auction against a non-spiteful opponent that was Therefore, for the setting describe in the theorem, G is a proposed by O’Neill [31]. In particular, O’Neill proved that clique and α(G) = 1, which concludes the proof. this strategy is optimal in terms of maximising the expected utility of the non-spiteful player, when played against a non- Again, choosing the value of β such that both components spiteful opponent. of the bound are equal, we get the following: Let the initial bid of the strategy be x0 = (b − 1)mod(s − r log(|S0|) Corollary 7. For β = 6|S | the bound defined 1) + 1, where b is the budget of both players, s is the stake 9T log 0 and 1 is the minimal bid increment. In addition, let x = ε i in Theorem 6 equals: x0 + i(s − 1) for i > 0 and let x−1 = 0. Given this, the r optimal strategy of O’Neill fˆ for y ∈ {xi−1, . . . , xi − 1} is: 6|S | U (T ) ≤ 6 log 0 log(|S |)T. A ε 0  ˆ xi if x = xi−1, √ f(x, y) = Note that UA(T ) = O˜( T ), which is an improvement of a y otherwise. 1 factor of |S0| 2 , compared to the regret bound of the general 5.2 Optimal Strategies with Threshold case. Based on the optimal strategy of O’Neill, we now propose a special class of strategies, using of which allow us to achieve 6. AUCTION AGAINST MALICIOUS OPPO- better regret bound for ELP. In particular, let fˆ ∈ S be NENT the optimal strategy by O’Neill described in the previous So far we have shown that by applying ELP, we can still section. We call a strategy f ∈ S a strategy with threshold θ achieve low regret performance, despite the fact that there θ if and only if ∀ f (x, y) = fˆ(x, y) and ∀ f (x, y) = y y<θ θ y≥θ θ is no prior knowledge about the opponent’s motives, nei- Note that a strategy with threshold follows the optimal ther about her rationality. We now introduce the concept strategy by O’Neill up to a moment, when opponent’s bid of spitefulness and consider an auction against a malicious exceed threshold and then passes. The underlying intuition opponent. As argued by Waniek et al. [39], there are vari- of this class of strategies is that in many cases, the bidder is ous real-world applications, in which the dollar auctions are rational (i.e., bidding optimally), but is also risk-aware. To played between spiteful players, whose objective function model the latter, we assume that the bidder is only willing involves some linear combination of maximising one’s own to bid up to a certain threshold value, although her bidding utility and minimising the opponent’s utility. For instance, budget would allow her to go beyond that threshold. As one may argue that certain moves of the USA or the So- such, we argue that this class of strategies with threshold viet Union during the Cold War were designed to harm the represents a realistic behaviour in many real-world situa- opponent even if it meant incurring some extra cost. tions. 5.3 Regret Analysis 6.1 The Concept of Spitefulness We interpret spitefulness as the desire of a player to hurt her We now turn to the regret analysis of ELP for the case of opponent. We follow the definition of spitefulness introduced bidding with threshold. In particular, we assume that we by Brandt et al. [9]. only have access to the above defined class of non-spiteful Any spiteful player i is characterised by a spitefulness co- strategies with threshold. efficient σi ∈ [0, 1]. The higher the coefficient, the more the player i is interested in minimizing the profit of her opponent Theorem 6. Assume that set S0 contains all strategies with threshold θ ∈ {0, . . . , b} and no other strategies. Un- j. The utility function of a spiteful player is then: der the abovementioned conditions and the settings of The- ui = (1 − σi)pi − σipj . orem 2, the regret bound of the ELP algorithm is at most: where pi and pj are the profits of corresponding players. 6|S0| log(|S0|) UA(T ) ≤ 9βT log + . A player with the spitefulness coefficient σ = 0 is called a ε β non-spiteful player, and she is only interested in maximizing Proof. From Theorem 2 we have: her own profit. On the other hand, a player with spitefulness coefficient σ = 1 is called a malicious player, and she is only  6|S0|  log |S0| UA(T ) ≤ 9βT αG log + interested in minimizing the profit of her opponent. α(G)ε β where α(G) is the independence number of the information 6.2 Mixed Nash Equilibrium graph G. We now consider a case when a non-spiteful bidder plays the Given two strategies with threshold fθ and fΘ, where repeated dollar auctions against a malicious opponent, not θ < Θ, there is always an edge fΘ → fθ in the informa- knowing that the opponent is malicious. In addition, our tion graph for all rounds. If auction was lost while using fΘ, opponent might not have full information about our non- then auction would have also been lost using fθ and the re- spitefulness either and might not follow a rational behaviour. sult would be equal to −x, where x is a last bid of fθ. If Given this, we assume that her strategy is taken from a set auction was won while using fΘ, we know the last bid of of S1 ⊆ S. opponent’s strategy y. Since fθ and fΘ follow the same se- If we knew the set S1 in advance (and the opponent also quence of bids, then for θ ≤ y auction would have been lost knows our set of strategies S0), a mixed Nash equilibrium with the result of −x (x being the last bid made by fθ) and can be easily calculated using standard arguments. However,

584 as this knowledge is typically not available in advance, it is from Theorem 2 we have that impossible to calculate the Nash equilibrium. Nevertheless, T 1 X 1 X we show that if both players apply the ELP algorithm to ri(t) ≥ max ri(f, fj (t)) − ∆T T T f∈Si play their strategies, we can converge to a Nash equilibrium t=1 t point in performance. 1 X = max ri(f, fj (t)) − ∆T T f t Theorem 8. Let fi(t) ∈ Si, i ∈ {0, 1} denote the bidding where f is a mixed strategy over Si. The first inequality is strategy of player i at round t, and we denote by fj (t) the strategy chosen by i’s opponent. Let obtained from Theorem 2, while the second one is from the fact that the mixed strategy belongs to the convex hull of T the pure strategies, and thus, cannot exceed the best pure 1 X f˜ = f (t), strategy. This can be further bounded as follows: i T i t=1 1 X 1 X max ri(f, fj (t)) − ∆T ≥ max min ri(f, g) − ∆T T f T f g t t T 1 X f˜ = f (t) where g is a mixed strategy over Sj . This can be further j T j t=1 bounded as: mixed strategies denote the average of these chosen strate- 1 X max min ri(f, g) ≥ max min ri(f, g) − ∆T T f g f g gies. Given this, for each T , there exists a mixed Nash equi- t ∗ ∗ librium profile (fi , fj ) such that ∗ = u − ∆T r That is, we have ∗ ∗ ˜ ˜ ˜ |S0|  ri(fi , fj ) − ri(fi, fj ) ≤ O T T 1 X ∗ r (t) ≥ u − ∆ Proof. For the sake of simplicity we now map the profit T i T t=1 of a player i to reward ri ∈ [−1, 1] (instead of ri ∈ [0, 1] as before). Reward ri = 1 of the non-spiteful player i corre- Similarly, we can prove (from the malicious player’s perspec- sponds to a profit equal to the budget pi = b, and reward tive) that ri = −1 of the non-spiteful player corresponds to a loss T 1 X ∗ equal to the budget pi = −b. Reward of the malicious player r (t) ≤ u + ∆ T i T is expressed as utility described in the previous section, i.e., t=1 reward r = 1 of the malicious player i corresponds to a loss i which concludes the proof. of the non-spiteful player j equal to the budget pj = −b, and reward ri = −1 of the malicious player i corresponds Note that this convergence in performance is a weaker to a profit of the non-spiteful player j equal to the budget property, compared to the convergence in strategy (i.e., the pj = b. average behaviour of the players converge to the distribution Given that assumption, note that when a non-spiteful of the mixed strategy in the equilibrium point), as the latter player bids against a malicious one, our problem can be re- automatically implies the former, while the other way is not garded as a repeated zero-sum game. In fact, within this true in general. game, each player i (with i ∈ {0, 1}) chooses from set of bidding strategies Si, the received reward of player i is ri, and the received reward of her opponent (due to the mali- 7. RELATED WORK cious behaviour) is rj = −ri. In this section we describe the bodies of literature concerning It is well known that in a 2-player zero-sum game, the important aspects of our setting: all-pay auctions, repeated value exists and does coincide with the (mixed) auctions and multi-armed bandit models. Nash equilibrium. Given this, we only need to show that the average behaviour of the players, when both apply ELP, 7.1 All-Pay Auctions and the Dollar Auction converges in performance to the minimax solution. To do so, All-pay auctions were studied as a model multiple settings let u∗ denote the minimax value, which can be defined as where expenditure of the resource does not always guaran- follows: tee profit [24]. Some of them include political campaigns, R&D competitions of oligopolistic companies [38], realizing ∗ u = max min ri(fi, fj ) = min max ri(fi, fj ) crowdsourcing enterprises [14] and international situation fi fj fj fi analysis [34]. Aside from theoretical studies, there is also experimental research [12]. The dollar auction is a widely where f and f are mixed strategies of players i and j, re- i j studied type of an all-pay auction[23, 22]. It is often used in spectively. For now we consider the case that player i (non- a classroom experiments serving a study of conflict escala- spiteful) uses ELP, and player j can play any meta-strategy tion among people [29, 19]. {f (t)}T . In addition, let BR(f (t)) denote the best re- j t=1 i Recently, Waniek et al. [39] offered a preliminary study sponse mixed strategy against player i’s f (t) bidding strat- i of a dollar auction in which a spiteful player plays against egy. By denoting a rational opponent. The authors considered the setting in s which a non-spiteful bidder unwittingly bids against a spite- 6|S0| α(G) log log(|S0|) ful one. In various scenarios in this setting the conflict esca- ∆ = 6 α(G)ε , T T lates. In particular, the spiteful bidder is able to force the

585 non-spiteful opponent to spend most of the budget. Still, it Acknowledgments is the spiteful bidder who often wins the prize. Furthermore, Tran-Thanh gratefully acknowledges funding from the UK a malicious player with a smaller budget is likely to plunge Research Council for project“ORCHID”, grant EP/I011587/1. the opponent more than a malicious player with a bigger Tomasz Michalak was supported by the European Research budget. Thus, a malicious player should not only hide his Council under Advanced Grant 291528 (“RACE”). real preferences but also the real size of his budget. 7.2 Repeated Auctions APPENDIX Modelling various processes as repeated auctions is a well established idea [41]. Among other applications, it was used for allocating tasks to robots [30] and assigning channels in Table 1: Notation radio network [17]. Many authors concentrate on the prob- Symbol Meaning lem of possible in repeated auctions [3, 35, 26], fo- i,j Players participating in the auction cusing their attention on the effect of cooperation between s Stake of the auction players on the income of an auctioneer [7]. Most of the pub- δ Minimal bid increment lished results consider second-price auctions [20, 7], however b Budget of both players all-pay auction format was also studied [28]. Xb States where player can make a bid S Set of all valid strategies in the auction 7.3 Multi-Armed Bandits S0 Set of available strategies The first multi-armed bandit models were designed to tackle pi Profit of player i problems with stochastic rewards [33, 1, 4]. The first adver- T Number of rounds (separate auctions) sarial setting was proposed by Auer et al. [5, 6]. However, in A Algorithm selecting strategies the early years, adversarial bandit models were believed to RA(T ) Total reward of algorithm A after T rounds be only suitable against non-adaptive opponents. The argu- UA(T ) Regret of algorithm A after T rounds ment that justified the usage of multi-armed bandits against rf (t) Reward for using strategy f in round t adaptive opponents was first proposed by Poland [32], whose G Neighbourhood graph techniques was adopted by many more recent works. Man- α(G) Independence number of graph G nor and Shamir[25] extended the adversarial bandit model Nf Neighbourhood of strategy f to the case when additional side information is provided to ft Strategy selected to use in round t improve the regret bounds of the model. Their work was β,γ(t),qf (t) Parameters of the ELP algorithm later improved by [2] and [21]. In particular, the former im- wf (t) Weight of choosing strategy f in round t proved the regret bound, while the latter focuses on the cases Pg(t) Probability of selecting strategy f in round t when the neighbourhood graph cannot be fully revealed in rˆf (t) Estimated reward for using str. f in round t advance (i.e., before the chosen arm is played). v Prob. of revealing additional information G(S0, v) Random neighbourhood graph of strat. S0 8. CONCLUSIONS θ Threshold of a strategy σi Spitefulness coefficient of player i In this paper, we investigated the problem of repeated dollar auctions in which players do not have prior knowledge about the opponent’s spitefulness, neither her rationality level. We proposed an adversarial multi-armed bandit model to effi- REFERENCES ciently tackle this problem, in which the goal is to max- [1] R. Agrawal. Sample mean based index policies with imise the players’ total utility over time. We showed that o(log n) regret for the multi-armed bandit problem. by using an adversarial bandit based meta-strategy, ELP, Advances in Applied Probability, 27:1054–1078, 1995. we can indeed provably achieve good performance. We then further improved this performance by considering additional [2] N. Alon, N. Cesa-Bianchi, C. Gentile, and Y. Mansour. settings of the model: (i) dollar auctions with random neigh- From bandits to experts: A tale of domination and bourhood graphs; and (ii) playing optimal strategies with independence. In Advances in Neural Information thresholds. We also proved that in a special case of non- Processing Systems, pages 1610–1618, 2013. spiteful player versus malicious player, if both use ELP to [3] M. Aoyagi. Bid rotation and collusion in repeated make their decisions, the game converges to a mixed Nash auctions. Journal of Economic Theory, 112(1):79–105, equilibrium in performance. 2003. As a potential future work, we aim to extend our work to [4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite–time repeated auctions where the auctions share the same budget. analysis of the multiarmed bandit problem. Machine That is, apart from making decision about which strategy Learning, 47:235–256, 2002. to use per each auction, the players also have to choose the [5] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. budget allocated to a particular auction, without exceeding Schapire. Gambling in a rigged casino: The adversarial an a priori given global limit. As our current techniques multi-armed bandit problem. In Foundations of significantly rely on the fact that the current budgets are Computer Science, 1995. Proceedings., 36th Annual given, it seems to be highly not trivial how to extend our Symposium on, pages 322–331. IEEE, 1995. model to such settings. Furthermore, until now, the litera- [6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. ture on the dollar auction focused on the case of two players. Schapire. The nonstochastic multiarmed bandit Hence, it would be very interesting to extend all the results problem. SIAM Journal on Computing, 32(1):48–77, to multiple players. 2002.

586 [7] S. Bikhchandani. Reputation in repeated second-price their use in ai in particular. In Computational auctions. Journal of Economic Theory, 46(1):97–119, conflicts, pages 1–20. Springer, 2000. 1988. [28] J. Munster.¨ Repeated contests with asymmetric [8] K. E. Boulding. Conflict and defense: A general information. Journal of Public Economic Theory, theory. 1962. 11(1):89–118, 2009. [9] F. Brandt, T. Sandholm, and Y. Shoham. Spiteful [29] J. K. Murnighan. A very extreme case of the dollar bidding in sealed-bid auctions. In Computing and auction. Journal of Management Education, Markets, 2005. 26(1):56–69, 2002. [10] J. Campos, C. Martinho, and A. Paiva. Conflict inside [30] M. Nanjanath and M. Gini. Repeated auctions for out: A theoretical approach to conflict from an agent robust task execution by a robot team. Robotics and point of view. AAMAS ’13, pages 761–768. Autonomous Systems, 58(7):900–909, 2010. International Foundation for Autonomous Agents and [31] B. O’Neill. International escalation and the dollar Multiagent Systems, 2013. auction. Journal of Conflict Resolution, 30(1):33–50, [11] C. Castelfranchi. Conflict ontology. In Computational 1986. Conflicts, pages 21–40. Springer, 2000. [32] J. Poland. Fpl analysis for adaptive bandits. 3rd [12] E. Dechenaux, D. Kovenock, and R. M. Sheremeta. A Symposium on Stochastic Algorithms, Foundations survey of experimental research on contests, all-pay and Applications (SAGA’05), pages 58–69, 2005. auctions and tournaments. WZB Discussion Paper SP [33] H. Robbins. Some aspects of the sequential design of II 2012-109, Berlin, 2012. experiments. Bulletin of the AMS, 55:527–535, 1952. [13] G. Demange. Rational escalation. Annales [34] M. Shubik. The dollar auction game: A paradox in d’AL’conomie˜ et de Statistique, (25/26):pp. 227–249, noncooperative behavior and escalation. Journal of 1992. Conflict Resolution, pages 109–111, 1971. [14] D. DiPalantino and M. Vojnovic. Crowdsourcing and [35] A. Skrzypacz and H. Hopenhayn. Tacit collusion in all-pay auctions. In EC’09, pages 119–128. ACM, 2009. repeated auctions. Journal of Economic Theory, [15] H. Fang. Lottery versus all-pay auction models of 114(1):153–169, 2004. lobbying. Public Choice, 112(3-4):351–371, 2002. [36] C. Tessier, L. Chaudron, and H. J. MAijller.˜ Agents’ [16] A. M. Frieze. On the independence number of random conflicts: New issues. In Conflicting Agents, volume 1 graphs. Discrete Mathematics, 81(2):171–175, 1990. of Multiagent Systems, Artificial Societies, And [17] Z. Han, R. Zheng, and H. V. Poor. Repeated auctions Simulated Organizations, pages 1–30. Springer US, with bayesian nonparametric learning for spectrum 2002. access in cognitive radio networks. Wireless [37] W. W. Vasconcelos, M. J. Kollingbaum, and T. J. Communications, IEEE Transactions on, Norman. Normative conflict resolution in multi-agent 10(3):890–900, 2011. systems. AAMAS’09, 19(2):124–152, 2009. [18] A. J. Jones. : Mathematical models of [38] X. Vives. Technological competition, uncertainty, and conflict. Elsevier, 2000. oligopoly. Journal of Economic Theory, 48(2):386–415, [19] J. H. Kagel and D. Levin. Auctions: a survey of 1989. experimental research, 1995–2008. 2008. [39] M. Waniek, A. Niescieruk, T. Michalak, and [20] J. L. Knetsch, F.-F. Tang, and R. H. Thaler. The T. Rahwan. Spiteful bidding in the dollar auction. In endowment effect and repeated market trials: Is the Proceedings of the 24th International Conference on vickrey auction demand revealing? Experimental Artificial Intelligence, pages 667–673. AAAI Press, economics, 4(3):257–269, 2001. 2015. [21] T. Koc´ak, G. Neu, M. Valko, and R. Munos. Efficient [40] D. Weyns and T. Holvoet. Regional synchronization learning by implicit exploration in bandit problems for simultaneous actions in situated multi-agent with side observations. In Advances in Neural systems. pages 497–510, 2003. Information Processing Systems, pages 613–621, 2014. [41] E. Wolfstetter. Auctions: an introduction. Journal of [22] V. Krishna and J. Morgan. An analysis of the war of economic surveys, 10(4):367–420, 1996. attrition and the all-pay auction. journal of economic [42] J. Y. Yu and S. Mannor. Piecewise-stationary bandit theory, 72(2):343–362, 1997. problems with side observations. In Proceedings of the [23] W. Leininger. Escalation and cooperation in conflict 26th Annual International Conference on Machine situations the dollar auction revisited. Journal of Learning, pages 1177–1184. ACM, 2009. Conflict Resolution, 33(2):231–254, 1989. [24] Y. Lewenberg, O. Lev, Y. Bachrach, and J. S. Rosenschein. Agent failures in all-pay auctions. IJCAI/AAAI, 2013. [25] S. Mannor and O. Shamir. From bandits to experts: On the value of side-observations. In Advances in Neural Information Processing Systems, pages 684–692, 2011. [26] R. C. Marshall and L. M. Marx. Bidder collusion. Journal of Economic Theory, 133(1):374–402, 2007. [27] H. J. Muller¨ and R. Dieng. On conflicts in general and

587