High level Last time we discussed notion of .  Static concept: set of prob. Distributions (p,q,…) such that Online Learning, Regret Minimization, nobody has any incentive to deviate. Optimality, and Correlated  But doesn’t talk about how system would get there. Troubling that even finding one can be hard in large games. Equilibrium What if agents adapt (learn) in ways that are well-motivated in terms of their own rewards? What can we say about the system then?

Avrim Blum

High level Consider the following setting… Today:  Each morning, you need to pick  Online learning guarantees that are achievable when acting one of N possible routes to drive in a changing and unpredictable environment. to work. Robots R Us  What happens when two players in a zero-sum game both  But traffic is different each day. 32 min use such strategies?  Not clear a priori which will be best.  When you get there you find out how  Approach minimax optimality. long your route took. (And maybe  Gives alternative proof of minimax theorem. others too or maybe not.)  What happens in a general-sum game?  Is there a for picking routes so that in the  Approaches (the set of) correlated equilibria. long run, whatever the sequence of traffic patterns has been, you’ve done nearly as well as the best fixed route in hindsight? (In expectation, over internal randomness in the algorithm)  Yes.

“No-regret” algorithms for repeated decisions “No-regret” algorithms for repeated decisions A bit more generally:  Algorithm has N options. World chooses cost vector. Can view as matrix like this (maybe infinite # cols) World – life - fate

Algorithm

 At each time step, algorithm picks row, life picks column.  At each time step, algorithm picks row, life picks column. average regret  Alg pays cost for action chosen. Define Alg pays cost for action in T time chosen. steps as: (avg per-day cost of alg) – (avg per-day cost of best  Alg gets column as feedback (or just its own cost in  Alg gets column as feedback (or just its own cost in fixed row in hindsight). the “bandit” model). the “bandit” model). We want this to go to 0 or better as T gets large.  Need to assume some bound on max cost. Let’s say all  Need to assume some bound on max cost. Let’s say all costs between 0 and 1. costs [betweencalled a “no 0 and-regret” 1. algorithm]

1 Some intuition & properties of no-regret algs. History and development (abridged)  Let’s look at a small example: World – life - fate  [Hannan’57, Blackwell’56]: Alg with avg regret O((N/T)1/2).

 2 1 0 Re-phrasing, need only T = O(N/ ) steps to get time- dest average regret down to . (will call this quantity T)

Algorithm 0 1  Optimal dependence on T (or ). Game-theorists viewed  Note: Not trying to compete with best #rows N as constant, not so important as T, so pretty adaptive strategy – just best fixed row much done. (path) in hindsight. Why optimal in T? World – life - fate  No-regret algorithms can do much 1 0 better than playing minimax optimal, dest

Algorithm 0 1 and never much worse.  Existence of no-regret algs yields • Say world flips fair coin each day. immediate proof of minimax thm (will • Any alg, in T days, has expected cost T/2. see in a bit) • But E[min(# heads,#tails)] = T/2 – Θ 푇 . • So, expected avg per-day gap is Θ 1/ 푇 .

History and development (abridged)  [Hannan’57, Blackwell’56]: Alg with avg regret O((N/T)1/2). 2  Re-phrasing, need only T = O(N/ ) steps to get time- average regret down to . (will call this quantity T) To think about this, let’s look at  Optimal dependence on T (or ). Game-theorists viewed the problem of “combining expert #rows N as constant, not so important as T, so pretty much done. advice”.  Learning-theory 80s-90s: “combining expert advice”. Imagine large class C of N prediction rules.  Perform (nearly) as well as best f2C.  [LittlestoneWarmuth’89]: Weighted-majority algorithm  E[cost] · OPT(1+) + (log N)/. 1/2 2  Regret O((log N)/T) . T = O((log N)/ ).  Optimal as fn of N too, plus lots of work on exact constants, 2nd order terms, etc. [CFHHSW93]…  Extensions to bandit model (adds extra factor of N).

Using “expert” advice Simpler question Say we want to predict the stock market. • We have n “experts”. • We solicit n “experts” for their advice. (Will the • One of these is perfect (never makes a mistake). market go up or down?) We just don’t know which one. • We then want to use their advice somehow to • Can we find a strategy that makes no more than make our prediction. E.g., lg(n) mistakes? Answer: sure. Just take majority vote over all experts that have been correct so far. Each mistake cuts # available by factor of 2. Basic question: Is there a strategy that allows us to do nearly as well as best of these in hindsight? Note: this means ok for n to be very large. [“expert” = someone with an opinion. Not necessarily “halving algorithm” someone who knows anything.]

2 What if no expert is perfect? Weighted Majority Algorithm One idea: just run above protocol until all Intuition: Making a mistake doesn't completely experts are crossed off, then repeat. disqualify an expert. So, instead of crossing off, just lower its weight. Makes at most log(n) mistakes per mistake of the best expert (plus initial log(n)). Weighted Majority Alg: – Start with all experts having weight 1. Seems wasteful. Constantly forgetting what we've – Predict based on weighted majority vote. “learned”. Can we do better? – Penalize mistakes by cutting weight in half.

Analysis: do nearly as well as best expert in hindsight Randomized Weighted Majority • M = # mistakes we've made so far. 2.4(m + lg n) not so good if the best expert makes a • m = # mistakes best expert has made so far. mistake 20% of the time. Can we do better? Yes. • W = total weight (starts at n). • Instead of taking majority vote, use weights as probabilities. (e.g., if 70% on up, 30% on down, then pick • After each mistake, W drops by at least 25%. 70:30) Idea: smooth out the worst case. So, after M mistakes, W is at most n(3/4)M. • Also, generalize ½ to 1- . • Weight of best expert is (1/2)m. So, constant M = expected ratio #mistakes unlike most worst-case bounds, numbers So, if m is small, then M is pretty small too. are pretty good.

Analysis Summarizing · (1+) -1 • Say at time t we have fraction Ft of weight on • E[# mistakes] m + log(n). experts that made mistake. 1/2 • So, we have probability Ft of making a mistake, and • If set =(log(n)/m) to balance the two terms out we remove an Ft fraction of the total weight. (or use guess-and-double), get bound of 1/2 – Wfinal = n(1- F1)(1 -  F2)... E[mistakes] · m + 2(m¢log n)

– ln(Wfinal) = ln(n) + t [ln(1 -  Ft)] · ln(n) -  t Ft (using ln(1-x) < -x) • Since m · T, this is at most m + 2(Tlog n)1/2. = ln(n) -  M. ( F = E[# mistakes]) t • If best expert makes m mistakes, then ln(W ) > ln((1-)m). final • So, avg regret = 2(Tlog n)1/2/T ! 0. • Now solve: ln(n) -  M > m ln(1-).

3 What can we use this for? Game-theoretic version

• Can use to combine multiple algorithms to • What if experts are actions? (paths in a do nearly as well as best in hindsight. network, rows in a matrix game,…) • At each time t, each has a loss (cost) in {0,1}. • But what about cases like choosing paths • Can still run the algorithm to work, where “experts” are different – Rather than viewing as “pick a prediction with actions, not different predictions? prob proportional to its weight” , – View as “pick an expert with probability proportional to its weight”

– Choose expert i with probability pi = wi/i wi. • Same analysis applies.

Game-theoretic version Illustration

• What if experts are actions? (paths in a World – life - fate 2 1 network, rows in a matrix game,…) (1-c1 ) (1-c1 ) 1  2  1 (1- c2 )( 1- c2 ) 1 scaling • What if losses (costs) in [0,1]?  2  1 (1- c3 )(1 - c3 ) 1 so costs • If expert i has cost ci, do: wi à wi(1-ci). . . 1 in [0,1] . . 1 • Our expected cost = i ciwi/W. 2 1 (1-cn ) (1-cn ) 1 • Amount of weight removed =   w c . i i i c1 c2 • So, fraction removed =  ¢ (our cost). • Rest of proof continues as before… Guarantee: E[cost] · OPT + 2(OPT¢log n)1/2 Since OPT · T, this is at most OPT + 2(Tlog n)1/2. So, now we can drive to work! So, regret/time step · 2(Tlog n)1/2/T ! 0. (assuming full feedback)

Minimax-optimal strategies • Can solve for minimax-optimal strategies using Linear programming • Claim: no-regret strategies will do nearly as well or Connections to Minimax better against any sequence of opponent plays. Optimality – Do nearly as well as best fixed choice in hindsight. – Implies do nearly as well as best distrib in hindsight – Implies do nearly as well as minimax optimal! Left Right

Left (½,-½) (1,-1)

Right (1,-1) (0,0)

4 Proof of minimax thm using RWM Proof contd • Now, consider playing randomized weighted- • Suppose for contradiction it was false. majority alg as Row, against Col who plays • This means some game G has VC > VR: optimally against Row’s distrib. – If Column player commits first, there exists • In T steps, a row that gets the Row player at least VC. – Alg gets ¸ [best row in hindsight] – 2(Tlog n)1/2 – But if Row player has to commit first, the – BRiH ¸ T¢V [Best against opponent’s empirical Column player can make him get only VR. C distribution] • Scale matrix so payoffs to row are VC – Alg · T¢V [Each time, opponent knows your in [-1,0]. Say V = V - . R R C randomized strategy] VR – Gap is T. Contradicts assumption once T > 2(Tlog n)1/2 , or T > 4log(n)/2.

Proof contd What if two RWMs play each other? • Now, consider playing randomized weighted- • Can anyone see the argument that their majority alg as Row, against Col who plays time-average strategies must be approaching optimally against Row’s distrib. minimax optimality?

• Note that our procedure gives a fast way to compute apx minimax-optimal strategies, if we can simulate Col (best-response) quickly.

What if all players minimize regret?  In zero-sum games, empirical frequencies quickly approaches minimax optimal.  In general-sum games, does behavior quickly (or Internal/Swap Regret at all) approach a Nash equilibrium? and  After all, a Nash Eq is exactly a set of distributions that are no-regret wrt each Correlated Equilibria other. So if the distributions stabilize, they must converge to a Nash equil.  Well, unfortunately, no.

5 What can we say? More general forms of regret If algorithms minimize “internal” or “swap” regret, 1. “best expert” or “external” regret: then empirical distribution of play approaches – Given n strategies. Compete with best of them in correlated equilibrium. hindsight.  Foster & Vohra, Hart & Mas-Colell,… 2. “sleeping expert” or “regret with time-intervals”:  Though doesn’t imply play is stabilizing. – Given n strategies, k properties. Let Si be set of days satisfying property i (might overlap). Want to simultaneously achieve low regret over each Si. What are internal/swap regret 3. “internal” or “swap” regret: like (2), except that and correlated equilibria? Si = set of days in which we chose strategy i.

Internal/swap-regret Weird… why care? • E.g., each day we pick one stock to buy “Correlated equilibrium” shares in. • Distribution over entries in matrix, such that if a trusted party chooses one at random and tells – Don’t want to have regret of the form “every you your part, you have no incentive to deviate. time I bought IBM, I should have bought R P S Microsoft instead”. • E.g., Shapley game. • Formally, swap regret is wrt optimal R --1,1,--11 -1,1 1,-1 function f:{1,…,n}!{1,…,n} such that every P 1,-1 --1,1,--11 -1,1 time you played action j, it plays f(j). S -1,1 1,-1 --1,1,--11 In general-sum games, if all players have low swap- regret, then empirical distribution of play is apx correlated equilibrium.

Connection Correlated vs Coarse-correlated Eq • If all parties run a low swap regret In both cases: a distribution over entries in the algorithm, then empirical distribution of matrix. Think of a third party choosing from this distr and telling you your part as “advice”. play is an apx correlated equilibrium. – Correlator chooses random time t 2 {1,2,…,T}. “Correlated equilibrium” Tells each player to play the action j they • You have no incentive to deviate, even after played in time t (but does not reveal value of t). seeing what the advice is. – Expected incentive to deviate:jPr(j)(Regret|j) = swap-regret of algorithm “Coarse-Correlated equilibrium” – So, this suggests correlated equilibria may be • If only choice is to see and follow, or not to see natural things to see in multi-agent systems at all, would prefer the former. where individuals are optimizing for themselves Low external-regret ) apx coarse correlated equilib.

6 Can convert any “best expert” algorithm A into one Internal/swap-regret, contd achieving low swap regret. Idea:

Algorithms for achieving low regret of this – Instantiate one copy Aj responsible for expected form: regret over times we play j. Q A – Foster & Vohra, Hart & Mas-Colell, Fudenberg 1 Play p = pQ & Levine. q2 Cost vector c Alg A 2 – Will present method of [BM05] showing how to p2c . . convert any “best expert” algorithm into one . achieving low swap regret. An – Allows us to view pj as prob we play – Unfortunately, #steps to achieve low swap action j, or as prob we play alg A . regret is O(n log n) rather than O(log n). j – Give Aj feedback of pjc. t t t t t – Aj guarantees t (pj c )¢qj · mini t pj ci + [regret term] t t t t t – Write as: t pj (qj ¢c ) · mini t pj ci + [regret term]

Can convert any “best expert” algorithm A into one Can convert any “best expert” algorithm A into one achieving low swap regret. Idea: achieving low swap regret. Idea:

– Instantiate one copy Aj responsible for expected – Instantiate one copy Aj responsible for expected regret over times we play j. regret over times we play j.

Q A Q A 1 1 Play p = pQ Play p = pQ

q2 q2 Cost vector c Alg A Cost vector c Alg A 2 2 p2c p2c ...... A A – Sum over j, get: n – Sum over j, get: n

t t t t t t t t t t t p Q c · j mini t pj ci + n[regret term] t p Q c · j mini t pj ci + n[regret term]

Our total cost For each j, can move our prob to its own i=f(j) Our total cost For each j, can move our prob to its own i=f(j)

t t t t t – Write as: t pj (qj ¢c ) · mini t pj ci + [regret term] – Get swap-regret at most n times orig external regret.

7