Plan for Today & Next time 15-859(B) Machine Learning • 2-player zero-sum games Theory • 2-player general-sum games – Nash equilibria – Correlated equilibria Learning and • Internal/swap regret and connection to Avrim Blum correlated equilibria • Many-player games with structure: congestion games / exact potential games – Best-response dynamics – Price of anarchy, Price of stability

2-Player Zero-Sum games Game Theory terminolgy • Two players R and C. Zero-sum means that what’s • Rows and columns are called pure strategies. good for one is bad for the other.

• Game defined by matrix with a row for each of R’s • Randomized algs called mixed strategies. options and a column for each of C’s options. Matrix tells who wins how much. • “Zero sum” means that game is purely • an entry (x,y) means: x = payoff to row player, y = payoff to competitive. (x,y) satisfies x+y=0. (Game column player. “Zero sum” means that y = -x. doesn’t have to be fair). • E.g., penalty shot: Left Right goalie Left Right goalie

Left (0,0) (1,-1) Left (0,0) (1,-1) GOAALLL!!! GOAALLL!!! shooter shooter

Right (1,-1) (0,0) No goal Right (1,-1) (0,0) No goal

Minimax-optimal strategies -optimal strategies • Minimax optimal is a (randomized) • Can solve for minimax-optimal strategies strategy that has the best guarantee on its using Linear programming expected gain, over choices of the opponent. • No-regret strategies will do nearly as well or [maximizes the minimum] better. • I.e., the thing to play if your opponent knows • I.e., the thing to play if your opponent knows you well. you well. Left Right goalie Left Right goalie

Left (0,0) (1,-1) Left (0,0) (1,-1) GOAALLL!!! GOAALLL!!! shooter shooter

Right (1,-1) (0,0) No goal Right (1,-1) (0,0) No goal

1 Minimax Theorem (von Neumann 1928) Interesting game to think about • Every 2-player zero-sum game has a unique • Graph G, source s, sink t. value V. • Player A chooses path P from s to t. • Minimax optimal strategy for R guarantees • Player B chooses edge e in G. R’s expected gain at least V. • If e is in P, B wins. Else A wins. • Minimax optimal strategy for C guarantees • What is minimax optimal strategy for B, A? C’s expected loss at most V. • Note that can run RWM for B, and best- Existence of no-regret strategies gives one response for A (shortest path alg on B’s way of proving the theorem. weights) to get apx-minimax-optimal.

General-sum games

• In general-sum games, can get win-win Now, to General-Sum games… and lose-lose situations. • E.g., “what side of sidewalk to walk on?”:

person Left Right walking towards you Left you (1,1) (-1,-1)

Right (-1,-1) (1,1)

General-sum games • A Nash Equilibrium is a stable pair of strategies (could be randomized). • In general-sum games, can get win-win • Stable means that neither player has and lose-lose situations. incentive to deviate on their own. • E.g., “which movie should we go to?”: • E.g., “what side of sidewalk to walk on”: Bully Hunger games Left Right

Bully (8,2) (0,0) Left (1,1) (-1,-1)

Hunger games (0,0) (2,8) Right (-1,-1) (1,1)

No longer a unique “value” to the game. NE are: both left, both right, or both 50/50.

2 Uses NE can do strange things • Economists use games and equilibria as • Braess paradox: models of interaction. – Road network, traffic going from s to t. • E.g., pollution / prisoner’s dilemma: – travel time as function of fraction x of – (imagine pollution controls cost $4 but improve traffic on a given edge. everyone’s environment by $3) travel time = 1, travel time indep of traffic t(x)=x. don’t pollute pollute 1 x s t don’t pollute (2,2) (-1,3) x 1 pollute (3,-1) (0,0) Fine. NE is 50/50. Travel time = 1.5 Need to add extra incentives to get good overall behavior.

NE can do strange things • Braess paradox: Existence of NE – Road network, traffic going from s to t. • Nash (1950) proved: any general-sum game – travel time as function of fraction x of must have at least one such equilibrium. traffic on a given edge. – Might require mixed strategies. travel time = 1, travel time • This also yields minimax thm as a corollary. indep of traffic t(x)=x. 1 x – Pick some NE and let V = value to row player in s 0 t that equilibrium. – Since it’s a NE, neither player can do better x 1 even knowing the (randomized) strategy their opponent is playing. Add new superhighway. NE: everyone – So, they’re each playing minimax optimal. uses zig-zag path. Travel time = 2.

Existence of NE in 2-player games Proof

• Proof will be non-constructive. • We’ll start with Brouwer’s fixed point theorem. • Unlike case of zero-sum games, we do not n know any polynomial-time algorithm for – Let S be a compact convex region in R and let finding Nash Equilibria in n £ n general-sum f:S ! S be a continuous function. games. [known to be “PPAD-hard”] – Then there must exist x 2 S such that f(x)=x. • Notation: – x is called a “fixed point” of f. – Assume an nxn matrix. • Simple case: S is the interval [0,1]. – Use (p1,...,pn) to denote mixed strategy for row • We will care about: player, and (q1,...,qn) to denote mixed strategy for column player. – S = {(p,q): p,q are legal probability distributions on 1,...,n}. I.e., S = simplexn £ simplexn

3 Proof (cont) Try #1

• S = {(p,q): p,q are mixed strategies}. • What about f(p,q) = (p’,q’) where p’ is best • Want to define f(p,q) = (p’,q’) such that: response to q, and q’ is best response to p? – f is continuous. This means that changing p • Problem: not necessarily well-defined: or q a little bit shouldn’t cause p’ or q’ to – E.g., penalty shot: if p = (0.5,0.5) then q’ could change a lot. be anything. – Any fixed point of f is a Nash Equilibrium. Left Right • Then Brouwer will imply existence of NE. Left (0,0) (1,-1)

Right (1,-1) (0,0)

Try #1 Instead we will use...

• What about f(p,q) = (p’,q’) where p’ is best • f(p,q) = (p’,q’) such that: response to q, and q’ is best response to p? – q’ maximizes [(expected gain wrt p) - ||q-q’||2] 2 • Problem: also not continuous: – p’ maximizes [(expected gain wrt q) - ||p-p’|| ] – E.g., if p = (0.51, 0.49) then q’ = (1,0). If p = (0.49,0.51) then q’ = (0,1). Left Right

Left (0,0) (1,-1) p p’ Right (1,-1) (0,0) Note: quadratic + linear = quadratic.

Instead we will use... Instead we will use...

• f(p,q) = (p’,q’) such that: • f(p,q) = (p’,q’) such that: – q’ maximizes [(expected gain wrt p) - ||q-q’||2] – q’ maximizes [(expected gain wrt p) - ||q-q’||2] – p’ maximizes [(expected gain wrt q) - ||p-p’||2] – p’ maximizes [(expected gain wrt q) - ||p-p’||2]

• f is well-defined and continuous since quadratic has unique maximum and small change to p,q only moves this a little. • Also fixed point = NE. (even if tiny p’p incentive to move, will move little bit). • So, that’s it! Note: quadratic + linear = quadratic.

4 What if all players minimize regret?  In zero-sum games, empirical frequencies quickly approaches minimax optimal.  In general-sum games, does behavior quickly (or Internal regret and at all) approach a Nash equilibrium? (after all, a Nash Eq is exactly a set of distributions that correlated equilibria are no-regret wrt each other).  Well, unfortunately, no.

A bad example for general-sum games What can we say?  Augmented Shapley game from [Z04]: “RPSF”  If algorithms minimize “internal” or “swap”  First 3 rows/cols are Shapley game (rock / paper / regret, then empirical distribution of play scissors but if both do same action then both lose). approaches . th  4 action “play foosball” has slight negative if other player is still doing r/p/s but positive if other player  Foster & Vohra, Hart & Mas-Colell,… does 4th action too.  Though doesn’t imply play is stabilizing.  NR algs will cycle among first 3 and have no regret, but do worse than only Nash Equilibrium of both playing foosball. What are internal regret and  We didn’t really expect this to work given how correlated equilibria? hard NE can be to find…

More general forms of regret Internal/swap-regret 1. “best expert” or “external” regret: • E.g., each day we pick one stock to buy – Given n strategies. Compete with best of them in shares in. hindsight. – Don’t want to have regret of the form “every 2. “sleeping expert” or “regret with time-intervals”: time I bought IBM, I should have bought – Given n strategies, k properties. Let Si be set of days Microsoft instead”. satisfying property i (might overlap). Want to simultaneously achieve low regret over each Si. • Formally, regret is wrt optimal function 3. “internal” or “swap” regret: like (2), except that f:{1,…,N}!{1,…,N} such that every time you

Si = set of days in which we chose strategy i. played action j, it plays f(j). • Motivation: connection to correlated equilibria.

5 Internal/swap-regret Internal/swap-regret “Correlated equilibrium” • If all parties run a low internal/swap regret – Distribution over entries in matrix, such that if algorithm, then empirical distribution of a trusted party chooses one at random and tells play is an apx correlated equilibrium. you your part, you have no incentive to deviate. – Correlator chooses random time t 2 {1,2,…,T}. – E.g., Shapley game. Tells each player to play the action j they R P S played in time t (but does not reveal value of t).

R -1,-1 -1,1 1,-1 – Expected incentive to deviate:jPr(j)(Regret|j) = swap-regret of algorithm P 1,-1 -1,-1 -1,1 – So, this says that correlated equilibria are a S -1,1 1,-1 -1,-1 natural thing to see in multi-agent systems where individuals are optimizing for themselves

Internal/swap-regret, contd Internal/swap-regret, contd Algorithms for achieving low regret of this Can convert any “best expert” algorithm A into one form: achieving low swap regret. Idea: – Foster & Vohra, Hart & Mas-Colell, Fudenberg – Instantiate one copy Ai responsible for & Levine. expected regret over times we play i. – Can also convert any “best expert” algorithm – Each time step, if we play p=(p1,…,pn) and get into one achieving low swap regret. cost vector c=(c1,…,cn), then Ai gets cost-vector p c. – Unfortunately, time to achieve low regret is i linear in n rather than log(n)…. – If each Ai proposed to play qi, so all together we have matrix Q, then define p = pQ.

– Allows us to view pi as prob we chose action i or prob we chose algorithm Ai.

Congestion games Fair cost-sharing • Many multi-agent interactions have Fair cost-sharing: n players in weighted directed graph G. structure. One nice class: Congestion Games Player i wants to get from si to ti, and they share cost • Always have a pure-strategy equilibrium. of edges they use with others. • Have a potential function s.t. whenever a player switches, potential drops by exactly that player’s improvement. – So, best-response dynamics always gives an G equilibrium. • Let’s start with an example.

6 Good equilibria, Bad equilibria Good equilibria, Bad equilibria

Fair cost-sharing: n players in weighted directed graph G. Fair cost-sharing: n players in weighted directed graph G. Player i wants to get from si to ti, and they share cost Player i wants to get from si to ti, and they share cost of edges they use with others. of edges they use with others.

Shared s Good equilibrium: all use edge of cost 1. Note that here, bad equilb is what transit (cost 1/n per player) you’d expect from 0 0 0 natural dynamics Bad equilibrium: all use edge of cost n. n 1 (players entering s s (cost 1 per player) one at time, etc) 1 n k ¿ n 1 1 1 … 1 Cost(bad equilib) = n¢Cost(good equilib) cars t t

Price of Anarchy and Price of Stability Potential functions and PoS • Price of Anarchy: ratio of worst equilibrium to For cost-sharing, PoS = O(log n): social optimum. (worst-case over games in class) • Given state S, let ne = # players on edge e. Cost(S) =

– We saw for cost-sharing PoA = (n). Also O(n). • Define potential ©(S) =

• Price of Stability: ratio of best equilibrium to • So, cost(S) · ©(S) · log(n) £ cost(S). social optimum. (worst-case over games in class) • Now consider best-response dynamics starting – For cost-sharing, PoS = £(log n). from OPT. © can only decrease. Shared • Exact Potential function: Function © s.t. if player transit So, if could tell moves, potential changes by exactly as much as people to play OPT, 0 0 0 cost of player who moved. and everyone went along, then BR s s – Guarantees that best-response dynamics will reach dynamics would 1 n k ¿ n Nash equilibrium lead to good state. 1 1 1 … 1 cars t

Congestion games more generally Current/recent research directions Game defined by n players and m resources. (esp in relation to machine learning) • Each player i choses a set of resources (e.g., a path) from • How much effort needed to “nudge” simple best-response

collection Si of allowable sets of resources (e.g., paths dynamics from bad equilibrium to a good one? from si to ti).

• Are there natural dynamics that can manage to reach • Cost of a resource j is a function fj(nj) of the number nj of players using it. good equilibria on their own?

• Cost incurred by player i is the sum, over all resources being used, of the cost of the resource. • Can one say anything interesting about “combining expert advice” types of problems where the quality of an expert • Generic potential function: depends on what the other players are doing? (In particular, in comparison to best equilibrium)

• Best-response dynamics may take a long time to reach equil, but if gap between © and cost is small, can get to apx-equilib fast.

7