Plan for Today & Next time 15-859(B) Machine Learning • 2-player zero-sum games Theory • 2-player general-sum games – Nash equilibria – Correlated equilibria Learning and Game Theory • Internal/swap regret and connection to Avrim Blum correlated equilibria • Many-player games with structure: congestion games / exact potential games – Best-response dynamics – Price of anarchy, Price of stability
2-Player Zero-Sum games Game Theory terminolgy • Two players R and C. Zero-sum means that what’s • Rows and columns are called pure strategies. good for one is bad for the other.
• Game defined by matrix with a row for each of R’s • Randomized algs called mixed strategies. options and a column for each of C’s options. Matrix tells who wins how much. • “Zero sum” means that game is purely • an entry (x,y) means: x = payoff to row player, y = payoff to competitive. (x,y) satisfies x+y=0. (Game column player. “Zero sum” means that y = -x. doesn’t have to be fair). • E.g., penalty shot: Left Right goalie Left Right goalie
Left (0,0) (1,-1) Left (0,0) (1,-1) GOAALLL!!! GOAALLL!!! shooter shooter
Right (1,-1) (0,0) No goal Right (1,-1) (0,0) No goal
Minimax-optimal strategies Minimax-optimal strategies • Minimax optimal strategy is a (randomized) • Can solve for minimax-optimal strategies strategy that has the best guarantee on its using Linear programming expected gain, over choices of the opponent. • No-regret strategies will do nearly as well or [maximizes the minimum] better. • I.e., the thing to play if your opponent knows • I.e., the thing to play if your opponent knows you well. you well. Left Right goalie Left Right goalie
Left (0,0) (1,-1) Left (0,0) (1,-1) GOAALLL!!! GOAALLL!!! shooter shooter
Right (1,-1) (0,0) No goal Right (1,-1) (0,0) No goal
1 Minimax Theorem (von Neumann 1928) Interesting game to think about • Every 2-player zero-sum game has a unique • Graph G, source s, sink t. value V. • Player A chooses path P from s to t. • Minimax optimal strategy for R guarantees • Player B chooses edge e in G. R’s expected gain at least V. • If e is in P, B wins. Else A wins. • Minimax optimal strategy for C guarantees • What is minimax optimal strategy for B, A? C’s expected loss at most V. • Note that can run RWM for B, and best- Existence of no-regret strategies gives one response for A (shortest path alg on B’s way of proving the theorem. weights) to get apx-minimax-optimal.
General-sum games
• In general-sum games, can get win-win Now, to General-Sum games… and lose-lose situations. • E.g., “what side of sidewalk to walk on?”:
person Left Right walking towards you Left you (1,1) (-1,-1)
Right (-1,-1) (1,1)
General-sum games Nash Equilibrium • A Nash Equilibrium is a stable pair of strategies (could be randomized). • In general-sum games, can get win-win • Stable means that neither player has and lose-lose situations. incentive to deviate on their own. • E.g., “which movie should we go to?”: • E.g., “what side of sidewalk to walk on”: Bully Hunger games Left Right
Bully (8,2) (0,0) Left (1,1) (-1,-1)
Hunger games (0,0) (2,8) Right (-1,-1) (1,1)
No longer a unique “value” to the game. NE are: both left, both right, or both 50/50.
2 Uses NE can do strange things • Economists use games and equilibria as • Braess paradox: models of interaction. – Road network, traffic going from s to t. • E.g., pollution / prisoner’s dilemma: – travel time as function of fraction x of – (imagine pollution controls cost $4 but improve traffic on a given edge. everyone’s environment by $3) travel time = 1, travel time indep of traffic t(x)=x. don’t pollute pollute 1 x s t don’t pollute (2,2) (-1,3) x 1 pollute (3,-1) (0,0) Fine. NE is 50/50. Travel time = 1.5 Need to add extra incentives to get good overall behavior.
NE can do strange things • Braess paradox: Existence of NE – Road network, traffic going from s to t. • Nash (1950) proved: any general-sum game – travel time as function of fraction x of must have at least one such equilibrium. traffic on a given edge. – Might require mixed strategies. travel time = 1, travel time • This also yields minimax thm as a corollary. indep of traffic t(x)=x. 1 x – Pick some NE and let V = value to row player in s 0 t that equilibrium. – Since it’s a NE, neither player can do better x 1 even knowing the (randomized) strategy their opponent is playing. Add new superhighway. NE: everyone – So, they’re each playing minimax optimal. uses zig-zag path. Travel time = 2.
Existence of NE in 2-player games Proof
• Proof will be non-constructive. • We’ll start with Brouwer’s fixed point theorem. • Unlike case of zero-sum games, we do not n know any polynomial-time algorithm for – Let S be a compact convex region in R and let finding Nash Equilibria in n £ n general-sum f:S ! S be a continuous function. games. [known to be “PPAD-hard”] – Then there must exist x 2 S such that f(x)=x. • Notation: – x is called a “fixed point” of f. – Assume an nxn matrix. • Simple case: S is the interval [0,1]. – Use (p1,...,pn) to denote mixed strategy for row • We will care about: player, and (q1,...,qn) to denote mixed strategy for column player. – S = {(p,q): p,q are legal probability distributions on 1,...,n}. I.e., S = simplexn £ simplexn
3 Proof (cont) Try #1
• S = {(p,q): p,q are mixed strategies}. • What about f(p,q) = (p’,q’) where p’ is best • Want to define f(p,q) = (p’,q’) such that: response to q, and q’ is best response to p? – f is continuous. This means that changing p • Problem: not necessarily well-defined: or q a little bit shouldn’t cause p’ or q’ to – E.g., penalty shot: if p = (0.5,0.5) then q’ could change a lot. be anything. – Any fixed point of f is a Nash Equilibrium. Left Right • Then Brouwer will imply existence of NE. Left (0,0) (1,-1)
Right (1,-1) (0,0)
Try #1 Instead we will use...
• What about f(p,q) = (p’,q’) where p’ is best • f(p,q) = (p’,q’) such that: response to q, and q’ is best response to p? – q’ maximizes [(expected gain wrt p) - ||q-q’||2] 2 • Problem: also not continuous: – p’ maximizes [(expected gain wrt q) - ||p-p’|| ] – E.g., if p = (0.51, 0.49) then q’ = (1,0). If p = (0.49,0.51) then q’ = (0,1). Left Right
Left (0,0) (1,-1) p p’ Right (1,-1) (0,0) Note: quadratic + linear = quadratic.
Instead we will use... Instead we will use...
• f(p,q) = (p’,q’) such that: • f(p,q) = (p’,q’) such that: – q’ maximizes [(expected gain wrt p) - ||q-q’||2] – q’ maximizes [(expected gain wrt p) - ||q-q’||2] – p’ maximizes [(expected gain wrt q) - ||p-p’||2] – p’ maximizes [(expected gain wrt q) - ||p-p’||2]
• f is well-defined and continuous since quadratic has unique maximum and small change to p,q only moves this a little. • Also fixed point = NE. (even if tiny p’p incentive to move, will move little bit). • So, that’s it! Note: quadratic + linear = quadratic.
4 What if all players minimize regret? In zero-sum games, empirical frequencies quickly approaches minimax optimal. In general-sum games, does behavior quickly (or Internal regret and at all) approach a Nash equilibrium? (after all, a Nash Eq is exactly a set of distributions that correlated equilibria are no-regret wrt each other). Well, unfortunately, no.
A bad example for general-sum games What can we say? Augmented Shapley game from [Z04]: “RPSF” If algorithms minimize “internal” or “swap” First 3 rows/cols are Shapley game (rock / paper / regret, then empirical distribution of play scissors but if both do same action then both lose). approaches correlated equilibrium. th 4 action “play foosball” has slight negative if other player is still doing r/p/s but positive if other player Foster & Vohra, Hart & Mas-Colell,… does 4th action too. Though doesn’t imply play is stabilizing. NR algs will cycle among first 3 and have no regret, but do worse than only Nash Equilibrium of both playing foosball. What are internal regret and We didn’t really expect this to work given how correlated equilibria? hard NE can be to find…
More general forms of regret Internal/swap-regret 1. “best expert” or “external” regret: • E.g., each day we pick one stock to buy – Given n strategies. Compete with best of them in shares in. hindsight. – Don’t want to have regret of the form “every 2. “sleeping expert” or “regret with time-intervals”: time I bought IBM, I should have bought – Given n strategies, k properties. Let Si be set of days Microsoft instead”. satisfying property i (might overlap). Want to simultaneously achieve low regret over each Si. • Formally, regret is wrt optimal function 3. “internal” or “swap” regret: like (2), except that f:{1,…,N}!{1,…,N} such that every time you
Si = set of days in which we chose strategy i. played action j, it plays f(j). • Motivation: connection to correlated equilibria.
5 Internal/swap-regret Internal/swap-regret “Correlated equilibrium” • If all parties run a low internal/swap regret – Distribution over entries in matrix, such that if algorithm, then empirical distribution of a trusted party chooses one at random and tells play is an apx correlated equilibrium. you your part, you have no incentive to deviate. – Correlator chooses random time t 2 {1,2,…,T}. – E.g., Shapley game. Tells each player to play the action j they R P S played in time t (but does not reveal value of t).
R -1,-1 -1,1 1,-1 – Expected incentive to deviate:jPr(j)(Regret|j) = swap-regret of algorithm P 1,-1 -1,-1 -1,1 – So, this says that correlated equilibria are a S -1,1 1,-1 -1,-1 natural thing to see in multi-agent systems where individuals are optimizing for themselves
Internal/swap-regret, contd Internal/swap-regret, contd Algorithms for achieving low regret of this Can convert any “best expert” algorithm A into one form: achieving low swap regret. Idea: – Foster & Vohra, Hart & Mas-Colell, Fudenberg – Instantiate one copy Ai responsible for & Levine. expected regret over times we play i. – Can also convert any “best expert” algorithm – Each time step, if we play p=(p1,…,pn) and get into one achieving low swap regret. cost vector c=(c1,…,cn), then Ai gets cost-vector p c. – Unfortunately, time to achieve low regret is i linear in n rather than log(n)…. – If each Ai proposed to play qi, so all together we have matrix Q, then define p = pQ.
– Allows us to view pi as prob we chose action i or prob we chose algorithm Ai.
Congestion games Fair cost-sharing • Many multi-agent interactions have Fair cost-sharing: n players in weighted directed graph G. structure. One nice class: Congestion Games Player i wants to get from si to ti, and they share cost • Always have a pure-strategy equilibrium. of edges they use with others. • Have a potential function s.t. whenever a player switches, potential drops by exactly that player’s improvement. – So, best-response dynamics always gives an G equilibrium. • Let’s start with an example.
6 Good equilibria, Bad equilibria Good equilibria, Bad equilibria
Fair cost-sharing: n players in weighted directed graph G. Fair cost-sharing: n players in weighted directed graph G. Player i wants to get from si to ti, and they share cost Player i wants to get from si to ti, and they share cost of edges they use with others. of edges they use with others.
Shared s Good equilibrium: all use edge of cost 1. Note that here, bad equilb is what transit (cost 1/n per player) you’d expect from 0 0 0 natural dynamics Bad equilibrium: all use edge of cost n. n 1 (players entering s s (cost 1 per player) one at time, etc) 1 n k ¿ n 1 1 1 … 1 Cost(bad equilib) = n¢Cost(good equilib) cars t t
Price of Anarchy and Price of Stability Potential functions and PoS • Price of Anarchy: ratio of worst equilibrium to For cost-sharing, PoS = O(log n): social optimum. (worst-case over games in class) • Given state S, let ne = # players on edge e. Cost(S) =
– We saw for cost-sharing PoA = (n). Also O(n). • Define potential ©(S) =
• Price of Stability: ratio of best equilibrium to • So, cost(S) · ©(S) · log(n) £ cost(S). social optimum. (worst-case over games in class) • Now consider best-response dynamics starting – For cost-sharing, PoS = £(log n). from OPT. © can only decrease. Shared • Exact Potential function: Function © s.t. if player transit So, if could tell moves, potential changes by exactly as much as people to play OPT, 0 0 0 cost of player who moved. and everyone went along, then BR s s – Guarantees that best-response dynamics will reach dynamics would 1 n k ¿ n Nash equilibrium lead to good state. 1 1 1 … 1 cars t
Congestion games more generally Current/recent research directions Game defined by n players and m resources. (esp in relation to machine learning) • Each player i choses a set of resources (e.g., a path) from • How much effort needed to “nudge” simple best-response
collection Si of allowable sets of resources (e.g., paths dynamics from bad equilibrium to a good one? from si to ti).
• Are there natural dynamics that can manage to reach • Cost of a resource j is a function fj(nj) of the number nj of players using it. good equilibria on their own?
• Cost incurred by player i is the sum, over all resources being used, of the cost of the resource. • Can one say anything interesting about “combining expert advice” types of problems where the quality of an expert • Generic potential function: depends on what the other players are doing? (In particular, in comparison to best equilibrium)
• Best-response dynamics may take a long time to reach equil, but if gap between © and cost is small, can get to apx-equilib fast.
7