<<

Outline Overview of Theory

A. Introduction Models of Interaction

B. Single Agent Learning – Normal-Form

C. – Repeated Games

D. Multiagent Learning – Stochastic Games

E. Future Issues and Open Problems Solution Concepts

SA3 – C1 SA3 – C2

Normal-Form Games Example — Rock-Paper-Scissors ¡

¥§¦©¨

¢¤£ ¦

£ Two players. Each simultaneously picks an action: A normal-form game is a tuple ¨ ¨ ¨ ¨ ¨ ,

¢ Rock, Paper, or Scissors. is the number of players,

¥ 

The rewards: is the set of actions available to player ¥ ¥ ¥ Rock beats Scissors   ¦

– is the joint action space   , Scissors beats Paper ¥     is player ’s payoff function . Paper beats Rock The matrices:         R P S R P S  

. . & ! & ! % % % % # # . . R $ R $          (     # % % % # %    

...... $ $ ...... ' P " ' P "

. . # % % # % % . . $ S $ S . .

SA3 – C3 SA3 – C4 More Examples More Examples

Matching Pennies Prisoner’s Dilemma H T H T C D C D ¡ ¡ ¡ ¡ % % % % ¤ £ # £ $ $      (

H H  (

C C % % % % ¤ % % # T $ T $ D D

Coordination Game Three-Player A B A B

¡ ¡ ¢ # ¢ #  

A A  ( B % # B % #

Bach or Stravinsky B S B S

¡ ¡ % ¢ # #   (

B B  S % # S ¢ #

SA3 – C5 SA3 – C6

Three-Player Matching Pennies Three-Player Matching Pennies

Three players. Each simultaneously picks an action: The matrices: H T H T

¡ ¡

Heads or Tails. % # % #

¦ ¥ ¥¦ ©    

H H § § § § ¨ ¨ ¨ ¨ T % # T % #

The rewards: ¡ ¡ # % % #

¥¦ ¦ ¥ ©  H  H

( ( § § § § ¨ ¨ ¨ ¨ Player One wins by matching Player Two, # % % # T T

Player Two wins by matching Player Three, ¡ ¡ # # % %

¦ ¥ ¥¦ ©  

H H § § § § ¨ ¨ ¨ ¨ Player Three wins by not matching Player One. % % # # T T

SA3 – C7 SA3 – C8 Strategies Strategies

What can players do? Notation.

¡ – Pure strategies ( ): select an action. – is a joint for all players. £ ¡ ¡ ¡   ¡

¡ ¡

– Mixed strategies ( ): select an action according ¢ ¦ ¥ to some probability distribution. ¤  ¡¨§ – is a joint strategy for all players except . ©  ¡ ¡¨§ ¡

– £ is the joint strategy where uses strategy ¡

and everyone else § .

SA3 – C9 SA3 – C10

Types of Games Repeated Games

Zero-Sum Games (a.k.a. constant-sum games) You can’t learn if you only play a game once.



¢

¦ Repeatedly playing a game raises new questions.

Examples: Rock-paper-scissors, matching pennies. – How many times? Is this ?

Team Games Finite Horizon Infinite Horizon     ¢

£ Examples: . – Trading off present and future reward?  ¦            

¦ ¦    General-Sum Games (a.k.a. all games)   Examples: Bach or Stravinsky, three-player matching Average Reward Discounted Reward pennies, prisoner’s dilemma

SA3 – C11 SA3 – C12 Repeated Games — Strategies Repeated Games — Examples

What can players do? Iterated Prisoner’s Dilemma C D C D ¡ ¡ ¤ – Strategies can depend on the history of play. £ # £    (

C C ¤ % % #

 D D ¤ ¡ ¥ ¥ ¡ ¢£ ¡  ¡ ¢

where

¥ The single most examined ! – 

– Repeated play can justify behavior that is not – Markov strategies a.k.a. stationary strategies rational in the one-shot game. ¦©¨ ¦ ¡ ¡ ¥

 ¦ ¨ ¨

¡ ¡ ¢

£    £ – Tit-for-Tat (TFT) § – -Markov strategies © Play opponent’s last action (C on round 1). © ¡ ¡ ¥ 

¦ A 1-.

¡ ¡ ¢ ¨

¦ ¦

£    £ £    £ § ¨ ¨ ¨

SA3 – C13 SA3 – C14

Stochastic Games Stochastic Games — Definition ¡ ¥

¢ ¦ ¦

£ £ £ £

A is a tuple ¨ ¨ ¨ ¨ ¨ ¨ ,

¢ is the number of agents,

is the set of states, ¥ 

MDPs Repeated Games is the set of actions available to agent , ¥ ¥ ¥   ¦

- Single Agent - Multiple Agent – is the joint action space   ,  ¥

    - Multiple State - Single State is the transition function £ , ¥ 

    is the reward function for the th agent .            

.  Stochastic Games .        

......  - Multiple Agent .  . - Multiple State . 

SA3 – C15 SA3 – C16 Stochastic Games — Policies Example — Soccer

What can players do? (Littman, 1994) A – Policies depend on history and the current state. B  ¤ ¡ ¡ ¥ ¥

¡ ¢£ ¡   

¢

where

¥

 Players: Two.

States: Player positions and ball possession (780).

– Markov polices a.k.a. stationary policies Actions: N, S, E, W, Hold (5).

¢ ¢

¡ ¡

¡ ¡ ¡ ¡ ¤£ ¡ ¦ ¦ £ £

Transitions: ¢

£ £ £ – Simultaneous action selection, random execution. – Focus on learning Markov policies, but the – Collision could change ball possession. learning itself is a non-Markovian policy. Rewards: Ball enters a goal.

SA3 – C17 SA3 – C18

Example — Goofspiel Stochastic Games — Facts

¢

  

¢ Players hands and the deck have cards . If ¢ , it is an MDP. Card from the deck is bid on secretly.   

¢ Highest card played gets points equal to the card If , it is a repeated game. from the deck.

Both players discard the cards bid. If the other players play a stationary policy, it is an

¢ MDP to the remaining player. Repeat for all deck cards.  £ ¢ ¢ ¡ ¡ ¡ ©

£ £ £ £ £ ¨§ ¢

£ £ £ £ £ £ § ¦ ¥ ¤ ¦ ¦ ¦ ¦   ©

§ § ¨  ¥ SIZEOF( or ) V(det) V(random) ¢  ¢

4 692 15150 59KB $ $   £ # % % # % # ¢  # % ¨ ¨

8 47MB $ $ – The interesting case, then, is when the other      % # %   ¢ # % ¨ ¨

13 2.5TB $ $ agents are not stationary, i.e., are learning.

SA3 – C19 SA3 – C20 Overview of Game Theory Dominance

Models of Interaction An action is strictly dominated if another action is always better, i.e, Solution Concepts ¢ ¢ © ¡ © ¡ ¥ ¥    ¡ ¦ ¦

 £ £ § § § § Normal Form Games Repeated/Stochastic Games

– Dominance – Nash Equilibria Consider prisoner’s dilemma. – – Universally Consistent C D C D ¡ – Pareto Efficiency ¡ ¤ £ £ #  (  

C C ¤ % – Nash Equilibria D D % # – Correlated Equilibria – For both players, D dominates C.

SA3 – C21 SA3 – C22

Iterated Dominance Minimax

Actions may be dominated by mixed strategies. Consider matching pennies. D E D E H T H T ! & ! & ¡ ¡ ¤ % % # % % % % $ $ A A   (

H H  ¤  (  # ¢ % % % % %

 " ' " ' B B T $ T $ ¤ # C % C#

Q: What do we do when the world is out to get us? If strictly dominated actions should not be played. . . A: Make sure it can’t. D E D E & ! ! & ¤ % # % A A Play strategy with the best worst-case . ¤   # ¢ %

(  ' " " B B' ¤ # % # © ¡

  ¤ ¢ ¢ £   ¥ ¡ ¡ C C £ § ¦ ¥ ¤ ¨ © ¦ §   ¥ ¦    

This game is said to be dominance solvable. Minimax optimal strategy.

SA3 – C23 SA3 – C24 Minimax Minimax — Linear Programming

Back to matching pennies. Minimax optimal strategies via linear programming. H T © ¡   ¡ ¡

% ¢ % % ¤ ¢ ¢ £   ¥ ¡ ¡

$ £ ¢ §  ¦ ¡ ¥ H  ¤  ¨ © ¦ §  

¥ % ¢ % % ¦    T $

Consider Bach or Stravinsky. B S ¡ ¡

# % £ ¢ ¢   ¡

B 

S¢ £ % #

Minimax optimal guarantees the saftey value. ¥ ¦ £ ¤ Minimax optimal never plays dominated strategies.

SA3 – C25 SA3 – C26

Pareto Efficiency Pareto Efficiency

A joint strategy is Pareto efficient if no joint strategy is Consider prisoner’s dilemma. better for all players, i.e., C D C D ¡ ¡ ¤ # £ £    (

C C ¢ ¢ ¤ % % # ¡ ¡ ¥     § ¦ ¦

¢

£    £ D D © £ £

– £ is not Pareto efficient. In zero-sum games, all strategies are Pareto efficient. Consider Bach or Stravinsky.

B S B S

¡ ¡ ¢ % # #  ( 

B B  S % # S ¢ # © ©   ¨ ¨

– £ and £ are Pareto efficient.

SA3 – C27 SA3 – C28 Nash Equilibria Nash Equilibria

What action should we play if there are no A is a joint strategy, where all dominated actions? players are playing best responses to each other.

¡ £ ¢  ¡ 

¦ ¦ ¢ ¡ ¡

   Optimal action depends on actions of other players. §

A set is the set of all strategies that are

optimal given the strategies of the other players. Since each player is playing a best response, no player can gain by unilaterally deviating. ¢ ¢ © ¡ £ © ¡ ¡ ¢  

¡ § ¡ ¡¨§ ¡ ¡¨§ ¡¨§ ¡ ¡ ¢

£ £ Dominance solvable games have obvious equilibria.

A Nash equilibrium is a joint strategy, where all – Strictly dominated actions are never best responses. players are playing best responses to each other. £ ¡ ¢  ¡  ¦ ¦ – Prisoner’s dilemma has a single Nash equilibrium. ¢ ¡ ¡¨§

  

SA3 – C29 SA3 – C30

Examples of Nash Equilibria Examples of Nash Equilibria

Consider the coordination game. Consider matching pennies.

H T H T

A B A B ¡ ¡ % % % % ¡ ¡ # # ¢ ¢ $ $     

H H  (

A A % % % % ( % # % # B B T $ T $

Consider Bach or Stravinsky. – No pure strategy Nash equilibria. Mixed strategies? ¤ § © ¥ ¥ ¢ £ ¡ ¦ ¦ ¡ ¢ ¦ ¦

B S B S £ ¡ ¡

# # ¢ %   

B B ( S % # S ¢ # – Corresponds to the minimax strategy.

SA3 – C31 SA3 – C32 Existence of Nash Equilibria Computing Nash Equilibria

All finite normal-form games have at least one Nash The exact complexity of computing a Nash equilibrium. (Nash, 1950) equilibrium is an open problem. (Papadimitriou, 2001) In zero-sum games. . . Likely to be NP-hard. (Conitzer & Sandholm, 2003) – Equilibria all have the same value and are interchangeable. Lemke-Howson . ¢ ¢ ¢

© © ©

¡ ¡ ¡ ¡ ¡ ¡ ¦ ¦

¦ £ £ £ £ are Nash is Nash For two-player games, bilinear programming solution.

– Equilibria correspond to minimax optimal strategies.

SA3 – C33 SA3 – C34

Fictitious Play

(Brown, 1949; Robinson 1951) (Fudenberg & Levine, 1998) ¡

An iterative procedure for computing an equilibrium. If converges, then what it converges to is a Nash equilibrium. ¡ ¥ ¡ ¦

1. Initialize , which counts the number of ¡



times player chooses action . When does converge?

2. Repeat. – Two-player, two-action games. ¡ ¡ ¨ ¦

(a) Choose § . – Dominance solvable games. ¡ ¡

(b) Increment . – Zero-sum games.

This could be turned into a learning rule.

SA3 – C35 SA3 – C36 Correlated Equilibria Correlated Equilibria

Is there a way to be fair in Bach or Stravinsky? Assume a shared randmoizer (e.g., a coin flip) exists.

B S B S ¡ ¡ ¢ # % # Define a new concept of equilibrium.  

B B  ( S % # S ¢ # – Let ¡ be a probability distribution over joint actions. – Suppose we wanted to both go to Bach or both – Each player observes their own action in a joint

go to Stravinsky with equal probability? ¡ action sampled from .

– We want to correlate our action selection. ¡ – is a if no player can gain B S B S ¡ ¡

¤ ¤ % % % ¢ # by deviating from their prescribed action. B B

¤ ¤ % % % ¢ # but not

 ¡  ¡ 

S S ¦

¡ ¡

£ §

SA3 – C37 SA3 – C38

Correlated Equilibria Overview of Game Theory

Back to Bach or Stravinsky. Models of Interaction B S B S

¡ ¡ # # ¢ % Solution Concepts  

B B  ( S % # S ¢ # Normal Form Games Repeated/Stochastic Games B S ¡

# % ¢ – Dominance – Nash Equilibria ¡

B

S % ¢ # – Minimax – Universally Consistent – Pareto Efficiency

All Nash equilibria are correlated equilibria. – Nash Equilibria – Correlated Equilibria

All mixtures of Nash are correlated equilibria.

SA3 – C39 SA3 – C40 Nash Equilibria in Repeated Games Nash Equilibria in Repeated Games

Obviously, Markov strategy equilibria exist. The TFT equilibria is strictly preferred to all Markov

strategy equilibria. Consider iterated prisoner’s dilemma and TFT.

C D C D

¡ ¡ ¤ £ # £ The TFT strategy plays a dominated action.    (

C C ¤ D % D % # TFT uses a threat to enforce compliance. – With average reward, what’s a best response? TFT is not a special case. © Always D has a value of 1. © D then C has a value of 2.5 © Always C and TFT have a value of 3.

– Hence, both players following TFT is Nash.

SA3 – C41 SA3 – C42

Nash Equilibria in Repeated Games Nash Equilibria in Repeated Games

Folk Theorem. For any repeated game with average Folk Theorem. For any repeated game with average reward, every feasible and enforceable vector of reward, every feasible and enforceable vector of payoffs for the players can be achieved by some Nash payoffs for the players can be achieved by some Nash equilibrium strategy. (Osborne & Rubinstein, 1994) equilibrium strategy. (Osborne & Rubinstein, 1994)

A payoff vector is feasible if it is a linear combination Players’ follow a deterministic sequence of play that of individual action payoffs. achieves the payoff vector.

A payoff vector is enforceable if all players get at Any deviation is punished. least their minimax value. The threat keeps players from deviating as in TFT.

SA3 – C43 SA3 – C44 Computing Repeated Game Equilibria Universally Consistent

(Littman & Stone, 2003) A.k.a. Hannan consistent, regret minimizing.

¦ ¥ ¡

¦

Polynomial time algorithm for finding a Nash ¢ For a history £ £    £ , define regret for equilibrium in a repeated game. player  , £

£ £ ¡  ¢  ¡ ¡ ¡ ¡

– Find a feasible and  ¢ 

¥ ¢

¤

£ § ¦

Regret ¥ ¤ enforceable payoff vector.   ¦ ¦    

– Construct a strategy that i.e., the difference between the reward that could punishes deviance. have been received by a stationary strategy and the actual reward received.

SA3 – C45 SA3 – C46

Universally Consistent Nash Equilibria in Stochastic Games 

¡

¡ ¥ A strategy is universally consistent if for any Consider Markov policies.

¦ ¡ ¡

there exists a such that for all § and , A best response set is the set of all Markov policies  ¨ © ¦  

£    £ that are optimal given the other players’ policies. © §

Regret ¡  £ ¥ ¡ ¡ ¥

£ § ¦   ¢ 

  ¦ £

¡ ¡

© © 

¢

      §    ¡ ¡     § £ £ i.e., with high probability the average regret is low for all strategies of the other players. A Nash equilibrium is a joint policy, where all players If regret is zero, then must be getting at least the are playing best responses to each other. minimax value. ¡ ¢ £  ¡  ¦ ¦ ¢ ¨§

  

SA3 – C47 SA3 – C48 Nash Equilibria in Stochastic Games

All discounted reward and zero-sum average reward stochastic games have at least one Nash equilibrium. (Shapley, 1953; Fink, 1964)

Stochastic games are the general model.

Nash equilibria in stochastic games has certainly received the most attention.

SA3 – C49