Prisoner's Dilemma, Tit-For-Tat, and ZD Strategies

Home , Tit for tat

Lecture 3: Prisoner’s Dilemma, Tit-for-tat, and ZD strategies The Prisoner’s Dilemma (PD): Two players cooperate or defect. If A cooperates and B defects, A loses; if A defects and B cooperates A wins; if both cooperate both gain; if both defect, both lose. Standard PD payoff matrix

The payoffs: CC> DC/CD > DD; and CC>[ DC + CD]/2 (so alternating strategy is not best). Since CC produces max: 6 > 5> 2, PD is not a zero sum game. Cooperation has a productive value. CC can be socially negative -- cartel ripping off consumers. Often written as T(DC) > R (CC) > P (DD) > S(CD) with R= 3, S=0, T=5, P =1 In a one-shot game, solution is D. If I do C, you will do D and win and vice versus. So we both play D. Same holds if we know the game ends in T.,which is a one-shot game. But T-1 is a one-shot game again. And so on. Thus a known number of interactions yields ALLD. But much less defecting in the world. To explain this we need: 1. Expected future dealings – expect to interact again with no certain endpoint. This is repeated or iterated PD game: IPD, for which there is no best action. Returns depend on what others do and discount of future payoffs. Rip off the tourist but cooperate with your spouse.

2. Low discount rate so future dealings matter. C today to encourage C tomorrow. Total value CC 6 > CD 5. Discount rate w = 1/(1+r) < 1, where r = interest rate,. w2 is value of a payoff two periods from now, w3 is value 3 periods on, etc. The future matters more when w is large so strategy that pays off in future can beat all D.

3. Conditional retaliatory strategies. If I play D against your C and you do not change to D, I win. If you shift to D against me, I get 5+1 in two rounds and 5+1+1 in three rounds while if you play C with another C player you get 6 in two rounds and 9 in three rounds and 9>7. Retaliation drops D to 1 in next rounds. Spiteful monkeys

4.World of strategies beyond all D or C In all-D world, best is D. In all C world, best is D. But with other strategies may be better to be nicer. Key other strategy is TFT, tit-for-tat, cooperate until opponent defect, then defect until opponent changes. An eye for an eye, tooth for tooth. Opponent C D C/D TFT C C D C/D Consider three periods, (the minimum for TFT to work better than All D given the payoff matrix above) where TFT and D meet half the time. For simplicity let w=1 so future is worth as much as present. TFT meets TFT: rewards = 3(1 +w+w2) =9 TFT meets All D: rewards = 0 + w+w2 = 2 All D meets TFT: rewards = 5 +w+w2 =7 All D meets All D: rewards = 1+ w+w2 = 3 TFT gets 11 (= 9+2 ) from playing D and TFT; D gets 10 (=7+3). TFT cooperation > defect. 5.Winning strategy varies with the distribution of strategies in world. In all-D world best is all-D. In TFT world, best is TFT type strategy. Consider how payoffs vary with the all-D and TFT population in a 3 period model % D TFT D 1/3 20/3 (1/3 2 + 2/3 9) 17/3 (1/3 3 + 2/3 7) TFT WINS 1/2 11/2 (½ 2 +1/2 9) 10/2 TFT WINS 2/3 13/3 13/3 EQUAL SCORES 3/4 15/4 16/4 D WINS So when %D> 2/3rds, D wins; when %D < 2/3rds D loses; at 2/3rds get unstable mixed equilibrium. Note TFT requires smaller proportion of itself to win (1/3rd +) than D (2/3rd+). Reason is 6>5. 6. Addition of all Cooperate (turn other cheek) helps all-D and hurts TFT: Too many suckers destroys world

TFT C D 1/3 of each 2/5 TFT, 2/5C, 7/10 TFT 2/10 C TFT 9 9 2 20/3 38/5 8.3* *FOR WIN C 9 9 0 18/3 36/5 8.1 D 7 15 3 25/3* 47/5* 8.2

D wins because it exploits C. With 2/5 TFT and 2/5 C (and 1/5 D), D wins. With 7/10 TFT and 2/10 C, TFT wins.

Thus, NO BEST CHOICE IN iterated PD. SUCCESS DEPENDS ON ECOLOGY OF STRATEGIES. For any payoff matrix, there is a distribution of All D, All C, and TFT so that D wins and that TFT wins. C never wins. One on one, TFT never wins. When TFT meets D, D scores more. Nice strategies gain from interactions with nice strategies. TFT beats D through its interaction with TFT. PD game on TV http://gawker.com/5903692/must-watch-golden-balls-contestant-wins-with-most-ballsy-move-ever

Axelrod 1979 Computer Tournament R. Axelrod asked experts to submit programs for the PD – code giving responses to any action by another. Fifteen programs enter, including D and C. Several complex programs try to infer and exploit opponents strategy. Anatol Rapaport enters TFT. TFT wins. Axelrod announces results and holds second contest. Analysis of round 1 showed that a more generous/ forgiving strategy could beat TFT: Tit for two tats -- TFTT -- which retaliates against DD but not D. 63 entrants in 2nd tournament and TFT (Rapaport) won again. Axelrod then simulated what would happen to the population of strategies in the next generation if higher scoring strategies increase their share of the population. TFT and other nice rules did well over time. TFT/nice strategies win because they never defect first but retaliate quickly to D, which limits D's points. Can a TFT world survive invasion of Ds, where survive means outscore D? Depends on %D invades (p). In first period TFT scores 3(1-p) + p, while D gets (1-p) + 5p so TFT beats D when 2(1-p)> -4p >0 ---> p<2/3. So if population change depends on relative scores, initial invasion of <2/3Ds would fail . Can a world of Ds survive invasion of TFTs? Yes, 1/3 or more needed with given matrix. Can TFT world survive invasion of Cs? No, because TFT and C score the same. Cs open door to D invasion.

Spatial interactions and n-hoods:CA models of PD IF TFTs interact more with each other in local N-HOOD rather than with the entire population – TFT is more likely to survive. Say 1-p% TFTs enter All-D and have 2 of their 4 interactions with TFTs. Then their score is equivalent to a world with 50% TFTs. But the Ds still interact largely with Ds, so TFT could win. CA models show how n-hood interactions affect outcomes in spatial PD games. Assume that players interact with others in nhood and change strategy depending on what they win in the nhood. Surrounded by Ds you turn D. Surrounded by TFTs you play TFT. Conflicts occur on the borders. Compare a TFT with 3 Ds and 1 TFT for neighbors with a TFT and D having half TFT neighbors and a TFT with 2 TFT neighbors. TFT TFT TFT TFT * ?? * TFT D * ?? * D D * ?? * D D TF T D The rule for ?? is to compute profits from D and TFT and pick most profitable. Consider the rewards using payoffs for three period interactions: TFT-TFT 9, TFT-D 2, D-D 3, D-TFT 7

NEIGHBORHOOD PICK 1D, 3 TFT 2D 2 TFT 3 D 1 TFT TFT 29 22 15 Surrounded by 2 or 3 TFTs choose TFT. D 24 20 16 Surrounded by 3 or more Ds choose D; Decision TFT TFT D

Go to http://ccl.northwestern.edu/netlogo/models/PDBasicEvolutionaryl and experiment with the PD games.

New Material on Spatial PD 1)Review of experiments on Prisoner’s Dilemmas on lattices to test interpretation of human behavior. We find that the experiments “moody conditional cooperation”1 not non-innovative game dynamics such as imitate-the- best or pairwise comparison rules fit the data. The results suggest that imposed lattice structure does not influence global cooperation, (Grujik, et al, 2014 A comparative analysis of spatial Prisoner’s Dilemma experiments:Conditional cooperation and payoff irrelevance, www.nature.com/articles/srep04615 2)Existence of a zealot who stays a cooperator irrespective of the result of an interaction has been reported to add “social viscosity” to a population and thereby helps increase the cooperation level in prisoner's dilemma games. which premises the so-called well-mixed situation of a population. We found that this is not always true when a spatial structure, i.e., connecting agent, is introduced. Deploying zealots is counterproductive, especially when the underlying topology is homogeneous, similar to that of a lattice. Our simulation reveals how the existence of never-converting cooperators destroys rather than boosts cooperation. (Matsuzawa,et al “Spatial prisoner’s dilemma games with zealous cooperators” PHYSICAL REVIEW E 94, 022114 (2016)

Better than TFT: Nicer and Conditional TFT has problems with errors in communication D'. If TFT meets TFT and errs, it --> an alternating cycle, with lower rewards than C. TFT CCC D' CDCD …More forgiving is TFTT CCC D' CC CCC DD CC TFT CCC C DCDC... TFTT CCC C CC. CCC CC DD To generalize strategies via conditional probabilities, let P be the probability you cooperate if X cooperated and Q be the probability you cooperate if X defected. This gives strategies below (Sigmund, Games of Life,)

1 MCC is CC if the player has cooperated the last time. If player has defected the last time, a player adopting MCC decides without taking into account what the neighbors in the contact network have done previously – mutation rather than copy neighbors. Nowak and Sigmund simulate world of (p,q) strategies with random ps and qs and NO neighborhoods. PAVLOV responds to previous round by switching if it loses: if its D leads to a D, it tries C while if its C meets a D, it tries D. WIN-STAY. LOSE-SHIFT. Pavlov would fail in Axelrod-tournament until TFT has destroyed most Ds.

Psychology Experiments-- Framing matters Study 1: More cooperation in ‘‘Community Game’’ PD than ‘‘Wall Street Game’’ in Israeli Air force. Instructors guessed who will cooperate based on behavior during training. (Liberman, V., S. M. Samuels, and L. Ross. 2004. Personality and Social Psychology Bulletin 30:1175-85.) Study 2: Interpretive labels of the game, the choices, and the outcomes led to different outcomes. (Zhong , Loewenstein, Murnighan “ Journal of Conflict Resolution,” Vol. 51, No. 3, 431-456 (2007))

6.One strategy to Rule Them All: the ZD Condition. “It would be surprising if any significant mathematical feature of IPD has remained undescribed, but that appears to be the case” (Freeman Dyson and William Press. 2012). Also surprising 93yr old Dyson added it! Dyson also is a climate skeptic “ he thinks the computer-generated models being used to predict long- term climate consequences are flawed because scientists have too little information about many of the variables that must be taken into account” (http://noconsensus.org/scientists/freeman_dyson.php). See 2015n Dyson interview www.youtube.com/watch?v=BiKfWdXXfIs. www.realclimate.org/index.php/archives/2008/05/freeman-dysons-selective-vision/ citicizes him. Suggested paper: How valid are his criticisms of climate change models? Putting other problems first?

ZD as slogan : "Robert Axelrod's 1980 tournaments of iterated prisoner's dilemma strategies have been condensed into the slogan, Don't be too clever, don't be unfair. Press and Dyson have shown that cleverness and unfairness triumph after all." — William Poundstone The Zero Determinant model presents strategy that “controls” outcomes regardless of what an opposing non-ZD strategy does. ZD plays C with conditional probability between 0 and 1 depending on last period's play – a memory 1 strategy. Dyson & Press show that an opponent who considers earlier encounters does no better playing against a mem 1 player than a mem 1 strategy, so that analysis need only consider strategies that remember the previous round. Here is ZD compared to four major strategies who do 0,1 responses to previous round. Conditional Probability of Playing C ZD Strategy All Coop All Defect TFT “Pavlov – WSLC” CC (R) Pcc 1 0 1 1 CD (T) Pcd 1 0 0 0 DC (S) Pdc 1 0 1 0 DD (P) Pdd 1 0 0 1

To do well ZD will likely set Pcc high; Pcd low; Pdc high; Pdd but not 0. Why? Since ZD can replicate other strategies, it has to have some “edge:”. Let Qcc, Qcd, Qdc, Qdd represent conditional probabilities of 2nd player. Then the four probabilities are each players strategy. Putting them together gives a probability distribution for the outcome of each round, conditional on the outcome of the previous round – a 4 by 4 Markov chain transition matrix M for the four outcomes in this period to the next CC CD DC DD

CC Pcc Qcc Pcc (1-Qcc) (1-Pcc) Qcc (1-Pcc) (1-Qcc) CD Pcd Qdc Pcd (1-Qdc) (1-Pcd) Qdc (1-Pcd) (1-Qdc) DC Pdc Qcd Pdc (1-Qcd) (1-Pdc) Qcd (1-Pdc) (1-Qcd) DD Pdd Qdd Pdd (1-Qdd) (1-Pdd) Qdd (1-Pdd) (1-Qdd) Let v be the 4 element vector of the distribution of outcomes among CC, CD, DC, and DD aka R, T, S, P. The v that solves v = M v gives the stationary distribution/equilibrium of R,T, S, P in which the row player gets v times rewards (R,T, S, P) while the column player gets v times rewards (R,S,T,P) Press and Dyson show that Sx can be expressed as a determinant in which one column involves only the four probabilities of one player’s strategy and another column that involves the probabilities of the other player's strategy. This allows a player to force a given linear relation between the outcomes of both players independently of whatever strategy the other might choose. This control is obtained by setting the determinant to zero, hence the name ZD. ZD strategy uses the linear relation to set the average score of opponent regardless of opponents' strategy. See http://s3.boskent.com/prisoners-dilemma/fixed.html, which plays conditional probabilities by solving Press and Dyson equations for a target of 2. If you cooperated last time, it cooperates with probability 2/3. If you defected while it cooperated, it cooperates with probability 0. If you defected last time and it defected, it cooperate with probability 1/3.

Whatever the non-ZD player does its long term outcome is 2. ZD can also ”Extort” gains by defecting enough times to win in any one on one contest with the other player. Extort-2 below forces the relationship where. ZD gains twice the share of payoffs above P compared with those received by opponent, where P is … .

Where does solution come from? Press and Dyson prove that v f SX equals the determinant of a matrix which is obtained via replacing the last column of MI by SX .

Denote this determinant as D(p,q,SX ), then player X’s expected payoff is EX = D(p,q,SX )/D(p,q,1), where 1 is an all-ones vector In the determinant, the second column is determined by the strategy of player X and the third column is solely determined by the strategy of player Y. We record the second column as ˜p = (�1+ p1,�1+ p2, p3, p4) and the third column as The ZD solution links the conditional probabilities to the R, S, T, P rewards by a, b, and v parameters Pcc =a R + b R + v +1 Pcd = a S + b T + v +1 Pdc = a T + b S + v Pdd = a P + b P + v which yields a linear equation that the rewards to them-- A(p,q) are connected to the rewards of the other player A(q,p) by this equation. :

Revolution in Game Theory?: Response of Researchers to New Solution Hao Dong, Rong Zhi-Hai, Zhou Tao al “Zero-determinant strategy: An underway revolution in game theory∗ (Chinese. Physics. B, 2014) ”ZD ... fundamentally changes the research paradigm of game theory. In the framework of ZD … are dozens of ingenious ideas and untraditional approaches for analyzing not only prisoner’s dilemma but also bi-matrix games, which dramatically expand our understanding of the stochastic process, the mutual benefit, the cooperation incentive, and even the optimal control in the repeated games.

William Press: “When both players have a theory of mind (that is, are not just evolving to maximize their own score) are all games in some deep way, actually Ultimatum Games. Freeman Dyson;“Cooperation loses and defection wins ... My view of the evolution of cooperation is colored by my memories of childhood … two important days, Christmas and Guy Fawkes. Christmas was the festival of love and forgiveness. Guy Fawkes was the festival of hate and punishment... (for) the guy who tried to blow up the King and the Parliament in 1605 and was gruesomely punished by torture and burning. For the children, Christmas was boring and Guy Fawkes was fun. We were born with an innate reward system that finds joy in punishing cheaters. The system evolved to give cooperative tribes an advantage over noncooperative tribes, using punishment to give cooperation an evolutionary advantage within the tribe. This double selection of tribes and individuals goes way beyond the Prisoners' Dilemma model.”

Chad English (Comment on Plos Blog, http://blogs.plos.org/neuroanthropology/2012/06/24/ prisoners-dilemma-and-the-evolution-of-inequality-does-unfairness-triumph-after-all/) “this provides the demonstrable benefit of unions and governments … companies in capitalist societies make use of ZD strategies to exploit their shorter term acting employees (“employees who do not know ZD strategies”). It is therefore in the interests of workers to create their ZD strategy organization aka a union)

What is the change in the PD model from previous model? ZD assumes different information – the player who uses ZD knows ZD and the other does not and just adjusts to gain best it can. ZD strategies provide the player with a strong unilateral control in games but this does not mean that ZD triumphs in evolutionary games. strategies that provide a unilateral advantage to sentient players pitted against unwitting opponents. As we shall see in lecture 4, ZD does poorly in evolutionary, in part because if two ZD extortionary strategies meet, they ends up with DD lowest value.

How does it work with noise? (Hao, et al, Extortion under uncertainty: Zero-determinant strategies in noisy games” Phys. Rev. E 91, 052803: The original ZD strategy does not capture the notion of robustness when the game is subjected to stochastic errors. We find that ZD strategies have high robustness against errors. We further derive the pinning strategy under noise, by which the ZD strategy player coercively sets the opponent's expected payoff to his desired level, although his payoff control ability declines with the increase of noise strength. Due to the uncertainty caused by noise, the ZD strategy player cannot ensure his payoff to be permanently higher than the opponent's, which implies dominant extortions do not exist even under low noise. But the ZD strategy player can establish a novel kind of extortion, named contingent extortions, where any increase of his own payoff always exceeds that of the opponent's by a fixed percentage; the conditions under which the contingent extortions can be realized are more stringent as the noise becomes stronger.

Conclusion 1- In one on one contests where winner lives and loser dies, D triumph because.it beats TFT (other Reciprocal Coop) on round one, ties afterwards. TFT/Coop never wins one on one.

2-- In tournament where a strategy interacts with many others, including strategies like itself, cooperation can score more than D-strategies. Both cooperative and defect strategies score most when playing against C strategies, but D scores more. But Cooperative strategies gain more against TFT-type strategies than Defect-type strategies. 3-- Outcomes in tournament environment depend on ecology/population of strategies and how it evolves over time -->analysis of evolutionary stable strategies (ESS) next class, where strategies evolve depending on their total points relative to others. TFT-type may lose every one on one, it will “win” by scoring relatively more points, than defect-type strategy. 4-- Issues not treated in basic model: robustness to mistakes in playing – mis-communication; cost of developing more complex strategies – # parameters can lead – ZD requires more “brainpower” than all D; prevalence of PD games vs others (ultimatum, dictator, etc in world?