The Pennsylvania State University

TOPICS IN LEARNING AND INFORMATION DYNAMICS

IN

A Dissertation in Mathematics by Matthew Young

© 2020 Matthew Young

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2020 The dissertation of Matthew Young was reviewed and approved by the following:

Andrew Belmonte Professor of Mathematics Dissertation Advisor, Chair of Committee

Christopher Griffin Professor of Operations Research

Jan Reimann Professor of Mathematics

Sergei Tabachnikov Professor of Mathematics

Syed Nageeb Ali Professor of Economics

Alexei Novikov Professor of Mathematics Chair of Graduate Program

ii Abstract

We discuss the role of learning and information in game theory, and design and investigate four models using different learning mechanisms. The first consists of rational players, one of which has the option of purchasing information about the other player’s before choosing their own. The second is an agent-based public goods model where players incrementally adjust their contribution levels after each game they play. The third is an agent-based rock, paper, scissors model where players choose their strategies by flipping cards with a win-stay lose-shift strategy. The fourth is a machine learning model where we train adversarial neural networks to play arbitrary 2 × 2 games together. We study various aspects of each of these models and how the different learning dynamics in them influence their behavior.

iii Table of Contents

List of Figures vii

List of Tables xi

Acknowledgments xii

Chapter 1 The Role of Learning in Game Theory 1 1.1 Introduction ...... 1 1.2 Rational learning after purchasing information ...... 3 1.3 Agent-based models with simple algorithm learning rules ...... 4 1.4 Machine learning by adversarial neural networks ...... 5

Chapter 2 Oracle Games 8 2.1 Introduction ...... 8 2.2 Preliminary Considerations ...... 14 2.2.1 Motivating Examples ...... 15 2.2.2 Definitions ...... 22 2.3 Fundamental Properties of Oracle Games ...... 23 2.4 Main Results ...... 26 2.5 Harmful Information ...... 34 2.6 Helpful Information ...... 37 2.7 Multiple Equilibria ...... 38 2.8 Discussion ...... 39

Chapter 3 Fair Contribution in a Nonlinear Stochastic Public Goods Model 42 3.1 Introduction ...... 42 3.1.1 Public Goods Games ...... 42 3.1.2 Cooperative Behavior in Biology ...... 44 3.1.3 Modifications of Public Goods Models ...... 45 3.1.4 Fairness ...... 47 3.1.5 Our model ...... 49

iv 3.2 Definition of the ...... 49 3.3 Population Dynamics ...... 51 3.4 Numerical Simulations ...... 56 3.5 Dynamics in the Presence of a Permanent Freeloader ...... 68 3.5.1 Multiple Permanent Freeloaders ...... 70 3.6 Discussion ...... 72

Chapter 4 Population dynamics in a model with restricted strategy transitions 74 4.1 Introduction ...... 74 4.1.1 Win-Stay Lose-Shift ...... 74 4.1.2 A Biological Basis for a Restriction to Two Strategies ...... 76 4.1.3 A Biological Basis for Win-Stay Lose-Shift ...... 77 4.1.4 Our Model ...... 79 4.2 Discrete Model ...... 80 4.3 Extinction in a Restricted Transition Population ...... 83 4.4 Continuous Models ...... 85 4.5 Discussion ...... 92

Chapter 5 Neural Networks Playing Games 95 5.1 Introduction ...... 95 5.1.1 Machine Learning ...... 95 5.1.2 Game theory and adversarial networks ...... 97 5.1.3 Our model ...... 99 5.2 The Model ...... 99 5.2.1 Construction ...... 99 5.2.2 Errors ...... 102 5.2.3 Temptation Games ...... 108 5.3 Pruning ...... 111 5.4 Network Comparisons ...... 114 5.4.1 Euclidean Distance in Weight Space ...... 114 5.4.2 Polar Projection ...... 118 5.4.3 Paths Between Networks ...... 121 5.4.4 Correlation metrics ...... 124 5.5 Level-k hierarchy in networks ...... 128 5.6 Discussion ...... 133 5.6.1 Summary ...... 133 5.6.2 Extensions of this model ...... 134 5.6.3 Applications ...... 135

v Chapter 6 Conclusion 137 6.1 Discussion ...... 137 6.1.1 Types of Information ...... 137 6.1.2 Benefit to players ...... 139 6.1.3 Convergence to a ...... 141 6.1.4 Intelligence ...... 142 6.2 Future Research ...... 143

Appendix Master Equation to Langevin Derivation from Chapter 4 146

Bibliography 150

vi List of Figures

2.1 Extensive form game ...... 15

2.2 Game tree for the motivating construction of Oracle Games ...... 16

2.3 Game tree for the standard construction of Oracle Games ...... 17 √ 2.4 Payment√ in equilibrium√ shown for oracle functions I(x) = (a) x + 1 − 1; (b) x; (c) 2 x...... 19

2.5 Amount paid by Player A at the equilibrium k (above),√ and the resulting response rate I (below) as functions of k when I(x) = kx...... 20

2.6 Illustration of the construction to prove Proposition 2 (see text): (top) original given oracle function I(x); (middle) nondecreasing equivalent or- acle function J1(x); (bottom) final nondecreasing, concave oracle function J(x)...... 25

2.7 Harmful Information Extensive form game ...... 35 √ 2.8 Ea as a function of k when I = kx ...... 37

√ 3.1 Graph of the benefit function b(C) = 400C, with equilibrium and socially optimal values marked in the case where m = 2...... 51

3.2 Examples of the Nash equilibrium value Ce plotted as a function of α, for three different return values R = 0.4 (bottom), 0.7 (middle), 0.95 (top). . 52

3.3 Representation of a single round of play...... 53

3.4 Simulation with m = n = 4 ...... 55

vii 3.5 Numerical simulation of the model with n = 100 players and group size m = 10, showing the distribution of contributions C of each player around the fairpoint f: t = 0 (top), t = 800 (middle), t = 1600 (bottom). . . . . 56

3.6 Average contribution cavg for population of n = 100 players over time, shown for: m = 10 (top), m = 50 (middle), m = 100 (bottom)...... 57

3.7 Examples of five individual players’ ci over time, in a population of n = 50 players and group size m = 10...... 58

3.8 E1 behavior ...... 58

3.9 E1 over time shown for three different initial conditions with n = 100, each with different subgroup sizes: m = 10, 50, 80...... 59

3.10 The initial states at t= 0 (left) and at t = 1600 for each group size (right) from the simulation used in Fig. 3.9...... 59

3.11 Average decrease rate for E1 as a function of m...... 60

3.12 The average value of ci for populations of n = 100 players with one permanent freeloader, as a function of group size m. Each population was numerically simulated and measured ten times at regular intervals between√ t = 200,000 and 300,000. We fixed d = 10,000, and for each m let ui = 40000mC − ci, which gives f = 10, 000 even when m changes. . 68

3.13 The average value of ci for populations of n = 100 players with two permanent freeloaders, as a function of m. In this case, two transitions are observed (see text)...... 71

4.1 Illustration of the population dynamics for one time step ...... 81

4.2 Trajectory of one simulation for N = 8 ...... 84

4.3 The average time for strategies played to reach a monoculture (average extinction time) as a function of the total number of cards N, with A = 1/2,B = 1/2,C = 0...... 85

4.4 Heat map of the discrete system on a grid for (a) N = 40; (b) N = 200. . 86

4.5 Trajectory for the deterministic system, with equally spaced initial condi- tions around the plane, for A = 1/2...... 87

viii 4.6 Parameterized curve of the x and y coordinates of the equilibrium point as functions of A, with A = 0 at (0, 0), and A = 1 at (1, 0) ...... 87

4.7 Individual system trajectories for total population size N = 800 cards, with the same initial conditions comparing (a) the discrete stochastic system and (b) the Fokker-Planck system...... 90

4.8 Heat maps for the discrete and continuous systems with N = 40 after 1,000,000 steps ...... 91

4.9 Heat map of the difference in the discrete system and continuous system 91

4.10 Extinction Times of discrete system and Fokker-Planck system on a log scale plot, as well as lines of best fit...... 92

4.11 Trajectory for Fokker-Planck system with N = 8 ...... 93

5.1 Network Playing Prisoner Dilemma. Blue lines are positive weights . . . 102

5.2 Error of a network over time ...... 103

5.3 Coordination behavior over time ...... 105

5.4 Error (y) as a function of number of hidden layers (x) ...... 107

5.5 Error (y) as a function of neurons per hidden layer (x) ...... 108

5.6 Probability A of choosing strategy sA as a function of x in temptation game 109

5.7 Probability A of choosing strategy sA as a function of x in temptation game after special training ...... 110

5.8 Log-Log plot of error (y) as a function of neurons pruned (x) ...... 112

5.9 Histogram of pairwise Euclidean distances for 10 networks ...... 116

5.10 Error for network pr as a function of r ...... 117

5.11 Examples of paths in R2 (left) and their corresponding polar projections (right)...... 119

5.12 Projection of a network path created via backpropagation (left), and a random walk (right) ...... 120

ix 5.13 Polar projection of three network paths, with θ0 = 0, and paths colored based on the error of the adjacent networks ...... 123

5.14 Errors of network paths from Figure 5.13 ...... 124

5.15 Correlation score ci (y) vs Brute Force Importance bi (x) for neurons in the first layer of 10 networks...... 128

5.16 Example of a network emulating a level 1 player ...... 130

5.17 Neural Network Transistor ...... 130

5.18 Correlation score (y) vs BFI (x) for neurons in the first layer of 10 networks, with correlations to closest ideal neurons...... 131

x List of Tables

2.1 Payoff matrix for Battle of the Sexes ...... 10

3.1 Payoffs for a 2 player public goods game...... 43

3.2 Important Variables ...... 52

5.1 Payoffs for an arbitrary 2 × 2 game...... 100

5.2 Errors for networks playing against their original opponent (col. 1), against a different opponent (col. 2), as well as errors for networks trained against a frozen opponent againt that opponent (col.3) and against a different opponent (col. 4) ...... 104

5.3 Payoffs and Errors for networks playing against each other (col. 1), against a nash player (col. 2), and for nash players playing against networks (col. 3), and other nash players (col. 4) ...... 107

5.4 Payoffs for the temptation game...... 108

xi Acknowledgments

I would like to acknowledge my thesis advisor Andrew Belmonte whose guidance and support have helped make my time at Penn State a wonderful experience. Mathematics is filled with wonderful and exciting things and he has an unrivaled expertise in opening lines of inquiry that are both fun and useful at the same time. I would like to acknowledge Chris Griffin for helpful advice and discussions throughout my time here. His vast knowledge of various topics was often vital in identifying techniques to use and literature to consider. I would like to acknowledge the members of my committee for giving their time and attention, and for being available to meet remotely during the quarantine.

xii Chapter 1 | The Role of Learning in Game Theory

1.1 Introduction

What is the role of learning and information in game theory? We broadly define “learning” to be the process by which an agent changes their strategy in response to information obtained from previous events. Classical game theory assumes that all players are perfectly rational, that they have complete knowledge of all rules of the games and all possible payoffs, that they are capable of performing all mathematical calculations, and that they know every other player is likewise rational. There isn’t much players could learn, since they start knowing everything that there is to know. And in a one-shot , players do not make multiple decisions that would allow them to express learning. Learning requires the passage of time, and thus has no role here. On the other hand, includes the passage of time, but goes so far as to strip away all decision-making from the individual and instead have their decisions hard-wired into their biology. The connection between evolutionary games and biology was made first by Smith and Price [1], using games to make a simplified model of animal behavior and create plausible explanations for observed behavior. They observe that animals in conflict often fight physically while deliberately avoiding use of their more dangerous weapons such as claws or horns. They construct a model bearing some resemblance to the iterated prisoner’s dilemma, and run simulations with several plausible strategies they come up with. They then compare the interactions of the strategies in their simulations with the behavior of various species that seem to engage in similar behavior in conflicts. Much other research has since been done studying biological systems using methods

1 and models from game theory. Often, population dynamics are defined by having each species correspond to a fixed pure strategy, and the proportions of species in the population evolve according to the , which increases the proportion of species that score higher in games they play, and decreases the proportion of species that score lower. The most general form of the replicator equation is given by

x˙ i = xi[fi(x) − φ(x)]

where xi is the frequency of species i in the population, fi(x) is the average fitness of species i, and φ(x) is the average fitness of all species in the population [2]. This has nice properties such as having the rate of change in a population, corresponding to birth and death processes, be proportional to the amount of that population that already exists. Additionally, subtracting the average fitness in the equation guarantees that the total sum of the all species frequencies remains constant, and so is meaningfully interpreted as a frequency if all value are initialized such that they sum to 1 [3]. We discuss various biological models throughout this dissertation in the context of the chapter they’re most relevant to, many of which use replicator dynamics. However, the replicator equation, and other similar evolutionary models, implicitly represents individuals who have fixed strategies, and their frequencies change via repro- duction. Individual players are simplistic and instinctive, and incapable of changing their strategy, so they are also incapable of learning. One could consider the population as a whole to have some form of learning, as it changes in composition over time based on the game’s structure, but this is typically not the perspective taken in such models. Somewhere between the extremes of perfectly rational agents and simple unchanging agents, are models using , where players reason using simplified internal models of their opponents and the environment [4]. It is this middle region that we are primarily interested in, as this is where learning can occur. Agents do not begin by knowing everything that can be known, but have some capacity for understanding by which they might gain this knowledge, either implicitly or explicitly, and then adapt their behavior in response. In game theory, information is defined by the ability of an agent to condition their strategy on different states. If at the time they take an action, a player knows whether they are in state A or state B, their strategy might specify taking one action when in state A and a different action in state B. If a player knows they are in one of states A or B but not which, then they must choose a single action to be played in either case. Thus,

2 a player with access to a finer partition of states has more information than a player with a coarser partition. Although learning may occur by a player acquiring new information and rationally calculating the to each possible state, this is not the only form of learning we consider. We allow any form of algorithm that updates its strategies based on new information to be classified as learning, provided there is some reasonable justification suggesting such updates will tend to increase the player’s utility (as opposed to a player which changes strategies completely at random). In this dissertation, we consider four models falling under three main categories of learning: rational learning after purchasing information, agent-based population models with simple algorithm update rules, and machine learning by competing neural networks. All numerical simulations were written in Java using Eclipse IDE by the author.

1.2 Rational learning after purchasing information

Rational players are defined as always choosing the strategy that maximizes their expected utility, given the information that they possesses. This requires that they know everything they need to know in order to perfectly compute this strategy. However, learning can still occur if players do not have all of the information that exists about the game. In general, missing information in games falls into one of two main categories: incomplete information where players lack information about some of the rules of the game or the payoffs that will result from outcomes, and imperfect information where players lack information about the current state of the game, such as decisions made by other players or random events that have occurred in secret. Under certain assumptions a game with incomplete information will be equivalent to a with complete but imperfect information [5]. However, generally the two types of information and their motivations from real world scenarios remain distinct. Classic game theory considering one-shot simultaneous games has complete but imperfect information, since neither player knows the strategy of the other player until they have both played out their choices. The role of information and the willingness of players to purchase it has been studied in various forms in games with imperfect information and in games with incomplete information. In games with imperfect information, where one player has the ability to acquire information about the second player’s strategy, it is necessary to consider both the actions of the first player in acquiring this information and responding to it, and the second player’s actions taking into account the possibility of their strategy being revealed. Much research has been done studying this dynamic. Typically, players are either given

3 a binary choice to purchase information for a fixed cost or not, or are given a noisy signal that correlates with useful information, and are given the ability to pay a variable amount in order to increase the signal’s accuracy according to some cost function. In Chapter 2 we introduce a formalism into standard two player games for the purchase of information about strategy choices, and study the effect of its cost on mixed strategy equilibria. In particular, we replace the standard practice of implementing partial information, in which players pay for the increased accuracy of noisy signals, with a partial response approach in which completely accurate information is only sometimes received, and players pay for a higher probability of receiving it. Rather than focus on the role of learning on the player’s strategy choice after receiving the information, which ends up being trivial, we consider the willingness of players to pay for this opportunity, as well as the response of the other player to the ability of their opponent to learn about their action.

1.3 Agent-based models with simple algorithm learning rules

Agent-based modelling is a category focused on populations of agents which make decisions individually according to a simple set of rules. Often, the system as a whole will display complex emergent phenomena as a result of the combination of many simple interactions. Typically in a game theory context, subsets of players are sampled from a larger population to play a game, and their actions consist of what strategy to play in each game. These agents are usually subrational: having some simple decision rule which allows them to adapt to local conditions in a way that hardwired creatures in standard replicator dynamics cannot, but much less optimally than a purely rational player with would be able to [6]. We coded a general-purpose agent based modeller, which can be initialized with payoffs for a game, and a population of players, each of which has some initial strategy and a simple update rule telling them how to change their strategy as a result of the game’s . Each time step, a set of players is sampled from the population to play one instance of the game together. They all play their currently stored strategy, receive the results of the game, and then update their strategy according to the rule before being returned to the larger population to wait until the next time they are selected. Players are not given the full set of information about the game, nor do they possess

4 the reasoning faculties required to process this information in order to maximize their payoffs. Instead, they update instinctively in ways that would tend to increase their payoff in the game they just played, but may be shortsighted and naive. Nevertheless, they still can learn over time as the individual adjustments of each individual player drives the distribution of the strategies in the entire population. Players learn and adapt by accumulating small insights from each game that is played, and the update rule that they use plays a large role in the population dynamics that emerge from this. Selecting different games and different update rules results in completely different models with completely different behaviors. We focus on two different agent-based models derived from this general modeller in this dissertation. In Chapter 3 we consider a nonlinear public goods game with continuous contribution amounts. Players strategies are a single number corresponding to the amount they contribute to public goods games, and their update rule increments their contribution up or down by a small amount based on the direction of best response for the group in which they participate. We analyze the dynamics that occur for various parameter values, and find that the players cluster near the “fair” Nash equilibrium, in which the players share equally in contributing to the . In Chapter 4 we consider the game Rock, Paper, Scissors, and give players an unusual Win-Stay Lose-Shift update rule. Each player is given a "card" with two of the three possible strategies on it, one of which begins face up. Players are selected in sets of two, play the face up strategy on their card, and each player’s update strategy is to flip their card over if they lose, switching the face up and face down strategies. This means we have a heterogeneous population, where different players are restricted to only being able to player certain strategies. We discuss biological parallels for this system, and analyze the emergent behavior of the population based on the distribution of cards and number of players. In both of these models, players obey a simplistic and short-sighted learning rule that seeks to maximize payoffs in the game most recently played, ignoring the larger population as a whole. Nevertheless the population as a whole will adapt to larger trends in aggregate.

1.4 Machine learning by adversarial neural networks

Machine learning has seen increasing amounts of attention in recent years, in both academic and industrial applications. Although the concept originates from the 1950’s,

5 the increasing speed and performance of computers has seen this field increasing in importance, as many tasks can be performed by deep neural networks that simply would not be feasible to train on less powerful computers in previous years, such as image recognition [7]. Neural networks store a collection of real numbers corresponding to weights of the connections between the various artificial neurons in the network. The network is given some sort of training data where it is given inputs, computes an output, and then this is compared to a desired output. The network is then updated, typically via a process called backpropagation, where the stored weights are updated based on their role in causing the network’s output to deviate from the desired output. This is similar to the incremental updates we use in our public goods based model in Chapter 3, in that the agent participates in a series of activities and adjusts stored values slightly after each round in an attempt to get closer to the optimal value. However instead of having a large number of simple agents that each store a single value, the neural network is a single agent that has a large number of stored values, with each value performing a different role within the neural network. Neural networks thus learn via algorithms that are designed to update them towards whatever the training data they are given represents. The weights they have are adjusted to match patterns in the input data to the desired output labelled in the training data. It learns by example, and thus the quality of its learning and ability to transfer that learning outside of the training data is constrained by the quality and variety of the training data, as well as the ability of the network to actually enact the correct function. In Chapter 5 we construct pairs of neural networks and train them to play arbitrary 2x2 games against each other. Both players are trained simultaneously against each other, and thus learn not only general insights about game theory, but also about the specific opponent they’re being trained against. Unlike much research in machine learning, we focus not on trying to design the best network to solve a problem, but instead on attempting to overcome transparency issues. We develop and adopt various tools to understand the internal structures of our networks and how they go about making decisions.

Funding

This material is based upon work supported by the National Science Foundation under Award No. CMMI-1463482 and Award No. CMMI-1932991. Any opinions, findings, and

6 conclusions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the National Science Foundation.

7 Chapter 2 | Oracle Games

Most of the material in this chapter appears in [8], available on the ArXiV.

2.1 Introduction

What is the value of information? In classical game theory, players are given complete and of the rules, strategy choices available, and payoffs in the game, but they have imperfect information on what actual strategy choice is made by the opposing player. In general, more information allows players to discriminate between different situations, and thus have more control over which decisions they make in each case. In fact, there is experimental evidence that human subjects will pay to know what strategy is being played against them, even when that information has no impact on their strategy choice [9]. Correspondingly, people will also sometimes attempt to deceive an opponent about their own strategy [10]. In various forms of competition, whether in business or biology, individuals are often willing to expend time, capital, or valuable energy resources in order to obtain information which allows them to make better decisions [11–13]. How should the value of information be included in game theory? Additionally, how does the value attributed to information as a commodity play a role among other choices made in a game? If players recognize the importance of learning in so far as it increases their ability to receive higher payoffs, then they will be willing to pay for the privilege of doing so. Additionally, if players are aware of the ability of others to learn certain information, then they will behave differently than if they expect the other players to remain ignorant of that information. Missing information in game theory can be broadly categorized into two types: imperfect information, where the missing information is about the player’s strategies themselves, and incomplete information, where the missing information regarding the

8 game rules and payoffs. Classic game theory with simultaneous games, such as rock paper scissors, contains complete but imperfect information, since players know all of the payoffs but do not know what strategy the other will play (except what they can deduce via rational thought). Extensive form games, such as Tic-Tac-Toe, have complete and . Games with incomplete information typically require that players have some prior beliefs about the values (some knowledge of the set and probability distribution values are being chosen from) in order to make decisions which maximize their expected payoff. Under certain assumptions a game with incomplete information will be equivalent to a Bayesian game with complete but imperfect information (Harsanyi, 1967). This is done by considering nature to be an additional player that plays a mixed strategy determining which game players end up playing. However generally the two types of information and their motivations from real world scenarios remain distinct. The role of information and the willingness of players to purchase it in games with incomplete information, where the missing information is regarding the game rules and payoffs, has been studied in various forms. Hellwign and Veldkamp [14], Myatt and Walace [15], and Rigos [16] all study different versions of Beauty Contest games where players choose a real number, and receive a payoff which increases based on on how close their number is to the average of other player’s choices, and also increases based on how close it is to some randomly determined state variable θ. Rigos defines the payoff as

2 2 ui(a) = −(1 − γ)(ai − θ) − γ(ai − a) and the other works make nearly identical definitions. The two goals of proximity to the target and proximity to other players are combined in a weighted average, with γ influencing which of them is a higher priority. Before choosing their strategy, players are given the option to sacrifice some of their payoff to purchase access to a noisy signal about θ, which allows them to get closer to it, as well as to other players who purchase the same (public) signal or separate (private) signals. A public signal is more valuable, since all players who purchase it will have identical information and be able to coordinate with each other. However a private signal still allows players to end up near θ, and since different player’s signals are all noisy versions of the same value θ, they will be correlated with each other as well. In another paper, Myatt and Walace [17] study Cournot games where multiple firms produce similar products and earn profits based on supply and demand functions, and are given the opportunity to purchase noisy information about the demand before they decide how much to produce. Yang [18], as well as Szkup and Travino [19], study investment

9 coordination games where players can choose to invest in a risky project, receiving a random value, or opt out, and have the opportunity to pay for a noisy signal about the payoff of investing before making the decision. Li and Yang [20] study the effects of pre-communication in a variant of Battle of the Sexes. In the standard Battle of the Sexes, two players are given a choice between two options. Each player i has a for option i and receives some payoff θ for choosing that option. However they also want to coordinate on the same option, and will each lose utility C (typically greater than θ) if they select different options. Thus, the players would prefer to coordinate on the same option even if it’s not their preferred option, but each still wants the other player to compromise rather than having to compromise themself.

2 1 B1 B2

A1 θ, 0 θ − C, θ − C A2 −C, −C 0, θ Table 2.1. Payoff matrix for Battle of the Sexes

This game has two pure strategy equilibria, where the players coordinate together on one of the options, and a mixed strategy equilibrium where the two players often miscoordinate. Li and Yang’s model assigns the players different preference intensities, so rather than having the same θ in each outcome, each player has their own θi assigned from some probability distribution. Each player’s preference intensity is known to them, but is hidden from the other player, who only knows the general probability distribution. The players are then given the option to send a signal about their preference intensity to the other player, having a choice between declaring it to be "low", "medium", or "high". However this is done via "" meaning there is no cost to the signal and nothing prevents the players from lying or exaggerating. In the primary equilibrium the authors focus on, players will defer to the player who sent the highest signal, and will exaggerate to some degree to take advantage of that, but for higher miscoordination costs they will refrain from doing so too severely because when both players send the same signal they each respond by choosing their own preferred strategy, failing to coordinate. Thus they find that useful information can be conveyed and improve the payoffs of players compared to the payoffs in a game without such signals, even when the signals are sent by players with incentives to deceive the other player. Hu et al. [21] adapt this model to experiments with human participants and find similar dynamics: players exaggerate to

10 some degree, but do so less when the cost of miscoordination is higher. Martinelli [22] studies an election model with the ability to purchase information about candidates. A large population of players are voters in an election with two candidates, one of which is a better candidate and will result in a higher utility for all of the players if elected, but players don’t know which one it is. Each player is given the option to purchase information for some cost, which differs between players. Players then receive a signal which they use to determine which candidate to vote for. Players who chose not to purchase information favor each candidate with probability 1/2, while players who purchase information favor the better candidate with probability 1/2 + q for some fixed value q ∈ (0, 1/2]. Although this means for q < 1/2 there’s some chance that voters elect the wrong candidate even if all are informed, for large populations this becomes increasingly improbable. In fact, because variance only increases as the square root of population size, as the population size increases a partially informed electorate can maintain a reasonable success rate even as the proportion of informed voters decreases. Martinelli finds this is precisely what happens in equilibrium, with only players who have a cheap cost of information choosing to purchase it, while the rest choose to remain ignorant in order to avoid paying the cost. In some sense, this acts much like a public goods game, where purchasing information acts like contributing to the public good: it costs the individual something and increases the expected payoff of the entire population. And the equilibrium in this model is much like one in a public goods game where players have varying costs of contribution: those with cheap costs will choose to contribute, while those with more expensive costs will not. However because individuals are rational, they will only pay the cost if the expected increase in their own payoff will be at least as large as the cost, regardless of the higher increase to the population as a whole. Thus the equilibrium behavior has some purchase of information, and increases the utility of all players to some degree, but it is less than the amount that would be socially optimal to maximize the total utility of the population. This also matches the equilibrium behavior of the public goods game we study in Chapter 3. Halpern and Pass [23] study a model of costly computation, where complete in- formation about a game is available, but players cannot rationally compute the Nash equilibrium from this information for free and must instead pay for an algorithm to compute their strategy. Agents may choose from a set of Turing machines that will compute a strategy from the given information, with more complex Turing machines being more expensive. Thus, players must balance the cost of their machine with its

11 accuracy, since more complex machines will be able to process the data more accurately and achieve higher expected payoffs. An interesting thing to note is is that information purchase in almost all of these examples is socially beneficial. It doesn’t just benefit the player who purchases information, but also benefits other players by enabling coordination and predictability. This happens primarily because these models use mostly cooperative games, where the payoffs of players are highly correlated with each other. In more competitive games, information a player gains that enables them to increase their payoff tends to decrease the payoffs of other players. In games with imperfect information, where one player has the ability to acquire information about the second player’s strategy, it is necessary to consider both the actions of the first player in acquiring this information and responding to it, and the second player’s actions taking into account the possibility of their strategy being revealed. Many studies have investigated this for various games. For instance, Ben-Porath and Kahneman [24] study a general model in which players play iterated games where they only observe their own actions and payoffs, but can pay some cost in order to observe the actions of other players in the same round. Players are able to punish each other in response to actions they observe. They find that players can achieve an equilibrium close to the Pareto frontier if they monitor sparsely to keep costs down and inflict harsh enough punishments for selfish actions to keep players in line. Players also have a method of sending public signals to prove that they are actually monitoring rather than trying to increase their utility by avoiding the monitoring cost, and other players will punish them for failing to monitor properly. In some sense, monitoring acts like a meta-level public goods game on top of the regular game, where monitoring provides a benefit to all players by enforcing equilibria, but at some cost to themself. (We discuss public good games and punishment in more detail in Chapter 3.) Flesch and Perea [25] construct a very similar model, with a few differences such as players paying to receive information about the actions of other players in the past, rather than their actions in the present when the cost is paid. They find largely similar results for equilibria of their model. Miklós-Thal and Schumacher [26] create a variation on this concept by having information sold by a rational third party that observes noisy signals about some players’ actions, which might be costly but helpful to others, or selfish and harmful to others. The monitor can sell positive or negative recommendations to other players about whether they should opt in to interactions with particular players. They find that there are

12 multiple equilibria. If the monitor provides accurate information cheaply, players no longer behave selfishly, which makes them predictable and lessens the value of the monitor’s information. But if the monitor deliberately obfuscates some of its information so that it’s useful but not maximally so, then players will still sometimes behave selfishly, creating an unreliable environment where the information has a higher value despite occasionally inaccuracies, and thus can be sold for a higher price. There are many other examples of research involving signalling and sharing of information in games with imperfect information. Sakai [27] study Cournot games where firms with differentiated products receives private information about demand for their own product, and choose whether or not to freely offer this information to their competitor in order to influence their decision. Ruiz-Hernández et al. [28] study variations of a model that compare Cournot and Bertrand games, where firms make decisions simultaneously, to Stackelberg games, where one firm makes a decision and the other can observe this decision before making their own. Halpern and Pass [29] construct a game theoretic framework for "translucent players". Rather than being able to change strategies unilaterally, as in a Nash equilibrium, each player in an equilibrium believe there is a small probability that any attempts to deviate will be leaked to the other players, who will have an opportunity to respond by also changing their strategies. This can enable equilibria such as in a Prisoner’s dilemma, as players can credibly threaten to defect in response to a defection from the other player despite play being simultaneous most of the time. Antonioni et al. [30] study costly information in network games. Experimental subjects are connected to each other in a graph network and each round each player plays the Prisoner’s Dilemma with all adjacent players in the network. They then have some ability to alter their connections in an attempt to avoid defecting players and connect to new players. Players have the ability to pay a cost in order to learn the most recent decision of potential new neighbors before approving the connection. They find that higher costs for this decrease the overall rate of cooperation in the population, as consistent cooperators are less likely to be rewarded by more connections than when costs are low. The most similar work to ours is by Solan And Yariv [31], who study a modification to two-player games by giving one player the opportunity to purchase information about the other player’s action before responding. In their paper, if the player pays, he always receives a noisy signal which is correlated with the opponent’s action, but has some probability of signalling a different action and misleading the player. Higher payments will increase the reliability of the signal, depending on the cost function associated with

13 the signalling device. This leads to a very broad and complex set of possible cost-to-signal functions that can be included, making it difficult to prove strong results except when restricted to specific types of games. They find that if sufficiently reliable information can be purchased cheaply enough, the player will purchase it and act on the information as if it were completely true, and also that the information cost affects the game’s equilibrium only insofar as it determines whether information is worth purchasing or not. The actual amount of information purchased, if any, depends only on the payoffs for the player without any information. This corresponds to the amount that causes the player’s strategies to become dominated, and is related to our notion of nodes. Our work differs from previous work primarily by replacing the standard practice of implementing partial information, in which players pay for the increased accuracy of noisy signals, with a partial response approach in which accurate information is only sometimes received, and players pay for a higher probability of receiving it. This chapter is organized as follows. We first introduce an extrinsic third player into a classic two player game: an all-knowing “oracle" who can be paid for a chance to reveal information about one of the players to the other. After exploring the consequences of such an oracle in representative examples, we define the properties of these games, and prove results on how any mixed Nash equilibria will be modified by the cost function of the oracle.

2.2 Preliminary Considerations

We begin by considering a standard normal form game G, focusing on cases where there is exactly one mixed strategy Nash equilibrium; as we will show later, the modifications we propose here do not affect the pure strategy Nash equilibria, so games with only pure strategy equilibria will be unchanged. We briefly discuss games with multiple mixed equilibria in Section 2.7. We define an oracle to be an external agent to G who knows and can potentially reveal information about each player’s actual choice of strategy, before these choices are played and payoffs resolve. The oracle is defined to have an associated oracle function I(x) which determines its probability of response as a function of the amount it is paid. When paid x, the oracle either reveals completely accurate information about a player’s realized strategy choice with probability I(x), or remains silent and gives no information with probability 1 − I(x). In this way, the oracle allows for partial purchase of information about a player’s choice without introducing anything other than factual information (i.e. the oracle either tells the truth or says nothing). To

14 our knowledge, we are the first to implement partial information in this manner. In principle, I: [0, +∞) → [0, 1], however the domain of I may be effectively bounded due to a rational player not paying beyond some fixed amount xm, determined for instance by the largest variation in payoffs in the game, xm < Pmax − Pmin. Note also that x = 0 is included, which represents the option of not paying the oracle at all.

2.2.1 Motivating Examples

We first consider the following two-player game, in order to illustrate our approach.

2 1 B1 B2

A1 1, −1 0, 0 A2 0, 0 2, −2

Note that this is equivalent to a matching pennies (anticoordination) game with scaled payoffs. Here the only Nash equilibrium is when A and B each play the mixed strategy 2 1 ( 3 , 3 ).

B

B1 B2

A A

A1 A2 A1 A2

(1, −1) (0, 0) (0, 0) (2, −2)

Figure 2.1. Matching Pennies Extensive form game

Since every simultaneous game is equivalent to a in which neither player observes the actions taken by the other [32], we consider this example as a sequential game in which player B selects a strategy first, as illustrated in Fig 2.1. We next introduce an oracle, and modify the game by inserting additional stages into the standard sequence. In our first construction, this goes as follows:

1. Player A chooses a nonnegative amount x to pay to the oracle.

2. Player B chooses a strategy.

15 3. With probability I(x) the oracle informs player A of player B’s realized strategy, and with probability 1 − I(x) remains silent.

4. Player A then chooses a strategy and the game resolves, with player A’s final payoff being decreased by the payment x chosen earlier.

Fig 2.2 shows the extensive form of this game. Note that since no information is given to player B at any point, the order of stages 1-3 may be rearranged in several ways, which allows for easier analysis without affecting the game.

A

x

B

B1 B2

Oracle

respond silent respond silent

A

A1 A2 A1 A2 A1 A2 A1 A2

1 0 1 0 0 2 0 2

Figure 2.2. Game tree for the motivating construction of Oracle Games

First, note that for each of player B’s strategies, player A’s best response to that strategy is unique (and we restrict ourselves to games with this property for the remainder of the chapter). Therefore, in the case that the oracle responds and provides B’s strategy, the rational response of player A will already be determined. Thus player A only makes two choices: the amount x of payment to the oracle, and the strategy choice when the oracle does not respond. It is equivalent to consider the circumstance where A makes his decision of what to play at the beginning of the game, but changes his mind if the oracle responds. The following structure leads to equivalent behavior and payoffs for every payoff matrix:

1. Player A chooses any nonnegative amount x to pay to the oracle.

16 2. With probability I(x) the oracle commits to informing player A of B’s strategy at a later time, and with probability 1 − I(x) commits to remaining silent.

3. Player A tentatively chooses a strategy to play if not given a response.

4. Player B chooses a strategy to play.

5. If the oracle committed to respond, it does so now, Player A ignores his previous choice and chooses the best response to player B’s realized strategy. If the oracle committed to remaining silent, then player A uses the tentative choice. In either case, the game resolves and player A’s payoff is reduced by x.

A

x

Oracle

respond silent

A(tentative)

A1 A2 A1 A2

B

B1 B2 B1 B2 B1 B2 B1 B2

1 2 1 2 1 0 0 2

Figure 2.3. Game tree for the standard construction of Oracle Games

Fig 2.3 shows the extensive game for this version. Given this structure, we can model the at stage 2 as a Bayesian game with two possible states [32]. When the oracle does not respond, the payoff matrix for the Oracle Game is the same as the game without an oracle. When the oracle does respond, the payoffs for each player are given by player A’s best response in the column determined by player B’s choice (since x is constant in this subgame, we leave it out of the payoff matrices since it will not affect equilibria). We represent this in normal form as the matrix R: where A1 and A2 are the tentative decisions for player A. We refer to R as the maximal matrix of the game, since

17 2 1 B1 B2

A1 1, −1 2, −2 A2 1, −1 2, −2 the payoffs for player A are equal to the maximum in each column of the original payoff matrix. Since this matrix shows the payoffs when the oracle does respond, it is natural that the payoffs in each column are identical since he changes his mind and ignores his previous decision. If M is the original payoff matrix, then the matrix of the expected values that the players perceive in the subgame is given by M · (1 − I(x)) + R · I(x). In this example, it becomes:

2 1 B1 B2

A1 1, −1 2I(x), −2I(x) A2 I(x), −I(x) 2, −2

The equilibria of this game matrix will depend on the value of I, which will depend on both the oracle function I(x), and the value x that player A has chosen to pay. The strategy space may be described as S = {sa, sb, x} where sb is B’s strategy, sa is A’s tentative strategy, and x is A’s payment to the oracle. Unless otherwise specified, we use I to denote I(x) evaluated at the value of x being played by A, and likewise I0 denotes dI dx . For any c ∈ R we define xc to be the smallest payment x such that I(x) = c, and yc to be any x such that I0(x) = c. Since I(x) is concave, I0(x) may be constant and equal to c on some interval. If so, then yc can refer to any one of the values on that interval, and any statement we make about yc is true for all such values. When this occurs in a Nash Equilibrium, the Oracle Game will have multiple equilibria, one for each choice of yc For this particular game matrix, we can classify the equilibrium into one of the following cases depending on which oracle function I(x) is attached to the game. Assuming

I(x) is differentiable at 0 and x 1 , these take the form: 2 0 3 2 1 2 1 Case 1: If I (0) ≤ 2 , the equilibrium is {( 3 , 3 ), ( 3 , 3 ), 0}, since player A pays x = 0, the players behave as they would if there was no oracle.

0 3 0 2−I 1−2I 2 1 Case 2: If I (0) ≥ ≥ I (x 1 ), the equilibrium is {( , ), ( , ), y 3 }. 2 2 3(1−I) 3(1−I) 3 3 2

0 3 2I0−1 1 Case 3: If I (x 1 ) ≥ , the equilibrium is {(1, 0), ( 0 , 0 ), x 1 }. 2 2 2I 2I 2

18 (a) (b) (c)

√ √ Figure 2.4. Payment in equilibrium shown for oracle functions I(x) = (a) x + 1 − 1; (b) x; √ (c) 2 x.

The derivation of these equilibria follows from Theorem 1, which is presented at the beginning of section 2.4. The graphs in Fig 2.4 show examples of different oracle functions satisfying each of these cases for the game matrix given. We assume I(x) = 1 for any values where these functions would be greater than 1. In the case that I(x) is not differentiable everywhere, Proposition 2 will ensure this 0 occurs at countably many points, I (x) will be nonincreasing and we can replace y 3 with 2 0 3 infx I (x) < 2 . Similarly, in other examples we consider, the description will assume differentiability of I(x) at key points, but can be modified to account for arbitrary oracle functions. We next consider a 3 × 3 game, defined by the matrix:

2 1 B1 B2 B3

A1 1, −1 0, 0 0, 0 A2 0, 0 2, −2 0, 0 A3 0, 0 0, 0 4, −4

Note that this matrix contains the previous example as a submatrix. With no oracle, the 4 2 1 only equilibrium is when A and B both play the mixed strategy ( 7 , 7 , 7 ). If Player A is given access to an oracle, then using the same process as before, the matrix becomes:

2 1 B1 B2 B3

A1 1, −1 2I, −2I 4I, −4I A2 I, −I 2, −2 4I, −4I A3 I, −I 2I, −2I 4, −4

19 2 3 4 5

Figure 2.5. Amount paid by Player A at the√ equilibrium k (above), and the resulting response rate I (below) as functions of k when I(x) = kx.

For this game matrix, the equilibrium will fall into one of the following cases, depending on I(x): 0 7 4 2 1 4 2 1 Case 1: If I (0) ≤ 8 the equilibrium is {( 7 , 7 , 7 ), ( 7 , 7 , 7 ), 0}.

0 7 0 Case 2: If I (0) ≥ ≥ I (x 1 ), the equilibrium is 8 5

( 4 + I 2 − 3I 1 − 5I ! 4 2 1 ) , , , , , , y 7 7(1 − I) 7(1 − I) 7(1 − I) 7 7 7 8

1 At I = 5 , the probability of A3 reaches 0, and A can no longer maintain B’s indifference since B’s strategy 3 is weakly dominated by a mixed strategy of 1 and 2. 7 0 3 Case 3: If ≤ I (x 1 ) ≤ , the equilibrium is 8 5 2

(3 1  8I0 − 2 4I0 − 1 3 − 2I0 ! ) , , 0 , , , , x 1 4 4 10I0 10I0 10I0 5

0 3 At I = 2 , the probability of B3 reaches 0, and B can no longer prevent A from increasing x. At this point, A and B have both effectively eliminated strategy 3 (B will never play it as it is now a dominated strategy, and A will never play it since after eliminating

20 column 3 it is also dominated). Thus the game reduces to the matrix:

2 1 B1 B2

A1 1, −1 2I, −2I A2 I, −I 2, −2

Note that this is identical to the matrix from the first game, thus, all equilibria that come from this matrix will be identical. 0 3 0 Case 4: If I (x 1 ) ≥ ≥ I (x 1 ), the equilibrium is 5 2 2 ( 2 − I 1 − 2I ! 2 1  ) , , 0 , , , 0 , y 3 3(1 − I) 3(1 − I) 3 3 2

0 3 Case 5: If I (x 1 ) ≥ , the equilibrium is 2 2

( 2I0 − 1 1 ! ) (1, 0, 0) , , , 0 , x 1 2I0 2I0 2

In general, if G is a game, and H is a game which contains G as a subgame, then if I(x) is an oracle function that causes all strategies in H which are not in G to become dominated, then the equilibria of the oracle games for G and H will be the same when given I(x). √ Using a set of oracle functions I(x) = kx and varying k, we illustrate how the amount xe paid by Player A at equilibrium, as well as the purchased probability of response I(xe), varies as the cost of information decreases for Player A (i.e. as k increases). The dependence of these two quantities on k is shown in Fig 2.5. The numbers between the dotted lines indicate which case the equilibrium falls under in that region (case 1 √ does not occur for any k since kx has infinite slope at x = 0). In Cases 2 and 4, Player A gradually increases x in response to the cheaper information, while maintaining B’s indifference by adjusting sa to compensate. In cases 3 and 5, A maintains I at a constant value (which costs less to maintain as information becomes cheaper), and B maintains

A’s indifference by adjusting sb away from exploitable strategies. In the following sections we prove results about Oracle Games indicating that most well-behaved games have equilibria similar to these example cases (for certain notions of “well-behaved" and “similar"), and we show how these equilibria are determined.

21 2.2.2 Definitions

Let G be a simultaneous, two-player game with the m × n payoff matrix M and players A and B. Let G|I be the game where A and B play game G but A is given access to an oracle with function I(x). If A’s maximal payoff in each column of M is unique, then A’s best response to an oracle response is predetermined, which means that A does not have to specify a strategy choice when the oracle responds. The set of strategy profiles is then expressed as S = {sa, sb, x} where sb is B’s strategy, sa is A’s strategy when the oracle does not respond, and x is A’s payment to the oracle. We make no meaningful distinction between pure and mixed strategies, except to note that an assumption we make later allows all oracle payments x to be considered as pure strategies. For each j, let αj be the index of the row corresponding to the highest payoff to A in column j of M

(we assume this is unique for each j). We define the maximal matrix R by Ri,j = Mαj ,j, such that every outcome in R is a copy of the outcome in M corresponding to A’s best response to strategy j. Let C be the m × n matrix where the payoff to A is 1 and the payoff to B is 0 in every cell. Then for every x ∈ [0, ∞), A paying the oracle x induces a Bayesian game with expected payoffs

M · (1 − I(x)) + R · I(x) − C · x.

Let MI(x) := M · (1 − I(x)) + R · I(x). Since the equilibria of a payoff matrix do not change with a constant reduction in all of the payoffs for either player, for each fixed x this will have the same equilibria as the actual induced payoff matrix. Then s ∈ S is a Nash equilibrium if and only if A and B are both indifferent on changing each of their strategies. Thus a necessary condition for an equilibrium must be that for whichever x player A is paying, sa, sb must be an equilibrium for the matrix

MI(x), since otherwise A or B could profit by changing their strategies.

We can also write the expected payoff Ea of A playing G|I in terms of A’s expected payoff given a response Er, and the expected payoff given no response En, as

Ea(sa, sb, x) = En(sa, sb) · (1 − I) + Er(sb) · I − x.

Definition 1 We define the Value of Information V to be the marginal increase in expected value A gains from increasing I, that is

∂E V := a = E − E . ∂I r n

22 This means the value of information is given by the change in expected benefit for A from receiving a response. If we assume that player A always chooses the optimal sa for the particular cross section MI(x), then V can be expressed solely as a function of sb. It immediately follows that V ≥ 0 for all sa, sb since A’s payoff when the oracle responds is always at least as good as his payoff when it is silent. And whenever sb is a pure strategy then A will play the best response to sb regardless of whether the oracle responds, so

Er = En, which means that V = 0.

Note also that Ea is linear with respect to sb: if s1 and s2 are strategies for B, and p ∈ [0, 1], then Ea(sa, ps1 + (1 − p)s2, x) = pEa(sa, s1, x) + (1 − p)Ea(sa, s2, x). This also implies V is linear with respect to sb.

Definition 2 We say that G|I has a node at c if one of B’s strategies changes from dominated to undominated (or vice versa) in MI(x) at x = c.

We observe that case 3 in the first example and cases 3 and 5 in the second example above correspond to equilibria with oracle payment x at a node.

2.3 Fundamental Properties of Oracle Games

We first derive some fundamental results that elucidate the basic properties of these Oracle Games.

Proposition 1 If {sa, sb} is a pure strategy Nash equilibrium of G, then {sa, sb, 0} is a Nash equilibrium of G|I.

Proof: If {sa, sb} is a pure strategy Nash equilibrium in G, then sa is a best response to sb, and sb is a best response to sa in M, and if x = 0 then the oracle never responds, so MI(0) = M. And since B is playing a pure strategy, sa will be a best response to sb regardless of whether the oracle responds or not, so A cannot benefit by increasing x.

Thus, no player has an incentive to change their strategies in any way, and {sa, sb, 0} is a Nash equilibrium of G|I.  In other words, the presence of the oracle does not affect pure strategy equilibria. This is natural, since in a pure strategy equilibria, both players are playing pure strategies, so information confirming what is already known adds no value.

23 Definition 3 We define two oracle functions I(x) and J(x) to be equivalent (I ∼= J) if for every game G, the set of equilibrium strategies (excluding the oracle payment) and resulting expected payoffs (including the payment) are identical for G|I and G|J.

This definition is useful because of the following results, which are based on the fact that a rational player will never pay more for less information (or in our case, for a less probable response).

Proposition 2 Every oracle function is equivalent to one which is continuous, nonde- creasing, and concave.

Proof: Given any oracle function I(x), we will construct another oracle function J(x) based on I such that J is continuous, nondecreasing and (weakly) concave, and then show that J is equivalent to I. We first construct a nondecreasing version of I. Suppose that there exists some c2 > c1 such that I(c2) < I(c1); player A will never pay c2 since it’s dominated by c1

(choosing c2 over c1 means paying more for less information). The value of information is always nonnegative, therefore A’s expected value must be nondecreasing with respect to

I, and strictly decreasing with respect to x. If we let J1 be an oracle function with

J1(x) = sup(I(a): a ≤ x)

then J1 is nondecreasing since it’s taking the supremum over a growing set. And J1 is equivalent to I because any values of x that differ between I and J1 are ones for which I has dropped below sup(I), which are also x that A would never pay. Similarly in G|J1, A will also never pay them because that would be paying more for the same amount of information. The fact that Player A can play a mixed strategy between two oracle payments leads to the second result, which we show by constructing a non-concave up version of J1. Let c1 and c2 be any numbers in [0, ∞), and A’s mixed strategy be to pay c1 with probability p and c2 with probability (1 − p). The expected amount A will pay is then pc1 + (1 − p)c2 = x¯, and the expected probability that the oracle will respond ¯ will be pI(c1) + (1 − p)I(c2) = I. The combination of these two yields the same results as another oracle function which took on the value I¯ at the point x¯. Thus the oracle function J1 is equivalent to the supremum of the convex hull of its graph:

J(x) = sup(pJ1(c1) + (1 − p)J1(c2))

24 I

J1

J

Figure 2.6. Illustration of the construction to prove Proposition 2 (see text): (top) original given oracle function I(x); (middle) nondecreasing equivalent oracle function J1(x); (bottom) final nondecreasing, concave oracle function J(x).

where the supremum is over all c1 and c2 in [0, ∞) and all p in [0, 1]. The supremum of the convex hull of any function is automatically continuous and concave. Note also that

J is nondecreasing because J1 is.  In Fig 2.6 we show an example of this construction process for a particular I(x)

(Fig. 2.6a), with equilvalent nondecreasing oracle functions J1(x) (Fig. 2.6b), and the full simplication of the proposition, Fig. 2.6c. We next show that any G|I with a nonzero cost of zero information can be shifted to an equivalent game for which I(0) = 0.

Proposition 3 Suppose G|I is a game with payoff matrix M and oracle function I(x) with I(0) > 0. Then there exist a game H and oracle function J(x) with J(0) = 0 such that G|I ∼= H|J.

For G|I, let I(0) = c > 0, and note that c ≤ 1. Define the following

I(x) − c N = M = (1 − c)M + cR, J(x) = I(0) 1 − c

First note that the maximal matrix R is the same for M and N, since the highest payoff

25 to A in each column of MI is the same for all values of I. Also, J(x) will be continuous, nondecreasing, and concave if I(x) is, and J(0) = 0 since I(0) = c. Additionally, J(x) will reach 1 at the same x value that I(x) does.

Then for any x, NJ(x) = (1 − J(x))N + J(x)R I(x)−c I(x)−c = (1 − 1−c )((1 − c)M + cR) + 1−c R I(x)−c I(x)−c = (1 − c)M − (I(x) − c)M + cR − cR 1−c + R 1−c = (1 − I(x))M + cR + (I(x) − c)R = (1 − I(x))M + I(x)R

Which is MI(x) by definition.  Therefore it is sufficient to only consider oracle functions with I(0) = 0. For the remainder of this chapter, we assume without loss of generality that all oracle functions are continuous, nondecreasing, concave, and satisfy I(0) = 0.

2.4 Main Results

Theorem 1 If I(x) is differentiable at c in the interior of its domain, then {sa, sb, c} is an equillibrium of G|I if and only if

1. {sa, sb} is an equilibrium of MI(c) 0 2. V (sb) · I (c) = 1

Proof: Condition 1 holds if and only if player A or B have no incentive to change sa or sb respectively. If we express player A’s payoff as Ea(sa, sb,I(x)) − x, then it suffices to find a global maximum of this function on its domain. Taking the derivative with respect to x and setting equal to zero yields

∂E dI a − 1 = 0 ∂I dx assuming that sb is constant. Since V = ∂Ea/∂I by definition, this is equivalent to condition 2, and shows that it yields a local maximum. V ≥ 0 and I is concave imply that Ea(sa, sb,I(x)) − x is also concave with respect to x, so any local maximum must be a global maximum. 

0 1 Lemma 1 1. If {sa, sb} is an equilibrium of MI(0) and limx→0+ I (x) ≤ then V (sb) {sa, sb, 0} will be an equilibrium of G|I. 0 1 − 2. If {sa, sb} is an equilibrium of MI(x1) and lim I (x) ≥ then {sa, sb, x1} will x→x1 V (sb) be an equilibrium of G|I.

26 Proof: Although I(x) will not be differentiable at the endpoints, (i.e. at 0 and x1 since player A cannot choose values of x < 0 and gains no benefit beyond I(x) = 1), we 0 1 only need to look at the one sided limit in these cases. If limx→0+ I (x) ≤ , then V (sb) player A will gain less benefit from increasing the oracle payment than the increase in 0 1 cost, and has no incentive to do so. Similarly, if lim − I (x) ≥ , then player A will x→x1 V (sb) lose more benefit from decreasing the oracle payment than the reduction in cost, (and can gain no more benefit from increasing the cost, since I is capped at 1), so has no incentive to change it. 

We also note that, even if I0(x) has discontinuities, condition 2 of Theorem 1 can be 0 modified to say that c must equal the supremum over all points with V (sb) · I (x) ≤ 1.

For the remainder of the chapter, we assume M is a payoff matrix such that MI(x) has a unique Nash equilibrium for each x, except possibly at nodes. This is a slightly stronger condition than requiring that the game has a unique Nash equilibrium, as it is possible to construct a game matrix M with a unique Nash equilibrium which disappears in MI(x) for some values of x, and is then replaced by multiple equilibria. However, such examples are unusual, as most games we naturally considered with a single nash equilibrium remained at one equilibrium for all x, and only when deliberately attempting to construct a counterexample was one discovered. Additionally, this restriction is mostly a matter of convenience and simplicity, as most of our results will apply in a slightly modified form to games with multiple equilibria, which we discuss in seciont 2.7.

Then we can define sa(x) and sb(x) as the strategies sa and sb in the unique equilibrium of MI(x) for all x except at nodes. If x is a node and all equilibria at that node have the same sa or sb, then sa(x) or sb(x) are defined as the appropriate strategy, while if sa or sb vary across equilibria, then the corresponding function is undefined at that node (in most games we consider, sa(x) will be defined at nodes and sb(x) will not). We also add the assumption that I is strictly increasing and strictly concave.

Proposition 4 If strategy s for player B is not dominated in M (weakly or strongly), but is dominated in MI(y) for some y, then it is strictly dominated for all MI(x) with x > y. That is, a strategy which becomes dominated after increasing x will stay dominated as x increases further.

Proof: Suppose strategy s becomes dominated by strategy t. Let rj be the payoffs to B for strategy j in the matrix R (A’s best strategies when the oracle responds). Let

27 bi,j be the entry in M in the ith row and jth column for B’s payoff, then the entry in the ith row and jth column of MI(x) will be

ci,j,x = (1 − I(x))bi,j + I(x)rj.

Let mi,j = rj − bi,j, then ci,j,x = bi,j + mi,jI(x). That is, the entries in the matrix will scale linearly with I, going from bi,j when I = 0 and reaching rj when I = 1. If t is a pure strategy, then rt, bi,t, and mi,t are already defined. If t is a mixed strategy which selects strategy j with probability wj, define rt = Σwjrj, bi,t = Σwjbj, mi,t = rt − bi,t. If strategy s is not dominated by t when x = 0 (and I(0) = 0), this means there is a row k such that bk,j ≥ bk,t. But if it is dominated by t for some y, then bk,s + mk,sI(y) ≤ bk,t + mk,tI(y). Together these imply that mk,t > mk,s and thus rt > rs.

Then for any row i, s dominated by t at y implies bi,s + mi,sI(y) ≤ bi,t + mi,tI(y).

Case 1: bi,s > bi,t. Using the same argument as above, we get that mk,t > mk,s. Then each of bi,s + mi,sI(y) and bi,t + mi,t · I(y) can be viewed as linear functions dependent on I, with slope m. Then line t has a greater slope and is above line s at I(y), so for any x > y, it will also be greater at I(x) since I is an increasing function.

Case 2: bi,s ≤ bi,t Then going back to ci,j,x = (1 − I(x))bi,j + I(x)rj, that is, for every x, the elements ci,j,x are weighted averages of bi,j and rj. Then since bi,s ≤ bi,t and rs < rt, then

(1 − I(x))bi,s + I(x)rs < (1 − I(x))bi,t + I(x)rt for all values of x.  Note that if a strategy starts out dominated at x = 0, Proposition 4 does not apply, and it may become undominated at one x value, but be redominated by another strategy at a greater x value. However, once it becomes dominated after being undominated, Proposition 4 will apply, and it remains dominated. Thus each strategy corresponds to at most two nodes. Thus G|I having finitely many strategies implies that it has finitely many nodes.

Proposition 5 In each interval between two nodes, and for each strategy i in the support of sa(x), there exist ai, bi, ci ∈ R such that the probabilities of playing strategy i in sa(x) can be expressed as ai+biI for all x in the interval. ci(1−I)

Proof: Let M be the payoff matrix. Recall that the cross section matrix MI(x) has

28 payoffs to B, (1−I)bi,j +Irj where bi,j are the payoffs to B in M, and rj are the payoffs to

B corresponding A’s best response in column j. Suppose sa(x) = (A1,A2, ...Am) Then for Pn each j, player B’s expected value when playing strategy j is Ej = i=1 Ai[(1−I)bi,j +Irj]. If we fix a particular interval between two nodes, then the set of B’s pure strategies which are undominated is constant on that interval. Let n be the number of undominated pure strategies for B in that interval. Note that the condition that MI(x) has unique equilibria between nodes implies that A also has n undominated pure strategies. Fix k as the index of one of B’s undominated pure strategies. Then sa(x) can be expressed as the solutions Pn to the simultaneous equations Ek = Ej for all j 6= 1, and i=1 Ai = 1. Using the last condition, we obtain

n−1 X Am = 1 − Ai i=1 which when substituted into the Ej gives

n−1 n−1 X X Ej = (1 − Ai)[(1 − I)bn,j + Irj] + Ai[(1 − I)bi,j + rj] i=1 i=1

Distributing the sum on the left and and combining terms with like indices gives

n−1 X Ej = (1 − I)bn,j + Irj + Ai(1 − I)(bi,j − bn,j) i=1

Since every instance of Ai is multiplied by 1 − I, we can make substitutions by defining ui = (1 − I)Ai to get

n−1 X Ej = (1 − I)bn,j + Irj + ui(bi,j − bn,j) i=1

Then we have n − 1 simultaneous equations with n − 1 variables ui, with the coefficients on all ui in R, and the constant terms have I with degree at most 1. It follows that the solutions must be of the form ui = a + bI for some ai, bi ∈ R. Thus we get that all Ai are ai+biI of the form (1−I) for some ai, bi in R. This suffices to prove the proposition. Additionally, if we have all bi,j ∈ Q, then ai, bi ∈ Q, and by using the least common denominator of ai ai+biI and bi we can express this as Ai = with ai, bi, ci ∈ ci(1−I) Z 

Note that this implies that sa(x) is continuous between nodes. Additionally, the

29 support of sa(x) must be constant between nodes because the support of sb(x) is.

Proposition 6 sb(x) is piecewise constant with respect to x, with discontinuities only at the nodes.

Proof: Suppose {sa(c), sb(c)} is an equilibrium of MI(c) for some particular c ∈ R. Thus sb(c) causes player A to be indifferent among all strategies included in sa(c). But since sa(c) is player A’s strategy when the oracle does not respond, his indifference does not depend on I. So sb(c) causes A to be indifferent on the support of sa(x) in MI(x) for all x.

Additionally sa(c) causes player B to be indifferent on the support of sb(c) in MI(c). Let d be any value such that there are no nodes between c and d. This means a strategy is dominated for B in MI(c) if and only if it is dominated in MI(d). sb(c) is part of an equilibrium in MI(c), so none of the strategies in its support are dominated. Thus 0 they are also undominated in MI(d), so there must be some sa which causes player B to be indifferent on the support of sb(c) in MI(d). Proposition 5 implies that the support 0 of sa is the same as the support of sa(c) since the formulas that define each strategy’s probability are nonzero between nodes. Then B must play a strategy that causes A to 0 be indifferent on all strategies in this support. sb(c) accomplishes this, thus (sa, sb(c)) is an equilibrium in MI(d). And by assumption the equilibrium is unique at each point, so sb(c) = sb(d) 

Since we have expressed the equilibrium strategies sa(x) and sb(x) as functions of x, we can also express the expected payoff of player A as

Ea(x) = Er(x) · I(x) + En(x) · (1 − I(x)) − x.

where Er(x) is player A’s expected payoff when sa(x) and sb(x) are played with the payoff matrix R (the oracle responds) and En(x) is A’s expected payoff when sa(x) and sb(x) are played with the payoff matrix M (the oracle does not respond). Then the value

∂Ea of information which we defined as V = ∂I , which we showed earlier depends only on sb can also be expressed as a function of x:

V (x) = Er(sb(x)) − En(sb(x)).

It the follows immediately that V (x) is piecewise constant with discontinuities only at the nodes, which comes from its direct dependence on sb(x).

30 If we consider a simplified construction where a player has the binary option to purchase information or not for a fixed cost c, this corresponds to a stepwise oracle function I(x). But by proposition 2, this is equivalent to an Oracle Game with a linear oracle function with slope 1/c. This, together with V (x) piecewise constant and Theorem 1, means equilibria will only occur at nodes, except when c = 1/V , in which case there are infinitely many equilibria. A will be indifferent between all values of x, and thus sa(x), sb(x), x will be an equilibrium for all x where V = 1/c. In the construction where A has a binary option to purchase information or not, this corresponds to A playing a mixed strategy where he chooses to purchase information with probability x, and this case would correspond to infinitely many such mixed strategies being equilibria. The following Lemma is a stronger version of Proposition 4 for strictly competetive games, as it eliminates the possibility of strategies dominated at x = 0 which become undominated for x > 0. Thus, the only nodes that can occur are ones corresponding to strategies becoming dominated.

Lemma 2 If G is strictly competitive, then any strategy for player B which is dominated in M will be dominated in MI(x) for all x.

Proof: Suppose that strategy t dominates strategy k for player B in M. Let rj be B’s payoff in column j of the maximal matrix R. Since G is strictly competitive, this will also correspond to the lowest payoff for B in column j of M. That is, rj ≤ bi,j for all i, j. Then rt = bi,t for some i. Then rk ≤ bi,k and k dominated by t implies bi,k ≤ bi,t.

And thus rk ≤ rt. Now for any x, let ci,j,x be the i, j th entry for B in MI(x). From the definition of MI(x), we get ci,j,x = (1 − I(x))bi,j + I(x)(rj). Then for any i, both bi,k ≤ bi,t and rk ≤ rt implies that ci,k,x ≤ ci,t,x. Thus k is dominated by t in MI(x). Note that if k is dominated by a mixed strategy, this argument extends in the same way as in Proposition 4. 

Note also that strict dominance in M will imply strict dominance in MI(x).

Proposition 7 If G is a strictly competetive game, then V(x) is nonincreasing with respect to x.

Proof: Since V (x) = Er(x)−En(x), it is sufficient to show that Er(x) is nonincreasing 0 and En(x) is nondecreasing. Let Er be player B’s payoff when the oracle responds, and

31 0 En be his payoff when the oracle does not respond. G strictly competetive implies that 0 En is nondecreasing if and only if En is nonincreasing. Since R is made from entries in 0 G, it is also a strictly competetive game matrix, so Er is nonincreasing if and only if Er 0 0 is nondecreasing. So it suffices to show these properties for Er and En. Note that these are both locally constant with discontinuities only at nodes since they are based on sb. Let x0 be any node. Lemma 2 implies that this node occurs when a strategy becomes dominated, so let j be one such strategy, and let s be the (possibly mixed) strategy that 0 dominates it at x . Then from the proof for Proposition 4 we have rj < rs. This means that j is dominated in the matrix R. Then, when player B shifts some of his mixed 0 0 strategy probability from strategy j to strategy s as x passes x , Er will increase. Thus 0 at every node, Er must increase, and since it is constant on intervals between nodes, we 0 conclude that Er is nondecreasing. 0 Since En is also constant except at the nodes, the only place where it could possibly 0 increase would be at a node. Suppose, for the sake of contradiction, that En increases at 0 the node x . Let s1 be player B’s strategy before the node, and s2 be player B’s strategy after the node. For any value of I, we have

0 0 Eb(sb) = I · Er(sb) + (1 − I)En(sb).

0 0 0 0 Then En increasing at x implies En(s2) > En(s1). We also showed above that Er(s2) > 0 Er(s1). Both of these mean that for any value of I, Eb(s2) > Eb(s1), which means s2 will always yield a higher payoff to player B than s1, assuming player A chooses sa optimally, regardless of how often the oracle responds. This contradicts the assumption that s1 was part of an equilibrium before the node, since player B could achieve a higher payoff by 0 switching to s2 immediately. Therefore En must be nonincreasing. 

Decreases in V occur for two reasons. The first is that as I increases, B abandons riskier strategies in favor of safer strategies. In a strictly competetive game, the only strategy that never becomes dominated for B is her security strategy, which is the strategy that maximizes the minimum payoff, and is thus the best response to an oracle which frequently responds. This simultaneously minimizes the payoff for A given a response to the oracle, thus the more frequently B chooses safe strategies, the lower Er will be. The second is that as B abandons any strategies, their mixed strategy becomes closer to a pure strategy, and is thus easier to predict without requiring a response to the oracle.

This tends to increase En

32 Theorem 2 If MI(x) has a unique equilibrium for each x except at nodes, I is strictly concave, and V(x) is nonincreasing, then G|I will have a unique equilibrium.

Proof:

Since we assume that each MI(x) has a unique equilibrium {sa, sb}, this covers condition 1 of Theorem 1, except at the nodes. It suffices to show that there is exactly one value of x that satisfies condition 2 of Theorem 1, and if it is a node then there is only one {sa, sb} that still meets condition 1. We can express the change in the expected value for Player A

∂E ∂E dI a = a · − 1 = VI0 − 1. ∂x ∂I dx

Since I0 is everywhere continuous and V is continuous except at nodes, the expression will be continuous except at nodes. I nondecreasing and strictly concave imply I is strictly increasing, and thus I0 ≥ 0, and is strictly decreasing. We’ve previously shown

∂Ea V ≥ 0, and is nonincreasing. These together imply ∂x is strictly decreasing. For any x, {sa(x), sb(x)} satisfy the first condition in Theorem 1, by definition. We now demonstrate that there is exactly one value of x that satisfies either Lemma 1 or the second condition of Theorem 1:

∂Ea Case 1: ∂x (0) < 0. ∂Ea This satisfies Lemma 1, and ∂x strictly decreasing means it is negative for all x, so 0 there are no values of x that satisfy VI = 1, which means that {sa(0), sb(0), 0} will be the unique equilibrium of G|I.

∂Ea Case 2: There exists a c such that ∂x (c) = 0. ∂Ea Then {sa(c), sb(c), c} satisfies Theorem 1, and is an equilibrium for G|I. Since ∂x is ∂Ea ∂Ea strictly decreasing, then for any x < c we get ∂x > 0, and for any x > c, ∂x < 0 so this equilibrium is unique.

∂Ea Case 3: ∂x (x) changes from positive to negative values discontinuously at node z. Let s1 = sb(c1) where c1 is any value in the region immediately before the node z, and let s2 = sb(c2) where c2 is any value in the region immediately after z. Then since

∂Ea 0 1 1 ∂x = VI − 1, we have V (s1) < I0 , and V (s2) > I0 , so there exists p ∈ (0, 1) such that 1 pV (s1) + (1 − p)V (s2) = I0 . Let β = ps1 + (1 − p)s2. Since V is linear, this implies 1 0 V (β) = I0 , and thus V (β)I = 1. Since some strategy gets dominated at z, we have supp(s2) ⊂ supp(s1), and thus supp(β) = supp(s1).

33 Let α = limx→z− sa(x), the strategy approached by Player A as x approaches node z from below. Since sa(x) in this region make B indifferent on all strategies in supp(s1) in MI(x), then α will also make B indifferent on all strategies in supp(β) in MI(z) since this is preserved by the limit. And B indifference to a strategy despite it being weakly dominated can only occur when one of A’s strategies goes to probability 0 at z. In particular, supp(α) = supp(sa(c2)) ⊂ supp(sa(c1)). Then, since s1 makes A indifferent on all strategies in supp(sa(c1)), and s2 makes A indifferent on all strategies in supp(sa(c2)), then β will make A indifferent on all strategies in supp(α). Thus, {α, β} is an equilibrium of MI(z). Further, only linear combinations of s1 and s2 will make A indifferent on supp(α), and of those, only β sets VI0 = 1, so this equilibrium is unique.

∂Ea Case 4: ∂x > 0 for all x up until x1 such that I(x1) = 1 ∂Ea This satisfies Lemma 1. Additionally, for any x < x1 we have ∂x > 0 in which case A could profit from increasing x. Thus {sa(x1), sb(x1), x1} will be the unique equilibrium of G|I

Finally, we note that since V is continuous everywhere except at the nodes, positive,

0 ∂Ea and weakly decreasing, when combined with I (x) strictly decreasing, implies that ∂x is continuous everywhere except at nodes, and strictly decreasing. Thus exactly one of

∂Ea these cases must occur, depending on if and where ∂x changes from positive to negative. 

Although each combination of G and I will result in a unique Nash equilibrium, any particular G could have a different equilibrium with a different oracle payment x depending on which function I it is attached to, which is why we break each example into cases.

2.5 Harmful Information

In most games we considered, cheaper oracle functions (ones with greater I(x)) cause A’s payoffs to increase when compared to more expensive ones. However, this is not always the case. We now consider a particular game matrix where cheaper information can be harmful to player A (in terms of decreasing his payoff in the equilibrium). Let G be the game given by the payoff matrix M: This essentially is a weighted matching pennies game where player B has the option to avoid playing altogether by choosing strategy B3. Played simultaneously, the only

34 1 2 B1 B2 B3

A1 4, −1 0, 2 0, 0

A2 0, 2 4, −1 0, 0

1 1 1 1 Nash equilibrium is where A plays ( 2 , 2 ) and B plays ( 2 , 2 , 0). Then their expected 1 values are EA = 2 and EB = 2 . B will not choose strategy 3 because she can gain a nonzero amount of points by playing the mixed strategy. If we consider the same strategies and payoffs but in a sequential game where B has to choose first, this is equivalent to giving A complete information about what B chooses (such as an oracle with the constant function I(x) = 1). If we look at the tree this creates:

B

B1 B3 B2

A A

A1 A2 A1 A2 A1 A2

(4, −1) (0, 2) (0, 2) (4, −1) (0, 0) (0, 0)

Figure 2.7. Harmful Information Extensive form game

If B ever chooses B1 or B2, A will choose the best response and the payoff will be

(4,-1). Knowing this, player B will only ever choose B3, and both players will get a payoff of 0. Note that this is worse for both players than the mixed strategy was. That is, player A knowing what strategy player B has played is detrimental to both players. If it were possible, player A would prefer not to have the information, or to be able to commit to ignoring the information and play a mixed strategy in order to incentivize player B to play B1 or B2.

So how is this reflected in the Oracle Game G|I? If A is given access to oracle I(x), the payoff matrix MI(x) becomes: The equilibrium will occur in one of the following cases: 0 1 Case 1: If I (0) ≤ 2 the oracle is too expensive to be worth paying, and the equilibrium

35 1 2 B1 B2 B3

A1 4, −1 4I, 2 − 3I 0, 0

A2 4I, 2 − 3I 4, −1 0, 0

1 1 1 1 1 is {( 2 , 2 ), ( 2 , 2 , 0), 0} with expected values Ea = 2 and Eb = 2

0 1 0 1 1 1 1 Case 2 (interval): If I (0) ≥ ≥ I (x 1 ) the equilibrium is {( , ), ( , , 0), y 1 }, and 2 3 2 2 2 2 2 neither player adjusts strategies due to the symmetry between strategy 1 and 2. Then

A’s expected value is Ea = 2 + 2I(y 1 ) − y 1 . Note that since A is deliberately trying to 3 2 maximize this, he’s only paying the oracle when the function has steepness at least 2, so 1 this payoff is greater* than his payoff in case 1 (*if the oracle is a straight line of slope 2 1 3 it will be equal). Player B’s payoff is EB = − I(y 3 ), which is worse than his payoff in 2 2 2 Case 1, but still more than 0.

1 0 1 1 1 1 4I0−2 Case 3 (node): If ≤ I (x 1 )} the equilibrium is {( , ), ( 0 , 0 , 0 ), x 1 ). When I 2 3 2 2 4I 4I 4I 3 1 reaches 3 , B becomes indifferent between all three strategies, since her expected value from any of them is 0, so she would be willing to play any mixed strategy involving them. But the only equilibrium is one where A is indifferent between A1 and A2 and also indifferent on increasing x any further. B’s strategy in the equilibrium is the only one that satisfies both conditions. In this equilibrium, B’s expected value is Eb = 0. And 4 A’s is Ea = 0 − x 1 . Note that this will be positive because I is concave and must have 3I 3 1 slope at least in order to get to case 3, so x 1 will be small. However Ea is decreasing 2 3 0 with respect to I (x 1 ). The reason for this is because A gets a higher payoff the more 3 0 often B plays strategies 1 and 2, but as I (x 1 ) increases, player B will opt out (strategy 3 0 0 3) more often in order to satisfy the condition VI = 1. As I (x 1 ) approaches ∞, B’s 3 strategy will approach (0, 0, 1), causing Ea to approach 0.

Thus, A will benefit the most at the boundary between case 2 and case 3. Fig 2.8 shows how her expected payoff changes as information becomes cheaper (as k increases). If the oracle is very expensive she will have to lose most of her benefit from the information to the oracle’s cost. But if information is too cheap then player B will be dissuaded from playing B1 and B2, and A will receive a lower payoff than if the oracle did not exist in the first place. It is not extremely surprising that this example demonstrates harmful information, given that we started with a known game in which A’s payoff in

36 √ Figure 2.8. Ea as a function of k when I = kx the simultaneous game (with no information) is lower than his payoff in the sequential game (with perfect information). However because Oracle Games allow for continuous purchasing of information, our model demonstrates that having access to small amounts of information is beneficial to player A, and only when a certain threshold is reached does it become harmful by incentivizing player B to change strategies. This is related to the game-theoretic notion of second-mover advantage: where a player achieves a higher payoff by going second and responding to the other player’s strategy compared to choosing first or simultaneously. The oracle effectively allows A to pay to become the second mover, and is thus beneficial when there is an advantage to being in this position, and is more likely to be harmful when there is a disadvantage to being in this position, although the continuous nature of payments and response rates complicates this somewhat.

2.6 Helpful Information

In most games we’ve considered, the existence of the oracle causes B’s payoffs to decrease when compared to the same game without an oracle. Cheaper oracle functions typically cause more of a decrease compared to more expensive ones, up to the point where the final node is reached and B plays the safest available strategy. This will always be the case for strictly competetive games, however for certain other games B’s payoffs may increase due to presence of the oracle. Consider the : This game has two pure strategy equilibria, and one mixed, where both players play

37 1 2 B1 B2

A1 1, 1 0, 0

A2 0, 0 1, 1

1 1 ( 2 , 2 ). When an oracle is introduced, both pure strategy equilibria still exist, in which A will not pay the oracle. There will also be one mixed strategy equilibrium, which will be as follows: 0 1 1 1 1 Case 1: If I (0) ≤ 2, the equilibrium is {( 2 , 2 ), ( 2 , 2 ), 0}. 0 0 1 1 1 1 Case 2 (interval): If I (0) ≥ 2 ≥ I (x1), the equilibrium is {( 2 , 2 ), ( 2 , 2 ), y2} 0 Case 3 (node): If 2 ≤ I (x1), A pays x1, which causes the oracle to always respond. Thus, A needs no tentative strategy. For very cheap I(x), there will be infinitely many 1 1 strategies B could play that keep A indifferent, but ( 2 , 2 ) will always be one of them. We note that in the mixed strategy equilibrium for the coordination game with no 1 oracle, or in Case 1, the expected payoff for both players is 2 . In general the payoff for 1 1 player A is EA = 2 (1 + I(x)) − x and the payoff for player B is EB = 2 (1 + I(x). As the oracle function becomes cheaper, the players succesfully coordinate more often, causing both player’s payoffs to increase. However for any nonzero x, B’s payoff is greater than A’s, since she benefits from the increased coordination without paying any of the cost. However for any oracle function, these payoffs are still less than in the pure strategy equilibrium, where both players receive 1 (aside from Case 3, where B’s payoff is also 1 in the mixed equilibrium). Examples where B will benefit from the oracle function can only occur in games where A’s best response to B is also good for B, but which also contain a mixed strategy equilibrium. In the absence of some way to coordinate together on these strategies as a pure strategy equilibrium, players may default to the mixed strategy equilibrium. In this case, A’s incentive to pay the oracle for his own benefit may allow B to also benefit without having to pay any cost.

2.7 Multiple Equilibria

Although so far we have restricted our focus to games with one mixed equilibrium, games with multiple equilibria will tend to behave in a similar way: each equilibrium can be analyzed separately using the same techniques. Consider the game defined by the matrix:

38 2 1 B1 B2 B3 B4

A1 1, −1 0, 0 −10, −10 −10, −10 A2 0, 0 2, −2 −10, −10 −10, −10 A3 −10, −10 −10, −10 2, −2 0, 0 A4 −10, −10 −10, −10 0, 0 3, −3

With no oracle, this game has three mixed strategy equilibria: one where they play their first two strategies, one where they play their last two strategies, and one where both players play all four strategies. If an oracle is introduced with oracle function I(x), there will still be three mixed strategy equilibria: two corresponding to the equilibria induced by the submatrix with either the first two strategies or the last two strategies of both players, and one involving both sets. In this last case, the number of strategies played ranges from 2 to 4 depending on the oracle function. If the oracle function is shallow, so payments to it are low, all four strategies will be played for both players. However if the oracle is cheap enough that B2 or B4 become dominated, they will be dropped (as will A2 and A4 respectively). If both are dropped (which happens at I = 2/3), this third equilibrium will be the same as the equilibrium of the 2 × 2 game where those dropped strategies never existed in the first place, which is a game with one equilibrium and falls under the purview of the rest of this chapter. Thus, most of our results can be adapted and applied separately to each individual equilibrium in larger games. More complicated situations can occur in game matrices where many equilibria are dropped and added as x changes, but we leave the details of this for future research.

2.8 Discussion

The Oracle Games defined here provide a method for investigating how players pay to acquire information, as well as how players respond to information about them being acquired. In particular, we have shown that the nodes, which occur when one of player B’s strategies becomes dominated or undominated, play an important role in considering which strategies will be played and how much information will be purchased. Oracle Games could potentially have applications involving industrial espionage, or any situation where competing decision-makers are not immediately aware of each other’s strategies, but can invest time or resources to obtain them at some cost. In general, we anticipate that our model will be useful whenever information is difficult to acquire, but is always reliable once acquired. For example, a firm hiring spies to

39 steal files from their competitors will have to pay regardless of whether they succeed in their operations or not, but if they succeed the files would be unlikely to contain false information. The model by Solan and Yariv [31] is similar to ours, but has different results about what sorts of equilibria will occur. They find that if sufficiently reliable information can be purchased cheaply enough, the player will purchase it and act on the information as if it were completely true, and also that the information cost affects the game’s equilibrium only insofar as it determines whether information is worth purchasing or not; the actual amount of information purchased, if any, depends only on the payoffs for the player without any information. This corresponds to the amount that causes the player’s strategies to become dominated, and is related to our notion of nodes. In our model, player A will purchase more information as it becomes cheaper in a continuous way, until the point where they reach the final node. Thus our mechanism of purchasing a random chance of perfect information is distinct from purchasing noisy information, and leads to different results even when attached to the same games. In cases where information can always be acquired for a fixed cost and is always accurate (a stepwise oracle function), both models should yield identical results. Oracle Games are fundamentally asymmetric because only one player has access to the oracle and its information. There is no straightforward way to directly extend this to a symmetric system where both players have an oracle, since one player must commit to a decision before the oracle can know his action and provide it to the other player. More complicated constructions could potentially resolve this. One such alternative would be to have both players bid payments to the oracle, and only the player with the higher bid gets access to information about the other player. It would have to specify what behavior would occur during a tie, which may lead to discontinuities in payoff functions. We speculate that for some games, equilibria would be identical to ours, but for some, one player might be willing to pay for the oracle simply to deny it to the other player, not because they value the information themselves highly. An asymmetric but potentially interesting alternative this suggests is to take our model and give player two the ability to pay the other player’s oracle in order to decrease its probability of response. This could be done as a subtraction of inputs, ie the probability of response is I(x1 −x2) where xa and xb are the payments of players A and B respectively. We speculate that in cases where A’s value of information is higher than B’s value of secrecy (the amount he loses from A getting a response), the equilibrium would be identical to those in this chapter. When B’s value of secrecy is higher than A’s value of information, the equilibrium would be no oracle payments, since anything A paid

40 would be immediately countered by B. Alternatively this could be implemented as a subtraction of outputs, ie the probability of response is Max(0,Ia(xa) − Ib(xb)) where Ia and Ib are functions given separately to player A and B. We speculate that when B’s oracle function is expensive, the equilibrium would be identical to those in this chapter. When B’s oracle function is somewhere in the middle and concave, both A and B would pay some amount based on the marginal rate of return. When B’s oracle function is cheap, the equilibrium would be no oracle payments, since anything A paid would be immediately countered by B. Another method of creating a symmetric version would be to have an extensive form game with both players having multiple actions, and each player having the ability to pay the oracle to learn information about the other player’s earlier actions. We speculate that this would be resolvable using the same used on normal extensive form games, but each subgame would be an oracle game as in this chapter. We also note that, while we briefly address the issue of games with multiple equilibria, we do not go into it in much detail, the process of which may yield interesting results that do not occur in games with a single equilibrium. Future research might explore these in more detail and develop better techniques for describing and comparing multiple equilibria in the same game. Finally, the practice of using randomly supplied accurate information as an alternative to noisy signals could be investigated in other games with information acquisition. This could simplify the complexity of some models by eliminating the need for players to condition their actions on uncertain beliefs, while still retaining the incentive to increase the amount of information available through the response rate. We expect such modifications may yield results similar to the approach with noisy signals, but possibly with a simpler analysis which might lead to novel or more useful results.

41 Chapter 3 | Fair Contribution in a Nonlinear Stochastic Public Goods Model

3.1 Introduction

3.1.1 Public Goods Games

In this chapter, we study an agent based model where players with a simple learning rule play a public goods game together. In nature, interactions between individuals often occur in groups of more than two. Such groups might accomplish more than the individuals comprising them could alone, enhancing the welfare of the individuals who contribute to the group’s success. However, individuals who seek to exploit the group by benefiting from it without contributing to its success may emerge. Much research in game theory has focused on the presence of cooperation in games that model this dynamic [33]. When there is no method to enforce cooperation, rational players will seek to maximize their own payoffs at the expense of others. But when all players make this decision then all of them end up with lower payoffs than if they had coordinated to cooperate together. The most well-known example of this is the Prisoner’s Dilemma, where each player is given a choice between Cooperating, or Defecting, which increases their payoff a small amount compared to Cooperating, at the cost of decreasing the other player’s payoff by a large amount. In the absence of any other modifications to this game, the only equilibrium is one in which both players Defect, leading to both players having a lower payoff than if both Cooperate. While the Prisoner’s Dilemma may be the simplest way to capture this conflict between group and individual benefit, the absence of more than two players restricts its applications to larger group settings. The related public goods game, originating by

42 Hamburger in 1978 [34], plays a similar role, but is better suited to describing interactions involving more than two players at a time. In the standard construction, each player is given a binary choice either to contribute a fixed amount c to the public good (cooperate), or to freeload off of the contribution of others while contributing nothing (defect). Players who contribute have the cost deducted from their final payout, which is then put into the public good, multiplied by a factor r (the rate of return), and then divided evenly among all players regardless of whether they contributed or not. Payoffs are given by

rcn u = c − c (3.1) c N rcn u = c (3.2) d N where uc is the payoff for contributors, ud is the payoff for defectors, nc is the number of players who contribute, and N is the total population size. If r > 1, then it is socially desirable (Pareto superior) for players to contribute: all players contributing leads to higher payoffs for each player compared to all players defecting. But if r < N, then each individual player maximizes their own utility by unilaterally choosing to defect, assuming their actions do not change the actions of the other players. When both of these hold we have a dilemma where the only Nash equilibrium is one where no player contributes, which is socially suboptimal (Pareto inferior) since every player’s payoff is lower than it would be had everyone contributed, just as in the Prisoner’s Dilemma. In fact, when N = 2 and 1 < r < 2 this is a Prisoner’s Dilemma.

2 1 CD C rc − c, rc − c rc/2 − c, rc/2 D rc/2, rc/2 − c 0, 0 Table 3.1. Payoffs for a 2 player public goods game.

Similarly, a Prisoner’s Dilemma can be scaled up into an N player variant that is equivalent to an N player public goods game. One method is to have N players in a population, where each pair plays one round of the Prisoner’s Dilemma together, with the restriction that each player must choose a fixed strategy to play versus every partner. They then receive the sum of the payoffs for all of their games. If the Prisoner’s Dilemma has payoffs T > R > P > S with P = 0 and T − R = −S, then this is equivalent to a

43 R−NS NT public goods game with c = N−1 and r = R−NS

3.1.2 Cooperative Behavior in Biology

There are many interactions in nature which involve something similar to the public goods game, but which appear to have avoided the dilemma, resulting in a population of contributors. Payoffs are often interpreted in evolutionary game theory as fitness: the ability for an individual to reproduce and spread its genes. A naive interpretation of evolution would suggest that any behavior that reduces one’s fitness will not evolve, and thus individuals will not sacrifice their own fitness for the sake of others, and yet we observe that they do. West and Griffin [35] discuss a number of evolutionary reasons why this might occur. A common cause is kin-selection, where cooperation increases the frequency of a strategy performed by individual even if it decreases their own personal fitness, so long as they are cooperating with individuals who are closely related to them. In particular, Hamilton’s rule states that a sacrificial behavior that helps one individual at the cost of the one performing it will be selected for if rb − c > 0, where here r is the relatedness of the individuals, b is the benefit to the fitness of the recipient, and c is the cost to the fitness of the one performing the action [36]. Even if an individual decreases their own direct fitness by such actions, they increase the total frequency of their shared genes in the population. An obvious example is parents taking care of their offspring, but it also applies to individuals caring for their siblings or cousins. Grafen [37] describe this process in detail and discuss a number of examples of this occurring in nature. However, in order for kin-selection to enable cooperation, individuals need to have some means of cooperating more with their relatives than they do with the general population as a whole, or else they sacrifice their own fitness to benefit unrelated individuals who do not share many genes. One method is kin-discrimination, where individuals are able to recognize individuals related to them, even if imperfectly. Brown and Brown [38] discuss the mechanisms and benefits of kin-discrimination in fish, particularly Salmonids, which recognize their kin via scents in the water and behave less aggressively in territory disputes with related individuals. Note that this kin-discrimination does not need to be deliberate in order to take place. Strassman et al. [39] discuss various types of kin-selection in microbes. One form of kin-selection they discuss is poison-antidote systems, where a microbe simultaneously releases a poison into its environment, and produces an internal antidote so that it doesn’t suffer from its own poison. Genetically related microbes will produce the same antidote, while unrelated microbes will not and will suffer from it. The microbe discriminates in

44 effect without having to discriminate in behavior. Strassmen et al. also discuss assortment, a form of kin-selection that does not require discrimination, where individuals are more likely to interact with individuals related to them, often due to proximity. They find that most microbes in solid substrates lack mobility and will tend to be near their relatives since reproduction causes them to emerge near each other. In other cases, microbes have bonding mechanisms that cause them to attach to others only if they are genetically related, leading to drifting clonal colonies made up of related individuals who then interact primarily with each other.

3.1.3 Modifications of Public Goods Models

Many game theory models have been constructed that also allow for the presence of cooperation. For example, assortment can be observed in models with spatial or network dynamics. Miller and Knowles [40] study a model where agents in a changing graph network play public goods games with their neighbors, and tend to change their strategies to copy neighbors with high fitness. They find that although in any pairwise interaction defectors will score higher than cooperators, the frequency of cooperators tends to increase over time and they eventually dominate the population. The method of changing strategies, where individuals copy their neighbors, causes clusters of individuals that have the same type. When defectors have high fitness, they convert their neighbors into defectors, which lowers their own fitness in future rounds. When cooperators have high fitness, they convert their neighbors into cooperators, which increases their fitness in future rounds. This difference in positive versus negative feedback is what enables cooperators to dominate in the long term, even if they initially perform worse. Additionally, many studies have used more sophisticated forms of public goods games to explain the presence of cooperation even when players are only self-interested and have no genetic ties to other players. Hauert et al. [41] construct a volunteering model where players play a public goods game with a third option to opt out of the group and act as a loner, which is preferable to playing with all defectors but less desirable than playing with all cooperators. Under some parameters, this creates a cyclic dynamic, where defectors are replaced by loners, who are eventually replaced by cooperators, who are then replaced by defectors. Under other parameters, one population will completely take over the population, although with mutation dynamics the dominant population will occasionally switch. Fehr and Gachter [42] explore the effect of punishment in a human experiment. Subjects play multiple versions of a prisoner’s dilemma, some of which allow for the

45 ability to pay a cost to punish the other player, and find that players are willing to do this even if the other player won’t be played against again and so the punisher cannot benefit from changing the behavior of this player. They find that games with sufficiently harsh punishment have more cooperative behavior compared to games without punishment. Hauert et al. [43] combine the notions of volunteering and punishment and create a public goods model with both, which includes four types of players: defectors, cooperators, loners and punishers, the latter behaving like cooperators but after each round also sacrificing a small amount of their payoff to inflict a cost on defectors as retribution for selfish behavior. They find that this leads to a higher frequency of cooperation compared to similar models with only volunteering or only punishment. Hauert [44] studies a complementary approach to punishment, where players are given the ability to sacrifice some of their payoff to provide an additional reward to contributors. They find that if the reward is large enough, this can lead to the emergence of cooperation, since individuals will be willing to pay for this reward in order to incentivize others to cooperate, but is unstable since once everyone is cooperating individuals stop rewarding the behavior. In some sense, there are multiple public goods in the same model: the contribution to the primary public good, and the contribution to the extra rewards, and both are potentially vulnerable to free-riders. Another method of incentivizing cooperation is to have a payoff function which is nonlinear in the number of cooperators [45]. A classic example is the n-player snowdrift game, where the return from the public good is b if at least one player contributes, and 0 otherwise, and the cost to contribute is c/n where n is the number of contributors. In this game, all of the social benefit is gained from the first cooperator, and all that subsequent contributors accomplish is sharing the burden of the cost. If b > c then an equilibrium will exist where some players cooperate. This can be in the form of an “unfair" pure strategy equilibrium, in which one player contributes and the remaining players free-ride, safe in the knowledge that the benefit will be gained, or it can be in the form of a "fair" mixed strategy equilibrium in which every player contributes with a nonzero probability that balances the desire to maximize the chances of gaining the benefit for themselves against the cost of having to contribute. Many examples of nonlinear public goods exist in nature, such as enzyme production [46] and cooperative hunting [47]. If the payoff function is sufficiently steep, then players will be incentivized to contribute whenever they can increase their own payoff by more than the cost of their contribution, leading to a nontrivial equilibrium with some cooperation. Bshary [48] provides a classification for these and other methods enabling the evolu-

46 tionary stability of altruistic behaviors into several categories. Reciprocity, where agents take costly actions that benefit other players in exchange for other players doing the same for them, encompasses cooperation in games with reputation or punishment. This generally requires some level of repeated interactions over time in a stable community, as well as more social agents such as humans [49]. By-product mutualism, where agents simply take the action that selfishly maximizes their own fitness and other players happen to benefit as a side effect, encompasses cooperation in games with kin selection, volunteering, or nonlinear payoffs. This can be seen in a wide range of interactions, including cleaner fish [50], herding animals [51], and meerkats [52].

3.1.4 Fairness

The idea that everyone contributes something to a publicly available and beneficial commodity, or that all organisms share in the cost of a common benefit, seems intrinsically related to the notion of fairness. Experiments with human subjects have shown that rather than making decisions purely to maximize their own payoff, players also care about the payoff of other players, or more specifically about a certain appropriate equality of treatment (see e.g. [53]). Subjects in experiments are often willing to cooperate in the prisoner’s dilemma if they believe that the other players will cooperate with them [54]. This is more accurately described as fairness rather than altruism, as subjects will punish other players who defect even in one-shot games [55]. In the , two players are placed in different roles: a dictator and a responder. A fixed quantity p of utility (or money, in experiments) is available. The dictator chooses a distribution of p among the two players, and the responder has the choice of accepting the offer, in which case both players receive their share according to the offer, or rejecting it, in which case neither player receives anything. While there are infinitely many Nash equilibria, each corresponding to a particular offer from the dictator and a responder who refuses to accept any offer except that particular one, the only subgame perfect equilibrium is one in which the dictator offers the minimum nonzero amount and the responder accepts it. In the absence of the ability to communicate or make threats, a perfectly rational responder who only cares about their own payoff has no incentive to reject any nonzero offer once they receive it, since it a rejection gives them nothing and it’s too late to influence the dictator into changing their offer. However, in experiments nontrivial offers often occur, and nonzero offers are often rejected, which may be partially explained by notions of fairness. Responders are often willing to forgo any payoff if it will punish the maker of an offer that they deem to be unfair [56]. Rabin [57]

47 construct an alternate version of game theoretic equilibrium where players explicitly make decisions based on fairness in addition to their own payoffs, which allows for such behavior. A different approach to this was taken by Zhu et al. [58], who find that even when fairness is not explicitly valued by players, it can emerge from simple algorithmic rules. They construct a dynamical system in an agent-based approach to the Ultimatum Game, in which each player possesses a pair of values: the amount they are willing to offer when chosen as the dictator, and the minimum amount they are willing to accept when chosen as the responder. Players are randomly chosen from the population to play one of these roles, and non-rationally play the ultimatum game with their pre-selected strategy. Players then update their values according to two algorithms based on their most recent interaction: one focused on success of outcome, and one on greed. This eventually leads to the unique fair equilibrium in which each player offers and accepts approximately half of the total. This model was also used by Rajtmajer et al. [59] to model privacy concerns in social media. These models also use learning in some sense, as players do not start in such an equilibrium, but incrementally update in response to their own experiences. A similar model was made by Santos et al. [60] who construct an N-player version of the Ultimatum game in which a single dictator offers an amount to a group of potential recipients, and the deal is made if a sufficient number of them vote to accept the offer. Like in Zhu et al. [58], each player possesses a pair of values that determine their behavior, but here these values are updated via imitation dynamics. Players are randomly sampled in groups of some fixed size N from a larger population, players’ payoffs are averaged over several games, and then players randomly select other players to imitate, with higher probability of imitating players who received higher payoffs. They find that typically offers and acceptance rates will end up low, but due to noise added to the imitation dynamics, offers and acceptances will fluctuate around some nonzero average values. The higher the number of votes required for an offer to complete, the higher these values will be. Though offers will typically remain partially unfair in favor of the dictator, certain values of group size and noise will lead to an equilibrium which is close to fair. These mechanisms allow for players to learn and change their behavior over time in an incremental way based on feedback from the games that they play. Players have imperfect information about what the other players are going to do, and are not rational agents which could calculate the optimal strategy even if they had such information. Instead, they learn and adapt instinctively, adjusting their strategy after each game that makes them more likely to perform better if they encounter similar situations in the

48 future.

3.1.5 Our model

In this chapter, we construct a nonlinear public goods game where players are given the option to contribute any nonnegative amount to the public good rather than the typical binary choice between nothing and a fixed amount. We then construct a stochastic dynamical system consisting of a population of players who are randomly sampled to play this public goods game with each other. Similar to the model in Zhu et al. [58], players possess a value which they use to decide their contribution amount, and which they incrementally update in response to the result of games that they participate in. Players possess no knowledge of the global shape of the payoff function or the strategies of other players, only being informed of their own payoff and the local gradient of best response each time they play a game. Yet we find that not only does this lead players to cluster near a Nash equilibrium, but when the population is larger than the number of players in each game, they approach the unique fair equilibrium in which every player contributes the same amount. This suggests that repeated pairings with different players can cause players to contribute fairly purely out of self interest without the need for punishment or assortment. We also test the robustness of these players to exploitation. When permanent free riders who do not update their strategy are introduced into a population using the incremental update rule, we find that when the model’s parameters cause interactions with the free riders to be rare, regular players will contribute in order to reach the fair equilibrium regardless of the free rider’s presence. This causes the free rider to suffer from low payoffs since every group they participate in undercontributes. But when the model’s parameters cause interactions with the free riders to be frequent, regular players will will increase their contributions to compensate for the free rider. This causes the free rider to benefit by being part of groups with equilibrium contributions without having to pay the cost.

3.2 Definition of the Public Goods Game

In the following m player public goods game, each player i possesses a real number ci ≥ 0 which defines their contribution to the public good. We define the total of these

49 contributions to be m X C = ci i=1 The game is defined by the general return function b(x), and each player i receives a payoff

ui(c1, c2...cm) = b(C) − ci, (3.3)

For simplicity, we will assume a power law return function

b(C) = RCα (3.4) with constants R, α ≥ 0. If α = 1, this is like the classic linear public goods game, but with no upper limit to contribution, with rate of return r = Rm: each player’s contribution is multiplied by r and then distributed among the m players, so that each player receives R times the total contribution. It follows that if R > 1, every player is incentivized to contribute as much as possible, since each receives more in return than she contributes, thus there will be no equilibrium unless we impose a maximum contribution limit. If R < 1 then the normal public goods dilemma applies, and self-interested players will contribute nothing even if it would be socially beneficial. However if α < 1, the return function b(C) is nonlinear and concave. This has two important effects: there are diminishing returns at high contribution rates, and there are also very high rates of return for low contributions. The ith player will seek to maximize ui by choice of contribution ci, defined by taking the derivative of Eq. 3.3 with respect 0 to ci, setting that equal to 0 and solving. This is equivalent to b (C) − 1 = 0, since dC/dci = 1 for each i. For the power law return function Eq. 3.4, we find this yields a Nash Equilibrium at 1  1  α−1 −1 C = = (Rα)(1−α) ≡ C Rα e

The value of Ce represents the total contribution such that each player’s marginal rate of return to himself equal 1; changing this contribution would change the payoff of every player, including himself, by the same amount that his cost increased by, so he is indifferent on increasing or decreasing it. If the players were contributing ci such that

C > Ce, then each player has an incentive to decrease her contribution, while if C < Ce, each player has an incentive to increase her contribution, so this is the only form of Nash equilibrium in this game. However, there are infinitely many such equilibria, each corresponding to a combination

50 √ Figure 3.1. Graph of the benefit function b(C) = 400C, with equilibrium and socially optimal values marked in the case where m = 2.

of ci summing up to Ce. We will be particularly interested in one such equilibrium, in which every player contributes equally to the public good; we define this as the fairpoint

C f = e . m

At this special Nash equilibrium, ci = f for each player. If we fix some R < 1, as it would be in a standard public goods game, then we can 1 consider Ce = (Rα) 1−α as a function of α. When α = 0,Ce = 0, corresponding to the case where the public goods game returns a constant payout of R to each player regardless of how much is contributed, so nobody has incentive to contribute. As α gradually increases,

Ce will increase, as shown in Fig. 3.2. This reflects the fact that functions of the type xα with α < 1 have arbitrarily large slope near 0. Since the players’ marginal rate of return is effectively multiplied by the derivative of Cα, they are incentivized to increase their payments in this region up to the point where the derivative becomes equal to 1/R. This point is near 0 for very small α, but gradually moves further away as α increases.

However, at some point Ce reaches a peak, and then decreases as α increases. As α α approaches 1, Ce once again approaches 0, reflecting the fact that C is flattening out into a line in this limit, which corresponds to an ordinary, linear public goods game.

3.3 Population Dynamics

We now define a model for the population dynamics. Fix a population size n ∈ N, a group size m ≤ n, a step size s ∈ R, and an R and α for the public goods game. In this population of n players, each has some initial strategy ci ∈ R corresponding to how much they plan to contribute to any public goods game they participate in. The state z of the population is defined by the set {ci} of these contributions. At each time step, a group σ

51 Ce

α

Figure 3.2. Examples of the Nash equilibrium value Ce plotted as a function of α, for three different return values R = 0.4 (bottom), 0.7 (middle), 0.95 (top).

Table 3.2. Important Variables Variable Description n Population Size m Group Size ci Contribution of player i to each game C Sum of contributions of all players in a particular group

Ce Contribution value at the equilibrium point

Ce f Individual contribution value in the fair equilibrium. f = m

Pn ci−f L(z) Modulus of the population. L(z) = i=1 s (mod m)

Σ|ci−f| E1 E1 = n 2 Σ(ci−f) E2 E2 = n of m players is chosen uniformly at random from the population. This group plays one round of the public goods game together, and each player receives a payoff of ui defined in Equation 3.3. Each player i also considers of what their payoff would have been had they contributed ci + s or ci − s instead of ci, assuming the other players contributed the same amounts they did in the actual game. Each player then compares these three possible utilities and updates their strategy by copying the one that would have given them the highest payoff (remaining at ci if there are any ties, though this will not occur for almost all b and s). This will in general also change the population state z. If s is sufficiently small relative to the smoothness of b, this dynamic corresponds to

52 each player incrementally adjusting their strategy along the gradient of best response in an attempt to (locally) maximize their own utility. If C < Ce, then dui/dci > 0, which means that each player will increase ci by s. If C > Ce then dui/dci < 0 for each player, and each will decrease ci by s (if C = Ce then dui/dci = 0, and the ci will not change). Thus the entire group will move in the same direction, by the same amount.

Figure 3.3. Representation of a single round of play.

Note that as long as u is concave and s is sufficiently small, the actual values and shape of u do not matter except insofar as they determine Ce. Each player will increase or decrease their contribution by the same amount, regardless of how steep the gradient is.

Thus, any two utility functions with equal Ce will result in identical population dynamics. Players might not receive the same payoffs from each game, and the gradient of utility might have different steepness, but it will have the same sign, so players will respond the same in every possible scenario.

Additionally, the actual value of Ce only amounts to a translation on the dynamics, and s amounts to a scaling. If we fix n and m, then any two models are isomorphic as dynamical systems. If M1 is a model with Ce = C1, s = s1, and f1 = C1/m, and M2 is a model with Ce = C2, s = s2, then if we express every player’s contribution as ci = fi + kis for some ki ∈ R, then φ(f1 + kis1) = f2 + kis2 induces an isomorphism between M1 and

M2 when applied to the contribution value of each player. We define the modulus L of a population state to be

n c − f L(z) = X i (mod m). i=1 s

For a fixed system, this will give us a real number 0 ≤ L < m for each population which will be invariant under the update function. Whenever a group is chosen to play, every player in the group will move in the same direction, either all increasing their payouts by s, decreasing them by s, or not changing them. Thus the modulus of the population will remain fixed at every time step. It follows that the modulus of a population is entirely

53 determined by its initial conditions, so a population that starts with modulus L can only reach states that also have modulus L.

Proposition 8 1) The population state with ci = f for all ci is an absorbing state 2) if m < n, this is the only absorbing state.

Proof: 1) Suppose ci = f for all ci, then at each time step, any chosen group will have

C = Ce, and thus no player will change ci.

 n  2) Suppose m < n, let z be a population state with ci =6 f for some ci. There are m possible groups that could be chosen at each time step. For each group σ, let Cσ = Σci where the sum is over players in σ. Then if we can show that Cσ 6= Ce for any group σ, that corresponds to a group that will cause players to update their contribution if chosen, so there is a nonzero probability that the population will move away from state z. Let group 1 correspond to a group of minimal contribution (pick the m players with the least individual contributions). If C1 6= Ce then we are done. If C1 = Ce, then that means the average ci in this group is f, so either ci = f for all players in this group, or some ci < f and some ci > f. Let group 2 correspond to a group with the same players as group 1 except the player with the least ci is replaced by the player with the greatest ci in the population, who cannot already be part of group 1 because it chose the m least contributions. The contributions of these two players cannot be equal unless every player had the same contribution, which would have to be f, and we assumed this wasn’t the case. Thus C2 > C1 = Ce.  Note that the absorbing state has modulus 0, so populations with nonzero modulus and m < n cannot reach any absorbing state. For simplicity in our simulations, we always select b(C) such that that f is an integer multiple of s, and set all initial ci to be integer multiples of s. In this case, L(z) will yield integer values. However, this simplification is only required in Propositions 9, 11, and 12, and all of our other results work in general.

Proposition 9 If s divides (ci − f) for all i, and m = 1, then each player will mono- tonically shift their ci towards f until they reach it. The population will end up in the absorbing state with probability 1.

Proof: Note that when m = 1, f = Ce. At each time step, one player is randomly selected to play the public goods game alone; all of the return from the player’s contribution is returned directly back to the player. Ce is defined as precisely the value at which a

54 player’s marginal increase in return to itself is equal to the marginal increase in cost, so it is the optimum value to maximize utility is at Ce, so players will shift their ci towards it each time they are chosen to play the game, until eventually all players reach it. 

Proposition 10 Let M be a model with m = n 1) If L(z) = 0, the population will deterministically approach a specific fixed state determined by the initial conditions. 2) If L(z) 6= 0, the population will deterministically approach a set of two states, and will oscillate between them.

Proof: If m = n, then there is only one possible group of m players that can be chosen from the population, so there is only one group with one value C1 depending on the population state. At each time step, the same population will be chosen to play together, so the entire population evolution is deterministic. On each time step, if C1 < Ce, every player will increase their contribution by s, so C1 will increase by ms. If C1 > Ce, every player will decrease their contribution by s, so C1 will decrease by ms. If L(z) = 0, then

C1 will eventually reach Ce, at which point the population is in an absorbing state (not necessarily of the form described by Proposition 9). If L(z) = ` 6= 0, C1 will eventually reach Ce + `. On the following timestep, every player will decrease ci, and C1 it will change to Ce + ` − m, oscillating between these two states every other step. Note that in either case, since every player changes their contribution at the same time and by the same amount as every other player, the difference between each pair of players’ contributions, ci − cj will remain constant throughout time, depending only on the initial conditions. 

Figure 3.4. Simulation with m = n = 4

55 3.4 Numerical Simulations

In simulation, players in a population tend to move towards the fairpoint until they end up close to it, often spreading out in a small Gaussian-like distribution around it. An example is shown in Fig. 3.5 for n = 100 players and group size m = 10.

Figure 3.5. Numerical simulation of the model with n = 100 players and group size m = 10, showing the distribution of contributions C of each player around the fairpoint f: t = 0 (top), t = 800 (middle), t = 1600 (bottom).

In Fig. 3.6 we show how the average contribution of the entire population evolves over time for n = 100 and three different values of m.

In Fig. 3.7, we show the time evolution of several different ci over time, for five particular players in a population of n = 50, with m = 10. We see in these simulations that on a short timeframe, the average of players’ contri- butions quickly goes to f, and then on a longer timeframe the variance in contributions decreases until players are clustered near the fairpoint. However in both cases the values fluctuate around this point without settling on it. We would like to formally define this trend and prove why it must happen. To do this, we note that there are several possible ways to denote how far a population is from the absorbing state. Our first type of distance is simply the average distance of each player to the fairpoint: Σ|c − f| E (z) = i 1 n

56 Figure 3.6. Average contribution cavg for population of n = 100 players over time, shown for: m = 10 (top), m = 50 (middle), m = 100 (bottom).

For instance, if every player has ci = f ± 5, then E1 = 5. Note that E1 = 0 iff z is the absorbing state. In Fig. 3.9, we show nine examples of E1 vs time, starting from three different initial conditions or population states z, each with three different group sizes m.

While the initial values of E1 clearly start with common initial conditions together, we

57 Figure 3.7. Examples of five individual players’ ci over time, in a population of n = 50 players and group size m = 10.

Figure 3.8. E1 behavior observe that as the simulation progresses, the final values approached by this distance function appear to be associated more with their group size values, with the smallest m value approaching the smallest E1 value. Fig. 3.10 shows a diagram of the initial conditions and approximate distribution after enough time has passed for them to cluster near f, using the same color for each population as in Fig. 3.9

We observe that E1 seems to decrease at a roughly constant rate, creating a linear graph in the time before it reaches the value it ends up fluctuating around. This makes sense, as the step size s does not depend on the distance the group is from equilibrium, so players the only factor influencing changes in E1 is how many players are moving

58 m = 80 m = 50 m = 10

Figure 3.9. E1 over time shown for three different initial conditions with n = 100, each with different subgroup sizes: m = 10, 50, 80.

Figure 3.10. The initial states at t= 0 (left) and at t = 1600 for each group size (right) from the simulation used in Fig. 3.9. towards or away from the fairpoint. However, the speed seems to differ based on m, since more players are moving in each round. To measure this speed, for each m we initialize a population with n = 100, allow the simulation to run for 50 time steps to allow the population to adjust their average, then measure E1. We then run the simulation for 500 games and measure E1 again. This allows the system to run for long enough to get a good average, but not long enough for

59 Figure 3.11. Average decrease rate for E1 as a function of m.

it to reach the stable E1, which tends to happen around t = 1000. We then compute the average decrease in E1 per time step during this interval. We average this speed over 100 simulations with each m to reduce variation due to noise. The results are shown in Figure 3.11. We observe that the speed is low when m is near 0 or n, and greatest in the middle, though the distribution is asymmetric and peaks around m = 40 rather than m = 50. There appear to be two competing forces driving this behavior. When a group is chosen to play, m players will update their contribution by s. Since E1 is defined as a population average, each player who moves towards the fairpoint will decrease E1 by s/n. When m is small, fewer players play each game, and thus the E1 changes at a slow rate.

However, sometimes players move away from the fairpoint, each causing E1 to increase by s/n. Sometimes, a few players can push a larger number of players away from the fairpoint, as shown in Fig. 3.8. This is more likely to happen for larger m, which means that even though more players are moving, they tend to move back and forth rather than directly towards the fairpoint. Eventually we reach the case where m = n as in

Proposition 10 and E1 does not decrease at all.

60 Proposition 11 Consider a model M with n > 2, m = 2, and s divides (ci − f) for all i. Then: 1) E1 is nonincreasing with respect to time; 2) If the modulus of the population is 0 and E1 > 0 then there is a nonzero chance of E1 decreasing within the next three s time steps; 3) If the modulus of the population is 1 and E1 > n then there is a nonzero chance of E1 decreasing within the next three time steps.

Proof: Since changes in E1 at each time step depend only on the arrangement of the two players chosen relative to f, there are only fives cases for which we need to show that changes in E1 are nonpositive:

Case 1: If both players chosen have ci < f, then C < Ce and they will increase their payoffs by s, and get closer to the fairpoint. Thus E1 will decrease by 2s/n.

Case 2: If both players chosen have ci > f then C > Ce, so they will decrease their payoffs by s and get closer to the fairpoint. Thus E1 will decrease by 2s/n.

Case 3: If one player chosen has ci < f and the other has cj > f, then if C = Ce neither will move, so E1 will not change. If C 6= Ce, then regardless of the direction of their motion, one will get closer to f while the other gets further from f by the same amount, so E1 will not change.

Case 4: If one player chosen has ci < f or ci > f and the other has cj = f, then the first player will move closer to f while the second player moves off it by the same amount, so E1 will not change.

Case 5: If both players have ci = f then they will not move, so E1 will not change. This proves statement 1.

Note that E1 will strictly decrease when either case 1 or 2 are chosen, so if we can show a nonzero probability of these states occurring within three time steps of any state, then this proves statements 2 and 3.

Case A: If there are at least two players with ci < f, or at least two players with ci > f, then there is a nonzero probability of those players being chosen together, which would cause case 1 or case 2 to occur and decrease E1. Note that since n > 2, not being in case A implies that at least one player must have ci = f by the pigeonhole principle. Choose one, call it player k.

Case B: If there are players i.j with ci < f and cj > f, then there is a nonzero probability that in the first time step, player i will be chosen with player k (putting us in case 4) increasing both their payoffs so that player k switches to ck = f + s. In the second time step, player k is chosen with player j, putting us in case 2, so E1 decreases 2s by n

61 Case C: There is one player with ci < f − s or ci > f + s, and all other players have cj = f. Suppose first that ci < f − s. Note that player i is at least two steps away from f, so there is a nonzero probability that in the first time step, player i is chosen with player k, increasing both payoffs so that ck = f + s and ci < f still. Then in the second time step player i is chosen with a different player j 6= k, so that the new cj = f + s. Then in the third time step players j and k are chosen together, so we are in Case 2 and E2 will 2s decrease by n . If we started with ci > f + s then it follows by symmetry that we have Case 1. Note that if the modulus of the population is 0, then Cases A, B and C are exhaustive, so we have proved statement 2.

Case D: There is one player with ci = f − s or ci = f + s and all other players have s cj = f. Then E1 = n . s Thus if the modulus is 1 and E1 6= n we must also be in case A, B or C.  Thus any population with m = 2 and n > 2 will gradually converge towards the absorbing state, at least until it reaches within a certain distance of it. Note that the assumption that s divides (ci − f) for all i is required to prevent overshooting in Cases 1 and 2. An example of when this assumption does not hold is if s = 1, c1 = c2 = f − 0.1.

Then if these players are chosen, in the next time step they will reach c1 = c2 = f + 0.9, and E1 will increase. It is useful to consider other distance or energy functions for the population states, such as Σ(c − f)2 E (z) = i 2 n This measures the average of the squared distance of each player from the fairpoint, so will behave similarly to E1 in many respects, but has the advantage that when a player updates their strategy, their contribution to E2 will change proportionately to their current distance from f. Since groups always move in the direction influenced by the sum of all ci, this means that players further from f, with more influence over the group’s direction, will also have more influence over E2, causing it to decrease more.

E2 is almost the same as the variance of the population’s contributions, but will differ slightly if the average contribution is not exactly equal to f.

Proposition 12 Let M be a model with m = 2, n > 2, and s divides (ci − f) for all i. Then for any population state z,

1) E2 is nonincreasing with respect to time

2) If L(z) = 0 and E2 > 0 then there is a nonzero chance of E2 decreasing within the

62 next three time steps s 3) IfL(z) = 1 and E2 > n then there is a nonzero chance of E2 decreasing within the next three time steps

Proof: The proof is almost entirely identical to the proof of Proposition 11, except that in Case 3 and 4, E2 sometimes decreases instead of remaining constant. We omit the details . When m > 2, there are possible configurations of players that cause E2 to increase. For example, if m = 3, and the group chosen has one player at f − s, and two players at f, then the players will increase their contributions and we will end up with one player at f and two at f + s, which means that E2 will increase by s/n. However events like this seem to occur rarely in the space of all possible configurations of players and have small increases on E2, while most configurations cause E2 to decrease.

For any group σ, let dσ = |Ce − C|. By comparing the contribution of each player to

E2 before and after that group plays, the change in E2 when group σ is played is found to be s ∆ E = (ms − 2d ) if d 6= 0, σ 2 n σ σ while ∆σE2 = 0 if dσ = 0, corresponding to the case where C = Ce, so the players don’t  n  change strategies. Since there are m possible groups of players that could be chosen, 1 each with probability n , we find that for a fixed population state, the expected change (m) in total E2 on the next timestep is

  Σ∆σE2 s Σdσ h∆E2i = = mps − 2  , (3.5)  n  n  n  m m where p is the probability that a chosen group will have nonzero dσ. (For most population states, p will be close to or equal to 1). This means that if the average dσ is greater than mps/2, we expect E2 to decrease over time, and if the average dσ is less than mps/2 we expect E2 to increase over time. The further players tend to be from the fairpoint in a population, the larger the dσ will tend to be, so this suggests that populations that are spread out will tend to have decreasing E2 and draw closer to the fairpoint, while populations close to the fair point will spread out, behaving stochastically in a drift towards some sort of equilibrium when the players are spread out just enough that h∆E2i = 0.

2 2 Lemma 3 For any population state, (E1) ≤ E2 ≤ n(E1)

63 Proof: Fix n, and for the population state z define

E 2 F (z) = 2 (E1) z

Because E1 is the average of some set of numbers, and E2 is the average of the squares of those same numbers, F will achieve a maximum when all of those numbers are zero except one of them, and will achieve a minimum when all of those numbers have the same value. In particular, let z1 be the state where c1 = f + d for some constant d, and d2 d ci = f for all i 6= 1. Then E2(z1) = n , and E1(z1) = n , and therefore F (z1) = n is a 2 maximum for F . Let z2 be the state where ci = f + d for all i. Then E2(z2) = d and

E1(z2) = d, and therefore F (z2) = 1 is a minimum for F . Thus we get 1 ≤ F (z) ≤ n for 2 2 2 all z, and multiplying by (E1) yields (E1) ≤ E2 ≤ n(E1) . q √ E2 Rearranging the inequalities yields the immediate corollary that n ≤ E1 ≤ E2

Lemma 4 For any fixed constant Q, and population variables m, n with m < n, there exists a constant dn,m,Q > 0 such that E1 ≥ Q implies the average dσ over all groups, davg ≥ dn,m,Q. Further, this dn,m,Q is linear with respect to Q, i.e. dn,m,Q = Qdn,m for some dn,m > 0.

Proof: Consider the function g from Z to R, which maps population states to their davg. It should be clear that this is a continuous function (changing population states by a small amount will change davg by a small amount). For each fixed Q, let gQ be g restricted to populations with E1 = Q. We wish to show that each gQ attains a minimum, and that these minima will be linear with respect to Q.

First, we will show that if gQ attains a minimum, it must be in a "stacked" population, where all players contribute one of two values. Next, we will show that for each number of players in each stack, there is one particular contribution value for each stack that locally minimizes gQ. Then since there are finitely many ways to arrange n players into two stacks, one of these local minima must be the global minimum for gQ. To do this, first consider the following transformations from the space of populations states to itself, all of which preserve E1 and preserve or decrease davg. Let i and j denote two players on the same side of the fairpoint (ci and cj both ≥ f or ci and cj both ≤ f), assume w.l.o.g. that ci ≤ cj. Let x be any value with 0 ≤ x ≤ cj − ci. Then let G(i, j; x) be the transformation that increases ci by x and decreases cj by x. This will preserve

E1 since both players are on the same side of the fairpoint and moved towards each other, so one got closer to the fairpoint by x, and the other got further from it by x.

64 To compute the effect on davg, partition the set of player groups into four sets: groups containing neither player i nor j, groups containing i but not j, groups containing j but not i, and groups containing both i and j. It should be clear that for any group

σ containing neither player, dσ will not change since none of the players in it moved.

Additionally, for groups containing both i and j, dσ will not change, because one player increased by x and the other decreased by x, so the net change is zero. There is a natural bijection between groups containing i but not j, and groups containing j but not i, made by substituting player i for player j. Consider a single pair, let σ{i} be the group containing player i, with total contributions C{i} and distance d{i}, while σ{j} is the group containing player j with sum C{j} and distance d{j}. Let Co be the sum of the contributions all of the players other than i and j (which are common to both groups). Then we can write

d{i} = |Co + ci − Ce| and d{j} = |Co + cj − Ce|.

After the transformation, we obtain G(d{i}) = |Co + ci + x − Ce| and G(d{j}) = |Co + cj − x − Ce|. It follows that d{i} + d{j} ≥ G(d{i}) + G(d{j}) since ci ≤ cj (if C{i} and C{j} are on the same side of Ce, one of di and dj will increase and the other will decrease, if they’re on opposite sides both will decrease, or one will increase by less than the other one increases). Thus the net change in contribution to davg from this pair will either remain the same or decrease. Since this is true of every pair, davg will remain constant or decrease after G is applied. Any population state can, using a finite composition of such G(i, j; x), be transformed into a “stacked" population, a population state where all players on the left of the fairpoint are playing the same value cL and all players on the right are playing the value cR, where these values are simply the average values that the players originally had on the left and right, respectively. Note that if any player was originally playing f, he can be moved to either side as part of this averaging process, so this stacking process is not necessarily unique for some initial populations; but there are finitely many choices for each initial population state. Since each G(i, j; x) preserves E1, and also preserves or decreases davg, we have shown that for every population state with E1 = Q, there is at least one stacked population with davg less than or equal to the original, so we can restrict our search for a minimum to stacked populations. Fix n, m. Then the set of all stacked populations can be characterized by three variables, cL, cR, and the number of players on the left of the fairpoint, which we will

65 call γ (the number of players on the right is n − γ). Since there are only finitely many choices for γ, let’s further restrict our search to populations with a fixed γ. Let gγ(cL, cR) be the function which takes cL and cR as inputs and gives the output davg, corresponding to the stacked population with γ, cR, and cL. We can compute this by summing over all possible groups chosen from the stacks:

1 m γ! n − γ! g (c , c ) = X |(m − i)c − ic | γ L R  n  i m − i R L m i=1

Since we’ve fixed E1 = Q then cR is automatically determined after choosing γ and cL, so we can consider g to be a function of one variable:

1 m γ! n − γ! Q − γc g (c ) = X |(m − i) l − ic |. γ L  n  i m − i n − γ L m i=1

If we define x = cL/Q, then we can factor a common term of Q out of gc,γ(cL), yielding: Q m γ! n − γ! 1 − γx g (x) = X |(m − i) − ix|. Q,γ  n  i m − i n − γ m i=1 Since this is a sum of finitely many absolute value terms, each of which is linear in x, then gQ,γ(x) must attain a minimum value for some (not necessarily unique) value xγ.

Note that since we’ve completely factored out Q, the value xγ which minimizes g will be independent of Q. That is, each possible stack configuration has its own arrangement that minimizes davg, and the only thing Q does is linearly scale this arrangement. Then xγ will correspond to a local minimum of g. Since there are finitely many possible γ, there must be one which is a global minimum. Since we fixed n, m earlier, let xn,m be the xγ which yields this global minimum. This xn,m will give the relative position of the stacks with minimum davg out of all possible population states with that n and m.

Then Ln,m,Q := Q · xn,m and rn,m,Q := Q(1 − xn,m) gives the actual position of the stacks with minimum davg for population states with E1 = Q. Let dn,m,Q be the minimum davg obtained. Note that when m < n, the only population state with davg = 0 is the one with all ci = f, and thus has E1 = 0, so we must have dn,m,Q > 0 for nonzero Q. Note also that this dn,m,Q is linear with respect to Q, since it’s a minimum of g, which is linear with respect to Q, so there exists dn,m such that dn,m,Q = Q · dn,m. 0 Therefore, if z is any population state with E1 = Q ≥ Q, we must have davg ≥ 0 Q · dn,m ≥ Q · dn,m = dn,m,Q .

66 Theorem 3 For a given n, m, s, with m < n, there exists a constant Kn,m,s > 0 such that E2 > Kn,m,s implies h∆E2i < 0.

Proof: Given n, m, s, let dn,m be the quantity defined in Lemma 4, and then define

ms 2 Q = and Kn,m,s = nQ . 2dn,m

2 2 Then whenever E2 > Kn,m,s, Lemma 3 forces n(E1) ≥ E2 > Kn,m,s = nQ , and thus E > Q = ms . By Lemma 4, this implies that d > Q·d , and thus d > ms ≥ mps 1 2dn,m avg n,m avg 2 2 for some 0 ≤ p ≤ 1. By Equation 3.5, this implies that h∆E2i < 0.  Note that E2 does not depend on s, so if we consider two models with the same population state and the same parameters other than s, they will have the same E2.

However since Kn,m,s is quadratic with respect to s, the threshhold beyond which E2 is decreasing will depend on s. Thus, the smaller s is, the closer the population states have to get to the fairpoint before they stop getting closer. Thus, if we consider any fixed n, m, and population state z, then there exists some s0 such that E2(z) < Kn,m,s for all s < s0. In other words, we can force the population to cluster arbitrarily close to the fairpoint by decreasing the stepsize. Taking this to the extreme, we can construct a continuous n version as follows. For a given m, n, and initial population state z, let xs(t) ∈ R be the j t k expected value of each player after s steps. Although for each nonzero s this function will be discontinuous, these functions will converge as s approaches zero, thus we define x(t) = lims→0 xs(t). This will be a deterministic continuous system, with each player’s payoff moving at a velocity equal to the average of the direction of best response over all groups that player belongs to. The change of E2 will become

dE 2d 2 = − σ , dt n and since dσ takes on discrete values this means E2 will continue to decrease until dσ reaches zero, which only occurs in an absorption state. It immediately follows that any population with m < n will reach the equilibrium where all ci = f after a finite amount of time, while any population with n = m will reach an equilibrium where the average ci is f after a finite amount of time.

As an example, consider a population with n = 3, m = 2, and initial values c1 =

−1, c2 = c3 = 1. For the discrete model with small s, then player 1 pairing with either player 2 or 3 will not cause E1 to decrease, but 1/3 of the time players 2 and 3 will play

67 and shift to the left. This will cause c2 and c3 to be less than c1 in magnitude, which will allow it to decrease as well, but increasing the other player’s contribution in the process. In the long run, this will cause each of the three players to move towards the fairpoint at an average rate of s/6 per timestep. This means that in the continuous version, each player will move towards the fairpoint at exactly a rate of 1/6 per unit of time, until they reach it and stop.

3.5 Dynamics in the Presence of a Permanent Freeloader

We now consider the robustness of the population dynamics described above to invasion from other types of players. Consider a population with n − 1 regular players following the standard update rules discussed above, and one permanent freeloader who never updates his strategy, and always contributes a fixed and low amount c0 to the public good, such that 0 ≤ c0 < f. While we might have c0 = 0 as for a perfect freeloader, to be 0 general we will consider c = f − δ for some δ > 0. If the regular players have ci near f, then whenever a group is chosen that includes the freeloader and m − 1 regular players, the group will have an unusually small total contribution C, and the regular players will increase their values. Thus the players will tend to have contributions larger than f and as a result, when a group is chosen that does not include the freeloader it will tend to have higher total contribution, which will lead to a decrease in ci values.

Figure 3.12. The average value of ci for populations of n = 100 players with one permanent freeloader, as a function of group size m. Each population was numerically simulated and measured ten times at regular√ intervals between t = 200,000 and 300,000. We fixed d = 10,000, and for each m let ui = 40000mC − ci, which gives f = 10, 000 even when m changes.

68 In order to determine which tendency will dominate, we need to consider the expected changes of each occurence. In each round, the freeloader has a m/n chance of being chosen to participate. If we consider the case when all regular players have

δ f < c < f + , i m − 1

m then with probability n the freeloader will be chosen and m − 1 players will increase n−m their value, and with probability n the freeloader will not be chosen and m players will decrease their value. This means the expected change in time of the average value in the population will be (m − 1)m m(n − m) − . n n If we make an inequality setting this less than 0 and solve for m, we find that this n+1 expected change is negative when m < 2 . This means whenever the population average increases past the fairpoint, it will tend to decrease back towards it. Thus the population will cluster near the fairpoint, with a slight shift off center due to the freeloader. The regular players play with the freeloader rarely, so any shifts caused by the freeloader are n+1 undone in his absence. The expected change in this region is positive when m > 2 , in which case the regular players will continue to increase their contributions over time.

However, once the regular players have sufficiently high ci that they contribute an average δ of more than f + µ, where µ := m−1 , this will compensate for the freeloader, causing C > Ce, so the players no longer increase their ci even when the freerider participates. This will cause the players to cluster near f + µ. Regular players still all decrease their values whenever they play without the freeloader, but he participates often enough that this can’t compensate for his effect unless players contribute enough to sometimes reach n+1 Ce even when paired with the freeloader. When m = 2 , the expected motion of players with f < ci < f + µ is zero, which means players will drift back and forth in a type of random walk with soft bounds that increase the probability of moving towards the center if they drift outside of the interval [f, f + µ].

Figure 3.12 shows the average ci for the non-freeriders in simulations as a function of m. It remains around f up until around m = 50 at which point it transitions. The graph displays a downward curve in this region, since larger group sizes means each player doesn’t need to contribute as much to compensate for the freerider. Because the update rules are symmetric, if we introduce a permanent overcontributor who always contributes a fixed amount c0 = f + d for some d > 0, then all of the above dynamics will occur but in the opposite direction. Regular players will end up near f but

69 n+1 d n+1 centered slightly to the left when m < 2 , will end up near f − m−1 when m < 2 , n+1 and will randomly drift between these two values when m = 2 .

3.5.1 Multiple Permanent Freeloaders

We also studied the effect of including a second permanent freeloader, which we found required significantly longer to settle into an equilibrium state. As expected, we now observe two transitions, corresponding to the probabilities of having one or both of the unmovable players more than half of the time. The transitions in the average contribution value cavg as a function of the group size m are shown in Fig. 3.13, for n = 100. Using the same approach as before, we can also calculate the specific values of m where these transitions should occur, though the process is more algebraically involved. (n−m)(n−m−1) The probability that a group contains no freeloaders is n(n−m−2) . The probability that 2(n−m)m a group contains one freeloader is n(n−1) . The probability that a group contains two m(m−1) freeloaders is n(n−1) . When regular players have contributions between f and f + µ, they will not be able to offset either freeloader, so the expected change of their contributions will be:

(n − m)(n − m − 1) 2(n − m)m m(m − 1) −m + (m − 1) + (m − 2) n(n − m − 2) n(n − 1) n(n − 1) due to the number of regular players who move in each of the three situations. This simplifies to

−2m n2 + n − 2 (m2 + (1 − 2n)m + ) n(n − 1) 2 which is negative when √ (2n − 1) − 2n2 − 6n + 2 m < m := 1c 2 δ Let µ1 := m−1 , which is the amount players need to overcontribute on average to 2d compensate for a single freeloader, and µ2 := m−2 , which is the amount players need to overcontribute on average to compensate for two freeloaders. Then when players payoffs are between f + µ and f + µ, they are capable of reaching Ce in groups with one freeloader, but not two. Then the expected change of their contributions will be:

(n − m)(n − m − 1) 2(n − m)m m(m − 1) −m − (m − 1) + (m − 2) n(n − m − 2) n(n − 1) n(n − 1)

70 Figure 3.13. The average value of ci for populations of n = 100 players with two permanent freeloaders, as a function of m. In this case, two transitions are observed (see text).

which simplifies to

2m −n2 + 3n + 2 (m2 − 3m + ) n(n − 1) 2 which is negative when √ 3 + 2n2 − 6n + 5 m < m := 2c 2

For Fig. 3.13 (n = 101) these yield m1c ' 30 and m2c ' 72. Note that at each transition, the group average increases by µ1 or µ2 respectively, however since these are not constant with respect to m, the group does not need to contribute as much to offset a freeloader as m increases. Gore et al. [61] find that the yeast Saccharomyces Cerevisiae displays behaviors that match some of the behaviors we see in our model. Cells in the presence of complex sugars will produce the enzyme invertase and release it into their surroundings in order to metabolize the sugars into glucose before absorbing them. This invertase is costly to produce, and diffuses in the region around each cell, such that cells near each other share the benefits of others. This acts like a nonlinear public goods game with variable contributions and subgrouping according to spatial proximity, although this grouping is nonrandom. Each cell can produce a variable amount of invertase to contribute to the cells in its immediate vicinity, and the fitness gain by cells is nonlinear in the amount produced, as there is a finite amount of sugars that can be metabolized and absorbed so excessive amounts of invertase lead to diminished benefits.

71 Additionally, they observe that some mutant cells do not produce invertase, and act as permanent freeloaders. They find that when the contributing cells encounter mutant cells, they increase their production of invertase in response to the resulting low glucose levels, which compensates for the lack of production by others and helps maintain the desired invertase levels in their neighborhood. This is very similar to what happens in our model.

3.6 Discussion

In this chapter we have presented an agent-based approach to the public goods dilemma in which randomly chosen subgroups of players follow an algorithmic learning mechanism to find the optimal point of a nonlinear return function. In this random subgroup approach, each player has a variable contribution level (unlike the All or None contribution in the standard public goods game), and shifts this contribution in a predefined way in the direction of best response. The contribution levels of the overall population thus adjust over time, as the players participate in various subgroups. We fix the subgroup size at m, and observe that for any value less than the size of the population itself, the total group contribution evolves towards the socially optimal value Ce, while all individuals move towards sharing equally in contributing to this. The approach of each player to this fairpoint f = Ce/m is independent of their original random starting values, and is statistical in nature; the players cluster around the fairpoint, but in most cases never remain on it. Our approach demonstrates how random association into subgroups and incremental learning can lead to fairness in shared contributions, despite the players not having any explicit preference for fairness. Players only care about their own utility, and make local decisions to improve their utlity based on their interaction with the group σ in which they find themselves at any given moment. Undercontributors will tend to be part of subgroups that undercontribute as a group, while overcontributors will tend to be part of subgroups that overcontribute, an effect which leads all players closer to the fairpoint. We could summarize this as saying that when the environment is fluctuating and unpredictable, the most consistent factor in any player’s group is the player itself. Just as nonlinearity is sufficient to enable cooperation in public goods games without the need for punishment or other forces aside from a player’s own direct payout from the game, we demonstrate how playing the public goods game in random subgroups is sufficient for fairness to emerge, despite the lack of an explicit incentive towards fair

72 behavior. This concept could potentially be useful in explaining the emergence of fair behavior in nature and in human behavior without models that require players to prefer or even be aware of the concept of fairness [58]. Future research could be done investigating how robust the behavior in our model is to alterations in the model’s systems, such as changing the players learning rule to move depending on the steepness of the gradient of best response. We have also chosen a single fixed value m in each population, such that all players participate in groups of the same size at all times. However this should be relaxed to a distribution of this important variable, allowing the subgroup association to occur on multiple scales. Additionally, the role of random subgroup association and fairness could be investigated in other games, seeing if there was a similar distinction between cases when m = n and m =6 n. Finally, the stability of this fairpoint to multiple permanent freeloaders should be tested, with a distribution of differing (low) contributions.

73 Chapter 4 | Population dynamics in a rock pa- per scissors model with restricted strategy transitions

4.1 Introduction

4.1.1 Win-Stay Lose-Shift

What happens in a game with more than two strategies, in which individual players are each restricted to changing between two of those strategies? The idea for this chapter originated in a simple card game created for a lecture demonstration of evolutionary games by A. Belmonte. In the game, all members of the audience are given a single card which has been printed with two nonidentical strategies of Rock-Paper-Scissors (RPS) on either side. Following what is essentially a “Win-Stay / Lose-Shift" strategy each person holds their card with one side face up. Participants then group into pairs and play one round of the game together, each playing the strategy represented by the face up side of their card. Players who lose then flip their card over so that their other strategy is face up, while players who win or tie maintain the same strategy. After playing one round together, a pair splits up and each person goes to find someone else to pair with for the following round. Within the mathematical field of strategic game theory, much research has focused on learning dynamics in repeated games involving simple rules for changing strategies in response to feedback. In the repeated Prisoner’s Dilemma, the strategy , which repeats the action done by its opponent in the previous round, tends to perform well in a diverse population, even out-competing more complex strategies [62,63]. It performs

74 well in large part because it mutually cooperates with copies of itself, creating stability in populations with a high frequency of this strategy. Additionally, it will retaliate against defectors, which discourages this behavior, or at the very least provides a higher payoff than cooperating with them. However, the strategy has some weaknesses that lead to it being unstable in certain circumstances. For example, if mutations occur in the population dynamics it can be slowly replaced by more naive cooperative strategies, which it will also cooperate with. This in turn allows pure defectors to thrive and gain a foothold in the population. Additionally, Tit for Tat performs poorly in game variants with some form of "trembling hand". Suppose during each round of play, each player chooses a strategy, and plays it with probability 1 − , but plays the other strategy by mistake with probability . When two Tit for Tat players play together, they begin by mutually cooperating, however as soon as a mistake occurs the other player retaliates by defecting in the next round, which causes a new paradigm of alternating defections and cooperations. The next mistake might cause both to cooperate and re-enter the cooperating paradigm, or it might cause both to defect and enter a mutual defection paradigm, which will continue until the next mistake. In the long-run, each player will receive each of the four possible payoffs in the prisoner’s dilemma equally often, yielding a lower average score for both than mutual cooperation. Tit for Tat shares many features with the strategy Win-Stay Lose-Shift, which was originally published by Robbins [64] as a simple improvement over randomization in the multi-armed bandit problem, and later adapted to use in the Prisoner’s Dilemma by by others such as Kraines and Kraines [65], or Nowak and Sigmund [66]. Also referred to as Pavlov [65], this strategy repeats its previous action if it received a sufficiently high payoff in the previous round, and switches to the other action if it received a low payoff in the previous round. However, Win-Stay Lose-Shift avoids many of the weaknesses of Tit for Tat, exploiting naive cooperators and quickly recovering from errors when playing with itself, while still sharing most of its strengths. It will mutually cooperate with other Win-Stay Lose-Shift players, and it will retaliate against pure defectors, although less consistently than Tit for Tat does. Additionally, it performs better when trembling hand is introduced. If two Win-Stay Lose-Shift lose players play, they begin with mutual cooperation. When a mistake is made it is followed by a single round of mutual defection, and then a return to mutual cooperation. As the error rate  goes to zero, these occurrences become increasingly rare and their payoffs approach R, the same as constant mutual cooperation. Additionally, Win-Stay Lose-Shift players who accidentally defect against a naive cooperator will continue to defect after seeing that

75 they were not retaliated against, making populations of Win-Stay Lose-Shift players less vulnerable to invasion by outsiders. [66,67].

4.1.2 A Biological Basis for a Restriction to Two Strategies

In this chapter, we consider the idea of individuals who have some capacity to learn and change their strategy, but have a partial restriction to a subset of all possible strategies. Biology provides a basis for why such a restriction might apply to individuals. For instance, if one species of animal is capable of eating nuts or berries, and another species is capable of eating berries or meat, then if we construct a game where strategies correspond to choosing which type of food to pursue then individuals from the two species can be seen as participating in the same game, but each with a restriction on which strategies they are able to pursue based on unchangeable biological constraints. Individuals could change strategies in response to previous events, while still being the same individual of the same species, rather than dying and being replaced by a new generation as replicator dynamics typically assume. For a more concrete example, Sinervo and Lively [68], as well as Bleay et al. [69] study males of the side-blotched lizard Uta stansburiana, which mature into one of three primary phenotypes depending on the amount of testosterone they produce. These phenotypes develop different colors on their throats, as well as different reproductive strategies. Orange-throated males aggressively defend a large territory with many females, blue-throated males defend a smaller territory more carefully, and yellow-throated males sneak through the territories of other males to mate with the inhabiting females and avoid conflict when confronted by mimicking female behavior. The authors represent these phenotypes as strategies in a game with cyclic dominance, as in Rock-Paper-Scissors. Orange-throats outcompete blue-throats by aggressively conquering their territory and access to females, blue-throats outcompete yellow-throats by carefully guarding their territory from infiltrators, and yellow-throats outcompete orange-throats by sneaking through many territories and mating with females that the orange-throats cannot guard as carefully due to the large area each controls. This leads to frequency dependent selection creating a cycle in the numbers of each phenotype over generations. Mills et al. [70] find that these lizards can also develop mixed phenotypes seeming to correspond to alleles mixing two of the primary phenotypes, with throats striped with the two corresponding colors. The lizards appear to display behavior according to Orange-Yellow-Blue dominance in the genes, but with some noticeable differences. Sinervo et al. [71] find that individuals with blue and yellow alleles can change in behavior

76 from yellow to blue. As the breeding season goes on, some lizards die out, freeing up the territory that they held. This tends to be disproportionately the aggressive orange- throats, which reduces their frequency in the population and increases the fitness of blue-throats. Late in the breeding season, yellow-throats in unclaimed territory that have the blue allele are able to undergo a transformation that increases testosterone production, and alters their behavior and appearance to become more like blue-throats. They then claim territory and act as a blue-throat, taking advantage of the lower frequency of orange-throats later in the season. However the authors find this ability to change phenotypes is limited, as it cannot be reversed, and none of the other genotypes can transform This example shows a biological basis for why individuals might be restricted to two strategies in particular, as genes carry two alleles that could be important for determining strategic behavior. Although Win-stay Lose-shift is not the best description of the transformation here, it can be seen as a learning mechanism among other animals.

4.1.3 A Biological Basis for Win-Stay Lose-Shift

In some sense, Win-stay Lose-shift can be viewed as a discrete and short-memory version of reinforcement learning or instrumental conditioning (hence its alternate name "Pavlov"), so its presence in instinctive behavior is expected. Many studies have been done which observe animals For example, Chalfoun and Martin [72] study the Brewer’s sparrow (Spizella breweri) and their nesting habits. They find that nesting locations vary along many dimensions, such as shrub height and density, and that sparrow couples seem to enact a win-stay, lose-shift strategy in choosing these characteristics: being more likely to change them from their previous nest if the nest was predated upon compared to if it was successful. McCoy and Platt [73] study risk-seeking behavior in rhesus macaques, giving them a choice between a safe option with a medium reward of juice, or a risky option with a chance of a small reward or a large reward. The authors primarily focus on neural activity in the monkeys based on the outcomes, but they also find that the monkeys are more likely to choose the risky option again after receiving a large reward compared to a small reward, which corresponds to win-stay lose-shift behavior. The same behavior can also be observed in humans. Hayden and Platt [74] perform a similar experiment as McCoy and Platt on humans, using gatorade instead of juice, and find that humans also have a tendency to use a Win-Stay Lose-Shift strategy: they are more likely to continue choosing the riskier strategy after receiving a good outcome from

77 it. Worthy et al. [75] examine human behavior in the Iowa Gambling Task, a game in which participants repeatedly choose one of four decks to choose cards from, and receive negative or positive scores based on each card. Two of the decks have many cards with high positive payoffs, but enough low payoffs that drawing from them yields negative expected value. The other two decks have cards with low positive payoffs, but smaller or rarer negatives such that drawing from them yields positive expected value. They analyze previous literature on the Iowa Gambling Task, and compare several decision-making models used in them to Win-Stay Lose-Shift. They then perform their own experiment with human subjects, and find that, of the models they consider, Win-Stay Lose-Shift provides the best fit to the behavior of about half of the subjects in their study, and another called Prospect Valence Learning, where players keep track of their expectancy for each deck with some distortions such as scope-insensitivity, best fits the behavior of the other half. However, Win-Stay Lose-Shift behavior is not universal. Olton and Schlosberg [76] find that rats seem to show the opposite behavior: a Win-Shift strategy. They placed rats in a maze structure with different paths, and placed food in some of the paths, and had rats run through it multiple times, replacing food according to different protocols to reward different strategies. They find that the rats more easily adapt to protocols that reward Win-Shift behavior, where food is placed in different paths each round, compared to Win-Stay protocols, where food is placed in the same paths. They also find that when food is replaced such that every choice results in an equal reward, rats still prefer a Win-Shift strategy. They speculate that this behavior is due to natural foraging behavior in rats, where scavenged food will be exhausted and exploratory behavior is more likely to yield success. Means [77] find that this tendency in rats to prefer win-shift strategies does not apply in all situations. They conduct experiments with a water-escape scenario, where a maze is partially flooded and rats must find an elevated platform in order to escape. Rats were either trained in win-stay trials, where escape platforms were placed in the same locations in succession, or win-shift trials, where escape platforms were placed in opposite locations. They find the rats were more likely to learn the win-stay behavior than the win-shift behavior. They also find that when rats did not learn the correct behavior, they instead "perseverated", retracing their steps to check the same locations in the same order as they did in their previous trial, regardless of the final location of the platform. This is sort of a Win-Stay strategy if the entire maze is counted as a single strategy,

78 given that this exact sequence of turns did eventually lead to an escape last time. But it is distinct from the study’s defined Win-Stay response which would be to remember the actual location of the platform in the previous round and head directly there.

4.1.4 Our Model

We wish to create a model implementing Win-Stay Lose-Shift dynamics on a game with more than two strategies, in particular the game Rock Paper Scissors. This immediately creates an issue of ambiguity. When a play decides to stay, they should make the same choice they did previously, which is unambiguous. When a player decides to shift, they now have multiple strategies which they did not play previously which could be shifted to. All of the game theory models of Win-Stay Lose-Shift that we’ve discussed involve games with two strategies. The experimental models occasionally have more than two possible choices, but their analysis groups alternative choices into a "shift" category and does not specify which of these alternative choices is chosen. This is appropriate when these choices are not meaningfully distinct, such as in a maze where one path leads to a reward and all other paths do not. However it does not work in a game theoretic model where the best strategy will depend on what strategy the opponent is playing. Here, such a decision must be made explicitly. There are multiple different ways this can be decided, and therefore multiple different versions of Win-Stay Lose-Shift that might be implemented. For example, players could shift between strategies in a periodic orbit, shifting to the next strategy each time they lose, players could shift to copy the opponent that beat them, players could shift to the best response to their most recent opponent, or players could choose randomly between strategies with some prespecified probabilities. Players could also switch to copy the strategies of opponents in a larger population that have higher average payoffs (which if implemented with the right parameters in a continuous population would yield dynamics identical to the replicator equation). Each one of these, and other possible implementations, would lead to different dynamics in many circumstances, despite all being reasonably described as "Win-Stay Lose-Shift". Our model makes this decision by using the idea of restriction to subsets of strategies in a manner analogous to the previously described demonstration using cards. We initialize a heterogeneous population where each player has a fixed card including only two of the three possible strategies in the game we study, Rock Paper Scissors, only one of which is active at a time. Each player then implements Win-Stay Lose-Shift by

79 switching between the two strategies on their card each time they lose. This learning rule is simplistic: players have barely any memory and change their decision based entirely on the results of their most recently played game. However, it still leads to interesting results when the population as a whole is considered.

4.2 Discrete Model

We consider a stochastic dynamical system consisting of a game, a population of players, and a "card" assigned to each player containing two distinct pure strategies from the game. Let G be the game "Rock, Paper, Scissors" with the symmetric payoff matrix

2 1 RPS R 0, 0 −1, 1 1, −1 P 1, −1 0, 0 −1, 1 S −1, 1 1, −1 0, 0

Let P be a finite population of players of size N. Each player begins with a card containing two of the three possible strategies in {R,P,S}, one of which is "face up", representing the player’s currently planned active strategy, and the other "face down" representing a strategy in reserve that the player knows how to play but is not currently using. At each time step, two players are chosen uniformly at random from the population to play against each other. Each player plays the strategy that is currently face up on their own card and receives the corresponding payoffs from the game. Finally, players update their strategy according to a "Win-stay, Lose-shift" rule: if the player receives a 1 or 0, indicating a win or tie, he does not change his active strategy, if he receives a -1, indicating a loss, he flips his card, switching the places of the face up and face down strategies. In this way, players adjust their strategies in an attempt to avoid exploitation, but each player only ever plays two of the three strategies available in the game. Figure 4.1 shows an illustration of this rule occurring for one time step. Note that there are three possible cards a player can have, corresponding to each of the three subsets of {R,P,S} of cardinality two: {R,P}, {R,S}, and {P,S}. In our model, the type of card a player has does not change during play, so one could consider the population to consist of three species of player which interact but do not increase or decrease in size. Alternatively, if we consider both the card of a player and its current

80 Figure 4.1. Illustration of the population dynamics for one time step facing, we can consider the population to consist of six phenotypes of player corresponding to each combination of card and facing. We denote each of these six groups by the letter corresponding to the face up strategy followed by the face down strategy (e.g. players with S face up and R face down belong to the group SR). The six groups are RP, PR, SP, PS, SR, and RS. It then follows that each of these groups can change in size when a card is flipped (e.g, Players in RP can change to PR and vice-versa). Let #PR(t) be the number of players in state PR at time t, define #SR(t), #PS(t) etc... analogously. Define

#PR(t) + #RP (t) #SR(t) + #RS(t) #PS(t) + #SP (t) A = ,B = ,C = , N N N to represent the fraction of players with each card {R,P}, {R,S}, and {P,S} respectively. Note these are determined by the initial conditions and are constant with respect to time. We immediately get A + B + C = 1, since the sets they count form a partition of all players. We can visualize the set of possible states S as a bounded lattice in R3 with each coordinate of the state space corresponding to the fraction of the population currently in a particular orientiation. At each time t, let

#PR(t) #SR(t) #PS(t) x(t) = , y(t) = , z(t) = , N N N represent the fraction of players currently in states PR, SR, and PS respectively. The remaining group sizes can then be written in terms of these variables, e.g. #RP = A − x(t). We then have the constraints 0 ≤ x ≤ A, 0 ≤ y ≤ B, and 0 ≤ z ≤ C. In what follows we often choose A = B = 1/2, C = 0, which means the dynamics occur in the plane 0 ≤ x, y ≤ 1/2. Using this x, y, z we can define the natural isomorphism Ψ: S → R3 which sends each state to the point with coordinates (x(t), y(t), z(t)). This then induces an isomorphism

81 between the dynamical system and a weighted random walk on a lattice in R3. At each time step, at most one player will flip her card, which will correspond to exactly one coordinate either increasing or decreasing by 1/N. Thus the system will enact a random walk with the probability of the system stepping in any direction determined by the probability of the appropriate players being chosen to play.

Let φx+(t) be the probability that at time t, x will increase by 1/N in the next time step. This is simply the probability that a RP player is chosen and loses, switching to PR. Making analogous definitions for other variables, we get

φx+(t) = 2γ(A − x)(x + z)

φx−(t) = 2γx(y + C − z)

φy+(t) = 2γ(B − y)(x + z)

φy−(t) = 2γy(A − x + B − y)

φz+(t) = 2γ(C − z)(A − x + B − y)

φz−(t) = 2γz(y + C − z)

N where γ = N−1 . Each equation corresponds to a specific phenotype of player flipping their card, and is thus derived by calculating the probability of choosing one player of that phenotype, and one player from either of the two phenotypes that cause the first player to lose. The factor of 2 appears due to order not mattering in selection, and the γ appears due to selection without replacement. We use φ∗(t) to refer to an arbitrary one of these six functions. Note that these six probabilities sum to less than 1, since there is also some probability that the players will tie, resulting in no change to the current state:

Proposition 13 If the population contains at least one of each card type, the system has no absorbing sets.

Proof: We first consider x. If φx+ > 0 then there is a nonzero probability that x will increase in the next time step. Additionally, φx+ = 0 iff x = A or x = z = 0. If x = A then x is at its maximum value. If x = z = 0 then A > 0 and C > 0 imply φz+ > 0, so there is a nonzero probability that in one time step z will increase, and in the second step x will increase since z increasing makes φx+ positive. Thus, from any state where x is not at its maximum value of A, there is a nonzero probability of x increasing within two time steps.

Likewise, if φx− > 0 then there is a nonzero probability that x will decrease in the next time step. φx− = 0 iff x = 0 or both y = 0 and z = C. If x = 0 then x is at its

82 minimum value. If y = 0 and z = C then B > 0 and C > 0 imply φy+ > 0, so there is a nonzero probability that in one time step y will increase, and in the second step x will decrease since y increasing makes φx− positive. Thus, from any state where x is not at its minimum value of 0, there is a nonzero probability of x decreasing within two time steps. Since the game matrix, and thus the dynamical system, is symmetric via a cyclic permutation, this suffices to show that all 3 variables have a nonzero probability of increasing or decreasing within at most two time steps from any state as long is it does not exceed the boundaries of the lattice. Thus, the system is transitive and has no absorbing sets.

Proposition 14 The dynamical system will eventually reach a fixed point iff the popula- tion does not contain at least one of each card type.

Proof: (=>) If the population contains at least one of each card type, proposition 13 applies and the system has no fixed points. (<=) Suppose WLOG C = 0. This automatically forces z = 0; there are no PS or

SP players. Then the state s0 with ψ(s0) = (0, 0, 0) is a fixed point. All players in this state have RP or RS cards, and thus every game results in a tie of Rock against Rock. Further, for any state s with x = 0, (there are no PR players), RS and RP players cannot lose, so all φ∗ are zero except possibly φy−, which will be positive iff y > 0. Thus, the set {s|x = 0} is an absorbing set which will eventually end up in s0 with probability 1. There are no other absorbing sets, so the state will randomly change until eventually it enters the absorbing set {s|x = 0} by chance, and shortly afterwards it will reach s0.  It should be noted that we chose to set the threshold for staying to be greater than or equal to 0 points, which results in players staying on the same strategy in a tie, which causes a monoculture population to be a fixed point. If instead players shifted strategies when tying against a player playing the same strategy, then this extinction behavior will not occur.

4.3 Extinction in a Restricted Transition Population

We now restrict ourselves to considering the case when C = 0, which means z = 0 and

A + B = 1. By substituting 1 − A for B, the φ∗ equations simplify to

φx+(s) = 2γx(A − x)

83 φx−(s) = 2γxy

φy+(s) = 2γx(1 − A − y)

φy−(s) = 2γy(1 − x − y)

Simulations of the system with these parameters demonstrate that while extinction does eventually occur, in which everyone ends up playing Rock, it takes longer to happen the larger N is. Figure 4.2 shows a trajectory of the x and y coordinates of one simulation of the system with N = 8,A = 1/2,B = 1/2 initialized with x = 1/4, y = 1/4. Although some simulations wander around near the center for some time, all of them eventually hit a state where x = 0, at which point φx+ = 0 and x stays at 0 forever, with y = 0 occurring shortly afterwards. Figure 4.3 shows the average time to extinction T on a semi-log plot, measured for various population sizes with A = 1/2,B = 1/2, and averaged over many trials (ranging from 10,000 to 20, as N increases and simulation time costs increase).

Figure 4.2. Trajectory of one simulation for N = 8

We see that T appears to grow exponentially. A regression done on this data yields T (N) ≈ 5.2(1.4)N with correlation coefficient r = 0.9939. Thus, for large N we have plenty of time to study the dynamics of the system in detail prior to extinction. We √ speculate that the base of the exponent may in fact be 2. Since we set A = B = 1/2, each time N increases by 2 it makes x and y each increase by 1, causing the lattice to be finer, and thus random drifts to be less significant in absolute terms. Given that x reaching zero is the precondition for extinction, this would suggest x increasing by 1

84 Figure 4.3. The average time for strategies played to reach a monoculture (average extinction time) as a function of the total number of cards N, with A = 1/2,B = 1/2,C = 0. causes the extinction time to double. However we leave the proof of this claim for future research.

4.4 Continuous Models

Simulations of actual trajectories of this random process are difficult to display visually, since it moves on a lattice and ends up overlapping its own path many times. To capture the essential aspects of this system, we plot the residence times at each lattice point as a heat map, shown in Figure 4.4(a) for N = 40, with A = 1/2,B = 1/2,C = 0 after t = 10, 000, 000 iterations. To generate this, we simulate the system for t steps, track the number of times the system has had each (x, y) coordinates, divide by t to get the frequency, and then plot it using a fixed color gradient. The resulting heat map shows the system is most often in a particular spot somewhere around x = 3/5, y = 2/5, and spends less time the further away from this point it is. Simulations run using different N show a similar heat map, centered at the same x and y values despite those corresponding to different absolute values for #RP and #RS. As N gets larger, the plots have a higher resolution in the (x, y) plane, spending less time in each particular state, and also showing a tighter spread around this center point. Figure 4.4(b) shows a heat map using the same process for N = 200, with an adjusted heat gradient to account for the finer lattice, and thus less time spent on each individual point.

85 Figure 4.4. Heat map of the discrete system on a grid for (a) N = 40; (b) N = 200.

Simulations also show that the system seems to move around this center point in a roughly counterclockwise orbit more often than other possible directions. We can quantify the existence of these behaviors by considering the following system: 2 ∆x = φx+ − φx− = 2γ(Ax − x − xy). 2 ∆y = φy+ − φy− = 2γ(x − Ax − y + y ). This shows the expected changes in x and y from any particular state. Our system will follow the direction of these equations in expectation, but will diverge from them at random due to the stochastic nature of the system. However, the larger N is the smaller the relative change each single step makes, so the less noisy the system becomes. We can make a deterministic, continuous version of this system, given by the ODE system:  2 x˙ = 2(Ax − x − xy)

y˙ = 2(x − Ax − y + y2) This has two fixed points: one at x = 0, y = 0, and one at

1  √  x = 3A − 2 + 5A2 − 8A + 4 , y = A − x 0 2 0 0

Which we compute by setting x˙ and y˙ equal to 0 and solving them as simultaneous √ √ 5−1 3− 5 equations. When A = 1/2 this yields x0 = 4 ≈ .31 and y0 = 4 ≈ .19. This corresponds to an equilibrium in which 31% of players play Paper, 19% play Scissors, and 50% play Rock. This differs from the Nash Equilibrium of the game in which every 1 player plays each strategy with probability 3 , because every player in the population has Rock on one side of their card, while only half possess each of Paper and Scissors. As

86 a result rock cards show up more often relative to their win-rate, which in turn causes paper to have a higher win-rate and scissors to have a lower win-rate. The Jacobian of this system at the central fixed point has characteristic polynomial 2 2 λ + (1 − 2A + 3x0)λ + (2x0 + 2x0 − 3Ax0), which has eigenvalues |λ| < 1 for 0 < A < 1. Thus, this fixed point is stable in the continuous model. The fixed point (0, 0) is unstable. Figure 4.5 shows a trajectory for several simulation of this deterministic system with various initial conditions, all converging towards the central fixed point.

Figure 4.5. Trajectory for the deterministic system, with equally spaced initial conditions around the plane, for A = 1/2.

Figure 4.6. Parameterized curve of the x and y coordinates of the equilibrium point as functions of A, with A = 0 at (0, 0), and A = 1 at (1, 0)

In the limiting case when N → ∞, the stochastic system will approach the continuous system in behavior. More formally, we relax the assumption that C = 0, and fix

A, B, C ∈ [0, 1], and some initial state s ∈ [0,A] × [0,B] × [0,C]. Let XN be the dynamical system with population N and card ratios AN ,BN ,CN equal to the rational numbers with denominator N that most closely approximate A, B, C. Choose initial

87 state sN such that φ(sN ) most closely approximates s. Let fN (t) = φ(st,N ) where st,N is the state with the highest probability of being reached in XN starting from state s and iterating for bNtc steps. Then limN→inf fN (t) exists, and corresponds to a continuous dynamical system governed by x˙ = φx+ − φx− = 2(A − x)(x + z) − 2x(y + C − z). y˙ = φy+ − φy− = 2(B − y)(x + z) − 2y(B − y + A − x). z˙ = φz+ − φz− = 2(C − z)(A − x + B − y) − 2z(C − z + y). (note that γ goes to 1 in the limit, so no longer needs to be included) However this system is difficult to work with, aside from noting that it has one fixed point somewhere in the interior, which in the symmetric case when A = B = C = 1/3 will occur at (1/6, 1/6, 1/6). So we once again restrict ourselves to the case where C = 0, noting that the continuous dynamical system this converges to is the same one we defined earlier. Since the flow rates in the continuous model are proportional to the difference in probability of motion in the discrete model, this means the discrete model will be a weighted random walk with a tendency of moving towards this same fixed point. On average the system will travel in the same direction at the same rate as the continuous model, but will wander somewhat randomly along the way. For large N this noise will be small and the system will stay very close to the equilibrium point, but for small N it will wander far. If at any point the stochastic system reaches a state with x = 0, it will be in absorbing state where PR players are extinct, and from there will shortly reach the extinction point at (0, 0). Meanwhile the continuous model will never go extinct unless it begins in a state with x = 0. This also matches intuitively with the observation that extinction times increase exponentially with N. The continuous system acts like a population with infinite N, so also has infinite extinction time. In general it is possible to convert a stochastic discrete system on a lattice into a continuous deterministic system by having each variable change according to the expected values of its probabilities, as we have done above. It is also possible to instead convert it into a stochastic continuous system that maintains its noisy wandering by using Fokker-Planck equations. For the discrete model with C = 0, let px,y(t) be the probability that the population state is in the position (x, y) at time t. Then px,y(t) will evolve approximately according to the master equation

d  p (t) = 2γ − (x(A − x) + xy)p dt x,y x,y

88 −(x(1 − A − y) + y(1 − x − y))px,y 1 1 1 +(x − )(A − x + )px− 1 ,y + (x + )ypx+ 1 ,y N N N N N 1 1 1  +x(1 − A − y + )px,y− 1 + (y + )(1 − x − y − )px,y+ 1 N N N N N

Effectively, this acts as a conservation law tracking the probability mass of what state the system is in. The two terms with px,y as a factor have negative sign, and correspond to the probability that the system is already in state (x, y), and then a player loses, causing the state to shift away. The remaining four terms have positive sign, and correspond to the probability that the system is in a state immediately adjacent to (x, y), and then the appropriate phenotype of player loses, causing the state to shift into (x, y). Note however that while this model retains the discrete states for x and y, the time t is a continuous variable, corresponding to games being played randomly at an average rate of once per unit of time. It is possible to treat this as its own deterministic dynamical system with (NA +

1) · (NB + 1) equations and variables corresponding to each possible px,y and simulate it numerically. In the short to medium timescale the resulting values will converge to something much like the heat maps in Figure 4.4, since the value of each px,y corresponds d to probability in being in one of those states. However dt p0,0 will be small but positive at all times, and therefore it will slowly absorb all of the probability mass. Thus in the long term all of the other px,y terms will slowly exponentially decay while p0,0 approaches 1. For large but finite N, we can use the master equation to construct a parallel model that approximates the original systems but with continuous time x, y by the Langevin equation

1 q dx = 2γ(x(A − x) − xy)dt + √ 2γ(x(A − x) + xy)dW x N t

1 q dy = 2γ(x(1 − A − y) − y(1 − x − y))dt + √ 2γ(x(1 − A − y) + y(1 − x − y))dW y N t

x y where dWt and dWt are independent Wiener processes. The details of the derivation can be found in Appendix A. In short, we rewrite the p 1 and p 1 terms using x± N ,y x,y± N operators, and take Taylor expansions of those. We then group the first order terms to make a "drift" term, and the second order terms to make a "diffusion" term, which allows the system to wander with magnitude proportional to the fluctuations of the discrete

89 system. We call this the Fokker-Planck system due to the Fokker-Planck equations used in its derivation, then simulate the resulting system numerically and explore its behavior. Figure 4.7 shows a heatmap trajectory of the original system towards the central equilibrium point for a large value of N, and compares it to a trajectory of the Fokker-Planck system with the same initial conditions and parameters.

Figure 4.7. Individual system trajectories for total population size N = 800 cards, with the same initial conditions comparing (a) the discrete stochastic system and (b) the Fokker-Planck system.

Figure 4.8 shows a comparison between a heat map of the original discrete system and one of the noisy continuous system after t = 1, 000, 000 steps. For the Fokker-Planck system, we measure the location of the system at discrete time intervals and count which lattice point it is closest to, so that the two heat maps have the same size and sections that can be compared. These maps look very similar, confirming that the Fokker-Planck system is a good approximation of the original system. Figure 4.9 shows the difference of these heat maps: the frequency of the original stochastic system minus the frequency of the noisy continuous system in each location. The maximum size of these differences are on the order of 5% as large as the measured frequencies, meaning the original system spent about 5% more time in the red region and the Fokker-Planck system spent about 5% more time in the blue region. However due to the seemingly random spread, this appears to be mostly random noise due to the finite amount of simulation time allowed before extinction occurs. We also observed that for larger values of N and t these differences shrink in relative size compared to the measured frequencies, suggesting that this is random noise or that the Fokker-Planck system becomes a more accurate approximation as N increases.

90 Figure 4.8. Heat maps for the discrete and continuous systems with N = 40 after 1,000,000 steps

Figure 4.9. Heat map of the difference in the discrete system and continuous system

We also observe that the Fokker-Planck system displays similar extinction behavior to the discrete system. It also wanders noisily around the central point, but for small values of N the diffusion term is large relative to the drift term, making it wander further. Once it reaches a point with x = 0, both the drift and diffusion terms for dx become 0, guaranteeing that x will remain 0, and the system will shortly reach (0, 0) just as the discrete system would. Figure 4.11 shows the trajectory of one simulation of the Fokker-Planck system from N = 8. It displays a similar behavior as the system in Figure 4.2, and reaches (0, 0) in approximately the same amount of time. We measured

91 extinction times for many N and found that extinction times for the Fokker-Planck system also grew exponentially, and were very close to those of the discrete system, shown in Figure 4.10. Extinction times were about nearly identical for the Fokker-Planck system for small N, but seemed to grow more slowly as N increased, being around half as long around N = 30. This is somewhat surprising, as we expect the Fokker-Planck model to become a better mimic to the discrete model as N increases. We speculate that this has to do with the proximal cause of extinction, when the system happens to hit x = 0 in its random wandering, being sensitive to even small changes in parameters or wandering behavior. However the actual explanation for this difference in trends is not immediately clear, and is worth looking into in future research.

Figure 4.10. Extinction Times of discrete system and Fokker-Planck system on a log scale plot, as well as lines of best fit.

4.5 Discussion

The model defined here demonstrates one method of implementing Win-Stay Lose-Shift on games with more than two strategies. We find that the system wanders stochastically around a point corresponding to a central fixed point in the continuous version, with less noise in systems with larger populations. We find the fixed point is influenced by how many players there are with each type of card, and does not correspond to the Nash Equilibrium of the game (except when there are equal amounts of all cards). Additionally, we find that the system will eventually reach a fixed point where all but one strategy is extinct if and only if at least one type of card is completely absent from the population.

92 Figure 4.11. Trajectory for Fokker-Planck system with N = 8

Future research might analytically derive some of the trends we observe numerically, such as the extinction time of a population in the discrete or Fokker-Planck models, or the spread and shape of the heat maps that center around the fixed point. Additionally, This method of learning dynamics with cards could be introduced to other games in a similar way, which we anticipate would lead to similar dynamics, depending on the specific game chosen. The system would likely wander stochastically around some fixed point, but if the game and cards given to players allowed any states in which certain players could no longer lose, the system would eventually reach that state or set of states by chance and would no longer leave it. The method of restricting players to strategies on cards emulates biological systems with differing phenotypes of organisms with limited but overlapping access to the total strategy space. Organisms which can only perform one strategy are modeled using replicator or other evolutionary dynamics, where strategies only update via death and birth. Organisms which compete only with other organisms of the same phenotype are typically modeled using various learning rules that treat all organisms equally. Here we demonstrate modeling techniques that could be used for shared environments between different species, or phenotypes of the same species, that have overlapping but nonidentical strategy spaces. This could be expanded by considering other learning dynamics for selecting behavior on a restricted subset of strategies, such as keeping track of payoffs and switching to strategies within their subset that score higher on average. This could also be expanded by combining our techniques with long-term birth/death dynamics, where instead of having the number of players of each card type be fixed, we treat each

93 type of card as a species and allow the population to evolve over time based on the average fitness of each species of card. Players could react to short-term trends in the environment by switching strategies, but long-term trends which made certain subsets of strategies more valuable would lead to changes in the frequency of each card in the overall population.

94 Chapter 5 | Neural Networks Playing Games

5.1 Introduction

5.1.1 Machine Learning

In this chapter we explore machine learning in neural networks in the context of game the- ory. Machine learning is a technique where programs develop the ability to perform tasks or solve problems by learning from examples rather than being explicitly programmed by a human designer. One of the first such concepts was put forth by Rosenblatt [78] in 1958 in the form of a perceptron, an artificial neuron that can convert inputs to outputs using a linear function, and can adjust the function in response to feedback from training sets. Later, more complicated networks were made of several such neurons connected to each other. A standard feed-forward network consists of several ordered layers of neurons: an input layer, some number of hidden layers, and an output layer. Each neuron has weighted connections to each of the neurons in the adjacent layers. When data is given to the network, the neurons in the input layer have values representing the data. The data is then fed forward by having each neuron in the next layer receive the weighted sum of all of the values in the neurons it’s connected to based on the strength of those connections. This value is then used by an activation function (typically a step function or sigmoid curve) to create the output of that neuron. This continues forward until it reaches the output layer, which represents the output of the network. During training, the network is given data which has already been labelled with the correct output, or some other mechanism for measuring success. The network takes the input, computes its output, and then compares that to the desired response. If its output was not correct, the weights are then adjusted in response by some backpropagation algorithm, which changes the weights of each connection in proportion to how influential they were in

95 causing the incorrect output [79–81]. This might be accomplished by gradient descent, which computes the gradient of the error or some related loss function E, with respect to the weights of the network, and then adjust the network weights by by −k∇E where k is a constant that adjusts the learning rate of the network. Higher values of k will cause the network to learn faster and thus reduces training times, but also can cause it to overshoot minima in the error function and hinder its ability to converge. In a feed-forward network, the relevant derivatives corresponding to weights connected directly to the output are simple to calculate directly, while weights in previous layers can be computed using the chain rule. Many terms in these computations are shared between several weights, so rather than compute each term separately for each weight, backpropagation algorithms compute and keep track of values and uses them multiple times to avoid redundant calculations. This training process is then repeated multiple times, each changing the network slightly in an attempt to decrease its error. For a finite training set, the loss function can be averaged over all of the data with updates done all at once, which is known as batch training. It can also be computed separately over each individual set of inputs, and then backpropagated in sequence. Because the gradients add linearly, for low k these will yield approximately the same results, although there will be a slight difference because each time the network’s weights change its performance on future games will different. This process is then repeated multiple times. As the network’s error decreases, the training process slows. The gradient flattens out if it reaches a local minimum in the space of possible weights, typically causing the network to converge to such a minimum. However, sometimes the network gets trapped in a local minimum which is not a global minimum. If the error function is shaped in a certain way, then networks can end up not training to their fullest potential and getting stuck with a high error because the training algorithm found an area with a basin of attraction in the gradients around it. Stochastic gradient descent is a similar training method which randomly samples the training data and thus causes the network’s path through weightspace to be more erratic. Networks might end up trapped in local minima temporarily, but the stochastic training is more likely to wander outside of small basins of attraction, so if a better global minimum is nearby then it is likely that the network will eventually find it. Neural networks, especially large ones, often suffer from a transparency issue, where they are able to perform the task they are intended to perform, but it is difficult for humans to understand how they are performing it. For example, when viewing a photo and attempting to identify if it is a picture of a bird or not, a human might look for

96 components such as wings, a beak, and a general body shape that they associate with birds. If she finds enough of these subcomponents and they fit together nicely then she concludes there is a bird. An artificial neural network might likewise have neurons that correspond to components of the image that it associates with birds, but these components are not clearly labelled as "wings" or "beak", and likely do not individually correspond to such human-legible concepts. Rather than have a neuron that fires when the picture has wings in it, the network might have a hundred different neurons that play a role in wing identification, each of which represents some collection of pixels and only fires 20% of the time that wings are present. None of these neurons would represent something a human can easily and intuitively identify, and only when included in the whole network do such neurons accomplish their purpose of increasing the probability that the network correctly identifies a bird. The training algorithm, like evolution, is much better at ensuring that the network eventually accomplishes its task than ensuring that it does so in the simplest or most human-legible manner [82,83].

5.1.2 Game theory and adversarial networks

Some research has been done studying neural networks in the context of game theory. Bhatia and Golman [84] study a Bidirectional Associative Memory network, which contains neurons arranged in a bipartite directed graph, with one neuron corresponding to each possible strategy from one of the two players, and each neuron is connected to another with a directed connection if and only if it is a best response to that strategy from the other player. It then iterates through time, like a finite Markov process or cellular automata, with neurons firing or not based on the neuron values in the previous state. The authors show that if the network eventually stabilizes it must stabilize on a state corresponding to a Nash Equilibrium, and they demonstrate its behavior in a number of examples. However their model must construct a new network with fixed connections specifically designed for each game. In some sense, each of these networks does some form of learning as it changes its strategy over time based on the best responses it has recorded until it finds a Nash equilibrium, but a more accurate description would be that it computes a Nash equilibrium from a set of best responses hard-coded into it. Each network is fixed and does not update its connections via backpropagation or any other method, and can only solve a single game that it’s designed to solve. Schuster and Yamaguchi [85] discuss the application of game theory to the behavior of neurons themselves. They model two neurons with a connection between them as players in a coordination game, where each chooses to fire or not fire and receives a

97 higher payoff if they do the same action as the other. They discuss various aspects of game theory such as equilibria and incomplete information, and how they relate to the neuron behavior, as well as some learning rules for the neurons. Choudhury and Swapan [86] design neural networks that can approximate a game theoretic solution to power loss transmission more quickly than conventional methods. Companies generating electric power in an area share the same transmission lines to distribute their power to consumers. During transmission, some power is lost, the cost of which must be born by the generating companies. However, the power loss over each line is quadratic in the amount of power transmitted, not linear, and therefore the assignment of loss to each company is a nontrivial problem. Several game theoretic approaches produce good solutions to this allocation, but require exhaustive information and lengthy simulations that are not feasible to compute in real time as power usage fluctuates. Choudhury and Swapan take one such approach, and design a neural network that can be trained on solutions generated by the game theory simulations. The network can then generate solutions that approximate this approach, but quickly enough to be usable in real time. However, none of these studies contain multiple networks competing against each other. Research has been done involving competing networks outside of game theory, such as in generative adversarial networks. Two neural networks are constructed in distinct roles: a generator and a discriminator. The discriminator is randomly given data either from some training data, or from the generator, and is rewarded for correctly identifying which source the data came from. The generator is given a source of random noise, and uses that to construct data of the same type as the training data, and is rewarded for successfully tricking the discriminator. Thus, the networks work in an adversarial manner, each trying to outcompete the other since their goals are in direct opposition. But in the long run they strengthen each other, since any easily detected flaws in one will be exploited by the other and cause the flawed network to remove the flaws. This technique not only exposes the discriminator to a wider variety of data to learn from, but allows the unsupervised training of generative networks that can create new content to mimic existing content, since the discriminator obviates the need for a human to manually evaluate the generator’s outputs during training [87]. We note that generative adversarial networks are fundamentally asymmetric, each having different types of inputs and outputs that accomplish different tasks. While this means this method is not directly applicable to a game theoretic approach, it still provides a foundation we can use in our approach.

98 5.1.3 Our model

In this chapter we seek to apply the notion of adversarial networks to game theory by constructing neural networks and which play arbitrary 2x2 games against each other. Each network receives the payoffs of the game as inputs, computes a mixed strategy, and plays those against each the other network. They then receive the expected value of their strategy given the mixed strategies both played as payoffs. The networks are then updated via backpropagation based on how their received score compares with how well it was possible to do given their opponent’s strategy. Both players are trained simultaneously against each other, and thus learn not only general techniques in game theory, but also about the specific opponent they’re being trained against. It is not sufficient simply to figure out what the Nash equilibrium of a game is, because if the opponent is an imperfect reasoner in a predictable way, then it is rational to exploit its weaknesses. This is not the same as generative adversarial networks, since neither network is attempting to replicate the input data, but instead places both networks in symmetric roles as players. However it still maintains a somewhat adversarial relationship between them, where in some games each network’s goals will be in direct opposition to the other’s, while in other games their payoffs will be aligned. Thus the two players strengthen each other by exploiting weaknesses and forcing the other player to adapt. The learning process of these neural networks may be thought of as a complex dynamical system which is sensitive to the initial conditions of the networks as well as the games randomly chosen during training. This makes it difficult to prove general theorems that apply to all such networks. We attempt to simplify things by using networks which are smaller than networks typically used in the literature. We then employ techniques from experimental mathematics to detect patterns in the data we present and make conjectures about how these patterns emerge and what they represent in the context of game theory.

5.2 The Model

5.2.1 Construction

We construct neural networks and train them to play arbitrary 2x2 games against each other. Each network is a feed forward fully connected network, with 8 neurons in the input layer, nl hidden layers containing nn neurons in each layer, and 2 output neurons. In particular, a feed forward network consists of a collection of neuron which act as

99 vertices in a graph, each containing some integer corresponding to their layer. Each neuron in a layer is connected by a weighted edge to every neuron in the previous and subsequent layers, but not those in its own. We typically set nl = 2 and nn = 8, as experimentation with these values suggests this yields a good balance between accuracy and simplicity. We initialize each connection with a weight chosen at random from a Gaussian distribution with mean 0 and variance 1, but the specific mechanism for choosing initial values does not have much importance in the eventual structure of the network. Some networks contain bias weights, which connect to an extra neuron that always outputs one. This allows the network to add constants to the functions and thus adjust the thresh-hold value necessary to fire. Due to the symmetry of the two strategies in the games and our error function, we do not anticipate the need for bias terms in playing games, as typically neurons will be comparing payoffs to other payoffs and not to constant values. Thus we choose not to include bias terms in our networks for the sake of simplicity, though future research might investigate whether their inclusion led to interesting results in some way. Two such networks are paired together, one in the role of Player 1, and the other in the role of Player 2. In each round, a random 2 × 2 game is generated, with the 8 values in both players’ payoff matrices all chosen uniformly at random from the interval [−1, 1].

2 1 sC sD

sA a1, b1 a2, b2 sB a3, b3 a4, b4 Table 5.1. Payoffs for an arbitrary 2 × 2 game.

These 8 values are placed in the input neurons of the network, and are then propagated 1 forward through the network, with each neuron using the activation function S(x) = 1+e−x . In particular, neuron j receives xj = Σwi,jvi summed over all neurons in the previous layer, where vi is the value of neuron i in the previous layer, and wi,j is the weight of the connection between neurons i and j. Neuron j then computes vj = S(xj), and, along with all of the other neurons in its layer, sends this to the next layer. The process is repeated until all neurons compute their values. We then interpret the outputs by having the network choose to plays the mixed A B strategy s = ( A+B , A+B ) where A is the value of its first output neuron, and B is the value of its second output neuron. (Due to the symmetry of the backpropagation algorithm we use, A + B ≈ 1 except in unusual circumstances, though this is property

100 not required for the network to function). Note that we could get a similar network by using one output neuron and interpreting its output as strategy 1 and its complement as strategy 2, but the construction with multiple output neurons can be generalized to larger game matrices more easily, and the behavior on 2x2 games is identical. Both players then play their mixed strategies against each other and receive payoffs according to the expected value of their strategies (rather than instantiating a particular instance stochastically). Each network is then updated via backpropagation according to the error function E(s1, s2) = Pbest(s2) − Pactual(s1, s2), where Pbest is the expected payoff the player could have received by playing the best response to their opponent’s strategy, and

Pactual is the expected payoff they receive given the strategy they actually chose against their opponent. Note this error is 0 if and only if the player chooses a best response to their opponents strategy. This error value is backpropagated through the output neuron corresponding to the best response, and the negative of this value is backpropagated through the other output neuron. For example, if A is the best response, then each weight is adjusted by EδA − EδB. This will be possible because the best response is always a pure strategy, except in the case where the player is indifferent between all strategies, which would make the error function 0 for all choices. Note that this error function does not necessarily incentivize players to play Nash equilibria, but instead rewards exploitation of the opponent’s mistakes. Further, it’s only possible for a network to receive 0 error if it responds perfectly to its opponent in each particular game, which even a perfectly rational player could not do unless they also had the ability to predict their opponent’s actions. This means the networks don’t just learn about game theory in general, but also about predicting the particular opponent they are playing against, which is also learning and thus changing behaviors at the same time. Figure 5.1 shows a network that has been trained for 1 million games in the process of playing a round of the Prisoner’s Dilemma. Circles represent individual neurons, while blue lines represent connections with positive weights and red lines represent connections with negative weights. The bolder the color and thickness of a connection, the greater the absolute value of its weight. The small upper number in each neuron is the sum of the values passed to it from the previous layer for this particular game, x, and the large number in the center is the activation value S(x). In this particular version of Prisoner’s Dilemma, strategy 2 corresponds to defection, and the network’s output is approximately the pure strategy that chooses this.

101 Figure 5.1. Network Playing Prisoner Dilemma. Blue lines are positive weights

5.2.2 Errors

We define the accuracy of a network by summing the score it receives over many games, and dividing it by the total score it could have received over the same games against the same opponent had it always chosen the best response. We then define the error score of a network as one minus this number. This is technically different from the error function we use in training, being a normalized version of it that divides by the maximum possible payoff averaged over many games. Under this measuring method, an untrained network or a player who choose strategies completely at random will have an accuracy of approximately 0%, or an error of 100%. A network which seeks to minimize it’s payoff could receive a negative accuracy, and one which always chooses the worst response would have an accuracy of approximately −100%, or an error of 200%. Note that we could adjust this by randomly generating games with payoffs in [0, 1], which would lead to the maximum error being 100% and untrained networks averaging 50%. Networks trained under our usual parameters gradually improve from 0% to about 95% accuracy, or a 5% error rate, at which point they no longer consistently improve, and

102 gradually fluctuate around this rate as they adjust weights to improve at recently played games but lose accuracy in other games at about the same rate. Note that this does not mean that networks choose the best response 95% of the time, but that they receive 95% of the total possible points averaged across all games. Networks will be incentivized to make changes that increase performance on games with large differences between payoffs, even if such changes reduce their performance on games with smaller payoff differences. Figure 5.2 shows the error of a new network being trained over time.

Figure 5.2. Error of a network over time

Networks trained in this manner have almost all of their error in games without any pure strategy equilibria. We trained networks together initially, then tested them on games against the same opponent without any further learning and measured their error rates. We trained three pairs of networks together, measured their errors, and averaged the errors from the pairs. The networks received about 4.5% error averaged over all games. We classified 2x2 games into three categories based on their Nash equilibria and measured errors on each category: games with one pure strategy equilibrium, games with two pure and one mixed equilibria, and games with only mixed strategy equilibria. The first column of table 5.2 shows the average error of a network in games of each subcategory. Note that the three categories occur with different frequencies when payoffs are chosen uniformly at random, such that the weighted average of the errors is 4.52% despite such a high error in games with only mixed equilibria, which occur infrequently. The networks’ weights never converge, fluctuating around causing the networks to change in structure over time. This could be due to the training method being inadequate

103 Errors of networks Same oppo- Different op- Frozen same Frozen differ- nent ponent opponent ent opponent Average on all games 4.52% 5.76% 3.34% 6.25% One pure equilibrium 1.04% 1.04% 0.91% 1.14% Two pure equilibria 7.39% 16.14% 7.86% 16.93% Only mixed equilibria 72.26% 69.25% 42.33% 82.86% Table 5.2. Errors for networks playing against their original opponent (col. 1), against a different opponent (col. 2), as well as errors for networks trained against a frozen opponent againt that opponent (col.3) and against a different opponent (col. 4) to properly train mixed strategies in the way we represent them, keeping errors above 0 and causing the network to constantly attempt to adapt. This results in a sort of Red Queen dynamic, with each network constantly adapting in an attempt to exploit the behavior of the other and to avoid exploitation in games that are chosen to play, but such changes make them vulnerable to different strategies in other games that have not been played recently. The weights in our networks tend to average about 5 in magnitude, which yields neurons that sometimes activate near 1 or 0, but sometimes in between, allowing for proper mixed strategies to be played. To test the robustness of the training, we also had the networks compete against opponents who had been trained in different pairs, and thus were not optimized for playing against each other, shown on column 2 of Table 5.2. This leads to a higher average error, coming exclusively from games with two pure strategy equilibria, which are coordination games. This is indicative of the fact that networks that have not been trained against each other do not have the ability to learn from the past behavior of their opponent, while networks being tested against the same opponent they trained with do, and can thus achieve consensus on which equilibrium to select over time. To investigate this notion further, we trained two new networks together for 1, 000, 000 games to initialize them, then periodically measured their strategy choice in a standard coordination game over time as they continued to train for another 100, 000 random games. Figure 5.3 shows the strategy chosen by each network at times sampled periodically during this interval. This shows that behaviors drift back and forth, but that a strong drift away from equilibrium by one player tends to be quickly followed either by that player shifting back, or by the other player shifting to match the new equilibrium. We also investigate the possibility of error arising from constantly moving target functions. Because the error function depends on the best response to the other player’s actual behavior, the behavior that will yield 0 error when played against one network on a certain game may have significant error against another network that chooses a

104 Figure 5.3. Coordination behavior over time different action. Even when playing against the same network over time, if that network is learning then its behavior will change over time. So even if a network encounters the same set of inputs twice at different times during training, the best response may not be the same both times if the other network has changed its behavior in the meantime. To measure the impact this might have on the error, we allow a network to train against an unchanging opponent. In particular, we first train two networks together for 1, 000, 000 games. We then freeze one of the networks and no longer backpropagate it in response to games. We then play another 1, 000, 000 games while only allowing the other network to learn from the results, allowing it to fine-tune its responses to the behavior of the frozen network. We then measure the error of the unfrozen network without further training. Column 3 of Table 5.2 shows the error received by the unfrozen network, averaged over several networks trained using this technique. We find that the average error is lower, but in subcategories this improvement comes almost entirely from games with only mixed strategy equilibria. This makes sense, as these are games where behavior fluctuates most strongly during training, and are parts where the players interactions are most adversarial. If a network is trained against a frozen network, it has more time to fine tune itself to take advantage of its opponent’s mistakes and find the best response in each scenario. However the error is still much greater than 0, so the changing of opponents does not account for the entirety of the error. Is it possible that this increase in performance against a frozen opponent behaves as

105 some form of over-fitting, which would reduce the network’s performance against other networks? It is entirely possible that the network would abandon many well-rounded safe strategies in favor of aggressive exploitation if those worked against an unchanging opponent that cannot take advantage of any exposed weaknesses. To test this, we paired networks that were trained against a frozen network against other networks, the results of which are in column 4 of Table 5.2. The mean score a small increase in errors in all categories compared to column 2. A 2-tailed t-test on the overall average between these categories yielded p = 0.539, which is not statistically significant. However this data was measured from 6 pairs of networks, so there remains a possibility that there is a small difference this test was not be high-powered enough to detect. We also investigate the performance of our networks when paired with players who always play Nash Equilibria. When a game has a unique Nash Equilibrium, the Nash player always chooses it, whether it’s pure or mixed. However, when a game has more than one Nash equilibrium (two pure and a mixed), there is some ambiguity in what a Nash player ought to do. Rather than creating a mechanism to select between one of the two pure equilibria, we have the Nash player always choose the unique mixed equilibrium. We find that when paired against such players, networks obtain about 1% error. This is much lower than networks playing against each other typically score, and occurs primarily because a mixed strategy Nash equilibrium requires that the opponent be indifferent between all of their strategies, and thus receives the same score regardless of their choice. This means that the network automatically receives 0 error when they play a game against an equilibrium mixed strategy, regardless of their own decision. Given that most of the error came from mixed strategy games, most of it goes away immediately. Note, however, that this seems to be an artifact of our error measuring function, and does not correspond to an increase in the actual payoffs of the network. Table 5.3 shows the received and total possible payoffs received by networks and Nash players playing against each other. Because Nash players always play mixed strategy equilibria when available, there are fewer opportunities to score points with them, either by exploiting them or by successfully coordinating with them, and therefore the maximum possible payoff when playing against them is lower than when playing against networks. So even though networks receive a lower total payoff when playing against a Nash player, the cause of this is not their own actions, so they do not receive error for it. In fact, the only possible source of error when playing against a Nash player is in failing to play a pure strategy equilibrium when a unique one exists, which is why the error for networks against Nash players is about the same as their error when playing such games against

106 Errors of networks and Nash players Network vs Network vs Nash vs Net- Nash vs Nash Network Nash work Actual Payoff 0.290 0.248 0.238 0.249 Maximum Possible 0.304 0.251 0.305 0.249 Error 4.52% 0.99% 22.04% 0% Table 5.3. Payoffs and Errors for networks playing against each other (col. 1), against a nash player (col. 2), and for nash players playing against networks (col. 3), and other nash players (col. 4) each other. Similarly, although Nash players receive approximately the same average payoffs when playing against networks as they do against other Nash players, the fact that they don’t exploit the network’s mistakes means they forgo a large amount of points, and the error function penalizes them accordingly. Additionally, we note that Nash players score a lower total payoff when paired with each other compared to networks paired with each other, despite having 0% error. This comes down primarily to them playing mixed strategy equilibria in coordination games rather than positive sum pure strategy equilibria. More sophisticated Nash players who had some mechanism for coordinating pure strategy equilibria in such games would likely receive higher total payoffs, though we leave such investigations for future research. We also investigate the effect of the network’s size on its performance. In general, a larger network has a larger space of possible configurations, so one would expect it to perform better once fully trained, but this comes at the cost of increasing the training time, as well as increasing the difficulty of analyzing its internal structure.

Figure 5.4. Error (y) as a function of number of hidden layers (x)

In figure 5.4, we fix the number of neurons per hidden layer at 8, generate a new neural network with x hidden layers, alongside a new opponent with standard parameters (2 hidden layers, 8 neurons per layer), train them initially for 5 million games, then measure and average the variable network’s error while they train for another million games. We observe that the error drops rapidly up to 2 hidden layers, then stays about

107 Figure 5.5. Error (y) as a function of neurons per hidden layer (x) the same, with slight increases. This apparent increase could be an artifact of noise, or could be caused by insufficient training time for the larger networks, or some inability for the larger network to handle the training’s lack of convergence. In figure 5.5, we fix the number of hidden layers at 2, and repeat the above procedure while varying the number of neurons in each hidden layer. We observe that the error drops quickly until around 8, and decreases very slowly afterwards. Although this does not show the full space of possible combinations of these two values, it covers a broad enough range to suggest that our choice of 2 hidden layers and 8 neurons per layer is reasonable. Although the error could be decreased by some margin by using networks with more neurons per layer, our purpose is not simply to create the best network for solving games, but also to study the internal structures of them, thus the increase in accuracy from a larger network is unlikely to be worth the cost in complexity.

5.2.3 Temptation Games

We investigate the networks’ behavior when differences between payoffs become small in some comparisons but not others. Consider the temptation game

2 1 sC sD

sA 0.5, 0.5 0, 0 sB x, 1 1, 0.5 Table 5.4. Payoffs for the temptation game.

108 For some x ∈ [0, 1]. Strategy sC is dominant for Player 2, and thus when x < 1/2 the unique Nash Equilibrium is {sA, sC }, and when x > 1/2 it is {sB, sC }. When x = 1/2 strategy sB weakly dominates sA, and any mixed strategy between the two creates a Nash Equilibrium. We focus on the case where x = 1/2 −  for some small . The Nash equilibrium is still {sA, sC }, but this is a risky option for Player 1 if their opponent is unreliable or irrational. If Player 2 plays sC then sA beats sB by , but if Player 2 players sD then sB beats sA by 1. Therefore, sA is only a best response if Player 1 expects Player 1 2 to select sC with a probability of at least 1+ . For a neural network, it might be difficult for the network to distinguish between the cases when x = 1/2 −  and when x > 1/2. We any neurons that detect such differences may not be perfectly calibrated, or may have inputs close to 0 and thus send out a signal close to 1/2, and might be overruled by other neurons. As a result, the network may choose sB even when playing against an opponent who does consistently choose sC . We measure this by training two networks together, and then freezing them both and taking measurements of Player 1’s strategy in this temptation game as x varies. Figure

5.6 shows a plot of Player 1’s strategy choice (the probability of choosing sA in the mixed strategy it outputs) as a function of x.

Figure 5.6. Probability A of choosing strategy sA as a function of x in temptation game

This appears to be a sigmoid curve that centers around x = 0.4. It is worth noting that the opponent network always played sC during these games, which means that for 0.4 < x < 0.5 Player 1 is making a suboptimal choice. When we averaged over all temptation games with 0 ≤ x ≤ 1, the network received an error of 0.3%, which is less than its overall average, even when restricted to games with only pure strategy, where it averages 1.04%. However this range includes many games with no temptation conflict

109 at all where choosing sB simply yields a high payoff. When we restricted to games with 0.4 ≤ x ≤ 0.5, the network received 2.92% error, which is higher than its average for pure strategy games, though still less than its average among all games. During regular training, games like this occasional show up and such mistakes will change the network in a way that causes it to perform better on these games in the future, but such changes will be small given the small amount of utility that is missed in making such a decision. Additionally, such changes may not persist as the network makes errors in other games that change the network in other ways, and such temptation games do not occur frequently enough for the network to optimize strongly towards them in particular. To demonstrate this, we train two networks and adjusted probability weights for their training games. With probability 3/4 a game is randomly generated as usual, with all inputs chosen uniformly at random from [−1, 1]. With probability 1/4, a temptation game is generated with x chosen uniformly at random from [0, 1], and all other values fixed as in Table 5.4. This creates networks which are still incentivized to perform well in all games, but put special emphasis on solving this particular type of game. Figure 5.7 shows a plot of a network’s strategy choice as a function of x after receiving this special training.

Figure 5.7. Probability A of choosing strategy sA as a function of x in temptation game after special training

This is likewise a sigmoid curve, but centered at 1/2. This network received an average error of 0.17% on temptation games where x ranged from 0 to 1, and 0.39% error when x ranged from 0.4 to 0.5. We considered the possibility this improved performance on temptation games would come at the cost of worse performance on other games, but when we tested this network on random games it averaged 4.45% error, which is close

110 enough to that of normally trained networks as to be statistically indistinguishable. If there is a trade-off, it is small enough to be hidden by random noise.

5.3 Pruning

The visual and computational complexity of a network is not simply in the total number of neurons it contains, but also in the connections between them. Although we have been creating fully connected networks, where every neuron is connected to every neuron in the adjacent layers, many of these connections are redundant or unnecessary, and during training many of these weights end up near 0. Thus if we can detect edges which are not serving a useful purpose, it is possible to "prune" a network to remove them and simplify the network without increasing its error by any meaningful amount. The naive way to remove weights would be to select weights with the smallest magnitude, however weights with small magnitude can still have a large impact on the network’s performance if they’re playing some important but subtle role in the network. We could brute force measurements of each weight’s importance by copying the network, deleting one weight, and measuring the error of the pruned copy. Then we could find which pruned copy has the smallest error, and keep it. Then repeat the procedure until the increases in error are too large, meaning all of the unimportant neurons have been removed. However this procedure would take an extraordinary amount of time, especially if the average error was needed to be computed to high accuracy. Hassibi and Stork [88] construct a method they call the Optimal Brain Surgeon, which estimates the second order partial derivative of the network’s error with respect to each weight’s magnitude to compute the Hessian matrix H, then uses the inverse Hessian to compute each weight’s saliency 2 Wq Lq = −1 2[H ]qq where Wq is the magnitude of weight q. This estimates the increase in error that will occur after weight q is removed and the other weights are adjusted to compensate for it. Thus, by removing weights with the lowest saliency, we can not only remove connections that serve no purpose, but also redundant connections that serve the same purpose as some other connection, which should also have low saliency. This process removes one weight each time it is run, removing what it considers to be the least important connection it finds. Thus, assuming we wish to prune more than one weight but not every single one, we need some form of stopping criteria to

111 determine how many times to repeat the process. Each subsequent weight removed is more important than the last, so we could select a maximum marginal increase in error, and only remove weights that increase the error less than that amount, or we could select a total maximum error for the network, and stop removing weights once the network reaches that amount, although it’s not immediately obvious what values those should be set at. However, if we look at all of the connections, there appears to be a partition into two regions of "important" connections and "unimportant" connections. We do not rigorously define such labels, however we do offer some support for this intuitive notion. We construct a new network, train it, then begin pruning it using the Optimal Brain Surgeon method, giving it an additional 5000 games of training after each removal so it can adjust. We measure its error between each prune, and repeat this until all weights are removed. The data suggests a nonlinear increase in error as neurons were removed, so we display the results on a log-log plot, shown in figure 5.8.

Figure 5.8. Log-Log plot of error (y) as a function of neurons pruned (x)

There appears to be a noticeable increase near x = 50, which suggests that there is some threshhold being crossed here. In some sense, this means that around 94 of the original 144 weights are performing important functions for the network, and around 50 of them are unimportant and can be removed with minimal impact. Repeating this process with different networks all yielded data with the same pattern, and a visible threshold near x = 50, though we did not perform rigorous analysis to precisely quantify

112 this value. Pruning of networks is often done in order to avoid overfitting. When large neural networks are given training data from a relatively small set, they often develop specialized techniques that exploit unintended patterns from noise in the training data. For example, a network attempting to detect birds in images might learn a rule such as "images with more than 20% of the background being green never have birds in them." This rule is not true in general, but if the training data happens to lack any counterexamples, then this rule would be genuinely useful at correctly classifying images in the training data. As a result, an overfit network becomes better at classifying the training data, but worse at generalizing to examples outside of the training data that don’t share those patterns. Pruning the network removes the number of degrees of freedom the network has to work with, and forces it to solve problems using simpler processes that are more likely to generalize successfully [89,90]. Another mechanism used to combat overfitting is random dropout. During each training game, some neurons are chosen at random from inside the network and temporar- ily removed, along with any connections going to and from those neurons. This forces neurons to be more robust, since they can’t rely on specific sequences of events to reliable trigger. Effectively it trains multiple different versions of the network to accomplish the same task, and then puts them all back together at the end [91]. We do not anticipate having issues with overfitting, and so do not employ specific techniques in an attempt to avoid it. Our networks are small, so they are less vulnerable to overfitting on tasks in general. More importantly, our training data is the set of all 2 × 2 games. Because the data is a set of all combinations of values within a range, and the error function can be separately computed directly from those values without needing to be manually labelled by a human, the training process can randomly generate training data from the full set of possibilities. This cannot feasibly be done on things such as image recognition tasks, where the set of all possible values consists mostly of random visual noise, and networks are being trained to accomplish tasks that are not already easy to compute from a simple algorithm. Our networks encounter orders of magnitude more unique examples of games chosen at random from the entire set, so there won’t be biases in the training data. We note that it is because we are running networks on an already solved problem that we can avoid this potential issue, since the ability to both generate data and assign error to the networks automatically enables such a large number of unique inputs to be used in training. If we wish to prune our networks, removing 50 weights seems like a reasonable amount

113 to remove given the results of this analysis. However, although this pruning can help reduce the visual and structural complexity of networks, it also dramatically increases the degrees of freedom for what shapes networks can have and makes comparisons between networks more difficult unless they’ve been pruned in the exact same way as each other. Therefore, networks in the remainder of this chapter are not pruned unless otherwise stated.

5.4 Network Comparisons

5.4.1 Euclidean Distance in Weight Space

We consider the question of how our Neural Networks are solving games, and how many different ways there are. Are all of the networks essentially the same inside? Or are there completely different types of networks that solve the same problem? Such notions are not well-defined, as the continuous nature of the space of all neural networks makes distinguishing networks a somewhat murky prospect. If we take a neural network, and then construct a copy with some number of its weights perturbed by a small amount , then unless the network’s internal structure is extremely sensitive, the copy will behave nearly identically to the original: giving approximately the same outputs from the same inputs. It is technically a different network, but intuitively we would consider its method of solving games to be basically the same as the original. We might consider some sort of metric space of networks, and the copy would be within  distance of the original. This would then translate our notion of networks being the same or not into questions such as: "do all networks with minimum error live within a small ball near each other?", "Are there multiple such clusters?", "Are there continual paths of minimal error networks all throughout such space?" In fact, if we fix the number of hidden layers and neurons per layer at 2 and 8 respectively as we typically do, and have not pruned any weights, then each network has 144 weights, each with a real number. So by fixing some ordering of the weights, each network can be represented by a single point in R144. This is a metric space, however the Euclidean metric it comes with is not ideal for the purpose of classifying networks. Even when we fix a numbering of the weights in a network and use this for all networks, two networks might arise which have the same weights connected to different neurons in the right pattern such that two two networks perform identically, but have weights in different slots that map to a different point in R144. If we take a network and permute

114 some neurons in the same layer, that is we take neuron i and j, and for every neuron k in an adjacent layer we exchange the values of wi,k and wj,k, then such a copy will behave identically to the original, giving exactly the same output to each set of inputs. Intuitively it is the same network, however since the ordering of the weights have changed, it will correspond to a different point in R144. Whatever metric or other method of comparing networks we use, we would like it to be invariant under permutations of neurons. Our first way to resolve this is to compare two networks at a time, and permute the neurons in one network to most closely match the other. We take two networks of the same size that have already been trained separately, and randomly generate a set of games to test them on. Rather than train them on these games, we record the values of their neurons on each game in a set of lists, Xi for each neuron i in the first network, and

Yj for each neuron j in the second network. We then take a pair of neurons in the same layer from each network, neurons i and j, and compute Pearson’s correlation coefficient:

E(Xi − µi)E(Yj − µj) ρi,j = σiσj

Where E denotes the expected value function, µi, µj are the mean values of their respective lists, and σi and σj are the standard deviations. For each hidden layer, we pair the neurons with the highest pairwise correlation together, and remove each from their respective lists, then recursively repeat this until all neurons have been paired together. We repeat this for each hidden layer. Note that we don’t need to pair neurons in the input layer whose values always match the inputs of the game and automatically have correlation 1, or the output layer which have meaning in their location and are automatically aligned in normal training. We then take the second network, and permute each of its neurons to be in the same locations as the neurons in the first network they were paired with.

For neural networks N1, N2 let d1(N1,N2) be the Euclidean distance after permuting 144 N2 to match N1 using the above method, and projecting both to R . This is reflexive, as identical networks will map to the same point, and the correlation permuting will 0 0 make this be invariant under permutations. That is, if N1 is a permutation of N1 and N2 0 0 is a permutation of N2, then d1(N1,N2) = d1(N1,N2). d1 is also symmetric: although the permutation process is asymmetric, since it permutes the second network to match the first network, the points they project to will be reordering of the same coordinate pairs, and thus have the same Euclidean distance.

However, d1 is not a metric, since it violates the triangle inequality. We can construct

115 networks N1,N2,N3, where each network has one large weight wi = 1/ that has little impact on the actual behavior of the network or its neurons. Then we arrange things such that when N1 and N3 are permuted, the large weights end up in different spots and the Euclidean distance ends up large (∼ 2/), but when N1 and N2 are compared, or

N2 and N3 are compared, the large weights end up in the same spot and the Euclidean distance is small. Such occurrences seem to be the exception, however, as discrepancies only occur when neurons switch permutations, and this should behave nicely locally. So we will still interpret this mostly like a metric. We construct and train 10 networks independently for 1 million games, then measure their pairwise distances under d1. Figure 5.9 shows a histogram of the results.

Figure 5.9. Histogram of pairwise Euclidean distances for 10 networks

For comparison, we measured random points in R144 with coordinates of the same magnitude (by taking the network weights and permuting them randomly) and found they had an average Euclidean distance of 110.6. The fact that all of the networks had distance less than this demonstrates some similarity in their weights after permuting them based on correlations. However we do not see clusters of networks with distance close to 0, which is what we would expect if all of the networks were doing the same thing. Perturbing a network’s weight by a small amount will perturb its behavior and thus error by a small amount, creating similar networks with similar error. However the fact that we do not observe such similar networks suggests there is a vast range of different networks, at least from the perspective of d1, all with approximately equal performance when solving games. If there are many different neural networks in different locations in R144, are they dense in the space? Technically, every point in R144 corresponds to some network, but we wish to consider networks that actually perform well in games. If we fix a neural network 144 B as player two, we can consider the error to be a function EB : R → R which takes a point, constructs a neural network with corresponding weights, and then computes its average error over all games when playing against network B. We then wish to know the

116 structure of points with low error. Such questions could be answered analytically if one were to write down a function description of a neural network containing 144 variables, and then integrate over all possible payoffs, but such an attempt is infeasible even for what qualify as relatively small neural networks such as ours, so we will make numerical attempts at answering such questions. We consider whether the linear combination of two low-error networks is also a low-error network. If we train networks independently they will have different opponents with different standards for error measurement, so to avoid this, we first train a network, B, for 1, 000, 000 games and freeze it. We then train other networks against network B without allowing it to update, so that all of them are being trained and measured on the same standard. This yields networks of error around 3.34% as shown in Table 5.2. We take two such networks, then permute one network’s neurons to line up with the other’s, −→ −→ and map them to R144, call them x and y . We then consider linear combinations of −→ −→ the two: pr = r · x + (1 − r) · y , for r ∈ [0, 1], construct the networks corresponding to such points, and measure their error. Figure 5.10 shows error values for the resulting networks.

Figure 5.10. Error for network pr as a function of r

The graph looks somewhat Gaussian in shape, being small when close to either of

117 the trained networks, and highest in the middle. This suggests the two networks are in distinct local minima of the error function, rather than in one wide basin of minima. The same pattern occurs when we repeated this using many different pairs of networks, suggesting there are many separate local minima scattered all throughout R144. However, there do seem to be some sort of connected paths in this space, since training a network under backpropagation will cause its weights to wander around until it ends up bearing little resemblance to the original, and yet it gets there without its error increasing so drastically.

5.4.2 Polar Projection

Visually displaying a path in any high dimensional space is difficult. To assist with this, we develop a polar projection technique that reduces variables into "radial distance" and "orthogonal distance." First we define the continuous version of our polar projection. We start with a given point p ∈ Rn which we call the "anchor point", a path f(t), f : [0,T ] → Rn such that

∀t, f(t) 6= p, and an initial angle θ0 (typically 0). We then compute the radial vector −→r (t) = f(t) − p, and the radial distance r(t) = |−→r (t)|. We then define the angle θ(t) by ˙ −→ θ(0) = θ0, θ = |f(t) − r (t)|/r(t). This will be finite since f(t) 6= p implies r(t) > 0. We then interpret r(t), θ(t) as polar coordinates for a path g(t) in R2 with p at (0, 0). Essentially this process preserves radial distances from p, and arc length, and discards all other information. θ acts to track the angular distance the path has traveled in all directions orthogonal to r, which when displayed on in polar coordinates will preserve arc lengths. For every t the distance from f(t) to p will be the same as the distance from g(t) to p, and for any t1, t2, the arc length between f(t1) and f(t2) will be the same as the arc length between g(t1) and g(t2). We might visualize this by imagining this process unravelling the path f, and rewinding it in a spiral around p. We can implement this polar projection for a discrete time path f : N → Rn by defining −→r (t), r(t), θ˙ as before and recursively defining θ(t+1) = θ(t)+θ˙(t), although this no longer perfectly preserves arc length, as changes in r cause θ˙ to over- or under-represent underlying changes in f(t). However such discrepancies will be small for paths with short lengths between each time step. Figure 5.11 shows several examples of paths in R2 and the paths they project to using this technique. We note that this process is a projection, since repeating this process will yield the same spiral path. Additionally, any path which can be represented as a parametric curve in polar coordinates centered around p with r(t) > 0 and monotonic

118 Original Path Projected Path

2 Figure 5.11. Examples of paths in R (left) and their corresponding polar projections (right).

θ(t) will project to itself, or some simple transformation of itself such as a rotation. This includes straight lines headed directly towards or away from p. Any travel in directions orthogonal to the vector between p and the path’s current location simply increases θ, regardless of what directions it goes in. Thus, an intersection in the projected path with itself does not indicate an intersection in the original path with itself. Additionally, this is a projection of paths, rather than points. A path which intersects itself need not

119 intersect itself in the projection, as the same point can be given a different θ depending on the arc length of the path preceding it. There is no particular relationship between two points projected with co-terminal angles other than there being enough arc length between them for this to happen. A path in Rn restricted to a n − 1-sphere of radius r centered at p would project onto an arc of a circle of radius r with equal length, regardless of the shape of the paths.

If such a path is expressed as a parametric curve xi(t) with constant speed of c along its arc, then its projection will have θ(t) = ct/r. For high dimensions, most directions are orthogonal to the radius, so most paths will project to a spiral around p. However changes in the radial distance from p will be preserved and show the spiral going out or in at the same rate. Thus the projection can be used to detect a path’s behavior towards p relative to other directions. A path that wanders around without getting closer or further from p will be noticeably different from a path with a general trend towards or away from p. Backpropagation Path Random Walk

(a) (b) Figure 5.12. Projection of a network path created via backpropagation (left), and a random walk (right)

We train a neural network on 1, 000, 000 games against a frozen opponent B and let p be the point in R144 it corresponds to. We initialize the path f with f(0) = p. We then continue to train the network against the same opponent B, recording the new point it corresponds to every 10, 000 games and setting that as the next point in the path. We repeat this until we have 100 points. Figure 5.12a shows the polar projection of this path. For comparison, we also make a random walk in R144 starting at p, and at each time step change the point by a random vector with each coordinate chosen independently from a Gaussian distribution with mean 0 and standard deviation 0.3, shown in Figure 5.12b. Decreasing the standard deviation will create a twistier path, which will project to a tighter spiral, but also travel less total distance in the same number of steps, and

120 remain closer to p. Setting it to 0.3 yields a path with approximately the same radial distance from p as the backpropagation path after 100 steps. These spirals look somewhat similar, but we observe a much tighter spiral from the backpropagation path, and see the radius decrease several times near the beginning, which the random walk does not do. A random walk in any number of dimensions will tend to get further from its starting point. But the backpropagation path is driven by the error function, which is drawn towards points with low error. Any time the path wanders away from areas of low-error, it will be driven back towards them in future games. We can measure this trend more rigorously by taking the dot product of adjacent vectors along the path. For each pair of adjacent points in the path, define the vector vi = pi+1 − pi. Then for each pair of adjacent vectors, the normalized dot product is vi+1·vi ri = . Averaging the ri together yields a value r for the path which measures |vi+1||vi| the straightness of the path. A path where each step is more likely to be followed by a step in the same general direction will have a positive value for r. A random walk in which each step is chosen independently without any overarching bias will have an r of approximately 0. A path which often backtracks on itself and creates zigzag patterns will have a negative value for r. We measured r for these paths and found that the random walk had r = 0.0032 due to random noise, and the backpropagation path had r = −0.34. This demonstrates that the backpropagation path is indeed twisting back on itself more often than would be expected from random chance, while still leaving it enough degrees of freedom to eventually wander away from its starting point.

5.4.3 Paths Between Networks

We now reconsider the notion of a path between two different networks. Even if a network directly on the line segment between two trained networks has high error, there may be some continuous path of low-error networks that bends or twists and connects the two networks in a roundabout way. To detect such paths, we construct a sequence of networks via a directed genetic algorithm [92] to balance error minimization with a bias towards a target.

We fix a target network T , and a starting network N0 that has been trained on the same opponent as T and has been permuted to align it’s neurons with those of 144 T . Define pi as the point network Ni maps to in R . As model parameters we fix a number of steps, n, a number of test games g, a number of candidate networks c, all of which increase the performance of the algorithm as they increase, but also increase

121 its computation time (which is O(ngc)). We also fix a diffusion coefficient σ which will influence how direct the resulting path will be. We then recursively define network Ni+1 as follows: −→ • Define the drift vector d = (pT − pi)/(n − i).

• Create c copies of Ni as "candidates" −→ • For each candidate k, create a diffusion vector D k, where each coordinate of each

Dk is chosen randomly from a Gaussian distribution with mean 0 and variance σ.

• For each weight wk,j in each candidate, add dj + Dk,j

• Measure the error of each candidate using g random test games.

• Let Ni+1 be the candidate with the lowest measured error. −→ We then repeat for n steps. Note that d becomes larger in magnitude as i approaches n, but becomes smaller as pi approaches pT . This is a soft requirement for pi to approach −→ pT at a constant rate and reach it in n steps: d stays constant in magnitude when this requirement is fulfilled, but increases if pi has previously approached at a slower rate, and decreases if pi approached at a faster rate. If this algorithm is performed without a drift term (or in the limit as n → ∞), this is a regular genetic algorithm which acts as a form of neural network training: create a bunch of offspring with slightly perturbed values, measure their error, and then keep the best one to repeat the process with. In fact, we ran such an algorithm and found it yielded a series of neural networks with approximately 3% error in a manner similar to (though less computationally efficient than) our standard backpropagation. If instead this algorithm were performed without a diffusion term (or by setting σ = 0), every candidate in each round would be identical and would correspond to the same pr on a straight line we used earlier to make Figure 5.10. However when put together as in the general case, it creates a directed selection process. Many candidates are created scattered with random values, but these values are biased in the direction of the target network, thus influencing the selection process to head in that direction. Low values of σ will force the networks to take a more direct route towards pT , while higher values will allow the networks to take more roundabout paths. If there are low-error networks in a continuous path connecting the starting network to the target network, then candidates which fall near that path will be selected. Since the error function is

122 continuous, candidates heading in the general direction of such a path will outcompete ones in other directions, even if they don’t land perfectly on it. If the diffusion coefficient σ is calibrated properly, it should allow candidates to scatter widely enough to find such low-error paths, while still heading in the direction of pT . We ran this algorithm multiple times with n = 40, g = 10000, c = 200 and σ ∈ {1/16, 1/4, 1}. Figure 5.13 shows a polar projection of the resulting paths. Note that each path is projected independently, so intersections between paths in the figure do not correspond to intersections in the original paths.

Figure 5.13. Polar projection of three network paths, with θ0 = 0, and paths colored based on the error of the adjacent networks

Figure 5.14 shows the errors of the networks in these paths in more detail. We observe that the path with the lowest σ had the highest error overall. We note that the shape of its errors looks similar to that in Figure 5.10, but with a lower peak. It was able to wander to some degree, but still followed a path close to a straight line towards the target network. However the lowest error occurred in the path with σ = 1/4, which was in the middle of the diffusion coefficients among those we sampled. We speculate that this occurred due to the parameter controlling the number of candidate networks, c. The high dimensional space yields many possible directions networks could travel, and if paths with low-error networks are sparse and thin, it’s entirely possible for the algorithm to generate c different networks none of which are in the optimal direction. Networks with higher noise are more likely to be vulnerable to such occurences, while networks with

123 Figure 5.14. Errors of network paths from Figure 5.13 lower noise will sample over a smaller volume, and therefore be more likely to find local gradients within that volume. We expect that increasing c will improve the performance of paths with high σ more than paths with low σ. These results demonstrate that there is some connected path of low-error networks, at least as low as errors found by the path with σ = 1/4, and while not a linear path, is at least straightforward enough to be found by our method for moderate σ. Since these networks were trained independently, and we found similar results when comparing other pairs of trained networks, this suggests all or most low-error networks lie in some sort of connected space of low-error networks, though the underlying connections are not necessarily as low-error as the properly trained networks. The subset of low-error networks contains curves and branching paths since each pair of networks are able to find some path connecting them, but direct linear paths tend to have much higher error.

5.4.4 Correlation metrics

We now consider methods of comparing networks other than the Euclidean metric. One of the primary flaws of using this metric to compare neural networks is that locations in Euclidean space are only indirectly related to the actual behavior of the network. Because the activation function saturates at high values, there is little practical difference

124 between a connection with a weight of 100 and a connection with a weight of 1000, but in the Euclidean metric this difference would completely dominate all other weights. Additionally, the high dimensionality of considering all 144 weights makes analysis difficult. We step away from this space and instead construct a simple metric based on the correlations we used earlier for permutations. We take two networks, compute the correlation coefficient ρX,Y for each set of corresponding output neurons between the two networks, and average them (though this is approximately equal to simply doing it for one output neuron on each network) to get ρ.

We then define the distance between the two networks to be d2 := 1 − ρ. In practice we compute the correlation using finitely many games, usually 100, 000, which gives a good approximation, but d2 is well-defined as the converged value in the limit as the number of test games goes to infinity. This is an actual metric provided two networks with identical behavior are considered to be the same network, since a network will always have correlation ρ = 1 with itself, correlations are symmetric, and they satisfy the triangle inequality. Additionally, this will be invariant under internal permutations, since it only measures the firing of the output layer. Therefore it is unnecessary to permute networks to match each other before comparing them. And it is continuous as a function of weights, meaning small perturbations in the weights will only perturb the metric by a small amount. We constructed 10 networks and trained them separately, then tested their output correlations together. We found the average correlation was 0.951 with little variance, all pairs of networks had correlation between 0.93 and 0.97. This corresponds to a metric distance of 0.049, which is rather similar to the error rate of such networks. Given that this correlation only measures the output neuron, this metric is only capable of detecting differences in the final decision of a network, and all of the networks are attempting to accomplish the same task. Two "perfect" networks which always chose the best response to each game would have a metric distance of 0 regardless of the internal structures they used to compute that response. However, this metric still tells us something. Because the error rate and correlation distance are both ∼ 0.05, this suggests that players are making the correct decision 95% of the time, and an incorrect decision 5% of the time, but different networks make such mistakes at different times. If networks were making the same mistakes, we would expect the correlation distance to be lower than the error rate. Although this reasoning is on somewhat shaky ground due to the fact that the error is weighted by how many points each game is worth and the correlation coefficient is not, it is still suggestive that all 10 independently trained networks seem to be "different"

125 from each other in some detectable aspect of their behavior. To take a better account for the internal structure of the network we can apply the correlation technique to the neurons in the hidden layers as well. However, we still wish for any metric to be invariant under permutations of internal neurons. If we naively compare the pairwise correlations of the nth neuron in one network with the nth neuron in another network, there may be a high distance even if the networks are permutations of each other, since a copy of that neuron may occur in a different location. Therefore, we do a slightly altered version of the permutation process used earlier. For each hidden layer, we make lists containing all neurons in each network, and compute the pairwise correlation of each neuron in one to each neuron in the other. In this case, we take the absolute value of each correlation, reasoning that two neurons with a strong negative correlation can play essentially the same role within a network simply by negating their output weights. We then pair the neurons with the highest pairwise correlation together, and repeat for each layer, excluding the input layer whose values always match the inputs of the game and automatically have correlation 1, and then let ρ2 be the average correlation among the pairs of neurons we made. We then define d3 := 1−ρ2. Technically this is not a metric, as the absolute value allows the construction of two networks which make opposite decisions on every game but have d3 = 0, but the errors of the networks would be x and 200% − x respectively, and thus both would not arise naturally via our training process, so this is not much cause for concern. We took the same 10 networks from before and test them using this measurement, and found the average correlation was 0.747, which was lower than the correlation for only the output neurons. We found that correlations for pairs of neurons varied wildly, ranging from 0.038 to 0.996. Interestingly, every single pair of networks had at least four neurons in their first hidden layer that had pairwise correlations above 0.9, often much higher. If we consider only the top four correlations in the first hidden layer, we get an average correlation of 0.978. In some sense, two neurons with a correlation of 1 are the "same neuron" placed into different networks. This suggests that all of the networks are doing almost the exact same thing with some of their neurons; that there’s some important neurons that every network creates copies of in order to play games with low-error But it also suggests that the remaining neurons are optional, or are performing some less important task that is less significant in the performance of the network, since some networks have certain of these neurons and others have different ones. This seems related to the notion of "important" and "unimportant" weights we discussed during pruning, as these weakly correlated

126 neurons may be connected to mostly unimportant weights which can be removed without hindering the primary functioning of the network. To investigate this, we define the Brute Force Importance (BFI) of a neuron to be the amount the network’s error increases when it is removed. To measure this, for each neuron we create a copy of the network, remove one if its neurons and all connections attached to that neuron, and then measure the error of the two networks and take their difference. This is not the most sophisticated method of measuring importance, as it cannot detect things like redundancy: two neurons which play the same important role that the network requires but can perform with a single one of them will be measured as having a low BFI because only the removal of both of them would significantly hurt the network. Additionally, it is technically possible for a neuron to have a negative BFI if its removal would decrease the network’s error, though such a neuron is unlikely to arise from training. However this measurement is simple to implement and interpret and should provide some rigorous backing for vague intuitive judgements of "importance". We take 10 networks and consider only the neurons in the first hidden layer, adjacent to the input neurons, giving us 80 different neurons. For each neuron i, we find the best matching neuron in each of the 9 other networks according to the highest absolute value correlation score, then average those values together to make a correlation score ci for the neuron. If every network has a copy of the same neuron, then they will correlate highly with each other and all score high ci. If only some networks contain copies of a neuron, then they will correlate highly when those networks are paired together, but have lower coordinations among other networks, so the average will be lower. Thus, ci should be some measure of how common a neuron, or the role it plays, is in all of the networks. We then also compute the Brute Force Importance bi of each neuron in each network. Figure 5.15 shows a scatter-plot comparing these two scores for each neuron. The distribution of scores does not cluster unambiguously into "important" and "unimportant" neurons, but there does seem to be a general trend of higher correlations and higher BFI. Most neurons with ci < 0.9 have bi < 0.1, and vice versa. This does not appear to be a linear trend, as there are many more neurons with low bi but high ci, which may be due to our concerns about BFI in the presence of redundancy. Nevertheless, this data provides further evidence that some neurons are playing an important role, such that multiple independent networks all contain such neurons, and removing them tends to increase the network’s error more than removing other neurons.

We then inspected the neurons with the highest ci and bi, which ought to all have similar structures, to determine what that structure was, and found that they appear to

127 Figure 5.15. Correlation score ci (y) vs Brute Force Importance bi (x) for neurons in the first layer of 10 networks. be doing something best described as level 2 play under level-k theory.

5.5 Level-k hierarchy in networks

Level-k theory is a concept in game theory where, rather than players being perfectly rational and computing the result of infinite regressions of knowledge, they have a finite hierarchy of steps in how they behave and expect their opponents to behave. A level 0 player behaves naively, choosing strategies completely at random, or in some other way uncorrelated with the utility of their decisions. A level k + 1 player is recursively defined as choosing the optimal strategy under the assumption that all other players are level k. [93,94] For example, consider the p-beauty contest game where a large set of players all choose a real number between 0 and 100. The winner is the player whose value is closest to p times the average of all players’ numbers for some 0 < p < 1. Consider p = 1/2.A level 0 player would ignore all incentives and choose a random number. A level 1 player would expect all of the other players’ numbers to be random, making the average number 50, and so would choose 25. A level 2 player would expect every other player to go through that exact reasoning, and all choose 25, and thus this player would choose 12.5. In the p-beauty contest game, as k approaches infinity, the behavior of a level k player converges to the Nash Equilibrium of 0. Convergence to the equilibrium happens in

128 many games, but in some games behavior will cycle periodically as k increases, such as in an asymmetric matching pennies game. Cognitive Hierarchy Theory is almost the same as Level-k theory, except a level k + 1 player is defined by choosing the optimal strategy under the assumption that the population consists of some mixture of levels 0 through k. This leads to similar but more nuanced behavior in many games, and is more likely to converge as k increases. Experiments with human subjects show that this does a good job of describing most people’s actual behavior in such games, with the majority of people behaving as level 1 players [94,95]. Our neural networks seem to resemble this behavior to some degree in their internal structure, with higher levels of play requiring larger networks due to increasing complexity of the computations involved. In 2 × 2 games, the simplest level 0 player is one that chooses either strategy with equal probability. A network with only output neurons and no inputs, or one with 0 weight on all its inputs, would send 0 to its outputs, which under the activation function would cause them to each output 1/2. A brand new network with no training and random initial weights will also act like a sort of level 0 player: it would choose different strategies for different games, but in a way completely uncorrelated with its utility. A level 1 player would then maximize its utility, given its beliefs, by picking whichever of its own strategies has the highest average payoff, and has no need to consider the opponent’s payoffs since they don’t influence its decision-making. A network with 8 inputs connected directly to its outputs with no hidden layers will train to become effectively a level 1 player. There is not enough space to process its opponent’s strategy in a useful way, so it simply sends positive weights from each payoff to the output that corresponds to that strategy, and negative weights to the other output, and whichever strategy has the higher average gets played. This gives it the maximum payoff (and thus 0 error) under the assumption that the opponent is a level 0 player. It also plays the dominant strategy whenever it has one, in which case it receives 0 error in such games regardless of what opponent it faces. Figure 5.16 shows an example of such a network. A level 2 player would consider its opponent’s outputs, find which of their strategies has the highest average payoff, and then pick the best response to that strategy. A network with at least two hidden layers appears to do something similar to this, though different networks do it in different ways alongside other things.

Using the variables names from Table 5.1, let P be the proposition that a1 > a2, let

Q be the proposition that a3 > a4, let R be the proposition that b1 + b3 > b2 + b4. Then

129 Figure 5.16. Example of a network emulating a level 1 player a level 2 player can be precisely defined by: if (R ∧ P ) ∨ (¬R ∧ Q) choose strategy 1. Otherwise choose strategy 2. Further, each of these propositions can be represented with a single neuron in the first hidden layer which fires at ∼ 1 when the proposition is true and ∼ 0 when it is false, simply by having positive weights from the input on one side of the inequality, and negative weights from inputs on the other side of approximately equal magnitude. The logical connectives required can also be represented in a neural network by using an arrangement of neurons that act as a transistor, as shown in Figure 5.17. If neuron R fires, it causes both neurons in the next layer to fire, regardless of what P does. Because the activation function is nonlinear, R saturates the signal and causes the top neuron to act as P ∨ R. If both weights leading to the output have equal weight, then they will cancel, causing the output to receive no net signal from P or R. If R does not fire, then the output will receive signal from P . If this is embedded as a substructure in a larger network, then the output could receive input from two or more similar transistors, which let P through if R is true, and let Q through if R is false. Thus, it is possible to manually construct an actual network which behaves as a true level 2 player.

Figure 5.17. Neural Network Transistor

130 In trained neural networks, we observe neurons corresponding to at least one of P or Q or their negations in every network trained under standard parameters with two or more hidden layers, often finding several. They can be found by eye in a diagram by finding a neuron with a single positive and single negative weight of any significant magnitude, with all other weights close to zero. These neurons tend to be the ones with the highest correlation scores, ci, since every network develops some form of them and they correlate highly with each other.

We define ideal neurons I1 through I6 as neurons that fire in perfect alignment with propositions P, ¬P, Q, ¬Q, R, ¬R respectively. For each neuron in an actual network, we compute its correlation with each of these ideal neurons, and label it according to the ideal neuron with the highest correlation. Figure 5.18 shows the same data from 5.15, but with each point labelled according to which ideal neuron it has the highest correlation with, and color coded to indicate the strength of the correlation.

Figure 5.18. Correlation score (y) vs BFI (x) for neurons in the first layer of 10 networks, with correlations to closest ideal neurons.

This shows an approximately equal prevalence and importance in neurons correspond- ing to ideal neurons 1 through 4, which makes sense based on the symmetry of their roles. It also shows that the neurons of highest importance correspond to propositions P and Q and correlate with them very strongly, while neurons corresponding to proposition R are less important. This makes sense, since a network that cannot predict its opponent’s strategy can still act as a level 1 player and choose strategies with high average payoffs,

131 while a network that does not compare its own payoffs properly will make incorrect decisions even if it knows what the opponent will do. Propositions P and Q are important in every game, while proposition R is only important when the player does not have a dominant strategy. In addition to having lower BFI, we also observe that correlations on neurons corre- sponding to proposition R tend to be weaker. We also note that several such neurons have high ci meaning they correspond strongly with each other, but not with the ideal neurons. This suggests that they are playing some role that is similar to proposition R, but not quite the same. We made an actual level 2 player and paired it against several trained neural networks, and the level 2 player averaged a 10% error against the networks. This indicates that the networks are doing useful computations beyond level 2 play, but it seems to be some component of their structure. Neurons corresponding to R look like a neuron with weights coming from the oppo- nent’s payoffs, with positive weights from two and negative weights from two. In fact, the substructure created by a neuron for R and ¬R along with the input neurons is precisely a level one player. If a network with two hidden layers is paired against an opponent with none, the larger network will often develop a nearly identical copy of the opponent. We also found that if after some initial training the opponent is manually altered in some way that would not normally occur during training, such as crippling it by pruning important weights, the first network will adjust its internal model to reflect these changes after being trained together some more. The network does not have an abstract reasoning process that determines what a level 0 player ought to do and then make decisions based on that, its structure develops by actually playing games against an opponent and, when trained consistently against the same opponent, minimizes error when it accurately predicts that opponent’s actual behavior. Networks playing against opponents of the same size cannot contain a full copy of their opponent due to a lack of space, but in some sense they contain a compressed level 1 representation of their opponent as a means of predicting its behavior. This prediction is not perfectly accurate, and tends to fail when the opponent’s choice deviates from level 1 play. However the smaller version of the opponent differs from a level 1 player in some ways depending on the specific opponent, which is how the networks tend achieve lower error than an actual level 2 player. However, this internal modelling of opponents does not seem to extend to higher levels of play. Larger networks trained using our error function do not tend to contain level 2 descriptions of their opponent, even when paired against such an opponent. If we

132 take a network, and manually construct a larger opponent which contains an internal copy of the smaller network, along with transistors making it choose the best response, then it will receive approximately 0 error against this opponent (still getting some due to the finite values of weights making it always play a mixed strategy). However such a structure is unstable and if we enable learning then the opponent will update, quickly causing the larger network to lose track of it and revert to typical error rates.

5.6 Discussion

5.6.1 Summary

In this chapter we develop a method for interpreting neural network outputs as mixed strategies in arbitrary games, and an error function that weights the strength of training based on the payoffs the network receives. We use this to train networks under various parameters, and develop several techniques to compare them and analyze their internal structures. We investigate the effect of pruning on the errors of networks. We measure the correlation of neurons firing which allows us to identify neurons in different networks that play similar roles. We use this to permute networks in order to make them more closely resemble each other in Euclidean space, which we use to make paths through this space. We display results using a novel polar projection technique, which preserves radial distances from a fixed point and arc length of paths. We also use neuron correlations to compare networks directly by averaging these correlations, and identify similarities in neurons with high correlation. We find that all networks share neurons that correspond to key propositions in level-k hierarchy theory, and investigate this further. We did not explicitly design or bias our networks towards learning how to play games in any particular manner, but find that they all seem to learn it in a similar manner. Networks with different initial conditions and different training games will develop in different ways. However they all share the learning mechanism of backpropagation via our error function, which has the same goal of decreasing error. The fact that they develop the same types of neurons that approximate level 2 play suggests that it has a unique position in the set of possible behaviors that balances low error with simplicity such that neural networks of the size we use are able to implement it and find it via backpropagation.

133 5.6.2 Extensions of this model

In this chapter, we have covered a broad range of topics in investigating these neural networks. Each of these topics could be investigated more thoroughly using different techniques, and many have parameters that would benefit greatly from more computation power. We also make several broad observations and conjectures based on patterns we observe in the data, which could be converted into more formal definitions which could then be proven by analytical or statistical methods. The largest avenue of further research would be considering larger games than 2 × 2 matrices. Because of how we convert from multiple output neurons to a mixed strategy, it is possible to construct networks that can play arbitrary n × n games by giving each network n2 input neurons and n output neurons, and have the network plays strategy

Ai i with probability where Ai is the value of output neuron i. Solving larger games ΣAi requires more computations, so networks would likely require more hidden layers and neurons per layer in order to perform this task. This would greatly increase the simulation times for playing and training networks, as well as increase their internal complexity. All of the techniques we use for analyzing networks trained on 2 × 2 games should be applicable to these larger networks in some form, but with greatly increased time cost and often with less clear results. Thus, future research in this direction would also include refining these techniques and developing new ones that can better handle the larger networks. We also observe that most of the remaining error in trained networks comes from their performance in mixed-strategy games. This is also the primary source of their lack of convergence, as players continue to update their behavior based on these errors. Our decision to interpret the outputs of neurons as mixed strategies in each game and update players immediately based on the results is not the only way of modeling mixed strategies. Additionally, our error function does not distinguish between points lost from the player making a poor game-theoretic decision such as playing a dominated strategy, and from less obvious situations such as losing a matching pennies game against their opponent. Computing the error as the difference between their actual payoff and the maximum possible means changes in the opponents behavior can change a player’s error without changing their actual received payoffs. A method which averaged payoffs over multiple games before updating could resolve some of these issues and create more stable network. A different error function could be used in a way that still encouraged "good" play in games, but under different criteria for what that means. Alternatively, we could construct a population of neural networks which all play together simultaneously, and

134 average their payoffs over all such opponents. Similar to the population in Chapter 3, a large population could have subgroups sampled from it in various sizes ranging from 2 up to the entire population. This may create more robust networks that perform better in mixed games, or against opponent’s they haven’t encountered before, though it could also negatively affect the ability of networks to create a consensus on coordination games. Different structures of networks could also yield different behavior. We only consider feed-forward networks without biases, but including biases may enable certain strategies that couldn’t be implemented without. Recurrent neural networks allow networks to retain information between games, which may help with remembering and predicting opponents actions. These and other structures may lead to interesting behaviors worth investigating. Our investigation into low-error networks in Euclidean space yielded some results, but leaves much unanswered. More sophisticated techniques for analyzing paths and gradients in high dimensional spaces could produce a better description of how these networks are arranged and connect to each other. We find that neurons inside networks correlate highly with neuron in other networks, and that these correlations are mostly explained by level-k hierarchy theory. However this description is imperfect, as neurons correlate with each other more strongly than with the ideal neurons made by this description, and our networks have better performance than a level 2 player in the same situation. Future research could attempt to elucidate this discrepancy and provide a more accurate description of the typical network’s behavior, as well as differences within networks, which seem to have some less important neurons that still benefit the network in some way. A more sophisticated notion of "importance" that better accounted for redundant neurons may also help in this analysis. We also observed that larger networks did not seem to emulate higher levels of play after training, despite the possibility of such behavior in a network. Future research could look more into the reason for this, as well as create a better description of larger networks’ structures if they’re doing something other than level k behavior. We also briefly investigate pruning weights off neurons, but leave most networks unpruned for the remainder of the chapter. Future research could investigate how pruning networks affects the behavior in each of the other sections.

5.6.3 Applications

Additionally, some of the techniques used here could be used in other contexts. Our polar projection is interesting as a technique to visually display high dimensional data in two

135 dimensions, but is unlikely to yield analytical results that couldn’t be found simply by using radial distance and arc length on their own. Nevertheless it may convey information more effectively and help recognize patterns in complicated high dimensional paths. Other techniques we use may help with studying neural networks with different training tasks. Permuting networks to align neurons within them and projecting them into Euclidean space would allow for a similar analysis of paths and low-error networks. Computing correlations between neurons within different networks trained on the same task could help identify common features that the networks share in common, effectively allowing one to detect how common or rare a neuron is based on how likely it is to show up in a set of networks.

136 Chapter 6 | Conclusion

6.1 Discussion

In this dissertation we have explored four different models, each with its own learning mechanism that results in different dynamics. Although some of the differences in these models come from the differences in the games they play, the majority are from the different types of players and the mechanisms they use to reason and learn. Here we discuss some of the similarities and differences of these models.

6.1.1 Types of Information

In the oracle games model, players are rational agents and start with complete but im- perfect information. One player is given the opportunity to purchase perfect information, and if they do so and the oracle responds, he uses it to rationally compute the best response from this information. This is the only model in which players are rational players and consciously represent information in this way. Additionally, in this model learning does not accumulate over time. Players play a one-shot game with a single moment in which information can be acquired, as opposed to slowly gathering more and more data over repeated games. In the public goods model, players are informed of the gradient of best response when they play games, which is sort of complete information, but they do not use the full payoff function to compute the Nash equilibrium, and only update incrementally. Players have imperfect information, and initially have no knowledge of what strategies other players intend to play. However, as they play games they interact with different subsets of the population, and thus gain understanding of their environment over time. Although players never create an actual mental representation of the population and

137 the strategies being played in it, the incremental update takes care of this for them. Players in a population of overcontributors will decrease their contribution on average which allows them to improve their payoff in such a population. Similarly, players in a population of undercontributors will increase their contribution over time. Thus, players contributions respond to the overall distribution of players, in some sense approximating this missing information. In the rock paper scissors model, players can distinguish between a win/tie and a loss, which is sort of like complete information, but again they don’t ever make calculations using actual payoffs, so it’s ambiguous whether they truly "know" the payoffs of the game. Players also have imperfect information, with no initial knowledge about what the general population of players are going to play. The win-stay lose-shift strategy only allows players to store one bit of memory, so it’s hard to say if individual players ever really acquire this information, but they seem to do so probabilistically. In a population with many paper players, each individual with a {S,R} card will be more likely to be playing scissors than rock at any given time after enough games have been played for them to update, which increases the average payoff of such players compared to playing either strategy with equal probability. In the neural network model, players are given all 8 values of the payoff matrix, but have no preconceptions about what these mean or what the rules of game theory are, so could be considered to have incomplete information. (Though an argument could be made that they do have complete information, depending on whether one considers the error function to be something the player is computing for itself based on the strategies it observes, or something computed and attached to the training data externally as is typical in other forms of training data.) Players also have imperfect information, since they play simultaneously and don’t get to see the other player’s strategy until after both have played. During training, players end up learning both types of information. Players learn general strategies in game theory, allowing them to perform well even when paired against different opponents, but also learn and adapt to trends in the behavior of their specific training partner, allowing them to perform even better when playing against that network. In all of these models, players start with imperfect information, but can gain it in one form or another and thus adjust their strategies in an attempt to choose better strategies in the environment they are actually in. Determining the completeness of information is somewhat ambiguous for all of the subrational players, since it’s not necessarily clear whether a player lacks information, or possesses the information but fails to use it in their

138 decision-making algorithm. We could resolve this in each model by explicitly defining whether or not players "know" the full payoffs and rules of the game, but since such a decision would be arbitrary and have no impact on the actual dynamics of the model, we refrain from doing so.

6.1.2 Benefit to players

In general, the goal of learning is alter one’s strategy in a way that improves one’s payoff. Thus, we expect a player with a learning mechanism to receive higher payoffs than one who stays at the same strategy. Although we have focused primarily on population dynamics rather than the actual payoffs of players, we can still discuss qualitatively how they are impacted by learning in each model. In the oracle games model, the player with the oracle usually receives higher payoffs in equilibria when he has an oracle compared to ones where he does not. The cheaper the oracle is, the more information he’ll purchase and the higher his payoff will be. However in section 2.5 we discuss a counterexample where the presence of the oracle decreases his payoff in equilibrium due to the other player’s reaction. Although the information is still useful, in that a unilateral decision not to purchase it would decrease his payoff even further, its very existence decreases the payoff of the player. In the public goods model, learning tends to be beneficial for each individual player, though this depends on the overall structure of the population. If the population is undercontributing on average, a player is more likely to play in an undercontributing subgroup, and thus increase their future contribution and improve their future payoffs. However, if they happen to end up in an overcontributing group, they will decrease their contribution. This contribution might improve their payoff if they’re paired in the exact same subgroup, but decrease their performance in other groups. Thus, an individual’s expected payoff will tend to increase over time, but with infrequent decreases. Although the permanent freeloader we discuss in Section 3.5 can benefit from not learning for large group sizes, they will have decreased payoff for small group sizes. Additionally, a permanent overcontributor will have decreased payoff in both cases and by a larger amount. Thus if players initial contributions are chosen randomly, an unlearning player will have lower payoff in expectation compared to a player that can learn. The rock paper scissors model has similar results. The expected payoff of a player is simply the frequency of players they win against minus the frequency of players they lose against in the population. In any given population state, each player has two possible expected payoffs corresponding to their two strategies. Although Win-Stay Lose-Shift

139 will sometimes cause a player to shift to a strategy with worse payoff, by definition it shifts when a player loses, which makes it more likely to shift away from a strategy that is more likely to lose. Thus, a player using this learning mechanism will spend more time in a strategy with a higher win-rate and thus have a slightly higher expected payoff compared to a hypothetical player that played a fixed strategy they did not update. In the neural network model, the learning method is incredibly important. Untrained networks have no built-in concept of game theory and play random strategies and receive an expected payoff of 0, or an error of 100%. It is only by learning that they gain the ability to think strategically and play strategies with higher expected payoffs. Once they’ve trained for some time, future training tends to decrease their performance in certain games at about the same rate it increases their performance in other games, causing their error rate to stagnate or fluctuate, but overall the learning process is highly beneficial to the network. Thus, we see that in most cases information is beneficial to players, but there are some exceptions. Often this benefit is in expectation, with common increases in payoff mixed with occasional decreases in payoff. Aside from the benefit of learning to the players doing the learning, we also consider its effect on the other players in the population. In the oracle games model, the player without the oracle usually receives a lower payoff in equilibria with the oracle compared to ones without the oracle. However this is primarily due to our focus on strictly competitive games, and we discuss a counterexample in section 2.6. It is not obvious which type of game is more common. In fact for such a question to even be well-defined, one would have to select a measure on the space of matrix games, and the answer would be determined by this measure. In the public goods model, players benefit from other player’s contributions being as high as possible. Thus, when a player updates their strategy, they increase the expected payoffs of all other player whenever they increase their own contribution, and they decrease the expected payoffs whenever they decrease their own contribution. If players initial contributions are randomly chosen from a distribution centered around the fairpoint, which we typically do, then increases and decreases will be equally common. So as a first-order effect, a player’s learning is neutral on the payoffs of other players. However, due to the nonlinearity of the payoff function, players receive a higher payoff for playing at the equilibrium point compared to a mix of over- and under-contributing groups, thus the convergence of the population towards the fairpoint, which happens due to their learning, slightly improves the payoffs of players since they can play in more consistent groups.

140 In the rock paper scissors model, the players are playing a zero-sum game. Therefore, any increase in expected payoff players gain for themselves by learning automatically causes an equal decrease in expected payoff for other players. In the neural network model, learning by one player is generally negative for the other player. Although it can allow both players to benefit in some games, such as coordination games, the continued learning and updating of a network creates a constantly changing target for the error function of their opponent. Networks perform better when they are playing against a frozen network which is not able to learn compared to a regular network that is, since the learning network is able to learn and exploit the frozen network’s behavior in games with mixed-strategy equilibria without it changing in response. We see a mixture of positive and negative effects of learning throughout these models. This seems to depends more on the games players are playing than which learning mechanic they’re using, as better performance by one player in cooperative games will tend to help other players, while better performance by one player in competitive games will tend to harm other players.

6.1.3 Convergence to a Nash equilibrium

In the oracle model, players are defined as rational agents which always compute and play the Nash equilibrium. This is not surprising, and there is no dynamical system here which could converge, players just play the best strategy out of those available to them. In the public goods model, there are infinitely many Nash equilibria. Each individual player updates their strategy in the direction of best response, but the actual best response depends on what other players are doing. When m = n, all players stay in the same formation until they eventually hit the Nash equilibrium that maintains this formation. When m < n, players are constantly playing in different groups, and thus the direction of best response often changes. However, in aggregate this causes the population to drift towards the fair point, which is the unique symmetric Nash equilibrium. Each player individually seeking to maximize their own payoff leads to a Nash equilibrium despite the fact that players do not consciously calculate where it is. We note that due to the discrete stochastic nature of the model, players do not actually converge on it (except in the rare extinction case) but instead cluster within a few steps of it, though this clustering can be made arbitrarily close by decreasing the step-size of the update rule. In the rock paper scissors model, players do not converge to the Nash equilbrium. Rock paper scissors does not have a pure strategy equilibrium, and players only play

141 pure strategies, so it is impossible for any individual game between two players to be an equilibrium. The population as a whole could converge to the mixed strategy 1 1 1 equilibrium ( 3 , 3 , 3 ), but in general they cluster around an interior fixed point based on the proportions of each card in the population, which is only equal to the Nash 1 equilibrium in the symmetric case where A = B = C = 3 . Further, because the Win-Stay Lose-Shift strategy only cares whether values are above or below a certain threshold, we could change the payoffs of the game matrix in a way that kept all of them on the same side of the threshold value, which would change the Nash equilibrium without affecting the model’s behavior in any way. This is the only one of the four models we consider in this dissertation in which such a change is possible. In the Neural Network model, players play pure strategy equilibria for the most part, but not mixed strategy equilibria. Although players can play mixed strategies, these typically change rapidly as the networks continue to receive errors, and do not converge to any particular strategy. Players often play pure strategy equilibria in coordination games when they are faced with a player they are familiar with. Even though players have no hard-coded notion of Nash equilibria, the goal of maximizing their own payoff naturally drives them towards equilibria.

6.1.4 Intelligence

Intelligence is a complicated and nebulous concept which is difficult to rigorously define in a mathematical sense, though some research has been done in this direction [96]. There are several parameters that vary in our models which seem correlated with intelligence, such as the amount of memory a player has, the length of the player’s learning algorithm or simulation time required to compute it, or how well we would expect players with this learning mechanism to perform in a variety of situations with different games. Although these measures do no correlate perfectly in general, we feel that a strong case can be made for an ordering of intelligence among the four models considered in this dissertation. The Win-Stay Lose-Shift method only remembers one bit of information: which state the player is currently in. Players only compute whether their received payoff is above a fixed threshhold, and are thus insensitive to changes beyond this value. They cannot accumulate data over time, and often flip their strategy into one with a worse expected payoff because they only react to the results of a single game. Win-Stay Lose-Shift could be implemented in a variety of games using cards, but players with such strategies might not perform well at all, especially if the payoffs did not interact nicely with their threshold value. Thus, we consider these players to be the least intelligent.

142 Next are the players in the public goods model. Their memory consists of a real number, and thus they can aggregate data over a longer period of time. They move in the gradient of best response, which would potentially enable them to find Nash equilibria, or at least moderately good strategies, in a variety of games with variable strategies. However, if the payoff function contains local maxima that are inferior to distant global maxima, players may converge on suboptimal equilibria. Thus, we consider these players to be more intelligent than those in the rock paper scissors model, but less than in the other models. In the neural network model, players’ memory consist of a real number for each of many weights (144 under our standard parameters), which interact with each other in a sophisticated pattern. Networks require many games in order to properly learn, however once they do they can perform well in a large range of games. Players are fairly robust since they are trained on all 2 × 2 games, though not perfectly, as they perform poorly on games with only mixed strategy equilibria. Thus, we consider these to be the second most intelligent type of player. In the oracle games model, players are defined as rational agents. This means they are perfectly intelligent, able to take any information in any game and compute the optimal strategy with perfect accuracy. Thus, we consider these to be the most intelligent type of player. However we note that in some sense this intelligence is artificial, defined into these hypothetical players. The players in the other three models have their reasoning processes defined by an algorithm and can be programmed and simulated, with their actions determined by the results. Meanwhile rational players are defined as always taking the best action, and their method of deducing this action is left ambiguous. In practice, we mathematically calculate the Nash equilibrium for each particular game and then post hoc declare it to be the action of the rational agents, while for simulated agents it is the other way around.

6.2 Future Research

We discuss possible directions for future research related to each individual model in the discussion section of the corresponding chapter. However, there is much research that could be done studying learning and information in game theory. Although we study four different methods of learning in this dissertation, we have by no means exhausted the set of possible methods, so future research might make models with different learning mechanisms and likewise study the behavior that emerges as a result.

143 Aside from looking at individual models with different learning mechanisms, research should be done directly comparing different learning methods within the same model. Similar to the different cards in our rock paper scissors model, a heterogeneous population could be made which interact with each other and play the same game. Rather than varying players by restricting them to different subsets of strategies, players could vary in how they acquire and use information to update their strategy. This could be done by having completely different learning mechanisms, such as having one type of player incrementally update their strategy while the other switches between two very different strategies using Win-Stay Lose-shift, or by having agents with the same fundamental learning mechanism with different parameters, such as two players who incrementally update their strategy with different step sizes. As a basic example, one could put agents using learning mechanisms from all four of our models into the public goods model from Chapter 4, Some players would be incrementally updating players and would behave as normal in this model. Some players would have cards with two contribution amounts on them and some payoff threshold, and would flip their card every time they received a payoff below that threshold. Some players would determine their contribution using a neural network. Since the game’s rules are the same every time, the neural network would have to use other data as inputs, such as the number of each type of player in the currently chosen group, or a history of the most recent group contributions that have been played. Some players would be rational players with the option to pay an oracle for a chance of it telling them the contributions of the other players, though this would have to be implemented in a way that allowed multiple oracle players to play in the same group. The long term behavior of each type of player could then be studied, as well as their average payoffs, and how these depended on the frequencies of each type of player in the population. Although we don’t anticipate this specific model being the most fruitful version of this idea, it’s just one example out of many possibilities. A more sophisticated model would choose learning types and the underlying game in a way that fit together nicely and would elucidate certain details that the researchers wanted to explore. A model with multiple different games that players switch between could explore a tradeoff between a learning mechanism with good robustness among different games but mediocre performance in each, and one which performs well in certain games but poorly in others. A more rigorous measurement of intelligence combined with a penalty for more intelligent agents could explore which situations made such a trade worth the cost, similar to the costly computation model by Halpern and Pass [23]. A model with a gradually changing ruleset

144 could explore the tradeoff between quickly adapting agents which struggle to converge, versus slowly changing agents which take longer to reach a new equilibrium, but settle closely into it once they reach it. The models we explore demonstrate the usefulness of learning in games, but also show that this learning can take a wide range of forms and still be useful. Simpler learning rules tend to be less robust than the more complex ones, but can still perform well when put in the right situation. The possibilities are too numerous to list exhaustively, and future research into learning dynamics may go in a completely different direction from what we’ve postulated here. However, we expect that some of the models and techniques developed in this dissertation will be useful for future investigations into the areas of game theory and learning.

145 Appendix | Master Equation to Langevin Deriva- tion from Chapter 4

Starting with the master equation,

d  p (t) = 2γ − (x(A − x) + xy)p dt x,y x,y −(x(1 − A − y) + y(1 − x − y))px,y 1 1 1 +(x − )(A − x + )px− 1 ,y + (x + )ypx+ 1 ,y N N N N N 1 1 1  +x(1 − A − y + )px,y− 1 + (y + )(1 − x − y − )px,y+ 1 N N N N N

To simplify the notation in future steps, let

Bx = φx+/2γ = x(A − x)

Dx = φx−/2γ = xy

By = φy+/2γ = x(1 − A − y)

Dy = φy−/2γ = y(1 − x − y)

Substituting these in yields

d  p (t) = 2γ (−B − D − B − D )p dt x,y x x y y x,y A 1 y +(Bx − − )px− 1 ,y + (Dx + )px+ 1 ,y N N 2 N N N x 1 − x − 2y 1  +(By + )px,y− 1 + (Dy + − )px,y+ 1 N N N N 2 N

146 1 ν ν ν Let ν = N . Let Ex be an operator such that Ex px,y = px+ν,y, and Ey an operator ν such that Ey px,y = px,y+ν. Then we get

d  p (t) = 2γ (−B − D − B − D ) dt x,y x x y y 2 −ν ν +(Bx − νA − ν )Ex + (Dx + νy)Ex  −ν 2 ν +(By + νx)Ey + (Dy + ν(1 − x − 2y) − ν )Ey px,y

In the limit as ν goes to zero, we can take the Taylor expansion of these operators,

∂ ν2 ∂2 Eν = (1 + ν + + ...) x ∂x 2 ∂x2 ∂ ν2 ∂2 Eν = (1 + ν + + ...) y ∂y 2 ∂y2

which yields

d  p (t) = 2γ (−B − D − B − D ) dt x,y x x y y ∂ ν2 ∂2 +(B − νA − ν2)(1 − ν + + ...) x ∂x 2 ∂x2 ∂ ν2 ∂2 +(D + νy)(1 + ν + + ...) x ∂x 2 ∂x2 ∂ ν2 ∂2 +(B + νx)(1 − ν + + ...) y ∂y 2 ∂y2 ∂ ν2 ∂2  +(D + ν(1 − x − 2y) − ν2)(1 + ν + + ...) p y ∂y 2 ∂y2 x,y

We now distribute, rearrange and simplify terms by differential order, and as an approximation drop all terms of order greater than 2. Additionally, as N increases, the change in x and y each time a game is played goes to zero, so to compensate for this we wish to increase the number of games played per unit of time at the same rate. This means we multiply everything by N, which is equivalent to dividing by ν. All of this together yields:

d  ∂ p (t) = 2γ [(1 − y − A) − ν] + [(−B + D ) + (A + y)ν + ν2] dt x,y x x ∂x 1 ∂2 + [(B + D )ν + (1 − y − A)ν2 − ν3] 2 x x ∂x2

147 ∂ +[(1 − 2y) − ν] + [(−B + D ) + (1 − 2x − 2y)ν − ν2] y y ∂y 1 ∂2  + [(B + D )ν + (1 − 2y)ν2 − ν3] p 2 y y ∂y2 x,y which is the Fokker-Planck equation for this model. This is like the master equation, in that it is a set of deterministic differential equations in px,y which represents the probability mass of a system being in each state x, y. However it is now continuous in x and y as well as t. Once in this form, we can use the continuous nature of the Fokker-Planck equation to construct a stochastic model that corresponds to the probabilities that it represents. This will mimic the original model, but with continuous x, y, and t. We note that our Fokker-Planck equation has the form

d  ∂ 1 ∂2 p (t) = C −F + G dt x,y 1 ∂x 2 1 ∂x2 ∂ 1 ∂2  −F + G p 2 ∂y 2 2 ∂y2 x,y where F1 and F2 are drift terms, G1 and G2 are diffusion terms [97]. For a Fokker-Planck equation in our form, the Langevin equation

q x dx = F1(x, y)dt + G1(x, y)dWt q y dy = F2(x, y)dt + G2(x, y)dWt

will approximate the original model [97,98]. When N is very large, ν is very small, so higher powers of ν will be insignificant compared to lower powers, so we additionally approximate by dropping the higher powers of ν in each term. Putting all of these together gives the Langevin equations

q x dx = (Bx − Dx)dt + (Bx + Dx)νdWt q y dy = (By − Dy)dt + By + Dy)νdWt which, substituting our original variables back in gives

148 1 q dx = 2γ(x(A − x) − xy)dt + √ 2γ(x(A − x) + xy)dW x N t 1 q dy = 2γ(x(1 − A − y) − y(1 − x − y))dt + √ 2γ(x(1 − A − y) + y(1 − x − y))dW y N t which is the Langevin equation for the Fokker-Planck model we define on page 89.

149 Bibliography

[1] Smith, J. M. and G. R. Price (1973) “The logic of animal conflict,” Nature, 246(5427), p. 15.

[2] Hofbauer, J. and K. Sigmund (2003) “Evolutionary game dynamics,” Bulletin of the American Mathematical Society, 40(4), pp. 479–519.

[3] Schuster, P. and K. Sigmund (1983) “Replicator dynamics,” Journal of theoreti- cal biology, 100(3), pp. 533–538.

[4] Arthur, W. B. (1994) “Inductive reasoning and bounded rationality,” The Ameri- can economic review, 84(2), pp. 406–411.

[5] Harsanyi, J. C. (1967) “Games with incomplete information played by “Bayesian” players, I–III Part I. The basic model,” Management Science, 14(3), pp. 159–182.

[6] Bonabeau, E. (2002) “Agent-based modeling: Methods and techniques for simu- lating human systems,” Proceedings of the national academy of sciences, 99(suppl 3), pp. 7280–7287.

[7] Yang, C., S. O. Prasher, J. Landry, and A. DiTommaso (2000) “Application of artificial neural networks in image recognition and classification of crop and weeds,” Canadian agricultural engineering, 42(3), pp. 147–152.

[8] Young, M. J. and A. Belmonte (2020) “Simultaneous games with pur- chase of randomly supplied perfect information: Oracle Games,” arXiv preprint arXiv:2002.08309.

[9] Eliaz, K. and A. Schotter (2010) “Paying for confidence: An experimental study of the demand for non-instrumental information,” Games and Economic Behavior, 70(2), pp. 304–324.

[10] McDonald, J. (1996) Strategy in poker, business and war, WW Norton and Company.

[11] Morris, S. and H. S. Shin (2002) “Social value of public information,” American Economic Review, 92(5), pp. 1521–1534.

150 [12] Asahina, K., V. Pavlenkovich, and L. B. Vosshall (2008) “The survival advantage of olfaction in a competitive environment,” Current Biology, 18(15), pp. 1153–1155.

[13] Gabaix, X., D. Laibson, G. Moloche, and S. Weinberg (2006) “Costly information acquisition: Experimental analysis of a boundedly rational model,” American Economic Review, 96(4), pp. 1043–1068.

[14] Hellwig, C. and L. Veldkamp (2009) “Knowing what others know: Coordination motives in information acquisition,” The Review of Economic Studies, 76(1), pp. 223–251.

[15] Myatt, D. P. and C. Wallace (2012) “Endogenous information acquisition in coordination games,” The Review of Economic Studies, 79(1), pp. 340–374.

[16] Rigos, A. (2018) “Flexible Information Acquisition in Large Coordination Games,” Preprint at https://swopec.hhs.se/lunewp/abs/lunewp2018_030.htm.

[17] Myatt, D. P. and C. Wallace (2015) “ and the social value of information,” Journal of Economic Theory, 158, pp. 466–506.

[18] Yang, M. (2015) “Coordination with flexible information acquisition,” Journal of Economic Theory, 158, pp. 721–738.

[19] Szkup, M. and I. Trevino (2015) “Information acquisition in global games of regime change,” Journal of Economic Theory, 160, pp. 387–428.

[20] Li, Z., H. Yang, and L. Zhang (2019) “Pre-communication in a coordination game with incomplete information,” International Journal of Game Theory, 48(1), pp. 109–141.

[21] Hu, Y., J. Kagel, H. Yang, and L. Zhang (2018) “The effects of pre-play communication in a coordination game with incomplete information,” SSRN.

[22] Martinelli, C. (2007) “Rational ignorance and voting behavior,” International Journal of Game Theory, 35(3), pp. 315–335.

[23] Halpern, J. Y. and R. Pass (2015) “Algorithmic rationality: Game theory with costly computation,” Journal of Economic Theory, 156, pp. 246–268.

[24] Ben-Porath, E. and M. Kahneman (2003) “Communication in repeated games with costly monitoring,” Games and Economic Behavior, 44(2), pp. 227–250.

[25] Flesch, J. and A. Perea (2009) “Repeated games with voluntary information purchase,” Games and Economic Behavior, 66(1), pp. 126–145.

[26] Miklós-Thal, J. and H. Schumacher (2013) “The value of recommendations,” Games and Economic Behavior, 79, pp. 132–147.

151 [27] Sakai, Y. (1986) “Cournot and Bertrand equilibria under imperfect information,” Journal of Economics, 46(3), pp. 213–232.

[28] Ruiz-Hernández, D., J. Elizalde, and D. Delgado-Gómez (2017) “Cournot– Stackelberg games in competitive delocation,” Annals of Operations Research, 256(1), pp. 149–170.

[29] Halpern, J. Y. and R. Pass (2018) “Game theory with translucent players,” International Journal of Game Theory, 47(3), pp. 949–976.

[30] Antonioni, A., M. P. Cacault, R. Lalive, and M. Tomassini (2014) “Know thy neighbor: Costly information can hurt cooperation in dynamic networks,” PloS One, 9(10), p. e110788.

[31] Solan, E. and L. Yariv (2004) “Games with espionage,” Games and Economic Behavior, 47(1), pp. 172–199.

[32] González-Dıaz, J., I. Garcıa-Jurado, and M. G. Fiestras-Janeiro (2010) An introductory course on mathematical game theory, American Mathematical Society.

[33] Nowak, M. (2006) “Five rules for the evolution of cooperation,” Science, 314, pp. 1560–1563.

[34] Hamburger, H. (1973) “N-person prisoner’s dilemma,” Journal of Mathematical , 3(1), pp. 27–48.

[35] West, S., A. Griffen, and A. Gardner (2007) “Evolutionary explanations for cooperation.” Current Biology, 17(16), pp. R661–R672.

[36] Hamilton, W. (1964) “The genetical theory of social behaviour, I, II.” Journal of Theoretical Biology, 7, pp. 17–52.

[37] Grafen, A. (1984) “Natural selection, kin selection and group selection,” Be- havioural ecology, 2, pp. 62–84.

[38] Brown, G. E. and J. A. Brown (1996) “Kin discrimination in salmonids,” Reviews in Fish Biology and Fisheries, 6(2), pp. 201–219.

[39] Strassmann, J. E., O. M. Gilbert, and D. C. Queller (2011) “Kin dis- crimination and cooperation in microbes,” Annual review of microbiology, 65, pp. 349–367.

[40] Miller, S. and J. Knowles (2016) “The emergence of cooperation in public goods games on randomly growing dynamic networks,” , pp. 363–378.

[41] Hauert, C., S. D. Monte, J. Hofbauer, and K. Sigmund (2002) “Volunteering as red queen mechanism for cooperation in public goods games,” Science, 296(5570), pp. 1129–1132.

152 [42] Fehr, E. and S. Gachter (2000) “Cooperation and Punishment in Public Goods Experiments,” American Economic Review, 90(4), pp. 980–994.

[43] Hauert, C., A. Traulsen, H. D. Silva, M. Nowak, and K. Sigmund (2008) “Public goods with punishment and abstaining in finite and infinite populations,” Biological Theory, 3(2), pp. 114–122.

[44] Hauert, C. (2010) “Replicator dynamics of reward and reputation in public goods games,” Journal of Theoretical Biology, 267, pp. 22–8.

[45] Archetti, M. and I. Scheuring (2012) “Game theory of public goods in one-shot social dilemmas without assortment,” Journal of theoretical biology, 299, pp. 9–20.

[46] Hemker, H. and P. Hemker (1969) “H. The Kinetics of Enzyme Cascade Systems General kinetics of enzyme cascades,” Proceedings of the Royal Society of London. Series B. Biological sciences, 173(1032), pp. 411–420.

[47] Packer, C., D. Scheel, and A. E. Pusey (1990) “Why lions form groups: food is not enough,” The American Naturalist, 136(1), pp. 1–19.

[48] Bshary, R. (2010) “Cooperation between unrelated individuals—a game theoretic approach,” in Animal behaviour: evolution and mechanisms (P. Kappeler, ed.), chap. 8, Springer, pp. 213–240.

[49] Wedekind, C. and M. Milinski (2000) “Cooperation through image scoring in humans,” Science, 288(5467), pp. 850–852.

[50] Soares, M. C., R. Bshary, S. C. Cardoso, and I. M. Côté (2008) “The meaning of jolts by fish clients of cleaning gobies,” Ethology, 114(3), pp. 209–214.

[51] Hamilton, W. D. (1971) “Geometry for the selfish herd,” Journal of theoretical Biology, 31(2), pp. 295–311.

[52] Clutton-Brock, T., P. Brotherton, M. O’Riain, A. Griffin, D. Gaynor, L. Sharpe, R. Kansky, M. B. Manser, and G. McIlrath (2000) “Individual contributions to babysitting in a cooperative mongoose, Suricata suricatta,” Pro- ceedings of the Royal Society of London. Series B: Biological Sciences, 267(1440), pp. 301–305.

[53] Camerer, C. F. (2010) “Behavioural game theory,” in Behavioural and Experi- mental Economics, Springer, pp. 42–50.

[54] Orbell, J. M., A. J. Van de Kragt, and R. M. Dawes (1988) “Explaining discussion-induced cooperation.” Journal of Personality and social Psychology, 54(5), p. 811.

[55] Ostrom, E., R. Gardner, J. Walker, and J. Walker (1994) Rules, games, and common-pool resources, University of Michigan Press.

153 [56] Güth, W., R. Schmittberger, and B. Schwarze (1982) “An experimental analysis of ultimatum bargaining,” Journal of economic behavior & organization, 3(4), pp. 367–388.

[57] Rabin, M. (1993) “Incorporating fairness into game theory and economics,” The American economic review, pp. 1281–1302.

[58] Zhu, Q., S. Rajtmajer, and A. Belmonte (in preparation) “The emergence of fairness in an agent-based ultimatum game,” .

[59] Rajtmajer, S., A. Squicciarini, J. M. Such, J. Semonsen, and A. Belmonte (2017) “An ultimatum game model for the evolution of privacy in jointly managed content,” in International Conference on Decision and Game Theory for Security, Springer, pp. 112–130.

[60] Santos, F. P., F. C. Santos, A. Paiva, and J. M. Pacheco (2015) “Evolu- tionary dynamics of group fairness,” Journal of theoretical biology, 378, pp. 96–102.

[61] Gore, J., H. Youk, and A. Van Oudenaarden (2009) “Snowdrift game dynamics and facultative cheating in yeast,” Nature, 459(7244), p. 253.

[62] Axelrod, R. (1980) “Effective choice in the prisoner’s dilemma,” Journal of conflict resolution, 24(1), pp. 3–25.

[63] ——— (1980) “More effective choice in the prisoner’s dilemma,” Journal of Conflict Resolution, 24(3), pp. 379–403.

[64] Robbins, H. (1952) “Some aspects of the sequential design of experiments,” Bulletin of the American Mathematical Society, 58(5), pp. 527–535.

[65] Kraines, D. and V. Kraines (1989) “Pavlov and the prisoner’s dilemma,” Theory and decision, 26(1), pp. 47–79.

[66] Nowak, M. and K. Sigmund (1993) “A strategy of win-stay, lose-shift that outperforms tit-for-tat in the Prisoner’s Dilemma game,” Nature, 364(6432), p. 56.

[67] Imhof, L. A., D. Fudenberg, and M. A. Nowak (2007) “Tit-for-tat or win-stay, lose-shift?” Journal of theoretical biology, 247(3), pp. 574–580.

[68] Sinervo, B. and C. M. Lively (1996) “The rock–paper–scissors game and the evolution of alternative male strategies,” Nature, 380(6571), p. 240.

[69] Bleay, C., T. Comendant, and B. Sinervo (2007) “An experimental test of frequency-dependent selection on male mating strategy in the field,” Proceedings of the Royal Society B: Biological Sciences, 274(1621), pp. 2019–2025.

154 [70] Mills, S. C., L. Hazard, L. Lancaster, T. Mappes, D. Miles, T. A. Oksa- nen, and B. Sinervo (2008) “Gonadotropin hormone modulation of testosterone, immune function, performance, and behavioral trade-offs among male morphs of the lizard Uta stansburiana,” The American Naturalist, 171(3), pp. 339–357.

[71] Sinervo, B., D. B. Miles, W. A. Frankino, M. Klukowski, and D. F. DeNardo (2000) “Testosterone, endurance, and Darwinian fitness: natural and sexual selection on the physiological bases of alternative male behaviors in side- blotched lizards,” Hormones and Behavior, 38(4), pp. 222–233.

[72] Chalfoun, A. D. and T. E. Martin (2010) “Facultative nest patch shifts in response to nest predation risk in the Brewer’s sparrow: a “win-stay, lose-switch” strategy?” Oecologia, 163(4), pp. 885–892.

[73] McCoy, A. N. and M. L. Platt (2005) “Risk-sensitive neurons in macaque posterior cingulate cortex,” Nature neuroscience, 8(9), p. 1220.

[74] Hayden, B. Y. and M. L. Platt (2009) “Gambling for Gatorade: risk-sensitive decision making for fluid rewards in humans,” Animal cognition, 12(1), pp. 201–207.

[75] Worthy, D. A., M. J. Hawthorne, and A. R. Otto (2013) “Heterogeneity of strategy use in the Iowa gambling task: A comparison of win-stay/lose-shift and reinforcement learning models,” Psychonomic bulletin & review, 20(2), pp. 364–371.

[76] Olton, D. S. and P. Schlosberg (1978) “Food-searching strategies in young rats: Win-shift predominates over win-stay.” Journal of Comparative and Physiological Psychology, 92(4), p. 609.

[77] Means, L. W. (1988) “Rats acquire win-stay more readily than win-shift in a water escape situation,” Animal Learning & Behavior, 16(3), pp. 303–311.

[78] Rosenblatt, F. (1958) “The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review, 65(6), p. 386.

[79] Marini, F., R. Bucci, A. Magrì, and A. Magrì (2008) “Artificial neural networks in chemometrics: History, examples and perspectives,” Microchemical journal, 88(2), pp. 178–185.

[80] Wasserman, P. D. and T. Schwartz (1988) “Neural networks. II. What are they and why is everybody so interested in them now?” IEEE Expert, 3(1), pp. 10–15.

[81] Warner, B. and M. Misra (1996) “Understanding neural networks as statistical tools,” The american statistician, 50(4), pp. 284–293.

[82] Iyer, R., Y. Li, H. Li, M. Lewis, R. Sundar, and K. Sycara (2018) “Trans- parency and explanation in deep reinforcement learning neural networks,” in Proceed- ings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 144–150.

155 [83] Bologna, G. and Y. Hayashi (2017) “Characterization of symbolic rules embedded in deep DIMLP networks: a challenge to transparency of deep learning,” Journal of Artificial Intelligence and Soft Computing Research, 7(4), pp. 265–286.

[84] Bhatia, S. and R. Golman (2014) “A recurrent neural network for game theoretic decision making,” in Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 36.

[85] Schuster, A. and Y. Yamaguchi (2010) “Application of game theory to neuronal networks,” Advances in Artificial Intelligence, 2010.

[86] Choudhury, N. D. and S. Goswami (2009) “Transmission loss allocation using game theory based artificial neural networks,” in 2009 6th International Confer- ence on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, vol. 1, IEEE, pp. 186–189.

[87] Radford, A., L. Metz, and S. Chintala (2015) “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434.

[88] Hassibi, B. and D. G. Stork (1993) “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in neural information processing systems, pp. 164–171.

[89] Fürnkranz, J. (1997) “Pruning algorithms for rule learning,” Machine learning, 27(2), pp. 139–172.

[90] Quinlan, J. R. (1987) “Simplifying decision trees,” International journal of man- machine studies, 27(3), pp. 221–234.

[91] Hinton, G. E., N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580.

[92] Mitchell, M. (1998) An introduction to genetic algorithms, MIT press.

[93] Stahl, D. O. (1993) “Evolution of smartn players,” Games and Economic Behavior, 5(4), pp. 604–617.

[94] Camerer, C. F., T.-H. Ho, and J.-K. Chong (2004) “A cognitive hierarchy model of games,” The Quarterly Journal of Economics, 119(3), pp. 861–898.

[95] Brañas-Garza, P., T. Garcia-Munoz, and R. H. González (2012) “Cognitive effort in the beauty contest game,” Journal of Economic Behavior & Organization, 83(2), pp. 254–260.

[96] Legg, S. and M. Hutter (2007) “Universal intelligence: A definition of machine intelligence,” Minds and machines, 17(4), pp. 391–444.

156 [97] Öttinger, H. C. (2012) Stochastic processes in polymeric fluids: tools and examples for developing simulation algorithms, Springer Science & Business Media.

[98] Van Kampen, N. G. (1992) Stochastic processes in physics and chemistry, vol. 1, Elsevier.

157 Vita Matthew Young

Education

• Ph.D. Mathematics, Pennsylvania State University, 2020

• B.S. Mathematics and Physics, University of Oklahoma, 2014

Selected Presentations

Non-cooperative strategic games with cost of partial information: Oracle Games poster, SIAM Conference on Applications of Dynamical Systems, Snowbird UT, May 2017.

Convergence to Fair Contributions in a Stochastic Nonlinear Public Goods Game With Random Subgroup Associations, Joint Mathematics Meetings, Baltimore MD, January 2019.

Population dynamics in cyclic games with restricted strategy transitions, Barry Sinervo’s game theory class, University of California Santa Cruz, March 2020.