Different Modes of Discounting in Repeated Games

Stephen Wolff∗ under the guidance of J´erˆomeRenault and Thomas Mariotti Toulouse School of Economics

September 3, 2011

Abstract In this paper we study different modes of discounting in infinitely repeated games with perfect monitoring. In particular, we explore the equilibrium payoff set of the infinitely repeated symmetric two-player prisoner’s dilemma under geometric and quasi-hyperbolic discounting. For each mode of discounting, we analyze the settings in which players have identical and different discount factors, and we provide an explicit characterization of a payoff profile on the Pareto frontier.

1 Introduction

Repeated games have become a centerpiece in economic theory, describing repeated interac- tions among strategically thinking individuals. In most studies involving repeated games, two assumptions are made. The first is that players discount geometrically. Geometric discounting is one of the simplest and most mathematically rich modes of discounting. However, research suggests that humans discount hyperbolically (see Ainslie (1992)), assigning more weight to pay- offs that occur closer to the present, a phenomenon known as “present bias”. Laibson (1997) popularized a family of discount functions that approximate hyperbolic discounting and that capture its key features while also allowing for tractable mathematical analysis. This family of functions, called “quasi-hyperbolic” discount functions, is the form usually encountered in the literature, and the form we will use in this paper. The second assumption commonly made in game-theoretic literature is that players use identical discount factors. While valid for certain settings, this assumption is by no means universally appropriate, and in fact it hides a time-dynamic aspect of the more general case. When players have different discount factors, the possibility of intertemporal trade becomes available. In particular, a relatively patient player can agree to grant a relatively impatient player higher rewards in earlier periods (which the impatient player weights more heavily) in exchange for higher rewards to the patient player in later periods (which the patient player weights more heavily). Such intertemporal trade can yield discounted total payoffs for the that lie outside the set of feasible rewards of the stage game. With hyperbolic discounting another novel aspect arises, namely, time inconsistency of players’ planned actions. A player exhibiting present bias may plan to follow one to maximize her total repeated-game payoff; but because of her present bias, when she actually finds herself in later periods, she may prefer to play a different strategy. This time inconsistency

∗Contact Stephen.Wolff@rice.edu with comments, corrections, or suggestions.

1 Discounting in Repeated Games Stephen Wolff, TSE (2011)

LR T 1, 1 −1, 2 B 2, −1 0, 0

Figure 1: The Prisoner’s Dilemma.

means that -perfect equilibria are no longer stable across time a priori, and we must refine our equilibrium concept, using instead the idea of Strotz-Pollak equilibria. This paper is organized as follows. We start by detailing the model we will use in our study. In section 3 we then analyze the model under geometric discounting, first laying out the well-studied baseline case of identical discount factors before moving to the case of different discount factors. Next, in section 4 we turn to hyperbolic discounting, again looking at the cases of identical and different discount factors. In section 5 we summarize our discussion and give some concluding remarks.

2 The Model

Consider a stage game G = (N, (Ai)i∈N , (ui)i∈N ), where N = {1, . . . , n} is the (finite) set of players, Ai is the (finite) set of actions for player i, and ui : A → R is the utility function for player i, describing the utility player i receives from each pure action profile in A = A1×· · ·×An. We allow players to use mixed actions in the stage game, so that the relevant stage game is the mixed extension of G, in which each player i chooses an action αi ∈ ∆Ai and receives the expected stage-game utility, or reward, given by X ui(α) = ×j=1,...,nαj(aj)ui(a), a∈A where αj(aj) denotes the probability assigned by the mixed action αj to the pure action aj. The particular stage game that we will use throughout this study is the symmetric two-player prisoner’s dilemma shown in Figure 1. We will assume the existence of a public randomization device; this ensures that the set of stage-game rewards (and later in the analysis, the set of continuation payoffs and the set of equilibrium payoffs) is convex. Let P denote the set of outcomes of the public randomization device. Given a public randomization device, the set V of feasible stage-game rewards is the convex hull of the set of pure-action rewards:

V = conv{u(a) | a ∈ A}.

Let vi denote the reward for player i in the stage game G:

vi = min max ui(ai, α−i). α−i∈∆A−i ai∈Ai

Let IRi = {v ∈ V | vi ≥ vi} denote the half-space of individually rational rewards for player i, and let IR denote the intersection of all such half-spaces for all players: \ IR = IRi = {v ∈ V | vi ≥ vi∀i ∈ N}. i∈N

2 Discounting in Repeated Games Stephen Wolff, TSE (2011)

∗ Similarly, let IRi = {v ∈ V | vi > vi} denote the half-space of strictly individually rational rewards for player i, and let IR∗ denote the intersection of all such half-spaces for all players:

∗ \ ∗ IR = IRi = {v ∈ V | vi > vi∀i ∈ N}. i∈N For the prisoner’s dilemma in Figure 1, the set V of feasible stage-game rewards is the solid parallelogram with its four vertices at the four pure reward profiles. The minimax reward for each player i is vi = 0. The set IR of individually rational reward profiles is thus this paral- lelogram V intersected with the nonnegative quadrant, and the set IR∗ of strictly individually rational rewards is V intersected with the strictly positive quadrant. The stage game G is repeated infinitely many times, starting at period t = 0. We will assume perfect monitoring; that is, at the end of each stage, all players can perfectly observe the actions chosen in that stage (as well as the outcomes of all previous stages) and can condition their future actions on these outcomes. The stage-t history of the repeated game is a vector ht that includes all past actions of all players, i.e. (a0, . . . , at−1) ∈ At, and all past realized values of the public randomization device, i.e. (p0, . . . , pt−1) ∈ P t. We denote the set of all possible stage-t histories by Ht = At × P t. Players observe the realization of the public randomization device in period t before choosing t their actions, so that a stage-t pure strategy for player i is a map si : H × P → Ai.

2.1 A Remark on Discounting In what follows we consider cases in which players use various modes of discounting to evaluate their payoff streams. However, we should note that discounting is not the only method for representing players’ preferences over time. Osborne and Rubinstein (1994) detail two other methods. Under the limit of means criterion, player i strictly prefers the stream of rewards t +∞ t +∞ (vi )t=0 to the stream of rewards (wi)t=0 if and only if

T X t t lim inf (vi − wi)/T > 0. T →+∞ t=0

t +∞ t +∞ Under the overtaking criterion, player i strictly prefers the stream of rewards (vi )t=0 to (wi)t=0 if and only if

T X t t lim inf (vi − wi) > 0. T →+∞ t=0 Whereas discounting weights a given reward less and less the further away in time the reward is received (for discount factors less than one), the limit of means and overtaking criteria weight all time periods equally. There are convincing economic arguments for using discounting, and it is by far the most commonly encountered method of representing time-dependent preferences in the literature. Therefore, we will restrict our attention to discounting in the remainder of the paper.

3 Geometric Discounting

Geometric discounting is the most widely used representation of time-dependent preferences. Under geometric discounting, player i has a discount factor δi ∈ (0, 1); a reward obtained t

3 Discounting in Repeated Games Stephen Wolff, TSE (2011)

t periods after the current period is multiplied by the weight δi . Note that, as t → +∞, the t weight δi → 0, so that if the set Vi of player i’s feasible stage-game rewards is bounded, then t +∞ t the infinite sum of any reward stream (vi )t=0 is well-defined, where vi ∈ Vi for all t. In practice, we usually define player i’s payoff in the infinitely repeated game to be

"+∞ # X t t πi(σ) = (1 − δi)Eσ δi ui(a ) , (1) t=0 where σ is any behavior-strategy profile of the infinitely repeated game and where the expec- tation is taken with respect to the probability distribution generated by σ over the (infinite) terminal histories. The normalization factor 1 − δi is included to render the payoffs of the infinitely repeated game directly comparable to the rewards of the stage game. Under geometric discounting, the repeated game assumes a symmetric structure through k time: The set of player i’s continuation payoffs Ci in any future period k is identical to her set of feasible payoffs:

" +∞ # k X t−(k+1) t Ci (σ) = (1 − δi)Eσ δi ui(a ) . (2) t=k+1

The equivalence between the continuation payoff (2) and the feasible payoff (1) is easily seen by reindexing the sum in (2) with t0 = t − k.

3.1 Identical Discount Factors

Consider the case in which all players discount use an identical discount factor δ = δ1 = ... = δn. In this case, the set of feasible payoffs of the infinitely repeated game coincides with the set of feasible rewards of the stage game. Furthermore, the famous folk theorem of Fudenberg and Maskin (1986) states that any feasible and strictly individually rational reward vector of the stage game can be sustained as a subgame-perfect (SPNE) payoff of the infinitely repeated game, provided players are sufficiently patient. That is, there exists some δ ∈ (0, 1) such that, if δ > δ, then any v ∈ V ∗ can be sustained as a payoff profile of some SPNE.

3.2 Different Discount Factors When players discount geometrically using different discount factors, they have the possibility to benefit from intertemporal trade. A relatively impatient player (i.e. a player whose discount factor is closer to zero) cares comparatively more about the rewards in time periods close to the present than does a relatively patient player, who cares comparatively more about the rewards received far in the future. Thus, in time periods close to the present, a relatively patient player can agree to play action profiles that give greater rewards to a relatively impatient player; in exchange, the relatively impatient player agrees to play, in future periods, action profiles that yield the patient player greater rewards. As a result of intertemporal trade, players can achieve repeated-game payoffs that lie outside the set of stage-game rewards. The possibility of intertemporal trade, along with individual rationality considerations, leads to two novel features of the repeated game. The first is that, as we have mentioned, the set of feasible payoffs of the repeated game can contain points that lie outside the set of feasible rewards of the stage game. The second is that there can be feasible and individually rational payoffs of the repeated game that cannot be sustained as equilibria. This latter feature occurs

4 Discounting in Repeated Games Stephen Wolff, TSE (2011) because a player cannot be trusted to accept a tail-game continuation payoff that is below her individually rational level, thus making some trade agreements non-credible. Lehrer and Pauzner (1999) provide explicit characterizations of the sets of feasible payoffs and equilibrium payoffs of two-player repeated games. Their analysis of the equilibrium payoff set relies on the fact that along any Pareto-optimal path of a two-player game, each player’s stream of rewards is monotone.1 For the Pareto frontier of the equilibrium payoff set, this mono- tonicity property implies that the rewards to the patient player must be (weakly) increasing, while those to the impatient player must be (weakly) decreasing. Let a subscript I indicate a quantity of the impatient player, and a subscript P a quantity of the patient player. Individual rationality implies the following results. The relatively impatient player will never accept an expected stage-game reward less than her individually rational (IR) level, at any period. Since her reward stream is weakly decreasing, if she ever received a reward less than her IR level, this would imply that her continuation payoff is less than her IR level, which she can always guarantee. Thus, along a Pareto-optimal path, the allowable stage-game rewards come from the set V ∩ IRI . The relatively patient player, on the other hand, is willing to accept stage-game rewards less than his IR level, so long as his overall payoff satisfies his IR level. Thus the set of equilibrium payoffs is given by

∆ F (V ∩ IRI ) ∩ IRP , where following Lehrer and Pauzner (1999) the notation F ∆(·) denotes the feasible payoff set of the repeated game with stage length ∆. We can then bound the set of Nash equilibria (denoted NE) and subgame-perfect Nash equilibria (denoted SPNE) between this set and the corresponding set using the ε-strong IR levels:

∆ ε ε ∆ ∆ ∆ F (V ∩ IRI ) ∩ IRP ≤ SPNE ≤ NE ≤ F (V ∩ IRI ) ∩ IRP . The equilibrium set of total discounted payoffs for the infinitely repeated prisoner’s dilemma is then the set bounded by the two axes and a monotonically decreasing concave curve con- 3 necting (0, 2 ) to (2, 0). In particular, note that the payoff profile (2, 2) is not sustainable as an equilibrium. The impatient player will never accept a stage-game reward less than her indi- vidually rational level of vI = 0, making the reward profile (−1, 2) inaccessible to the patient player. In the limit as the patience ratio r = log δP / log δI goes to infinity, the payoff profile 3 2 (2, 2 ) becomes the unique Pareto optimum for any non-dictatorial weighting vector.

3.2.1 Explicit Construction of Equilibria and Analysis To illustrate the forces at play, we will now construct a subgame-perfect Nash equilibrium for the prisoner’s dilemma and show that, for suitable choices of the discount factors δI and δP , we 3 can achieve total payoffs arbitrarily close to the Pareto optimum (2, 2 ). Consider the infinitely repeated symmetric two-player prisoner’s dilemma whose stage game is shown in Figure 1. Assume that both players discount geometrically; denote the “impatient” player’s discount factor by δI and the “patient” player’s discount factor by δP , with 0 < δI < δP < 1. In the context of this game, consider the following strategy profile σ∗, which can conveniently be thought of as consisting of two phases. The first phase comprises the initial T stages (stage 0

1In their conclusion, Lehrer and Pauzner (1999) show that this monotonicity property of Pareto-optimal paths need not hold in games with more than two players. As a result, their results about the set of equilibrium payoffs do not hold for games with more than two players. 2Nowhere in this paper does the particular base of the logarithm matter; for convenience we may take the notation log to denote the natural logarithm.

5 Discounting in Repeated Games Stephen Wolff, TSE (2011) to stage T −1); we will specify the value of T later. At each stage of the first phase, players play according to action profile (B,L). Adherence to this strategy yields, in each stage of the first phase, a reward of 2 to the impatient player and a reward of −1 to the patient player. In the second phase, which begins in stage T and continues in all future stages, the impatient player 1 1 plays T , and the patient player mixes ( 2 + e)L + ( 2 − e)R, where e is a strictly positive real 1 number satisfying 0 < e < 2 ; as with T , we will determine the exact value of e later. If both players adhere to this strategy, then in each stage of the second phase, the impatient player gets 3 an expected reward of 2e, and the patient player an expected reward of 2 − e. Any deviation from this strategy, in either phase, is punished by immediate and permanent reversion to the stage-game Nash equilibrium (B,R), which yields the stage-game payoffs (0, 0).

Proposition 1. The strategy σ∗ constitutes a subgame-perfect Nash equilibrium.

Proof. By the one-shot deviation principle, we can prove that σ∗ is an SPNE by showing there exist no profitable one-shot deviations. The impatient player clearly cannot gain from deviating during the first phase, since the reward she receives in each stage under the strategy profile σ∗, namely 2, is the largest possible reward she can receive. If the patient player could gain from deviating at any stage during the first phase, then he would gain at least as much by deviating at stage 0, since the punishment of deviation is Nash reversion, which yields a stage-game payoff of 0, and this is strictly greater than the stage-game payoff of −1 he incurs in all first-phase stages prior to the deviation. This logic implies that the optimal deviation for the patient player is to deviate to the pure action R in stage T = 0, giving him a reward of zero in stage T = 0 and in every stage thereafter. Thus, to ensure that the patient player has no incentive to deviate in the first phase, we must ensure that his total discounted payoff is greater than zero:

"T −1 +∞ # X X 3  0 < (1 − δ ) δt (−1) + δs − e P P P 2 t=0 s=T  T    1 − δP T 3 1 = (1 − δP ) (−1) + δP − e 1 − δP 2 1 − δP 5  = − e δT − 1, 2 P so that

T 1 δP > 5 . (3) 2 − e In the second phase, if the patient player had any profitable deviation, it would involve shifting some weight in his mixed strategy from L to R (recall that R is a strictly dominant strategy in the stage game). Any deviation is punished by immediate and permanent reversion to the minmax payoffs (0, 0), so the patient player’s optimal deviation is to play the pure action R. To ensure that the patient player has no (weakly) profitable deviation, then, we require that

" +∞ # "+∞ # X X 3  (1 − δ ) 2 + δt 0 < (1 − δ ) δs − e , P P P P 2 t=1 s=0 which yields the result

1  1 δ > e + . (4) P 2 2

6 Discounting in Repeated Games Stephen Wolff, TSE (2011)

3 1 We can show that, for e satisfying 0 < e < 2 , condition (4) holds whenever (3) does. Turning to the impatient player, any deviation by her in the second phase requires some non-zero weight on her strictly dominant strategy B. Again, since any deviation is punished by immediate and permanent reversion to the minmax payoffs (0, 0), her optimal deviation would be to play the pure action B. To ensure that this deviation is not (weakly) profitable, we need

" +∞ # "+∞ # 1  1  X X (1 − δ ) + e 2 + − e 0 + δt 0 < (1 − δ ) δs(2e) , I 2 2 I I I t=1 s=0 which, solving for δI , yields the result 1 δ > . (6) I 1 + 2e Conditions (3) and (6), taken together, ensure that no profitable one-shot deviations from σ∗ exist for either player. Hence by the one-shot deviation principle, σ∗ constitutes a subgame- perfect Nash equilibrium.

3 Next we show that we can get arbitrarily close to the Pareto-optimal playoff profile (2, 2 ) with suitable choices of the discount factors. Suppose we are given two strictly positive real numbers εI > 0 and εP > 0. Assuming that both players play according to the equilibrium ∗ strategy σ detailed above, we ask what restrictions we must impose on the discount factors δI ∗ and δP so that the total discounted payoffs to the impatient player, denoted πI , are within εI ∗ 3 of 2, and the total discounted payoffs to the patient player, denoted πP , are within εP of 2 . Under the strategy profile σ∗, the total discounted payoffs to the impatient player are

"T =1 +∞ # ∗ X t X s T πI = (1 − δI ) δI (2) + δI (2e) = 2 − 2(1 − e)δI . (7) t=0 s=T

The impatient player’s total discounted payoffs are within εI of 2 if

∗ πI > 2 − εI .

3 1 To see that inequality (3) implies inequality (4) when 0 < e < 2 , we begin by noting that, since 0 < δP < 1 by assumption, inequality (3) gives the following chain of inequalities for T > 0:

2 T −1 T 1 1 > δP > δP > . . . > δP > δP > 5 . 2 − e Thus, in particular, (3) implies 1 δP > 5 . (5) 2 − e

1 If we can now show that (5) implies (4) for e satisfying 0 < e < 2 , our proof will be complete. This implication is true if the following inequality holds 1 1 „ 1 « > e + , 5 2 2 2 − e

5 which for e < 2 can be rearranged to yield 0 > −(2e − 3)(2e − 1).

1 For e < 2 , both terms in parentheses are negative, so their product is positive, and hence the right side of the inequality is negative; thus the inequality holds, and our proof is complete.

7 Discounting in Repeated Games Stephen Wolff, TSE (2011)

Substituting in expression (7) and simplifying, we obtain ε δT < I . (8) I 2(1 − e)

At first the direction of the inequality in (8) may seem surprising: to ensure that the impatient player’s payoffs are sufficiently close to 2, we demand that her discount factor δI be less than the T th root of the quantity given on the right side of (8). However, this makes sense on two considerations. First, as the impatient player’s discount factor becomes larger, she places more and more weight on the payoffs in the second phase, which pull down her total discounted payoffs. If her discount factor becomes too large, her total discounted payoffs will be pulled below 2 − εI . This explains the direction of the inequality in (8). Second, recall that we do have a lower bound on δI in equation (6), due to individual rationality considerations, so δI cannot be arbitrarily close to zero. As computed above, total discounted payoffs to the patient player under σ∗ are

"T −1 +∞ # X X 3  5  π∗ = (1 − δ ) δt (−1) + δs − e = − e δT − 1. (9) P P P 2 2 P t=0 s=T

3 To ensure that the patient player’s total discounted payoffs are within εP of 2 is to require that 3 π∗ > − ε , P 2 P which gives the condition 5 − 2ε δT > P . (10) P 5 − 2e

Note that condition (10) is stronger than condition (3) whenever 5 − 2εP > 2, i.e. whenever 3 3 εP < 2 . Any εP ≥ 2 results in a total discounted payoff for the patient player that is nonpositive; since the patient player can always guarantee a total discounted payoff of 0 by always playing his dominant strategy, only condition (10) is relevant. Note also that the inequality in (10) 1 implicitly bounds e, the added weight above 2 that the patient player puts on the action L in each stage of the second phase. This is because we must have δP < 1; hence (10) implies that

e < εP . (11)

1 Equation (4) also provided an implicit bound on e, namely, e < 2δP − 2 . However, as we will see, δP will typically be close to 1, so for small εP it will be the constraint e < εP that is binding. It remains for us to determine what bounds on T , the length of the first phase, exist. A lower bound for T comes from (8), the requirement that the impatient player’s total discounted payoff be within εI of 2. For εP such that e < εP < 1, we can use (11) to rewrite (8) as

T εI δI < . 2(1 − εP ) Taking logs of both sides and solving for T , we have the following lower bound for T :

log ε − log(2(1 − ε )) T > I P . (12) log δI

8 Discounting in Repeated Games Stephen Wolff, TSE (2011)

An upper bound for T comes from (10), the requirement that the patient player’s total dis- 3 counted payoff be within εP of 2 . Taking logs of both sides of (10) and solving for T , we have log(5 − 2ε ) − log(5 − 2e) T < P . (13) log δP

Practically, given εP , we choose e < εP , which then determines the upper bound on T . Table 1 summarizes the bounds on the parameters.

Parameter Reference Bounds

 1 e (11) 0 < e < min ε , 2δ − P P 2

log ε − log(2(1 − ε )) log(5 − 2ε ) − log(5 − 2e) T (12) & (13) I P < T < P log δI log δP

1  ε 1/T δ (6) & (8) < δ < I I 1 + 2e I 2(1 − e)

5 − 2ε 1/T δ (10) P < δ < 1 P 5 − 2e P Table 1: Parameter bounds for the infinitely repeated symmetric two-player prisoner’s dilemma in which players use the different geometric discount factors δI < δP .

3.2.2 Different Geometric Discount Factors: An Example We conclude this section by offering a specific example of the above computations. The example will illustrate our claim that in the situations we generally wish to consider, the relevant upper bound on e is εP . 1 Suppose we wish each player’s total discounted payoff to be within 100 of the payoff profile 3 1 1 (2, 2 ); that is, let εI = εP = 100 . Since εP ≤ 2 , we must choose e such that 0 < e < εP ; setting 1 1 e = 2 εP = 200 satisfies these constraints. The lower bound on δI , specified by (6), is then 1 100 δ = = . I 1 + 2e 101 4 If we take this value to be our δI , then the lower bound on T is log ε − log(2(1 − ε )) T = I P ≈ 531.466452, log δI

4 The fastidious reader may object that our lower bound on δI is a strict inequality, and therefore we cannot set δI = δI . This is a valid complaint, to which we offer the following three defenses. First, strict inequality was chosen in (6) rather than weak inequality to ensure that deviation is not even weakly profitable, i.e. to rule out deviations that yield a payoff equal to or greater than the payoff afforded by adhering to the given strategy. If we are willing to allow such a possibility, then we can relax the lower bound on δI to a weak inequality, and setting

δI = δI presents no problems. Second, even if we demand strict inequality, we can make δI as close as we like to δI ; setting the two equal is simply the limiting case of this process. If neither of these two arguments convinces 1 the reader, then our final defense is to simply choose e slightly larger, e = 199 say, so that the lower bound δI is 100 slightly smaller; then the value δI = 101 satisfies the strict inequality.

9 Discounting in Repeated Games Stephen Wolff, TSE (2011) so that the minimum length of the first phase is T = 532. Taking this to be our value of T , we can compute the lower bound on δP to be

5 − 2ε 1/T δ = P ≈ 0.999996229. P 5 − 2e

Thus for this example, the patience ratio r is, at maximum, log δ r = P ≈ 0.000378983088. log δI

4 Hyperbolic Discounting

The term hyperbolic discounting refers to a discount factor fH (t) that can be written in the form 1 f (t) = (14) H 1 + kt for some parameter k ≥ 0 that captures the degree of discounting. The greater the value of k, the more the agent discounts the future compared to the present, and the more severe is her present bias. Hyperbolic discounting does a better job than geometric discounting at capturing the be- havior actually observed in humans (and other animals). However, the hyperbolic discount- ing function given by equation (14) destroys the recursive framework present under geometric discounting, thus making mathematical analysis extremely difficult. A compromise mode of discounting that captures the key features of true hyperbolic discounting while preserving a re- cursive structure5 in future periods is quasi-hyperbolic discounting, first studied by the psychol- ogist Herrnstein (1961), introduced to economics by Phelps and Pollak (1968), and popularized by Laibson (1997). A discount factor fQH (t) is called quasi-hyperbolic if it can be written in the form ( 1 for t = 0 fQH (t) = (15) βδt for t > 0, where δ serves the analogous role as it did under geometric discounting and where β describes the degree of present-bias. These parameters satisfy 0 < δ < 1 and β > 0. For present-biased preferences, the parameter β satisfies 0 < β < 1; the discount rate falls more steeply between the present period t = 0 and the next period t = 1 than between any two future periods t0 and t0 + 1 (for any t0 ≥ 1). Because of the parametric form of (15), an individual who uses quasi-hyperbolic discounting is often said to have beta-delta preferences. As with geometric discounting, we use a normalization factor to make the repeated-game payoffs comparable to the stage-game rewards. To this end we define an effective discount factor ∆(β, δ) to be the geometric discount factor that yields an equivalent total payoff under a uniform reward stream:

1 + ∆(β, δ) + ∆(β, δ)2 + ... = 1 + βδ + βδ2 + ....

5The discount factor between stage 0 and stage 1 is βδ, whereas the discount factor between any two consecutive future stages is δ. If we restrict our attention solely to future stages, however, then discounting appears geometric. Chade et al. (2008) exploit this recursive structure with great success.

10 Discounting in Repeated Games Stephen Wolff, TSE (2011)

Solving for the effective discount factor, we find βδ ∆(β, δ) = . (16) 1 + βδ − δ

To simplify notation, we will write player i’s effective discount factor as ∆i := ∆(βi, δi). We t again denote player i’s stage-t continuation payoff by Ci :

+∞ t X j−1 j Ci = βδ δ ui(σ ). j=t+1

Her total discounted average payoff is then given by

" +∞ # 0 X t−1 t πi(σ) = (1 − ∆i) ui(σ ) + βδ δ ui(σ ) t=1 0 0 |h1,σ0 = (1 − ∆i)ui(σ ) + ∆iCi (σ ). The new issue that arises with hyperbolic discounting (and, therefore, with quasi-hyperbolic discounting) is time inconsistency of players’ strategies. As seen from the present period t = 0, the stream of discount factors is 1, βδ, βδ2,.... Notice that the relative weight between periods 1 and 0, namely βδ, is different from the relative weight between any two adjacent future periods, namely δ. This discrepancy in relative weights means that a rational player, evaluating various streams of rewards, may optimize by choosing some strategy σA in the present, but on arriving in some future period, this player finds it optimal to play a different strategy σB. As an example, let β = 1/4, δ = 1/2, so ∆ = 1/5, and consider a player deciding between the strategies σA and σB yielding the two reward streams (0, 2, 3, 3,...) and (0, 4, 0, 0,...), respectively. Assume that the two strategies prescribe the same action in period t = 0, so that both reward streams are available to the player in period t = 1. As seen from the present, strategy σA yields a total discounted payoff of  1  1  1 1  4 1  1 1 − 0 + 2 + 3 + 3 + ... = (2 + 3) = , 5 8 2 4 5 8 2 whereas strategy σB yields a total discounted payoff of  1  1  2 1 − 0 + (4 + 0 + 0 + ...) = . 5 8 5

Since 1/2 > 2/5, in the present period t = 0 a rational player would choose strategy σA to obtain the first reward stream. However, upon entering period t = 1, the player would discount the remaining reward streams as  1  1  1  11 1 − 2 + 3 + 3 + ... = 5 8 2 5 and  1  1  1  16 1 − 4 + 0 + 0 + ... = , 5 8 2 5 respectively. Since 11/5 < 16/5, she would now prefer to play strategy σB to obtain the second reward stream. Her preferences have changed; in particular, any promise by this player in period

11 Discounting in Repeated Games Stephen Wolff, TSE (2011)

t = 0 to adhere to strategy σA is non-credible, assuming she is free to choose her actions in each period. The time-inconsistency encountered above renders the notion of subgame-perfect equilib- rium unstable across time and requires us to introduce a new equilibrium concept, that of Strotz-Pollak equilibrium. The strategy profile σ is a Strotz-Pollak equilibrium if for all histo- ries h ∈ H, for all players i ∈ N, and for all actions ai ∈ Ai,

|h,σ(h) |h,(ai,σ−i(h)) (1 − ∆i)ui(σ) + ∆iCi(σ ) ≥ (1 − ∆i)ui(ai, σ−i(h)) + ∆iCi(σ ). In words, a Strotz-Pollak equilibrium is a strategy profile σ for which, in any stage t, no player can do better by deviating, given that all other players adhere to σ. Note that there are no strictly profitable one-shot deviations under a Strotz-Pollak equilibrium, where profitability is evaluated taking the given period t as the present. Note also that Strotz-Pollak equilibrium does not require subgame perfection; it requires only that, for all t, each player will find it optimal to play according to σ when she finds herself in stage t.

4.1 Identical Discount Factors Let us consider first the case in which both players discount quasi-hyperbolically using the same discount factors β and δ. (The effective discount factor ∆ for both players is then given by equation (16).) Is it possible to construct a Strotz-Pollak equilibrium that supports the cooperation total discounted payoff profile (1, 1)? ∗ As a first attempt, let us try grim-trigger Nash reversion: Both players use a strategy σi that begins by cooperating (T for player 1, L for player 2) in stage 0 and continues cooperating so long as both players have always cooperated in the past, i.e., so long as ht = ((T,L),..., (T,L)) for any t > 0. Any deviation is punished by immediate and permanent reversion to B by player 1 and to R by player 2. Provided her opponent follows this strategy, adhering to the strategy ∗ σi yields player i a total discounted payoff of 1, as the following computation verifies: " +∞ # X 1 − δ  βδ  π (σ∗) = (1 − ∆) 1 + βδ δt−1(1) = 1 + = 1. i 1 + βδ − δ 1 − δ t=1 Proposition 2. The strategy σ∗ defines a Strotz-Pollak equilibrium, provided that the effective 1 discount factor ∆ > 2 . Proof. What deviations do we need to check? Certainly we need to check the total discounted payoff if a player deviates in the current period. For Strotz-Pollak equilibrium, that is all we need to check. The strategy of Nash reversion results in only two states of the repeated game, that in which no player has yet deviated, and that in which a deviation has occurred. In the latter ∗ state, σi stipulates that each player play her strictly dominant action, and hence no profitable deviation is possible. In the former state, if any profitable deviation exists, then deviating to the strictly dominant action will be the best deviation. Thus we need to compare the total discounted payoff when a player adheres to σi and when she deviates to her strictly dominant action. Checking for profitable deviations in future stages will inform us how a player will plan to behave, but due to the time-inconsistency associated with quasi-hyperbolic discounting, it is only the comparison of actions in the present stage that is credible and relevant. If player i deviates to her dominant action in the present stage, her total discounted payoff is " +∞ # dev ∗ X t−1 πi(σi , σ−i) = (1 − ∆) 2 + βδ δ (0) = 2(1 − ∆). t=1

12 Discounting in Repeated Games Stephen Wolff, TSE (2011)

Thus, to ensure no strictly profitable deviation exists, we must have that

∗ dev πi(σ ) > πi(σi , σ−i), which yields the restriction 1 ∆ > . 2

Note the similarity between this and the restriction we had for identical geometric discount factors; this similarity arises because the grim-trigger Nash reversion strategy implies that if any profitable deviation exists, it will be profitable to deviate in the current stage, and hence only the actions in the current stage need be compared, provided that the discount factor in all future periods is weakly less than the weight assigned to the present — which is the case for geometric discount factors less than one and for present-biased quasi-hyperbolic discounting. Note also that this restriction grants quite a bit of freedom to the parameters β and δ, requiring 1 only that ∆(β, δ) as given by (16) satisfy 2 < ∆(β, δ) < 1.

4.2 Different Discount Factors We now generalize to the case in which players use different quasi-hyperbolic discount factors. We would like to use the methods developed in section 3.2, but because each player has two discount parameters, βi and δi, denoting one player as “patient” and the other as “impatient” potentially becomes trickier. If βi > βj and δi > δj, then player i is unequivocally the (relatively) patient player: her discounting weights are everywhere greater than those of player j (equal in the current stage and strictly greater in all future stages). If, however, the inequalities in the two comparisons go in opposite ways, then the players’ discount factors potentially6 cross at some point; that is, there exists a value of t, which we shall denote τ, such that for t < τ one player has the greater discount factor, and for t ≥ τ the other player has the greater discount factor. Let us start with the special case in which one player, player 2 say, has β2 = 1 (so that his discounting reduces to geometric discounting with discount factor δ2), and the other player has β1 < 1. If δ2 ≥ δ1, then the quasi-hyperbolically discounting player 1 is unequivocally the relatively impatient player; if instead δ2 < δ1, then the geometrically discounting player 2 is the relatively impatient player starting at some stage τ. However, note that if the first-stage discount factors satisfy δ2 > β1δ1, then in any current stage the quasi-hyperbolic discounter will view herself as the relatively impatient player for the next stage, and she will select her action and future strategy accordingly.

6In discrete time, it is possible that the inequalities go in opposite ways but the relevant discount factors, i.e. the discount functions evaluated at integer values of t, do not ever switch dominance. Take for example 4 5 9 1 t t β1 = 5 < 6 = β2 and δ1 = 10 > 10 = δ2; for all integers t > 0, we have that β1δ1 > β2δ2. Note that the continuous discount functions always cross at a strictly positive value of t, namely log(β /β )) t∗ = 2 1 . log(δ1/δ2) If t∗ < 1, then the discrete-time discount factors do not switch dominance. (For the particular values of the discount parameters stated earlier, t∗ ≈ 0.0186.) Note that this condition for t∗ applies equally to cases in which the inequalities go the same way. In these cases, we have that exactly one of the ratios β2/β1 and δ1/δ2 is greater than 1 and the other is less than 1, so that t∗ is negative, and we conclude (correctly) that the discount factors do not switch dominance in the nonnegative time space we consider.

13 Discounting in Repeated Games Stephen Wolff, TSE (2011)

Let us see if we can sustain an equilibrium similar to the one we had in section 3.2. In particular, let player 1 be a quasi-hyperbolic discounter with β1 < 1, and let player 2 be a geometric discounter (i.e. let β2 = 1). Hereafter we will subscript the quasi-hyperbolic discounter’s values with qh and the geometric discounter’s values with g. The equilibrium concept we apply is Strotz-Pollak equilibrium. We seek the reward profile stream 3 3 (2, −1), (2, −1),..., (2, −1) , (2e, − e), (2e, − e),... | {z } | {z } | {z } 2 2 Stage 0 Stage 1 Stage T − 1 | {z } | {z } Stage T Stage T + 1 as earlier, obtained by having players play according to (B,L) from stage 0 through stage T −1, 1 1  play according to T, 2 L + 2 R from stage T onward, and punish any deviation by immediate and permanent Nash reversion. The condition necessary for the geometric discounter to adhere to this strategy is the same as that given in section 3.2, namely,

T 1 δg > 5 . (17) 2 − e Analysis of the deviations for the quasi-hyperbolic discounter is similar to section 3.2. She clearly has no incentive to deviate in stages 0 through T − 1, as her reward in these stages is the greatest possible reward, her reward under the given strategy is strictly positive in all stages, and deviation is punished by permanent Nash reversion which yields a reward stream of zeros. Move now to the second phase of the game, during which she receives the stage- game reward 2e. Under the given strategy profile, the game reduces to a one-phase, two-state repeated prisoner’s dilemma. The quasi-hyperbolic discounter may plan to deviate, but for Strotz-Pollak equilibrium only her actual actions matter. In any stage of the second phase, the total discounted payoff to the quasi-hyperbolic discounter if she follows the given strategy is

" +∞ # ∗ X t−1 πqh(σ ) = (1 − ∆qh) 2e + βqhδqh δqh (2e) = 2e, t=0 whereas if she deviates in stage T and plays her dominant strategy B she receives the total discounted payoff

" +∞ # 1 X π (σdev, σ∗) = (1 − ∆ ) 2( + e) + β δ δt−10 = (1 − ∆ )(1 + 2e). qh qh g qh 2 qh qh qh qh t=1 Thus the quasi-hyperbolic discounter has no profitable deviations if

∗ dev ∗ πqh(σ ) > πqh(σqh , σg )

2e > (1 − ∆qh)(1 + 2e) 1 − ∆ e > qh . (18) 2∆qh

1 Recall that e, the additional weight above 2 that player 2 (here, the geometric discounter) puts 1 1 on the action L in the second phase, must satisfy − 2 ≤ e ≤ 2 . Thus if the proposed strategy is to constitute an equilibrium, we must have 1 − ∆ 1 qh ≤ , 2∆qh 2

14 Discounting in Repeated Games Stephen Wolff, TSE (2011)

which solving for ∆qh yields the lower bound 1 ∆ ≥ . (19) qh 2

Notice that as ∆qh → 1, condition (18) approaches e > 0; that is, the more patient the quasi- hyperbolic discounter becomes, the less “extra” reward the geometric discounter needs to give her to ensure compliance. If conditions (17), (18), and (19) hold, then the given strategy profile σ∗ constitutes a Strotz-Pollak equilibrium. The total discounted payoffs under this strategy profile to the quasi- hyperbolic discounter are

" T −1 +∞ # ∗ X s−1 T X t−T πqh(σ ) = (1 − ∆qh) 2 + βqhδqh δqh (2) + βqhδqh δqh (2e) s=1 t=T T ! βqhδ = 2 1 − qh (1 − e) , 1 + βqhδqh − δqh and to the geometric discounter are

"T −1 +∞ # X X 3  π (σ∗) = (1 − δ ) δs(−1) + δt − e g g g g 2 s=0 T 5  = − e δT − 1. 2 g

Using the infimum value of e (i.e., the limit of the lowest possible “extra” reward) given by condition (18), these expressions for the total discounted payoff become

β δT ! ∗ qh qh 2∆qh − 1 πqh(σ ) = 2 1 − (1 − ∆qh) 1 − δqh ∆qh and   ∗ 7 1 T πg(σ ) = − δg − 1. 2 ∆qh

∗ ∗ 3 As ∆qh → 1, we have πqh(σ ) → 2; further imposing the limit δg → 1 implies that πg(σ ) → 2 . As with different geometric discount factors, here we can get as close as we like to the point 3  2, 2 on the Pareto frontier by making the discount factors sufficiently close to 1. Suppose we wish to make the quasi-hyperbolic discounter’s total discounted payoff within εqh of 2. Solving for T , this gives us the lower bound   1 1 + βqhδqh − δqh T > log εqh . (20) log δqh 2βqh(1 − e) The upper bound on T comes from requiring the geometric discounter’s total discounted payoff 3 to be within εg of 2 , and is therefore the same as in (13).

15 Discounting in Repeated Games Stephen Wolff, TSE (2011)

4.2.1 Quasi-Hyperbolic and Geometric Discounter: An Example We conclude this discussion with an example computation, which illustrates a new feature arising from the fact that e is bounded away from 0 by condition (18). Consider a quasi- 3 4 3 hyperbolic discounter with parameters βqh = 4 and δqh = 5 (so that ∆qh = 4 ) playing against a 99 geometric discounter with discount factor δg = 100 . Taking the infimum value of e given in (18), 1−∆qh 1 ∗ we have inf e = = . From condition (20), the minimum value of T such that πqh(σ ) is 2∆qh 6 within εqh of 2 is then

1 4  T = 4 log εqh . log 5 5

1 For εqh = 100 , this gives T ≈ 21.6377, so that the minimum length of the first phase is T = 22. For these parameters, the total discounted payoff to the quasi-hyperbolic discounter is ∗ πqh(σ ) ≈ 1.9908. Recall that the total discounted payoff to the geometric discounter is

5  π (σ∗) = − e δT − 1. g 2 g

99 ∗ ∗ 4 Taking δg = 100 , we compute πg(σ ) ≈ 0.8705. Even in the limit as δg → 1, we have πg(σ ) → 3 , 3 less than the payoff 2 the patient player was able to achieve when playing against an opponent using a different geometric discount factor. As before, the value of e puts an upper bound on the geometric discounter’s payoff; but unlike before, here the value of e cannot be made arbitrarily close to 0, but is bounded away from 0 by the value of ∆qh. The patient geometric discounter 3 can only obtain the payoff 2 if e → 0, which requires that ∆qh → 1. The end result is that the 3 closer ∆qh is to 1, the smaller we can make e and still satisfy (18), and hence the closer to 2 we can drive the patient geometric discounter’s payoff.

5 Conclusion

This paper represents a first exploration into the theory of repeated games. After defining some general concepts in repeated games, we considered geometric discounting, presenting cel- ebrated results for the cases of identical and different discount factors. We then constructed an explicit equilibrium to the infinitely repeated two-player prisoner’s dilemma that allows us 3 to get arbitrarily close to the Pareto optimal payoff profile (2, 2 ). We next turned our atten- tion to quasi-hyperbolic discounting. For the prisoner’s dilemma in which players have the same quasi-hyperbolic discount factors, we showed that the mutually cooperative payoff profile (1, 1) is sustainable as an subgame-perfect Nash equilibrium under the analogous condition as for identical geometric discount factors. We then showed that, when player 1 discounts quasi- hyperbolically and player 2 geometrically, we can get again arbitrarily close to the payoff profile 3 (2, 2 ) by choosing suitable discount factors. This paper gives many directions for future research. Under geometric discounting with 3 different discount factors, the payoff profile (2, 2 ) lies on the Pareto frontier; the reward stream to each player is monotonic, and therefore individual rationality puts a lower bound of 0 on any reward received by the relatively impatient player in the second phase. We need to determine whether this point is on the Pareto frontier in the case of quasi-hyperbolic discounting as well, or if there is a feasible and individually rational payoff profile that Pareto dominates it. An intriguing project is to try to generalize the discussion of section 4.2 to the case in which both players discount quasi-hyperbolically with potentially different discount parameters, βi ≤ 1

16 Discounting in Repeated Games Stephen Wolff, TSE (2011)

and 0 < δi < 1. Another vein of research, stepping outside the boundaries of the repeated prisoner’s dilemma, is to explore what we can say if players’ discount factors change over time. When discount factors are unequal, the relatively impatient player enjoys the greater reward (at least in the case of geometric discounting). Might a player be willing to pay a portion of current gains to affect the discount factor of her opponent? The main contribution of this paper is to show explicit constructions of payoffs arbitrarily close to a given point. Really what we would like to do is characterize the feasible and indi- vidually rational payoff sets, `ala Lehrer and Pauzner (1999). It is this problem that calls out most loudly for future attention.

References

G. Ainslie. Picoeconomics. Cambridge University Press, Cambridge, 1992.

H. Chade, P. Prokopovych, and L. Smith. Repeated games with present-biased preferences. Journal of Economic Theory, 139(1):157–175, 2008.

D. Fudenberg and E. Maskin. The folk theorem in repeated games with discounting or with incomplete information. Econometrica, 54(3):533–554, 1986.

R. Herrnstein. Relative and absolute strength of response as a function of frequency of rein- forcement. Journal of the Experimental Analysis of Animal Behavior, 4(3):267–272, 1961.

D. Laibson. Golden eggs and hyperbolic discounting. The Quarterly Journal of Economics, 112 (2):443–477, 1997.

E. Lehrer and A. Pauzner. Repeated games with differential time preferences. Econometrica, 67(2):393–412, 1999.

M. Osborne and A. Rubinstein. A Course in . MIT Press, Cambridge, 1994.

E. Phelps and R. Pollak. On second-best national saving and game-equilibrium growth. The Review of Economic Studies, 35(2):185–199, 1968.

17