Information Theoretic Approaches to the Unexpected Hanging Paradox

–Independent Work Report Spring 2014–

Stefani Karp Advisor: Mark Braverman

May 6, 2014

“This paper represents my own work in accordance with University regulations.” - Stefani Karp

Abstract In this paper, we analyze the average surprise caused by the revelation of the day of the hanging in the unexpected hanging paradox. We aim to shed new light on the unexpected hanging paradox by quantifying the gap in average surprise between two cases: (Case 1) the day of the hanging is revealed prior to the start of the week, and (Case 2) the day of the hanging is only revealed when the hanging actually occurs, where Case 2 is the standard formulation of the paradox. As the length of the week increases, we find that this gap in average surprise converges and is upper-bounded (for the probability distributions we analyze). This indicates that, on average, on the day of the hanging there is a finite amount of information contained in the knowledge that the prisoner has survived until that day. 1. Introduction The unexpected hanging paradox (often called the surprise examination paradox) has received significant attention in philosophy and mathematics since the 1940s [7]. We present here a brief summary of the paradox: A prisoner is told that he will be hung on one of the days (Monday through Friday) of the upcoming week and that the hanging will be a surprise when it occurs. The prisoner reasons as follows: If I am still alive by Friday morning, then I know the hanging must occur on Friday. Therefore, a Friday hanging could not be a surprise, so I can rule out Friday. If I rule out Friday, then the hanging cannot occur on Thursday either, because if I am still alive by Thursday morning and I have already ruled out Friday, then I know the hanging must occur on Thursday. Therefore, a Thursday hanging could not be a surprise either. By this reasoning, I can rule out every day of the week, and so I conclude that a surprise hanging is impossible. However, on Wednesday afternoon, the prisoner is hung - to his surprise. Therein lies the paradox. This paradox is known by a variety of names due to its wide generalizability. For example, in Chow [3], rather than a hanging, the event of interest is a surprise examination in class. For several decades, countless attempts have been made to resolve the paradox from a variety of academic perspectives (including analysis by philosophers as renowned as Quine [8]). The paradox has received particular attention in epistemology, as the paradox can be said to hinge on our definitions of “knowledge, belief and memory” [3]. In mathematics, the paradox has been analyzed in relation to Gödel’s second incompleteness theorem [6] and from various game-theoretic perspectives [5, 9], just to name a few of the different mathematical lenses through which the paradox has been studied thus far. As of yet, no general consensus on the paradox’s resolution has been reached within the academic community [3]. However, in this paper, our goal is not to resolve the paradox, but rather to explore one particular approach to the study of the paradox: an information-theoretic approach. We believe information theory presents a natural lens through which to view the paradox, as it allows us to answer the question: How can we quantify the prisoner’s surprise, and what is that quantity? This is a highly significant question, as many of the debates surrounding the paradox hinge upon the definition of what it means to be “surprised” [3]. According to information theory, the quantity of surprise caused by a particular outcome is the amount of information revealed when it occurs. The most widely-accepted definition of information content is closely tied to Shannon entropy (which is often defined as the expected value of a random variable’s information content). Specifically, the information content I(x) (or, more formally, 1 the self-information) contained in outcome x is defined as I(x) = log P(x) , where P(x) is the probability of outcome x [1]. This paper is not the first to examine the unexpected hanging paradox from an information- theoretic perspective. Chow [3] discusses the work of Karl Narveson, who derived a probability distribution (for the day of the hanging) that maximizes the average surprise when the hanging occurs. Borwein [2] builds on this work by further analyzing the mathematical properties of the surprise function and solving various surprise-maximization problems (i.e., optimization problems) derived from the paradox. Extending the work of Chow and Borwein, in this paper we primarily focus on the gap in average surprise between (1) the prisoner being told at the beginning of the week what day he will be

2 hung and (2) the prisoner discovering the day of his hanging later in the week at the time of the hanging. As we will show, in the first case, the prisoner’s average surprise is simply the Shannon entropy of the probability distribution for the day of the hanging. In the second case, the prisoner’s average surprise is diminished; by the time the hanging occurs, the prisoner has already gained some information (i.e., the prisoner knows that he has not yet died). Therefore, in this second case, the prisoner’s average surprise is smaller than in the first case. Specifically, we determine the maximum average surprise achievable in the first case, and we compare this value with the average surprise in the second case under two different prior probability distributions (the uniform distribution and the surprise-maximizing distribution). As the number of days in the “week” approaches infinity, we find that the average amount of surprise lost approaches 1 ln2 (from below). Our results confirm our intuition that the prisoner can be surprised, since we see that the prisoner’s average surprise is less than the entropy by at most 1.5 bits of information. Since the entropy grows without bound but the gap in average surprise is bounded, we see that the average amount of information gained prior to the hanging eventually becomes negligible compared to the entropy (as the length of the week increases).

2. Problem Definition As in Borwein [2], we consider an n-day week, during which a prisoner is hung on exactly one of the n days. (We note that the problem can be generalized from a “hanging” to any event of interest that occurs once in a particular length of time.) Let Di be the event that the prisoner is hung (i.e., the prisoner dies) on day i, where i ∈ {1,...,n}. Then let Ai be the event that the prisoner is still alive by the start of day i, such that Ai = ¬D1 ∩ ··· ∩ ¬Di−1, where ¬Di means that the prisoner is not hung on day i. We let P(Di) be the prior probability that the hanging occurs on day i, and we let P(Di | Ai) be the posterior probability that the hanging occurs on day i (i.e., the conditional probability that the prisoner dies on day i given that the prisoner has survived days 1 through i − 1). Our goal is to find the average surprise caused by the revelation of the day of the hanging. We use Sn to represent the average surprise for an n-day week. Based on the Shannon entropy-based definition of self-information, we define the surprise caused by an event E as the information gain I(E) upon E’s occurrence, where I(E) = −log2 P(E) if E occurs with probability P(E). If the prisoner is told at the beginning of the week what day he will be hung, then the average surprise Sn is simply: n Sn = − ∑ P(Di)· log2 P(Di) i=1 since, at the start of the week, the probability of the hanging occurring on any day i is just the prior probability P(Di). However, if the prisoner only discovers the day of his hanging at the time of the hanging, then the average surprise Sn is: n Sn = − ∑ P(Di)· log2 P(Di | Ai) i=1 since, at the time of the hanging, the probability of the hanging occurring on day i must account for the fact that there are now effectively only n − i + 1 days in the week, where i ∈ {1,...,n}.

3 3. The Average Surprise Gap We begin with a discussion of the maximum average surprise when the prisoner is told at the beginning of the week what day he will be hung. We call this Case 1.

We then consider the average surprise when the day of the hanging is revealed at the time of the hanging instead. We call this Case 2.

∗ Throughout this section, we use the following notation: Sn is the maximum average surprise D for an n-day week when the day of the hanging is revealed at the start of the week (Case 1), and Sn is the average surprise under the prior probability distribution D in Case 2.

3.1. Case 1 - Maximum Average Surprise ∗ Theorem 3.1.1. The prior probability distribution on {1,...,n} that yields Sn is the discrete uniform distribution.

Proof: As deﬁned in Section 2, for Case 1 we have: n ∗ Sn = − ∑ P(Di)· log2 P(Di) i=1 which is simply the Shannon entropy of the prior probability distribution.

It is well-known that the maximum entropy probability distribution on {1,...,n} is the discrete uniform distribution.

Therefore, the prior distribution that maximizes the average surprise for Case 1 is the discrete uniform distribution.

∗ Theorem 3.1.2. Sn = log2 n.

Proof: From the proof of Theorem 3.1.1, we have: n ∗ Sn = − ∑ P(Di)· log2 P(Di) i=1 1 where P(Di) = n .

Thus, we ﬁnd: n ∗ 1 1 Sn = − ∑ · log2 i=1 n n

= log2 n

(This result is as expected, as the entropy of the discrete uniform distribution is known to be log2 n.)

4 3.2. Case 2 - Uniform Distribution Based on the results of Section 3.1, we now consider the uniform distribution for Case 2.

U 1 Theorem 3.2.1. Let U be the uniform distribution on {1,...,n}. Then, Sn = n log2(n!).

1 1 Proof: Under distribution U , we have: P(Di) = n and P(Di | Ai) = n−i+1 . Therefore, we have: n U Sn = − ∑ P(Di)· log2 P(Di | Ai) i=1 n 1 1 = − ∑ log2 i=1 n n − i + 1 1 n = ∑ log2 (n − i + 1) n i=1 " # 1 n = log2 ∏(n − i + 1) n i=1 1 = log (n!) n 2

U ∗ U U 1 Theorem 3.2.2. Let gn = Sn − Sn . Then, lim gn = . n→∞ ln2

∗ U 1 Proof: By Theorems 3.1.2 and 3.2.1, we have Sn = log2 n and Sn = n log2(n!). Therefore, we have U 1 gn = log2 n − n log2(n!), and thus: U 1 lim gn = lim log n − log (n!) n→∞ n→∞ 2 n 2

√ n n Stirling’s approximation gives us the asymptotic formula n! ∼ 2πn e . With this, we have: √ n U 1 n lim gn = lim log n − log 2πn n→∞ n→∞ 2 n 2 e 1 √ n = lim log n − log 2πn − log n→∞ 2 n 2 2 e 1 √ = lim log e − log 2πn n→∞ 2 n 2

= log2 e 1 = ln2

5 ∗ U Analysis: This result shows that although Sn and Sn both increase without bound as n → ∞, the U gap between them converges. Furthermore, experiments in Mathematica strongly indicate that gn is ∗ U nondecreasing (see Figure 1 below). This implies that the gap between Sn and Sn is upper-bounded 1 by ln2 , as the plot appears to conﬁrm. Therefore, the average amount of information contained in 1 the fact that the prisoner has not yet died by the day of the hanging is at most ln2 .

U Figure 1: Plot of gn .

3.3. Case 2 - Maximum Average Surprise As in Section 3.1 (for Case 1), we now ﬁnd the maximum average surprise for Case 2.

Although the uniform distribution maximizes the average surprise for Case 1, there is no rea- son to believe that the uniform distribution will also maximize the average surprise for Case 2. Therefore, we proceed as follows:

∗ ∗ D ∗ Dn Theorem 3.3.1. Let Dn = argmax Sn . Let gn = Sn − Sn . Then gn is deﬁned recursively as: D [g ln2−ln(n−1)−1] lnn+gn−1 ln2−ln(n−1)−e n−1 gn = ln2 ,g1 = 0.

6 ∗ Proof: As presented in Chow [3], we can deﬁne Dn in terms of the conditional probability qk, th where qk is the probability that the hanging occurs on the k -to-the-last day of an n-day week for k ∈ {0,...,n − 1}. According to Karl Narveson, we ﬁnd the mutual recursions:

(sk−1−1) qk = e (1) sk = sk−1 − qk (2) where s0 = 0. As we can see, the conditional probability qk is independent of the total length of the week (i.e., qk is independent of n). The intuition for this recursive deﬁnition is the following: if the hanging has not occurred at the start of the kth-to-the-last day, then the residual probability ∗ distribution for the remaining k + 1 days given this knowledge must be Dk+1. Therefore, we have:

∗ ∗ Dn Dn−1 Sn = −qn−1 log2 qn−1 + (1 − qn−1)Sn−1 , for n ≥ 2, a simple variant of which serves as the foundation for Narveson’s mutual recursions (1) ∗ Dn 1 and (2) above. We therefore ﬁnd that Sn = − ln2 sn−1 for n ≥ 1.

Then, we have:

1 g = log n − − s (3) n 2 ln2 n−1 lnn s = + n−1 ln2 ln2 lnn s − q = + n−2 n−1 ln2 ln2 lnn + s − e(sn−2−1) = n−2 (4) ln2

From equation (3) above, we ﬁnd: sn−1 = gn ln2 − lnn ⇒ sn−2 = gn−1 ln2 − ln(n − 1).

Plugging this back into equation (4), we have:

lnn + [g ln2 − ln(n − 1)] − e[gn−1 ln2−ln(n−1)−1] g = n−1 n ln2

1 When n = 1, we have sn−1 = s0 = 0 and therefore gn = log2 n − − ln2 sn−1 = 0.

[g ln2−ln(n−1)−1] lnn+gn−1 ln2−ln(n−1)−e n−1 Therefore, we have gn = ln2 , g1 = 0.

7 Analysis: We then plotted gn in Mathematica for large values of n as seen below in Figure 2. This plot provided the following intuition: (1) gn appears to be nondecreasing for all n. (2) gn appears to approach a value slightly above 1.442. 1 Therefore, we hypothesized that gn is upper-bounded by ln2 , the proof of which follows in Lemma 3.3.2. 1 1 We further hypothesized that not only is gn upper-bounded by ln2 , but also that gn converges to ln2 . The proof of a lower bound in Lemma 3.3.3 is used in conjunction with Lemma 3.3.2 to prove convergence (as shown in Theorem 3.3.4).

Figure 2: Plot of gn.

1 Lemma 3.3.2. gn < ln2 for all n ≥ 1.

1 Proof: By induction on n. By Theorem 3.3.1, we have: g1 = 0 < ln2 . 1 1 For some n ≥ 2, we assume that gn−1 < ln2 . This is equivalent to: gn−1 = ln2 (1 − ε) for some ε > 0. Therefore, we ﬁnd: n ln 2gn−1 g = n−1 + g − n ln2 n−1 e(n − 1)ln2 n ln (1 − ε) 2(1−ε)/ln2 = n−1 + − ln2 ln2 e(n − 1)ln2 n ln (1 − ε) e(1−ε) = n−1 + − ln2 ln2 e(n − 1)ln2 n ln (1 − ε) e−ε ⇒ g = n−1 + − (5) n ln2 ln2 (n − 1)ln2

8 Since ex > 1 + x for x 6= 0, we ﬁnd: x > ln(1 + x) for x > −1, x 6= 0.

Therefore, for n ≥ 2, we have:

1 1 > ln 1 + n − 1 n − 1 1 n ⇒ > ln (6) n − 1 n − 1

x Using e > 1 + x for x 6= 0, as well as inequality (6), we can upper-bound gn from equation (5): 1 (1 − ε) (1 − ε) g < + − n (n − 1)ln2 ln2 (n − 1)ln2 1 (1 − ε) = [1 − (1 − ε)] + (n − 1)ln2 ln2 ε (1 − ε) = + (n − 1)ln2 ln2 ε (1 − ε) < + ln2 ln2 1 = ln2

1 1 1 Therefore, we have: gn−1 < ln2 ⇒ gn < ln2 for n ≥ 2. By induction, gn < ln2 for all n ≥ 1.

1 7lnn Lemma 3.3.3. gn > ln2 − n for all n ≥ 500.

Proof: By induction on n.

1 7ln500 From calculations in Mathematica, we have: g500 ≈ 1.43005 > ln2 − 500 ≈ 1.35569.

1 7lnn Then, for some n ≥ 500, we assume that gn > ln2 − n . There are exactly two cases: 1 7ln(n+1) (1) gn > ln2 − n+1 1 7lnn 1 7ln(n+1) (2) ln2 − n < gn ≤ ln2 − n+1

Case (1) is simple. Based on Figure 2 (and strongly supported by further calculations in Mathe- 1 7ln(n+1) matica), we assume that gn is nondecreasing in n. Therefore, we have: gn+1 ≥ gn > ln2 − n+1 . 1 7ln(n+1) Therefore, gn+1 > ln2 − n+1 .

7lnn 7ln(n+1) For case (2), we show that gn+1 − gn > n − n+1 .

9 Let ∆gn = gn+1 − gn. Then, we have:

" n+1 # ln 2gn ∆g = n + g − − g n ln2 n enln2 n n+1 ln 2gn = n − ln2 enln2 ln n+1 2[1/ln2−7ln(n+1)/(n+1)] ≥ n − ln2 enln2 ln n+1 2[−7ln(n+1)/(n+1)] = n − ln2 nln2 ln n+1 e[−7ln2· ln(n+1)/(n+1)] ⇒ ∆g ≥ n − (7) n ln2 nln2

2 −ε ε2 We use the following two inequalities: ln(1 + ε) > ε − ε for ε > 0 and e < 1 − ε + 2 for ε > 0. Therefore, inequality (7) above simplifies as follows: ln(n+1) h ln(n+1) i2 1 1 1 − 7ln2· n+1 + 7ln2· n+1 /2 − 2 ∆g > n n − n ln2 nln2 2 1 ln(n+1) h ln(n+1) i 1 − n − 1 + 7ln2· n+1 − 7ln2· n+1 /2 = nln2 2 1 ln(n+1) h ln(n+1) i − n + 7ln2· n+1 − 7ln2· n+1 /2 = nln2 2 − 1 + · ln(n) − ( )2 · [ln(n+1)] / n 7ln2 n+1 7ln2 (n+1)2 2 ⇒ ∆g > (8) n nln2 It is known that (lnn)2 < n for sufficiently large n. Based on this intuition, we find that:

(7ln2)2 [ln(n + 1)]2 < n + 1 2 for n ≥ 500. This allows us to simplify inequality (8) as follows:

− 1 + · ln(n) − n+1 n 7ln2 n+1 (n+1)2 ∆g > n nln2 − 1 + 7ln2· ln(n) − 1 = n n+1 n+1 nln2 − 2 + 7ln2· ln(n) ⇒ ∆g > n n+1 (9) n nln2

10 2 ln(n) − n +7ln2· n+1 7lnn 7ln(n+1) Thus, we have ∆gn > nln2 , and our goal is to show ∆gn > n − n+1 . 2 ln(n) − n +7ln2· n+1 7lnn 7ln(n+1) We therefore proceed to show that nln2 > n − n+1 :

2 ln(n) − + 7ln2· ? 7lnn 7ln(n + 1) n n+1 > − (10) nln2 n n + 1 2 ln(n) " n # − + 7ln2· ? nln + lnn n n+1 > 7 n+1 nln2 n(n + 1) " n # 2 ln(n) ? nln + lnn − + 7ln2· > 7ln2 n+1 n n + 1 (n + 1) 2 ln(n) ? lnn n 1 − + 7ln2· > 7ln2· − 7ln2· · ln 1 + n n + 1 n + 1 n + 1 n 2 ? n 1 − > −7ln2· · ln 1 + n n + 1 n 2 ? n 1 < 7ln2· · ln 1 + (11) n n + 1 n

1 1 For n > 1, we have: ln 1 + n > 2n . Therefore, the truth of the following inequality implies the truth of (11) as well:

2 ? n 1 < 7ln2· · n n + 1 2n 4 ? n < (12) 7ln2 n + 1

4 Since 7ln2 ≈ 0.824, inequality (12) is true for n ≥ 500.

Therefore, inequality (10) is true, and combining this with inequality (9), we have:

ln(n) − 2 + 7ln2· 7lnn 7ln(n + 1) ∆g > n n+1 > − n nln2 n n + 1

7lnn 7ln(n+1) 1 7lnn Therefore, we have: gn+1 > gn + n − n+1 , and since gn > ln2 − n by the inductive hypoth- 1 7lnn 7lnn 7ln(n+1) 1 7ln(n+1) 1 7ln(n+1) esis, we have: gn+1 > ln2 − n + n − n+1 = ln2 − n+1 ⇒ gn+1 > ln2 − n+1 .

This completes case (2).

1 7lnn 1 7ln(n+1) For n ≥ 500, we have therefore shown that if gn > ln2 − n , then gn+1 > ln2 − n+1 .

1 7lnn By induction, gn > ln2 − n for n ≥ 500.

11 1 Theorem 3.3.4. lim gn = . n→∞ ln2 1 Proof: From Lemma 3.3.2, we have: gn < ln2 for all n ≥ 1. From Lemma 3.3.3, we have: 1 7lnn gn > ln2 − n for all n ≥ 500. Thus, we have: 1 7lnn 1 − < g < ln2 n n ln2 1 7lnn 1 lim − ≤ lim gn ≤ lim n→∞ ln2 n n→∞ n→∞ ln2 1 1 ≤ lim gn ≤ ln2 n→∞ ln2

1 ⇒ lim gn = n→∞ ln2

1 U Analysis: The fact that this limit is also ln2 means that gn and gn converge to the same value. This implies that the average surprise in Case 2 converges to the same value for both the uniform distribution and the maximum-surprise distribution. An interesting next step might be to determine whether the maximum-surprise distribution itself converges to the uniform distribution as n → ∞. To some extent, this question is answered by Section 4.

4. The Continuous Version In this section, we extend our results from Section 3 to a continuous formulation of the problem, in which an event of interest occurs at some time t ∈ [0,T], as opposed to once during an n-day week (the discrete version of the problem).

To analyze the information content in a continuous probability distribution, we use the concept of the differential entropy H, deﬁned in [4] for a continuous probability distribution f (x) as follows:

Z 1 Z H = f (x)log2 dx = − f (x)log2 f (x)dx x f (x) x

Just as in the discrete formulation of the problem, the entropy H is equal to the average surprise S when the time of the event is revealed prior to the start of the interval (Case 1).

12 However, in Case 2, the time of the event is only revealed when it occurs, and so the average surprise S is instead: Z T 1 Z T S = f (t)log2 dt = − f (t)log2 g(t)dt 0 g(t) 0 where f (t) = Pr[event occurs at time t] and g(t) = Pr[event occurs at time t | event has not yet occurred by time t].

Thus, we have: f (t) g(t) = R T t f (x)dx

And therefore: " # Z T f (t) S = − f (t) dt log2 R T 0 t f (x)dx for a continuous probability distribution f (t).

In the theorems that follow, we use notation similar to that of Section 3. However, we elimi- nate the subscript n (since n lacks meaning for a continuous probability distribution), and we let T = 1 for simplicity.

Theorem 4.1. In Case 1, the maximum average surprise S∗ = 0.

Proof: As explained above, in Case 1 we have S∗ = H, where H is the entropy of the distribution that maximizes the average surprise. It is a known fact that the maximum entropy distribution on a continuous interval [a,b] is the uniform distribution. Therefore, S∗ is the entropy of the uniform distribution on [0,1]:

Z 1 Z 1 ∗ S = − f (t)log2 f (t)dt = − 1· log2(1)dt = 0. 0 0

U 1 Theorem 4.2. In Case 2, let U be the uniform distribution on [0,1]. Then S = − ln2 .

Proof: The uniform distribution on [0,1] is: f (t) = 1. Thus, " # " # Z 1 f (t) Z 1 1 1 SU = − f (t)log dt = − 1· log dt = − . 2 R 1 2 R 1 ln2 0 t f (x)dx 0 t 1dx

13 Theorem 4.3. In Case 2, let D∗ be the distribution on [0,1] that maximizes the average surprise. Then D∗ = U .

Proof: The maximum-surprise probability distribution f (t) maximizes: " # Z 1 f (t) S = − f (t)log dt 2 R 1 0 t f (x)dx

To ﬁnd the maximum-surprise probability distribution, we use the following intuition: analo- gous to the recursive equation for maximum surprise in the discrete version of the problem, the maximum-surprise continuous probability distribution must display the following self-similarity:

ε f (t + ε) f = 1−t f (t) f (0)

We use the above equation to ﬁnd a differential equation as follows:

f ε f (t) f (t + ε) = 1−t f (0) ε f (t + ε) − f (t) f f (t) − f (t) f (0) = 1−t ε f (0)·ε ε f (t + ε) − f (t) f − f (0) f (t) = 1−t ε ε · f (0) d f 1 f (t) = · f 0(0)· dt 1 −t f (0) d f 1 = c· · f (t) dt 1 −t which we can solve using a program such as Mathematica to ﬁnd: C f (t) = 1 |t − 1|c on the interval t ∈ (0,1), where C1 is some constant.

14 We then solve for the value of C1 that makes f (t) a probability distribution over (0,1):

1 Z C1 c dt = 1 0 |t − 1| and ﬁnd: C1 = 1 − c where c < 1.

Plugging C1 back in, we get: 1 − c f (t) = |t − 1|c on the interval t ∈ (0,1), where c < 1.

The average surprise is then:

1−c 1 " # Z 1 − c (1−t)c 1 S = − c log2 1 dt = − − log2(1 − c) 0 (1 −t) R 1−c 1 − c t (1−x)c dx

We then maximize this function with respect to c, yielding c = 0.

The maximum-surprise probability distribution is therefore:

f (t) = 1

f 0(0) which is the uniform distribution and therefore satisﬁes c = f (0) .

Analysis: In the continuous case, the surprise-maximizing distribution is the uniform distribution. This is interesting, as it seems to imply that in the discrete case the surprise-maximizing probability distribution approaches the uniform distribution as n → ∞ (as we hypothesized based on our analyses 1 1 in Section 3). Furthermore, we see that the gap in the continuous case is 0 − − ln2 = ln2 , just as in the limit of the discrete case, which seems to conﬁrm our results from Section 3.

15 5. Conclusions and Future Work We briefly review our process and results before discussing future research questions. First, we review the discrete case (i.e., the standard formulation of the paradox with an n-day week). For a given prior probability distribution, the average surprise caused by the revelation of the day of the hanging is maximized when the day of the hanging is revealed prior to the start of the week. In this case (referred to as Case 1 throughout the paper), the average surprise caused by the revelation of the day of the hanging is simply the entropy of the probability distribution over the days of the week. Since the discrete uniform distribution on {1,...,n} is the maximum entropy probability distribution, we therefore conclude the following: The maximum average surprise caused by the revelation of the day of the hanging occurs when the following two conditions are met: (a) The day of the hanging is revealed prior to the start of the week, and (b) The probability distribution over the days of the week is the discrete uniform distribution. Our calculations show that the value of the maximum average surprise (i.e., the value of the average surprise when these two conditions are met) is log2 n, where n is the number of days in the week. Our main goal was to determine how the average surprise in the standard formulation of the paradox (referred to as Case 2 throughout this paper) differs from the maximum quantity described above. In Case 2, the day of the hanging is only revealed on the day of the hanging itself. Therefore, when the day of the hanging is revealed, the prisoner has already gained the information that the hanging is not on any of the previous days. As a result, immediately prior to the revelation of the day of the hanging, there is less uncertainty in the day of the hanging than there was initially, and so the prisoner’s surprise is less than it would have been had the day of the hanging been revealed prior to the start of the week (as in Case 1). For Case 2, we first analyzed the uniform distribution, since this is the distribution that generates the maximum average surprise in Case 1. We found that, in the limit (as n → ∞), the prisoner 1 experiences ln2 bits of surprise more in Case 1 than in Case 2 and, further, that this gap in surprise 1 does not exceed ln2 . We then analyzed the gap in surprise for Case 2’s maximum surprise probability distribution (in comparison with Case 1’s maximum surprise probability distribution). In the limit, this gap in 1 1 surprise also approaches ln2 and, as we proved, this gap converges to ln2 . What can we conclude from this? To begin, we clearly see that the prisoner can in fact experience surprise in Case 2, since the average surprise in Case 1 grows without bound, but the gap in average surprise between Case 1 and Case 2 is at most a small constant. This provides quantitative confirmation of our belief that the prisoner can (despite his paradoxical reasoning) be surprised by the day of the hanging. Further, our analysis of the continuous probability distribution seems to indicate that, in the discrete case limit (as n → ∞), the surprise-maximizing probability distribution approaches the discrete uniform distribution. Analyzing our results from the perspective of “information content,” our results also mean the following: The expected amount of information contained in the fact that the prisoner has not died 1 by the day of the hanging is at most ln2 bits of information. In other words, the gap in surprise stems from the fact that, as the days go by, the prisoner is gaining information. By the day of 1 the hanging, how much information has the prisoner gained? On average, at most ln2 bits of information, according to our results. Thus, on average (and for sufficiently large n), there are still 1 at least log2 n − ln2 bits of information left to be gained upon revelation of the day of the hanging.

16 These conclusions lead to a variety of interesting possibilities for future work. Although the two probability distributions that we chose to analyze in Case 2 seem to be the two distributions most relevant to the paradox (i.e., the uniform distribution and the surprise-maximizing distribution), it might be interesting to analyze the asymptotic behavior of other probability distributions as well. In particular, for another well-deﬁned prior probability distribution, it would be interesting to see how the gap in average surprise behaves when we assign this prior distribution to both Cases 1 and 2 (as we did with the uniform distribution). We can envision fairly trivial probability distributions that would make the gap zero for all values of n (such as assigning a probability of 1 to the ﬁrst day of the week); however, is there some prior probability distribution that can make the gap in average 1 surprise larger than ln2 ? In other words, what is the maximum amount of average information contained in the fact that the prisoner has not died by the day of the hanging? Another interesting question is the following: since the discrepancy between Cases 1 and 2 is derived from the prisoner’s changing perspective as time elapses, we naturally wonder what happens, generally speaking, when the prisoner’s beliefs are allowed to differ from reality. This question is likely more relevant to game theory, but there are almost certainly areas in which information theory could enhance the analysis. Finally, we suggest the exploration of other formulations of the problem, such as having more than just one event of interest or having more than one week (i.e., having cycles). Therefore, through the methods and results presented in this paper, as well as the suggestions for further research, our ultimate hope is that the work in this paper can serve as a foundation for future information-theoretic approaches to the unexpected hanging paradox.

17 References [1] M. Borda, Fundamentals in Information Theory and Coding. Springer, 2011. [2] D. Borwein, J. Borwein, and P. Maréchal, “Surprise maximization,” American Mathematical Monthly, pp. 517–527, 2000. [3] T. Y. Chow, “The surprise examination or unexpected hanging paradox,” American Mathematical Monthly, pp. 41–51, 1998. [4] T. Cover and J. Thomas, Elements of Information Theory. Wiley, 2012. [5] J. L. Ferreira and J. Z. Bonilla, “The surprise exam paradox, rationality, and pragmatics: a simple game-theoretic analysis,” Journal of Economic Methodology, vol. 15, no. 3, pp. 285–299, 2008. [6] S. Kritchman and R. Raz, “The surprise examination paradox and the second incompleteness theorem,” Notices of the AMS, vol. 57, no. 11, pp. 1454–1458, 2010. [7] D. J. O’CONNOR, “Pragmatic paradoxes,” Mind, vol. 57, no. 227, pp. 358–359, 1948. [8] W. Quine, “On a so-called paradox,” Mind, vol. 62, no. 245, pp. 65–67, 1953. [9] E. Sober, “To give a surprise exam, use game theory,” Synthese, vol. 115, no. 3, pp. 355–373, 1998.