–Independent Work Report Spring 2014–
Information Theoretic Approaches to the Unexpected Hanging Paradox
Stefani Karp Advisor: Mark Braverman
May 6, 2014
“This paper represents my own work in accordance with University regulations.” - Stefani Karp
Abstract In this paper, we analyze the average surprise caused by the revelation of the day of the hanging in the unexpected hanging paradox. We aim to shed new light on the unexpected hanging paradox by quantifying the gap in average surprise between two cases: (Case 1) the day of the hanging is revealed prior to the start of the week, and (Case 2) the day of the hanging is only revealed when the hanging actually occurs, where Case 2 is the standard formulation of the paradox. As the length of the week increases, we find that this gap in average surprise converges and is upper-bounded (for the probability distributions we analyze). This indicates that, on average, on the day of the hanging there is a finite amount of information contained in the knowledge that the prisoner has survived until that day. 1. Introduction The unexpected hanging paradox (often called the surprise examination paradox) has received significant attention in philosophy and mathematics since the 1940s [7]. We present here a brief summary of the paradox: A prisoner is told that he will be hung on one of the days (Monday through Friday) of the upcoming week and that the hanging will be a surprise when it occurs. The prisoner reasons as follows: If I am still alive by Friday morning, then I know the hanging must occur on Friday. Therefore, a Friday hanging could not be a surprise, so I can rule out Friday. If I rule out Friday, then the hanging cannot occur on Thursday either, because if I am still alive by Thursday morning and I have already ruled out Friday, then I know the hanging must occur on Thursday. Therefore, a Thursday hanging could not be a surprise either. By this reasoning, I can rule out every day of the week, and so I conclude that a surprise hanging is impossible. However, on Wednesday afternoon, the prisoner is hung - to his surprise. Therein lies the paradox. This paradox is known by a variety of names due to its wide generalizability. For example, in Chow [3], rather than a hanging, the event of interest is a surprise examination in class. For several decades, countless attempts have been made to resolve the paradox from a variety of academic perspectives (including analysis by philosophers as renowned as Quine [8]). The paradox has received particular attention in epistemology, as the paradox can be said to hinge on our definitions of “knowledge, belief and memory” [3]. In mathematics, the paradox has been analyzed in relation to Gödel’s second incompleteness theorem [6] and from various game-theoretic perspectives [5, 9], just to name a few of the different mathematical lenses through which the paradox has been studied thus far. As of yet, no general consensus on the paradox’s resolution has been reached within the academic community [3]. However, in this paper, our goal is not to resolve the paradox, but rather to explore one particular approach to the study of the paradox: an information-theoretic approach. We believe information theory presents a natural lens through which to view the paradox, as it allows us to answer the question: How can we quantify the prisoner’s surprise, and what is that quantity? This is a highly significant question, as many of the debates surrounding the paradox hinge upon the definition of what it means to be “surprised” [3]. According to information theory, the quantity of surprise caused by a particular outcome is the amount of information revealed when it occurs. The most widely-accepted definition of information content is closely tied to Shannon entropy (which is often defined as the expected value of a random variable’s information content). Specifically, the information content I(x) (or, more formally, 1 the self-information) contained in outcome x is defined as I(x) = log P(x) , where P(x) is the probability of outcome x [1]. This paper is not the first to examine the unexpected hanging paradox from an information- theoretic perspective. Chow [3] discusses the work of Karl Narveson, who derived a probability distribution (for the day of the hanging) that maximizes the average surprise when the hanging occurs. Borwein [2] builds on this work by further analyzing the mathematical properties of the surprise function and solving various surprise-maximization problems (i.e., optimization problems) derived from the paradox. Extending the work of Chow and Borwein, in this paper we primarily focus on the gap in average surprise between (1) the prisoner being told at the beginning of the week what day he will be
2 hung and (2) the prisoner discovering the day of his hanging later in the week at the time of the hanging. As we will show, in the first case, the prisoner’s average surprise is simply the Shannon entropy of the probability distribution for the day of the hanging. In the second case, the prisoner’s average surprise is diminished; by the time the hanging occurs, the prisoner has already gained some information (i.e., the prisoner knows that he has not yet died). Therefore, in this second case, the prisoner’s average surprise is smaller than in the first case. Specifically, we determine the maximum average surprise achievable in the first case, and we compare this value with the average surprise in the second case under two different prior probability distributions (the uniform distribution and the surprise-maximizing distribution). As the number of days in the “week” approaches infinity, we find that the average amount of surprise lost approaches 1 ln2 (from below). Our results confirm our intuition that the prisoner can be surprised, since we see that the prisoner’s average surprise is less than the entropy by at most 1.5 bits of information. Since the entropy grows without bound but the gap in average surprise is bounded, we see that the average amount of information gained prior to the hanging eventually becomes negligible compared to the entropy (as the length of the week increases).
2. Problem Definition As in Borwein [2], we consider an n-day week, during which a prisoner is hung on exactly one of the n days. (We note that the problem can be generalized from a “hanging” to any event of interest that occurs once in a particular length of time.) Let Di be the event that the prisoner is hung (i.e., the prisoner dies) on day i, where i ∈ {1,...,n}. Then let Ai be the event that the prisoner is still alive by the start of day i, such that Ai = ¬D1 ∩ ··· ∩ ¬Di−1, where ¬Di means that the prisoner is not hung on day i. We let P(Di) be the prior probability that the hanging occurs on day i, and we let P(Di | Ai) be the posterior probability that the hanging occurs on day i (i.e., the conditional probability that the prisoner dies on day i given that the prisoner has survived days 1 through i − 1). Our goal is to find the average surprise caused by the revelation of the day of the hanging. We use Sn to represent the average surprise for an n-day week. Based on the Shannon entropy-based definition of self-information, we define the surprise caused by an event E as the information gain I(E) upon E’s occurrence, where I(E) = −log2 P(E) if E occurs with probability P(E). If the prisoner is told at the beginning of the week what day he will be hung, then the average surprise Sn is simply: n Sn = − ∑ P(Di)· log2 P(Di) i=1 since, at the start of the week, the probability of the hanging occurring on any day i is just the prior probability P(Di). However, if the prisoner only discovers the day of his hanging at the time of the hanging, then the average surprise Sn is: n Sn = − ∑ P(Di)· log2 P(Di | Ai) i=1 since, at the time of the hanging, the probability of the hanging occurring on day i must account for the fact that there are now effectively only n − i + 1 days in the week, where i ∈ {1,...,n}.
3 3. The Average Surprise Gap We begin with a discussion of the maximum average surprise when the prisoner is told at the beginning of the week what day he will be hung. We call this Case 1.
We then consider the average surprise when the day of the hanging is revealed at the time of the hanging instead. We call this Case 2.
∗ Throughout this section, we use the following notation: Sn is the maximum average surprise D for an n-day week when the day of the hanging is revealed at the start of the week (Case 1), and Sn is the average surprise under the prior probability distribution D in Case 2.
3.1. Case 1 - Maximum Average Surprise ∗ Theorem 3.1.1. The prior probability distribution on {1,...,n} that yields Sn is the discrete uni- form distribution.
Proof: As defined in Section 2, for Case 1 we have: n ∗ Sn = − ∑ P(Di)· log2 P(Di) i=1 which is simply the Shannon entropy of the prior probability distribution.
It is well-known that the maximum entropy probability distribution on {1,...,n} is the discrete uniform distribution.
Therefore, the prior distribution that maximizes the average surprise for Case 1 is the discrete uniform distribution.
∗ Theorem 3.1.2. Sn = log2 n.
Proof: From the proof of Theorem 3.1.1, we have: n ∗ Sn = − ∑ P(Di)· log2 P(Di) i=1 1 where P(Di) = n .
Thus, we find: n ∗ 1 1 Sn = − ∑ · log2 i=1 n n
= log2 n
(This result is as expected, as the entropy of the discrete uniform distribution is known to be log2 n.)
4 3.2. Case 2 - Uniform Distribution Based on the results of Section 3.1, we now consider the uniform distribution for Case 2.
U 1 Theorem 3.2.1. Let U be the uniform distribution on {1,...,n}. Then, Sn = n log2(n!).
1 1 Proof: Under distribution U , we have: P(Di) = n and P(Di | Ai) = n−i+1 . Therefore, we have: n U Sn = − ∑ P(Di)· log2 P(Di | Ai) i=1 n 1 1 = − ∑ log2 i=1 n n − i + 1 1 n = ∑ log2 (n − i + 1) n i=1 " # 1 n = log2 ∏(n − i + 1) n i=1 1 = log (n!) n 2
U ∗ U U 1 Theorem 3.2.2. Let gn = Sn − Sn . Then, lim gn = . n→∞ ln2
∗ U 1 Proof: By Theorems 3.1.2 and 3.2.1, we have Sn = log2 n and Sn = n log2(n!). Therefore, we have U 1 gn = log2 n − n log2(n!), and thus: U 1 lim gn = lim log n − log (n!) n→∞ n→∞ 2 n 2
√ n n Stirling’s approximation gives us the asymptotic formula n! ∼ 2πn e . With this, we have: √ n U 1 n lim gn = lim log n − log 2πn n→∞ n→∞ 2 n 2 e 1 √ n = lim log n − log 2πn − log n→∞ 2 n 2 2 e 1 √ = lim log e − log 2πn n→∞ 2 n 2
= log2 e 1 = ln2
5 ∗ U Analysis: This result shows that although Sn and Sn both increase without bound as n → ∞, the U gap between them converges. Furthermore, experiments in Mathematica strongly indicate that gn is ∗ U nondecreasing (see Figure 1 below). This implies that the gap between Sn and Sn is upper-bounded 1 by ln2 , as the plot appears to confirm. Therefore, the average amount of information contained in 1 the fact that the prisoner has not yet died by the day of the hanging is at most ln2 .
U Figure 1: Plot of gn .
3.3. Case 2 - Maximum Average Surprise As in Section 3.1 (for Case 1), we now find the maximum average surprise for Case 2.
Although the uniform distribution maximizes the average surprise for Case 1, there is no rea- son to believe that the uniform distribution will also maximize the average surprise for Case 2. Therefore, we proceed as follows:
∗ ∗ D ∗ Dn Theorem 3.3.1. Let Dn = argmax Sn . Let gn = Sn − Sn . Then gn is defined recursively as: D [g ln2−ln(n−1)−1] lnn+gn−1 ln2−ln(n−1)−e n−1 gn = ln2 ,g1 = 0.
6 ∗ Proof: As presented in Chow [3], we can define Dn in terms of the conditional probability qk, th where qk is the probability that the hanging occurs on the k -to-the-last day of an n-day week for k ∈ {0,...,n − 1}. According to Karl Narveson, we find the mutual recursions:
(sk−1−1) qk = e (1) sk = sk−1 − qk (2) where s0 = 0. As we can see, the conditional probability qk is independent of the total length of the week (i.e., qk is independent of n). The intuition for this recursive definition is the following: if the hanging has not occurred at the start of the kth-to-the-last day, then the residual probability ∗ distribution for the remaining k + 1 days given this knowledge must be Dk+1. Therefore, we have:
∗ ∗ Dn Dn−1 Sn = −qn−1 log2 qn−1 + (1 − qn−1)Sn−1 , for n ≥ 2, a simple variant of which serves as the foundation for Narveson’s mutual recursions (1) ∗ Dn 1 and (2) above. We therefore find that Sn = − ln2 sn−1 for n ≥ 1.
Then, we have:
1 g = log n − − s (3) n 2 ln2 n−1 lnn s = + n−1 ln2 ln2 lnn s − q = + n−2 n−1 ln2 ln2 lnn + s − e(sn−2−1) = n−2 (4) ln2
From equation (3) above, we find: sn−1 = gn ln2 − lnn ⇒ sn−2 = gn−1 ln2 − ln(n − 1).
Plugging this back into equation (4), we have:
lnn + [g ln2 − ln(n − 1)] − e[gn−1 ln2−ln(n−1)−1] g = n−1 n ln2