Coincidence in Runs and Clusters MAA
Total Page:16
File Type:pdf, Size:1020Kb
5/28/2012 Coincidence in Runs and Clusters MAA COINCIDENCE IN RUNS AND CLUSTERS Milo Schield, W. M. Keck Statistical Literacy Project Abstract: Coincidence may be the most common “Coincidences fascinate us; they seem to compel a manifestation of chance in everyday life. Chance- search for their significance.” Paulos (1989). “My based coincidences are a necessary result of the Law favorite is this little known fact: In Psalm 46 of the of Very Large Numbers: the unlikely is expected (the King James Bible, published in the year that Shake- “impossible” becomes almost certain) given enough speare turned 46, the 46th word is "shake" and the tries. This paper examines: runs in flipped coins, one 46th word from the end is "spear." (More remarkable and two-dimensional clusters of rare events and the than this coincidence is that someone should have birthday problem using Excel spreadsheets. This noted this!)” Myers (2002). paper has three goals: 1) to show that unlikely coin- “One famous coincidence is that John Adams and cidences are much more common than expected, 2) to Thomas Jefferson, two men who shaped the Declara- show the ambiguity in the question “What is the tion of Independence, both died on July 4, 1826, the chance of that?” and 3) to show that as a rule-of fiftieth anniversary of the signing of that historic thumb a run with one chance in N is generally found document.” (Neimark, 2004) in a string of N tries. This rule of thumb is simple, memorable and useful since it is within 10% of the Coincidences that are extremely unlikely due to expected value for large N Conjecture: if the chance chance are argued to be due to something else. So of run of length k is pk, then that length run is ex- how likely are coincidences? pected in N tries when N = (1/p)(k+1). The short answer: “Coincidences are more likely than Keywords: statistical literacy, Excel, visual display you ever imagined.” The longer answer: “Coinci- dences are expected given enough tries.” The goal of 1. STATISTICAL LITERACY this paper is to illustrate these two points. No matter how one defines statistical literacy, it must include randomness or chance. The two most com- Coincidences are a matter of a statistical law: the law mon references to chance in the everyday media in- of Very-Large Numbers: the “impossible” is almost volve the margin of error and coincidences. certain given enough tries. As described by Diaconis and Mostellar (1989), ''With a large enough sample, Coincidences are newsworthy. Raymond and Schield any outrageous thing is apt to happen.” (2008) analyzed 273 statistics-based news stories. Of these, 10% included the phrase “unlikely due to To illustrate the omnipresence of coincidences, this chance” whereas 5% mentioned "confidence level" paper begins with runs in flipping a fair coin, exam- and only 3% mentioned "statistically significant". ines patterns or clusters of rare events in one and two-dimensional spaces and reviews the classic The goal of this paper is to investigate runs and clus- “birthday” problem. Excel spreadsheets generate the ters, and show why coincidences are expected. data in this paper. 2. COINCIDENCES 3. COINCIDENCE: RUNS IN FLIPING COINS Coincidences are event-conjunctions that are unlikely Consider seeing a run in flipping a fair coin. A “run” and memorable. Life is filled with unlikely events. identifies a group activity as in a “run” on a bank Recall the first name of the last stranger you talked where depositors withdraw their money in large to. Unlikely? Yes. Memorable? Not likely. numbers or when a group of people compete to be Now recall an unlikely conjunction that was memo- first in a 10-K “run”. In statistics, a run involves rable for you. Calling someone just as they were repetitions of a rare outcome. For coins, a run is se- about to call you (or vice versa). Meeting someone ries of heads (or tails). you know in a distant or unlikely place. Finding Runs in flipping coins are a bit more complex than something lost long ago, just before you were going expected. See Appendix A for technical details. to buy a replacement. Having a dream about some- thing before it happened. Figure 1 shows the results of flipping 16 rows of 100 coins each. It shows if the result is a head (a red John Allen Paulos, author of Innumeracy stated that color), tabulates the length of the longest run in each the most incredible coincidence of all might be the row, and then show the maximum length of all the absence of all coincidence. Auditors sometimes use runs in that spreadsheet. the lack of coincidence as evidence of fraud. 2012Schield-MAA4A.doc Page 1 5/28/2012 Coincidence in Runs and Clusters MAA Figure 1: Runs of Heads in 1,600 cells The longest run for each row is shown at the left in then in 1,024 trials of 10 coins each, one set of 10 column A. The longest run of all is shown at the top heads is expected. It can be shown that in this case left corner. the chance of 10 or more heads is at least 50% (is more likely than not). See Appendix B and Notice the run of ten heads circled in the first row of Appendix C. When Np is an integer, Lord (2010) Figure 1. What is the chance of that? showed that the mean is also the median and the A run of 10 heads seems very unlikely. Yet runs of mode. at least ten heads occur more often than not when the But the Binomial distribution does not give any in- data if refreshed (press F9). How is this possible? formation about the distribution of the longest run in The short answer: the question is ambiguous. “What a string of size N. But knowing that this coincidence is the chance of that?” has two interpretations: (a run of k heads) is expected in T sets of size k is a useful first step that matches with our intuitions. 1. What is the chance that on the next 10 tries, heads will result in exactly the cells circled? Consider a simpler case of a run of 3 heads: one chance in eight. A run of three heads is expected in T 2. What is the chance that ten adjacent cells will all trials of 3 coins each where T = 8 since T*P = 1. have heads -- somewhere in the 1,600 cells? This would involve flipping 24 coins. The answer to the first question is one chance in 10 These eight sets of 3 coins each can be mapped onto 1,024: P = (1/2) . The answer to the second depends a series of 10 coin flips as shown in Figure 2. on what length longest run is expected in N tries. In general, T trials of K coins each can be mapped The expected length of the longest run is not deter- into N flips when N = T + (k-1) and T = 2k. For runs mined by the binomial distribution. But if there are T of length k, consider N = (1/p)k + (k-1). trials (where T = 1/P), of 10 flips each, then one run is expected: on average. If P is one chance in 1024, 2012Schield-MAA4A.doc Page 2 5/28/2012 Coincidence in Runs and Clusters MAA Figure 2: Compressing 24 flips (8 sets of 3 each) into 10 flips in a row When the event is a run of two heads (chance of one Alternatively, when the run of interest is one chance in four) with N = 5, the chance of that run or longer is in N, then one needs N cells so that one such run is greater than 50% and the mean length is 1.94. generally found. This simple rule of thumb is intui- tive, memorable and visually convincing. When the event is a run of three heads (one chance in eight) with N = 10, the chance of that run or longer is If the chance of success on the next try is p, the greater than 50% and the mean length is 2.80. chance of a run of k or more successes is generally more likely than not in N tries where N = (1/p)k. Appendix E thru Appendix J support the claim that the longest run generally found in series of N flips of Note the relentless use of “generally.” This is a rule a fair coin is given by Log(N) base 2 for N >> k. of thumb, not a mathematical theorem. Table 1 shows the number of flips generally needed This run of 10 in Figure 1 is unexpected for the non- given the length of the longest run. statistician who doesn’t see the ambiguity in the question and who fails to see the large number of Table 1 Number of flips needed by longest run length potential groups of size 10 within these 1,600 cells. Length N Length N This run of 10 heads in more than a thousand tries is Two 4 Seven 128 not unexpected by anyone who is statistically-literate. Three 8 Eight 256 Four 16 Nine 512 Recall again the ambiguity in the question: How Five 32 Ten 1,024 likely (unlikely) is that? This ambiguity explains Six 64 Eleven 2,048 why coincidences can be both rare and common. Appendix D examines the distribution of run lengths. With this background, we can return to the horizontal runs in Figure 1. In a row of 100 cells, we generally The real key to these very long runs is overlap.