University of Pennsylvania ScholarlyCommons

Statistics Papers Wharton Faculty Research

2001

Interval Estimation for a Binomial Proportion

Lawrence D. Brown University of Pennsylvania

T. Tony Cai University of Pennsylvania

Anirban Dasgupta University of Pennsylvania

Follow this and additional works at: https://repository.upenn.edu/statistics_papers

Part of the and Probability Commons

Recommended Citation Brown, L. D., Cai, T., & Dasgupta, A. (2001). Interval Estimation for a Binomial Proportion. Statistical Science, 16 (2), 101-133. http://dx.doi.org/10.1214/ss/1009213286

This paper is posted at ScholarlyCommons. https://repository.upenn.edu/statistics_papers/161 For more information, please contact [email protected]. Interval Estimation for a Binomial Proportion

Abstract We revisit the problem of interval estimation of a binomial proportion. The erratic behavior of the coverage probability of the standard Wald has previously been remarked on in the literature (Blyth and Still, Agresti and Coull, Santner and others). We begin by showing that the chaotic coverage properties of the Wald interval are far more persistent than is appreciated. Furthermore, common textbook prescriptions regarding its safety are misleading and defective in several respects and cannot be trusted.

This leads us to consideration of alternative intervals. A number of natural alternatives are presented, each with its motivation and context. Each interval is examined for its coverage probability and its length. Based on this analysis, we recommend the Wilson interval or the equal-tailed Jeffreys prior interval for small n and the interval suggested in Agresti and Coull for larger n. We also provide an additional frequentist justification for use of the Jeffreys interval.

Keywords Bayes, binomial distribution, confidence intervals, coverage probability, Edgeworth expansion, expected length, Jeffreys prior, normal approximation, posterior

Disciplines Statistics and Probability

This journal article is available at ScholarlyCommons: https://repository.upenn.edu/statistics_papers/161 Statistical Science 2001, Vol. 16, No. 2, 101–133 Interval Estimation for a Binomial Proportion

Lawrence D. Brown, T. Tony Cai and Anirban DasGupta

Abstract. We revisit the problem of interval estimation of a binomial proportion. The erratic behavior of the coverage probability of the stan- dardWaldconfidenceinterval has previously been remarkedon in the literature (Blyth andStill, Agresti andCoull, Santner andothers). We begin by showing that the chaotic coverage properties of the Waldinter- val are far more persistent than is appreciated. Furthermore, common textbook prescriptions regarding its safety are misleading and defective in several respects andcannot be trusted. This leads us to consideration of alternative intervals. A number of natural alternatives are presented, each with its motivation and con- text. Each interval is examinedfor its coverage probability andits length. Basedon this analysis, we recommendthe Wilson interval or the equal- tailedJeffreys prior interval for small n andthe interval suggestedin Agresti andCoull for larger n. We also provide an additional frequentist justification for use of the Jeffreys interval. Key words and phrases: Bayes, binomial distribution, confidence intervals, coverage probability, Edgeworth expansion, expected length, Jeffreys prior, normal approximation, posterior.

1. INTRODUCTION of the t test, , andANOVA, its popularity in everyday practical statistics is virtu- This article revisits one of the most basic and ally unmatched. The standard interval is known as methodologically important problems in statisti- the Waldinterval as it comes from the Waldlarge cal practice, namely, interval estimation of the test for the binomial case. probability of success in a binomial distribu- So at first glance, one may think that the problem tion. There is a textbook confidence interval for is too simple andhas a clear andpresent solution. this problem that has acquirednearly universal In fact, the problem is a difficult one, with unantic- acceptance in practice. The interval, of course, is ipatedcomplexities. It is widelyrecognizedthat the pˆ ± z n−1/2ˆp1 −ˆp1/2, where pˆ = X/n is α/2 actual coverage probability of the standard inter- the sample proportion of successes, and z is the α/2 val is poor for p near 0 or 1. Even at the level of 1001 − α/2th of the standard normal introductory statistics texts, the standard interval distribution. The interval is easy to present and is often presentedwith the caveat that it shouldbe motivate andeasy to compute. With the exceptions usedonly when n·minp 1−p is at least 5 (or 10). Examination of the popular texts reveals that the Lawrence D. Brown is Professor of Statistics, The qualifications with which the standard interval is Wharton School, University of Pennsylvania, 3000 presentedare varied,but they all reflect the concern Steinberg Hall-Dietrich Hall, 3620 Locust Walk, about poor coverage when p is near the boundaries. Philadelphia, Pennsylvania 19104-6302. T. Tony Cai In a series of interesting recent articles, it has is Assistant Professor of Statistics, The Wharton also been pointedout that the coverage proper- School, University of Pennsylvania, 3000 Steinberg ties of the standard interval can be erratically Hall-Dietrich Hall, 3620 Locust Walk, Philadelphia, poor even if p is not near the boundaries; see, for Pennsylvania 19104-6302. Anirban DasGupta is instance, Vollset (1993), Santner (1998), Agresti and Professor, Department of Statistics, Purdue Uni- Coull (1998), andNewcombe (1998). Slightly older versity, 1399 Mathematical Science Bldg., West literature includes Ghosh (1979), Cressie (1980) Lafayette, Indiana 47907-1399 andBlyth andStill (1983). Agresti andCoull (1998)

101 102 L. D. BROWN, T. T. CAI AND A. DASGUPTA particularly consider the nominal 95% case and lower at quite large sample sizes, andthis happens show the erratic andpoor behavior of the stan- in an unpredictable and rather random way. dard interval’s coverage probability for small n Continuing, also in Section 2 we list a set of com- even when p is not near the boundaries. See their mon prescriptions that standard texts present while Figure 4 for the cases n = 5 and10. discussing the standard interval. We show what We will show in this article that the eccentric the deficiencies are in some of these prescriptions. behavior of the standard interval’s coverage prob- Proposition 1 andthe subsequent Table 3 illustrate ability is far deeper than has been explained or is the defects of these common prescriptions. appreciatedby at large. We will show In Sections 3 and4, we present our alterna- that the popular prescriptions the standard inter- tive intervals. For the purpose of a sharper focus val comes with are defective in several respects and we present these alternative intervals in two cat- are not to be trusted. In addition, we will moti- egories. First we present in Section 3 a selected vate, present andanalyze several alternatives to the set of three intervals that clearly standout in standard interval for a general confidence level. We our subsequent analysis; we present them as our will ultimately make recommendations about choos- “recommended intervals.” Separately, we present ing a specific interval for practical use, separately several other intervals in Section 4 that arise as for different intervals of values of n. It will be seen clear candidates for consideration as a part of a that for small n (40 or less), our recommendation comprehensive examination, but do not stand out differs from the recommendation Agresti and Coull in the actual analysis. (1998) made for the nominal 95% case. To facili- The short list of recommended intervals contains tate greater appreciation of the seriousness of the the score interval, an interval recently suggested problem, we have kept the technical content of this in Agresti andCoull (1998), andthe equal tailed article at a minimal level. The companion article, interval resulting from the natural noninforma- Brown, Cai andDasGupta (1999), presents the asso- tive Jeffreys prior for a binomial proportion. The ciatedtheoretical calculations on Edgeworthexpan- score interval for the binomial case seems to sions of the various intervals’ coverage probabili- have been introduced in Wilson (1927); so we call ties andasymptotic expansions for their expected it the Wilson interval. Agresti andCoull (1998) lengths. suggested, for the special nominal 95% case, the In Section 2, we first present a series of exam- interval p˜ ±z n˜ −1/2˜p1−˜p1/2, where n˜ = n+4 ples on the degree of severity of the chaotic behav- 0 025 and p˜ =X + 2/n + 4; this is an adjusted Wald ior of the standard interval’s coverage probability. interval that formally adds two successes and The chaotic behavior does not go away even when two failures to the observedcounts andthen uses n is quite large and p is not near the boundaries. For instance, when n is 100, the actual coverage the standard method. Our second interval is the probability of the nominal 95% standard interval appropriate version of this interval for a general is 0.952 if p is 0.106, but only 0.911 if p is 0.107. confidence level; we call it the Agresti–Coull inter- The behavior of the coverage probability can be even val. By a slight abuse of terminology, we call our more erratic as a function of n. If the true p is 0.5, thirdinterval, namely the equal-tailedinterval the actual coverage of the nominal 95% interval is corresponding to the Jeffreys prior, the Jeffreys 0.953 at the rather small sample size n = 17, but interval. falls to 0.919 at the much larger sample size n = 40. In Section 3, we also present our findings on the This eccentric behavior can get downright performances of our “recommended” intervals. As extreme in certain practically important prob- always, two key considerations are their coverage lems. For instance, consider defective proportions in properties andparsimony as measuredby expected industrial problems. There it would length. Simplicity of presentation is also sometimes be quite common to have a true p that is small. If an issue, for example, in the context of classroom the true p is 0.005, then the coverage probability presentation at an elementary level. On considera- of the nominal 95% interval increases monotoni- tion of these factors, we came to the conclusion that cally in n all the way up to n = 591 to the level for small n (40 or less), we recommendthat either 0.945, only to drop down to 0.792 if n is 592. This the Wilson or the Jeffreys prior interval should unlucky spell continues for a while, andthen the be used. They are very similar, and either may be coverage bounces back to 0.948 when n is 953, but used depending on taste. The Wilson interval has a dramatically falls to 0.852 when n is 954. Subse- closed-form formula. The Jeffreys interval does not. quent unlucky spells start off at n = 1279, 1583 and One can expect that there wouldbe resistance to on andon. It shouldbe widelyknown that the cov- using the Jeffreys interval solely due to this rea- erage of the standard interval can be significantly son. We therefore provide a table simply listing the INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 103 limits of the Jeffreys interval for n up to 30 and Cp n=Ppp ∈ CI  0 40, the Wilson, the Jeffreys dation in the introductory statistics textbooks and andthe Agresti–Coull interval are all very simi- in statistical practice. The interval is known to guar- lar, andso for such n, due to its simplest form, antee that for any fixed p ∈0 1Cp n→1 − α we come to the conclusion that the Agresti–Coull as n →∞. interval should be recommended. Even for smaller Let φz and z be the standard normal density sample sizes, the Agresti–Coull interval is strongly anddistributionfunctions, respectively. Throughout −1 preferable to the standardoneandso might be the the paper we denote κ ≡ zα/2 =  1 − α/2 pˆ = choice where simplicity is a paramount objective. X/n and qˆ = 1 −ˆp. The standard normal approxi- The additional intervals we considered are two mation confidence interval CI s is given by slight modifications of the Wilson and the Jeffreys (1) CI =ˆp ± κn−1/2ˆpqˆ1/2 intervals, the Clopper–Pearson “exact” interval, s the arcsine interval, the logit interval, the actual This interval is obtainedby inverting the accep- Jeffreys HPD interval andthe likelihoodratio tance region of the well known Waldlarge-sample interval. The modified versions of the Wilson and normal test for a general problem: the Jeffreys intervals correct disturbing downward (2) θˆ − θ/seθˆ ≤ κ spikes in the coverages of the original intervals very close to the two boundaries. The other alternative where θ is a generic parameter, θˆ is the maximum intervals have earnedsome prominence in the liter- likelihoodestimate of θ and seθˆ is the estimated ature for one reason or another. We hadto apply a of θˆ. In the binomial case, we have 1/2 −1/2 certain amount of discretion in choosing these addi- θ = p θˆ = X/n and seθˆ=ˆpqˆ n tional intervals as part of our investigation. Since The standard interval is easy to calculate and we wish to direct the main part of our conversation is heuristically appealing. In introductory statis- to the three “recommended” intervals, only a brief tics texts andcourses, the confidenceinterval CI s summary of the performances of these additional is usually presentedalong with some heuristic jus- intervals is presentedalong with the introduction tification basedon the . Most of each interval. As part of these quick summaries, students and users no doubt believe that the larger we indicate why we decided against including them the number n, the better the normal approximation, among the recommended intervals. andthus the closer the actual coverage wouldbe to We strongly recommendthat introductorytexts the nominal level 1 − α. Further, they wouldbelieve in statistics present one or more of these recom- that the coverage probabilities of this methodare mended alternative intervals, in preference to the close to the nominal value, except possibly when n standard one. The slight sacrifice in simplicity is “small” or p is “near” 0 or 1. We will show how wouldbe more than worthwhile. The conclusions completely both of these beliefs are false. Let us we make are given additional theoretical support take a close look at how the standard interval CI s by the results in Brown, Cai andDasGupta (1999). really performs. Analogous results for other one parameter discrete 2.1Lucky n, Lucky p families are presentedin Brown, Cai andDasGupta (2000). An interesting phenomenon for the standard interval is that the actual coverage probability 2. THE STANDARD INTERVAL of the confidence interval contains nonnegligible oscillation as both p and n vary. There exist some When constructing a confidence interval we usu- “lucky” pairs p n such that the actual coverage ally wish the actual coverage probability to be close probability Cp n is very close to or larger than to the nominal confidence level. Because of the dis- the nominal level. On the other hand, there also crete nature of the binomial distribution we cannot exist “unlucky” pairs p n such that the corre- always achieve the exact nominal confidence level sponding Cp n is much smaller than the nominal unless a randomized procedure is used. Thus our level. The phenomenon of oscillation is both in n, objective is to construct nonrandomized confidence for fixed p, andin p, for fixed n. Furthermore, dras- intervals for p such that the coverage probability tic changes in coverage occur in nearby p for fixed

Ppp ∈ CI ≈1 − α where α is some prespecified n andin nearby n for fixed p. Let us look at five value between 0 and1. We will use the notation simple but instructive examples. 104 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 1. Standard interval; oscillation phenomenon for fixed p = 0 2 and variable n = 25 to 100

The probabilities reportedin the following plots note that when n = 17 the coverage probability andtables, as well as those appearing later in is 0.951, but the coverage probability equals 0.904 this paper, are the result of direct probability when n = 18. Indeed, the unlucky values of n arise calculations produced in S-PLUS. In all cases suddenly. Although p is 0.5, the coverage is still their numerical accuracy considerably exceeds the only 0.919 at n = 40. This illustrates the inconsis- number of significant figures reportedand/orthe tency, unpredictability and poor performance of the accuracy visually obtainable from the plots. (Plots standard interval. for variable p are the probabilities for a fine grid of values of p, e.g., 2000 equally spacedvalues of p Example 3. Now let us move p really close to for the plots in Figure 5.) the boundary, say p = 0 005. We mention in the introduction that such p are relevant in certain Example 1. Figure 1 plots the coverage prob- practical applications. Since p is so small, now one ability of the nominal 95% standard interval for may fully expect that the coverage probability of p = 0 2. The number of trials n varies from 25 to the standardintervalis poor. Figure 2 andTable 100. It is clear from the plot that the oscillation is 2.2 show that there are still surprises andindeed significant andthe coverage probability doesnot we now begin to see a whole new kindof erratic steadily get closer to the nominal confidence level behavior. The oscillation of the coverage probabil- as n increases. For instance, C0 2 30=0 946 and ity does not show until rather large n. Indeed, the C0 2 98=0 928. So, as hardas it is to believe, coverage probability makes a slow ascent all the the coverage probability is significantly closer to way until n = 591, and then dramatically drops to 0.95 when n = 30 than when n = 98. We see that 0.792 when n = 592. Figure 2 shows that thereafter the true coverage probability behaves contrary to the oscillation manifests in full force, in contrast conventional wisdom in a very significant way. to Examples 1 and2, where the oscillation started early on. Subsequent “unlucky” values of n again Example 2. Now consider the case of p = 0 5. arise in the same unpredictable way, as one can see Since p = 0 5, conventional wisdom might suggest from Table 2.2. to an unsuspecting user that all will be well if n is about 20. We evaluate the exact coverage probabil- 2.2 Inadequate Coverage ity of the 95% standard interval for 10 ≤ n ≤ 50. In Table 1, we list the values of “lucky” n [defined The results in Examples 1 to 3 already show that as Cp n≥0 95] andthe values of “unlucky” n the standard interval can have coverage noticeably [defined for specificity as Cp n≤0 92]. The con- smaller than its nominal value even for values of n clusions presentedin Table 2 are surprising. We andof np 1 − p that are not small. This subsec-

Table 1 Standard interval; lucky n and unlucky n for 10 ≤ n ≤ 50 and p = 0 5

Lucky n 17 20 25 30 35 37 42 44 49 C0 5n 0.951 0.959 0.957 .957 0.959 0.953 0.956 0.951 0.956 Unlucky n 10 12 13 15 18 23 28 33 40 C0 5n 0.891 0.854 0.908 0.882 0.904 0.907 0.913 0.920 0.919 INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 105

Table 2 in this regard. See Figure 5 for such a comparison. Standard interval; late arrival of unlucky n for small p The error in coverage comes from two sources: dis- Unlucky n 592 954 1279 1583 1876 creteness andskewness in the underlyingbinomial C0 005n 0.792 0.852 0.875 0.889 0.898 distribution. For a two-sided interval, the rounding error due to discreteness is dominant, and the error due to is somewhat secondary, but still tion contains two more examples that display fur- important for even moderately large n. (See Brown, ther instances of the inadequacy of the standard Cai andDasGupta, 1999, for more details.) Note interval. that the situation is different for one-sided inter- vals. There, the error causedby the skewness can Example 4. Figure 3 plots the coverage probabil- be larger than the rounding error. See Hall (1982) ity of the nominal 95% standard interval with fixed for a detailed discussion on one-sided confidence n = 100 andvariable p. It can be seen from Fig- intervals. ure 3 that in spite of the “large” sample size, signifi- The oscillation in the coverage probability is cant change in coverage probability occurs in nearby caused by the discreteness of the binomial dis- p. The magnitude of oscillation increases signifi- tribution, more precisely, the lattice structure of cantly as p moves toward0 or 1. Except for values the binomial distribution. The noticeable oscil- of p quite near p = 0 5, the general trendof this lations are unavoidable for any nonrandomized plot is noticeably below the nominal coverage value procedure, although some of the competing proce- of 0 95. dures in Section 3 can be seen to have somewhat smaller oscillations than the standard procedure. See the text of Casella andBerger (1990) for intro- Example 5. Figure 4 shows the coverage proba- ductory discussion of the oscillation in such a bility of the nominal 99% standard interval with n = context. 20 andvariable p from 0 to 1. Besides the oscilla- The erratic andunsatisfactory coverage prop- tion phenomenon similar to Figure 3, a striking fact erties of the standard interval have often been in this case is that the coverage never reaches the remarkedon, but curiously still donot seem to nominal level. The coverage probability is always be widely appreciated among statisticians. See, for smaller than 0.99, andin fact on the average the example, Ghosh (1979), Blyth andStill (1983) and coverage is only 0.883. Our evaluations show that Agresti andCoull (1998). Blyth andStill (1983) also for all n ≤ 45, the coverage of the 99% standard show that the continuity-correctedversion still has interval is strictly smaller than the nominal level the same disadvantages. for all 0

Fig. 2. Standard interval; oscillation in coverage for small p 106 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 3. Standard interval; oscillation phenomenon for fixed n = 100 and variable p confidence coefficient Cp n→0asp → 0or1. For example, when n = 40 and p = 0 5, one has Therefore, most major problems arise as regards np = n1 − p=20 and np 1 − p=10, so clearly coverage probability when p is near the boundaries. either of the conditions (1) and (2) is satisfied. How- Poor coverage probabilities for p near 0 or 1 are ever, from Table 1, the true coverage probability in widelyremarkedon, andgenerally, in the popu- this case equals 0.919 which is certainly unsatisfac- lar texts, a brief sentence is added qualifying when tory for a confidence interval at nominal level 0.95. to use the standard confidence interval for p.It The qualification (5) is useless and(6) is patently is interesting to see what these qualifications are. misleading; (3) and (4) are certainly verifiable, but A sample of 11 popular texts gives the following they are also useless because in the context of fre- qualifications: quentist coverage probabilities, a -based pre- The confidence interval may be used if: scription does not have a meaning. The point is that 1. np  n 1 − p are ≥ 5 (or 10); the standard interval clearly has serious problems 2. np 1 − p≥5 (or 10); andthe influential texts caution the readersabout 3. npˆ n1 −ˆp are ≥ 5 (or 10); that. However, the caution does not appear to serve 4. pˆ ± 3 pˆ1 −ˆp/n does not contain 0 or 1; its purpose, for a variety of reasons. 5. n quite large; Here is a result that shows that sometimes the 6. n ≥ 50 unless p is very small. qualifications are not correct even in the limit as It seems clear that the authors are attempting to n →∞. say that the standardintervalmay be usedif the central limit approximation is accurate. These pre- Proposition 1. Let γ>0. For the standard con- scriptions are defective in several respects. In the fidence interval, estimation problem, (1) and(2) are not verifiable. Even when these conditions are satisfied, we see, for instance, from Table 1 in the previous section, (3) lim inf Cp n n→∞ p np  n 1−p≥γ that there is no guarantee that the true coverage probability is close to the nominal confidence level. ≤ Paγ < Poissonγ≤bγ

Fig. 4. Coverage of the nominal 99% standard interval for fixed n = 20 and variable p. INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 107

Fig. 5. Coverage probability for n = 50.

Table 3 It is clear that qualification (1) does not work at Standard interval; bound (3) on limiting minimum coverage when np  n 1 − p≥γ all and(2) is marginal. There are similar problems with qualifications (3) and(4).  5710 3. RECOMMENDED ALTERNATIVE INTERVALS lim inf Cp n 0.875 0.913 0.926 n→∞ p np  n 1−p≥γ From the evidence gathered in Section 2, it seems clear that the standard interval is just too risky. This brings us to the consideration of alternative where a and b are the integer parts of γ γ intervals. We now analyze several such alternatives,  each with its motivation. A few other intervals are 2 2 κ + 2γ ± κ κ + 4γ/2 also mentionedfor their theoretical importance. where the − sign goes with a and the + sign with b . Among these intervals we feel three standout in γ γ their comparative performance. These are labeled The proposition follows from the fact that the separately as the “recommended intervals”. sequence of Binn γ/n distributions converges 3.1Recommended Intervals weakly to the Poisson(γ) distribution and so the limit of the infimum is at most the Poisson proba- 3.1.1 The Wilson interval. An alternative to the bility in the proposition by an easy calculation. standard interval is the confidence interval based Let us use Proposition 1 to investigate the validity on inverting the test in equation (2) that uses the 1/2 −1/2 of qualifications (1) and(2) in the list above. The null standard error pq n insteadof the esti- 1/2 −1/2 nominal confidence level in Table 3 below is 0.95. matedstandarderror ˆpqˆ n . This confidence interval has the form X + κ2/2 κn1/2 Table 4 (4) CI = ± ˆpqˆ + κ2/4n1/2 W 2 2 Values of λx for the modified lower bound for the Wilson interval n + κ n + κ

1 −  x = 1 x = 2 x = 3 This interval was apparently introduced by Wilson (1927) andwe will call this interval the Wilson 0.90 0.105 0.532 1.102 interval. 0.95 0.051 0.355 0.818 The Wilson interval has theoretical appeal. The 0.99 0.010 0.149 0.436 interval is the inversion of the CLT approximation 108 L. D. BROWN, T. T. CAI AND A. DASGUPTA to the family of equal tail tests of H0 p = p0. the Jeffreys prior is Beta1/2 1/2 which has the Hence, one accepts H0 basedon the CLT approx- density function imation if andonly if p is in this interval. As 0 fp=π−1p−1/21 − p−1/2 Wilson showed, the argument involves the solution of a quadratic equation; or see Tamhane and Dunlop The 1001 − α% equal-tailedJeffreys prior interval (2000, Exercise 9.39). is defined as

3.1.2 The Agresti–Coull interval. The standard (6) CI J =LJxUJx interval CI is simple andeasy to remember. For s where LJ0=0UJn=1 andotherwise the purposes of classroom presentation anduse in (7) L x=Bα/2 X + 1/2n− X + 1/2 texts, it may be nice to have an alternative that has J the familiar form pˆ ± z pˆ1 −ˆp/n, with a better (8) UJx=B1 − α/2 X + 1/2n− X + 1/2 andnew choice of pˆ rather than pˆ = X/n. This can be accomplishedby using the center of the Wilson The interval is formedby taking the central 1 − α region in place of pˆ. Denote X = X + κ2/2 and interval. This leaves α/2 poste- n˜ = n + κ2. Let p˜ = X/ n˜ and q˜ = 1 −˜p. Define the rior probability in each omittedtail. The exception confidence interval CI for p by is for x = 0n where the lower (upper) limits are AC modifiedtoavoidthe undesirableresult that the 1/2 −1/2 coverage probability Cp n→0asp → 0or1. (5) CI AC =˜p ± κ˜pq˜ n˜ The actual endpoints of the interval need to be Both the Agresti–Coull andthe Wilson interval are numerically computed. This is very easy to do using centeredon the same value, p˜. It is easy to check softwares such as Minitab, S-PLUS or Mathematica. that the Agresti–Coull intervals are never shorter In Table 5 we have provided the limits for the case than the Wilson intervals. For the case when α = of the Jeffreys prior for 7 ≤ n ≤ 30. 0 05, if we use the value 2 insteadof 1.96 for κ, The endpoints of the Jeffreys prior interval are this interval is the “add2successes and2 failures” the α/2 and1 −α/2 quantiles of the Betax+1/2n− interval in Agresti andCoull (1998). For this rea- x + 1/2 distribution. The psychological resistance son, we call it the Agresti–Coull interval. To the among some to using the interval is because of the best of our knowledge, Samuels and Witmer (1999) inability to compute the endpoints at ease without is the first introductory statistics textbook that rec- software. ommends the use of this interval. See Figure 5 for We provide two avenues to resolving this problem. the coverage of this interval. See also Figure 6 for One is Table 5 at the endof the paper. The second its average coverage probability. is a computable approximation to the limits of the Jeffreys prior interval, one that is computable with 3.1.3 Jeffreys interval. Beta distributions are the just a normal table. This approximation is obtained standard conjugate priors for binomial distributions after some algebra from the general approximation andit is quite common to use beta priors for infer- to a Beta quantile given in page 945 in Abramowitz ence on p (see Berger, 1985). andStegun (1970). Suppose X ∼ Binn p andsuppose p has a prior The lower limit of the 1001 − α% Jeffreys prior distribution Betaa1a2; then the posterior distri- interval is approximately bution of p is BetaX + a1n − X + a2. Thus a x + 1/2 (9)  1001 − α% equal-tailedBayesian interval is given n + 1 +n − x + 1/2e2ω − 1 by where  Bα/2 X + a1n− X + a2 κ 4pˆq/nˆ +κ2 − 3/6n2 ω = B1 − α/2 X + a1n− X + a2 4pˆqˆ 1/2 −ˆpp ˆqˆκ2 + 2−1/n where Bα m1m2 denotes the α quantile of a + 6nˆpqˆ2 Betam1m2 distribution. The well-known Jeffreys prior andthe uniform The upper limit may be approximatedby the same prior are each a beta distribution. The noninforma- expression with κ replacedby −κ in ω. The simple tive Jeffreys prior is of particular interest to us. approximation given above is remarkably accurate. Historically, Bayes procedures under noninforma- Berry (1996, page 222) suggests using a simpler nor- tive priors have a track recordof goodfrequentist mal approximation, but this will not be sufficiently properties; see Wasserman (1991). In this problem accurate unless npˆ1 −ˆp is rather large. INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 109

Table 5 95% Limits of the Jeffreys prior interval xn=7 n=8 n=9 n=10 n=11 n=12

0 0 0.292 0 0.262 0 0.238 0 0.217 0 0.200 0 0.185 1 0.016 0.501 0.014 0.454 0.012 0.414 0.011 0.381 0.010 0.353 0.009 0.328 2 0.065 0.648 0.056 0.592 0.049 0.544 0.044 0.503 0.040 0.467 0.036 0.436 3 0.139 0.766 0.119 0.705 0.104 0.652 0.093 0.606 0.084 0.565 0.076 0.529 4 0.234 0.861 0.199 0.801 0.173 0.746 0.153 0.696 0.137 0.652 0.124 0.612 5 0.254 0.827 0.224 0.776 0.200 0.730 0.180 0.688 6 0.270 0.800 0.243 0.757 xn=13 n=14 n=15 n=16 n=17 n=18

0 0 0.173 0 0.162 0 0.152 0 0.143 0 0.136 0 0.129 1 0.008 0.307 0.008 0.288 0.007 0.272 0.007 0.257 0.006 0.244 0.006 0.232 2 0.033 0.409 0.031 0.385 0.029 0.363 0.027 0.344 0.025 0.327 0.024 0.311 3 0.070 0.497 0.064 0.469 0.060 0.444 0.056 0.421 0.052 0.400 0.049 0.381 4 0.114 0.577 0.105 0.545 0.097 0.517 0.091 0.491 0.085 0.467 0.080 0.446 5 0.165 0.650 0.152 0.616 0.140 0.584 0.131 0.556 0.122 0.530 0.115 0.506 6 0.221 0.717 0.203 0.681 0.188 0.647 0.174 0.617 0.163 0.589 0.153 0.563 7 0.283 0.779 0.259 0.741 0.239 0.706 0.222 0.674 0.207 0.644 0.194 0.617 8 0.294 0.761 0.272 0.728 0.254 0.697 0.237 0.668 9 0.303 0.746 0.284 0.716 xn=19 n=20 n=21 n=22 n=23 n=24

0 0 0.122 0 0.117 0 0.112 0 0.107 0 0.102 0 0.098 1 0.006 0.221 0.005 0.211 0.005 0.202 0.005 0.193 0.005 0.186 0.004 0.179 2 0.022 0.297 0.021 0.284 0.020 0.272 0.019 0.261 0.018 0.251 0.018 0.241 3 0.047 0.364 0.044 0.349 0.042 0.334 0.040 0.321 0.038 0.309 0.036 0.297 4 0.076 0.426 0.072 0.408 0.068 0.392 0.065 0.376 0.062 0.362 0.059 0.349 5 0.108 0.484 0.102 0.464 0.097 0.446 0.092 0.429 0.088 0.413 0.084 0.398 6 0.144 0.539 0.136 0.517 0.129 0.497 0.123 0.478 0.117 0.461 0.112 0.444 7 0.182 0.591 0.172 0.568 0.163 0.546 0.155 0.526 0.148 0.507 0.141 0.489 8 0.223 0.641 0.211 0.616 0.199 0.593 0.189 0.571 0.180 0.551 0.172 0.532 9 0.266 0.688 0.251 0.662 0.237 0.638 0.225 0.615 0.214 0.594 0.204 0.574 10 0.312 0.734 0.293 0.707 0.277 0.681 0.263 0.657 0.250 0.635 0.238 0.614 11 0.319 0.723 0.302 0.698 0.287 0.675 0.273 0.653 12 0.325 0.713 0.310 0.690 xn=25 n=26 n=27 n=28 n=29 n=30

0 0 0.095 0 0.091 0 0.088 0 0.085 0 0.082 0 0.080 1 0.004 0.172 0.004 0.166 0.004 0.160 0.004 0.155 0.004 0.150 0.004 0.145 2 0.017 0.233 0.016 0.225 0.016 0.217 0.015 0.210 0.015 0.203 0.014 0.197 3 0.035 0.287 0.034 0.277 0.032 0.268 0.031 0.259 0.030 0.251 0.029 0.243 4 0.056 0.337 0.054 0.325 0.052 0.315 0.050 0.305 0.048 0.295 0.047 0.286 5 0.081 0.384 0.077 0.371 0.074 0.359 0.072 0.348 0.069 0.337 0.067 0.327 6 0.107 0.429 0.102 0.415 0.098 0.402 0.095 0.389 0.091 0.378 0.088 0.367 7 0.135 0.473 0.129 0.457 0.124 0.443 0.119 0.429 0.115 0.416 0.111 0.404 8 0.164 0.515 0.158 0.498 0.151 0.482 0.145 0.468 0.140 0.454 0.135 0.441 9 0.195 0.555 0.187 0.537 0.180 0.521 0.172 0.505 0.166 0.490 0.160 0.476 10 0.228 0.594 0.218 0.576 0.209 0.558 0.201 0.542 0.193 0.526 0.186 0.511 11 0.261 0.632 0.250 0.613 0.239 0.594 0.230 0.577 0.221 0.560 0.213 0.545 12 0.295 0.669 0.282 0.649 0.271 0.630 0.260 0.611 0.250 0.594 0.240 0.578 13 0.331 0.705 0.316 0.684 0.303 0.664 0.291 0.645 0.279 0.627 0.269 0.610 14 0.336 0.697 0.322 0.678 0.310 0.659 0.298 0.641 15 0.341 0.690 0.328 0.672 110 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 6. Comparison of the average coverage probabilities. From top to bottom: the Agresti–Coull interval CI AC  the Wilson interval CI W the Jeffreys prior interval CI J and the standard interval CI s. The nominal confidence level is 0 95

In Figure 5 we plot the coverage probability of the The coverage of the Jeffreys interval is quali- standard interval, the Wilson interval, the Agresti– tatively similar to that of CI W over most of the Coull interval andthe Jeffreys interval for n = 50 parameter space 0 1. In addition, as we will see and α = 0 05. in Section 4.3, CI J has an appealing connection to the mid-P correctedversion of the Clopper–Pearson 3.2 Coverage Probability “exact” intervals. These are very similar to CI J, In this andthe next subsections, we compare the over most of the , andhave similar appealing performance of the standard interval and the three properties. CI J is a serious and credible candidate recommended intervals in terms of their coverage for practical use. The coverage has an unfortunate probability andlength. fairly deep spike near p = 0 and, symmetrically, Coverage of the Wilson interval fluctuates accept- another near p = 1. However, the simple modifica- ably near 1 − α, except for p very near 0 or 1. It tion of CI J presentedin Section 4.1.2 removes these might be helpful to consult Figure 5 again. It can two deep downward spikes. The modified Jeffreys be shown that, when 1 − α = 0 95, interval CI M−J performs well.   Let us also evaluate the intervals in terms of their γ lim inf C n = 0 92 average coverage probability, the average being over n→∞ γ≥1 n   p. Figure 6 demonstrates the striking difference in γ the average coverage probability among four inter- lim inf C n = 0 936 n→∞ γ≥5 n vals: the Agresti–Coull interval, the Wilson interval the Jeffreys prior interval andthe standardinter- and   val. The standard interval performs poorly. The γ lim inf C n = 0 938 interval CI AC is slightly conservative in terms of n→∞ γ≥10 n average coverage probability. Both the Wilson inter- for the Wilson interval. In comparison, these three val andthe Jeffreys prior interval have excellent values for the standard interval are 0.860, 0.870, performance in terms of the average coverage prob- and0.905, respectively, obviously considerably ability; that of the Jeffreys prior interval is, if smaller. anything, slightly superior. The average coverage of the Jeffreys interval is really very close to the The modification CI M−W presentedin Section 4.1.1 removes the first few deep downward spikes nominal level even for quite small n. This is quite impressive. of the coverage function for CI W. The resulting cov- erage function is overall somewhat conservative for Figure 7 displays the absolute errors,  1 p very near 0 or 1. Both CI W and CI M−W have the 0 Cp n−1 − α dp , for n = 10 to 25, and same coverage functions away from 0 or 1. n = 26 to 40. It is clear from the plots that among The Agresti–Coull interval has goodminimum the four intervals, CI WCIAC and CI J are com- coverage probability. The coverage probability of parable, but the mean absolute errors of CI s are the interval is quite conservative for p very close significantly larger. to 0 or 1. In comparison to the Wilson interval it 3.3 Expected Length is more conservative, especially for small n. This is not surprising because, as we have noted, CI AC Besides coverage, length is also very important always contains CI W as a proper subinterval. in evaluation of a confidence interval. We compare INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 111

Fig. 7. The mean absolute errors of the coverage of the standard solid the Agresti–Coull dashed the Jeffreys + and the Wilson dotted intervals for n = 10 to 25 and n = 26 to 40 both the expectedlength andthe average expected 40. Interestingly, the comparison is clear andcon- length of the intervals. By definition, sistent as n changes. Always, the standard interval andthe Wilson interval CI have almost identical Expectedlength W average expectedlength; the Jeffreys interval CI J is = En plengthCI  comparable to the Wilson interval, andin fact CI J   is slightly more parsimonious. But the difference is n n = Ux n−Lx n px1 − pn−x not of practical relevance. However, especially when x x=0 n is small, the average expectedlength of CI AC is where U and L are the upper andlower lim- noticeably larger than that of CI J and CI W. In fact, its of the confidence interval CI , respectively. for n till about 20, the average expectedlength of The average expectedlength is just the integral CI AC is larger than that of CI J by 0.04 to 0.02, and  1 this difference can be of definite practical relevance. 0 En plength(CI ) dp . We plot in Figure 8 the expectedlengths of the The difference starts to wear off when n is larger four intervals for n = 25 and α = 0 05. In this case, than 30 or so.

CI W is the shortest when 0 210 ≤ p ≤ 0 790, CI J is the shortest when 0 133 ≤ p ≤ 0 210 or 0 790 ≤ p ≤ 4. OTHER ALTERNATIVE INTERVALS

0 867, and CI s is the shortest when p ≤ 0 133 or p ≥ Several other intervals deserve consideration, 0 867. It is no surprise that the standard interval is either due to their historical value or their theoret- the shortest when p is near the boundaries. CI s is ical properties. In the interest of space, we hadto not really in contention as a credible choice for such exercise some personal judgment in deciding which values of p because of its poor coverage properties additional intervals should be presented. in that region. Similar qualitative phenomena hold for other values of n. 4.1Boundary modification Figure 9 shows the average expectedlengths of The coverage probabilities of the Wilson interval the four intervals for n = 10 to 25 and n = 26 to andthe Jeffreys interval fluctuate acceptably near

Fig. 8. The expected lengths of the standard solid the Wilson dotted the Agresti–Coull dashed and the Jeffreys + intervals for n = 25 and α = 0 05. 112 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 9. The average expected lengths of the standard solid the Wilson dotted the Agresti–Coull dashed and the Jeffreys + intervals for n = 10 to 25 and n = 26 to 40.

1 − α for p not very close to 0 or 1. Simple modifica- replacedby a lower boundof λx/n where λx solves tions can be made to remove a few deep downward (10) e−λλ0/0! + λ1/1! +···+λx−1/x − 1!=1 − α spikes of their coverage near the boundaries; see Figure 5. A symmetric prescription needs to be followed to modify the upper bound for x very near n. The value 4.1.1 Modified Wilson interval. The lower bound of x∗ shouldbe small. Values which work reasonably of the Wilson interval is formedby inverting a CLT well for 1 − α = 0 95 are approximation. The coverage has downward spikes x∗ = 2 for n<50 and x∗ = 3 for 51 ≤ n ≤ 100+. when p is very near 0 or 1. These spikes exist for all n and α. For example, it can be shown that, when Using the relationship between the Poisson and 1 − α = 0 95 and p = 0 1765/n, χ2 distributions, 2 PY ≤ x=Pχ21+x ≤ 2λ lim Ppp ∈ CI W=0 838 n→∞ where Y ∼ Poissonλ, one can also formally andwhen 1 − α = 0 99 and p = 0 1174/n 2 express λx in (10) in terms of the χ quantiles: lim P p ∈ CI =0 889 The particular 2 2 n→∞ p W λx =1/2χ2x α where χ2x α denotes the 100αth numerical values 0 1174 0 1765 are relevant only percentile of the χ2 distribution with 2x degrees of to the extent that divided by n, they approximate freedom. Table 4 gives the values of λx for selected the location of these deep downward spikes. values of x and α. The spikes can be removedby using a one-sided For example, consider the case 1 − α = 0 95 and

Poisson approximation for x close to 0 or n. Suppose x = 2. The lower boundof CI W is ≈ 0 548/n + we modify the lower bound for x = 1 x∗. For a 4. The modified Wilson interval replaces this by a ∗ 2 fixed1 ≤ x ≤ x , the lower boundof CI W shouldbe lower boundof λ/n where λ =1/2 χ4 0 05. Thus,

Fig. 10. Coverage probability for n = 50 and p ∈0 0 15. The plots are symmetric about p = 0 5 and the coverage of the modified intervals solid line is the same as that of the corresponding interval without modification dashed line for p ∈0 15 0 85. INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 113

Fig. 11. Coverage probability of other alternative intervals for n = 50. from a χ2 table, for x = 2 the new lower boundis 4.2 Other intervals 0 355/n. 4.2.1 The Clopper–Pearson interval. The Clopper– We denote this modified Wilson interval by Pearson interval is the inversion of the equal-tail CI . See Figure 10 for its coverage. M−W binomial test rather than its normal approxima- tion. Some authors refer to this as the “exact” 4.1.2 Modified Jeffreys interval. Evidently, CI J procedure because of its derivation from the bino- has an appealing Bayesian interpretation, and, mial distribution. If X = x is observed, then its coverage properties are appealing again except the Clopper–Pearson (1934) interval is defined by for a very narrow downward coverage spike fairly CI =L xU x, where L x and U x near 0 and1 (see Figure 5). The unfortunate down- CP CP CP CP CP are, respectively, the solutions in p to the equations wardspikes in the coverage function result because

UJ0 is too small andsymmetrically LJn is too PpX ≥ x=α/2 and PpX ≤ x=α/2 large. To remedy this, one may revise these two It is easy to show that the lower endpoint is the α/2 specific limits as quantile of a beta distribution Betax n − x + 1, U 0=p and L n=1 − p  andthe upper endpointis the 1 − α/2 quantile of a M−J l M−J l beta distribution Betax + 1n− x. The Clopper– n Pearson interval guarantees that the actual cov- where pl satisfies 1 − pl = α/2 or equivalently 1/n erage probability is always equal to or above the pl = 1 −α/2 . nominal confidence level. However, for any fixed p, We also made a slight, ad hoc alteration of LJ1 andset the actual coverage probability can be much larger than 1−α unless n is quite large, andthus the confi-

LM−J1=0 and UM−Jn − 1=1 dence interval is rather inaccurate in this sense. See Figure 11. The Clopper–Pearson interval is waste- In all other cases, LM−J = LJ and UM−J = UJ. fully conservative andis not a goodchoice for prac- We denote the modified Jeffreys interval by CI M−J. tical use, unless strict adherence to the prescription This modification removes the two steep down- Cp n≥1−α is demanded. Even then, better exact wardspikes andthe performance of the interval is methods are available; see, for instance, Blyth and improved. See Figure 10. Still (1983) andCasella (1986). 114 L. D. BROWN, T. T. CAI AND A. DASGUPTA

4.2.2 The arcsine interval. Another interval is The interval (13) has been suggested, for example, basedon a widelyusedvariance stabilizing trans- in Stone (1995, page 667). Figure 11 plots the cov- formation for the binomial distribution [see, e.g., erage of the logit interval for n = 50. This interval Bickel andDoksum, 1977: Tˆp=arcsinˆp1/2 performs quite well in terms of coverage for p away This stabilization is basedon the delta from 0 or 1. But the interval is unnecessarily long; methodandis, of course, only an asymptotic one. in fact its expectedlength is larger than that of the Anscombe (1948) showedthat replacing pˆ by Clopper–Pearson exact interval. pˇ =X + 3/8/n + 3/4 gives better variance stabilization; furthermore Remark. Anscombe (1956) suggestedthat λˆ = log X+1/2  is a better estimate of λ; see also Cox 2n1/2arcsinˇp1/2−arcsinp1/2 → N0 1 n−X+1/2 andSnell (1989) andSantner andDuffy (1989). The as n →∞. variance of Anscombe’s λˆ may be estimatedby

This leads to an approximate 1001−α% confidence n + 1n + 2 V = interval for p, nX + 1n − X + 1

CI = sin2arcsinˇp1/2− 1 κn−1/2 A new logit interval can be constructedusing the Arc 2 (11) new estimates λˆ and V. Our evaluations show that sin2arcsinˇp1/2+ 1 κn−1/2 the new logit interval is overall shorter than CI 2 Logit in (13). But the coverage of the new interval is not See Figure 11 for the coverage probability of this satisfactory. interval for n = 50. This interval performs reason- ably well for p not too close to 0 or 1. The coverage 4.2.4 The Bayesian HPD interval. An exact has steep downward spikes near the two edges; in Bayesian solution wouldinvolve using the HPD fact it is easy to see that the coverage drops to zero intervals insteadof our equal-tails proposal. How- when p is sufficiently close to the boundary (see ever, HPD intervals are much harder to compute Figure 11). The mean absolute error of the coverage and do not do as well in terms of coverage proba- of CI Arc is significantly larger than those of CI W, bility. See Figure 11 andcompare to the Jeffreys’ CI AC and CI J. We note that our evaluations show equal-tailedinterval in Figure 5. that the performance of the arcsine interval with the standard pˆ in place of pˇ in (11) is much worse 4.2.5 The likelihood ratio interval. Along with than that of CI Arc. the Waldandthe Rao score intervals, the likeli- hoodratio methodis one of the most usedmethods 4.2.3 The logit interval. The logit interval is for construction of confidence intervals. It is con- obtainedby inverting a Waldtype interval for the structedby inversion of the likelihoodratio test log odds λ = log p ; (see Stone, 1995). The MLE 1−p which accepts the null hypothesis H0 p = p0 if of λ (for 0

Pearson interval CI CP can be written as limit for the Clopper–Pearson rule with x − 1/2 suc- cesses and L x is the lower limit for the Clopper– CI =Bα/2 X n − X + 1 J CP Pearson rule with x + 1/2 successes. Strawderman B1 − α/2 X + 1n− X andWells (1998) contains a valuable discussionof mid-P intervals andsuggests some variations based It therefore follows immediately that CI is always J on asymptotic expansions. containedin CI CP . Thus CI J corrects the conserva- tiveness of CI CP . It turns out that the Jeffreys prior interval, 5. CONCLUDING REMARKS although Bayesianly constructed, has a clear and convincing frequentist motivation. It is thus no sur- prise that it does well from a frequentist perspec- Interval estimation of a binomial proportion is a tive. As we now explain, the Jeffreys prior interval very basic problem in practical statistics. The stan- CI can be regarded as a continuity corrected dardWaldinterval is in nearly universal use. We J first show that the performance of this standard version of the Clopper–Pearson interval CI CP . interval is persistently chaotic andunacceptably The interval CI CP inverts the inequality PpX ≤ Lp ≤ α/2 to obtain the lower limit andsimilarly poor. Indeed its coverage properties defy conven- for the upper limit. Thus, for fixed x, the upper limit tional wisdom. The performance is so erratic and the qualifications given in the influential texts of the interval for p, UCP x, satisfies are so defective that the standard interval should (14) P X ≤ x≤α/2 UCP x not be used. We provide a fairly comprehensive andsymmetrically for the lower limit. evaluation of many natural alternative intervals. This interval is very conservative; undesirably so Basedon this analysis, we recommendthe Wilson for most practical purposes. A familiar proposal to or the equal-tailedJeffreys prior interval for small eliminate this over-conservativeness is to instead nn ≤ 40). These two intervals are comparable in invert both absolute error andlength for n ≤ 40, andwe believe that either could be used, depending on (15) P X≤Lp−1+1/2P X=Lp=α/2 p p taste. This amounts to solving For larger n, the Wilson, the Jeffreys andthe Agresti–Coull intervals are all comparable, andthe 1/2PU xX ≤ x − 1 (16) CP Agresti–Coull interval is the simplest to present. + P X ≤ x = α/2 UCP x It is generally true in statistical practice that only which is the same as those methods that are easy to describe, remember and compute are widely used. Keeping this in mind, Umid-PX=1/2B1 − α/2 x n − x + 1 (17) we recommendthe Agresti–Coull interval for prac- +1/2B1 − α/2 x + 1n− x tical use when n ≥ 40. Even for small sample sizes, the easy-to-present Agresti–Coull interval is much andsymmetrically for the lower endpoint.These preferable to the standard one. are the “Mid-P Clopper-Pearson” intervals. They are known to have goodcoverage andlength perfor- We wouldbe satisfiedif this article contributes to a greater appreciation of the severe flaws of the mance. Umid-P given in (17) is a weightedaverage of two incomplete Beta functions. The incomplete popular standardintervalandan agreement that it Beta function of interest, B1 − α/2 x n − x + 1,is deserves not to be used at all. We also hope that continuous andmonotone in x if we formally treat the recommendations for alternative intervals will x as a continuous argument. Hence the average of provide some closure as to what may be used in preference to the standard method. the two functions defining Umid-P is approximately the same as the value at the halfway point, x + 1/2. Finally, we note that the specific choices of the Thus values of n, p and α in the examples andfigures are artifacts. The theoretical results in Brown, Cai Umid-PX≈B1−α/2x+1/2n−x+1/2=UJx andDasGupta (1999) show that qualitatively sim- exactly the upper limit for the equal-tailedJeffreys ilar phenomena as regarding coverage and length interval. Similarly, the corresponding approximate holdfor general n and p andcommon values of lower endpoint is the Jeffreys’ lower limit. the coverage. (Those results there are asymptotic Another frequentist way to interpret the Jeffreys as n →∞, but they are also sufficiently accurate prior interval is to say that UJx is the upper for realistically moderate n.) 116 L. D. BROWN, T. T. CAI AND A. DASGUPTA

APPENDIX Table A.1 95% Limits of the modified Jeffreys prior interval xn=7 n=8 n=9 n=10 n=11 n=12

0 0 0.410 0 0.369 0 0.336 0 0.308 0 0.285 0 0.265 1 0 0.501 0 0.454 0 0.414 0 0.381 0 0.353 0 0.328 2 0.065 0.648 0.056 0.592 0.049 0.544 0.044 0.503 0.040 0.467 0.036 0.436 3 0.139 0.766 0.119 0.705 0.104 0.652 0.093 0.606 0.084 0.565 0.076 0.529 4 0.234 0.861 0.199 0.801 0.173 0.746 0.153 0.696 0.137 0.652 0.124 0.612 5 0.254 0.827 0.224 0.776 0.200 0.730 0.180 0.688 6 0.270 0.800 0.243 0.757 xn=13 n=14 n=15 n=16 n=17 n=18

0 0 0.247 0 0.232 0 0.218 0 0.206 0 0.195 0 0.185 1 0 0.307 0 0.288 0 0.272 0 0.257 0 0.244 0 0.232 2 0.033 0.409 0.031 0.385 0.029 0.363 0.027 0.344 0.025 0.327 0.024 0.311 3 0.070 0.497 0.064 0.469 0.060 0.444 0.056 0.421 0.052 0.400 0.049 0.381 4 0.114 0.577 0.105 0.545 0.097 0.517 0.091 0.491 0.085 0.467 0.080 0.446 5 0.165 0.650 0.152 0.616 0.140 0.584 0.131 0.556 0.122 0.530 0.115 0.506 6 0.221 0.717 0.203 0.681 0.188 0.647 0.174 0.617 0.163 0.589 0.153 0.563 7 0.283 0.779 0.259 0.741 0.239 0.706 0.222 0.674 0.207 0.644 0.194 0.617 8 0.294 0.761 0.272 0.728 0.254 0.697 0.237 0.668 9 0.303 0.746 0.284 0.716 xn=19 n=20 n=21 n=22 n=23 n=24

0 0 0.176 0 0.168 0 0.161 0 0.154 0 0.148 0 0.142 1 0 0.221 0 0.211 0 0.202 0 0.193 0 0.186 0 0.179 2 0.022 0.297 0.021 0.284 0.020 0.272 0.019 0.261 0.018 0.251 0.018 0.241 3 0.047 0.364 0.044 0.349 0.042 0.334 0.040 0.321 0.038 0.309 0.036 0.297 4 0.076 0.426 0.072 0.408 0.068 0.392 0.065 0.376 0.062 0.362 0.059 0.349 5 0.108 0.484 0.102 0.464 0.097 0.446 0.092 0.429 0.088 0.413 0.084 0.398 6 0.144 0.539 0.136 0.517 0.129 0.497 0.123 0.478 0.117 0.461 0.112 0.444 7 0.182 0.591 0.172 0.568 0.163 0.546 0.155 0.526 0.148 0.507 0.141 0.489 8 0.223 0.641 0.211 0.616 0.199 0.593 0.189 0.571 0.180 0.551 0.172 0.532 9 0.266 0.688 0.251 0.662 0.237 0.638 0.225 0.615 0.214 0.594 0.204 0.574 10 0.312 0.734 0.293 0.707 0.277 0.681 0.263 0.657 0.250 0.635 0.238 0.614 11 0.319 0.723 0.302 0.698 0.287 0.675 0.273 0.653 12 0.325 0.713 0.310 0.690 xn=25 n=26 n=27 n=28 n=29 n=30

0 0 0.137 0 0.132 0 0.128 0 0.123 0 0.119 0 0.116 1 0 0.172 0 0.166 0 0.160 0 0.155 0 0.150 0 0.145 2 0.017 0.233 0.016 0.225 0.016 0.217 0.015 0.210 0.015 0.203 0.014 0.197 3 0.035 0.287 0.034 0.277 0.032 0.268 0.031 0.259 0.030 0.251 0.029 0.243 4 0.056 0.337 0.054 0.325 0.052 0.315 0.050 0.305 0.048 0.295 0.047 0.286 5 0.081 0.384 0.077 0.371 0.074 0.359 0.072 0.348 0.069 0.337 0.067 0.327 6 0.107 0.429 0.102 0.415 0.098 0.402 0.095 0.389 0.091 0.378 0.088 0.367 7 0.135 0.473 0.129 0.457 0.124 0.443 0.119 0.429 0.115 0.416 0.111 0.404 8 0.164 0.515 0.158 0.498 0.151 0.482 0.145 0.468 0.140 0.454 0.135 0.441 9 0.195 0.555 0.187 0.537 0.180 0.521 0.172 0.505 0.166 0.490 0.160 0.476 10 0.228 0.594 0.218 0.576 0.209 0.558 0.201 0.542 0.193 0.526 0.186 0.511 11 0.261 0.632 0.250 0.613 0.239 0.594 0.230 0.577 0.221 0.560 0.213 0.545 12 0.295 0.669 0.282 0.649 0.271 0.630 0.260 0.611 0.250 0.594 0.240 0.578 13 0.331 0.705 0.316 0.684 0.303 0.664 0.291 0.645 0.279 0.627 0.269 0.610 14 0.336 0.697 0.322 0.678 0.310 0.659 0.298 0.641 15 0.341 0.690 0.328 0.672 INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 117

ACKNOWLEDGMENTS Cox,D.R.andSnell, E. J. (1989). Analysis of Binary Data, 2nd ed. Chapman and Hall, London. We thank Xuefeng Li for performing some helpful Cressie, N. (1980). A finely tunedcontinuity correction. Ann. computations andJim Berger, DavidMoore, Steve Inst. Statist. Math. 30 435–442. Samuels, Bill Studden and Ron Thisted for use- Ghosh, B. K. (1979). A comparison of some approximate confi- ful conversations. We also thank the Editors and dence intervals for the binomial parameter J. Amer. Statist. two anonymous referees for their thorough andcon- Assoc. 74 894–900. structive comments. Supportedby grants from the Hall, P. (1982). Improving the normal approximation when National Science Foundation and the National Secu- constructing one-sided confidence intervals for binomial or rity Agency. Poisson parameters. Biometrika 69 647–652. Lehmann, E. L. (1999). Elements of Large-Sample Theory. REFERENCES Springer, New York. Newcombe, R. G. (1998). Two-sided confidence intervals for the Abramowitz, M. andS tegun, I. A. (1970). Handbook of Mathe- single proportion; comparison of several methods. Statistics matical Functions. Dover, New York. in Medicine 17 857–872. Agresti, A. andC oull, B. A. (1998). Approximate is better than Rao, C. R. (1973). Linear and Its Applica- “exact” for interval estimation of binomial proportions. Amer. tions. Wiley, New York. Statist. 52 119–126. Anscombe, F. J. (1948). The transformation of Poisson, binomial Samuels,M.L.andWitmer, J. W. (1999). Statistics for andnegative binomial data. Biometrika 35 246–254. the Life Sciences, 2nded.Prentice Hall, Englewood Anscombe, F. J. (1956). On estimating binomial response rela- Cliffs, NJ. tions. Biometrika 43 461–464. Santner, T. J. (1998). A note on teaching binomial confidence Berger, J. O. (1985). Statistical Decision Theory and Bayesian intervals. Teaching Statistics 20 20–23. Analysis, 2nded.Springer, New York. Santner,T.J.andDuffy, D. E. (1989). The Statistical Analysis Berry, D. A. (1996). Statistics: A Bayesian Perspective. of Discrete Data. Springer, Berlin. Wadsworth, Belmont, CA. Stone, C. J. (1995). A Course in Probability and Statistics. Bickel, P. andD oksum, K. (1977). . Duxbury, Belmont, CA. Prentice-Hall, EnglewoodCliffs, NJ. Blyth,C.R.andStill, H. A. (1983). Binomial confidence inter- Strawderman,R.L.andWells, M. T. (1998). Approximately vals. J. Amer. Statist. Assoc. 78 108–116. exact inference for the common in several 2 × 2 Brown,L.D.,Cai, T. andD asGupta, A. (1999). Confidence inter- tables (with discussion). J. Amer. Statist. Assoc. 93 1294– vals for a binomial proportion andasymptotic expansions. 1320. Ann. Statist to appear. Tamhane,A.C.andDunlop, D. D. (2000). Statistics and Data Brown,L.D.,Cai, T. andD asGupta, A. (2000). Interval estima- Analysis from Elementary to Intermediate. Prentice Hall, tion in discrete . Technical report, Dept. EnglewoodCliffs, NJ. Statistics. Univ. Pennsylvania. Vollset, S. E. (1993). Confidence intervals for a binomial pro- Casella, G. (1986). Refining binomial confidence intervals portion. Statistics in Medicine 12 809–824. Canad. J. Statist. 14 113–129. Casella, G. andB erger, R. L. (1990). Statistical Inference. Wasserman, L. (1991). An inferential interpretation of default Wadsworth & Brooks/Cole, Belmont, CA. priors. Technical report, Carnegie-Mellon Univ. Clopper,C.J.andPearson, E. S. (1934). The use of confidence Wilson, E. B. (1927). Probable inference, the law of succes- or fiducial limits illustrated in the case of the binomial. sion, andstatistical inference. J. Amer. Statist. Assoc. 22 Biometrika 26 404–413. 209–212. Comment

Alan Agresti and Brent A. Coull

In this very interesting article, Professors Brown, ness can cause havoc for much larger sample sizes Cai andDasGupta (BCD) have shown that discrete- that one wouldexpect. The popular (Wald)confi- dence interval for a binomial parameter p has been known for some time to behave poorly, but readers Alan Agresti is Distinguished Professor, Depart- will surely be surprisedthat this can happen for ment of Statistics, University of Florida, Gainesville, such large n values. Florida 32611-8545 (e-mail: [email protected]fl.edu). Brent Interval estimation of a binomial parameter is A. Coull is Assistant Professor, Department of Bio- deceptively simple, as there are not even any nui- statistics, Harvard School of Public Health, Boston, sance parameters. The goldstandardwouldseem Massachusetts 02115 e-mail: [email protected] to be a methodsuch as the Clopper–Pearson, based vard.edu. on inverting an “exact” test using the binomial dis- 118 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 1. A Comparison of mean expected lengths for the nominal 95% Jeffreys J Wilson W Modified Jeffreys M-J Modified Wilson M-W and Agresti–Coull AC intervals for n = 5 6 7 8 9. tribution rather than an approximate test using average p˜ rather than the weightedaverage of the 1/2 the normal. Because of discreteness, however, this . The resulting interval p˜ ± zα/2˜pq/˜ n˜ methodis too conservative. A more practical, nearly is wider than CIW (by Jensen’s inequality), in par- goldstandardforthis andother discreteproblems ticular being conservative for p near 0 and1 where seems to be basedon inverting a two-sidedtest CIW can suffer poor coverage probabilities. using the exact distribution but with the mid-P Regarding textbook qualifications on sample size value. Similarly, with large-sample methods it is for using the Waldinterval, skewness considera- better not to use a continuity correction, as other- tions andthe Edgeworthexpansion suggest that wise it approximates exact inference basedon an guidelines for n shoulddependon p through 1 − ordinary P-value, resulting in conservative behav- 2p2/p1−p. See, for instance, Boos andHughes- ior. Interestingly, BCD note that the Jeffreys inter- Oliver (2000). But this does not account for the val (CI ) approximates the mid-P value correction J effects of discreteness, and as BCD point out, guide- of the Clopper–Pearson interval. See Gart (1966) lines in terms of p are not verifiable. For elemen- for relatedremarks about the use of 1 additions 2 tary course teaching there is no obvious alternative to numbers of successes andfailures before using (such as t methods) for smaller n, so we think it is frequentist methods. sensible to teach a single methodthat behaves rea- 1. METHODS FOR ELEMENTARY sonably well for all n, as do the Wilson, Jeffreys and STATISTICS COURSES Agresti–Coull intervals. It’s unfortunate that the Waldinterval for p is so seriously deficient, because in addition to 2. IMPROVED PERFORMANCE WITH being the simplest interval it is the obvious one BOUNDARY MODIFICATIONS to teach in elementary statistics courses. By con- BCD showedthat one can improve the behavior trast, the Wilson interval (CI ) performs surpris- W of the Wilson andJeffreys intervals for p near 0 ingly well even for small n. Since it is too com- and 1 by modifying the endpoints for CI when plex for many such courses, however, our motiva- W x = 1 2n − 2n − 1 (and x = 3 and n − 3 for tion for the “Agresti–Coull interval” (CIAC ) was to provide a simple approximation for CI . Formula n>50) andfor CI J when x = 0 1n− 1n. Once W one permits the modification of methods near the (4) in BCD shows that the midpoint p˜ for CIW is a weightedaverage of pˆ and1/2 that equals the sample space boundary, other methods may per- 2 form decently besides the three recommended in sample proportion after adding zα/2 pseudo obser- vations, half of each type; the square of the coef- this article. For instance, Newcombe (1998) showedthat when ficient of zα/2 is the same weightedaverage of the variance of a sample proportion when p =ˆp and 0

Fig. 2. A comparison of expected lengths for the nominal 95% Jeffreys J Wilson W Modified Jeffreys M-J Modified Wilson M-W and Agresti–Coull AC intervals for n = 5.

contains CIW. The logit interval is the uninforma- suggest that the boundary-modified likelihood ratio tive one [0 1] when x = 0orx = n, but substitut- interval also behaves reasonably well, although con- ing the Clopper–Pearson limits in those cases yields servative for p near 0 and1. coverage probability functions that resemble those For elementary course teaching, a disadvantage for CIW andCI AC , although considerably more con- of all such intervals using boundary modifications servative for small n. Rubin andSchenker (1987) is that making exceptions from a general, simple recommended the logit interval after 1 additions to 2 recipe distracts students from the simple concept numbers of successes andfailures, motivating it as a of taking the estimate plus andminus a normal normal approximation to the posterior distribution score multiple of a standard error. (Of course, this of the logit parameter after using the Jeffreys prior. concept is not sufficient for serious statistical work, However, this modification has coverage probabili- but some over simplification andcompromise is nec- ties that are unacceptably small for p near 0 and1 essary at that level.) Even with CIAC , instructors (See Vollset, 1993). Presumably some other bound- may findit preferable to give a recipe with the ary modification will result in a happy medium. In same number of added pseudo observations for all a letter to the editor about Agresti and Coull (1998), 2 α, insteadof zα/2. Reasonably goodperformance Rindskopf (2000) argued in favor of the logit inter- seems to result, especially for small α, from the val partly because of its connection with logit mod- 2 value 4 ≈ z0 025 usedin the 95% CI AC interval (i.e., eling. We have not usedthis methodfor teaching the “addtwosuccesses andtwo failures” interval). in elementary courses, since logit intervals do not Agresti andCaffo (2000) discussedthisandshowed extendto intervals for the differenceof proportions that adding four pseudo observations also dramat- and(like CI W andCI J) they are rather complex for ically improves the Waldtwo-sample interval for that level. comparing proportions, although again at the cost of For practical use andfor teaching in more rather severe conservativeness when both parame- advanced courses, some statisticians may prefer the ters are near 0 or near 1. likelihoodratio interval, since conceptually it is sim- ple andthe methodalso applies in a general model- 3. ALTERNATIVE WIDTH COMPARISON building framework. An advantage compared to the Waldapproach is its invariance to the choice of In comparing the expectedlengths of the scale, resulting, for instance, both from the origi- three recommended intervals, BCD note that the nal scale andthe logit. BCD donot say much about comparison is clear andconsistent as n changes, this interval, since it is harder to compute. However, with the average expectedlength being noticeably it is easy to obtain with standard statistical soft- larger for CIAC than CIJ andCI W. Thus, in their ware (e.g., in SAS, using the LRCI option in PROC concluding remarks, they recommend CIJ andCI W GENMOD for a model containing only an intercept for small n. However, since BCD recommendmod- term andassuming a binomial response with logit ifying CIJ andCI W to eliminate severe downward or identity link function). Graphs in Vollset (1993) spikes of coverage probabilities, we believe that a 120 L. D. BROWN, T. T. CAI AND A. DASGUPTA more fair comparison of expectedlengths uses the Finally, we are curious about the implications of modified versions CIM−J andCI M−W. We checked the BCD results in a more general setting. How this but must admit that figures analogous to much does their message about the effects of dis- the BCD Figures 8 and9 show that CI M−J and creteness andbasing interval estimation on the CIM−W maintain their expectedlength advantage Jeffreys prior or the rather than the Wald over CIAC , although it is reduced somewhat. test extend to parameters in other discrete distri- However, when n decreases below 10, the results butions andto two-sample comparisons? We have change, with CIM−J having greater expectedwidth seen that interval estimation of the Poisson param- than CIAC andCI M−W. Our Figure 1 extends the eter benefits from inverting the score test rather BCD Figure 9 to values of n<10, showing how the than the Waldtest on the count scale (Agresti and comparison differs between the ordinary intervals Coull, 1998). andthe modifiedones. Our Figure 2 has the format One wouldnot think there couldbe anything of the BCD Figure 8, but for n = 5 insteadof 25. new to say about the Waldconfidenceinterval Admittedly, n = 5 is a rather extreme case, one for for a proportion, an inferential methodthat must which the Jeffreys interval is modified unless x = 2 be one of the most frequently usedsince Laplace or 3 andthe Wilson interval is modifiedunless x = 0 (1812, page 283). Likewise, the confidence inter- or 5, andfor it CI AC has coverage probabilities that val for a proportion basedon the Jeffreys prior can dip below 0.90. Thus, overall, the BCD recom- has receivedattention in various forms for some mendations about choice of method seem reasonable time. For instance, R. A. Fisher (1956, pages 63– to us. Our own preference is to use the Wilson inter- 70) showedthe similarity of a Bayesian analysis val for statistical practice andCI AC for teaching in with Jeffreys prior to his fiducial approach, in a dis- elementary statistics courses. cussion that was generally critical of the confidence interval method but grudgingly admitted of limits 4. EXTENSIONS obtainedby a test inversion such as the Clopper– Other than near-boundary modifications, another Pearson method, “though they fall short in logical type of fine-tuning that may help is to invert a test content of the limits foundby the fiducialargument, permitting unequal tail probabilities. This occurs andwith which they have often been confused,they naturally in exact inference that inverts a sin- do fulfil some of the desiderata of statistical infer- gle two-tailedtest, which can perform better than ences.” Congratulations to the authors for brilliantly inverting two separate one-tailedtests (e.g., Sterne, casting new light on the performance of these old 1954; Blyth andStill, 1983). andestablishedmethods.

Comment

George Casella

1. INTRODUCTION vastly different from their nominal confidence level. What we now see is that for the Waldinterval, an Professors Brown, Cai andDasGupta (BCD) are approximate interval, the chaotic behavior is relent- to be congratulatedfor their clear andimaginative less, as this interval will not maintain 1 − α cover- look at a seemingly timeless problem. The chaotic age for any value of n. Although fixes relying on behavior of coverage probabilities of discrete confi- ad hoc rules abound, they do not solve this funda- dence sets has always been an annoyance, result- ing in intervals whose coverage probability can be mental defect of the Wald interval and, surprisingly, the usual safety net of asymptotics is also shown not to exist. So, as the song goes, “Bye-bye, so long, farewell” to the Waldinterval. George Casella is Arun Varma Commemorative Now that the Waldinterval is out, what is in? Term Professor and Chair, Department of Statis- There are probably two answers here, depending tics, University of Florida, Gainesville, Florida on whether one is in the classroom or the consult- 32611-8545 e-mail: [email protected]fl.edu. ing room. INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 121

Fig. 1. Coverage probabilities of the Blyth-Still interval upper and Agresti-Coull interval lower for n = 100 and 1 − α = 0 95.

2. WHEN YOU SAY 95% statedconfidence;when you say 95% you should mean 95%! In the classroom it is (still) valuable to have a But the fix here is rather simple: apply the “con- formula for a confidence intervals, and I typically tinuity correction” to the score interval (a technique present the Wilson/score interval, starting from that seems to be out of favor for reasons I do not the test formulation. Although this doesn’t understand). The continuity correction is easy to have the pleasing pˆ ± something, most students justify in the classroom using pictures of the nor- can understand the logic of test inversion. More- mal density overlaid on the binomial mass func- over, the fact that the interval does not have a tion, andthe resulting interval will now maintain symmetric form is a valuable lesson in itself; the its nominal level. (This last statement is not based statistical worldis not always symmetric. on analytic proof, but on numerical studies.) Anyone However, one thing still bothers me about this reading Blyth (1986) cannot help being convinced interval. It is clearly not a 1 − α interval; that is, that this is an excellent approximation, coming at it does not maintain its nominal coverage prob- only a slightly increasedeffort. ability. This is a defect,andone that shouldnot One other point that Blyth makes, which BCD do be compromised. I am uncomfortable in present- not mention, is that it is easy to get exact confi- ing a confidence interval that does not maintain its dence limits at the endpoints. That is, for X = 0 the 122 L. D. BROWN, T. T. CAI AND A. DASGUPTA lower boundis 0 andfor X = 1 the lower boundis So for any value of n and α, we can compute an 1 −1 − α1/n [the solution to PX = 0=1 − α]. exact, shortest 1 − α confidence interval that will not display any of the pathological behavior illus- 3. USE YOUR TOOLS tratedby BCD. As an example, Figure 1 shows the Agresti–Coull interval along with the Blyth–Still The essential message that I take away from the interval for n = 100 and1 − α = 0 95. While work of BCD is that an approximate/formula-based the Agresti–Coull interval fails to maintain 0 95 approach to constructing a binomial confidence coverage in the middle p region, the Blyth–Still interval is boundto have essential flaws. However, interval always maintains 0 95 coverage. What is this is a situation where brute force computing will more surprising, however, is that the Blyth–Still do the trick. The construction of a 1 − α binomial interval displays much less variation in its cov- confidence interval is a discrete optimization prob- erage probability, especially near the endpoints. lem that is easily programmed. So why not use the Thus, the simplistic numerical algorithm produces tools that we have available? If the problem will an excellent interval, one that both maintains its yieldto brute force computation, then we should guaranteedcoverage andreducesoscillation in the use that solution. coverage probabilities. Blyth andStill (1983) showedhow to compute exact intervals through numerical inversion of ACKNOWLEDGMENT tests, andCasella (1986) showedhow to compute exact intervals by refining conservative intervals. Supportedby NSF Grant DMS-99-71586.

Comment

Chris Corcoran and Cyrus Mehta

We thank the authors for a very accessible some asymptotic procedures tenuous, even when the andthorough discussionof this practical prob- underlying probability lies away from the boundary lem. With the availability of modern computa- or when the sample size is relatively large. tional tools, we have an unprecedented opportu- The authors have evaluatedvarious confidence nity to carefully evaluate standard statistical pro- intervals with respect to their coverage properties cedures in this manner. The results of such work andaverage lengths. Implicit in their evaluation are invaluable to teachers andpractitioners of is the premise that overcoverage is just as badas statistics everywhere. We particularly appreciate undercoverage. We disagree with the authors on this the attention paidby the authors to the gener- fundamental issue. If, because of the discreteness of ally oversimplifiedandinadequaterecommenda- the test statistic, the desired confidence level cannot tions made by statistical texts regarding when to be attained, one would ordinarily prefer overcover- use normal approximations in analyzing binary age to undercoverage. Wouldn’t you prefer to hire data. As their work has plainly shown, even in a fortune teller whose track recordexceedsexpec- the simple case of a single binomial proportion, tations to one whose track recordis unable to live the discreteness of the data makes the use of up to its claim of accuracy? With the exception of the Clopper–Pearson interval, none of the intervals discussed by the authors lives up to its claim of Chris Corcoran is Assistant Professor, Depart- 95% accuracy throughout the range of p. Yet the ment of Mathematics and Statistics, Utah authors dismiss this interval on the grounds that State University, 3900 old Main Hill, Logon, it is “wastefully conservative.” Perhaps so, but they Utah, 84322-3900 e-mail: corcoran@math. do not address the issue of how the wastefulness is usu.edu. Cyrus Mehta is Professor, Department manifested. of , Harvard School of Public Health, What penalty do we incur for furnishing confi- 655 Huntington Avenue Boston, Massachusetts dence intervals that are more truthful than was 02115 and is with Cytel Software Corporation, 675 requiredof them? Presumably we pay for the conser- Massachusetts Avenue, Cambridge, Massachusetts vatism by an increase in the length of the confidence 02319. interval. We thought it wouldbe a useful exercise INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 123

In fact the BSC andLR intervals are actually shorter than Agresti–Coull for p<0 2orp>0 8, andshorter than the Wilson interval for p<0 1 and p>0 9. The only interval that is uniformly shorter than BSC andLR is the Jeffreys interval. Most of the time the difference in lengths is negligi- ble, andin the worst case (at p = 0 5) the Jeffreys interval is only shorter by 0.025 units. Of the three asymptotic methods recommended by the authors, the Jeffreys interval yields the lowest average prob- ability of coverage, with significantly greater poten- tial relative undercoverage in the 0 05 0 20 and 0 80 0 95 regions of the parameter space. Consid- ering this, one must question the rationale for pre- ferring Jeffreys to either BSC or LR. The authors argue for simplicity andease of com- putation. This argument is validfor the teaching of statistics, where the instructor must balance sim- plicity with accuracy. As the authors point out, it is customary to teach the standard interval in intro- ductory courses because the formula is straight- forwardandthe central limit theorem providesa goodheuristic for motivating the normal approxi- mation. However, the evidence shows that the stan- dardmethodiswoefully inadequate. Teaching sta- Fig. 1. Actual coverage probabilities for BSC and LR intervals tistical novices about a Clopper–Pearson type inter- as a function of pn = 50 Compare to author’s Figures 5 10 val is conceptually difficult, particularly because and 11. exact intervals are impossible to compute by hand. As the Agresti–Coull interval preserves the confi- to actually investigate the magnitude of this penalty dence level most successfully among the three rec- for two confidence interval procedures that are guar- ommended alternative intervals, we believe that anteed to provide the desired coverage but are not this feature when coupledwith its straightforward as conservative as Clopper–Pearson. Figure 1 dis- computation (particularly when α = 0 05) makes plays the true coverage probabilities for the nominal this approach ideal for the classroom. 95% Blyth–Still–Casella (see Blyth andStill, 1983; Simplicity andease of computation have no role Casella, 1984) confidence interval (BSC interval) to play in statistical practice. With the advent andthe 95% confidenceinterval obtainedby invert- of powerful microcomputers, researchers no longer ing the exact likelihoodratio test (LR interval; the resort to handcalculations when analyzing data. inversion follows that shown by Aitken, Anderson, While the needfor simplicity applies to the class- Francis andHinde, 1989, pages 112–118). room, in applications we primarily desire reliable, There is no value of p for which the coverage of the BSC andLR intervals falls below 95%. Their cover- accurate solutions, as there is no significant dif- age probabilities are, however, much closer to 95% ference in the computational overheadrequiredby than wouldbe obtainedby the Clopper–Pearson pro- the authors’ recommended intervals when compared cedure, as is evident from the authors’ Figure 11. to the BSC andLR methods. From this perspec- Thus one couldsay that these two intervals are uni- tive, the BSC andLR intervals have a substantial formly better than the Clopper–Pearson interval. advantage relative to the various asymptotic inter- We next investigate the penalty to be paidfor the vals presentedby the authors. They guarantee cov- guaranteedcoverage in terms of increasedlength of erage at a relatively low cost in increasedlength. the BSC andLR intervals relative to the Wilson, In fact, the BSC interval is already implemented in Agresti–Coull, or Jeffreys intervals recommended StatXact (1998) andis therefore readilyaccessible to by the authors. This is shown by Figure 2. practitioners. 124 L.D. BROWN, T.T. CAI AND A. DASGUPTA

Fig. 2. Expected lengths of BSC and LR intervals as a function of p compared respectively to Wilson Agresti–Coull and Jeffreys intervals n = 25. Compare to authors’ Figure 8

Comment

Malay Ghosh

This is indeed a very valuable article which brings modifiedJeffreys equal-tailedinterval works well out very clearly some of the inherent difficulties in this problem andrecommendit as a possible con- associatedwith confidenceintervals for parame- tender to the Wilson interval for n ≤ 40. ters of interest in discrete distributions. Professors There is a deep-rootedoptimalityassociatedwith Brown, Cai andDasgupta (henceforth BCD) are Jeffreys prior as the unique first-order probability to be complimentedfor their comprehensive and prior for a real-valuedparameter of inter- thought-provoking discussion about the “chaotic” est with no nuisance parameter. Roughly speak- behavior of the Waldinterval for the binomial pro- ing, a probability matching prior for a real-valued portion andan appraisal of some of the alternatives parameter is one for which the coverage probability that have been proposed. of a one-sided Baysian is asymp- My remarks will primarily be confinedto the totically equal to its frequentist counterpart. Before discussion of Bayesian methods introduced in this giving a formal definition of such priors, we pro- paper. BCD have demonstrated very clearly that the vide an intuitive explanation of why Jeffreys prior is a matching prior. To this end, we begin with

Malay Ghosh is Distinguished Professor, Depart- the fact that if X1 Xn are iid Nθ 1, then  n ment of Statistics, University of Florida, Gainesville, Xn = i=1 Xi/n is the MLE of θ. With the uni- Florida 32611-8545 e-mail: [email protected]fl.edu. form prior πθ∝c (a constant), the posterior of θ INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 125

 is NXn 1/n. Accordingly, writing zα for the upper belongs to the general class of first-order probabil- 100α% point of the N0 1 distribution, ity matching priors  −1/2  1/2 Pθ ≤ Xn + zαn Xn πθ∝I11 θhθ2 θp  −1/2 = 1 − α = Pθ ≤ Xn + zαn θ as derived in Tibshirani (1989). Here h· is an arbi- trary function differentiable in its arguments. andthis is an example of perfect matching. Now In general, matching priors have a long success ˆ if θn is the MLE of θ, under suitable regular- story in providing frequentist confidence intervals, ˆ ity conditions, θnθ is asymptotically (as n →∞) especially in complex problems, for example, the −1 Nθ I θ, where Iθ is the Fisher Information Behrens–Fisher or the common mean estimation θ 1/2 number. With the transformation gθ= I t, problems where frequentist methods run into dif- ˆ by the delta method, gθn is asymptotically ficulty. Though asymptotic, the matching property Ngθ 1. Now, intuitively one expects the uniform seems to holdfor small andmoderatesample sizes prior πθ∝c as the asymptotic matching prior for as well for many important statistical problems. gθ. Transforming back to the original parameter, One such example is Garvan andGhosh (1997) Jeffreys prior is a probability matching prior for θ. where such priors were foundfor general disper- Of course, this requires an invariance of probability sion models as given in Jorgensen (1997). It may matching priors, a fact which is rigorously estab- be worthwhile developing these priors in the pres- lishedin Datta andGhosh (1996). Thus a uniform ence of nuisance parameters for other discrete cases prior for arcsinθ1/2, where θ is the binomial pro- as well, for example when the parameter of interest portion, leads to Jeffreys Beta (1/2, 1/2) prior for θ. is the difference of two binomial proportions, or the When θ is the Poisson parameter, the uniform prior log-odds ratio in a 2 × 2 . for θ1/2 leads to Jeffreys’ prior θ−1/2 for θ. Having arguedso strongly in favor of matching In a more formal set-up, let X1 Xn be iid priors, I wonder, though, whether there is any spe- 1−α conditional on some real-valued θ. Let θπ X1  cial needfor such priors in this particular problem of Xn denote a posterior 1−αth quantile for θ under binomial proportions. It appears that any Beta (a a) the prior π. Then π is saidto be a first-orderprob- prior will do well in this case. As noted in this paper, ability matching prior if by shrinking the MLE X/n towardthe prior mean Pθ ≤ θ1−αX  X θ 1/2, one achieves a better centering for the construc- (1) π 1 n = 1 − α + on−1/2 tion of confidence intervals. The two diametrically opposite priors Beta (2, 2) (symmetric concave with This definition is due to Welch and Peers (1963) maximum at 1/2 which provides the Agresti–Coull who showedby solving a differentialequation that interval) andJeffreys prior Beta (1 /2 1/2) (symmet- Jeffreys prior is the unique first-order probability ric convex with minimum at 1/2) seem to be equally matching prior in this case. Strictly speaking, Welch good for recentering. Indeed, I wonder whether any andPeers provedthis result only for continuous Beta α β prior which shrinks the MLE toward distributions. Ghosh (1994) pointed out a suitable the prior mean α/α + β becomes appropriate for modificationof criterion (1) which wouldleadto the recentering. same conclusion for discrete distributions. Also, for The problem of construction of confidence inter- small and moderate samples, due to discreteness, vals for binomial proportions occurs in first courses one needs some modifications of Jeffreys interval as in statistics as well as in day-to-day consulting. done so successfully by BCD. While I am strongly in favor of replacing Waldinter- This idea of probability matching can be extended vals by the new ones for the latter, I am not quite even in the presence of nuisance parameters. sure how easy it will be to motivate these new inter- T Suppose that θ =θ1 θp , where θ1 is the par- vals for the former. The notion of shrinking can be T ameter of interest, while θ2 θp is the nui- explained adequately only to a few strong students sance parameter. Writing Iθ=Ijk  as the in introductory statistics courses. One possible solu- Fisher information matrix, if θ1 is orthogonal to tion for the classroom may be to bring in the notion T θ2 θp in the sense of Cox andReid(1987), of continuity correction andsomewhat heuristcally 1 1 that is, I1k = 0 for all k = 2 p, extending ask students to work with X+ n−X+  instead 1/2 2 2 the previous intuitive argument, πθ∝I11 θ of X n − X. In this way, one centers around is a probability matching prior. Indeed, this prior X + 1 /n + 1 a la Jeffreys prior. 2 126 L. D. BROWN, T. T. CAI AND A. DASGUPTA Comment

Thomas J. Santner

I thank the authors for their detailed look at be consistent in their assessment of the likely values a well-studiedproblem.For the Waldbinomial p for p. Symmetry (1) is the minimal requirement of a interval, there has not been an appreciation of binomial confidence interval. The Wilson and equal- the long persistence (in n)ofp locations having tailedJeffrey intervals advocatedbyBCD satisfy substantially deficient achieved coverage compared the symmetry property (1) andhave coverage that with the nominal coverage. Figure 1 is indeed a is centered(when coverage is plottedversus true p) picture that says a thousandwords. Similarly, the about the nominal value. They are also straightfor- asymptotic lower limit in Theorem 1 for the mini- wardto motivate, even for elementary students, and mum coverage of the Waldinterval is an extremely simple to compute for the outcome of interest. useful analytic tool to explain this phenomenon, However, regarding p confidence intervals as the although other authors have given fixed p approx- inversion of a family of acceptance regions corre- imations of the coverage probability of the Wald sponding to size α tests of H0 p = p0 versus interval (e.g., Theorem 1 of Ghosh, 1979). HA p = p0 for 0

Fig. 1. Coverage of nominal 95% symmetric Duffy–Santner p intervals for n = 20 bottom panel and n = 50 top panel

However, the benefits of having reasonably short aries arbitrarily set at a b=0 1 in the nota- andsuitably symmetric confidenceintervals are suf- tion of Duffy andSantner, anda small adjustment ficient that such intervals have been constructedfor was made to insure symmetry property (1). (The several frequently occurring problems of biostatis- nonsymmetrical multiple stage stopping boundaries tics. For example, Jennison andTurnbull (1983) and that produce the data considered in Duffy and Sant- Duffy andSantner (1987) present acceptance set– ner do not impose symmetry.) The coverages of these inversion confidence intervals (both with available systems are shown in Figure 1. To give an idea of FORTRAN programs to implement their methods) computing time, the n = 50 intervals requiredless for a binomial p basedon datafrom a multistage than two seconds to compute on my 400 Mhz PC. ; Coe andTamhane (1989) describea To further facilitate comparison with the intervals more sophisticatedset of repeatedconfidenceinter- whose coverage is displayed in Figure 5 of BCD, vals for p − p also basedon multistage clinical 1 2 I computedthe Duffy andSantner intervals for a trial data (and give a SAS macro to produce the slightly lower level of coverage, 93.5%, so that the intervals). Yamagami andSantner (1990) present average coverage was about the desired 95% nomi- an acceptance set–inversion confidence interval and nal level; the coverage of this system is displayed FORTRAN program for p1 − p2 in the two-sample binomial problem. There are other examples. in Figure 2 on the same vertical scale andcom- To contrast with the intervals whose coverages pares favorably. It is possible to call the FORTRAN are displayed in BCD’s Figure 5 for n = 20 and program that makes these intervals within SPLUS n = 50, I formedthe multistage intervals of Duffy which makes for convenient data analysis. andSantner that strictly attain the nominal con- I wish to mention that are a number of other fidence level for all p. The computation was done small sample interval estimation problems of con- naively in the sense that the multistage FORTRAN tinuing interest to biostatisticians that may well program by Duffy that implements this method have very reasonable small sample solutions based was appliedusing one stage with stopping bound- on analogs of the methods that BCD recommend. 128 L. D. BROWN, T. T. CAI AND A. DASGUPTA

Fig. 2. Coverage of nominal 93 5% symmetric Duffy–Santner p intervals for n = 50.

Most of these wouldbe extremely difficultto han- methodof choice in other elementary texts. In his dle by the more brute force method of inverting introductory texts, Larson (1974) introduces the acceptance sets. The first of these is the problem Wilson interval as the methodof choice although of computing simultaneous confidence intervals for he makes the vague, andindeedfalse, statement, as p0 − pi 1 ≤ i ≤ T that arises in comparing a con- trol binomial distribution with T treatment ones. BCD show, that the user can use the Waldinterval if The secondconcerns forming simultaneous confi- “n is large enough.” One reviewer of Santner (1998), dence intervals for pi − pj, the cell probabilities an article that showedthe coverage virtues of the of a multinomial distribution. In particular, the Wilson interval comparedwith Wald-likeintervals equal-tailedJeffrey prior approach recommendedby advocated by another author in the magazine Teach- the author has strong appeal for both of these prob- lems. ing Statistics (written for high school teachers) com- Finally, I note that the Wilson intervals seem mentedthat the Wilson methodwas the “standard” to have receivedsome recommendationas the methodtaught in the U.K.

Rejoinder

Lawrence D. Brown, T. Tony Cai and Anirban DasGupta

We deeply appreciate the many thoughtful and atic than previously believed. We are happy to see constructive remarks andsuggestions madeby the a consensus that the Waldinterval deservesto discussants of this paper. The discussion suggests be discarded, as we have recommended. It is not that we were able to make a convincing case that surprising to us to see disagreement over the spe- the often-usedWaldinterval is far more problem- cific alternative(s) to be recommended in place of INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 129 this interval. We hope the continuing debate will to the construction of satisfactory procedures. This add to a greater understanding of the problem, and forces one to again decide whether γ% shouldmean we welcome the chance to contribute to this debate. “approximately γ%,” as it does in most other con- temporary applications, or “at least γ%” as can A. It seems that the primary source of disagree- be obtainedwith the Blyth–Still procedureor the ment is basedon differencesin interpretation Cloppe–Pearson procedure. An obvious price of the of the coverage goals for confidence intervals. latter approach is in its decreased precision, as mea- We will begin by presenting our point of view suredby the increasedexpectedlength of the inter- on this fundamental issue. We will then turn to a number of other issues, vals. as summarizedin the following list: B. All the discussants agree that elementary motivation andsimplicity of computation are impor- B. Simplicity is important. tant attributes in the classroom context. We of C. Expectedlength is also important. course agree. If these considerations are paramount D. Santner’s proposal. then the Agresti–Coull procedure is ideal. If the E. Shoulda continuity correction be used? needfor simplicity can be relaxedeven a little, then F. The Waldinterval also performs poorly in we prefer the Wilson procedure: it is only slightly other problems. harder to compute, its coverage is clearly closer to G. The two-sample binomial problem. the nominal value across a wider range of values of H. Probability-matching procedures. p, andit can be easier to motivate since its deriva- I. Results from asymptotic theory. tion is totally consistent with Neyman–Pearson the- A. Professors Casella, Corcoran andMehta come ory. Other procedures such as Jeffreys or the mid-P out in favor of making coverage errors always fall Clopper–Pearson interval become plausible competi- only on the conservative side. This is a traditional tors whenever computer software can be substituted point of view. However, we have taken a different for the possibility of handderivationandcomputa- perspective in our treatment. It seems more consis- tion. tent with contemporary statistical practice to expect Corcoran andMehta take a rather extreme posi- that a γ% confidence interval should cover the true tion when they write, “Simplicity andease of com- value approximately γ% of the time. The approxi- putation have no role to play in statistical practice mation shouldbe built on sound,relevant statisti- [italics ours].” We agree that the ability to perform cal calculations, andit shouldbe as accurate as the computations by handshouldbe of little, if any, rel- situation allows. evance in practice. But conceptual simplicity, parsi- We note in this regardthat most statistical mod- mony andconsistency with general theory remain els are only felt to be approximately validas repre- important secondary conditions to choose among sentations of the true situation. Hence the result- procedures with acceptable coverage and precision. ing coverage properties from those models are at These considerations will reappear in our discus- best only approximately accurate. Furthermore, a sion of Santner’s Blyth–Still proposal. They also broad range of modern procedures is supported leave us feeling somewhat ambivalent about the only by asymptotic or Monte-Carlo calculations, and boundary-modified procedures we have presented in so again coverage can at best only be approxi- our Section 4.1. Agresti andCoull correctly imply mately the nominal value. As statisticians we do that other boundary corrections could have been the best within these constraints to produce proce- triedandthat our choice is thus somewhat adhoc. dures whose coverage comes close to the nominal (The correction to Wilson can perhaps be defended value. In these contexts when we claim γ% cover- on the principle of substituting a Poisson approx- age we clearly intendto convey that the coverage is imation for a Gaussian one where the former is close to γ%, rather than to guarantee it is at least clearly more accurate; but we see no such funda- γ%. mental motivation for our correction to the Jeffreys We grant that the binomial model has a some- interval.) what special character relative to this general dis- C. Several discussants commented on the pre- cussion. There are practical contexts where one can cision of various proposals in terms of expected feel confident this model holds with very high preci- length of the resulting intervals. We strongly con- sion. Furthermore, asymptotics are not required in cur that precision is the important balancing crite- order to construct practical procedures or evaluate rion vis-a-vis´ coverage. We wish only to note that their properties, although asymptotic calculations there exist other measures of precision than inter- can be useful in both regards. But the discrete- val expectedlength. In particular, one may investi- ness of the problem introduces a related barrier gate the probability of covering wrong values. In a 130 L. D. BROWN, T. T. CAI AND A. DASGUPTA charming identity worth noting, Pratt (1961) shows he suggests producing nominal γ% intervals by con- the connection of this approach to that of expected structing the γ∗% Blyth–Still intervals, with γ∗% length. Calculations on coverage of wrong values of chosen so that the average coverage of the result- p in the binomial case will be presentedin Das- ing intervals is approximately the nominal value, Gupta (2001). This article also discusses a number γ%. The coverage plot for this procedure compares of additional issues and presents further analytical well with that for Wilson or Jeffreys in our Figure 5. calculations, including a Pearson tilting similar to Perhaps the expectedinterval length for this proce- the chi-square tilts advised in Hall (1983). dure also compares well, although Santner does not Corcoran andMehta’s Figure 2 compares average say so. However, we still do not favor his proposal. length of three of our proposals with Blyth–Still and It is conceptually more complicatedandrequires a with their likelihoodratio procedure. We note first specially designed computer program, particularly if that their LB procedure is not the same as ours. one wishes to compute γ∗% with any degree of accu- Theirs is basedon numerically computedexact per- racy. It thus fails with respect to the criterion of sci- centiles of the fixedsample likelihoodratio statistic. entific parsimony in relation to other proposals that We suspect this is roughly equivalent to adjustment appear to have at least competitive performance of the chi-squaredpercentile by a Bartlett correc- characteristics. tion. Ours is basedon the traditionalasymptotic E. Casella suggests the possibility of perform- chi-squaredformula for the distributionof the like- ing a continuity correction on the score statistic lihoodratio statistic. Consequently, their procedure prior to constructing a confidence interval. We do has conservative coverage, whereas ours has cov- not agree with this proposal from any perspec- erage fluctuating aroundthe nominal value. They tive. These “continuity-correctedWilson” intervals assert that the difference in expected length is “neg- have extremely conservative coverage properties, ligible.” How much difference qualifies as negligible though they may not in principle be guaranteedto is an arguable, subjective evaluation. But we note be everywhere conservative. But even if one’s goal, that in their plot their intervals can be on aver- unlike ours, is to produce conservative intervals, age about 8% or 10% longer than Jeffreys or Wilson these intervals will be very inefficient at their nor- intervals, respectively. This seems to us a nonneg- mal level relative to Blyth–Still or even Clopper– ligible difference. Actually, we suspect their prefer- Pearson. In Figure 1 below, we plot the coverage ence for their LR andBSC intervals rests primarily of the Wilson interval with andwithout a conti- on their overriding preference for conservativity in nuity correction for n = 25 and α = 0 05, and coverage whereas, as we have discussed above, our the corresponding expected lengths. It is seems intervals are designed to attain approximately the clear that the loss in precision more than neutral- desired nominal value. izes the improvements in coverage andthat the D. Santner proposes an interesting variant of the nominal coverage of 95% is misleading from any original Blyth–Still proposal. As we understand it, perspective.

Fig. 1. Comparison of the coverage probabilities and expected lengths of the Wilson dotted and continuity-corrected Wilson solid intervals for n = 25 and α = 0 05. INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 131

Fig. 2. Comparison of the systematic coverage biases. The y-axis is nSnp From top to bottom: the systematic coverage biases of the Agresti–Coull Wilson Jeffreys likelihood ratio and Wald intervals with n = 50 and α = 0 05.

F. Agresti andCoull ask if the dismalperfor- probability matching is a one-sided criterion. Thus mance of the Waldinterval manifests itself in a family of two-sided intervals LnUn will be first- other problems, including nordiscrete cases. Indeed order probability matching if it does. In other lattice cases such as the Poisson Pr p ≤ L =α/2 + on−1/2=Pr p ≥ U  andnegative binomial, both the considerableneg- p n p n ative coverage bias andinefficiency in length per- As Ghosh notes, this definition cannot usefully sist. These features also show up in some continu- be literally appliedto the binomial problem here, ous exponential family cases. See Brown, Cai and because the asymptotic expansions always have a DasGupta (2000b) for details. discrete oscillation term that is On−1/2. However, In the three important discrete cases, the bino- one can correct the definition. mial, Poisson andnegative binomial, there is in fact One way to do so involves writing asymptotic some conformity in regardto which methodswork expressions for the probabilities of interest that can well in general. Both the likelihoodratio interval be divided into a “smooth” part, S, andan “oscil- (using the asymptotic chi-squaredlimits) andthe lating” part, Osc, that averages to On−3/2 with equal-tailedJeffreys interval perform admirablyin respect to any smooth density supported within (0, all of these problems with regardto coverage and 1). Readers could consult BCD (2000a) for more expectedlength. Perhaps there is an underlyingthe- details about such expansions. Thus, in much gen- oretical reason for the parallel behavior of these erality one couldwrite two intervals constructedfrom very differentfoun- Prpp ≤ Ln dational principles, and this seems worth further (1) −1 = α/2 + SL p+OscL p+On  study. n n G. Some discussants very logically inquire about where S p=On−1/2, and Osc p has the Ln Ln the situation in the two-sample binomial situation. property informally describedabove. We wouldthen Curiously, in a way, the Waldinterval in the two- say that the procedure is first-order probability sample case for the difference of proportions is less matching if S p=on−1/2, with an analogous Ln problematic than in the one-sample case. It can expression for the upper limit, Un. nevertheless be somewhat improved. Agresti and In this sense the equal-tail Jeffreys procedure Caffo (2000) present a proposal for this problem, is probability matching. We believe that the mid- andBrown andLi (2001) discusssome others. P Clopper–Pearson intervals also have this asymp- H. The discussion by Ghosh raises several inter- totic property. But several of the other proposals, esting issues. The definition of “first-order proba- including the Wald, the Wilson and the likelihood bility matching” extends in the obvious way to any ratio intervals are not first-order probability match- set of upper confidence limits; not just those cor- ing. See Cai (2001) for exact andasymptotic calcula- responding to Bayesian intervals. There is also an tions on one-sided confidence intervals and hypoth- obvious extension to lower confidence limits. This esis testing in the discrete distributions. 132 L. D. BROWN, T. T. CAI AND A. DASGUPTA

The failure of this one-sided, first-order property, that we have satisfactorily met the first two of these however, has no obvious bearing on the coverage goals. As Professor Casella notes, the debate about properties of the two-sided procedures considered alternatives in this timeless problem will linger on, in the paper. That is because, for any of our proce- as it should. We thank the discussants again for a dures, lucidandengaging discussionof a number of rel- −1 evant issues. We are grateful for the opportunity (2) SL p+SU p=0 + On  n n to have learnedso much from these distinguished even when the individual terms on the left are only colleagues. On−1/2. All the procedures thus make compensat- ing one-sided errors, to On−1, even when they are not accurate to this degree as one-sided procedures. ADDITIONAL REFERENCES This situation raises the question as to whether Agresti, A. andC affo, B. (2000). Simple andeffective confi- it is desirable to add as a secondary criterion for dence intervals for proportions and differences of proportions two-sided procedures that they also provide accu- result from adding two successes and two failures. Amer. rate one-sided statements, at least to the probabil- Statist. 54. To appear. −1/2 Aitkin, M., Anderson, D., Francis, B. and Hinde, J. (1989). ity matching On . While Ghosh argues strongly Statistical Modelling in GLIM. OxfordUniv. Press. for the probability matching property, his argument Boos,D.D.andHughes-Oliver, J. M. (2000). How large does n does not seem to take into account the cancellation have to be for Z and t intervals? Amer. Statist. 54 121–128. inherent in (2). We have heardsome others argue in Brown,L.D.,Cai,T.andDasGupta, A. (2000a). Confidence favor of such a requirement andsome argue against intervals for a binomial proportion andasymptotic expan- sions. Ann. Statist. To appear. it. We do not wish to take a strong position on Brown,L.D.,Cai,T.andDasGupta, A. (2000b). Interval estima- this issue now. Perhaps it depends somewhat on the tion in exponential families. Technical report, Dept. Statis- practical context—if in that context the confidence tics, Univ. Pennsylvania. boundsmay be interpretedandusedin a one-sided Brown,L.D.andLi, X. (2001). Confidence intervals for fashion as well as the two-sided one, then perhaps the difference of two binomial proportions. Unpublished manuscript. probability matching is calledfor. Cai, T. (2001). One-sided confidence intervals and hypothesis I. Ghosh’s comments are a reminder that asymp- testing in discrete distributions. Preprint. totic theory is useful for this problem, even though Coe,P.R.andTamhane, A. C. (1993). Exact repeatedconfidence exact calculations here are entirely feasible andcon- intervals for Bernoulli parameters in a group sequential clin- venient. But, as Ghosh notes, asymptotic expres- ical trial. Controlled Clinical Trials 14 19–29. Cox,D.R.andReid, N. (1987). Orthogonal parameters and sions can be startingly accurate for moderate approximate conditional inference (with discussion). J. Roy. sample sizes. Asymptotics can thus provide valid Statist. Soc. Ser. B 49 113–147. insights that are not easily drawn from a series of DasGupta, A. (2001). Some further results in the binomial inter- exact calculations. For example, the two-sided inter- val estimation problem. Preprint. Datta,G.S.andGhosh, M. (1996). On the invariance of nonin- vals also obey an expression analogous to (1), formative priors. Ann. Statist. 24 141–159. Duffy, D. and Santner, T. J. (1987). Confidence intervals for (3) PrpLn ≤ p ≤ Un a binomial parameter basedon multistage tests. Biometrics −3/2 = 1 − α + Snp+Oscnp+On  43 81–94. −1 Fisher, R. A. (1956). Statistical Methods for Scientific Inference. The term Snp is On  andprovidesa useful Oliver and Boyd, Edinburgh. expression for the smooth center of the oscillatory Gart, J. J. (1966). Alternative analyses of contingency tables. J. coverage plot. (See Theorem 6 of BCD (2000a) for Roy. Statist. Soc. Ser. B 28 164–179. a precise justification.) The following plot for n = Garvan,C.W.andGhosh, M. (1997). Noninformative priors for 50 compares S p for five confidence procedures. dispersion models. Biometrika 84 976–982. n Ghosh, J. K. (1994). Higher Order Asymptotics.IMS,Hayward, It shows how the Wilson, Jeffreys andchi- CA. squaredlikelihoodratio proceduresall have cover- Hall, P. (1983). Chi-squaredapproximations to the distribution age that well approximates the nominal value, with of a sum of independent random variables. Ann. Statist. 11 Wilson being slightly more conservative than the 1028–1036. other two. Jennison, C. and Turnbull, B. W. (1983). Confidence intervals for a binomial parameter following a multistage test with As we see it our article articulatedthree primary application to MIL-STD 105D andmedicaltrials. Techno- goals: to demonstrate unambiguously that the Wald metrics, 25 49–58. interval performs extremely poorly; to point out that Jorgensen, B. (1997). The Theory of Dispersion Models. CRC none of the common prescriptions on when the inter- Chapman andHall, London. val is satisfactory are correct andto put forward Laplace, P. S. (1812). Th´eorie Analytique des Probabilit´es. Courcier, Paris. some recommendations on what is to be used in its Larson, H. J. (1974). Introduction to Probability Theory and Sta- place. On the basis of the discussion we feel gratified tistical Inference, 2nded.Wiley, New York. INTERVAL ESTIMATION FOR BINOMIAL PROPORTION 133

Pratt, J. W. (1961). Length of confidence intervals. J. Amer. Tibshirani, R. (1989). Noninformative priors for one parameter Statist. Assoc. 56 549–567. of many. Biometrika 76 604–608. Rindskopf, D. (2000). Letter to the editor. Amer. Statist. 54 88. Welch,B.L.andPeers, H. W. (1963). On formula for confi- Rubin,D.B.andSchenker, N. (1987). Logit-basedinterval esti- dencepoints basedon intergrals of weightedlikelihoods. J. mation for binomial data using the Jeffreys prior. Sociologi- Roy. Statist. Ser. B 25 318–329. cal Methodology 17 131–144. Yamagami, S. and Santner, T. J. (1993). Invariant small sample Sterne, T. E. (1954). Some remarks on confidence or fiducial confidence intervals for the difference of two success proba- limits. Biometrika 41 275–278. bilities. Comm. Statist. Simul. Comput. 22 33–59.