The American

ISSN: 0003-1305 (Print) 1537-2731 (Online) Journal homepage: https://www.tandfonline.com/loi/utas20

A Cheap Trick to Improve the Power of a Conservative Hypothesis Test

Thomas J. Fisher & Michael W. Robbins

To cite this article: Thomas J. Fisher & Michael W. Robbins (2019) A Cheap Trick to Improve the Power of a Conservative Hypothesis Test, The American Statistician, 73:3, 232-242, DOI: 10.1080/00031305.2017.1395364 To link to this article: https://doi.org/10.1080/00031305.2017.1395364

View supplementary material

Accepted author version posted online: 14 Nov 2017. Published online: 17 Jul 2018.

Submit your article to this journal

Article views: 759

View related articles

View Crossmark

Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=utas20 THE AMERICAN STATISTICIAN 2019, VOL. 73, NO. 3, 232–242: General https://doi.org/./..

A Cheap Trick to Improve the Power of a Conservative Hypothesis Test

Thomas J. Fishera and Michael W. Robbinsb aDepartment of , Miami University, Oxford, OH; bRAND Corporation, Pittsburgh, PA

ABSTRACT ARTICLE HISTORY Critical values and p-values of statistical hypothesis tests are often derived using asymptotic approxima- Received March  tions of distributions. However, this sometimes results in tests that are conservative (i.e., under- Revised September  state the of an incorrectly rejected null hypothesis by employing too stringent of a threshold for rejection). Although computationally rigorous options (e.g., the bootstrap) are available for such situations, KEYWORDS we illustrate that simple transformations can be used to improve both the size and power of such tests. Asymptotic; Bootstrap; Conservative; Hypothesis Using a logarithmic transformation, we show that the transformed is asymptotically equivalent to test; Logarithmic its untransformed analogue under the null hypothesis and is divergent from the untransformed version transformation; Power; Size under the alternative (yielding a potentially substantial increase in power). The transformation is applied distortion to several easily-accessible statistical hypothesis tests, a few of which are taught in introductory statistics courses. With theoretical arguments and simulations, we illustrate that the log transformation is preferable to other forms of correction (such as statistics that use a multiplier). Finally, we illustrate application of the method to a well-known dataset. Supplementary materials for this article are available online.

1. Introduction Likewise, many statistics based on normal theory (the ANOVA Hypothesis testing has a rich and extensive history budding F-test, e.g.), tend to be conservative when the underlying data from astronomy, finance, genetics, and the social sciences (see have a distribution with larger tails than the normal distribution Stigler 1986). From its foundations in the trial of the Pyx at (see Pearson 1931; Glass, Peckham, and Sanders 1972). Further- the Royal Mint of London in the 13th century (see Stigler more, the included in nearly all statistical software for 1999,chap.21),throughitsearlyprobabilisticandmathemat- generalized linear models is known to be overly conservative for ical development by Bernoulli, Euler, Gauss, Laplace, Legendre, (see Hauck and Donner 1977; Jennings 1986; and Markov (see Hald 2007, chap. 3–4), to the indispensable Hirji, Mehta, and Patel 1987;CoxandSnell1989,tonameafew). results of the early 20th century (consider Student 1908;Pear- The reduction in the rate of Type I errors seen in conserva- son 1900,tonameafew),thehypothesistesthasrevolutionized tive hypothesis tests has the adverse side effect of a reduction in the practice of modern science. The formulation of the mod- power to detect a false null hypothesis. Therefore, corrections ern statistical test can be traced to the competing philosophies for this issue are of interest, and consequentially methods exist of Fisher (1925) and Neyman and Pearson (1933), and a con- that improve the performance of asymptotic approximations volution of the two approaches is standard practice today; see in finite samples. Consider, for example, Edgeworth expansion Lehmann (1999). (Hall 1992), which involves modification of the asymptotic dis- Key to the implementation of a statistical hypothesis test is tribution for finite samples by including higher order moments the of the test statistic. Even with the ( and ); however, this method may require advent of the bootstrap (Efron 1979) and the practicality of exorbitant algebraic results (the requisite theory has not been Bayesian methods due to the evolution of computing, in prac- developed for many statistics that are conservative). Further, tice, many statistical results are still based on theoretical sam- bootstrapping is a popular method wherein a sampling dis- pling distributions. In many cases, this distribution is approxi- tribution is approximated via a scheme (from the mated using an asymptotic result (e.g., a ), observed data in the form a nonparametric bootstrap or via andinfinitesamples,thecriticalvaluesandp-values are approx- simulation in the form a parametric bootstrap), but in many imated from the asymptotic distribution. However, the practice applications, this method requires a practitioner to implement of using asymptotic approximations of sampling distributions the algorithm and can mandate a substantial computational sometimes results in statistical tests that are conservative (i.e., cost (Efron 1979). has a smaller than desired rate of rejections of a true null hypoth- Therefore, easily applicable and general methods for correct- esis). Conservative tests are frequently yielded when the statistic ing conservative test statistics are worth exploring. In this arti- is derived from a point process that has a limit distribution based cle, we propose a simple transformation that when applied to a on a continuous stochastic process such as the Kolmogorov– test statistic will increase its detection power under the alterna- Smirnov test (see Lilliefors 1967; Hollander and Wolfe 1999). tive hypothesis while retaining the same sampling distribution

CONTACT Thomas J. Fisher fishert@miamioh.edu Department of Statistics, Miami University,  Bishop Circle, Oxford, OH . Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/TAS © 2019 American Statistical Association THE AMERICAN STATISTICIAN 233

(asymptotically) under the null. The transformation is argued approximations for regression-based t-andF-tests. This article to improve performance under the null hypothesis when applied focuses on circumstances where the finite-sample distribution to statistics that are conservative; however, it will likely lead to of the test statistic is not understood and where critical values an undesirable rate of Type I error if applied to statistics that are approximated (typically using asymptotic distributions), are not conservative. In Section 2, we develop our result and which may lead to unreliable performance in finite samples. relate it to statistical practice, uniformly most power (UMP) Next, we illustrate how transformations can be used to improve tests, and the standard undergraduate curriculum. With some finite-sample performance. commonly used test statistics as motivating examples, Section 3 For a given statistic Tn satisfying the assumptions above, we provides some simulations demonstrating the potential increase propose the following modified test statistic: in power yielded by our method and comparisons to resampling ∗ =− κ ( − / κ ). techniques. Our approach is applied to an interesting empiri- Tn n log 1 Tn n (1) cal dataset in Section 4. We follow standard statistical notation The following two theorems demonstrate that T ∗ shares the throughout this article. H and H represent the null and alterna- n 0 1 same asymptotic distribution as T under the null hypothesis but tive hypotheses of a statistical test, respectively. The probability n is more likely to detect the when true (i.e., of a Type I error (rejecting H when true) is labeled α, while the 0 power). probability of a Type II error (failing to reject H0 when false) is β − β denoted .Thepowerofastatisticaltestis1 (probability of ∗ →p →∞ Theorem 2.1. When H0 is true, Tn Tn as n ;moreover, rejecting H0 when false). ∗ Tn and Tn share the same asymptotic distribution. Proof. Consider 2. Test Statistics ∗ κ κ T =−n log(1 − Tn/n ) X ={ , ,..., } = (X ) n       Let n X1 X2 Xn be a sample and Tn Tn n denote 2 3 κ T 1 T 1 T a statistic for testing the competing hypotheses H0 and H1.Fur- = n + n + n +··· n κ κ κ thermore, assume the following: n 2 n 3 n ( ≥ ) = (a) Tn is strictly nonnegative: P Tn 0 1, T 2 T 3 (b) when H is true: T = O (1) (likewise, T has a limit = T + n + n +··· 0 n p n n 2nκ 3n2κ distribution), κ = Tn + An. (2) (c) when H1 is true: Tn = Op(n ) for some κ>0; that is, Tn diverges to +∞ at rate nκ , −κ When H0 is true, we see An = Op(n ) from assumption (a), where O (·) represents order in probability (i.e., X = O (a ) p p n p n whence T ∗ → T ,andT ∗ shares the same asymptotic distribu- that for any >0, there exist a finite M > 0, such that n n n tion as Tn from standard convergence results.  P(|Xn/an| > M)< for all n). Note that as a consequence ∗ of (c), H0 is rejected for large values of Tn. Many standard Theorem 2.2. When H1 is true, Tn diverges from Tn and will be statistical methods satisfy the assumptions (the ANOVA F-test, more powerful than Tn if decisions are based on the same critical the Pearson χ 2 goodness-of-fit test, to name a few). For many values. χ 2 commonly used tests (e.g., statistics that have a limit distri- ≤ = O ( κ ) κ = κ = / Proof. Consider An in (2). If H1 is true, 0 An p n by bution), it holds that 1. However, several observe 1 2 ( ∗ > ) ≥ ( > ), assumption (c). It follows that for all c, P Tn c P Tn c (e.g., z-tests). ∗ and hence T can offer more power than T .  As an example of a test statistic that satisfies the n n above assumptions, consider the classic t-test. That is, let A consequence of the theory that yields Theorems 2.1 and iid 2 2 Xi ∼ N(μ, σ ) for i = 1,...,n,whereμ and σ are unknown, 2.2 is that if Tn is not conservative, application of (1) will exacer- and consider testing H0 : μ = μ0 against H1 : μ = μ0.Inthis bate the rate of Type I error. For example, consider the classic t =| | case,√ it is appropriate to choose as a test statistic Tn t0 ,with or F-tests based on normal data (wherein the finite sample dis- ¯ ¯ 2 t0 = n(X − μ0)/σˆ where X is the sample and σˆ is tribution is known and asymptotic approximations result in lib- the sample . Clearly, Tn is nonnegative, and when H0 is eral tests). Applying our transformation to a t or F-test here will true, it holds that t0 follows a Student’s t-distribution with n − 1 distort the performance further. Moreover, note that Theorems ∗ κ> degrees of freedom and has a standard normal distribution 2.1 and 2.2 hold for Tn defined using any 0(i.e.,itisnot = O ( ) κ asymptotically (thus, Tn p 1 ). However, under√ H1, t0 obeys needed that equal the rate of under H1 as can be anoncentralt-distribution so that E[t0] ≈ n(μ − μ0)/σ seen in Assumption (c)). From (2),itisevidentthatthepower = O ( 1/2) = 2 ∗ κ (hence, Tn p n ). If one were to use Tn t0 ,wewould of Tn is larger when the value of used to calculated it becomes 2 see Tn obey an F-distribution (which is asymptotically χ ) smaller (although the statistic is likely to be liberal if too small under H0,whereasTn = Op(n) under H1. In this example, alge- a value is chosen). We find that setting κ equal to the H1 rate of braic expressions for the statistic’s sampling distribution in finite divergence yields a test that performs well under both hypothe- samples are well known, and asymptotic approximations are ses. A brief simulation studying the sensitivity of κ is included rarely used since it is well known that the normal approximation in Section 3.5. (or χ 2 in case of an F statistic) results in inflated Type I error We note that computational issues may arise if the statistic performance (i.e., α = P(|t0| > t1−α/2,ν ) z1−α/2) takes on large values and the sample size is relatively small. That ν ≥ κ ∗ for finite ). Similar issues are encountered if using asymptotic is, if Tn n , Tn is undefined. One could easily scale Tn,and 234 T. J. FISHER AND M. W. ROBBINS analogous results to Theorems 2.1 and 2.2 would hold. In prac- f (x) =−nκ log(1 − x/nκ ) and f (·) is a monotonic increasing ∗ tice, we recommend approximating Tn with function over the relevant domain. Assuming H0 is true, let c1 represent the exact α-level critical point of Tn.Itfollowsthat m i ∗ ≈ Tn Tn κ( − ) (3) ∗ in i 1 α = P(Tn > c1) = P(T > f (c1)). (4) i=1 n = ≥ = ( ) α ∗ for some m.Inthesimulationsbelow,weusem 6whenTn Hence, c2 f c1 is an exact -level critical value of Tn .Now κ n and report the occurrences of when this approximation is assume Xn is generated under an alternative hypothesis setting needed. Unless the sample size is particularly small (in which where Tn has power 1 − β.Itfollows κ casenostatisticwillhavepower),wefindthatifTn ≥ n ,the − β = ( > ) = ( ∗ > ), alternative hypothesis is likely (overwhelmingly) true, and such 1 P Tn c1 P Tn c2 (5) a transformation is unneeded as the original statistic is already ∗ − β rejecting the null hypothesis. and therefore Tn also has power equal to 1 .Thatis,when ∗ Our proposed transformation in (1)ismotivatedfromsome their respective true critical values are used, Tn and Tn are of results in analysis. There, it is well known that the equivalent power. classic Box–Pierce statistic (see Box and Pierce 1970) for detect- The above arguments may lead to a questioning of the use- ing serial correlation (which utilizes an asymptotic χ 2 distri- fulness of the proposed result. However, suppose Tn is not an bution) is overly conservative and several modifications have exact α-level test and tends to be conservative in practice; that ( > )<α been proposed to improve its finite sample performance (see, is, under H0, P Tn c1 using the notation from (4). Given ∗ e.g., Ljung and Box 1978; Peña and Rodríguez 2006). Robbins Tn is conservative, one can justify using Tn in place of Tn with ( ∗ > ) ≈ α and Fisher (2015)andFisherandRobbins(2017)illustratethat thesamecriticalpointc1 if P Tn c1 . For this reason, log-based adjustments have utility for correcting the conserva- we note our transformation is sensible if Tn is notably conser- tive behavior of Box–Pierce-type statistics; the statistic in (1)is vative in practical settings. There are many such cases and we a generalization of those results. explore the practical use of our transformation in simulations in Section 3. Although the arguments that motivate our method are 2.1. The Utility of Other Methods for Correction asymptotic in nature, the improvement in power offered by the Note that there are myriad transformations that can be used to transformed statistic (which is controlled by the discrepancy correct the Type I error rate of conservative tests. For example, term An from (2)) is substantial only when Tn is large. Under † = > → large samples, for the discrepancy to be meaningful, Tn must one could set Tn bnTn,wherebn 1foralln with bn 1 be particularly large to balance out the denominator of nκ seen as n →∞. Consider bn = n/(n − k) for k > 0, which is in the vein of the multiplier suggested by Ljung and Box (1978). Results in An.Insuchsituations,theoriginaltestwouldlikelyreject analogous to Theorems 2.1 and 2.2 will hold for this transfor- the null hypothesis (rendering transformation uninformative). −1 Consequentially, our test is most useful in situations with mod- mation. However, in this case bn − 1 = k/(n − k) = O(n ), κ− † − = O ( 1) erate samples (wherein, the discrepancy An can be substantial and consequentially, Tn Tn p n under H1.Thus,we ∗ without mandating an excessively large Tn). expect that under H1,theseparationbetweenTn and Tn will be † ∗ greater than the separation between Tn and Tn, implying that Tn will be the more powerful choice of transformed test statistic. 2.3. Connections to Education We present simulations that compare our proposed tech- nique to a multiplicative correction in Section 3.4. Therein, the Although not the primary focus of this article, we note that multiplier bn is selected using trial and error; obviously, this is the aforementioned result may provide a useful example in not a viable procedure in practice (hence, this comparison is rel- an undergraduate course. The proof of egated to a short simulation study toward the end of the article). Theorem 2.1 is an example that relies on two mathematical Just as it is not clear how to optimally choose a multiplier, it is not results with which many undergraduate students struggle (e.g., clear how to optimally implement the log transformation (e.g., the application of Taylor/power series expansions and conver- what value of κ in (1) should be used?). Thus, we do not attack gence). The proof can be simplified by setting κ = 1 and noting → →∞ ∗ the problem of selecting an optimal transformation scheme in that An 0asn under H0,whereasTn diverges from Tn ∗ detail; our goal is simply to illustrate the potential utility of basic under H1.TheideathatTn is equivalent to Tn under H0 but more transformations and some additional simulations studying the likely to detect the null hypothesis follows. sensitivity of κ areincludedinSection 3.5. Perhaps more important than the mathematical foundations, we highlight the connections to the concepts of statistical infer- ence: statistical power, Type I error rates, and UMP tests. Our 2.2. Exact Level and UMP Tests transformationcanbeusedasatechniquetohelpdemonstrate One may note that Theorem 2.2 appears to violate standard sta- why a UMP is the most powerful test following the arguments tistical findings if Tn happens to be an exact α-level or UMP test. around (5). The arguments in this article can be used to moti- However, the transformed statistic is only more powerful when vate discussion of the limitations (and possible corrections) of the same sampling distribution is assumed for it and its untrans- asymptoticresults.Thearticlecanbeusedforadiscussion formed analogue (e.g., when the same critical values are used into conservative behavior of some statistics and the various ∗ ∗ = ( ) for both Tn and Tn ). To explain, note that Tn f Tn ,where approaches of addressing it. Last, in the following sections we THE AMERICAN STATISTICIAN 235 provide several examples of the transformed to well-known sta- Table . Rate of rejections at α = 1%, out of , replications, of F statistic based F( , n − ) F F∗ χ 2 on an 1 2 distribution, and based on an asymptotic 1 distribution, the tistical methods. 2 ∗ χ -based test T , the transformed T and bootstrapped TB (based on  resam- ples) under the null hypothesis at seven sample sizes n.

3. Implementation Examples n        Theresultsoftheprevioussectionarepresentedinageneral F . . . . . . . Fχ . . . . . . . framework. In the following subsections, we provide several ∗ Fχ . . . . . . . applications of the proposed method to well-known (and easy T . . . . . . . to implement) statistical tests. With each test, to demonstrate T ∗ . . . . . . . T the practicality and usefulness of our proposed result we study B . . . . . . . its performance in finite samples via simulation. The goal of the simulations is to demonstrate that the proposed method can provide improvements in terms of statistical power while Stuart (1977) justifies such a test by demonstrating that the F achieving acceptable Type I error rates (for certain conservative statistic holds for nonnormal data asymptotically. test). For additional comparisons, we include a bootstrapped Wewill consider transformed versions of the described statis- ∗ =− ( − / ) ∗ =− ( − (nonparametric resampling) version of the original statistic. tics. This includes F n log 1 F n and T n log 1 / ) =− ( − 2) ∗ Specific scenarios are outlined below. We retain much of the T n n log 1 r .Wenotethatourmodifiedstatistic,T , exists with probability 1 since 0 ≤ r2 ≤ 1, while the F∗ statistic aforementioned notation, and each of the presented simulation ∗ results are based on 10,000 replications with 1000 bootstrap may need to be approximated using (3). The T tests use critical χ 2 samples. The simulation examples are followed by a brief values from the asymptotic distribution while the F statis- ticusesthefinite-sampleF distribution. For added comparison, discussion of other statistical tests that could be modified to ∗ ∗ incorporate our modification. we also include Fχ and Fχ , which are the versions of F and F , χ 2 respectively, that utilize a 1 critical value. We also include a bootstrapped version of the T statistic, denoted TB,wherethe 3.1. Tests for Correlation yi terms are resampled with replacement. We further note that Our first example considers testing for correlation between two the bootstrapped versions of all statistics report the same per- centage of rejections since all are based on the same underlying sets of observations, xi and yi,fori = 1,...,n,usingthePearson : , r. { }n  Asample xi i=1 is generated as a set of uniform random vari- n (x − x¯)(y − y¯) ( , ) { }n =  i=1 i  i . ables in the interval 1 20 .Thesample yi i=1 is generated by r   = + δ +   n 2 n 2 yi 5 xi 3 i where the i terms are iid t-distributed ran- = (x − x¯) = (y − y¯) i 1 i i 1 i dom variables with three degrees of freedom. Table 1 reports theTypeIerrorratesatα = 1% under an assortment of sam- Properties of r have been well studied (see Casella and Berger ∗ 1990, and others), and r is covered in nearly every introduc- ple sizes n for the F statistic, variants of F and F that use an χ 2 χ 2 tory statistics textbook. Under the assumption that the data are asymptotic 1 distribution, the statistic T,ourtransformed ∗ { }n bivariate normal, we have the well-known result T ,andthebootstrappedTB. Here, the observations yi i=1 are generated with δ = 0. n − 2 F = r2 (6) From the simulation, we can see that the F statistic and 1 − r2 χ 2-based result are fairly conservative (this is typical for lep- tokurtic symmetric distributions, see Pearson 1931;Glass, thatbehavesasaF-distributed random variable with ν = 1, and 1 Peckham, and Sanders 1972) and we note that in most cases T ∗ ν = n − 2 degrees of freedom (the F statistic above is derived 2 reports error rates closer to the nominal level. As expected, the by squaring the standard t-test for correlation). The F test can 2 bootstrapped TB provides error rates close to the nominal level. utilize an asymptotic χ distribution but is known to be liberal ∗ Both Fχ and F , that utilize the asymptotic χ 2 distribution, when applied to normal data. χ 1 areabitoversized.TheF∗ test helps to highlight that our Alternatively, one can use the simpler (although less accurate χ transformation will exacerbate finite sample issues if applied to in finite samples) result statistics that are liberal. Last, we note that in zero cases in this ∗ T = nr2, (7) simulation did we need to approximate F using (3), although this is not guaranteed in general (unlike T ∗). χ 2 which follows a 1 distribution asymptotically. This statistic is a To study the performance under the alternative hypothesis, special case of the test in Haugh (1976) to determine if two sta- we look at n = 35 and let δ, a perturbation parameter, tionary time series are correlated (specifically, T can be consid- from 0 to 1 in steps of 0.025. We consider only those statistics ered a Haugh-type test for cross-correlation at lag 0). It is known with adequate or conservative Type I error rates from the study that T is conservative since it is negatively biased compared to it above. As the value of δ deviates from zero, the null hypothesis is asymptotic distribution (see Box and Pierce 1970;Haugh1976; less correct and the statistics should favor the alternative. Figure 1 Ljung and Box 1978, for further details). provides a visualization of the statistics under this scenario. Here,weperformastudywithleptokurticdatawherenormal In Figure 1 a, we see the transformed statistic T ∗ provides theory results (e.g., the F statistic) tend to be conservative (see more than a 5% improvement in power over T and is more pow- Pearson 1931; Glass, Peckham, and Sanders 1972). Kendall and erful than F. Our transformed statistic T ∗ also provides more 236 T. J. FISHER AND M. W. ROBBINS

∗ Figure . Visualization of F, T , T ,andTB (based on  resamples) under the alternative hypothesis as a function of perturbation parameter δ for n = 35.(a)PowerofF, T T ∗ T α = T T ∗ H , ,and B at 1%.(b)Meanvalueof and under 1. power than TB in these simulations—all without the added com- 1tok with the mean from time points k + 1ton.Themagni- putational cost. Furthermore, Figure 1(b) displays the observed tude of maximum absolute discrepancy is mean value of the two test statistics T and T ∗ as a function of δ | |, (F was excluded as it has a different sampling distribution than max CUSUMk (8) 1≤k≤n T). We see that, consistent with Theorem 2.2, T ∗ deviates from T as δ grows. which provides the basis of a test statistic and the value of k that maximizes |CUSUMk| is an estimate of the change point time c. The distribution of (8) depends on the limiting struc- 3.2. Change Point Testing ture of the process t ;seeCsörgoandHorváth(˝ 1997). It follows Our second example involves a common problem in statistics, that the change point problem. Here, we study the problem in the 1 D context of a detecting a change point in the mean of a time series. C = max |CUSUMk| −→ sup |B(t)|, (9) τˆ ≤ ≤ { }n 1 k n 0≤t≤1 Let Xt t=1 be an observed time series from a process with a sta- tionary covariance structure (see Brockwell and Davis 1991,or where τˆ2 is a consistent estimator of the long-run variance: countless other texts). We are interested in determining if {X } t follows the model n ∞ τ 2 = 1  = ( , ) μ +  ≤ ≤ , lim var t cov t t−k t for 1 t c n→∞ n = t=1 =−∞ Xt μ + δ +  < ≤ , k t for c t n ∞ 2 = σ + 2 cov ( ,− ) , for nonzero δ,whereμ is an unknown mean, δ is the magnitude t t k k=1 of the (potential) mean shift at unknown time c,and{t } is a zero-mean stationary series with finite covariance and variance D where −→ denotes convergence in distribution and B(t) is a σ 2. We wish to statistically test H : δ = 0versusH : δ = 0. √ 0 1 standard Brownian bridge. Note that C diverges at a rate of n The change point problem is well studied with a colossal vol- under the alternative hypothesis and assumptions (a)–(c) hold ume of literature contributing to the our current understanding. in this setting. Changepoint statistics akin to C are often conser- The work of Page (1954, 1955) is generally credited with intro- vative (see, e.g., Robbins 2009) since the maximum of a discrete ducing the problem. Quandt (1958, 1960)iscreditedwithits set is being approximated with the supremum of a continuous extension into linear models (i.e., segmented ). process. A review of parametric change point analysis can be found in For additional comparisons, the bootstrapped distribution of Chen and Gupta (2012). Here, we use the large sample nonpara- C is found using the stationary bootstrap (Kunsch 1989;Poli- metric methods described in CsörgoandHorváth(˝ 1997)and tis and Romano 1994)andthefixed block bootstrap (see Kirch more recently studied in Robbins et al. (2011). Consider testing 2007). In the former, random blocks of length l (where l is the outlined hypothesis with ( ) chosen from a Geometric distribution with mean log n )are k n sampled to approximate the sampling distribution . In the lat- 1 k = ( ) CUSUMk = √ Xt − Xt ter, blocks of fixed length l log n are sampled. The block- n n  t=1  t=1 sampling technique is a necessary step to retain any serial cor- √  relation in the bootstrapped series. = k − k ¯ − ¯ ∗ , 1 n Xk X  n n k In the simulations below, we let t follow an autoregres-  = φ + η φ = .   sive process of order 1: t t−1 t ,with 0 1and ¯ = −1 k ¯ ∗ = ( − )−1 n η ∼iid ( , ) τ 2 where Xk k t=1 Xt and Xk n k t=k+1 Xt .The t N 0 1 .Toestimatethelong-runvariance ,alinear CUSUM statistic compares the sample mean from time points combination of the variance and serial covariance terms, we use THE AMERICAN STATISTICIAN 237

α = C C∗ C Table . Rate of rejections at 5% (1%), out of , replications, of , , f , Weseethatourproposedmethodcanleadtomorethana C block H and r (based on  bootstrapped samples) under 0 of no change point at several sample sizes n. 10% improvement in terms of power over C. Further, our trans- formed statistic provides more power than either bootstrap tech- n       nique with an increase of roughly 4% power. Figure 2(b) demon- C . (.) . (.) . (.) . (.) . (.) . (.) strates the distribution of our proposed method results in a C∗ . (.) . (.) . (.) . (.) . (.) . (.) statistic shifted to the right of C in the upper-tails, hence result- C f . (.) . (.) . (.) . (.) . (.) . (.) ing in a more powerful method. We also note that in no cases C . (.) . (.) . (.) . (.) . (.) . (.) r was the approximated (3) transformed statistic needed in these simulations. the nonparametric Bartlett-based estimator: n  3.3. Logistic Regression 1 2 τˆ2 = X − X¯ n t As a third example, consider testing the significance of predictor t=1   − variables in a logistic regression. That is, consider a dichotomous qn n s   s 1 ¯ ¯ response variableY with success probability p = P(Y = 1) and + 2 1 − X − X X + − X , i i i q + 1 n − s t t s predictor variables X , X , ..., X for i = 1,...,n.Weestimate s=1 n t=1 1i 2i ki the logarithm of the odds-ratio 1/3 with qn = n (see Newey and West 1987), where · denotes   the floor function. The p-values and null distribution ( | ) = pi logit E[Yi Xi] log − of C canbefoundviathewell-knownexpression: 1 pi = γ + γ x + γ x +···+γ x (10)   ∞ 0 1 1i 2 2i k ki | ( )| > = (− )k+1 {− 2 2}, > . P sup B t x 2 1 exp 2k x x 0 using maximum likelihood. Testing of of 0≤t≤1 k=1 the model (and parameters) can be performed using a Wald statistic, which is a simple quadratic form: We compare√ the statistic√ C to a transformed version ∗ =− ( − / ) − C n log 1 C n andthetwobootstrappedver- = γˆ T  1γˆ ∼ χ 2 →∞, Wγ γ k as n (11) sions, C f and Cr, the fixed and random blocks, respectively. To begin consider an empirical size study (δ = 0) and com-  where γˆ = (γˆ ,...,γˆ )T , γ is the estimated pare the rejection rates at various sample sizes and two signifi- 1 k of the parameters γˆ,andχ 2 is a chi-squared distribution with k cance levels. k degrees of freedom. We note that the Wald statistic is subop- Table 2 demonstrates all the the statistics exhibit conservative timal and not the only test available for this type of inference or acceptable Type I error performance in this setting (see Rob- (e.g., the likelihood ratio and Rao test; see McCullagh and bins et al. 2011, for a more comprehensive study of C). Overall, Nelder 1989;Agresti2015), but it is included by default in both theTypeIerrorratesdoimproveforlargersamples. SAS and R routines implementing logistic regression. The Wald Next, we explore the power of the statistics as a function of δ statistic can also be used to test the significance of an individual the perturbation parameter . The time series follows the afore- 2 2 fitted parameter γˆ j with Wj =ˆγ /σˆγ for j = 1,...,k,where mentioned model with a change point occurring at c = n/2 j j 2  σˆγ is the jth diagonal element of γ ,whichfollows(asymp- and magnitude δ. Figure 2 displays the empirical power of C, j ∗ χ 2 C ,andC f as a function of δ (left panel, Cr is excluded from the totically) a 1 distribution. Consider the transformed statistics ∗ =− ( − / ) ∗ image as it has slightly less power thanC f ) and the empirical dis- Wγ n log 1 Wγ n and analogous Wj ,whichwillfol- tribution of the simulated statistics C and C∗ at δ = 0.50 (right low the same asymptotic χ 2 distributions by Theorem 2.1.We b panel) for sample size n = 100. also consider bootstrapped versions of Wγ and Wj, denoted Wγ

∗ Figure . Performance of C, C ,andCf (based on  bootstrapped block samples) under the alternative hypothesis. Change point occurs at time n/2 with magnitude δ ∗ ∗ for sample size n = 100.(a)PowerofC, C ,andCf at α = 5%. (b) Distributions of C and C at δ = 0.5. 238 T. J. FISHER AND M. W. ROBBINS

α = Table . Rate of rejections at 5%, out of , replications, of Wald, trans- of pi is near the extreme of 0 but the error rate does increase with formed Wald, and bootstrapped Wald (based on  resamples) statistics under n. The bootstrap test exhibit poor Type I errors in the case when the null hypothesis at three sample sizes n and true proportions pi. p = 0.1 but improves for other values and for larger samples. p = . p = . p = . i 0 1 i 0 3 i 0 50 At the moderate proportion of 0.50, the test statistics are still n          conservative for small samples but closer to the nominal level of 5% at the larger sample size. Last, we note that in one occur- W γ . . . . . . . . . = = . ∗ rence, out of 10,000 replications (n 15, p 0 1) was the test Wγ . . . . . . . . . ∗ b statistic greater than the sample size (wherein Tn would have Wγ . . . . . . . . . W to be approximated using (3)). However, the issue of a sample 1 . . . . . . . . . W ∗ 1 . . . . . . . . . with no variability appears problematic for the bootstrapping W b 1 . . . . . . . . . algorithm, particularly with small sample sizes. For instance, in 2019 of our simulations at n = 15, p = 0.1, an inseparable b andWj , respectively. There, we resample (with replacement) the sample was generated, but even in the 7981 simulations with a response variables Yi while considering X1 and X2 constant. separable sample, on average 158 of the bootstrapped samples Some numerical consideration is necessary when fitting were inseparable (an inseparable sample will result in a Wald logistic models such as (10), particularly when bootstrapping is statistic value of approximately zero), which contributes to the involved. In small samples, it is possible to experience conver- inflatedTypeIerrorrates. gence issues with the iterative nonlinear optimizer. Our simula- To simulate under the alternative hypothesis (and there- tion routine performs some error checking and retaining only fore assess power), we set the success probability of Yi to = / = ,..., those samples that result in good numeric fits. Further, in sce- pi X1i 100 for all i 1 n,whereX1i is discrete uniform narios with a large likelihood of 0 (or 1) responses, resampling asbefore.Notethatthischoiceofpi does not satisfy the logistic the binary response variable Yi canresultinaninseparableset model of (10); however, this setting is used since it provides (i.e., all response values take on 0 or 1). We record the occur- a simple situation in which H0 is falsified. Large values of X1i rence of such situations although it does not necessary result in will result in a greater proportion of success in Yi, while smaller a poor fit. The Wald statistic is also an example where, on occa- values of X1i result in smaller proportions, and X2i provides no sion, the approximation in (3) is necessary. We keep track of the contribution. occurrence of times our transformed statistic is approximated Figure 3 provides two visualizations of the Wald, transformed and the number of occurrences with an inseparable sample. statistic, and bootstrapped statistic under this alternative. The ∗ ∗ α = We begin by studying the Type I error rates of Wγ and Wj rate of rejections at 5% significance level is displayed asa compared to Wγ and Wj, respectively, and their bootstrapped function of the sample size n. analogues. Let k = 2andthepredictorX1i is generated as dis- From Figure 3, we see that the transformed statistic can pro- crete uniform random variables between 10 and 90. The pre- vide more than a 17% improvement in power (at n = 22 in dictor X2i is selected from the set {50, 100, 150, 200} with Figure 3(a)), and consistently provides an increase in power at equalprobabilityandwetreateachpredictorasacovariate.The smallersamplesizes,overtheuntransformedWaldstatistic.We response variable Yi is generated as a Bernoulli random variable also note that the bootstrapped Wald statistic is most powerful with pi independent of X1i and X2i.Weconsideranassortmentof here; however, its Type I error rate is inflated at smaller sample finite samples n and values of pi ranging from near the extreme sizes. Furthermore, we note that only in nine total cases did the (0.1) to the middle (0.5) of feasible values and the rejection rates unmodified test statistic exceed the sample size (twice each for ∗ ∗ b b γ = , γ = for Wγ , Wγ , Wγ , Wj, Wj and Wj are displayed in Table 3 for W and W1 at n 10 14 and once for W at n 22). In gen- α = 5%. eral, we find that the issue of a non-existent modified statistic We note that, with the exceptions of the bootstrapped ver- appears only under the alternative hypothesis when the unmod- sions, all the statistics are very conservative when the true value ifiedstatisticisalreadyrejectingthenullhypothesis.InthisWald

Figure . Power performance of Wald, transformed statistic, and bootstrapped Wald (based on  resamples) at α = 5% as a function of the sample size n. Success Y X X W W ∗ W b α = W W ∗ W ∗ α = proportion of response i is a linear function of 1i, while 2i provides no contribution. (a) Power of γ , γ and γ at 5%.(b)Powerof 1, 1 and b at 5%. THE AMERICAN STATISTICIAN 239

∗ † Table . Rate of rejections at at three α-levels for Wγ , Wγ ,andWγ , out of , 10,000 iterations at the 0.1% significance level when it has a 4% replications, for sample size n = 50 and success probability .. rejection rate at the 5% significance level. Importantly, as shown ∗ † in Table 4, the log transformation of (1)doesmoretomitigate α Wγ Wγ Wγ this issue than the test that uses a multiplicative correction. 5% . . . ∗ † Choosing the multiplier bn that ensures that the Tn and Tn 1% . . . 0.1% . . . tests observe the sample Type I error rate for a specific n is † equivalent to selecting a critical value that ensures that Tn has a desired Type I error rate. Therefore, in accordance with the α = ∗ test example, the 5% critical values are 5.99 and 3.84 for arguments used in Section 2.2,weanticipatethatiftheTn and † Wγ and W1, respectively; that is to say, for example, that the T testsobservethesamesizeattheα significance level, they ≥ ∗ n event that W1 n (in which case W1 is undefined) only occurs will also observe the same power at that level of significance. when H0 is clearly rejected. Some additional simulations with Nonetheless, we favor the use of the proposed log transforma- an ANOVA F-test (see supplemental code) show a nonexistent tion over a multiplicative correction (as justified with theoreti- modified statistic is only an issue when the unmodified statistic cal arguments in Section 2.1). When bn has been selected so that ∗ † has power near 100%. Tn and Tn observe the same size for some desired significance α ∗ level ,theTn test will be more powerful at significance levels less than α. That is, if the tests report p-values less than α,the 3.4. Comparison to a Multiplicative Correction ∗ Tn test will observe a p-value smaller than that yielded by the † As discussed in Section 2.1,onemayimproveaconservative Tn test.Thisisillustratedempiricallyasfollows. hypothesis test Tn with a multiplicative correction bn > 1based Figure 4 displays the empirical power functions of the Wγ , ∗ † δ on n such that bn → 1asn →∞. Here, to demonstrate the pro- Wγ ,andWγ tests as a function of the term from (12) for signif- posed method can have improved performance compared with icance levels of 1% (Figure 4(a)) and 0.1% (Figure 4(b)) when the † = = . ∗ one that uses a multiplicative correction, Tn bnTn,weruna setting of Table 4 is used (e.g., bn 1 038). As expected,Wγ ,and † short simulation. The response variable Yi,fori = 1,...,50, Wγ display equivalent power at α = 5%; therefore, results for ∗ is generated in the logistical regression setting with success that significance level are omitted. In Figure 4 aweseethatWγ † probability provides a roughly 2% improvement over Wγ and both improve over Wγ by upwards of 5%. However, at the smaller α-level of = 1 , = δ pi where z Xi (12) 0.1% significance, we see in Figure 4bthatourmethodcansub- 1 + e−z stantially improve over the unmodified Wald by nearly nearly and Xi is a mean zero normal random variable with standard 20% and over the multiplicative corrected test by 12%. deviation2.WecomparetheWaldtestWγ definedin(11)to ∗ ∗ We also see the divergence of the observed values ofWγ from our modified test Wγ and a multiplicatively corrected version † Wγ and Wγ in our simulations. † = Wγ bnWγ .Themultiplicativetermbn is chosen through trial These observations lead one to wonder why the improve- † ∗ and error such that Wγ and Wγ havethesameempiricalType ment in power offered by our method becomes augmented for α = = . Ierrorrateatnominallevel 5%; this results in bn 1 038 smaller values of α. As indicated in (2), the difference between = ∗ † when we set n 50. Table 4 reports simulated Type I error rates Tn and Tn (and between Tn and Tn) becomes larger for smaller (wherein the true proportion of success is fixed at 0.1824 with n and/or larger Tn. By design, Tn is larger when the null hypoth- δ ∗ no influence from X1)forthesethreetestsatthe5%,1%,and esis is “more false” (i.e., grows). (The divergence of Tn from Tn 0.1% significance levels. We note that conservative tests tend with increasing δ is illustrated in Figure 1(b).) However, when α to become, in essence, more conservative for smaller values of is moderate (e.g., 5%), all methods will reject the null hypoth- significance levels. For instance, the Wγ test rejected 0 out of esis by the time that δ is large enough to induce divergence

Figure . Power performance of Wald, our transformed statistic and a multiplicative corrected version at α = 1% and α = 0.1% as a function of perturbation parameter δ. ∗ † ∗ † Success proportion of response Yi is a logit function of δXi, where Xi ∼ N(0,σ = 2).(a)PowerofWγ , Wγ ,andWγ at α = 1%.(b)PowerofWγ , Wγ ,andWγ at α = 0.1%. 240 T. J. FISHER AND M. W. ROBBINS

∗ between Tn and Tn, and consequentially the distinction in power the rate of divergence under H1) provides improvement over the between the methods is not particularly substantial. Decreas- unmodified statistic and is comparable to bootstrapping. ing α increases the threshold for rejection, and as a result the improvement in power offered by the log transformation is more visible. 3.6. Other Examples The above examples constitute just three of the plethora of sce- narios in which our trick can be implemented. The Wald statis- 3.5. Sensitivity of κ value tic is a general result and can be derived when estimation is ∗ performed using maximum likelihood methods (or some other As discussed in Section 2, our transformed statistic T can method with asymptotic normality); most generalized linear κ> be defined with any 0. For simplicity, and based on our models will satisfy this requirement and many software imple- κ previous simulation results, setting totherateofdivergence mentations of such methods include a Wald statistic by default. under H1 appears to provide a good balance of ease-of-use and Furthermore, if the score (or Lagrange multiplier) test is con- performance. Here, via simulation, we briefly study the sensitiv- servative,ittoocouldbemodifiedtoincludeourtrick.This κ ity of and how it may effect the performance of a transformed opens to the door to a number of methods that can potentially statistic. Using a similar framework as the simulation from be improved if they are conservative under certain settings— Section 3.4, we consider a Wald statistic based on the aforemen- consider the log-rank or Kaplan–Meier estimate from survival tioned logistic model. Here, we compare multiple versions of a analysis for instance. b transformed statistic to Wγ and Wγ (the bootstrapped Wald): The correlation test provides an example of a normal the- ∗ κ = κ=2 Wγ (that described in the previous sections, 1), Wγ , ory result applied to heavy-tailed data; these methods tend to κ=0.5 our transformed statistic with κ = 2, Wγ ,atransformed be conservative and can be improved with our transforma- κ=0.7981 statistic with κ = 0.5andWγ , a transformed statistic with tion. Additional examples include modifying the ANOVA F- κ = 0.7891. In the latter case, the value of κ was chosen via trial test, tests for linear regression and other designs. In a similar and error so that the transformed statistic has approximately vein, the Hotelling T 2 statistic in multivariate analysis is also 5% Type I errors in the below simulation. Selecting a constant derived under the assumption of normality. Furthermore, these value, whether bn as in Section 3.4 or κ here, dependent on α, methods relate to likelihood ratio test (LRT) for Gaussian data. n, and the underlying model via trial and error is not feasible In general, the distribution of LRT tends to be approximated in general; here, we do so to present results where an optimal (consider the well-known asymptotic χ 2 approximation) and (in terms of Type I error performance) can improve in terms canbehaveconservativelyinsmallsamples.Ourtrickcanoffer of power compared to our more naive approach and over improvement in power in these scenarios as well. bootstrapping. Last, the CUSUM example showed the use of our method Our simulation is analogous to that in Section 3.4;how- when the test statistic follows a point process that is approxi- ever, here, we report the results numerically (as compared mated through a continuous stochastic process—approximation to graphically) to see the possible improvement provided. A of a statistic based on a discrete processes with a limit pro- dichotomous response, of length n = 50 is generated following cess that is continuous will usually lead to conservative tests. the logistic model in (12) except here Z =−1.5 + δX1,where This example closely relates to the well-known Kolmogorov– X1 is a uniform r.v. on the interval (0, 3). Table 5 reports the Smirnov statistic. In fact, our method can lead to substantial rate of rejections of the aforementioned statistics as a function power increases there as well when critical values and p-values of the perturbation parameter δ. are approximated from the asymptotic distribution, although We see in the results, that consistent with the results in the exact distribution tends to be available in small samples Section 3.3, bootstrapping provides more power than Wγ and (when our method provides the most improvement); see Birn- ∗ Wγ (albeit a relatively small improvement). We also see that baum (1952), Durbin (1973), and Simard and L’Ecuyer (2011) the transformed statistic with an optimal κ value provides more for a discussion on the distribution of the Kolmogorov–Smirnov power than all the other statistics reported with the exception of statistic. Implementations of a LRT, t-test, ANOVA F-test, and κ=0.5 Wγ , which has inflated Type I errors. Furthermore, we see the Kolmogorov–Smirnov test are included in the supplemental that our recommended approach (where κ is selected to match source code.

∗ κ=2 κ=0.5 κ=0.7981 b Table . Rate of rejection of Wγ , Wγ , Wγ , Wγ , Wγ ,andWγ out of , replications, for sample size n = 50 and with success probability based on pertur- bation parameter δ in the logistic model (). δ

Test  . . . . . . . . . 

Wγ . . . . . . . . . . . ∗ Wγ . . . . . . . . . . . κ=2 Wγ . . . . . . . . . . . κ=0.5 Wγ . . . . . . . . . . . κ=0.7981 Wγ . . . . . . . . . . . b Wγ . . . . . . . . . . . THE AMERICAN STATISTICIAN 241

4. Empirical Example Table . Test for significance of Challenger Logistic Model.

On January 28, 1986, the space shuttle Challenger experienced Full model Temperature Pressure a catastrophic failure 73 sec after launch. It has been well estab- Method Value p-value Value p-value Value p-value lished that a cold launch temperature contributed to the failure W . . . . . . of a field-joint O-ring, which, in turn, led to a combustion W ∗ . . . . . . gas leak in a Solid Rocket Booster that led to the structural breakup and loss of the space craft; see the Presidential Com- When the Wald test is performed on each predictor vari- missionontheSpaceShuttleChallengerAccident(1986). A able individually, the corresponding values are Wt = 4.3224 statistical analysis of pre-Challenger launch data was studied and Ws = 1.3415 with p-values 0.0376 and 0.2468, respectively. by Dalal, Fowlkes, and Hoadley (1989). Here, we mimic part ∗ = . Usingourtransformedmethod,thevaluesareWt 4 7879 and of the analysis in Section 3.1 of that paper using our proposed ∗ = . Ws 1 3823 with corresponding p-values 0.0286 and 0.2397, methodology. We note this particular dataset has been well respectively.Wedonotconsiderthebootstrapheresinceitwas studiedusingamultitudeofmethods(seeTufte1997; Maran- shown, in Section 3.3, to have moderately inflated Type I error zano and Krzysztofowicz 2008) and is included as an example rates when parameters were similar to those seen here. in many introductory engineering and statistics courses. Giventhemoderatesamplesize,coupledwiththesimula- Consider the binary response of a primary O-ring incident by tion results of Section 3.3, we can conclude the full model does either erosion or blowby with possible thermal distress predic- explain the failure of O-rings even if you include the (insignif- tor variables: temperature and leak-check pressure. The raw data icant) predictor variable pressure, which is a different finding can be found in Table 1 of Dalal, Fowlkes, and Hoadley (1989) than the unmodified Wald method. Furthermore, we note that and are displayed in Figure 5. if one considers the popular Bonferroni correction for the mul- Visually we see most O-ring failures occurred at lower tem- tiple comparisons (also a conservative procedure) in looking at peratures and higher pressure. Consider fitting a full logistic the individual predictors, we note that no predictors are signif- model with both temperature t and pressure s as predictors (this icant according to the adjusted significance level of α∗ = 0.025, model is excluded from the dichotomous response discussion although our modified method provides a p-valueclosetothat in Dalal, Fowlkes, and Hoadley 1989 as s is argued to have little- nominal level. Last, we highlight the behavior of our test statis- to-no contribution). The resulting fit is a version of (10)with γˆ = . γˆ =− . γˆ = . tic: In the case of pressure (an accepted true null hypothesis of coefficients 0 13 292, t 0 229, and s 0 010 and with no significance), our modified statistic matches the traditional respective standard errors 7.664, 0.110, and 0.009. Waldtotheseconddecimalplace,whereasinthecaseofthefull Consider a Wald statistic to test whether the model provides model and temperature variable (alternative hypothesis), we see some explanation; that is, jointly test whether t and/or s con- a stronger deviation from the traditional Wald. tribute to the response. The resulting Wald statistics are sum- marized in Table 6. The Wald value for the full model is 5.4027 χ 2 5. Discussion with a corresponding p-value of 0.0671 from the 2 distribu- tion. This suggests that the model does not explain the response Asymptotic approximations of sampling distributions of test at the 5% significance level. However, our transformed statistic statistics are popular due to their convenience. For instance, it provides a value of 6.1583 with corresponding p-value 0.0460 may not be feasible to extract the algebraic distribution of the leading to a significant finding. test statistic in finite samples, whereas the large-sample distribu- tion may be readily available. Unfortunately, asymptotic approx- imations can lead to substantial overstatement of the true Type Ierrorrate,andanontriviallossofpowermaybeincurredasa consequence. We have illustrated that a simple transformation canbeappliedtothestatistictoimprovepowerandsizewhile maintaining the underlying asymptotic process. Furthermore, it is important to note that although the transformed statistic observes no improvement in power if the true critical value is applied, the specific choice of transformation can have a notable impact on the improvement in power that is observed when crit- ical values based on an asymptotic distribution are used. Specif- ically, the log transformation proposed here yields a greater power increase than multiplicative transformations, which take the product of the statistic and a constant that is greater than unity.

Supplementary Material Figure . O-ring failures for  pre-Challenger space shuttle launches as a function of Temperature and Pressure. Note there were two nonfailures at coordinates (, ) Source Code: R-project source code implementing simu- and (, ) and coordinates (, ) and (, ) experience both a failure and non- lations and data analysis presented in this failure, respectively. article. 242 T. J. FISHER AND M. W. ROBBINS

References Kunsch,H.R.(1989), “The Jackknife and the Bootstrap for General Sta- tionary Observations,” The Annals of Statistics, 17, 1217–1241. [236] Agresti, A. (2015), Foundations of Linear and Generalized Linear Models, Lehmann, E. L. (1999), Elements of Large-Sample Theory, Springer Texts in New York: Wiley. [237] Statistics, New York: Springer-Verlag. [232] Birnbaum, Z. W. (1952), “Numerical Tabulation of the Distribution of Kol- Lilliefors,H.W.(1967), “On the Kolmogorov-Smirnov Test for Normality mogorov’s Statistic for Finite Sample Size,” Journal of the American Sta- with Mean and Variance Unknown,” Journal of the American Statistical tistical Association, 47, 425–441. [240] Association, 62, 399–402. [232] Box,G.E.P.,andPierce,D.A.(1970), “Distribution of Residual Autocorre- Ljung, G. M., and Box, G. E. P.(1978), “On a Measure of Lack of Fit in Time lations in Autoregressive Integrated Time Series Mod- Series Models,” Biometrika, 65, 297–303. [234,235] els,” Journal of the American Statistical Association, 65, 1509–1526.[234,235] Maranzano, C. J., and Krzysztofowicz, R. (2008), “Bayesian Reanalysis of Brockwell,P.J.andDavis,R.A.(1991), Time Series: Theory and Methods the Challenger O-ring Data,” Risk Analysis, 28, 1053–1067. [241] (2nd ed.), Springer Series in Statistics, New York: Springer-Verlag.[236] McCullagh,P.,andNelder,J.(1989), Generalized Linear Models (2nd ed.), Casella, G., and Berger, R. L. (1990), ,TheWadsworth Boca Raton, FL: Chapman & Hall/CRC Monographs on Statistics & & Brooks/Cole Statistics/Probability Series, Pacific Grove, CA: Applied Probability. [237] Wadsworth & Brooks/Cole Advanced Books & Software. [235] Newey,W.K.,andWest,K.D.(1987), “A Simple, Positive Semi- Chen,J.,andGupta,A.K.(2012), Parametric Statistical Change Point Anal- Definite, Heteroskedasticity and Consistent Covari- ysis (2nd ed.), Boston, MA: Birkhuser. [236] ance Matrix,” Econometrica, 55, 703–708. [237] Cox, D., and Snell, E. (1989), Analysis of Binary Data (2nd ed.), Boca Raton, Neyman, J., and Pearson, E. S. (1933), “On the Problem of the Most Effi- FL: Chapman & Hall/CRC Monographs on Statistics & Applied Prob- cient Tests of Statistical Hypotheses,” Philosophical Transactions of the ability. [232] Royal Society of London. Series A, Containing Papers of a Mathematical Csörgo,˝ M., and Horváth, L. (1997), Limit Theorems in Change-Point Anal- or Physical Character, 231, 289–337. [232] ysis, with a foreword by David Kendall, Wiley Series in Probability and Page, E. S. (1954), “Continuous Inspection Schemes,” Biometrika, 41, Statistics, Chichester: Wiley. [236] 100–115. [236] Dalal, S. R., Fowlkes, E. B., and Hoadley, B. (1989), “Risk Analysis of ——— (1955), “A Test for a Change in a Parameter Occurring at an the Space Shuttle: Pre-challenger Prediction of Failure,” Journal of the Unknown Point,” Biometrika, 42, 523–527. [236] American Statistical Association, 84, 945–957. [241] Pearson, E. S. (1931), “The in Cases of Non-Normal Durbin, J. (1973), Distribution Theory for Tests Based on the Sample Distri- Variation,” Biometrika, 23, 114–133. [232,235] bution Function, Philadelphia, PA: Society for Industrial and Applied Pearson, K. (1900), “On the Criterion that a Given System of Deviations Mathematics. [240] from the Probable in the Case of a Correlated System of Variables is Efron, B. (1979), “Bootstrap Methods: Another Look at the Jackknife,” The such that it can be Reasonably Supposed to have Arisen from Random Annals of Statistics, 7, 1–26. [232] Sampling,” Philosophical Magazine Series 5, 50, 157–175. [232] Fisher,R.A.(1925), StatisticalMethodsforResearchWorkers,Edinburgh: Peña,D.,andRodríguez,J.(2006), “The Log of the Determinant of the Auto- Oliver & Boyd. [232] correlation Matrix for Testing in Time Series,” Journal Fisher,T.J.,andRobbins,M.W.(2017), “An improved measure for lack of of Statistical Planning and Inference, 136, 2706–2718. [234] fit in time series models,” Statistica Sinica, pre-print. [234] Politis,D.N.,andRomano,J.P.(1994), “The Stationary Bootstrap,” Journal Glass,G.V.,Peckham,P.D.,andSanders,J.R.(1972), “Consequences of the American Statistical Association, 89, 1303–1313. [236] of Failure to Meet Assumptions Underlying the Fixed Effects Analy- Presidential Commission on the Space Shuttle Challenger Accident (1986), ses of Variance and Covariance,” Review of Educational Research, 42, Report of the Presidential Commission on the Space Shuttle Challenger 237–288. [232,235] Accident (Vols. 1 & 2), Washington, D.C. [241] Hald, A. (2007), A History of Parametric Statistical Inference from Bernoulli Quandt, R. E. (1958), “The Estimation of the Parameters of a Linear Regres- to Fisher, 1713–1935, Sources and Studies in the History of Mathemat- sion System Obeying Two Separate Regimes,” Journal of the American ics and Physical Sciences, New York: Springer (electronic resource).[232] Statistical Association, 53, 873–880. [236] Hall,P.(1992), The Bootstrap and Edgeworth Expansion, Springer Series in ——— (1960), “Tests of the Hypothesis that a Linear Regression System Statistics, New York: Springer-Verlag. [232] Obeys Two Separate Regimes,” Journal of the American Statistical Asso- Hauck,W.W.,andDonner,A.(1977), “Wald’sTest as Applied to Hypotheses ciation, 55, 324–330. [236] in Logit Analysis,” Journal of the American Statistical Association, 72, Robbins, M. (2009), Change-Point Analysis: Asymptotic Theory and Appli- 851–853. [232] cations, Ph.D. thesis, Clemson University. [236] Haugh, L. D. (1976), “Checking the Independence of Two Covariance- Robbins,M.,Gallagher,C.,Lund,R.,andAue,A.(2011), “Mean Shift Test- Stationary Time Series: A Univariate Residual Cross-Correlation ing in Correlated Data,” Journal of Time Series Analysis, 32, 498–511. Approach,” Journal of the American Statistical Association, 71, 378–385. [236,237] [235] Robbins,M.W.,andFisher,T.J.(2015), “Cross-Correlation Matrices Hirji,K.F.,Mehta,C.R.,andPatel,N.R.(1987), “Computing Distributions for tests of Independence and Causality between Two Multivariate for Exact Logistic Regression,” JournaloftheAmericanStatisticalAsso- Time Series,” Journal of Business & Economic Statistics, 33, 459–473. ciation, 82, 1110–1117. [232] [234] Hollander, M., and Wolfe, D. A. (1999), Nonparametric Statistical Methods, Simard, R., and L’Ecuyer, P. (2011), “Computing the Two-Sided Hoboken, NJ: Wiley. [232] Kolmogorov-Smirnov Distribution,” Journal of Statistical Software, Jennings, D. E. (1986), “Judging Inference Adequacy in Logistic Regres- 39(1), 1–18. [240] sion,” Journal of the American Statistical Association, 81, 471–476. [232] Stigler, S. M. (1986), The : The Measurement of Uncer- Kendall, M., and Stuart, A. (1977), “The Advanced Theory of Statistics: tainty before 1900, Cambridge, MA: Harvard University Press. [232] Inference and Relationship,” in Kendall’s Advanced Theory of Statistics: ——— (1999), Statistics on the Table: The History of Statistical Concepts and Classical Inference and the , London: C. Griffin & Co. [235] Methods, Cambridge, MA: Harvard University Press. [232] Kirch, C. (2007), “Block Permutation Principles for the Change Analysis Student (1908), “The Probable Error of a Mean,” Biometrika, 6, 1–25. [232] of Dependent Data,” JournalofStatisticalPlanningandInference, 137, Tufte, E. R. (1997), Visual Explanations: Images and Quantities, Evidence 2453–2474. [236] and Narrative, Cheshire, CT: Graphics Press. [241]