JANUARY 1997 WILKS 65

Resampling Hypothesis Tests for Autocorrelated Fields

D. S. WILKS Department of Soil, Crop and Atmospheric Sciences, Cornell University, Ithaca, New York (Manuscript received 5 March 1996, in ®nal form 10 May 1996)

ABSTRACT Presently employed hypothesis tests for multivariate geophysical (e.g., climatic ®elds) require the as- sumption that either the data are serially uncorrelated, or spatially uncorrelated, or both. Good methods have been developed to deal with temporal correlation, but generalization of these methods to multivariate problems involving spatial correlation has been problematic, particularly when (as is often the case) sizes are small relative to the dimension of the data vectors. Spatial correlation has been handled successfully by methods when the temporal correlation can be neglected, at least according to the null hypothesis. This paper describes the construction of resampling tests for differences of that account simultaneously for temporal and spatial correlation. First, univariate tests are derived that respect temporal correlation in the data, using the relatively new concept of ``moving blocks'' bootstrap resampling. These tests perform accurately for small samples and are nearly as powerful as existing alternatives. Simultaneous application of these univariate resam- pling tests to elements of data vectors (or ®elds) yields a powerful (i.e., sensitive) multivariate test in which the cross correlation between elements of the data vectors is successfully captured by the resampling, rather than through explicit modeling and estimation.

1. Introduction dle portion of Fig. 1. First, standard tests, including the t test, rest on the assumption that the underlying data It is often of interest to compare two datasets for are composed of independent samples from their parent differences with respect to particular attributes. For ex- populations. Very often, atmospheric and other geo- ample, one might want to investigate whether the physical data do not satisfy this assumption even ap- value under one set of conditions is different than the proximately. In particular, such data typically exhibit mean in some other circumstance. In order to make fair serial (or auto-) correlation. The effect of this autocor- and reliable comparisons, it is necessary to account for relation on the distribution of the mean is to sampling variability in this process. That is, even if there increase its above the level that would be in- is no difference with respect to an attribute of interest ferred under the assumption of independence. That is, (e.g., the mean) in the generating processes for two sample means are less consistent from batch to batch batches of data, the two sample estimates of the quantity than would be the case for independent data. (the pair of sample means) will rarely be exactly equal. The estimated variance of the Standard statistical tests (e.g., the familiar t test) have of the mean appears in the denominator of the test sta- been devised to discern what magnitude of difference, tistic for the t test: in relation to the variability evident in the data samples, can be declared with high con®dence to be real (i.e., xÅ Ϫ ␮ t ϭ 0 . (1) statistically signi®cant). Statistical tests have most com- 1/2 monly been applied to atmospheric datasets in the con- [var(xÅ)] text of climate studies (e.g., Livezey 1985), although Underestimation of the variance of the sample mean thus their of potential applicability is much broader in¯ates the standard t , leading to unwarranted (e.g., Daley and Chervin 1985). rejections of the null hypothesis (e.g., Cressie 1980; Two complications arise when attempting to apply Wilks 1995). However, a number of univariate (i.e., sca- standard hypothesis tests to climatic or other geophys- lar) tests have been developed that perform properly, ical data. These are illustrated schematically in the mid- even when the data upon which they operate exhibit serial correlation (Albers 1978; Chervin and Schneider 1976; Jones 1975; Katz 1982; Zwiers and ThieÂbaux 1987; Zwiers and von Storch 1995). One approach, Corresponding author address: Dr. Daniel S. Wilks, Department of Soil, Crop and Atmospheric Sciences, Cornell University, 1123 which is successful when the sample size n is suf®- Brad®eld Hall, Ithaca, NY 14853-1901. ciently large, is to estimate the variance of the sampling E-mail: [email protected] distribution of the mean as

᭧1997 American Meteorological Society

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC 66 JOURNAL OF CLIMATE VOLUME 10

FIG. 1. Conceptual illustration of therelationships between conventional univariate hypothesis tests (top), univariate tests modi®ed to account for temporal correlation in the data (middle left), multivariate tests assuming serial independence (middle right), and tests respecting both types of correlations (bottom) that are the subject of this paper.

s2 for problems involving estimation of the mean (Livezey var(xÅ) ϭ V x , (2) n 1995; ThieÂbaux and Zwiers 1984; von Storch 1995; Zwiers and von Storch 1995). Similarly, (3) is some- 2 wheresx is the sample variance of the data and V is the times used to de®ne the ``effective sample size'' ``variance in¯ation factor,'' which depends on the au- n tocorrelation in the data according to n ϭ , (4) e V nϪ1 k V ϭ 1 ϩ 21Ϫrk. (3) which also pertains only to inferences concerning ͸ n kϭ1΂΃ means.

Here, the rk are estimates of the at lags The second complication indicated in Fig. 1 is that k. For independent data, rk ϭ 0 for k ± 0, in which case the data of interest may be vector valued, or multivar- V ϭ 1 and the conventional t test is recovered in (1). iate. If the elements of the data vectors are mutually The variance in¯ation factor is sometimes called the independent, it is straightforward to evaluate jointly the ``time between effectively independent samples,'' (Leith results of a collection of scalar hypothesis tests (the 1973; Trenberth 1984), or the ``decorrelation time,'' al- ``®eld signi®cance'') using the binomial distribution though this terminology is only really applicable to (3) (e.g., Livezey and Chen 1983; von Storch 1982). How-

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC JANUARY 1997 WILKS 67 ever, for problems of practical interest, it is often the on the without the need to explicitly model case that the data exhibit substantial cross correlation. those correlations. Section 4 summarizes and concludes For example, it might be of interest to test whether the with suggestions for future directions. means of two sets of atmospheric ®elds (these could be values at a spatial array of points, or expansion coef- ®cients for the data projected onto a different basis, such 2. Univariate tests as selected Fourier harmonics) are signi®cantly different a. Bootstrap tests and the moving blocks bootstrap between two of these ®elds. If the data are cross correlated but temporally independent, the clas- The bootstrap is a relatively recent computer-based sical Hotelling test [the multivariate analog of the simple statistical technique (Efron 1982; Efron and Tibshirani t test, e.g., Johnson and Wichern (1982)] or related tests 1993). The philosophical basis of the bootstrap is the (e.g., Hasselmann 1979) can be used, provided the num- idea that the sampling characteristics (i.e., batch-to- ber of samples is much larger than the dimension of the batch variations exhibited by different samples from a vector data. parent population) of a statistic of interest can be sim- Alternatively, resampling tests (Chung and Fraser ulated by repeatedly treating a single available batch of 1958; Livezey and Chen 1983; Mielke et al. 1981; Prei- data in a way that mimics the process of sampling from sendorfer and Barnett 1983; Wilks 1996; Zwiers 1987) the parent population. As a nonparametric approach, it can be employed. These tests involve, in one form or is often useful when the validity of assumptions un- another, developing sampling distributions for a test sta- derlying more traditional theoretical approaches is ques- tistic (e.g., the length of the difference between vector tionable and when the more traditional approaches are means, according to a particular norm) by repeated ran- either unavailable or intractable. dom reordering the data vectors (®elds). This procedure is used most frequently to estimate is attractive because the resampling procedure captures sampling distributions, or particular aspects of sampling the cross correlation between elements of the data vec- distributions, such as standard errors and con®dence in- tors without requiring that the structure be tervals. Although computationally intensive, the pro- explicitly modeled and estimated. However, conven- cedure is algorithmically simple. The bootstrap approx- tional resampling procedures, which repeatedly rear- imation to the sampling distribution of a statistic of range the time ordering of individual data points, nec- interest is constructed by repeatedly resampling the essarily destroy important information regarding the available data with replacement to yield multiple syn- temporal correlation that may be present in the data thetic samples of the same size as the original set of (e.g., Zwiers 1990). Such tests are thus applicable only observations. For example, consider an arbitrary statis- to situations where either the temporal correlation is negligible, or temporal independence is implied by the tic S(x), which is some function of a sample of n data values x {x , x , x , ´´´, x }. A bootstrap sample from null hypothesis. ϭ 1 2 3 n Multivariate geophysical data exhibiting both time x, say x*, is constructed by treating x as nearly as pos- and space correlations, indicated in the lower portion sible as if it were the full parent population. Each boot- of Fig. 1, are often of interest. In principle, multivariate strap sample consists of n values drawn independently parametric tests that account for serial correlation could and with replacement from x and in general will contain be constructed by extending univariate tests that include multiple copies of some of the original data values while serial correlation (Kabaila and Nelson 1985), or through omitting others. Some large number nB of bootstrap sam- likelihood methods (Hillmer and Tiao 1979). However, ples is drawn, and the statistic of interest, S(x*), is com- either of these approaches would require estimation of puted for each. Remarkably, it has been found that the a prohibitively large number of parameters for many distribution of the resulting nB values of S(x*) is often problems. Although some suggestions for dealing si- a reasonable representation of the sampling distribution multaneously with auto- and cross correlation in resam- of S(x). Consequently, one can, for example, estimate pling tests applied to multivariate data have been offered a con®dence interval for S(x) on the basis of the dis- by Livezey (1995), an adequate resampling test respect- persion of the distribution of S(x*). ing both the temporal and spatial correlation typically There is a close relationship between con®dence in- found in atmospheric data has not previously been de- tervals and hypothesis tests. In particular, bootstrap con- vised. This paper develops such a test for the difference ®dence intervals can be used as the basis for nonpara- of vector means, using a relatively new idea called the metric hypothesis tests. If a bootstrap resampling pro- ``moving blocks'' bootstrap, which is a resampling pro- cedure is designed in a way that is consistent with a cedure that can capture time dependence in autocorre- speci®ed null hypothesis regarding a statistic of interest, lated data. The simpler univariate (i.e., for scalar series) the central (1 Ϫ ␣)100% con®dence interval for that bootstrap tests for differences of the mean are developed statistic, as estimated through its bootstrap distribution, in section 2. In section 3 these tests are extended to comprises the acceptance region for the corresponding multivariate data (i.e., vector series), where the resam- two-sided hypothesis test. When the observed statistic pling procedure captures the effects of cross correlation lies outside of this interval, the null hypothesis can be

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC 68 JOURNAL OF CLIMATE VOLUME 10 rejected at the ␣100% level. A corresponding one-sided test would be signi®cant at the (␣/2)100% level. An alternative approach to the construction of non- parametric resampling hypothesis tests is through per- mutation procedures (e.g., Mielke et al. 1981; Preisen- dorfer and Barnett 1983). A two-sample permutation test operates by assuming that, under the null hypoth- esis, the two batches of data at hand have been drawn from the same parent population. Accordingly, their la- FIG. 2. Schematic illustration of the moving blocks bootstrap. Be- belling as belonging to one batch or the other is arbi- ginning with a time series of length n ϭ 12 (above), b ϭ 3 blocks trary, and all the data can be pooled into a single body of length l ϭ 4 are drawn with replacement. The resulting time series from which pairs of synthetic samples can be drawn (below) is one of (n Ϫ l ϩ 1)b ϭ 729 equally likely bootstrap samples. repeatedly, without replacement. This process leads, in a manner analogous to the bootstrap procedure, to the length n, assuming all the possible blocks are distinct, construction of a synthetic sampling distribution for the is (n Ϫ l ϩ 1)b. For l ϭ 1 the moving blocks bootstrap statistic of interest. A limitation of permutation tests is reduces to the conventional bootstrap. the implication that the null hypothesis speci®es that the An outstanding problem for implementation of the underlying distributions for the two data samples are moving blocks bootstrap is the selection of the block the same in all respects. By contrast, bootstrap tests can length l. Limited theoretical work on block length se- investigate much more focused null hypotheses. In the lection done to date has indicated that as n → ϱ, the bootstrap tests described below, the null hypothesis will block length l should also tend to in®nity, but slowly be that pairs of means (only) are equal, without assum- enough that l/n → 0 (KuÈnsch 1989; Liu and Singh 1992). ing equality of , autocorrelations, or other as- In addition, it is expected that, if other factors are equal, pects of the joint distributions of the data. time series exhibiting stronger will re- In their basic forms, neither bootstrap nor permutation quire longer block lengths to capture adequately the procedures are appropriate tools for resampling time effects of the dependence (Carlstein 1986). However, as series or other autocorrelated data because independent a practical matter, choices for l in applied work have resampling destroys the correlation structure. For pos- been ad hoc and qualitative (Leger et al. 1992). Pre- itively correlated time series, a hypothesis test based on scriptions offered below for the choice of l have been the conventional bootstrap would be a nonparametric developed speci®cally for the problem of testing sample analog of the t test (1), but with independent bootstrap means of time series. resampling effectively yielding V ϭ 1 in (2). As is the case for the t test, the resulting bootstrap test would be permissive: the null hypothesis would be rejected more b. Bootstrap test for AR(1) data easily and frequently than warranted. One approach to One of the simplest possible time series models is the modifying the bootstrap to account for correlated data ®rst-order autoregressive [AR(1)] process (e.g., Box and is to use a ®tted parametric model (e.g., an autoregres- Jenkins 1976), de®ned by sive scheme for time series data) to produce approxi- xЈ ϭ ␾xЈ ϩ ⑀ (5) mately uncorrelated residuals, bootstrap these ``pre- ttϪ1t whitened'' values, and then reconstruct simulated data Here the primes indicate ``anomalies'' constructed by series by inverting the parametric model (Efron and Tib- subtraction of the mean (i.e., xЈt ϭ xt Ϫ ␮, with ␮ ϭ shirani 1993; Solow 1985). However, extending this ap- E[x]), ␾ is the autoregressive parameter (which is equal proach to multivariate settings would be problematic. to the population value of the lag-1 autocorrelation),

A nonparametric alternative for bootstrapping cor- and the ⑀t series consists of independent random variates related data is the recently proposed moving blocks with expectation E[⑀] ϭ 0 and ␴⑀. bootstrap (KuÈnsch 1989; Liu and Singh 1992). This Often it is assumed that the ⑀t values follow a Gaussian technique preserves much of the correlation structure in distribution. Zwiers and von Storch (1995) have sug- bootstrapped time series by resampling ``blocks,'' or gested that this simple model is a reasonable approxi- sets of ®xed length of consecutive data values, rather mation to the behavior of many atmospheric variables than single data points. Figure 2 illustrates the construc- not exhibiting quasi-periodic behavior, in that its spec- tion of a single bootstrap sample from a data record of trum is maximum at zero and decreases length n ϭ 12, using blocks of length l ϭ 4. The basic monotonically for higher frequencies. Accordingly, they set of objects operated upon by the moving blocks boot- devised a test (hereafter the ZvS test), based on simu- strap consists of the n Ϫ l ϩ 1 distinct blocks of length lations using (5), for inferences concerning means of l, rather than the n individual data values in x. De®ning time series. For AR(1) data the ZvS test yields accurate b as the number of blocks comprising each bootstrap results for smaller samples than does the test based on sample (b ϭ 3 in Fig. 2), the number of distinct bootstrap (1)±(3) or its two-sample counterpart. samples that can be drawn from a given time series of The tests described here will be based on the modi®ed

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC JANUARY 1997 WILKS 69 one-sample t-test statistic, (1)±(3), or the corresponding For example, if the available data consisted of two two-sample test described below. Both of these require paired (and possibly mutually correlated) series y and estimation of the variance in¯ation factor V (3). Direct z, equality of means for the two could be investigated substitution of sample estimates of the n Ϫ 1 sample through a one-sample test operating on x ϭ y Ϫ z, whose autocorrelations into (3) leads to erratic results for V null hypothesis would be that the mean ␮X ϭ 0. (ThieÂbaux and Zwiers 1984). More stable estimates for The basic statistic of interest for the two-sample tests V can be obtained by ®tting a time series model to the developed below is the difference between the two sam- data series x and using the properties of the time series ple means, model to extrapolate to lagged correlations beyond the S(x , x ) ϭ xÅ Ϫ xÅ . (9) ®rst few (Katz 1982; ThieÂbaux and Zwiers 1984). When 1 2 1 2 assuming an AR(1) model for the data, only the lag-1 Usually, the null hypothesis (H0) will be that the means autocorrelation needs to be directly estimated from the of the two populations from which the samples were data: drawn are equal. That is,

nϪ1 S0 ϭ ␮1 Ϫ ␮2 ϭ 0, (10) (x Ϫ xÅ)(x Ϫ xÅ) ͸ ttϩ1 tϭ1 although investigation of other null hypotheses presents rϭ . (6) 1 n no additional dif®culty. Regardless of the speci®c H of (x Ϫ xÅ)2 0 ͸ t interest, the test statistic used to investigate the plau- t 1 ϭ sibility of that null hypothesis will be Then, the properties of the AR(1) process are exploited S(x , x ) Ϫ S to compute dϭ12 0, (11) ␴Ã S r ϭ rk, (7) k 1 in which the denominator is the square root of the es- which are used for the remaining n Ϫ 2 correlation timated sampling variance of S(x1, x2), estimates in (3). 1/2 Although they are more stable (i.e., lower variance), ␴Ã S ϭ [var(xÅ12) ϩ var(xÅ )] estimates of V thus obtained are still substantially biased ss221/2 for samples that are not large. The nature of this bias xx12 ϭV12ЈϩVЈ . (12) is that the estimates are on average too low, and the ΂΃nn12 tendency for the test in (1)±(3) to be permissive for small Equation (11) is the two-sample version of (1). Except samples (e.g., ThieÂbaux and Zwiers 1984; Zwiers and for the bias-corrected variance-in¯ation factors, this is von Storch 1995) may be attributable in part to this bias. the same test statistic used by Katz (1982) and is the For the variance estimate in (2) to be compatible with basis of the two-sample ``usual'' test in the terminology the corresponding bootstrap estimates developed below, of Zwiers and von Storch (1995). Asymptotically (i.e., it is necessary to correct the bias in the estimation of for suf®ciently large sample size), the test statistic d in V. equations specifying the logarithm (11) follows the standard Gaussian distribution when H of the ratio of the (known) variance in¯ation factor to 0 is true. For the nonparametric tests derived in this sec- the (biased) V (3), for synthetic series from a variety of tion, the form of the distribution of d is unimportant, AR(1) models and sample sizes, lead to the adjusted but the point of scaling the test statistic (11) by the variance in¯ation factor standard deviation in (12) is to improve the accuracy of 2V the resulting tests, by removing the dependence of its VЈϭVexp . (8) distribution on unknown quantities (e.g., Hall and Wil- ΂΃n son 1991). These VЈ approach V for large n and are very nearly The bootstrap test for the difference of means is based unbiased for all but very small sample sizes in com- on evaluation of the unusualness of the observed S(x1, x2), in terms of the test statistic d, in the context of an bination with very strong correlations (i.e., ne Ͻ 2), for which none of the available tests operate correctly. estimated sampling distribution for d derived by boot- The methods and results here will be presented mainly strapping x1 and x2. The moving blocks bootstrap is for two-sample tests for differences of mean. In this applied repeatedly to produce nB realizations of the boot- setting, there are two mutually independent time series, strap analog of (11), x and x , of lengths n and n , respectively, to be com- 1 2 1 2 S(x12*, x*) Ϫ S(x 12, x ) pared. However, the methods and most of the results are d*ϭ . (13) ␴Ã * also valid for, and apply in an obvious way to, the cor- S responding one-sample tests, the statistic for which is For a two-tailed test, the null hypothesis is then rejected given in (1). One-sample tests are appropriate for prob- at the ␣100% level if the observed value of d would lems where one is interested in whether the mean of a place in the most extreme ␣100% of the distribution of single series could be a particular value, perhaps zero. d*. That is, the test rejects H0 if

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC 70 JOURNAL OF CLIMATE VOLUME 10

nB second pass of moving blocks bootstrapping to the boot- I(di*Նd) #{d*Նd} ͸ ␣ strap samplesx12* andx* would introduce further rifts ϭiϭ1 Յ (14a) in the time sequences of the original series x and x . n 1 n 12 1 2 BBϩ ϩ The approach taken here is to estimate the sampling for the upper tail, or variability of S(x12* ,x* ) through the jackknife variance estimates forx*12 andx* (e.g., Efron 1982; Efron and nB Tibshirani 1993), but to compute them by operating on I(d* Յ d) #{d*Յd} ͸ i ␣ the bootstrap blocks rather than on the individual data ϭiϭ1 Յ (14b) values. That is, the denominator in (13) is computed nBBϩ 1 n ϩ 12 using jackknife-after-bootstrap variance estimates from the two bootstrap samples, for the lower tail. Here, I(´) is the indicator function, 2 2 1/2 which equals 1 if its argument is true and is zero oth- ␴Ã S* ϭ (␴Ã JAB,1ϩ ␴Ã JAB,2) . (15) erwise. For example, the results presented below will The two jackknife-after-bootstrap variance estimates are be based on the sampling distribution of d being esti- based on b sample means derived from the bootstrap mated using n 1999 bootstrap realizations of d*. B ϭ samples x*, each computed by leaving out a different These two-tailed tests will then reject H at the 5% level 0 one of the b data blocks. De®ne either if the observed value of d is within the range of or more extreme than the largest 50, or the smallest 50, ... ⌺x(1)*ϩ⌺x (2)* ϩ⌺x (*iϪ1)ϩ⌺x* (iϩ1)ϩϩ⌺x (*b) of these d* values. The corresponding one-tailed test xÅ (*i) ϭ (16) nЈϪl would reject H0 at the (␣/2)100% level if either (14a) or (14b), as appropriate, were satis®ed. Equations (14a) as the ith of these b averages, where ⌺ x*(i) is the sum and (14b) constitute the simple `` method'' of the data values over the ith block and nЈϭbl is the (Efron and Tibshirani 1993) for bootstrap con®dence length of the full bootstrap sample x*. (It may happen . More elaborate re®nements to the that nЈ ± n if b and l do not divide n evenly.) De®ne percentile method exist (Efron 1987; Efron and Tib- also shirani 1993), but these were not found to materially 1 b improve the performance of the difference-of-mean tests xÅ* ϭ xÅ* (17) (´)͸ (i) described here. b iϭ1

The difference of bootstrap means S(x12* ,x* ) is com- as the average of these averages. Then the jackknife- pared in the numerator of (13) to the original statistic after-bootstrap estimate of the variance of the mean can S(x1, x2)Ðnot to S0, which is the corresponding term in be written as (11). This substitution is necessary to ensure that the resulting realizations of d* re¯ect H , considering that nЈ b Ϫ 1 b 0 Ã 2 (xÅ* xÅ*)2 , (18) from the perspective of the bootstrap samples x and x ␴JABϭ (i)(´)Ϫ 1 2 ΂΃΂nb ΃͸iϭ1 constitute the population, and the difference of their in which the factor (nЈ/n) is included for cases where sample means S(x1, x2) is the bootstrap model for the nЈ ± n. As a practical consideration, particularly for population counterpart S0 (10). The resulting bootstrap distribution of d* will be centered near zero. Failure to small b, it sometimes happens that all the blocks drawn formulate d* in this way leads to tests of low power in a particular bootstrap sample are the same. If so, this variance estimate is zero, and a new x* must be drawn (i.e., poor sensitivity to violations of H0). Further dis- cussion on this point can be found in Hall and Wilson for the test to proceed. (1991). It remains to specify the block length l. A practical In (13), the term S(x* ,x* ) is computed by separately requirement for the rule used to choose the block length 12 is that it must depend only on sample statistics and not applying the moving blocks bootstrap to x1 and x2, and then computing and differencing the sample means of on unknown population quantities. If l is too small, the resulting test will be permissive (H rejected too fre- the resulting pair of bootstrap samples. Note that the 0 quently), with the limit of l ϭ 1 corresponding to the two samples x1 and x2 are not mixed in the resampling process. Maintaining the separateness of the two sam- test that ignores the serial correlation altogether (cf. ples in the resampling is an important difference be- Zwiers 1990). It is also found that the test is stringent (H is rejected too rarely) if l is too large. A workable tween the present test and, for example, the ``BP'' boot- 0 rule for choice of the block length was developed here strap test described by Zwiers (1987). A practical con- by trial and error evaluation of test results for synthetic sequence is that the null hypothesis can pertain exclu- AR(1) series in cases where H was true, respecting the sively to the means of the two samples and that equality 0 constraints mentioned in the last paragraph of section of other aspects of the two distributions need not be 2a. This process led to the rule: select the largest integer assumed. no greater than The denominator in (13) is a resampling counterpart of the denominator in (11). However, application of a l ϭ (n Ϫ l ϩ 1)(2/3)(1Ϫ1/VЈ), (19)

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC JANUARY 1997 WILKS 71

FIG. 3. Achieved signi®cance levels for two-sided, two-sample AR(1) tests using sample sizes ranging from n ϭ 8tonϭ1024 and autoregressive parameter ␾ ϭ 0.1, 0.3, 0.5, 0.7, and 0.9. Shaded region indicates the 95% con®dence interval around the nominal sig- ni®cance level of 5%. which must be evaluated iteratively, but converges quickly (a reasonable initial guess for l is ͙n). The speci®ed block length thus increases with increasing sample size, and for a given sample size is larger for larger values of the bias-adjusted variance in¯ation fac- tor VЈ. Having chosen the block length, the number of blocks b in the bootstrap sample is determined as the nearest integer to n/l. The two samples x1 and x2 may yield different values for b, l, and nЈϭbl. FIG. 4. Comparison of power functions for two-sample tests for The criterion used for the choice of (19) for speci- differences of mean, using AR(1) data and conducted at the 5% level. Horizontal axis is difference in population means between the two ®cation of l was that the resulting tests would exhibit samples, standardized by the square root of the white-noise variance. accurate rejection probabilities for the smallest possible effective sample sizes ne, without compromising test performance for large ne. Figure 3 shows the achieved In addition to accurate rejection probabilities when signi®cance levels (estimated probability of rejecting H0 H0 is true, one is interested in the power, or ability to when it is true) for the resulting tests, for selected pos- detect violations of H0, of these bootstrap tests. The itive values of the autoregressive parameter and for sam- power of a hypothesis test generally increases with the ple sizes ranging from n ϭ 8tonϭ1024. These prob- magnitude of the violation of H0, and the probability abilities are based on two-sided tests at the 5% level, with which H0 will be rejected, as a function of the estimated as relative frequencies with which H0 was (true) , is known as the power rejected among tests on 2000 replications of pairs of function for that test. Figure 4 compares power functions AR(1) series having equal means. These results are for for the AR(1) bootstrap tests described in this section synthetic series independent of those used to arrive at (light solid curves) to those for the ZvS test (light dashed (19). The shaded region in Fig. 3 indicates the 95% curves) for selected sample sizes and values of the au- con®dence interval around the nominal signi®cance lev- toregressive parameter. The power of the bootstrap tests el of 5% for this number of replications. The ®gure is generally slightly less than, but often indistinguish- indicates that the test operates correctly when ne is great- able from, that of the ZvS test. Zwiers and von Storch er than approximately 10 [or greater than about 8, ac- (1995) show that the power of the ZvS test is nearly as cording to (8)], which compares favorably to the ZvS great as the corresponding likelihood ratio test, which test and to the tests described in Zwiers and ThieÂbaux provides a theoretical maximum on test power.

(1987). For cases with inadequate sample size, the tests Usually it is assumed that the white-noise series ⑀t in are uniformly permissive. (5) consists of independent Gaussian variates. There is Equation (19) for the block length is also valid for no guarantee that real data will be Gaussian, however, the corresponding one-sample test of the mean. In that so that the robustness of the bootstrap test to violations case, the statistic S(x) is simply the sample mean of the of this assumption is also of interest. Figure 5 shows single sample x, the test statistic d is given by (1) [using achieved signi®cance levels, again estimated from the also (2) and (3)], and the denominator in (13) is the results of 2000 tests of pairs of series for which H0 is square root of the jackknife-after-bootstrap variance in true, generated using ␾ ϭ 0.5 and independent white (18). noise following uniform, Gaussian, exponential, and

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC 72 JOURNAL OF CLIMATE VOLUME 10

achieved signi®cance levels are comparable to those in Fig. 5 for the corresponding one-sample bootstrap tests, but for the strongly skewed time series the re- sulting bootstrap distributions are asymmetric, and the probabilities of rejection are higher for the left tail than for the right tail. Finally, it is of interest to investigate the robustness of this test to violations of the assumed AR(1) time structure of the data. Zwiers and von Storch (1995) speculated that, while atmospheric and other geophys- ical time series do not necessarily follow an AR(1) model, data not clearly in¯uenced by quasi-periodic processes such as ENSO or the quasi-biennial oscil- FIG. 5. Achieved signi®cance levels for two-sided, two-sample lation have power spectra suf®ciently similar to the AR(1) tests using sample sizes ranging from n ϭ 8tonϭ1024 and autoregressive parameter ␾ ϭ 0.5 for data series forced with white AR(1) model that the ZvS test should be applicable. noise following four different distributions: uniform (U), Gaussian Figure 6 shows achieved signi®cance levels for two- (G), exponential (E), and lognormal (LN) Insets show of sample, two-tailed bootstrap tests, which assume samples of the resulting series values. Shaded region indicates the AR(1) time dependence, operating on data with more 95% con®dence interval around the nominal signi®cance level of 5%. complicated autocorrelation structures. The results in Fig. 6a are for data generated according to the AR(2) lognormal distributions with E[⑀] ϭ 0. The insets show model (section 2c) over that portion of the parameter histograms of samples of the resulting x values. As in space corresponding to stationary series with positive Fig. 3, the shaded region indicates the 95% con®dence lag-1 autocorrelation. Here, the tests operate correctly band around the nominal signi®cance level of 5%. The only for generating processes that are very close to the results for Gaussian variates are the same as those for AR(1)Ðthat is, for very small ͦ␾2ͦ. The tests are per- ␾ ϭ 0.5 in Fig. 3. Figure 5 shows that the test perfor- missive for ␾2 Ͼ 0 and stringent for ␾2 Ͻ 0, with very mance is even better (i.e., the test is accurate for small- large deviations from accurate test performance evi- er n) for autoregressive series forced with uniform var- dent for ␾2 far from zero. For ␾2 more negative than iates. While somewhat larger sample sizes are required about Ϫ2/3, none of the 2000 test replications rejected for the test to operate properly when the ⑀t values are H0. Corresponding results (Fig. 6b) for the autore- strongly skewed (exponential and lognormal distri- gressive± [ARMA(1,1)] generating butions), the two-sample test is quite robust overall to process (section 2d) are only slightly better, with ac- these violations of the Gaussian assumption. The two- curate rejection probabilities only for small ͦ␪1ͦ, al- sample ZvS test (results not shown) is similarly robust though the wild deviations from the nominal signi®- to these deviations from Gaussian data. Overall cance level seen in Fig. 6a are absent. Neither of the

FIG. 6. Achieved signi®cance levels for two-sided, two-sample AR(1) tests when applied to (a) AR(2) and (b)

ARMA(1,1) data series for which r1 Ͼ 0, with n ϭ 128. Shading indicates approximate regions in the respective parameter spaces for which these are not different from the nominal test level of 5% with 95% con®dence. Increasing sample size does not improve test performances. Corresponding results for the ZvS test are indistinguishable.

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC JANUARY 1997 WILKS 73 results in these two panels is improved by increasing As is the case for the AR(1) test, the values of V thus the sample size. The corresponding results for the ZvS obtained are biased. Empirical analysis of samples of test are indistinguishable from those shown in Fig. 6. synthetic AR(2) series, of the same kind used to develop (8), lead to the bias-adjusted variance in¯ation factor c. Bootstrap test for AR(2) data 3V VЈϭVexp , (23) The second-order autoregressive [AR(2)] model is an n extension of the AR(1) process in (5) to dependence on ΂΃ two time lags. It is de®ned by the equation which again is very nearly unbiased except for very small n and strong time dependence. xЈϭ␾xЈϩ␾xЈϩ⑀. (20) t 1tϪ12tϪ2t Bootstrapping for the AR(2) test to generate a dis- The AR(2) model is considerably more ¯exible than the tribution of d* (13) also proceeds as described in the AR(1), which is obtained as a special case of (20) when previous section, excepting only that the block lengths

␾2 ϭ 0. This more general model can exhibit autocor- are chosen using relation functions decreasing either more quickly or less (2/3)(1Ϫ14VЈ) quickly than the exponential decay of the AR(1) process l ϭ (n Ϫ l ϩ 1) ͙ . (24) given in (7) and also has the capacity to exhibit pseu- As before, this equation has been developed through doperiodic behavior (e.g., Box and Jenkins 1976; Wilks trial and error, using the guidelines in the last paragraph 1995; Zwiers 1990). of section 2a and according to the criterion that the Figure 6a shows very strongly that the AR(1) test resulting test operates correctly for the smallest possible described in the previous section (as well as the ZvS sample sizes without compromising performance for test) is not robust to the more general case of AR(2) larger samples. Figure 7 illustrates this performance for data, either (as expected) for regions of the parameter two-sample, two-tailed tests, again over that portion of space where pseudoperiodicities are produced or for the the AR(2) parameter space corresponding to stationary many parameter combinations yielding spectra that are models with positive lag-1 autocorrelation. The shaded monotonically decreasing functions of frequency. This regions indicate parameter combinations for which the section describes a bootstrap test analogous to that in achieved signi®cance level is within the 95% con®dence section 2b, but that operates correctly when applied to band of the nominal signi®cance level of 5%, as esti- data generated by the more general AR(2) model. mated from 2000 replications for which H0 is true. For Computing the test statistic d proceeds as in section n ϭ 16, the test operates accurately only for very weak 2b, with the following modi®cations. To estimate the dependence, where VЈϽ1. The portion of the parameter variance in¯ation factor [(3)] for AR(2) data, the ®rst space exhibiting accurate tests steadily increases with two sample autocorrelations are required. Equation (6) increasing sample size, until for n ϭ 64 only tests op- is used for the ®rst of these. The sample lag-2 auto- erating on data exhibiting very strong dependence (␾2 correlation is estimated here using k ␾1) are permissive. For n Ն 128 the test yields ac- nϪ2 curate rejection probabilities for all AR(2) parameter (x Ϫ xÅ )(x Ϫ xÅ ) combinations investigated (also indicated in Fig. 7). ͸ t Ϫ tϩ2 ϩ tϭ1 Overall, the AR(2) bootstrap test is accurate for n larger r2 ϭ 1/2 , (21) e nϪ2 n than about 12 to 15. (xϪxÅ)(22xϪxÅ) ͸͸t Ϫ t ϩ Since the AR(1) process can be regarded as a special []tϭ1tϭ3 case of the AR(2), the test described in this section can which gives more stable results for small samples than also be used, and performs accurately, for AR(1) data. does the simple extension of the form of (6) to two lags. Given the results in Fig. 6a, it is worth considering what In (21), the subscripts ``Ϫ'' and ``ϩ'' denote sample is lost when the slightly more elaborate AR(2) test is means over the ®rst and last n Ϫ 2 series values, re- applied to AR(1) series. The answer is that estimation spectively. The properties of the AR(2) process are then of the second sample autocorrelation in (21) results in used to estimate the remaining n Ϫ 3 autocorrelations the AR(2) bootstrap test having slightly less power. The in (3), according to heavy solid lines in Fig. 4 show the power functions for the AR(2) test when applied in selected AR(1) set- rϭ␾ÃÃrϩ␾r,kՆ3, (22a) k 1 kϪ12kϪ2 tings. The AR(2) test is uniformly less powerful, but where the loss of power is quite small except for small ne. The 1 Ϫ r AR(2) test also exhibits robustness to non-Gaussian data ␾Ã ϭ r 2 (22b) that are comparable to those shown in Fig. 5 for the 11 2 1Ϫr1 AR(1) test. Unless one can be quite con®dent that a data and series can be adequately modeled as AR(1), the serious nonrobustness of the AR(1) test to AR(2) data illustrated r Ϫ r2 ␾Ãϭ21. (22c) in Fig. 6a suggests that the AR(2) test should be pre- 2 2 1 Ϫ r1 ferred to either the AR(1) test or the ZvS test.

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC 74 JOURNAL OF CLIMATE VOLUME 10

FIG. 8. Achieved signi®cance levels for two-sided, two-sample AR(2) tests when applied to ARMA(1,1) data series with n ϭ 128. Shading indicates approximate region for which these are not different from the nominal test level of 5% with 95% con®dence. Test per- formance improves only very slightly with sample size.

Finally, Fig. 8 indicates the performance of the AR(2) test when applied to ARMA(1,1) data (section 2d). Ac- curate tests are achieved over a larger portion of this parameter space than was seen for the AR(1) test (com- pare Fig. 6b), and the achieved signi®cance levels agree with the nominal level to within a factor of 2 throughout. It is always permissive where it does not operate cor- rectly. Figure 8 shows results for n ϭ 128, but im- provements with increasing sample sizes are slight. An improvement with respect to the AR(1) test is not sur- prising, given that the AR(2) model is very much more ¯exible than the AR(1) model.

d. Bootstrap test for ARMA(1,1) data Another simple but plausible time series model for geophysical data is the autoregressive±moving average process, of order 1 in both the autoregression and the moving average. This is the ARMA(1,1) process, de- ®ned by

xtЈϭ␾1xЈϩtϪ1 ⑀t Ϫ␪⑀1 tϪ1, (25)

where ␾1 is the autoregressive parameter, ␪1 is the mov- ing average parameter, and the remaining symbols have the same de®nitions as in (5) and (20). The ARMA(1,1) process can be viewed as an AR(1) process forced by

noise ⑀t that is itself autocorrelated. Figures 6b and 8 indicate that the tests in the previous two sections, based on the assumptions that the data conform to the AR(1) and AR(2) models, respectively, FIG. 7. Achieved signi®cance levels for two-sided, two-sample AR(2) tests. Shading indicates approximate regions in the parameter perform accurately over only limited portions of space for which these are not different from the nominal test level ARMA(1,1) parameter space. An analogous test can be of 5% with 95% con®dence. For n ϭ 128 and greater, the test operates developed assuming that the data have come from an correctly for all AR(2) parameter combinations considered.

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC JANUARY 1997 WILKS 75

ARMA(1,1) generating process. For this test, the vari- the other tests considered. However, for the more strong- ance in¯ation factor V is computed using (3), (6), (21), ly autocorrelated data series (␾ ϭ 0.8), the ARMA(1,1) and test is slightly more powerful than the AR(2) test and is comparable to the AR(1) test. à rk ϭ ␾1rkϪ1, k Ն 3, (26a) Finally, the robustness of the ARMA(1,1) test to where AR(2) data is shown in Fig. 10, for n ϭ 128. While these results are better than the corresponding results à ␾121ϭ r /r . (26b) for the AR(1) test (compare Fig. 6a), the ARMA(1,1) That is, autocorrelation functions for ARMA(1,1) pro- tests perform accurately only for relatively minor de- cesses decay exponentially from their values at the ®rst viations from the AR(1) process, which is the special lag, at a rate speci®ed by the autoregressive parameter. case of (25), with ␪1 ϭ 0. For AR(2) parameter com- As before, the estimate of the variance in¯ation factor binations where the ARMA(1,1) test does not perform so obtained is biased, and a correction is necessary for accurately, increasing the sample size yields tests that the bootstrap test to perform properly for relatively are increasingly stringent in relation to the results shown small sample sizes. Following again the regression ap- in Fig. 10. For smaller n the tests represented by the proach used to develop (8), it was found that a large area below the stippling are still stringent, while tests part of this bias for ARMA(1,1) series can be removed in the area in and above the stippling are permissive for by applying small samples. à (2 ϩ ␪1)V VЈϭVexp , (27a) 3. Multivariate tests ΂΃n à The univariate (scalar data) tests developed in section where ␪1 is obtained as the root of 2 can be extended to multivariate (vector-valued data) ÃÃÃ22 (r111Ϫ␾)␪ϩ(1 Ϫ 2␾ 1111r ϩ ␾ )␪ tests by considering now that the data in each of the two samples are matrices [x] with elements x , where à t,m ϩ(r11Ϫ␾)ϭ0 (27b) t ϭ 1, ´´´, n is the time index as before, and m ϭ 1, ´´´,

à M indexes the elements of the data vectors xt. The vector that statis®es ͦ␪1ͦ Ͻ 1. The bias correction (27a) is some- what less satisfying than its counterparts (8) and (23) elements might correspond to points on a spatial grid, in that it does not re¯ect the slight tendency for positive in which case each xt could be a single realization of a bias (i.e, V is too large, on average) for ARMA(1,1) ®eld, or coef®cients for these data projected onto a space of smaller dimension. The multivariate tests described processes for which ␪1 approaches ␾1, both of these parameters are relatively large, and the sample size is below reduce to the univariate tests described in section moderately large. Somewhat better test performance un- 2 for M ϭ 1. der these conditions might be obtained if a more re®ned The basic idea behind the multivariate tests is to si- bias correction were to be found. multaneously carry out the corresponding univariate The mechanics of the bootstrapping proceed as de- tests for all data dimensions. That is, instead of block scribed before, using (24) to choose the block length. resampling scalar series, the same block resampling pro- The accuracy of this test over the portion of the cedure will be applied to the vector data. Simultaneous ARMA(1,1) parameter space corresponding to station- application of the same resampling patterns to all di- ary series with positive lag-1 autocorrelation is shown mensions of the data vectors will yield resampled sta- as a function of sample size in Fig. 9. For n ϭ 16 (Fig. tistics re¯ecting the cross correlations in the underlying 9a) the test is accurate over only a relatively small por- data, without the necessity of explicitly modeling those tion of the parameter space, and the inaccurate tests are cross correlations. Meanwhile, appropriate choice of the all permissive. As the sample size increases to n ϭ 64 block length will allow the in¯uence of the temporal (Fig. 9c), the test performs accurately over nearly all of correlation on the distribution of the test statistic to be the parameter space. Figure 9d illustrates the effect of represented as well. the de®ciency in the bias correction (27a), which, since For each dimension of the data, there will be a cor- the small positive bias of V is not corrected, produces responding distance dm computed using (11). The ``glob- tests that are stringent to a small degree in the upper al,'' or ®eld, signi®cance relates to the size of the vector portion of the parameter space. However, as a practical d of these ``local'' test statistics. If the null hypothesis matter, the tests are quite workable even with this small of no real local difference in (vector) mean is true, the problem, and the ARMA(1,1) test performs essentially size of the vector d, having been drawn from a popu- correctly for n Ͼ 64 for all of the parameter combi- lation with mean 0, should be small. However, there are nations investigated here. many choices for judging the size of the test vector d. The power of these ARMA(1,1) tests is illustrated by The performance, with respect to test accuracy and pow- the thick gray lines in Fig. 4. For most of these cases er, of the following four vector norms will be investi- it is somewhat less sensitive to violations of H0 than gated:

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC 76 JOURNAL OF CLIMATE VOLUME 10

FIG. 9. Achieved signi®cance levels for two-sided, two-sample ARMA(1,1) tests. Shading indicates approximate regions for which these are not different from the nominal test level of 5% with 95% con®dence.

M which has been recommended by Mielke (1985). The D1 ϭ ͦdmͦ, (28a) ͸ vector norm D2 (28b) is the square of the traditional mϭ1 Euclidean distance in M-space, and this (or, equiva- M lently, its square root) would be the conventional choice. D ϭ d 2, (28b) 2 ͸ m Here, D (28c) is the ``max norm,'' or largest element mϭ1 3 of d, which typically corresponds to the smallest p value D3 ϭ max ͦdmͦ, (28c) among the local tests, as suggested, for example, by m Westfall and Young (1993). Finally, D4 (28d) is the and ``counting norm'' (Zwiers 1987), or the number of the M M local tests that are signi®cant at the ␣100% level, D ϭ #{p Յ ␣} ϭ I[p Յ ␣]. (28d) 4 mm͸ which has been employed frequently for testing climate mϭ1 data (e.g., Livezey and Chen 1983; Wilks 1996). In

The ®rst of these norms, D1 (28a), measures the length order to employ D4, a speci®c local test must be chosen of d as the sum of the absolute values of its elements, to compute the p value pm, which in the following will

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC JANUARY 1997 WILKS 77

will be investigated, which are the same as those used by Zwiers (1987). These correlations among the M el- ements of x are introduced through the matrix of un-

lagged correlations [R0], whose elements are 1, ͦi Ϫ jͦ ϭ 0

␳i,j ϭ s12/(1 Ϫ s ), ͦi Ϫ jͦ ϭ 1 (30)

Άs1␳ͦiϪjͦϪ12ϩs␳ͦiϪjͦϪ2,ͦiϪjͦՆ2. The two spatial correlation structures to be considered

are a spatial analog of an AR(1) process, for which s1 ϭ 0.9 and s2 ϭ 0, and a spatial analog of an AR(2) process, for which s1 ϭ 1.6 and s2 ϭϪ0.8. The (spatial) lag-1 autocorrelation for these two models is nearly identical. However, at larger spatial separations, the au- tocorrelation structures are quite different. The spatial AR(1) autocorrelation function decreases monotonically toward zero with increasing lag, while the spatial AR(2) autocorrelation function exhibits damped oscillations around zero, with local maxima and minima reminiscent FIG. 10. Achieved signi®cance levels for two-sided, two-sample of teleconnection patterns. ARMA(1,1) tests when applied to AR(2) data, with n ϭ 128. Shading In order for the vector block bootstrapping to capture indicates approximate region in which these are not different from the cross correlation in the data produced by (30), it is the nominal test level of 5% with 95% con®dence. essential that the same block length l be applied to each of the vector elements. In the following, this common be the z test assuming the asymptotic Gaussian distri- block length is chosen by separately computing l for bution of (11). each vector element, as described in section 2, and then

Sampling distributions under H0 for each of the norms averaging these M speci®ed block lengths. This pro- in (28) are developed by bootstrapping the data [x]in cedure is a quick and ad hoc choice, which could very the same way as described in section 2 for scalar series, likely be improved upon. Note that it is not necessary with the same time blocks being chosen in each boot- for the two data vectors in a two-sample test to be re- strap sample for each of the M dimensions. Note that sampled with the same block length. the two data batches [x1] and [x2] need not be resampled Figures 11 and 12 show the achieved signi®cance using the same block length. For each of the nB bootstrap levels of these vector tests for data generated using the samples generated in this way, (13) is applied to each spatial AR(2) process and spatial AR(1) process from of the M elements to yield the vector d*, the length of (30), respectively. The tests are two-sample, two-sided which, D*, is in turn computed according to the various tests, and the shading again indicates 95% con®dence norms in (28). The (univariate) distribution of each of intervals around the nominal signi®cance level of 5%. these collections of nB realizations of D* is then used The four panels in each ®gure show results for tests to assess the unusualness of the corresponding observed constructed using each of the four test statistic norms D by applying the percentile method (14), using D and in (28). These results are for sample sizes ranging from D* rather than d and d*. That is, d is declared to be n ϭ 16 to n ϭ 2048 and for M ϭ 4toMϭ64 di- signi®cantly different from 0 at the ␣100% level if D mensional data. The symbols indicate data generated is among either the (␣/2)100% smallest or the using ␾ ϭ 0.2, 0.5, and 0.8 in (29), and only points for

(␣/2)100% largest of the nB values of D*. which ne Ͼ 10 are plotted. The multivariate data used to demonstrate that the Achieved signi®cance levels for the multivariate tests vector hypothesis tests will be realizations from the sim- based on both D1 and D2 are quite close to the nominal pli®ed vector autoregressive process level of 5% when the effective sample size ne is rela- tively large with respect to the dimension M of the data xЈЈϭ ␾ x ϩ [C] ⑀ , (29) ttϪ1 t vectors, and the performances of these two norms in where ⑀t is an M-dimensional vector of independent this respect are nearly indistinguishable. These obser- standard Gaussian variates. The matrix [C] re¯ects the vations apply as well to tests based on the max norm correlation among the elements of xt and is related to D3, except when the temporal dependence is strong, in their (unlagged) correlation matrix according to [C][C]T which case the tests tend to be stringent. Tests based on

ϭ [R0]. For simplicity, a scalar autoregressive coef®- the counting norm D4 appear to be consistently stringent cient ␾ is used in (29), which implies that the matrix even for large ne/M, although not grossly so. However, of lag-1 correlations [R1] is symmetric and proportional for weak temporal dependence (␾ ϭ 0.2), tests based to [R0]. on D3 perform well even for small sample sizes in re- Multivariate data with two forms of cross correlation lation to M. For other situations in which ne Ͻ M, all

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC 78 JOURNAL OF CLIMATE VOLUME 10

FIG. 11. Achieved signi®cance levels for multivariate two-sample, two-sided tests. Shaded regions indicate 95% con®dence intervals around the nominal signi®cance level of 5%. Sample sizes range from n ϭ 16 to n ϭ 2048, and the dimension of the data vectors ranges from M

ϭ 4toMϭ64. Dependence between vector elements is speci®ed by (30), with s1 ϭ 1.6 and s2 ϭϪ0.8. Symbols indicate autoregressive parameter ␾1 ϭ 0.2, 0.5, or 0.8. Only points for which ne Ͼ 10 are plotted. of the tests are stringent, especially for the spatial AR(2) sented pertains to two-sided, two-sample tests at the data in Fig. 11, but could still be used in practice given nominal 5% level, applied to vector AR(1) data with n the understanding that the actual test level is somewhat ϭ M ϭ 64 and cross correlation described by (30), with smaller than the nominal one. Tests for data with ne Ͻ s1 ϭ 1.6 and s2 ϭϪ0.8. The six panels in Fig. 13 show

10 (not shown) are progressively more permissive, but power functions for different numbers m⌬ of the M vec- exhibit gross inaccuracies (achieved signi®cance levels tor elements of the population mean for one of the two larger than 0.10) only for ne smaller than about 2. Re- samples being increased. The power functions in Fig. sults similar to those in Figs. 11 and 12 are obtained 13a pertain to the mean of only m⌬ ϭ 1 of the M ϭ 64 for extensions of (29) used to generate vector AR(2) elements being increased, with the identity of that el- data (not shown), with series having weaker or stronger ement being chosen randomly for each . Since autocorrelation (as measured by V) exhibiting results one would expect that violations of H0 would tend to similar to those in Figs. 11 and 12, with smaller or larger occur with spatial coherency, changes to the vector values of ␾, respectively. means in Figs. 13b±f are made randomly, but with the The most pronounced differences in the performance constraint that the changes are made to adjacent vector of tests based on the four norms in (28) are with respect elements, in contiguous blocks of length m⌬. to their power. Figure 13 illustrates the sensitivity to Clearly, the power of the tests increases as m⌬ in- violations of H0 of the multivariate tests constructed creases, since increasing the means of larger numbers using these four vector norms. The speci®c case pre- of vector elements by a ®xed amount provides stronger

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC JANUARY 1997 WILKS 79

FIG. 12. As in Fig. 11 but for dependence between vector elements speci®ed by (30) with s1 ϭ 0.9 and s2 ϭ 0. signals for the tests to detect. As expected by extension dashed lines) show progressively decreasing sensitivity, of Fig. 4, power is greater for tests operating on less with D4 exhibiting especially low power for small m⌬. strongly correlated data (␾ ϭ 0.5) than on more strongly As m⌬ increases, the power of the D3 test increases only correlated data (␾ ϭ 0.8). For clarity, only results for slightly, while the power of the other tests increases

␾ ϭ 0.5 are shown in Figs. 13a,b. Power for tests with substantially, until m⌬ ϭ 16, where power for most of ␾ ϭ 0.2 (also not shown) bears the same qualitative the tests is fairly comparable. For m⌬ ϭ 32 the power relationship to the ␾ ϭ 0.5 and ␾ ϭ 0.8 results as in of the D3 tests is noticeably less than for the other three

Fig. 4. The levels of the power functions for ⌬␮/␴⑀ ϭ norms, which are now practically identical in power. 0 indicate that the tests are stringent for this sample The same basic patterns shown in Fig. 13 occur for size, particularly for ␾ ϭ 0.8. This is consistent with other sample sizes as well, although with less power for Fig. 11 since Fig. 13 pertains to the rather demanding smaller n and greater power for larger n. Results for cases of ne/M ϭ 0.34 and 0.12 for ␾ ϭ 0.5 and ␾ ϭ vector data generated using (30) with s1 ϭ 0.9 and s2 0.8, respectively. ϭ 0 are also very similar, but with a tendency for the

For a given value of m⌬, there are very major dif- tests to exhibit slightly less power. Power functions for ferences in test sensitivity among the tests based on the analogous tests performed on temporal AR(2) data, with four vector norms in (28). For the most challenging the component scalar tests as described in section 2c, cases of small m⌬, the max norm D3 (heavy solid lines) are also basically the same but with (analogously to Fig. is clearly superior to the others. The squared-distance 4) slightly less power when applied to these temporal norm D2 (light solid lines), absolute-distance norm D1 AR(1) data. Power functions for the tests applied to (light dashed lines), and the counting norm D4 (heavy mutually independent multivariate data (i.e., s1 ϭ s2 ϭ

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC 80 JOURNAL OF CLIMATE VOLUME 10

FIG. 13. Comparison of power functions for multivariate two-sample tests, using vector AR(1) data with n ϭ 64 and M ϭ 64, conducted at the nominal 5% level. Horizontal axis is difference between individual elements of the population means, standardized by the square root of the white-noise variance. (a)±(f) Results for differences between the mean vectors applied to m⌬ ϭ 1, 2, 4, 8, 16, and 32 randomly selected adjacent vector elements, respectively. Dependence between vector elements is speci®ed by (30), with s1 ϭ 1.6 and s2 ϭϪ0.8. Results for

␾1 ϭ 0.8 are omitted from (a) and (b) for clarity.

0) are also qualitatively similar, but show noticeably tivariate tests based on the max norm D3 is so great for greater power. the dif®cult cases where the ``signal'' is con®ned to only

The power functions shown in Fig. 13 are also rep- a very small number m⌬ of elements embedded in M- resentative of those for other values of the vector di- dimensional ``noise.'' Comparison of Fig. 4b to Fig. 13a mension M, except for the performance of tests based shows that for m⌬ ϭ 1 the test suffers surprisingly little on the counting norm D4. The ratio m⌬/M required for degradation in power as the dimension of the data vec- tests based on the counting norm to exhibit good power tors increases from M ϭ 1toMϭ64 for ␾ ϭ 0.5 and must be larger for smaller M and can be smaller for the relatively small sample size of n ϭ 64. For M ϭ 1 larger M. As a rule of thumb, for the tests described in in Fig. 4b, the test achieves 90% power at ⌬␮/␴⑀ ϭ 1.1, this paper at least, those based on the counting norm while for M ϭ 64 in Fig. 13a, the scaled difference in exhibited good power when the ratio m⌬/M was larger the mean (of only one of the M vector elements) nec- than approximately 2(M)Ϫ1/2. Intuitively, it is reasonable essary to achieve 90% power increases to only 1.7. For that tests based on the D4 would have low power for the intermediate values M ϭ 4 and 16, the corresponding small m⌬ since this norm does not take account of how ⌬␮/␴⑀ are 1.2 and 1.4, respectively. Larger sample sizes strongly each local scalar H0 is rejected. yield the same qualitative patterns, but with decreasing

Finally, it is remarkable that the power of these mul- ⌬␮/␴⑀. By focusing on the scalar test indicating viola-

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC JANUARY 1997 WILKS 81

tion of H0 most strongly, the max norm D3 appears to made using the resulting estimate of the achieved sig- ®lter much of the mere sampling variability in the re- ni®cance level, rather than the nominal one. It would maining M Ϫ m⌬ vector elements. Conversely, however, clearly also be possible to extend the procedures de- when m⌬ is large this focus leads D3 to ignore important scribed above to construct tests oriented toward speci®c information, leading to tests with lower power. higher-order or more complex models for the time de- pendence in the data. The power of the multivariate tests developed in sec- 4. Summary and conclusions tion 3 depends very strongly on the vector norm chosen This paper has developed bootstrap tests that are to summarize the individual scalar test statistics. When based on a nonparametric analog of Katz's (1982) test differences in vector means are con®ned to a small num- for selected simple time-dependence structures. The ber of vector elements, the tests based on the counting tests are robust to deviations from Gaussian data and norm exhibit extremely poor power. Limited ability to perform well for relatively small sample sizes. The uni- discern violations of the null hypothesis in such ``needle variate bootstrap tests developed in section 2 are nearly in a haystack'' situations is expected, as has been point- as powerful as the corresponding likelihood ratio tests. ed out by Hasselmann (1979). However, it is found here Proper choice of the bootstrap block length is important that use of the max norm, or largest of the M scalar for accurate test performance. The prescriptions offered statistics, yields surprisingly powerful multivariate tests here for the block length work well for hypothesis tests even when the signal is con®ned to a small number of of the mean, but may not be universally applicable. the vector elements and that the test power degrades The primary motivation for this paper has been con- relatively little as M increases. Unless it is known or struction of nonparametric multivariate tests, by oper- strongly suspected that violations of H0 will occur for ating the scalar tests of section 2 in parallel and thus a large proportion of the elements of the data vectors, resampling directly the cross correlation in vector data. the max norm D3 is probably the best choice among The results in section 3 show that this can be a successful those presented in (28). strategy, although in most cases the actual test levels For the multivariate tests it is essential that the vector are smaller than the nominal signi®cance levels when bootstrapping operate with the same block length for the effective sample size ne is comparable in magnitude each element of a data vector, although the block lengths or smaller than the dimension M of the data vectors. for the two vectors in two-sample problems can be dif-

The condition ne/M Ͻ 1 occurs frequently enough in ferent. It is not necessary, however, that the same scalar practice to be important and yet is not handled well by time series model be assumed for each element of a data existing tests. The tests described here can be applied vector. For purposes of computing the denominator of successfully in these instances as well, provided the (11), some of the scalar series could be modeled as analyst accounts for their moderate conservatism. ARMA(1,1), some as AR(2), etc., so long as comparable The univariate test presented in section 2b, which block lengths would result. assumes that the data arise from a ®rst-order autore- Although attention has been con®ned here to tests for gressive process, performs very badly indeed when this differences in means, there is no reason why analogous assumption is violated. It was found here that this short- tests addressing other aspects of the data could not be coming applies equally to the test recently developed constructed as well. For example, tests involving mea- by Zwiers and von Storch (1995). The test presented in sures of the variance of data ®elds, or functions of the section 2c, which is based on the assumption that the spatial correlation patterns, could be of interest. In these data follow a second-order autoregressive process, is cases, the more re®ned bootstrap con®dence intervals much more robust and widely applicable. If it can be described in Efron (1987) might produce signi®cant im- assumed in a univariate test setting that the data are provements in test performance. AR(1), then the ZvS test will be preferable to the scalar How the tests should proceed if the elements of a data bootstrap tests developed here, from the standpoints of vector exhibit grossly different degrees of serial cor- test power and computational simplicity. More gener- relation is unclear, although for large sample sizes the ally, however, the AR(2) bootstrap test developed in- underlying scalar tests perform well over a fairly wide section 2c should be preferred. In practice, one could range of choices for l. The consequences of misspeci- investigate formally which of the three time series mod- fying the block length for some of the data elements els used in sections 2b±d best ®ts the data series at hand could be explored more fully. This potential problem is (e.g., Katz and Skaggs 1981). relevant because of the clear need to adapt the block If a dataset of interest clearly exhibits behavior more length to the strength of the autocorrelation in the data. elaborate that any of the three models used in section This requirement has also motivated the use of para- 2, the performance of these tests can be readily inves- metric time series models to compute VЈ in (12), even tigated through Monte Carlo simulations of the type though a wholly nonparametric approach might be pref- used here, if an adequate time series model for the data erable. Future work might also address more general at hand can be identi®ed. Following such an analysis, and robust alternatives to this aspect of the construction judgments regarding the outcome of the tests can be of the tests.

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC 82 JOURNAL OF CLIMATE VOLUME 10

Acknowledgments. I thank Rick Katz for telling me Leger, C., D. N. Politis, and J. P.Romano, 1992: Bootstrap technology of the existence of block bootstrapping and Francis and applications. Technometrics, 34, 378±398. Leith, C. E., 1973: The of time-average estimates of Zwiers for helpful discussions. This research has been climatic means. J. Appl. Meteor., 12, 1066±1069. funded in part by the Northeast Regional Climate Center Liu, R. Y., and K. Singh, 1992: Moving blocks bootstrap captures under National Oceanic and Atmospheric Administra- weak dependence. Exploring the Limits of Bootstrap, R. Lepage tion Grant NA16CP-0220-02. and L. Billard, Eds., John Wiley, 225±248. Livezey, R. E., 1985: Statistical analysis of general circulation model climate simulation: Sensitivity and prediction . J. REFERENCES Atmos. Sci., 42, 1139±1149. , 1995: Field intercomparison. Analysis of Climate Variability, Albers, W., 1978: Testing the mean of a normal population under H. von Storch and A. Navarra, Eds., Springer, 159±176. dependence. Ann. Stat., 6, 1337±1344. , and W. Y. Chen, 1983: Statistical ®eld signi®cance and its Box, G. E. P., and G. M. Jenkins, 1976: Time Series Analysis: Fore- determination by Monte Carlo techniques. Mon. Wea. Rev., 111, casting and Control. Holden-Day, 575 pp. 46±59. Carlstein, E., 1986: The use of subseries values for estimating the Mielke, P. W., 1985: Geometric concerns pertaining to applications variance of a general statistic from a stationary sequence. Ann. of statistical tests in the atmospheric sciences. J. Atmos. Sci., Stat., 14, 1171±1179. 42, 1209±1212. Chervin, R. M., and S. H. Schneider, 1976: On determining the sta- , K. J. Berry, and G. W. Brier, 1981: Application of multi-re- tistical signi®cance of climate experiments with general circu- sponse permutation procedures for examining seasonal changes lation models. J. Atmos. Sci., 33, 405±412. in monthly mean sea-level pressure patterns. Mon. Wea. Rev., Chung, J. H., and D. A. S. Fraser, 1958: tests for a 109, 120±126. multivariate two-sample problem. J. Amer. Stat. Assoc., 53, 729± Preisendorfer, R. W., and T. P. Barnett, 1983: Numerical model-reality 735. intercomparison tests using small-sample statistics. J. Atmos. Cressie, N., 1980: Relaxing assumptions in the one sample t test. Sci., 40, 1884±1896. Aust. J. Stat., 22, 143±153. Solow, A. R., 1985: Bootstrapping correlated data. Math. Geol., 17, Daley, R., and R. M. Chervin, 1985: Statistical signi®cance testing 769±775. in numerical weather prediction. Mon. Wea. Rev., 113, 814±826. ThieÂbaux, H. J., and F. W. Zwiers, 1984: The interpretation and es- Efron, B., 1982: The Jackknife, the Bootstrap, and Other Resampling timation of effective sample size. J. Climate Appl. Meteor., 23, Plans. Society for Industrial and Applied Mathematics, 92 pp. 800±811. , 1987: Better bootstrap con®dence intervals. J. Amer. Stat. As- Trenberth, K. E., 1984: Some effects of ®nite sample size and per- soc., 82, 171±185. sistence on meteorological statistics. Part I: Autocorrelations. , and R. J. Tibshirani 1993: An Introduction to the Bootstrap. Mon. Wea. Rev., 112, 2359±2368. Chapman and Hall, 436 pp. von Storch, H., 1982: A remark on Chervin-Schneider's algorithm to Hall, P., and S. R. Wilson, 1991: Two guidelines for bootstrap hy- test signi®cance of climate experiments with GCMs. J. Atmos. pothesis testing. Biometrics, 47, 757±762. Sci., 39, 187±189. Hasselmann, K., 1979: On the signal-to-noise problem in atmospheric response studies. Meteorology over the Tropical Oceans, D. R. , 1995: Misuses of statistical analysis in climate research. Anal- Shaw, Ed., Roy. Meteor. Soc., 251±259. ysis of Climate Variability, H. von Storch and A. Navarra, Eds., Hillmer, S. C., and G. C. Tiao, 1979: of stationary Springer, 11±26. multiple autoregressive moving average models. J. Amer. Stat. Westfall, P. H., and S. S. Young, 1993: Resampling-Based Multiple Assoc., 74, 652±660. Testing. John Wiley, 340 pp. Johnson, R. A., and D. W. Wichern, 1982: Applied Multivariate Sta- Wilks, D. S., 1995: Statistical Methods in the Atmospheric Sciences. tistical Analysis. Prentice-Hall, 594 pp. Academic Press, 464 pp. Jones, R. H., 1975: Estimating the variance of time averages. J. Appl. , 1996: Statistical signi®cance of long-range ``optimal climate Meteor., 14, 159±163. normal'' temperature and precipitation forecasts. J. Climate, 9, Kabaila, P., and G. Nelson, 1985: On con®dence regions for the mean 827±839. of a multivariate time series. Commun. Stat. Theory Methods, Zwiers, F. W., 1987: Statistical considerations for climate experi- 14, 735±753. ments. Part II: Multivariate tests. J. Climate Appl. Meteor., 26, Katz, R. W., 1982: Statistical evaluation of climate experiments with 477±487. general circulation models: A parametric time series approach. , 1990: The effect of serial correlation on statistical inferences J. Atmos. Sci., 39, 1446±1455. made with resampling procedures. J. Climate, 3, 1452±1461. , and R. H. Skaggs, 1981: On the use of autoregressive-moving , and H. J. ThieÂbaux, 1987: Statistical considerations for climate average processes to model meteorological time series. Mon. experiments. Part I: Scalar tests. J. Climate Appl. Meteor., 26, Wea. Rev., 109, 479±484. 464±476. KuÈnsch, H. R., 1989: The jackknife and the bootstrap for general , and H. von Storch, 1995: Taking serial correlation into account stationary observations. Ann. Stat., 17, 1217±1241. in tests of the mean. J. Climate, 8, 336±351.

Unauthenticated | Downloaded 10/02/21 01:12 AM UTC