provides the largest possible conditioning set, or the Ancillary largest possible reduction in dimension, and is analo- gous to a minimal sufficient . A, B,andC are In a parametric model f(y; θ) for a random variable maximal ancillary statistics for the location model, or vector Y, a statistic A = a(Y) is ancillary for θ if but the is only a maximal ancillary in the loca- the distribution of A does not depend on θ.Asavery tion uniform. simple example, if Y is a vector of independent, iden- An important property of the location model is that tically distributed random variables each with mean the exact conditional distribution of the maximum θ, and the sample size is determined randomly, rather likelihood estimator θˆ, given the maximal ancillary than being fixed in advance, then A = number of C, can be easily obtained simply by renormalizing observations in Y is an ancillary statistic. This exam- the likelihood function: ple could be generalized to more complex structure L(θ; y) for the observations Y, and to examples in which the p(θˆ|c; θ) =  ,(1) sample size depends on some further parameters that L(θ; y) dθ are unrelated to θ. Such models might well be appro-  priate for certain types of sequentially collected data where L(θ; y) = f(yi ; θ) is the likelihood function arising, for example, in clinical trials. for the sample y = (y1,...,yn), and the right-hand ˆ Fisher [5] introduced the concept of an ancillary side is to be interpreted as depending on θ and c, statistic, with particular emphasis on the usefulness of { } | = using the equations ∂ log f(yi ; θ) /∂θ θˆ 0and an ancillary statistic in recovering information that c = y − θˆ. is lost by reduction of the sample to the maximum ˆ The location model example is readily generalized likelihood estimate θ, when the maximum likelihood to a linear regression model with nonnormal errors. estimate is not minimal sufficient. Suppose that, for i = 1,...,n, we have indepen- An illustrative, if somewhat artificial, example is = + dent observations from the model Yi xi β σεi , a sample (Y1,...,Yn), where now n is fixed, from where the distribution of εi is known. The vector of the uniform distribution on (θ, θ + 1). The largest − ˆ ˆ standardized residuals (Yi xi β)/σ is ancillary for and smallest observations, (Y(1),Y(n)),say,forma θ = (β,σ) and there is a formula similar to (1) for minimal sufficient statistic for θ. The maximum the distribution of θˆ, given the residuals. likelihood estimator of θ is any value in the interval It is possible, and has been argued, that Fisher’s − − (Y(n) 1,Y(1)), and the range Y(n) Y(1) is an meaning of ancillarity included more than the require- ancillary statistic. In this example, while the range ment of a distribution free of θ: that it included a does not provide any information about the value of notion of a physical mechanism for generating the θ that generated the data, it does provide information data in which some elements of this mechanism are ˆ on the precision of θ. In a sample for which the range “clearly” not relevant for assessing the value of θ, ˆ is 1, θ is exactly equal to θ, whereas a sample with but possibly relevant for assessing the accuracy of a range of 0 is the least informative about θ. the inference about θ. Thus, Kalbfleisch [7] makes a A theoretically important example discussed in distinction between an experimental and a mathemat- Fisher [5] is the location model (see Location–Scale ical ancillary statistic. Fraser [6] developed the notion Family), in which Y = (Y1,...,Yn), and each Yi of a structural model as a physically generated exten- follows the model f(y− θ), with f(·) known but sion of the location model. Efron & Hinkley [4] gave = − θ unknown. The vector of residuals A (Y1 Y, particular attention to the role of an ancillary statistic −1 ...,Yn − Y),whereY = n Yi , has a distribu- in estimating the of the maximum likelihood tion free of θ, as is intuitively obvious, since both estimator. Yi and Y are centered at θ. The uniform example Two generalizations of the concept of ancillary discussed above is a special case of the location statistic have become important in recent work in model, and the range Y(n) − Y(1) is also an ancillary the theory of inference. The first is the notion of statistic for the present example. In fact, the vector approximate ancillarity, in which the distribution B = (Y(2) − Y(1),...,Y(n) − Y(1)) is also ancillary, as of A is not required to be entirely free of θ, ˆ ˆ ˆ is C = (Y1 − θ,...,Yn − θ),whereθ is the maxi- but free of θ to some order of approximation. mum likelihood estimate of θ.Amaximal ancillary For example, we might require that the first few

Encyclopedia of Biostatistics, Online © 2005 John Wiley & Sons, Ltd. This article is © 2005 John Wiley & Sons, Ltd. This article was published in the Encyclopedia of Biostatistics in 2005 by John Wiley & Sons, Ltd. DOI: 10.1002/0470011815.b2a15002 2 Ancillary Statistics moments of A be constant (in θ), or that the An example of (4) is the two-by-two table, with distribution of A be free of θ in a neighborhood ψ the log odds ratio. The conditional distribution of of the true value θ0, say. The definition used by a single cell entry, given the row and column totals, Barndorff-Nielsen & Cox [1] is that A is qth√ order depends only on ψ: this is the basis for Fisher’s exact locally ancillary for θ near θ0 if f(a; θ0 + δ/ n) = test (see Conditionality Principle). Although it is −q/2 f(a; θ0) + O(n ). Approximate ancillary statistics sometimes claimed that the row total is an ancillary are also discussed in McCullagh [9] and Reid [11]. statistic for the parameter of interest, this is in fact The notion of an approximate ancillary statistic has not the case, at least according to the definition of turned out to be rather important for the asymptotic ancillarity discussed here. Some more general notions theory of statistical inference, because the location of ancillarity have been proposed in the literature, but family model result given in (1) can be generalized, have not proved to be widely useful in theoretical to give the result developments. Further discussion of ancillarity and conditional inference in the presence of nuisance L(θ; y) ˆ| =. | ˆ |1/2 p(θ a; θ) c(θ, a) j(θ) ˆ ,(2) parameters can be found in Liang & Zeger [8] and L(θ; y) Reid [10]. Ancillary statistics are defined for parametric mod- where c is a normalizing constant, j(θ) =−∂2 log L(θ)/∂θ∂θ is the observed els, so would not be defined, for example, in Cox’s proportional hazards see Cox function, a is an approximately ancillary statistic, regression model ( Regression Model). Cox [2] did originally argue, andintheright-handsidey is a function of θˆ, a. though, that the full likelihood function could be par- This approximation, which is typically much more titioned into a factor that provided information on the accurate than the normal approximation to the distri- regression parameters β and a factor that provided no bution of θˆ, is known as Barndorff-Nielsen’s approxi- information about β in the absence of knowledge of mation, or the p∗ approximation, and is reviewed the baseline hazard: the situation is analogous to (4) in Reid [10] and considered in detail in Barndorff- but, as was pointed out by several discussants of [2], Nielsen & Cox [1]. In (2) the likelihood function is the likelihood factor that is used in the analysis is not normalized by a slightly more elaborate looking for- in fact the conditional likelihood for any observable mula than the simple integral in (1), but the principle random variables. Cox [3] developed the notion of of renormalizing the likelihood function has still been partial likelihood to justify the now standard esti- applied. A distribution function approximation anal- mates of β. ogous to (2) is also available: see Barndorff-Nielsen & Cox [1] and Reid [11]. Suppose that the parameter θ is partitioned as References θ = (ψ, λ),whereψ is the parameter of interest and λ is a nuisance parameter. For example, ψ [1] Barndorff-Nielsen, O.E. & Cox, D.R. (1994). Inference might parameterize a regression model for survival and Asymptotics. Chapman & Hall, London. (This is time as a function of several covariates, and λ might the only book currently available that gives a survey parameterize the baseline hazard function. If we of many of the main ideas in statistical theory along with a detailed discussion of the asymptotic theory θ can partition the minimal sufficient statistic for as for likelihood inference that has been developed since (S, T), such that 1980 (see Large-sample Theory). Chapter 2.5 discusses = | ancillary statistics, and Chapters 6 and 7 consider the f(s, t; θ) f(s t; ψ)f (t; λ), (3) p∗ approximation and the related distribution function approximation.). T ψ then is an ancillary statistic for in the sense of [2] Cox, D.R. (1972). Regression models and life tables the above definition. Factorizations of the form given (with discussion), Journal of the Royal Statistical Soci- in (3) are the exception, though, and we more often ety, Series B 34, 187–220. have a factorization of the type [3] Cox, D.R. (1975). Partial likelihood, Biometrika 62, 269–276. f(s, t; θ) = f(s|t; ψ)f (t; ψ, λ)(4) [4] Efron, B. & Hinkley, D.V. (1978). Assessing the accu- racy of the maximum likelihood estimator: observed or versus expected Fisher information (with discussion), f(s, t; θ) = f(s|t; ψ, λ)f (t; λ). (5) Biometrika 65, 457–487.

Encyclopedia of Biostatistics, Online © 2005 John Wiley & Sons, Ltd. This article is © 2005 John Wiley & Sons, Ltd. This article was published in the Encyclopedia of Biostatistics in 2005 by John Wiley & Sons, Ltd. DOI: 10.1002/0470011815.b2a15002 Ancillary Statistics 3

[5] Fisher, R.A. (1934). Two new properties of mathematical 343–365. (Among the early papers on the p∗ approxima- likelihood, Proceedings of the Royal Society, Series A tion, this is one of the easier ones.) 144, 285–307. Barndorff-Nielsen, O.E. (1986). Inference on full or partial [6] Fraser, D.A.S. (1968). The Structure of Inference. Wiley, parameters based on the standardized signed log likeli- New York. hood ratio, Biometrika 73, 307–322. (Introduces the dis- [7] Kalbfleisch, J.D. (1975). Sufficiency and conditionality, tribution function approximation based on the p∗ approx- Biometrika 62, 251–259. imation.) [8] Liang, K.-Y. & Zeger, S.L. (1995). Inference based Fisher, R.A. (1973). Statistical Methods and Scientific Infer- on estimating functions in the presence of nuisance ence, 3rd Ed. Oliver & Boyd, Edinburgh. (Includes a parameters, Statistical Science 10, 158–172. discussion of ancillary statistics with many examples.) [9] McCullagh, P. (1987). Tensor Methods in Statistics. Jorgensen, B. (1994). The rules of conditional inference: is Chapman & Hall, London. there a universal definition of nonformation?, Journal of [10] Reid, N. (1988). Saddlepoint approximations in statisti- the Italian Statistical Society 3, 355–384. (This consid- cal inference, Statistical Science 3, 213–238. (A review ers several definitions of ancillarity in the presence of of the p∗ approximation and related developments, up nuisance parameters.) to 1987.). Kalbfleisch, J.G. (1985). Probability and Statistical Inference, [11] Reid, N. (1995). The roles of conditioning in inference, Vol. II, 2nd Ed. Springer-Verlag, New York. (This is one Statistical Science 10, 138–157. of the few undergraduate textbooks that treats ancillarity in any depth, in Chapter 15.)

Further Reading N. REID Barndorff-Nielsen, O.E. (1983). On a formula for the distribu- tion of the maximum likelihood estimator, Biometrika 70,

Encyclopedia of Biostatistics, Online © 2005 John Wiley & Sons, Ltd. This article is © 2005 John Wiley & Sons, Ltd. This article was published in the Encyclopedia of Biostatistics in 2005 by John Wiley & Sons, Ltd. DOI: 10.1002/0470011815.b2a15002