On the Relation between Frequency Inference and Likelihood

Donald A. Pierce Radiation Effects Research Foundation 5-2 Hijiyama Park Hiroshima 732, Japan [email protected]

1. Introduction

Modern higher-order asymptotic theory for frequency inference, as largely described in Barn- dorff-Nielsen & Cox (1994), is fundamentally likelihood-oriented. This has led to various indica- tions of more unity in frequency and Bayesian inferences than previously recognised, e.g. Sweeting (1987, 1992). A main distinction, the stronger dependence of frequency inference on the sample space—the outcomes not observed—is isolated in a single term of the asymptotic formula for P- values. We capitalise on that to consider from an asymptotic perspective the relation of frequency inferences to the likelihood principle, with results towards a quantification of that as follows. Whereas generally the conformance of frequency inference to the likelihood principle is only to first order, i.e. O(n–1/2), there is at least one major class of applications, likely more, where the confor- mance is to second order, i.e. O(n–1). There the issue is the effect of mechanisms. On a more subtle point, it is often overlooked that there are few, if any, practical settings where exact frequency inferences exist broadly enough to allow precise comparison to the likelihood principle. To the extent that ideal frequency inference may only be defined to second order, there is some- times—perhaps usually—no conflict at all. The strong likelihood principle is that in even totally unrelated experiments, inferences from data yielding the same should be the same (the weak one being essentially the sufficiency principle). But at least in theoretical considerations, the relevant experiments are usually not totally unrelated, since it would then be difficult to relate their parameters. There is usually con- sidered some underlying stochastic model or process, with the distinction between the experiments being further specifications governing what aspects of the data from that underlying process are observed. That is, specifications—such as stopping rules or censoring mechanisms—that are ‘unin- formative’ in the sense of not affecting the likelihood function. The version of the likelihood princi- ple then of concern is that a given dataset, in terms of the underlying process, should lead to the same inference when it arises from different observational specifications. Perhaps this should be called the intermediate likelihood principle, but the stronger one may be of sufficiently limited in- terest as to make that terminology superfluous.

2. Main result

Consider in particular the following two settings, where in both instances the inference under consideration is the P-value for testing an hypothesis. a) Data are generated sequentially as independent observations, subject to some stopping rule. The issue is to what extent inferences should depend on which of various stopping rules, all consistent with the observed data, was in action. b) Data such as survival times are generated independently according to a given distribution, but censored. The issue is to what extent inferences should depend on which of various censoring mechanisms, all consistent with the observed data, was in action. These are two of the most commonly-considered examples for likelihood principle considera- tions, but there is a major distinction between them not previously recognised. In setting (a) it is correctly believed that ideal P-values generally depend on the stopping rule to no less than O(n–1/2). On the other hand, by putting together several substantial results due to others, it can be seen that for setting (b) ideal P-values depend on the censoring mechanism only to O(n–1). In each case n denotes the number of observations in the underlying process. The use of terminology “ideal P-values“ as opposed to “exact P-values“ is addressed in the final section. This is a satisfying result. Attempts to reconcile frequency and likelihood inferences for set- ting (a) involve the interpretation of P-values rather than their magnitude, e.g. Birnbaum (1961). Many statisticians seem, though, reasonably satisfied with the idea that frequency inferences should depend on the stopping rule. But as first stressed by Pratt (1961), if P-values in setting (b) should depend on the censoring mechanism, then for uncensored data they would depend on what censor- ing might have been done had some items not failed in the available time. In principle this seems totally unacceptable, and to many statisticians it casts the foundational issues regarding censoring in a different light than those regarding stopping rules. At least to some extent, the results stated above conform to that intuition. The fundamental distinction between the settings (a) and (b) in these respects is as follows. As indicated later, ideal frequency P-values can be approximated to second order from the likelihood function supplemented by quantities computed from the independent (or martingale) observed con- tributions to the likelihood. That is, one might say, this second-order approximation to P-values does not depend on ‘data not observed’, but only on more details regarding the observed data than are carried by the likelihood function. But these details, the contributions to the likelihood, may or may not depend on the specifications governing what aspects of the data from the underlying proc- ess are observed, e.g. stopping rules or censoring mechanisms. In setting (b) it is clear that the contributions to the likelihood are the same regardless of the censoring mechanism, being densities and survival probabilities computed in terms of the underly- ing stochastic model. But in setting (a), the nature of the (typically martingale) contributions to the likelihood depends on the stopping rule. For example, when the underlying process is Bernoulli trials, for fixed sample size they are probabilities of the trials, whereas for stopping at a given num- ber of successes they correspond to the factors introduced to the likelihood when the number of suc- cesses steps by unity.

3. Indication of argument

Consider testing ψ(θ)=ψ against a one-sided alternative, where θ is the full multi-dimensional parameter and the interest parameter ψ is a scalar function. Write l(θ; y) for the log likelihood based on data y, and r = rψ (y) for the signed square root of the generalised likelihood ratio statistic for $ ~ 12/ $ ~ testing the hypothesis. That is, rll=−sgn(ψψ$ )[2 { ( θ)( − θ)}] , where θ and θ are the uncon- strained and constrained maximum likelihood estimators. Under the hypothesis r is standard normal to O(n–1/2), and one of the primary advances of modern higher-order asymptotics is Barndorff- Nielsen’s modification r* = r + r –1 log(u/r) which is standard normal to O(n –3/2) [Barndorff- Nielsen & Cox (1994, Sect. 6.6)]. The quantity u involves partial derivatives of l(θ; y) with respect to aspects of the data y. It is only this that introduces dependence of r* on the reference set, so that resulting P-values do not conform to the likelihood principle. It is not generally possible to approximate u to O(n –1) in terms of only the likelihood function from given data y. This is seen by considering examples of type (a) of the previous section. For a class of such problems, Pierce & Peters (1994) identified the non-zero coefficient of n–1/2 in the asymptotic expansion of the effect on u of stopping rules—more particu- larly in the expansion of ratios of P-values for the ‘same’ data arising from different stopping rules. This establishes the claim made regarding examples (a), which comes as no surprise. There are various second-order approximations to u incorporating extra-likelihood informa- tion, and in my view the most promising of these is due to Skovgaard (1996). This involves com- puting (expected) covariances of θ–scores at different parameter values, and related quantities. For full-rank or curved exponential families this is very simply done, differing little from ordinary in- formation calculations. For full-rank exponential families the approximation to u is exact, and thus it is good when, as is usual, the curvature is small. Remarkably, Severini (1999) has shown that when the required covariances for that approxi- mation are replaced by sample covariances based on the contributions to the log likelihood, the ap- proximation to u remains valid to second order. Moreover, this approximation to u is also exact for full-rank exponential families. The approximation has a troublesome character, though. Generally speaking, it depends on the data in more detail than through the minimal , thus violating the sufficiency prin- ciple. More particularly, it is usually—to a second-order extent—ill-defined since the contributions to the likelihood can be defined in more than one sense. For this and other reasons, it seems gener- ally preferable for exponential families to use the Skovgaard approximation since it is easily calcu- lated. However, it is precisely this ‘defect’ in the Severini approximation that provides the main re- sult here. In possibly violating the sufficiency principle, the method utilises non-likelihood infor- mation regarding the reference set, in a manner that sheds light on likelihood principle considera- tions. In particular, for the class of problems (b) involving censored data, the approximation yields precisely the same P-value regardless of the censoring mechanism, since for a given dataset the contributions to the likelihood are the same for any censoring mechanism. Since this common P- value agrees to second order with each of those based on exact values of u, this establishes the claim regarding examples (b).

4. Discussion

I have here referred to “ideal frequency P-values“ rather than “exact frequency P-values“, and this pertains to issues not yet well-understood. Only for very special settings are there exact fre- quency P-values for the setting described at the outset of section 3. These settings are within full- rank exponential families and transformation (e.g. location-scale) models, and even then for special forms of the parametric function ψ(θ). When considering data from two experiments leading to the same likelihood function, there are few, if any, examples where exact frequency inference in the presence of nuisance parameters is possible for both settings. In general ideal P-values can only be defined asymptotically, and I have assumed here that to second order these are given by Barndorff- Nielsen’s r*. I believe that this is correct, but although the distributional aspects of r* are clear— standard normal to third order—the inferential basis is less clear. It explicitly involves Fisherian no- tions of conditioning on ancillary , but what is to be resolved hinges on a general theory for dealing with nuisance parameters. Presently it is unclear whether this theory will involve second- order or third-order considerations. If it develops that ideal P-values are only defined to second or- der, then for the censored data class of examples there will be no conflict at all with the likelihood principle. It was indicated that along with modern developments in asymptotics have come various indi- cations of more unity in frequency and Bayesian inferences than previously recognised. This mainly relates to how the approaches deal with nuisance parameters. Underlying Barndorff-Nielsen’s r* is a decomposition of the adjustment r –1 log(u/r) = NP + INF, the first term pertaining to the fitting of nuisance parameters, and the other pertaining to limited information on the interest parameter [Pierce & Peters (1992), Barndorff-Nielsen & Cox (1994, Sect. 6.6.4). Whereas exp(–0.5 r2 ) is the 2 profile likelihood, Lmp(ψ) ∝ exp{–0.5 (r + NP) }is a modified profile likelihood, allowing for fit- ting of nuisance parameters (op cit, Sect. 8.2). In most respects this modified profile likelihood can, to second order, be treated as the likelihood arising from a one-parameter model. There are close second-order connections between full Bayesian analysis for inference about ψ(θ) and a “one-parameter“ Bayes analysis based on Lmp(ψ) and a prior distribution only for ψ [Sweeting (1987)]. More to the point of this paper, the term INF entering into r*, but not Lmp(ψ), represents to a large extent the relation of frequency inference to likelihood. This was explored in Pierce & Peters (1994), but only for one-parameter problems where NP does not arise. It was con- jectured there that to second order the adjustment NP, and hence Lmp(ψ), would conform to the likelihood principle, so that all non-likelihood aspects of frequency inference would be carried by INF. Results here confirm that conjecture regarding NP for setting (b), but not for the reasons we had anticipated since it then holds for INF as well. The ideal roles of asymptotic theory include not only numerical approximations, but clarifica- tion of the nature of inference by de-emphasising differences between specific settings and focusing on the common underlying issues. The development based on the likelihood ratio approximation to densities [Barndorff-Nielsen & Cox, Sect. 6.2], and its consequences as used here, achieve both goals remarkably well.

REFERENCES

Barndorff-Nielsen, O.E. and Cox, D.R. (1994). Inference and Asymptotics. Chapman and Hall. London. Birnbaum, A. (1961). On the foundations of statistical inference: binary experiments. Annals of Mathematical Statistics 32, 414-435. Pierce, D.A. and Peters, D. (1992). Practical use of higher-order asymptotics for multiparameter exponential families. Journal of Royal Statistical Society B 54, 701-737. Pierce, D.A. and Peters, D. (1994). Higher-order asymptotics and the likelihood principle: one pa- rameter models. Biometrika 81, 1–10. Pratt, J. (1961). Review of Lehmann’s Testing Statistical Hypotheses. Journal of American Statisti- cal Association 56, 163-166. Skovgaard, I. (1996). An explicit large-deviation approximation to one-parameter tests. Journal of Bernoulli Society 2, 145–165. Severini, T. A. (1999) An empirical adjustment to the likelihood ratio statistic. Biometrika 86 (in press). Sweeting, T.J. (1987). Discussion of paper by Cox and Reid. Journal of Royal Statistical Society B 49, 20–21. Sweeting, T.J. (1992). Discussion of paper by Pierce and Peters. Journal of Royal Statistical Society B 54, 732–733.

FRENCH RÉSUMÉ

La théorie asymptotique moderne d’un ordre de magnitude supérieur pour l’inférence de fréquence est essentiellement axée sur la vraisemblance. L’une des distinctions principales entre l’inférence de fréquence et l’inférence de vraisemblance, la dépendance plus forte de la première sur l’ensemble des valeurs possibles de l’échantillon, est isolée dans un terme unique de la formule asymptotique pour les valeurs-P. Nous en tirons parti pour quantifier la relation des inférences de fréquence avec le principe de vraisemblance. Tandis que la conformité de l’inférence de fréquence au principe de vraisemblance n’est en général que du premier ordre de magnitude, nous trouvons qu’il y a au moins une classe majeure d’applications -- et probablement plus -- là où la conformité est de second ordre. Dans ces applications, la question est l’effet des mécanismes de contrôle sé- lectif. Quelques autres considérations générales sont discutées, se rapportant au fait que, sauf pour des situations très spécifiques, les inférences de fréquence idéales ne sont définies que d’une manière asymptotique.