On the Relation Between Frequency Inference and Likelihood

On the Relation between Frequency Inference and Likelihood Donald A. Pierce Radiation Effects Research Foundation 5-2 Hijiyama Park Hiroshima 732, Japan [email protected] 1. Introduction Modern higher-order asymptotic theory for frequency inference, as largely described in Barn- dorff-Nielsen & Cox (1994), is fundamentally likelihood-oriented. This has led to various indica- tions of more unity in frequency and Bayesian inferences than previously recognised, e.g. Sweeting (1987, 1992). A main distinction, the stronger dependence of frequency inference on the sample space—the outcomes not observed—is isolated in a single term of the asymptotic formula for P- values. We capitalise on that to consider from an asymptotic perspective the relation of frequency inferences to the likelihood principle, with results towards a quantification of that as follows. Whereas generally the conformance of frequency inference to the likelihood principle is only to first order, i.e. O(n–1/2), there is at least one major class of applications, likely more, where the conformance is to second order, i.e. O(n–1). There the issue is the effect of censoring mechanisms. On a more subtle point, it is often overlooked that there are few, if any, practical settings where exact frequency inferences exist broadly enough to allow precise comparison to the likelihood principle. To the extent that ideal frequency inference may only be defined to second order, there is some- times—perhaps usually—no conflict at all. The strong likelihood principle is that in even totally unrelated experiments, inferences from data yielding the same likelihood function should be the same (the weak one being essentially the sufficiency principle). But at least in theoretical considerations, the relevant experiments are usually not totally unrelated, since it would then be difficult to relate their parameters. There is usually considered some underlying stochastic model or process, with the distinction between the experiments being further specifications governing what aspects of the data from that underlying process are observed. That is, specifications—such as stopping rules or censoring mechanisms—that are ‘unin- formative’ in the sense of not affecting the likelihood function. The version of the likelihood principle then of concern is that a given dataset, in terms of the underlying process, should lead to the same inference when it arises from different observational specifications. Perhaps this should be called the intermediate likelihood principle, but the stronger one may be of sufficiently limited interest as to make that terminology superfluous. 2. Main result Consider in particular the following two settings, where in both instances the inference under consideration is the P-value for testing an hypothesis. a) Data are generated sequentially as independent observations, subject to some stopping rule. The issue is to what extent inferences should depend on which of various stopping rules, all consistent with the observed data, was in action. b) Data such as survival times are generated independently according to a given distribution, but censored. The issue is to what extent inferences should depend on which of various censoring mechanisms, all consistent with the observed data, was in action. These are two of the most commonly-considered examples for likelihood principle considerations, but there is a major distinction between them not previously recognised. In setting (a) it is correctly believed that ideal P-values generally depend on the stopping rule to no less than O(n–1/2). On the other hand, by putting together several substantial results due to others, it can be seen that for setting (b) ideal P-values depend on the censoring mechanism only to O(n–1). In each case n denotes the number of observations in the underlying process. The use of terminology “ideal P-values“ as opposed to “exact P-values“ is addressed in the final section. This is a satisfying result. Attempts to reconcile frequency and likelihood inferences for setting (a) involve the interpretation of P-values rather than their magnitude, e.g. Birnbaum (1961). Many statisticians seem, though, reasonably satisfied with the idea that frequency inferences should depend on the stopping rule. But as first stressed by Pratt (1961), if P-values in setting (b) should depend on the censoring mechanism, then for uncensored data they would depend on what censoring might have been done had some items not failed in the available time. In principle this seems totally unacceptable, and to many statisticians it casts the foundational issues regarding censoring in a different light than those regarding stopping rules. At least to some extent, the results stated above conform to that intuition. The fundamental distinction between the settings (a) and (b) in these respects is as follows. As indicated later, ideal frequency P-values can be approximated to second order from the likelihood function supplemented by quantities computed from the independent (or martingale) observed contributions to the likelihood. That is, one might say, this second-order approximation to P-values does not depend on ‘data not observed’, but only on more details regarding the observed data than are carried by the likelihood function. But these details, the contributions to the likelihood, may or may not depend on the specifications governing what aspects of the data from the underlying process are observed, e.g. stopping rules or censoring mechanisms. In setting (b) it is clear that the contributions to the likelihood are the same regardless of the censoring mechanism, being densities and survival probabilities computed in terms of the underlying stochastic model. But in setting (a), the nature of the (typically martingale) contributions to the likelihood depends on the stopping rule. For example, when the underlying process is Bernoulli trials, for fixed sample size they are probabilities of the trials, whereas for stopping at a given number of successes they correspond to the factors introduced to the likelihood when the number of successes steps by unity. 3. Indication of argument Consider testing ψ(θ)=ψ against a one-sided alternative, where θ is the full multi-dimensional parameter and the interest parameter ψ is a scalar function. Write l(θ; y) for the log likelihood based on data y, and r = rψ (y) for the signed square root of the generalised likelihood ratio statistic for $ ~ 12/ $ ~ testing the hypothesis. That is, rll=−sgn(ψψ$ )[2 { ( θ)( − θ)}] , where θ and θ are the uncon- strained and constrained maximum likelihood estimators. Under the hypothesis r is standard normal to O(n–1/2), and one of the primary advances of modern higher-order asymptotics is Barndorff- Nielsen’s modification r* = r + r –1 log(u/r) which is standard normal to O(n –3/2) [Barndorff- Nielsen & Cox (1994, Sect. 6.6)]. The quantity u involves partial derivatives of l(θ; y) with respect to aspects of the data y. It is only this that introduces dependence of r* on the reference set, so that resulting P-values do not conform to the likelihood principle. It is not generally possible to approximate u to O(n –1) in terms of only the likelihood function from given data y. This is seen by considering examples of type (a) of the previous section. For a class of such problems, Pierce & Peters (1994) identified the non-zero coefficient of n–1/2 in the asymptotic expansion of the effect on u of stopping rules—more particularly in the expansion of ratios of P-values for the ‘same’ data arising from different stopping rules. This establishes the claim made regarding examples (a), which comes as no surprise. There are various second-order approximations to u incorporating extra-likelihood informa- tion, and in my view the most promising of these is due to Skovgaard (1996). This involves com- puting (expected) covariances of θ–scores at different parameter values, and related quantities. For full-rank or curved exponential families this is very simply done, differing little from ordinary in- formation calculations. For full-rank exponential families the approximation to u is exact, and thus it is good when, as is usual, the curvature is small. Remarkably, Severini (1999) has shown that when the required covariances for that approximation are replaced by sample covariances based on the contributions to the log likelihood, the approximation to u remains valid to second order. Moreover, this approximation to u is also exact for full-rank exponential families. The approximation has a troublesome character, though. Generally speaking, it depends on the data in more detail than through the minimal sufficient statistic, thus violating the sufficiency principle. More particularly, it is usually—to a second-order extent—ill-defined since the contributions to the likelihood can be defined in more than one sense. For this and other reasons, it seems generally preferable for exponential families to use the Skovgaard approximation since it is easily calcu- lated. However, it is precisely this ‘defect’ in the Severini approximation that provides the main result here. In possibly violating the sufficiency principle, the method utilises non-likelihood infor- mation regarding the reference set, in a manner that sheds light on likelihood principle considerations. In particular, for the class of problems (b) involving censored data, the approximation yields precisely the same P-value regardless of the censoring mechanism, since for a given dataset the contributions to the likelihood are the same for any censoring mechanism. Since this common P- value agrees to second order with each of those based on exact values of u, this establishes the claim regarding examples (b). 4. Discussion I have here referred to “ideal frequency P-values“ rather than “exact frequency P-values“, and this pertains to issues not yet well-understood.

On the Relation Between Frequency Inference and Likelihood

The Likelihood Principle

Statistical Theory

1 the Likelihood Principle

Model Selection by Normalized Maximum Likelihood

P Values, Hypothesis Testing, and Model Selection: It’S De´Ja` Vu All Over Again1

1 Likelihood 2 Maximum Likelihood Estimators 3 Properties of the Maximum Likelihood Estimators

CHAPTER 2 Estimating Probabilities

The Likelihood Function

Likelihood Principle

P-Values and the Likelihood Principle

Likelihood: Philosophical Foundations

Maximum Likelihood Estimation