On the Relation Between Frequency Inference and Likelihood
Total Page:16
File Type:pdf, Size:1020Kb
On the Relation between Frequency Inference and Likelihood Donald A. Pierce Radiation Effects Research Foundation 5-2 Hijiyama Park Hiroshima 732, Japan [email protected] 1. Introduction Modern higher-order asymptotic theory for frequency inference, as largely described in Barn- dorff-Nielsen & Cox (1994), is fundamentally likelihood-oriented. This has led to various indica- tions of more unity in frequency and Bayesian inferences than previously recognised, e.g. Sweeting (1987, 1992). A main distinction, the stronger dependence of frequency inference on the sample space—the outcomes not observed—is isolated in a single term of the asymptotic formula for P- values. We capitalise on that to consider from an asymptotic perspective the relation of frequency inferences to the likelihood principle, with results towards a quantification of that as follows. Whereas generally the conformance of frequency inference to the likelihood principle is only to first order, i.e. O(n–1/2), there is at least one major class of applications, likely more, where the confor- mance is to second order, i.e. O(n–1). There the issue is the effect of censoring mechanisms. On a more subtle point, it is often overlooked that there are few, if any, practical settings where exact frequency inferences exist broadly enough to allow precise comparison to the likelihood principle. To the extent that ideal frequency inference may only be defined to second order, there is some- times—perhaps usually—no conflict at all. The strong likelihood principle is that in even totally unrelated experiments, inferences from data yielding the same likelihood function should be the same (the weak one being essentially the sufficiency principle). But at least in theoretical considerations, the relevant experiments are usually not totally unrelated, since it would then be difficult to relate their parameters. There is usually con- sidered some underlying stochastic model or process, with the distinction between the experiments being further specifications governing what aspects of the data from that underlying process are observed. That is, specifications—such as stopping rules or censoring mechanisms—that are ‘unin- formative’ in the sense of not affecting the likelihood function. The version of the likelihood princi- ple then of concern is that a given dataset, in terms of the underlying process, should lead to the same inference when it arises from different observational specifications. Perhaps this should be called the intermediate likelihood principle, but the stronger one may be of sufficiently limited in- terest as to make that terminology superfluous. 2. Main result Consider in particular the following two settings, where in both instances the inference under consideration is the P-value for testing an hypothesis. a) Data are generated sequentially as independent observations, subject to some stopping rule. The issue is to what extent inferences should depend on which of various stopping rules, all consistent with the observed data, was in action. b) Data such as survival times are generated independently according to a given distribution, but censored. The issue is to what extent inferences should depend on which of various censoring mechanisms, all consistent with the observed data, was in action. These are two of the most commonly-considered examples for likelihood principle considera- tions, but there is a major distinction between them not previously recognised. In setting (a) it is correctly believed that ideal P-values generally depend on the stopping rule to no less than O(n–1/2). On the other hand, by putting together several substantial results due to others, it can be seen that for setting (b) ideal P-values depend on the censoring mechanism only to O(n–1). In each case n denotes the number of observations in the underlying process. The use of terminology “ideal P-values“ as opposed to “exact P-values“ is addressed in the final section. This is a satisfying result. Attempts to reconcile frequency and likelihood inferences for set- ting (a) involve the interpretation of P-values rather than their magnitude, e.g. Birnbaum (1961). Many statisticians seem, though, reasonably satisfied with the idea that frequency inferences should depend on the stopping rule. But as first stressed by Pratt (1961), if P-values in setting (b) should depend on the censoring mechanism, then for uncensored data they would depend on what censor- ing might have been done had some items not failed in the available time. In principle this seems totally unacceptable, and to many statisticians it casts the foundational issues regarding censoring in a different light than those regarding stopping rules. At least to some extent, the results stated above conform to that intuition. The fundamental distinction between the settings (a) and (b) in these respects is as follows. As indicated later, ideal frequency P-values can be approximated to second order from the likelihood function supplemented by quantities computed from the independent (or martingale) observed con- tributions to the likelihood. That is, one might say, this second-order approximation to P-values does not depend on ‘data not observed’, but only on more details regarding the observed data than are carried by the likelihood function. But these details, the contributions to the likelihood, may or may not depend on the specifications governing what aspects of the data from the underlying proc- ess are observed, e.g. stopping rules or censoring mechanisms. In setting (b) it is clear that the contributions to the likelihood are the same regardless of the censoring mechanism, being densities and survival probabilities computed in terms of the underly- ing stochastic model. But in setting (a), the nature of the (typically martingale) contributions to the likelihood depends on the stopping rule. For example, when the underlying process is Bernoulli trials, for fixed sample size they are probabilities of the trials, whereas for stopping at a given num- ber of successes they correspond to the factors introduced to the likelihood when the number of suc- cesses steps by unity. 3. Indication of argument Consider testing ψ(θ)=ψ against a one-sided alternative, where θ is the full multi-dimensional parameter and the interest parameter ψ is a scalar function. Write l(θ; y) for the log likelihood based on data y, and r = rψ (y) for the signed square root of the generalised likelihood ratio statistic for $ ~ 12/ $ ~ testing the hypothesis. That is, rll=−sgn(ψψ$ )[2 { ( θ)( − θ)}] , where θ and θ are the uncon- strained and constrained maximum likelihood estimators. Under the hypothesis r is standard normal to O(n–1/2), and one of the primary advances of modern higher-order asymptotics is Barndorff- Nielsen’s modification r* = r + r –1 log(u/r) which is standard normal to O(n –3/2) [Barndorff- Nielsen & Cox (1994, Sect. 6.6)]. The quantity u involves partial derivatives of l(θ; y) with respect to aspects of the data y. It is only this that introduces dependence of r* on the reference set, so that resulting P-values do not conform to the likelihood principle. It is not generally possible to approximate u to O(n –1) in terms of only the likelihood function from given data y. This is seen by considering examples of type (a) of the previous section. For a class of such problems, Pierce & Peters (1994) identified the non-zero coefficient of n–1/2 in the asymptotic expansion of the effect on u of stopping rules—more particu- larly in the expansion of ratios of P-values for the ‘same’ data arising from different stopping rules. This establishes the claim made regarding examples (a), which comes as no surprise. There are various second-order approximations to u incorporating extra-likelihood informa- tion, and in my view the most promising of these is due to Skovgaard (1996). This involves com- puting (expected) covariances of θ–scores at different parameter values, and related quantities. For full-rank or curved exponential families this is very simply done, differing little from ordinary in- formation calculations. For full-rank exponential families the approximation to u is exact, and thus it is good when, as is usual, the curvature is small. Remarkably, Severini (1999) has shown that when the required covariances for that approxi- mation are replaced by sample covariances based on the contributions to the log likelihood, the ap- proximation to u remains valid to second order. Moreover, this approximation to u is also exact for full-rank exponential families. The approximation has a troublesome character, though. Generally speaking, it depends on the data in more detail than through the minimal sufficient statistic, thus violating the sufficiency prin- ciple. More particularly, it is usually—to a second-order extent—ill-defined since the contributions to the likelihood can be defined in more than one sense. For this and other reasons, it seems gener- ally preferable for exponential families to use the Skovgaard approximation since it is easily calcu- lated. However, it is precisely this ‘defect’ in the Severini approximation that provides the main re- sult here. In possibly violating the sufficiency principle, the method utilises non-likelihood infor- mation regarding the reference set, in a manner that sheds light on likelihood principle considera- tions. In particular, for the class of problems (b) involving censored data, the approximation yields precisely the same P-value regardless of the censoring mechanism, since for a given dataset the contributions to the likelihood are the same for any censoring mechanism. Since this common P- value agrees to second order with each of those based on exact values of u, this establishes the claim regarding examples (b). 4. Discussion I have here referred to “ideal frequency P-values“ rather than “exact frequency P-values“, and this pertains to issues not yet well-understood.