PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003

Likelihood Inference in the Presence of Nuisance Parameters N. Reid, D.A.S. Fraser Department of , University of Toronto, Toronto Canada M5S 3G3

We describe some recent approaches to likelihood based inference in the presence of nuisance parameters. Our approach is based on plotting the and the p-value function, using recently developed third order approximations. Orthogonal parameters and adjustments to profile likelihood are also discussed. Connections to classical approaches of conditional and marginal inference are outlined.

1. INTRODUCTION it is defined only up to arbitrary multiples which may depend on y but not on θ. This ensures in particu- We take the view that the most effective form of lar that the likelihood function is invariant to one-to- inference is provided by the observed likelihood func- one transformations of the measurement(s) y. In the tion along with the associated p-value function. In context of independent, identically distributed sam- the case of a scalar parameter the likelihood func- pling, where y = (y1, . . . , yn) and each yi follows the tion is simply proportional to the density function. model f(y; θ) the likelihood function is proportional The p-value function can be obtained exactly if there to Πf(yi; θ) and the log-likelihood function becomes a is a one-dimensional statistic that measures the pa- sum of independent and identically distributed com- rameter. If not, the p-value can be obtained to a ponents: high order of approximation using recently developed methods of likelihood asymptotics. In the presence `(θ) = `(θ; y) = Σ log f(yi; θ) + a(y). (2) of nuisance parameters, the likelihood function for a ˆ (one-dimensional) parameter of interest is obtained The maximum likelihood estimate θ is the value of via an adjustment to the profile likelihood function. θ at which the likelihood takes its maximum, and in The p-value function is obtained from quantities com- regular models is defined by the score equation puted from the likelihood function using a canonical 0 ˆ parametrization ϕ = ϕ(θ), which is computed locally ` (θ; y) = 0. (3) at the data point. This generalizes the method of eliminating nuisance parameters by conditioning or The observed Fisher information function j(θ) is the marginalizing to more general contexts. In Section curvature of the log-likelihood: 2 we give some background notation and introduce 00 the notion of orthogonal parameters. In Section 3 we j(θ) = −` (θ) (4) illustrate the p-value function approach in a simple and the expected Fisher information is the model model with no nuisance parameters. Profile likelihood quantity and adjustments to profile likelihood are described in Section 4. Third order p-values for problems with nui- sance parameters are described in Section 5. Section i(θ) = E{−`00(θ)} = −`00(θ; y)f(y; θ)dy. (5) 6 describes the classical conditional and marginal like- Z lihood approach. If y is a sample of size n then i(θ) = O(n). In accord with the partitioning of θ we partition the observed and expected information matrices and use 2. NOTATION AND ORTHOGONAL the notation PARAMETERS iψψ iψλ i(θ) = (6) iλψ iλλ We assume our measurement(s) y can be modelled   as coming from a probability distribution with density and or mass function f(y; θ), where θ = (ψ, λ) takes values d ψψ ψλ in R . We assume ψ is a one-dimensional parameter −1 i i i (θ) = λψ λλ . (7) of interest, and λ is a vector of nuisance parameters. i i If there is interest in more than one component of   θ, the methods described here can be applied to each We say ψ is orthogonal to λ (with respect to expected component of interest in turn. The likelihood function Fisher information) if iψλ(θ) = 0. When ψ is scalar is a transformation from (ψ, λ) to (ψ, η(ψ, λ)) such that ψ is orthogonal to η can always be found (Cox and L(θ) = L(θ; y) = c(y)f(y; θ); (1) Reid [1]). The most directly interpreted consequence

265 PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003 of parameter orthogonality is that the maximum like- Table I The p-values for testing µ = 0, i.e. that the lihood estimates of orthogonal components are asymp- number of observed events is consistent with the totically independent. background. Example 1: ratio of Poisson Suppose upper p-value 0.0005993 y1 and y2 are independent counts modelled as Poisson with λ and ψλ, respectively. Then the likelihood lower p-value 0.0002170 function is mid p-value 0.0004081 r∗ −λ(1+ψ) y2 y1+y2 Φ( ) 0.0003779 L(ψ, λ; y1, y2) = e ψ λ Φ(r) 0.0004416 and ψ is orthogonal to η(ψ, λ) = λ(ψ + 1). In Φ{(θˆ − θ)ˆj1/2} 0.0062427 fact in this example the likelihood function factors as L1(ψ)L2(η), which is a stronger property than param- eter orthogonality. The first factor is the likelihood for a binomial distribution with index y1 + y2 and proba- where Z follows a standard . It bility of success ψ/(1 + ψ), and the second is that for is relatively simple to improve the approximation to −3/2 a Poisson distribution with mean η. third order, i.e. with relative error O(n ), using ∗ Example 2: exponential regression Sup- the so-called r approximation pose yi, i = 1, . . . , n are independent observations, ∗ each from an exponential distribution with mean r (θ) = r(θ) + {1/r(θ)} log{q(θ)/r(θ)} ∼ N(0, 1) (12) λ exp(−ψxi), where xi is known. The log-likelihood function is where q(θ) is a likelihood-based statistic and a gener- alization of the Wald statistic (θˆ−θ)j1/2(θˆ); see Fraser −1 `(ψ, λ; y) = −n log λ + ψΣxi − λ Σyi exp(ψxi) (8) [2]. Example 3: truncated Poisson and iψλ(θ) = 0 if and only if Σxi = 0. The stronger y property of factorization of the likelihood does not Suppose that follows a Poisson distribution with θ b µ b hold. mean = + , where is a background rate that is assumed known. In this model the p-value function can be computed exactly simply by summing the Pois- son probabilities. Because the Poisson distribution is 3. LIKELIHOOD INFERENCE WITH NO discrete, the p-value could reasonably be defined as NUISANCE PARAMETERS either

We assume now that θ is one-dimensional. A plot Pr(y ≤ y0; θ) (13) of the log-likelihood function as a function of θ can quickly reveal irregularities in the model, such as a or non-unique maximum, or a maximum on the bound- 0 ary, and can also provide a visual guide to deviance Pr(y < y ; θ), (14) from normality, as the log-likelihood function for a sometimes called the upper and lower p-values, respec- normal distribution is a parabola and hence symmet- tively. ric about the maximum. In order to calibrate the For the values y0 = 17, b = 6.7, Figure 1 shows log-likelihood function we can use the approximation the likelihood function as a function of µ and the p- / r(θ) = sign(θˆ − θ)[2{`(θˆ) − `(θ)}]1 2 ∼· N(0, 1), (9) value function p(µ) computed using both the upper and lower p-values. In Figure 2 we plot the mid p- which is equivalent to the result that twice the log like- 2 value, which is lihood ratio is approximately χ1. This will typically provide a better approximation than the asymptoti- Pr(y < y0) + (1/2)Pr(y = y0). (15) cally equivalent result that The approximation based on r∗ is nearly identical to · −1 θˆ − θ ∼ N(0, i (θ)) (10) the mid-p-value; the difference cannot be seen on Fig- p µ as it partially accommodates the potential asymme- ure 2. Table 1 compares the -values at = 0. This try in the log-likelihood function. These two approx- example is taken from Fraser, Reid and Wong [3]. imations are sometimes called first order approxima- tions because in the context where the log-likelihood is O(n), we have (under regularity conditions) results 4. PROFILE AND ADJUSTED PROFILE such as LIKELIHOOD FUNCTIONS Pr{r(θ; y) ≤ r(θ; y0)} = Pr{Z ≤ r(θ; y0)} (11) We now assume θ = (ψ, λ) and denote by λˆψ the / {1 + O(n−1 2)} restricted maximum likelihood estimate obtained by

266 PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003

vious section generalize to 6 0 . 0 ˆ ˆ 1/2 ·

r(ψ) = sign(ψ − ψ)[2{`p(ψ) − `p(ψ)}] ∼ N(0, 1), d o 4 o 0

. (17) h i l 0 e k i

l and 2 0 .

0 ψψ ψˆ − ψ ∼· N(0, {i (θ)}−1). (18) 0 . 0 These approximations, like the ones in Section 3, are 0 10 20 30 40 derived from asymptotic results which assume that mu n → ∞, that we have a vector y of independent, iden- tically distributed observations, and that the dimen- sion of the nuisance parameter does not increase with

0 n. Further regularity conditions are required on the . ....... 1 ....... ........ ...... model, such as are outlined in textbook treatments ...... 8 .... . ..... 0 ..... of the asymptotic theory of maximum likelihood. In ..... ..... ...... 6 ..... finite samples these approximations can be mislead- . ..... e .... 0 .... u .... l ..... a ...... ing: profile likelihood is too concentrated, and can be v ..... - .... 4 .... p . ..... 0 ...... maximized at the ‘wrong’ value. ...... ...... ....... 2 ..... Example 4: normal theory regression . ...... Suppose 0 ....... ........ 0 ......... yi = xiβ + i, where xi = (xi1, . . . , xip) is a vector of ............. 0 ......... .

0 known covariate values, β is an unknown parameter 0 10 20 30 40 of length p, and i is assumed to follow a N(0, ψ)

mu distribution. The maximum likelihood estimate of ψ is Figure 1: The likelihood function (top) and p-value function (bottom) for the Poisson model, with b = 6.7 1 0 2 ψˆ = Σ(yi − xiβˆ) (19) and y0 = 17. For µ = 0 the p-value interval is n (0.99940, 0.99978). which tends to be too small, as it does not allow for the fact that p unknown parameters (the components 0 . 1 of β) have been estimated. In this example there is a simple improvement, based on the result that the 8 . 0 likelihood function for (β, ψ) factors into 6 . 0 0 ˆ 2 1 2 i e L (β, ψ; y¯)L {ψ; Σ(y − xiβ) } (20) u l a v p 4 . 0 where L2(ψ) is proportional to the marginal distri- 0 ˆ 2 yi − x β 2 bution of Σ( i ) . Figure 3 shows the profile . 0 likelihood and the marginal likelihood; it is easy to

0 verify that the latter is maximized at . 0

0 1 0 2 0 3 0 µ 1 0 2 ψˆm = Σ(yi − x βˆ) (21) n − p i Figure 2: The upper and lower p-value functions and the mid-p-value function for the Poisson model, with b = 6.7 0 ∗ which in fact is an unbiased estimate of ψ. and y = 17. The approximation based on Φ(r ) is Example 5: product of exponential means p identical to the mid- -value function to the drawing Suppose we have independent pairs of observations accuracy. y1i, y2i, where y1i ∼ Exp(ψλi) y2i ∼ Exp(ψ/λi), i = 1, . . . , n. The limiting normal theory for profile likeli- hood does not apply in this context, as the dimension maximizing the likelihood function over the nuisance of the parameter is not fixed but increasing with the parameter λ with ψ fixed. The profile likelihood func- sample size, and it can be shown that tion is π ψˆ → ψ (22) 4 Lp(ψ) = L(ψ, λˆψ); (16) as n → ∞ (Cox and Reid [4]). also sometimes called the concentrated likelihood or The theory of higher order approximations can be the peak likelihood. The approximations of the pre- used to derive a general improvement to the profile

267 PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003 0 . 1 5. P -VALUES FROM PROFILE 8 . LIKELIHOOD 0 profile marginal 6 .

0 The limiting theory for profile likelihood gives first d o o h i l

e order approximations to p-values, such as k i l 4 . 0 . p(ψ) = Φ(rp) (27) 2 . 0 and 0 . 0 . 1/2 1 2 3 4 5 6 p(ψ) = Φ{(ψˆ − ψ)jp (ψˆ)} (28) σ

Figure 3: Profile likelihood and marginal likelihood for although the discussion in the previous section sug- the parameter in a normal theory regression gests these may not provide very accurate approxima- with 21 observations and three covariates (the ”Stack tions. As in the scalar parameter case, though, a much Loss” data included in the Splus distribution). The better approximation is available using Φ(r∗) where profile likelihood is maximized at a smaller value of ψ, ∗ and is narrower; in this case both the estimate and its r (ψ) = rp(ψ) + 1/{rp(ψ)} log{Q(ψ)/rp(ψ)} (29) estimated standard error are too small. where Q can also be derived from the likelihood func- tion and a function ϕ(θ, y0) as likelihood or log-likelihood function, which takes the form −1/2 Q = (νˆ − νˆψ)σˆν 1 `a(ψ) = `p(ψ) + log |jλλ(ψ, λˆψ)| + B(ψ) (23) 2 where where jλλ is defined by the partitioning of the ob- T ν(θ) = eψ ϕ(θ) , served information function, and B(ψ) is a further 0 ˆ 0 ˆ adjustment function that is Op(1). Several versions of eψ = ψϕ (θψ)/|ψϕ (θψ)| , 2 B(ψ) have been suggested in the statistical literature: σˆν = |j(λλ)(θˆψ)|/|j(θθ)(θˆ)| , we use the one defined in Fraser [5] given by −2 |j(θθ)(θˆ)| = |jθθ(θˆ)||ϕθ0 (θˆ)| , 1 0 0 2 B(ψ) = − log |ϕλ(ψ, λˆψ)jϕϕ(ψˆ, λˆ)ϕλ(ψ, λˆψ)|. (24) ˆ ˆ 0 ˆ − 2 |j(λλ)(θψ)| = |jλλ(θψ)||ϕλ (θψ)| . This depends on a so-called canonical parametrization 0 The derivation is described in Fraser, Reid and ϕ = ϕ(θ) = `;V (θ; y ) which is discussed in Fraser, Wu [6] and Reid [7]. The key ingredients are the Reid and Wu [6] and Reid [7]. log-likelihood function `(θ) and a reparametrization In the special case that ψ is orthogonal to the nui- ϕ(θ) = ϕ(θ; y0), which is defined by using an approx- sance parameter λ a simplification of `a(ψ) is available imating model at the observed data point y0; this ap- as proximation in turn is based on a conditioning argu- 1 ment. A closely related approach is due to Barndorff- `CR(ψ) = `p(ψ) − log |jλλ(ψ, λˆψ)| (25) 2 Nielsen; see Barndorff-Nielsen and Cox [8, Ch. 7], and the two approaches are compared in [7]. which was first introduced in Cox and Reid (1987). Example 6: comparing two binomials Table 2 The change of sign on log |j| comes from the or- shows the employment history of men and women at thogonality equations. In i.i.d. sampling, `p(ψ) is the Space Telescope Science Institute, as reported in Op(n), i.e. is the sum of n bounded random variables, Science Feb 14 2003. We denote by y1 the number whereas log |j| is Op(1). A drawback of `CR is that of males who left and model this as a Binomial with it is not invariant to one-to-one reparametrizations sample size 19 and probability p1; similarly the num- of λ, all of which are orthogonal to ψ. In contrast ber of females who left, y2, is modelled as Binomial `a(ψ) is invariant to transformations θ = (ψ, λ) to with sample size 7 and probability p2. We write the θ0 = (ψ, η(ψ, λ)), sometimes called interest-respecting parameter of interest transformations. Example 5 continued In this example ψ is or- p1(1 − p2) thogonal to λ = (λ1, . . . , λn), and ψ = log . (30) p2(1 − p1) √ `CR(ψ) = −(3n/2) log ψ − (2/ψ)Σ (y1iy2i). (26) The hypothesis of interest is p1 = p2, or ψ = 0. The p- The value that maximizes `CR is ’more nearly con- value function for ψ is plotted in Figure 4. The p-value sistent’ than the maximum likelihood estimate as at ψ = 0 is 0.00028 using the normal approximation ψˆCR −→ (π/3)ψ. to rp, and is 0.00048 using the normal approximation

268 PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003

−1 −β β + µ Table II Employment of men and women at the Space ϕθ0 = (33) Telescope Science Institute, 1998–2002 (from Science −β 0 ! magazine, Volume 299, page 993, 14 February 2003). from which

ψϕ0 = (−β, β + µ). (34) Left Stayed Total Then we have Men 1 18 19 −βˆµ log(βˆ) + (βˆµ + µ) log(βˆ + µˆ) Women 5 2 7 χ(θˆ) = √ (35) ˆ2 ˆ 2 Total 6 20 26 {βµ + (βµ + µ) } −βˆµ log(βˆµ) + (βˆµ + µ) log(βˆµ + µ) χ θˆψ , ( ) = √ 2 2 (36)

ˆ ˆ

0 µ . {βµ + (β + µ) } 1 8 . ˆ 1 2 ˆ ˆ 0 |j(θθ)(θ)| = y y = k/β(β + µˆ) (37) 2 2 y1 βˆµ µ y2βˆ ( + ) + µ 6 . 0 |j(λλ)(θˆψ)| = (38)

2 2 e

u ˆ ˆ l βµ µ β a ( + ) + µ v p 4 . 0 and finally 2 . ˆ ˆ 0 β + µˆ β Q = (βˆµ + µ) log − βˆµ log βˆµ + µ βˆµ ( ! ) 0 . 0 1/2 8 6 4 2 0 {kβˆ(βˆ + µˆ)} ψ . 2 2 1/2 (39) {kβˆ(βˆµ + µ) + (βˆ + µˆ)βˆµ} Figure 4: The p-value function for the log-odds ratio, ψ, for the data of Table II. The value ψ = 0 corresponds to The likelihood root is √ the hypothesis that the probabilities of leaving are equal r = sign(Q) [2{`(βˆ, µˆ) − `(βˆµ, µ)}] (40) for men and women. √ = sign(Q) (2[kβˆ log{βˆ/βˆµ}) + (βˆ + µˆ)

log{(βˆ + µˆ)/(βˆµ + µ)} to r∗. Using Fisher’s exact test gives a mid p-value ˆ ˆ ˆ ˆ of 0.00090, so the approximations are anticonservative −k(β − βµ) − {β + µˆ − (βµ + µ)}]). (41) in this case. The third order approximation to the p-value function Example 7: Poisson with estimated back- is 1 − Φ(r∗), where ground Suppose in the context of Example 3 that ∗ we allow for imprecision in the background, replacing r = r + (1/r) log(Q/r). (42) b by an unknown parameter β with estimated value βˆ. Figure 5 shows the p-value function for µ using the We assume that the background estimate is obtained mid-p-value function from the Poisson with no adjust- from a Poisson count x, which has mean kβ, and the ment for the error in the background, and the p-value signal measurement is an independent Poisson count, function from 1−Φ(r∗). The p-value for testing µ = 0 y, with mean β+µ. We have βˆ = x/k and varβˆ = β/k, is 0.00464, allowing for the uncertainty in the back- so the estimated precision of the background gives us ground, whereas it is 0.000408 ignoring this uncer- a value for k. For example, if the background is es- tainty. timated to be 6.7  2.1 this implies a value for k of The hypothesis Ey = β could also be tested by . 6.7/(2.1)2 = 1.5. Uncertainty in the standard error modelling the mean of y as νβ, say, and testing the of the background is ignored here. We now outline value ν = 1. In this formulation we can eliminate the steps in the computation of the r∗ approximation the nuisance parameter exactly by using the binomial (29). distribution of y conditioned on the total x + y, as The log-likelihood function based on the two inde- described in example 1. This gives a mid-p-value of pendent observations x and y is 0.00521. The computation is much easier than that outlined above, and seems quite appropriate for test- `(β, µ) = x log(kβ) − kβ + y log(β + µ) − β − µ (31) ing the equality of the two means. However if infer- ence about the mean of the signal is needed, in the 0 with canonical parameter ϕ = (log β, log(β + µ)) . form of a point estimate or confidence bounds, then Then the formulation as a ratio seems less natural at least in the context of HEP experiments. A more complete ∂ϕ(θ) 0 1/β ϕθ0 θ , comparison of methods for this problem is given in ( ) = 0 = (32) ∂θ 1/(β + µ) 1/(β + µ) ! Linnemann [8].

269 PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003 0 . 1 8 . 0 estimated background known background 6 . 0 e u l a v p 4 . 0 2 . 0 0 . 0

0 1 0 2 0 3 0 4 0

mu

Figure 5: Comparison of the p-value functions computed assuming the background is known and using the mid-p-value with the third order approximation allowing a background error of 1.75.

6. CONDITIONAL AND MARGINAL log fcond(s | t; ψ), and that LIKELIHOOD p(ψ) = Φ(r∗) (45)

In special model classes, it is possible to elimi- where nate nuisance parameters by either conditioning or ∗ 1 Q marginalizing. The conditional or marginal likelihood r = ra + log( ) then gives essentially exact inference for the parame- ra ra 1/2 ter of interest, if this likelihood can itself be computed ra = [2{`a(ψˆa) − `a(ψ)}] exactly. In Example 1 above, L1 is the density for y2 1/2 Q = (ψˆa − ψ){ja(ψˆ)} conditional on y1 + y2, so is a conditional likelihood for ψ. This is an example of the more general class of approximates the p-value function with relative error linear exponential families: O(n−3/2) in i.i.d. sampling. An asymptotically equiv- alent approximation based on the profile log-likelihood f(y; ψ, λ) = exp{ψs(y)+λ0t(y)−c(ψ, λ)−d(y)}; (43) is in which p(ψ) = Φ(r∗) (46)

fcond(s | t; ψ) = exp{ψs − Ct(ψ) − Dt(s)} (44) where defines the conditional likelihood. The comparison of ∗ 1 Q r = rp + log( ) two binomials in Example 6 is in this class, with ψ rp rp as defined at (30) and λ = log{p2/(1 − p2)}. The 1/2 p  { p ˆ − p } difference of two Poisson means, in Example 7, can- r = [2 ` (ψ) ` (ψ) ] 1/2 not be formulated this way, however, even though the 1/2 |jλλ(ψ, λˆψ)| Q = (ψˆ − ψ){jp(ψˆ)} . Poisson distribution is an exponential family, because 1/2 |jλλ(ψˆ, λˆ)| the parameter of interest ψ is not a component of the canonical parameter. In the latter approximation an adjustment for nui- It can be shown that in models of the form (43) sance parameters is made to Q, whereas in the former the log-likelihood `a(ψ) = `p(ψ) + (1/2) log |jλλ| ap- the adjustment is built into the likelihood function. proximates the conditional log-likelihood `cond(ψ) = Approximation (46) was used in Figure 3.

270 PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003

A similar discussion applies to the class of transfor- [3] D.A.S. Fraser, N. Reid and A. Wong, “On mation models, using marginal approximations. Both Inference for Bounded Parameters”, arXiv: classes are reviewed in Reid [9]. physics/0303111, v1, 27 Mar 2003. to appear in Phys. Rev. D. [4] D.R. Cox and N. Reid, “A Note on the Difference Acknowledgments between Profile and Modified Profile Likelihood”, Biometrika 79, 408, 1992. The authors wish to thank Anthony Davison and [5] D.A.S. Fraser, “Likelihood for Component Param- Augustine Wong for helpful discussion. This research eters”, Biometrika 90, 327, (2003). was partially supported by the Natural Sciences and [6] D.A.S. Fraser, N. Reid and J. Wu, “A Simple Gen- Engineering Research Council. eral Formula for Tail Probabilities for Frequen- tist and ”, Biometrika 86, 246, 1999. [7] N. Reid, “Asymptotics and the Theory of Infer- References ence”, Ann. Statist., to appear, 2004. [8] J. T. Linnemann, “Measures of Significance in [1] D.R. Cox and N. Reid, “Parameter Orthogonal- HEP and Astrophysics”, available in these Pro- ity and Approximate Conditional Inference”, J. R. ceedings on page 35. Statist. Soc. B, 47, 1, 1987. [9] N. Reid, “Likelihood and Higher-Order Approxi- [2] D.A.S. Fraser, “Statistical Inference: Likelihood to mations to Tail Areas: a Review and Annotated Canad. J. Statist. 24 Significance”, J. Am. Statist. Assoc. 86 258, 1991. Bibliography”, , 141, 1996.

271