Likelihood Inference in the Presence of Nuisance Parameters N
Total Page:16
File Type:pdf, Size:1020Kb
PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003 Likelihood Inference in the Presence of Nuisance Parameters N. Reid, D.A.S. Fraser Department of Statistics, University of Toronto, Toronto Canada M5S 3G3 We describe some recent approaches to likelihood based inference in the presence of nuisance parameters. Our approach is based on plotting the likelihood function and the p-value function, using recently developed third order approximations. Orthogonal parameters and adjustments to profile likelihood are also discussed. Connections to classical approaches of conditional and marginal inference are outlined. 1. INTRODUCTION it is defined only up to arbitrary multiples which may depend on y but not on θ. This ensures in particu- We take the view that the most effective form of lar that the likelihood function is invariant to one-to- inference is provided by the observed likelihood func- one transformations of the measurement(s) y. In the tion along with the associated p-value function. In context of independent, identically distributed sam- the case of a scalar parameter the likelihood func- pling, where y = (y1; : : : ; yn) and each yi follows the tion is simply proportional to the density function. model f(y; θ) the likelihood function is proportional The p-value function can be obtained exactly if there to Πf(yi; θ) and the log-likelihood function becomes a is a one-dimensional statistic that measures the pa- sum of independent and identically distributed com- rameter. If not, the p-value can be obtained to a ponents: high order of approximation using recently developed methods of likelihood asymptotics. In the presence `(θ) = `(θ; y) = Σ log f(yi; θ) + a(y): (2) of nuisance parameters, the likelihood function for a ^ (one-dimensional) parameter of interest is obtained The maximum likelihood estimate θ is the value of via an adjustment to the profile likelihood function. θ at which the likelihood takes its maximum, and in The p-value function is obtained from quantities com- regular models is defined by the score equation puted from the likelihood function using a canonical 0 ^ parametrization ' = '(θ), which is computed locally ` (θ; y) = 0: (3) at the data point. This generalizes the method of eliminating nuisance parameters by conditioning or The observed Fisher information function j(θ) is the marginalizing to more general contexts. In Section curvature of the log-likelihood: 2 we give some background notation and introduce 00 the notion of orthogonal parameters. In Section 3 we j(θ) = −` (θ) (4) illustrate the p-value function approach in a simple and the expected Fisher information is the model model with no nuisance parameters. Profile likelihood quantity and adjustments to profile likelihood are described in Section 4. Third order p-values for problems with nui- sance parameters are described in Section 5. Section i(θ) = E{−`00(θ)g = −`00(θ; y)f(y; θ)dy: (5) 6 describes the classical conditional and marginal like- Z lihood approach. If y is a sample of size n then i(θ) = O(n). In accord with the partitioning of θ we partition the observed and expected information matrices and use 2. NOTATION AND ORTHOGONAL the notation PARAMETERS i i λ i(θ) = (6) iλψ iλλ We assume our measurement(s) y can be modelled as coming from a probability distribution with density and or mass function f(y; θ), where θ = ( ; λ) takes values d λ in R . We assume is a one-dimensional parameter −1 i i i (θ) = λψ λλ : (7) of interest, and λ is a vector of nuisance parameters. i i If there is interest in more than one component of θ, the methods described here can be applied to each We say is orthogonal to λ (with respect to expected component of interest in turn. The likelihood function Fisher information) if i λ(θ) = 0. When is scalar is a transformation from ( ; λ) to ( ; η( ; λ)) such that is orthogonal to η can always be found (Cox and L(θ) = L(θ; y) = c(y)f(y; θ); (1) Reid [1]). The most directly interpreted consequence 265 PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003 of parameter orthogonality is that the maximum like- Table I The p-values for testing µ = 0, i.e. that the lihood estimates of orthogonal components are asymp- number of observed events is consistent with the totically independent. background. Example 1: ratio of Poisson means Suppose upper p-value 0.0005993 y1 and y2 are independent counts modelled as Poisson with mean λ and λ, respectively. Then the likelihood lower p-value 0.0002170 function is mid p-value 0.0004081 r∗ −λ(1+ ) y2 y1+y2 Φ( ) 0.0003779 L( ; λ; y1; y2) = e λ Φ(r) 0.0004416 and is orthogonal to η( ; λ) = λ( + 1). In Φf(θ^ − θ)^j1=2g 0.0062427 fact in this example the likelihood function factors as L1( )L2(η), which is a stronger property than param- eter orthogonality. The first factor is the likelihood for a binomial distribution with index y1 + y2 and proba- where Z follows a standard normal distribution. It bility of success =(1 + ), and the second is that for is relatively simple to improve the approximation to −3=2 a Poisson distribution with mean η. third order, i.e. with relative error O(n ), using ∗ Example 2: exponential regression Sup- the so-called r approximation pose yi; i = 1; : : : ; n are independent observations, ∗ each from an exponential distribution with mean r (θ) = r(θ) + f1=r(θ)g logfq(θ)=r(θ)g ∼ N(0; 1) (12) λ exp(− xi), where xi is known. The log-likelihood function is where q(θ) is a likelihood-based statistic and a gener- alization of the Wald statistic (θ^−θ)j1=2(θ^); see Fraser −1 `( ; λ; y) = −n log λ + Σxi − λ Σyi exp( xi) (8) [2]. Example 3: truncated Poisson and i λ(θ) = 0 if and only if Σxi = 0. The stronger y property of factorization of the likelihood does not Suppose that follows a Poisson distribution with θ b µ b hold. mean = + , where is a background rate that is assumed known. In this model the p-value function can be computed exactly simply by summing the Pois- son probabilities. Because the Poisson distribution is 3. LIKELIHOOD INFERENCE WITH NO discrete, the p-value could reasonably be defined as NUISANCE PARAMETERS either We assume now that θ is one-dimensional. A plot Pr(y ≤ y0; θ) (13) of the log-likelihood function as a function of θ can quickly reveal irregularities in the model, such as a or non-unique maximum, or a maximum on the bound- 0 ary, and can also provide a visual guide to deviance Pr(y < y ; θ); (14) from normality, as the log-likelihood function for a sometimes called the upper and lower p-values, respec- normal distribution is a parabola and hence symmet- tively. ric about the maximum. In order to calibrate the For the values y0 = 17, b = 6:7, Figure 1 shows log-likelihood function we can use the approximation the likelihood function as a function of µ and the p- = r(θ) = sign(θ^ − θ)[2f`(θ^) − `(θ)g]1 2 ∼· N(0; 1); (9) value function p(µ) computed using both the upper and lower p-values. In Figure 2 we plot the mid p- which is equivalent to the result that twice the log like- 2 value, which is lihood ratio is approximately χ1. This will typically provide a better approximation than the asymptoti- Pr(y < y0) + (1=2)Pr(y = y0): (15) cally equivalent result that The approximation based on r∗ is nearly identical to · −1 θ^ − θ ∼ N(0; i (θ)) (10) the mid-p-value; the difference cannot be seen on Fig- p µ as it partially accommodates the potential asymme- ure 2. Table 1 compares the -values at = 0. This try in the log-likelihood function. These two approx- example is taken from Fraser, Reid and Wong [3]. imations are sometimes called first order approxima- tions because in the context where the log-likelihood is O(n), we have (under regularity conditions) results 4. PROFILE AND ADJUSTED PROFILE such as LIKELIHOOD FUNCTIONS Prfr(θ; y) ≤ r(θ; y0)g = PrfZ ≤ r(θ; y0)g (11) We now assume θ = ( ; λ) and denote by λ^ the = f1 + O(n−1 2)g restricted maximum likelihood estimate obtained by 266 PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003 vious section generalize to 6 0 . 0 ^ ^ 1=2 · r( ) = sign( − )[2f`p( ) − `p( )g] ∼ N(0; 1); d o 4 o 0 . (17) h i l 0 e k i l and 2 0 . 0 ^ − ∼· N(0; fi (θ)g−1): (18) 0 . 0 These approximations, like the ones in Section 3, are 0 10 20 30 40 derived from asymptotic results which assume that mu n ! 1, that we have a vector y of independent, iden- tically distributed observations, and that the dimen- sion of the nuisance parameter does not increase with 0 n. Further regularity conditions are required on the . ....... 1 ....... ........ ...... model, such as are outlined in textbook treatments ...... 8 .... ..... 0 ..... of the asymptotic theory of maximum likelihood. In ..... ..... ...... 6 ..... finite samples these approximations can be mislead- . ..... e .... 0 .... u .... l ..... a ...... ing: profile likelihood is too concentrated, and can be v ..... - .... 4 .... p . ..... 0 ...... maximized at the `wrong' value. ...... ...... ....... 2 ..... Example 4: normal theory regression . ...... Suppose 0 ....... ........ 0 ......... yi = xiβ + i, where xi = (xi1; : : : ; xip) is a vector of ............. 0 ......... 0 known covariate values, β is an unknown parameter 0 10 20 30 40 of length p, and i is assumed to follow a N(0; ) mu distribution.