ON THE USE OF RELATIVE LIKELIHOOD RATIOS IN

Charles G. Boncelet*, Lisa M. Marvel**, Michael E. Picollelli*

*University of Delaware, Newark DE USA **US Army Research Laboratory, APG Aberdeen MD USA

ABSTRACT Standard statistical procedures including maximum likeli- hood estimates and confidence intervals can by found in many We ask the question, “How good are parameter estimates?,” textbooks, including Bickel and Docksum [1] and Kendall and offer criticism of confidence intervals as an answer. In- and Stuart [4]. Relative likelihoods have received some atten- stead we suggest the engineering community adopt a little tion in the and epidemiological literature, but little known idea, that of defining plausible intervals from the rela- attention in the engineering literature. The best reference on tive likelihood ratio. Plausible intervals answer the question, relative likelihood methods is the text by Sprott [7]. One en- “What of values could plausibly have given rise to the gineering reference is a recent paper by Sander and Beyerer we have seen?” We develop a simple theorem for com- [6]. puting plausible intervals for a wide variety of common dis- In this paper, we adopt a “frequentist” interpretation of tributions, including the Gaussian, exponential, and Poisson, probability and statistical inference. Bayesian among others. adopt a different view, one with which we have some sym- pathy, but that view is not explored herein. For a recent dis- 1. INTRODUCTION cussion of Bayesian statistics, see Jaynes [3].

We consider a basic question of statistical analysis, “How good are the parameter estimates?”. Most often this question 2. PRELIMINARIES: LIKELIHOOD FUNCTIONS, is answered with confidence intervals. We argue that confi- MAXIMUM LIKELIHOOD ESTIMATES, AND dence intervals are flawed. We propose that a little known al- CONFIDENCE INTERVALS ternative based on relative likelihood ratios be used. We term these “plausible intervals.” We present a theorem on comput- Consider a common estimation problem: estimating the ing plausible intervals that applies to a wide range of distri- of a Gaussian distribution. Let X1, X2,..., Xn be IID (in- butions, including the Gaussian, exponential, and Chi-square. dependent and identically distributed) Gaussian random vari- 2 2 A similar result holds for the Poisson. Lastly, we show how ables with mean µ and σ , i.e., Xi ∼ N(µ, σ ). 2 to compute plausible regions for the Gaussian when both lo- For the , we assume we know σ and seek to esti- cation and scale parameters are estimated. mate µ. The density of each Xi is This work is part of a larger effort to understand why sta- 1 −(x − µ)2  tistical and signal processing procedures do not work as well √ i f(xi; µ) = exp 2 in practice as theory indicates they should. Furthermore, we 2πσ2 2σ strive to develop better and more robust procedures. In this The of µ is work, we begin to understand why “statistically significant” results are, upon further evaluation, not as reliable as thought. L(µ; xn) = f(x ; µ)f(x ; µ) ··· f(x ; µ) One reason is that confidence intervals are too small. They do 1 1 2 n − Pn (x − µ)2  not accurately reflect the range of possible inputs that could = (2πσ2)−n/2 exp i=1 i have given rise to the observed data. 2σ2

This work is not defense specific, though there are nu- n merous defense applications of statistical inference and signal where we use the notation x1 = x1, x2, . . . , xn. processing. We expect this work to influence a wide range The maximum likelihood estimate (MLE) of µ is found by n of procedures beyond simple parameter estimation. For in- setting the first derivative of L(µ; x1 ) to 0, stance, Kalman filtering, spectral estimation, and hypothesis d n testing are potential applications. 0 = L(µ; x1 ) dµ µ=µˆ CGB can be reached at [email protected], LMM at mar- [email protected], and MEP at [email protected]. The calculations are somewhat easier if we find the maximum The confidence interval is not too helpful at predicting fu- of the log-likelihood function, ture values either. For instance, consider the following change to the : After observing n measurements we will d n 0 = log L(µ; x ) compute a sample average and a confidence interval. Then dµ 1 µ=µˆ we will make another n measurements (independent of the n ! d n 1 X first) and ask what is the probability the second sample mean = − log 2πσ2 − (x − µ)2 dµ 2 2σ2 i is in the confidence interval? Let the second sample mean be 0 i=1 µ=µˆ denoted µˆ . Then, n ! n 1 X  0   cσ 0 cσ  = xi − µˆ Pr u ≤ µˆ ≤ v = Pr µˆ − √ ≤ µˆ ≤ µˆ + √ σ2 n i=1 n n  cσ 0 cσ  From which we conclude the MLE of µ is the sample mean = Pr − √ ≤ µˆ − µˆ ≤ √ n n n 1 X µˆ = x = x Since both µˆ and µˆ0 are independent N(µ, σ2/n) random n i n i=1 variables, the probability is Just how good is the sample mean as an estimator of µ? √ √ √ Φ(c/ 2) − Φ(−c/ 2) = 2Φ(c/ 2) − 1 A commonly used measure of goodness is the confidence in- n n terval. Find u(X1 ) and v(X1 ) such that For example, when α = 0.05, c = 1.96 and the probability evaluates to only 0.834.  n n  We regard the latter criticism as particularly damning. Pr u(X1 ) ≤ µ ≤ v(X1 ) ≥ 1 − α The usual reason to perform statistical analysis is to deter- for some α > 0. Normally, the confidence interval is selected mine something about future observations. After all, the n n as the minimal range v(X1 ) − u(X1 ). For the Gaussian current observations are already known. Parameter estimates example, the confidence interval is are often useful to the extent they help inform us about future  n n  observations. Confidence intervals are misleading indicators Pr u(X1 ) ≤ µ ≤ v(X1 ) ≥ 1 − α cσ of future values. u(Xn) = µˆ − √ To guarantee that µˆ0 is in the confidence interval with 1 n √ cσ probability 1 − α, c must increase to 1.96 2 = 2.78. v(Xn) = µˆ + √ 1 n 4. RELATIVE LIKELIHOOD RATIO INTERVALS where c = Φ−1(1 − α/2) and Φ(·) is the Normal distribution function. In the usual case where α = 0.05, c = 1.96. We hope to revive an older idea that has received little atten- tion in the engineering literature, relative likelihood ratios.A 3. CRITICISMS OF CONFIDENCE INTERVALS good reference is the text by Sprott [7]. The relative likeli- hood ratio is the following: Many criticisms of confidence intervals have been raised. L(θ; xn) L(θ; xn) Here we list a few of them: R(θ; xn) = 1 = 1 1 sup L(θ; xn) ˆ n There is considerable confusion as to what a confidence θ 1 L(θ; x1 ) interval actually represents. The standard interpretation goes where θ represents the unknown parameter or parameters and something like this: Before the experiment is done, we agree θˆ is the MLE of θ. to compute a sample average and a confidence interval (u, v) The relative likelihood ratio helps answer the question, as above. Then the probability the interval will cover µ is “What values of θ could plausibly have given the data xn that 1 − α. 1 we observed?” The relative likelihood is useful after the ex- Note, however, after the experiment is done, the X have i periment is run, while probabilities are most useful before the values, x . Then, µˆ, u, and v are numbers. Since µ is not i experiment is run. considered to be random, we cannot even ask the question, As an example, we consider the Gaussian example above. “What is the probability µ is in the interval (u, v)?” µ is either The unknown parameter is θ = µ and the MLE is θˆ = µˆ. in the interval or not, but the question is not within the realm Compare the relative likelihood ratio to a threshold, of probability. The confidence interval has probability about the true  Pn 2  − i=1(xi − µ) mean of 1 − α if µ = µˆ. In general µ 6= µˆ, and the interval exp 2 n 2σ contains less mass, R(θ; x1 ) = n ≥ α (1) − P (x − µˆ)2  exp i=1 i Φ(v − µ)/σ − Φ(u − µ)/σ < 1 − α 2σ2 After taking logs and simplifying, the relation becomes 1.0 2σ2 (µ − µˆ)2 ≤ log 1/α (2) n 1/n α 1−γ Solving for µ gives a relative likelihood ratio interval, which γe we shall refer to as a plausible interval. r r 2σ2 log 1/α 2σ2 log 1/α µˆ − ≤ µ ≤ µˆ + n n u 1.0 v γ σ σ µˆ − c√ ≤ µ ≤ µˆ + c√ (3) n n Fig. 1: Graphical representation of the plausible interval for an exponential distribution. When α = 0.05, c = 2.45. We see the plausible interval is bigger than the confidence interval. It is a more conservative measure. To reiterate, the plausible interval gives all values of θ that could have plau- After taking logs, multiplying by -1, and comparing to α, the sibly given rise to the data observed, where plausibly is mea- relation reduces to sured by the ratio of the likelihood at θ to the maximum like- k   X pj log α lihood. − pˆ log = KL(ˆp||p) ≤ − (5) j pˆ n As another example, consider estimating the parameter in j=1 j an exponential distribution. Let X , X ,..., X be IID ex- 1 2 n where KL(ˆp||p) is the Kullback-Leibler between ponential with density f(x) = λe−λx. pˆ and p (Kullback and Leibler [5]). In words, the set of plau- n sible p’s are those with a Kullback-Leibler divergence from pˆ n n X L(λ; x1 ) = λ exp(−λ xi) less than or equal to − log(α)/n. i=1 Below we present a theorem on the calculation of plausi- ˆ n 1 ble intervals for a wide class of exponential-type distributions. λ = Pn = i=1 xi xn Theorem 1 If X1, X2,..., Xn are IID with common density λn n ˆ  l l R(λ; x1 ) = exp n(1 − λ/λ) k k−1 −dλ x λˆ f(x) = Cλ x e (6) where k and l are shape parameters, d is a convenience con- If we let γ = λ/λˆ and compare the relative likelihood to α, stant (e.g., for the Gaussian, d = 0.5), and C is a normalizing we obtain an interesting result: constant. Let x1, x2,..., xn denote the corresponding obser- γe1−γ ≥ α1/n (4) vations. The maximum likelihood estimate of λ is nk λˆ = (7) We get a semi-graphical way of determining the plausible in- dl Pn xl terval, as demonstrated in Figure 1. Compute α1/n and find i=1 i graphically or numerically the upper and lower values, u and The relative likelihood reduces to v. An essential feature of the exponential inference problem 1−γ l/nk is the asymmetry of the upper and lower values of the plausi- γe ≥ α (8) ble interval. The upper limit is much farther from γ = 1 than ˆ l is the lower limit. where γ = (λ/λ) and α is the threshold level. As a third example of the utility of relative likelihoods, This theorem applies to a wide variety of distributions. consider estimating the parameters in a multinomial distribu- Some are listed below in Table 1. For example, the exponen- tion. Let X1, X2,..., Xn be n IID multinomial random vari- tial distribution has k = 1, l = 1, and d = 1 and (8) reduces ables with parameters p1, p2,..., pk (p1 +p2 +···+pk = 1). to (4), since l/k = 1. This theorem is easy to apply: n n1 n2 nk L(p1, p2, . . . , pk; x1 ) = p1 p2 ··· pk λ nj 1. Compute the MLE of using (7). pˆj = for j = 1, 2, . . . , k n l/nk n 2. Compute α .  pˆ1  pˆ2  pˆk ! p1 p2 pk R(p , p , . . . , p ; xn) = ··· 3. Using graphical (Figure 1) or numerical , solve 1 2 k 1 pˆ pˆ pˆ 1 2 k (8) for u and v. Exponential k = 1, l = 1, d = 1 Weibull k = l, d = 1 Erlang l = 1, d = 1 Gaussian (known mean) k = 0, l = 2, d = 0.5 Chi-Square k ← (k/2) − 1, l = 1, d = 0.5 Rayleigh k = 1, l = 2, d = 0.5

Table 1: Some distributions that meet the conditions of The- orem 1

4. Solve for upper and lower bounds on λ using λ = λuˆ 1/l and λ = λvˆ 1/l. While the Poisson does not fit the conditions of the the- orem, its probability mass function is sufficiently close the same general conclusions apply. Let X1, X2,..., Xn be n IID Poisson random variables with parameter λ. Then,

xi   λ −λ Pr Xi = xi = e xi! Pn i=1 xi 2 n λ −nλ Fig. 2: A contour plot of location error, δ , versus scale error, L(λ; x1 ) = Qn e 2 i=1 xi! γ for the Gaussian, with z = γ exp(1 − γ − γδ ). n 1 X λˆ = x = x n i n i=1 Using γ and δ2, the relative likelihood can be written as nλˆ λ ˆ ˆ R(λ; xn) = enλ(1−λ/λ) n  1 ˆ R(β, σ; y, X) = γn/2 exp (1 − γ − γδ2) λ 2 ˆ Letting γ = λ/λ and comparing to α results in Comparing this to α and simplifying results in the following:

1−γ 1/(nλˆ) γe ≥ α (9) 2 γe(1−γ−γδ ) ≥ α2/n (12) The only difference for the Poisson is that α is raised to a data ˆ −1 dependent power, 1/(nλ) = (x1 + x2 + ... + xn) . Since the Gaussian depends on two parameters, the rela- tive likelihood depends on two parameters: δ2 represents the 5. error in the location estimate and γ the error in the scale es- timate. When compared to α, we obtain plausible regions, a In this section, we consider the linear regression problem with generalization to higher dimensions of plausible intervals. additive Gaussian noise of unknown variance. Let the regres- The plausible region is defined by (12). When δ2 = 0, sion problem be y = Xβ +  where  ∼ N(0, σ2I). The (12) reduces to (8); when γ = 1, (12) reduces to (2). MLE of β is The function z = γ exp(1−γ−γδ2) is shown in Figure 2. The location error is large when δ2 is large. This happens βˆ = (XT X)−1XT y when γ < 1, i.e., when σc2 < σ2. The MLE of σ2 is When γ = 1, δ2 = − log(α(2/n)). It is interesting to ask what is the maximum value of δ2 when γ is allowed to vary ˆ T ˆ (y − Xβ) (y − Xβ) 1 2 (i.e., when the variance is also unknown). One can solve the σc2 = = ky − Xβˆk n n 2 following maximization problem:

Define two derived quantities: 2 max δ2 such that γe1−γ−γδ ≥ α(2/n) 2 (β − βˆ)T XT X(β − βˆ) δ ,γ δ2 = (10) 2 nσc The solution is γ = α(2/n) and δ2 = α−(2/n) − 1. That the σc2 maximum value of δ2 is larger when γ is allowed to vary is a γ = (11) σ2 restatement of the familiar log inequality, x − 1 ≥ log(x). 6. CONCLUSION

We argue that confidence intervals are flawed and do not accu- rately represent the range of possible parameter values. Better we argue are plausible intervals based on relative likelihood ratios. Plausible intervals answer the question, “What range of parameter values could plausibly have given rise to the data we have observed?” Unlike probabilities, likelihood ratios are informative after the experiment is run. We have presented a theorem that provides a simple se- quence of steps to compute the plausible interval for a wide range of distributions, including the Gaussian, exponential, Weibull, Chi-squared, Erlang, and Poisson. We have extended the theorem to the Gaussian when both location and scale pa- rameters are unknown. The relative likelihood for the multinomial distribution re- duces to the Kullback-Leibler divergence. This work is part of a larger effort to understand why sta- tistical and signal processing procedures often do not work as well in practice as the theory indicates and to develop better, more robust, procedures. Specific work to follow includes more analysis of the Gaussian case, including applications such as Kalman filtering, autoregressive fitting, and spectral analysis. Also, we need to extend Theorem 1 to a wider class of distributions. We believe that plausible intervals can replace the need for techniques such as the bootstrap and the jack- knife (Efron [2]). Rather than creating new data, we can use plausible intervals to explore the space of plausible inputs.

7. REFERENCES

[1] P. J. Bickel and K. A. Doksum. . Holden-Day Series in Probability and Statistics. Holden- Day, Inc., San Francisco, 1977. [2] B. Efron. The Jackknife, the Bootstrap and Other Resam- pling Plans. Siam, Philadelphia, PA, 1982. [3] Edwin T. Jaynes. Probability Theory The Logic of Sci- ence. Cambridge University Press, 2003. [4] M. Kendall and A. Stuart. The Advanced Theory of Statis- tics, volume 1. Charles Griffin & Company Limited, Lon- don, 1977. [5] S. Kullback and R. A. Leibler. On information and suf- ficiency. Annals of Mathematical Statistics, 22(1):79–86, 1951. [6] Jennifer Sander and Jurgen Beyerer. Decreased complex- ity and increased problem specificity of bayesian fusion by local approaches. In 11th Internatonal Conference on Information Fusion, pages 1036–1042, June 2008. [7] D. A. Sprott. Statistical Inference in Science. Springer Series in Statistics. Springer, 2000.