Estimation: Point and Interval the Components Given the Observed Data

Home , Point estimation

Errors in Statistical Assumptions, Effects of however, and it can also be measured for changes to types of procedures see Robustness in Statistics and assumptions in non-Bayesian inference. Nonparametric Statistics: The Field. There have been some successful attempts to take considerations of sensitivity to assumptions into ac- See also: Linear Hypothesis: Regression (Basics); count explicitly in the formulation of Bayesian models. Linear Hypothesis: Regression (Graphics); Robust- Model uncertainty (see Draper 1995) and Bayesian ness in Statistics; Statistics: The Field; Time Series: model averaging (Hoeting et al. 1999) attempt to ARIMA Methods; Time Series: General develop methodology for expressing directly in the modeling process the belief that more than one out of several competing statistical models might be ap- Bibliography propriate for the data, and it may not be clear which Benjamini Y 1983 Is the t test really conservative when the one is best. For example, one might not be able to parent distribution is long-tailed? Journal of the American decide whether to model data as normal random Statistical Association 78: 645–54 variables or as Cauchy random variables. The methods Box G E P 1953 Non-normality and tests on variances. Bio- of model uncertainty and model averaging provide metrika 40: 318–35 Box G E P 1980 Sampling and Bayes’ inference in scientific ways to pursue both models (as well as others that modeling and robustness (with discussion). Journal of the might seem plausible) without having to pretend as if Royal Statistical Society,(Series A) 143: 383–430 we really believed firmly in precisely one of them. Draper D 1995 Assessment and propagation of model un- Finally, there have been some methods developed to certainty (with discussion). Journal of the Royal Statistical deal with specific common violations of assumptions. Society,(Series B) 57: 45–97 The assumption that all observations in a sample have Efron B 1969 Student’s t-test under symmetry conditions. the same distribution has received attention particu- Journal of the American Statistical Association 64: 1278–302 larly in order to be able to accommodate the occasional Hoeting J A, Madigan D, Raftery A E, Volinsky C 1999 Bayesian outlying observation. One can model data as coming model averaging: A tutorial (with discussion). Statistical Science 14: 382–417 from a mixture of two or more distributions. One of Hotelling H 1961 The behavior of some standard statistical tests the distributions is the main component on which under nonstandard conditions. In: Neyman J (ed.) Proceedings most interest is focussed. The others represent the of the Fourth Berkeley Symposium on Mathematical Statistics types of data that might arise when something goes and Probability. University of California Press, Berkeley, CA, wrong with the data collection procedure. Each Vol.1, pp. 319–59 observation can be modeled as coming from one of the component distributions with a probability associated M. J. Schervish with each component. Indeed, using powerful simu- lation methods (see Marko Chain Monte Carlo Methods) one can even compute, for each observation, the conditional probability that it arose from each of Estimation: Point and Interval the components given the observed data. Introduction 5. Conclusion When sampling is from a population described by a Q θ θ This article has discussed the need for carefully density or mass function f(x ), knowledge of yields acknowledging the probabilistic assumptions made to knowledge of the entire population. Hence, it is justify a statistical procedure and some methods for natural to seek a method of finding a good estimator θ assessing the sensitivity of the procedure to those of the point , that is, a good point estimator. assumptions. If one believes that there are violations However, a point estimator alone is not enough for a of the assumptions that might reasonably arise in complete inference, as a measure of uncertainty is also practice, and if the procedure is overly sensitive to needed. For that, we use a set estimator in which the θ ? 9 Θ violations of those assumptions, one might wish to inference is the statement that C where C and l select an alternative procedure. C C (x) is a set determined by the value of the data l θ There are two popular approaches to choosing X x observed. If is real-valued, then we usually alternativestostandardstatisticalprocedureswhenone prefer the set estimate C to be an interval. Our fears that assumptions are likely to be violated. One is uncertainty is quantified by the size of the interval and to use robust procedures, and the other is to use its probability of covering the parameter. nonparametric procedures. Robust procedures are designed to be less sensitive to violations of specific 1. Point Estimators assumptions without sacrificing too much of the good performance of standard procedures when the stan- In many cases, there will be an obvious or natural dard assumptions hold. Nonparametric procedures candidate for a point estimator of a particular par- are chosen so that their properties can be verified ameter. For example, the sample mean is a natural under fewer assumptions. For more details on these candidate for a point estimator of the population

4744 Estimation: Point and Interal mean. However, when we leave a simple case like this, If we instead have samples X",…, Xn from a bi- intuition may desert us so it is useful to have some nomial (k, p) population where p is known and k is techniques that will at least give us some reasonable unknown, the likelihood function is candidates for consideration. Those that have stood E G the test of time include: n k L(k Q x, p) l Π pxi(1kp)k−xi (4) i=" F xi H 1.1 The Method of Moments and the MLE must be found by the numerical maximization. The method of moments will give the The method of moments (MOM) is, perhaps, the closed form solution. oldest method of finding point estimators, dating back at least to Karl Pearson in the late 1800s. One of the x` # strengths of MOM estimators is that they are usually kV l (5) ` k \ Σ k ` # simple to use and almost always yield some sort of x (1 n) i(xi x) estimate. In many cases, unfortunately, this method yields estimators that may be improved upon. which can take on negative values. This illustrates a Let X ,…, X be a sample from a population with shortcoming of the method of moments, one not " n Qθ θ shared by the MLE. Another, perhaps more serious density or mass function f(x ",…, k). MOM estimators are found by equating the first k sample shortcoming of the MOM estimator is that it may not moments to the corresponding k population moments. be based on a sufficient statistic (see Statistical l Sufficiency), which means it could be inefficient in not That is, we define the sample moments by mj Σn j\ µ θ θ using all of the available information in a sample. i="Xi n and the population moments by j( ",…, k) l j l l µ θ θ In contrast, both MLEs and Bayes estimators (see EX for j 1,…, k. We then set mj j( ",…, k) and solve for θ ,…, θ . This solution is the MOM Bayesian Statistics) are based on sufficient statistics. θ " θ k estimator of ",…, k. 1.3 Bayes Estimators

1.2 Maximum Likelihood Estimators In the Bayesian paradigm a random sample X",…, Xn is drawn from a population indexed by θ, where θ is Q θ θ For a sample X",…, Xn from f(x ",…, k), the like- considered to be a quantity whose variation can be lihood function is deﬁned by described by a probability distribution (called the prior distribution). A sample is then taken from a θ Q l θ θ Q L( x) L( ",…, k x",…, xn) population indexed by θ and the prior distribution is l Πn f(x Q θ ,…, θ ) (1) updated with this sample information. The updated i=" i " k prior is called the posterior distribution. π θ The values of θ that maximize this function are those If we denote the prior distribution, by ( ), and the i Q θ parameter values for which the observed sample is sampling distribution by f(x ), then the posterior θ most likely, and are called the maximum likelihood distribution, the conditional distribution of given the estimators (MLE). If the likelihood function is dif- sample, x,is θ ferentiable (in i), the MLEs can often be found by π θ Q l Q θ π θ \ solving ( x) f(x ) ( ) m(x) (6) where m(x) l !f(x Q θ)π(θ)dθ is the marginal distri- c bution of x. logL(θ Q x) l 0, i l 1,…, k (2) cθ Example Let X ,…, X be i.i.d. Bernoulli (p). Then i l Σ " n α β Y i Xi is binomial (n, p). If p has a Beta ( , ) prior c θ Q distribution, that is, where the vector with coordinates cθ logL( x)is called the score function (see Schervishi 1995, Sect. Γ(αjβ) 2.3). π(p) l p α−"(1kp) β−" (7) Γ α Γ β Example If X",…, Xn are i.i.d. Bernoulli (p), the ( ) ( ) likelihood function is the posterior distribution of p given y,is n L(p Q x) l Π pxi(1kp)"−xi (3) f(y, p) i=" f( p Q y) l f(y) Q and diﬀerentiating log L(p x) and setting the result Γ jαjβ # l Σ \ (n ) y+α− n−y+β− equal to zero gives the MLE p ixi n. This is also l p "(1kp) " (8) the method of moments estimator. Γ(yjα)Γ(nkyjβ)

4745 Estimation: Point and Interal which is a beta distribution with parameters yjα and showing that σ# # has smaller MSE than S#. Thus, by nkyjβ. The posterior mean, a Bayes estimator of p, trading off variance for bias, the MSE is improved. is Measuring performance by the squared difference between the estimator and a parameter is a special case yjα of a function called a loss function. The study of the pV l (9) B αjβj performance, and the optimality, of estimators eval- n uated through loss functions is a branch of decision theory. In addition to MSE, based on squared error loss, another popular loss function is absolute error loss, L(θ, W) l Q Wkθ Q. Both of these loss functions increase as the distance between θ and W increases, θ θ l with minimum value L( , ) 0. That is, the loss is 2. E aluating Point Estimators minimum if the estimator is correct. There are many methods of deriving point estimators In a decision theoretic analysis, the worth of an (robust methods, least squares, estimating equations, estimator is quantified in its risk function, that is, for invariance) but the three in Sect. 1 are among the most an estimator W of θ, the risk function is R(θ, W) l θ popular. No matter what method is used to derive a Eθ L( , W), so the risk function is the average loss. If point estimator, it is important to evaluate the es- the loss is squared error, the risk function is the MSE. timator using some performance criterion. Using squared error loss, the risk function (MSE) of One way of evaluating the performance of a point the binomial Bayes estimator of p is estimator W of a parameter θ is through its mean kθ # V k # l V j V # squared error (MSE), defined by Eθ(W ) . Ep(pB p) Varp pB (Biasp pB) MSE measures the average squared difference be- E G tween the estimator W and the parameter θ. Although np(1kp) npjα # l j kp (13) any increasing function of the absolute dis- (αjβjn)# αjβjn tance Q Wkθ Q would serve, there is a nice factorization F H

E (Wkθ)# l Var Wj(E Wkθ)# In the absence of good prior information about p,we θ θ θ might try to choose α and β to make the risk function l j # # VarθW (Biasθ W ) (10) of pB constant (called an equalizer rule). The solution is to choose α l β l Nn\4, yielding where we define the bias of a point estimator as Biasθ l kθ W Eθ W . An estimator whose bias is identically n (in θ) equal to zero is called unbiased. E(pV kp)# l N # B 4(nj n)# For an unbiased estimator we have Eθ (Wkθ) l Varθ W, and so, if an estimator is unbiased, its MSE is equal to its variance. If X",…, Xn are i.i.d. from a We can also use a Bayesian approach to the problem population with mean µ and variance σ#, the sample of loss function optimality, where we would use the F mean is an unbiased estimator since EX l µ, and has prior distribution to compute an average risk MSE !Θ R(θ, W)π(θ)dθ, known as the Bayes risk. We then find the estimator that yields the smallest value of the # z z σ Bayes risk. Such an estimator is called the Bayes rule E(Xkµ)# l VarX l (11) with respect to a prior π. n To find the Bayes decision rule for a given prior π we write the Bayes risk as Controlling bias does not guarantee that MSE is minimized. In particular, it is sometimes the case that θ π θ θ a trade-off occurs between variance and bias. For & R( , W) ( )d example, in sampling from a normal population with Θ σ# variance , the usual unbiased estimator of the l L(θ, W(x)) π(θ Q x)dθ m(x)dx (14) # l " Σn k F # l & 9& : variance S n−" i="(Xi X) has MSE χ Θ 2σ%\(nk1). An alternative estimator for σ# is the σ# # l "n k F # maximum likelihood estimator i= (Xi X) l n−" # nσ# " where the quantity in square brackets is the expected n S . This is a biased estimator of with MSE value of the loss function with respect to the posterior

E G E G distribution, called the posterior expected loss.Itisa 2nk1 2 function only of x, and not a function of θ. Thus, for E(σV #kσ#)# l σ% ! σ% n# nk1 each x, if we choose the estimate W(x) to minimize the F H F H posterior expected loss, we will minimize the Bayes l E(S#kσ#)# (12) risk.

4746 Estimation: Point and Interal

For squared error loss, the posterior expected loss is where x- is the sample mean and s is the sample minimized by the mean of the posterior distribution. standard deviation. The validity of this interval can be For absolute error loss, the posterior expected loss is justiﬁed from the Central Limit Theorem, because minimized by the median of the posterior distribution. If we have a sample X",…, Xn from a normal dis- Xz kµ tribution with mean θ and variance σ#, and the prior is ! n(0, 1) (17) \N n(µ, τ#), the posterior mean is S n

τ# σ#\n the standard normal distribution. We then see that the E(θ Q x` ) l x` j µ (15) coverage probability (and confidence coefficient) of τ#j σ#\ τ#j σ#\ ( n) ( n) Eqn. 16 is approximately 95 percent. The above interval is a large sample interval since Since the posterior distribution is normal, it is sym- its justification is based on an asymptotic argument. metric and the posterior mean is the Bayes rule for There are many methods for constructing interval both squared error and absolute error loss. The estimators that are valid in small samples, including \ # l posterior mean in our binomial beta example, pB these. y+α α+β+n, is the Bayes estimator against squared error loss. We can loosely group evaluation criteria into large sample or asymptotic methods, and small sample 3.1 Inerting a Test Statistic methods. Our calculations using MSE and risk func- There is a correspondence between acceptance regions tions illustrate small sample methods. In large samples, of tests (see Hypothesis Testing in Statistics) and MLEs typically perform very well, being asymptoti- confidence sets, summarized in the following theorem. cally normal and efficient, that is, attaining the smallest Theorem For each θ ? Θ, let A(θ ) be the acceptance possible variance. Other types of estimators that are α ! θ l!θ ? χ region of a le el test of H!: !. For each x , derived in a similar manner (for example, M- define a set C(x) in the parameter space by estimators—see Robustness in Statistics) also share good asymptotic properties. For a detailed discussion C(x) loθ : x ? A(θ )q (18) see Casella and Berger (2001), Lehmann (1998), ! ! Stuart et al. (1999) or Lehmann and Casella (1998, kα Chap. 6). Then the random set C(X) is a 1 confidence set. Conersely, let C(X) be a 1kα confidence set. For any θ ? Θ ! , define θ lo θ ? q 3 Inter al Estimation A( !) x: ! C(x) (19) θ Reporting a point estimator of a parameter only θ α provides part of the story. The story becomes more Then A( !) is the acceptance region of a le el test of H : θ l θ . complete if an assessment of the error of estimation is ! ! µ σ# σ# also reported. Informally, this can be accomplished by Example If X",…, Xn are i.i.d. n( , ), with known, the test of H :µ l µ vs. H :µ µ will accept giving an estimated standard error of the estimator ! α! " ! and, more formally, this becomes the reporting of an the null hypothesis at level if interval estimate. If X l x is observed, an interal θ σ σ estimate of a parameter is a pair of functions, L(x) ` k % µ % ` j θ ? x zα x zα (20) and U(x) for which the inference [L(x), U(x)] is /# Nn ! /# Nn made. The coerage probability of the random interval [L(X), U(X)] is the probability that [L(X), U(X)] µ - k σ\N - j σN covers the true parameter, θ, and is denoted by The interval of values, [x zα/# n, x zα/# n], θ ? for which the null hypothesis will be accepted at level Pθ( [L(X), U(X)]). α kα µ By definition, the coverage probability depends on ,isa1 confidence interval for . the unknown θ, so it cannot be reported. What is typically reported is the confidence coefficient, the θ ? infimum of the coverage probabilities, infθ Pθ( 3.2 Piotal Inference [L(X), U(X)]). µ σ# Perhaps one of the most elegant methods of construct- If X",…, Xn are i.i.d. with mean and variance ,a common interval estimator for µ is ing set estimators is the use of pivotal quantities (Barnard 1949). A random variable Q(X, θ) l Q(X ,…, X , θ), is a piotal quantity (or pivot) if the s " n µ ? x` p2 (16) distribution of Q(X, θ) is independent of all par- Nn ameters. If we find a set C such that P(Q(X, θ) ?C) l

4747 Estimation: Point and Interal

1kα, then the set oθ:Q(X, θ) ? Cq has coverage prob- that in 95 percent of repeated experiments, the realized ability 1kα. intervals will cover the true parameter. In the Bayesian In location and scale cases, once we calculate the approach, a 95 percent coverage means that the sample mean XF and the sample standard deviation S, probability is 95 percent that the parameter is in the we can construct the following pivots: realized interval. In the classical approach the randomness comes from the repetition of experiments, while in the Bayesian approach the randomness comes from Form of pdf Type of pdf Pivotal quantity uncertainty about the value of the parameter (sum- f(xkµ) location XF kµ marized in the prior distribution). E G λ 1 x XF Example Let X",…, Xn be i.i.d. Poisson ( ) and f scale (21) assume that λ has a gamma prior, λ " gamma (a, b), σ σ σ F H where a is an integer. The posterior distribution of λ is E G 1 xkµ XF kµ f location-scale σ σ π λ Q l l j j \ −" F H S ( iXi ixi) gamma(a ixi,[n (1 b)] ) (22) In general, diﬀerences are pivotal for location problems, while ratios (or products) are pivotal for scale Thus the posterior distribution of 2[nj(1\b)]λ is χ# kα λ problems. See also Fiducial and Structural Statistical #(a+Σx), and a 1 Bayes credible interval for is Inference. Example Suppose that X ,…, X are i.i.d. expo- " n 1 χ# χ# 5 nential (λ). Then T l X is a suﬃcient statistic for λ 2 #(a+Σx),"−α/# #(a+Σx), α/# 6 i i 3 λ: % λ % 7 (23) T " n λ t λ j \ j \ and gamma ( , ). In the gamma pdf, and 4 2[n (1 b)] 2[n (1 b)] 8 appear together as t\λ and, in fact the gamma (n, λ) pdf (Γ(n)λn)−"tn–"e−t/λ is a scale family. Thus, if Q(T, λ) l 2T\λ, then We can also form a Bayes credible set by taking the highest posterior density (HPD) region of the parameter space, by choosing c so that Q(T, λ) " gamma(n, λ(2\λ)) l gamma (n,2)

kα l π λ Q Σ λ which does not depend on λ. The quantity Q(T, λ) l 1 & ( ixi)d (24) \λ χ# oλ:π(λ Q Σx) & cq 2T is a pivot with a gamma (n, 2), or #n, distribution, and a 1kα pivotal interval is Such a construction is optimal in the sense of giving the shortest interval for a given 1kα (although if the 2T % λ % 2T posterior is multimodal the set may not be an interval). χ# χ# #n, α/# #n, "−α/#

χ# " χ# l where P( #n #n,a) a. 4. Other Interals We have presented two-sided parametric intervals that are constructed to cover a parameter. Other types of 3.3 Bayesian Inter als intervals include (a) one-sided intervals, (b) distri- If π(θ Q x) is the posterior distribution of θ given X l x, bution-free intervals, (c) prediction intervals, (d) then for any set A 9 Θ the posterior probability of A tolerance intervals. is One-sided intervals are those in which only one endpoint is estimated, such as θ ? [L(X), _). Distri- bution-free intervals are intervals whose probability P(θ ? A Q x) l & π(θ Q x)dθ guarantee holds with little (or no) assumption on the A underlying distribution. The other two interval defi- nitions, together with the usual confidence interval, If A l A(x) is chosen so that this posterior probability provide us with a hierarchy of inferences, each more is 1kα, then A is called a 1kα credible set for θ.If stringent than the previous. π θ Q ( x) corresponds to a discrete distribution, we If X",…, Xn are i.i.d. from a population with cdf replace integrals with sums in the above expressions. F(x Q θ), and C(x) l [L(x), U(x)] is an interval, for a The interpretation of the Bayes interval estimator is specified value 1kα it is a: different from the classical intervals. In the classical (a) confidence interal if, for all θ, approach, to assert 95 percent coverage is to assert Pθ[L(X) % θ% U(X)] & 1kα;

4748 Ethical Behaior, Eolution of

(b) prediction interal if, for all θ, Ethical Behavior, Evolution of % % & kα Pθ[L(X) Xn+" U(X)] 1 ; (c) tolerance interal if, for all θ and for a specified Q θ k Q θ & & kα The distinction between good or bad and what we value p, Pθ([F(U(X) ) F(L(X) ) p] 1 . ought or ought not do constitutes the subject matter of So a confidence interval covers a mean, a prediction ethics. Early students of the evolution of ethical interval covers a new random variable, and a tolerance behavior (EEB) and some sociobiologists attempted interval covers a proportion of the population. Thus, to direct the study of EEB into the domain of pre- each gives a different inference, with the appropriate scriptive ethics. Twenty-first century sociobiologists one being dictated by the problem at hand. Vardeman are not concerned with the nature of vice, virtue, (1992) argues that these ‘other intervals’ provide an or the rules of moral behavior, but with the question of inference that is different from that of a confidence the biological origin of ethical behavior. To disregard interval and are as important as confidence intervals. the distinction between the questions of ethics and the questions of the evolutionary causes and the bases of human ethical behavior is to misunderstand the discipline of EEB, which is not a branch of philosophy or of ethics, but a subdiscipline of sociobiology. No 5. Conclusions prescriptive code can be derived from the theory of Point estimation is one of the cornerstones of stat- evolution; therefore, the previously used term ‘evol- istical analysis, and the basic element on which many utionary ethics’ is a misnomer that must be dropped inferences are based. Inferences using point estimators from sociobiological usage. An excellent philosophical gain statistical validity when they are accompanied by examination (Flew 1967) and a superior historical an interval estimate, providing an assessment of the analysis (Richards 1987) on EEB are available. uncertainty. We have mainly discussed parametric The ideas on EEB will be exemplified by a review of point and interval estimation, where we assume that arguments, first, of the originators of EEB; second, of the underlying model is correct. Such an assumption ecologists, ethologists, geneticists, linguists, neuro- can be questioned, and considerations of nonpar- physiologists, and cognitive scientists; and third, of ametric or robust alternatives can address this (see sociobiologists and their supporters and opponents. Robustness in Statistics). For more on these subjects see, for example, Hettmansberger and McKean (1998) or Staudte and Sheather (1990). Full treatments of parametric point and interval estimation can be found in Casella and Berger (2001), Stuart et al. (1999), or 1. The Originators of the Idea Schervish (1995). Charles Darwin, in On the Origin of Species (1859), established the homology between human and nonhuman primate morphology, but not between human and nonhuman behavior. In his revised edition of The Descent of Man (1885), he recognized that the moral Bibliography sense or conscience differentiates humans from the Barnard G A 1949 Statistical inference (with discussion). Journal other animals, and that ‘The imperious word ought of the Royal Statistical Society, Series B 11: 115–39 seems merely to imply the consciousness of the Casella G, Berger R L 2001 Statistical Inference, 2nd edn. existence of a rule of conduct, however it may have Wadsworth\Brooks Cole, Pacific Grove, CA originated.’ He believed that the understanding of the Hettmansperger T P, McKean J W 1998 Robust Nonparametric EEB would be enlarged by studies of nonhuman Statistical Methods. Wiley, New York behavior, and that nonhuman animals possessing Lehmann E L 1998 Introduction to Large-Sample Theory. parental and filial instinct would exhibit a kind of Springer-Verlag, New York rudimentary intellectual and moral sense. He sup- Lehmann E L, Casella G 1998 Theory of Point Estimation, 2nd ported this by arguing that, first, social instincts lead edn. Springer-Verlag, New York animals to social groups and to perform ‘social Schervish M J 1995 Theory of Statistics. Springer-Verlag, New services’; second, as the mental faculties develop, York Staudte R G, Sheather S J 1990 Robust Estimation and Testing. images and feelings endure and are subject to recall; John Wiley, New York third, development of language facilitates communi- Stuart A, Ord J K, Arnold S 1999 Adanced Theory of Statistics, cation; and last, acquisition of habits helps to guide Classical Inference and the Linear Model, 6th edn. Arnold, the conduct of individuals within the community. Oxford University Press, London, Vol. 2A Humans, to Darwin, are social animals who have Vardeman S B 1992 What about other intervals. The American lost some early instincts but retain the protohuman Statistician 46: 193–7 instinctive love and sympathy for others. Some instincts are subject to group selection; the altruistic G. Casella and R. L. Berger social behaviors of early humans were not for the good

4749