<<

Least Squares Estimation

SARA A. VA N D E GEER Volume 2, pp. 1041–1045

in

Encyclopedia of in Behavioral Science

ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4

Editors

Brian S. Everitt & David C. Howell

 John Wiley & Sons, Ltd, Chichester, 2005 To write down the least squares estimator for the Least Squares Estimation model, it will be convenient to use  matrix notation. Let y = (y1,...,yn) and let X be × The method of least squares is about estimating the n p data matrix of the n observations on the p variables parameters by minimizing the squared discrepancies   between observed data, on the one hand, and their x1,1 ··· x1,p expected values on the other (see Optimization =  . .  = X . ··· . ( x1 ... xp ) ,(3) Methods). We will study the method in the context x ··· x of a regression problem, where the variation in n,1 n,p one variable, called the response variable Y , can where xj is the column vector containing the n be partly explained by the variation in the other observations on variable j, j = 1,...,n. Denote variables, called covariables X (see Multiple Linear the squared length of an n-dimensional vector v by  2 =  = n 2 Regression). For example, variation in exam results v v v i=1 vi . Then expression (1) can be Y are mainly caused by variation in abilities and written as diligence X of the students, or variation in survival y − Xb2, times Y (see ) are primarily due to variations in environmental conditions X. Given the which is the squared between the vector y value of X, the best prediction of Y (in terms of and the linear combination b of the columns of the error – see Estimation) is the mean f(X) of matrix X. The distance is minimized by taking the Y given X. We say that Y is a function of X plus projection of y on the space spanned by the columns noise: of X (see Figure 1). Y = f(X)+ noise. Suppose now that X has full column rank, that is, no column in X can be written as a linear The function f is called a regression function. It is to combination of the other columns. Then, the least be estimated from n covariables and their squares estimator βˆ is given by responses (x1,y1),...,(xn,yn).   ˆ −1 Suppose f is known up to a finite number p ≤ n β = (X X) X y.(4)  of parameters β = (β1,...,βp) ,thatis,f = fβ .We estimate β by the value βˆ that gives the best fit to The of the Least Squares Estimator. the data. The least squares estimator, denoted by βˆ, is that value of b that minimizes In order to construct confidence intervals for the components of βˆ, or linear combinations of these n 2 components, one needs an estimator of the covariance (yi − fb(xi )) ,(1) i=1 y over all possible b. The least squares criterion is a computationally convenient measure of fit. It corresponds to maxi- mum likelihood estimation when the noise is nor- mally distributed with equal . Other mea- sures of fit are sometimes used, for example, least absolute deviations, which is more robust against out- xb∨ liers.(See Robust Testing Procedures).

Linear Regression. Consider the case where fβ is a linear function of β,thatis, x fβ (X) = X1β1 +···+Xpβp.(2)

Here (X1,...,Xp) stand for the observed variables Figure 1 The projection of the vector y on the plane used in fβ (X). spanned by X 2 Least Squares Estimation matrix of βˆ. Now, it can be shown that, given X,the This gives ˆ of the estimator β is equal to 100 50.533.8350  X X = 50.533.8350 25.5025 ,  −1 2 (X X) σ . 33.8350 25.5025 20.5033 0.0937 −0.3729 0.3092 2  where σ is the variance of the noise. As an estimator −1 = − − 2 (X X) 0.3729 1.9571 1.8189 . of σ ,wetake 0.3092 −1.8189 1.8009 (10) 1 1 n σˆ 2 = y − Xβˆ2 = eˆ2,(5) n − p n − p i We simulated n independent standard normal i=1 random variables e1,...,en, and calculated for i = 1,...,n, where the eˆi are the residuals yi = 1 − 3xi + ei .(11) ˆ = − ˆ −···− ˆ ei yi xi,1β1 xi,pβp.(6) Thus, in this example, the parameters are The covariance matrix of βˆ can, therefore, be esti- β1 1 = − mated by β2 3 .(12)  β3 0 (X X)−1σˆ 2. Moreover, σ 2 = 1. Because this is a simulation, these ˆ For example, the estimate of the variance of βj is values are known. To calculate the least squares estimator, we need  ˆ 2 2 X y varˆ (βj ) = τ σˆ , the values of , which, in this case, turn out to be j −64.2007 2  where τj is the jth element on the diagonal of X y = −52.6743 .(13)  −1 (X X) . A confidence interval for βj is now obtained −42.2025 ˆ ± by taking the least squares estimator βj a margin: The least squares estimate is thus 0.5778 ˆ ± ˆ ˆ βj c var(βj ), (7) βˆ = −2.3856 .(14) −0.0446 where c depends on the chosen confidence level. For a 95% confidence interval, the value c = 1.96 is a From the data, we also calculated the estimated good approximation when n is large. For smaller variance of the noise, and found the value values of n, one usually takes a more conservative σˆ 2 = 0.883.(15) c using the tables for the student distribution with n − p degrees of freedom. The data are represented in Figure 2. The dashed line is the true regression fβ (x). The solid line is the estimated regression f ˆ(x). Numerical Example. Consider a regression with β The estimated regression is barely distinguishable constant, linear and quadratic terms: ˆ from a straight line. Indeed, the value β3 =−0.0446 2 of the quadratic term is small. The estimated variance fβ (X) = β1 + Xβ2 + X β3.(8) ˆ of β3 is = = = ˆ We take n 100 and xi i/n, i 1,...,n.The varˆ (β3) = 1.8009 × 0.883 = 1.5902.(16) matrix X is now   Using c = 1.96 in (7), we find the confidence interval 2 1 x1 x1 √  . . .  β ∈−0.0446 ± 1.96 1.5902 = [−2.5162, 2.470]. X = . . . .(9) 3 2 (17) 1 xn xn Least Squares Estimation 3

ˆ ˆ ˆ 3 This is because β3 is correlated with β1 and β2.One data may verify that the correlation matrix of βˆ is 1−3x 2 2 0.5775−2.3856x −0.0446x 1 −0.8708 0.7529 1 −0.8708 1 −0.9689 . 0.7529 −0.9689 1 0 Testing Linear Hypotheses. The testing problem −1 considered in the numerical example is a special case of testing a linear hypothesis H0 : Aβ = 0, where A −2 is some r × p matrix. As another example of such a hypothesis, suppose we want to test whether two −3 coefficients are equal, say H0 : β1 = β2. This there is one restriction r = 1, and we can take A as −4 × 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 the 1 p row vector

Figure 2 Observed data, true regression (dashed line), and A = (1, −1, 0,...,0). (22) least squares estimate (solid line) In general, we assume that there are no linear dependencies in the r restrictions Aβ = 0. To test

Thus, β3 is not significantly different from zero at the the linear hypothesis, we use the 5% level, and, hence, we do not reject the hypothesis ˆ ˆ 2 = Xβ0 − Xβ /r H0: β3 0. T 2 = ,(23) Below, we will consider general test statistics for σˆ 2 testing hypotheses on β. In this particular case, the ˆ test statistic takes the form where β0 is the least squares estimator under H0 : Aβ = 0. In the numerical example, this statistic takes ˆ2 the form given in (18). When the noise is normally 2 β3 T = = 0.0012.(18) distributed, critical values can be found in a table varˆ (βˆ ) 3 for the F distribution with r and n − p degrees of Using this test statistic is equivalent to the above freedom. For large n, approximate critical values are 2 method based on the confidence interval. Indeed, in the table of the χ distribution with r degrees of as T 2 <(1.96)2, we do not reject the hypothesis freedom. H0 : β3 = 0. Under the hypothesis H0 : β3 = 0, we use the least squares estimator Some Extensions ˆ   . In many cases, the vari- β1,0 −1 0.5854 = (X X0) X y = .(19) 2 βˆ 0 0 −2.4306 ance σi of the noise at measurement i depends on xi . 2,0 2 Observations where σi is large are less accurate, and, Here,   hence, should play a smaller role in the estimation of 1 x1 β.Theweighted least squares estimator is that value  . .  of b that minimizes the criterion X0 = . . .(20) 1 x n n (y − f (x ))2 i b i . σ 2 It is important to note that setting β3 to zero changes i=1 i the values of the least squares estimates of β1 and β2: overall possible b. In the linear case, this criterion is ˆ ˆ β1,0 = β1 numerically of the same form, as we can make the ˆ ˆ .(21) ˜ = ˜ = β2,0 β2 change of variables yi yi /σi and xi,j xi,j /σi . 4 Least Squares Estimation

The minimum χ 2-estimator (see Estimation)isan where the number of parameters p is about as example of a weighted least squares estimator in the large as the number of observations n.Thecurse of context of . dimensionality in such models is handled by apply- ing various complexity regularization techniques (see e.g., [2]). . When fβ is a nonlinear function of β, one usually needs iterative algorithms to find the least squares estimator. The variance can References then be approximated as in the linear case, with ˙ fβˆ(xi ) taking the role of the rows of X. Here, [1] Green, P.J. & Silverman, B.W. (1994). Nonparametric ˙ Regression and Generalized Linear Models: A Roughness fβ (xi ) = ∂fβ (xi )/∂β is the row vector of derivatives of f (x ). For more details, see e.g. [4]. Penalty Approach, Chapman & Hall, London. β i [2] Hastie, T., Tibshirani, R. & Friedman, J. (2001). The Elements of Statistical Learning. Data Mining, Inference . In nonparametric re- and Prediction, Springer, New York. gression, one only assumes a certain amount of [3] Robertson, T., Wright, F.T. & Dykstra, R.L. (1988). Order smoothness for f (e.g., as in [1]), or alternatively, Restricted , Wiley, New York. [4] Seber, G.A.F. & Wild, C.J. (2003). Nonlinear Regression, certain qualitative assumptions such as monotonicity Wiley, New York. (see [3]). Many nonparametric least squares proce- dures have been developed and their numerical and SARA A. VA N D E GEER theoretical behavior discussed in literature. Related developments include estimation methods for models