Least Squares Estimation

Least Squares Estimation

Least Squares Estimation SARA A. VA N D E GEER Volume 2, pp. 1041–1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S. Everitt & David C. Howell John Wiley & Sons, Ltd, Chichester, 2005 To write down the least squares estimator for the Least Squares Estimation linear regression model, it will be convenient to use matrix notation. Let y = (y1,...,yn) and let X be × The method of least squares is about estimating the n p data matrix of the n observations on the p variables parameters by minimizing the squared discrepancies between observed data, on the one hand, and their x1,1 ··· x1,p expected values on the other (see Optimization = . = X . ··· . ( x1 ... xp ) ,(3) Methods). We will study the method in the context x ··· x of a regression problem, where the variation in n,1 n,p one variable, called the response variable Y , can where xj is the column vector containing the n be partly explained by the variation in the other observations on variable j, j = 1,...,n. Denote variables, called covariables X (see Multiple Linear the squared length of an n-dimensional vector v by 2 = = n 2 Regression). For example, variation in exam results v v v i=1 vi . Then expression (1) can be Y are mainly caused by variation in abilities and written as diligence X of the students, or variation in survival y − Xb2, times Y (see Survival Analysis) are primarily due to variations in environmental conditions X. Given the which is the squared distance between the vector y value of X, the best prediction of Y (in terms of mean and the linear combination b of the columns of the square error – see Estimation) is the mean f(X) of matrix X. The distance is minimized by taking the Y given X. We say that Y is a function of X plus projection of y on the space spanned by the columns noise: of X (see Figure 1). Y = f(X)+ noise. Suppose now that X has full column rank, that is, no column in X can be written as a linear The function f is called a regression function. It is to combination of the other columns. Then, the least be estimated from sampling n covariables and their squares estimator βˆ is given by responses (x1,y1),...,(xn,yn). ˆ −1 Suppose f is known up to a finite number p ≤ n β = (X X) X y.(4) of parameters β = (β1,...,βp) ,thatis,f = fβ .We estimate β by the value βˆ that gives the best fit to The Variance of the Least Squares Estimator. the data. The least squares estimator, denoted by βˆ, is that value of b that minimizes In order to construct confidence intervals for the components of βˆ, or linear combinations of these n 2 components, one needs an estimator of the covariance (yi − fb(xi )) ,(1) i=1 y over all possible b. The least squares criterion is a computationally convenient measure of fit. It corresponds to maxi- mum likelihood estimation when the noise is nor- mally distributed with equal variances. Other mea- sures of fit are sometimes used, for example, least absolute deviations, which is more robust against out- xb∨ liers.(See Robust Testing Procedures). Linear Regression. Consider the case where fβ is a linear function of β,thatis, x fβ (X) = X1β1 +···+Xpβp.(2) Here (X1,...,Xp) stand for the observed variables Figure 1 The projection of the vector y on the plane used in fβ (X). spanned by X 2 Least Squares Estimation matrix of βˆ. Now, it can be shown that, given X,the This gives ˆ covariance matrix of the estimator β is equal to 100 50.533.8350 X X = 50.533.8350 25.5025 , −1 2 (X X) σ . 33.8350 25.5025 20.5033 0.0937 −0.3729 0.3092 2 where σ is the variance of the noise. As an estimator −1 = − − 2 (X X) 0.3729 1.9571 1.8189 . of σ ,wetake 0.3092 −1.8189 1.8009 (10) 1 1 n σˆ 2 = y − Xβˆ2 = eˆ2,(5) n − p n − p i We simulated n independent standard normal i=1 random variables e1,...,en, and calculated for i = 1,...,n, where the eˆi are the residuals yi = 1 − 3xi + ei .(11) ˆ = − ˆ −···− ˆ ei yi xi,1β1 xi,pβp.(6) Thus, in this example, the parameters are The covariance matrix of βˆ can, therefore, be esti- β1 1 = − mated by β2 3 .(12) β3 0 (X X)−1σˆ 2. Moreover, σ 2 = 1. Because this is a simulation, these ˆ For example, the estimate of the variance of βj is values are known. To calculate the least squares estimator, we need ˆ 2 2 X y varˆ (βj ) = τ σˆ , the values of , which, in this case, turn out to be j −64.2007 2 where τj is the jth element on the diagonal of X y = −52.6743 .(13) −1 (X X) . A confidence interval for βj is now obtained −42.2025 ˆ ± by taking the least squares estimator βj a margin: The least squares estimate is thus 0.5778 ˆ ± ˆ ˆ βj c var(βj ), (7) βˆ = −2.3856 .(14) −0.0446 where c depends on the chosen confidence level. For a 95% confidence interval, the value c = 1.96 is a From the data, we also calculated the estimated good approximation when n is large. For smaller variance of the noise, and found the value values of n, one usually takes a more conservative σˆ 2 = 0.883.(15) c using the tables for the student distribution with n − p degrees of freedom. The data are represented in Figure 2. The dashed line is the true regression fβ (x). The solid line is the estimated regression f ˆ(x). Numerical Example. Consider a regression with β The estimated regression is barely distinguishable constant, linear and quadratic terms: ˆ from a straight line. Indeed, the value β3 =−0.0446 2 of the quadratic term is small. The estimated variance fβ (X) = β1 + Xβ2 + X β3.(8) ˆ of β3 is = = = ˆ We take n 100 and xi i/n, i 1,...,n.The varˆ (β3) = 1.8009 × 0.883 = 1.5902.(16) matrix X is now Using c = 1.96 in (7), we find the confidence interval 2 1 x1 x1 √ . β ∈−0.0446 ± 1.96 1.5902 = [−2.5162, 2.470]. X = . .(9) 3 2 (17) 1 xn xn Least Squares Estimation 3 ˆ ˆ ˆ 3 This is because β3 is correlated with β1 and β2.One data may verify that the correlation matrix of βˆ is 1−3x 2 2 0.5775−2.3856x −0.0446x 1 −0.8708 0.7529 1 −0.8708 1 −0.9689 . 0.7529 −0.9689 1 0 Testing Linear Hypotheses. The testing problem −1 considered in the numerical example is a special case of testing a linear hypothesis H0 : Aβ = 0, where A −2 is some r × p matrix. As another example of such a hypothesis, suppose we want to test whether two −3 coefficients are equal, say H0 : β1 = β2. This means there is one restriction r = 1, and we can take A as −4 × 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 the 1 p row vector Figure 2 Observed data, true regression (dashed line), and A = (1, −1, 0,...,0). (22) least squares estimate (solid line) In general, we assume that there are no linear dependencies in the r restrictions Aβ = 0. To test Thus, β3 is not significantly different from zero at the the linear hypothesis, we use the statistic 5% level, and, hence, we do not reject the hypothesis ˆ ˆ 2 = Xβ0 − Xβ /r H0: β3 0. T 2 = ,(23) Below, we will consider general test statistics for σˆ 2 testing hypotheses on β. In this particular case, the ˆ test statistic takes the form where β0 is the least squares estimator under H0 : Aβ = 0. In the numerical example, this statistic takes ˆ2 the form given in (18). When the noise is normally 2 β3 T = = 0.0012.(18) distributed, critical values can be found in a table varˆ (βˆ ) 3 for the F distribution with r and n − p degrees of Using this test statistic is equivalent to the above freedom. For large n, approximate critical values are 2 method based on the confidence interval. Indeed, in the table of the χ distribution with r degrees of as T 2 <(1.96)2, we do not reject the hypothesis freedom. H0 : β3 = 0. Under the hypothesis H0 : β3 = 0, we use the least squares estimator Some Extensions ˆ Weighted Least Squares. In many cases, the vari- β1,0 −1 0.5854 = (X X0) X y = .(19) 2 βˆ 0 0 −2.4306 ance σi of the noise at measurement i depends on xi . 2,0 2 Observations where σi is large are less accurate, and, Here, hence, should play a smaller role in the estimation of 1 x1 β.Theweighted least squares estimator is that value . of b that minimizes the criterion X0 = . .(20) 1 x n n (y − f (x ))2 i b i . σ 2 It is important to note that setting β3 to zero changes i=1 i the values of the least squares estimates of β1 and β2: overall possible b. In the linear case, this criterion is ˆ ˆ β1,0 = β1 numerically of the same form, as we can make the ˆ ˆ .(21) ˜ = ˜ = β2,0 β2 change of variables yi yi /σi and xi,j xi,j /σi .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    5 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us