Statistical View of Linear Least Squares

2005 SAMSI Undergraduate Workshop Statistical View of Linear Least Squares

Minjung Kyung [email protected] May 22, 2005

0-0 2005 SAMSI Undergraduate Workshop

• Functional relation between two variables is expressed by a mathematical formula. – X denotes the independent variable – Y denotes the dependent variable – A functional relation is of the form Y = f(X). – Given a particular value of X, the function f indicates the corresponding value of Y .

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Example of Functional Relation Y 0 50 100 150 200 250 300

0 50 100 150

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Introduction to Linear Regression

• Statistical relation between two variables – not a perfect one – in general, the observations for a statistical relation do not fall directly on the curve of relationship

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Scatter Plot Scatter Plot and Line of Statistical Relationship Work Hrs Work Hrs 100 200 300 400 500 100 200 300 400 500 20 40 60 80 100 120 20 40 60 80 100 120

Lot Size Lot Size y=62.37+3.57x

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Curvilinear Statistical Realtion Example Prognosis 10 20 30 40 50

0 10 20 30 40 50 60

Days

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Introduction to Linear Regression

• A regression model is a formal means of expression the two essential ingredients of a statistical relation: 1. A tendency of the response variable Y to vary with the predictor variable X in a systematic fashion. 2. A scattering of points around the curve of statistical relationship.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Simple Linear Regression Model

Yi = β0 + β1Xi + ǫi

• Yi is the value of the response variable in the ith trial

• β0 and β1 are parameters (the regression coeﬃcients)

• Xi is the value of the predictor variable in the ith trial

• ǫi is a random error term

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Simple Linear Regression Model

Model Assumptions 1. the error terms are normally distributed with mean 0 and variance σ2 for all values of i

2 ǫi ∼ N(0,σ )

2. the error terms ǫi and ǫj are independent if i 6= j 3. Although the model explicitly allows for measurement error in Y , measurements made on X are known precisely (there is no measurement error)

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Simple Linear Regression Model

Important Features of the Simple Linear Regression Model

1. The response Yi is a sum of 2 components: the deterministic term β0 + β1Xi and the random error term ǫi. Therefore, Yi is a random variable.

2. The response Yi comes from a probability distribution whose mean is

E[Yi]= β0 + β1Xi.

3. The response Yi exceeds or falls short of the value of the regression function by the error term amount ǫi.

4. The responses Yi have the same constant variance as the error term ǫi

2 var[Yi]= var[β0 + β1Xi + ǫi]= σ .

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

5. The responses Yi and Yj are uncorrelated, since the error

terms ǫi and ǫj are uncorrelated.

In summary, the responses Yi come from normal 2 distribution with mean E[Yi]= β0 + β1Xi and variance σ , the

same for all levels of X. Further, any two responses Yi and

Yj are uncorrelated.

2 Yi ∼ i.i.d N(β0 + β1Xi,σ )

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Steps for Selecting an Appropriate Regression Model

1. Exploratory data analysis 2. Develop one or more tentative regression models 3. Examine and revise the regression models for their appropriateness for the data at hand (or develop new models) 4. Make inferences on basis of the selected regression model

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Estimation of the Regression Function

Method of least square

• For the observations (Xi,Yi), consider the deviation of Yi from its expected value

Yi − (β0 + β1Xi).

• Consider the sum of the squared deviations

n 2 Q = (Yi − β0 − β1Xi) . (1) i=1 X

The estimators of β0 and β1 are those values β0 and β1 that minimize Q for the given sample observations b b (X1,Y1), (X2,Y2),..., (Xn,Yn).

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Estimation of the Regression Function

Least square estimator

• The estimator β0 and β1 that satisfy the least squares criterion can be found in 2 ways b b 1. Numerical Search Procedures 2. Analytical procedures We will use analytical approach.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Estimation of the Regression Function

Least square estimator

• The values of β0 and β1 that minimize Q can be derived

by diﬀerentiating (1) with respect to β0 and β1 and setting the result equal to 0

∂Q n = −2 (Y − β − β X )=0 ∂β i 0 1 i 0 i=1 X ∂Q n = −2 X (Y − β − β X )=0 ∂β i i 0 1 i 1 i=1 X • Simplifying, we get the normal equations

n n Yi − nβ0 − β1 Xi =0 i=1 i=1 X X SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

n n n 2 XiYi − β0 Xi − β1 Xi =0 i=1 i=1 i=1 X X X • The normal equations can be solved simultaneously to

get estimates of the parameters β0 and β1 n i=1(Xi − X)(Yi − Y ) β1 = n (X − X)2 P i=1 i b 1 n P n β = Y − β X = Y − β X 0 n i 1 i 1 i=1 i=1 ! X X where X andb Y are the meansb of the X andb Y observations, respectively.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Estimation of the Regression Function

Residuals • The ﬁtted value for the ith case,

Yi = β0 + β1Xi

• The ith residual is theb diﬀerenceb b between the observed value Yi and the ﬁtted value Yi

ei = Yi − Yi = Yib− (β0 + β1Xi). • Model Error Term: ǫi =bYi − (β0 +bβ1Xbi) Represents the vertical deviation of Yi from the unknown true regression line.

• Residual: ei = Yi − Yi

b SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

– Represents the vertical deviation of Yi from the ﬁtted

value Yi on the estimated regression line. – Residuals are useful for studying whether a given b regression model is appropriate for the given data.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Properties of ﬁtted regression line

• n i=1 ei =0. • n Pi=1 ei, is a minimum. • n n Pi=1 Yi = i=1 Yi. • Pn P i=1 Xiei =0. b • n Pi=1 Yiei =0. • TheP regressionb line always goes through the point (X, Y ).

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Estimation of σ2

• A variety of inferences concerning the regression function require an estimate of σ2. – To get an estimate of σ2, ﬁrst compute the error sum of squares or residual sum of squares:

n n 2 2 SSE = (Yi − Yi) = ei . i=1 i=1 X X b – The mean square error(MSE) is computed as SSE MSE = . n − 2 – It can be shown that MSE is an unbiased estimator of σ2.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Matrix Approach to Least Squares

• The regression model Yi = β0 + β1Xi + ǫi can be written in matrix notation as Y = Xβ + ǫ where

Y1 1 X1 ǫ1       Y1 1 X2 β0 ǫ2 Y = , X = β = , ǫ =  .   . .     .   .   . .  β1  .                 Yn   1 Xn   ǫn             

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Matrix Approach to Least Squares

• The normal equations in matrix form are

′ ′ X Xβ = X Y

• The model parameters can be estimated as follows:

′ − ′ β =(X X) 1X Y

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Matrix Approach to Least Squares

• The residuals are computed using:

e = Y − Y = Y − Xβ,

where b b β0 + β1X1 . Y =  .  b .b    β0 + β1Xn  b     • 2 The estimate for σ is computedb b as follows: e′e σ2 = n − 2

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Inferences in Regression Analysis

Inferences concerning β1

• Point Estimator of β1 n i=1(Xi − X)(Yi − Y ) β1 = n (X − X)2 P i=1 i b P • Estimate of the standard error of β1 MSE SE[β ]= b 1 n 2 i=1(Xi − X) b • Conﬁdence interval for β1:P

β1 ± t1−α/2;n−2SE[β1]

b b

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Inferences in Regression Analysis

Inferences concerning β1

• To test H0 : β1 =0 vs. H0 : β1 6=0, – Test statistics β − 0 t = 1 SE[β1] b – p-value b p(|t| >T(1−α/2,n−2))

→ if p-value<α , we reject H0.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Inferences in Regression Analysis

Inferences concerning β0

• Point Estimator of β0

β0 = Y − β1X

• Estimate of the standardb errorb of β0

2 1 X SE[β ]= MSE + b 0 n 2 "n i=1(Xi − X) # b P • Conﬁdence interval for β0:

β0 ± t1−α/2;n−2SE[β0]

b b

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Inferences in Regression Analysis

Inferences concerning β0

• To test H0 : β0 =0 vs. H0 : β0 6=0, – Test statistics β − 0 t = 0 SE[β0] b – p-value b p(|t| >T(1−α/2,n−2))

→ if p-value<α , we reject H0.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Checking the Model Assumptions

• If the model is appropriate for the data at hand, the

observed residuals ei should reﬂect the properties

assumed for the ǫi. • Residuals can be used to detect departures from the linear regression model – A residual plot against the ﬁtted values can be used to determine if the error terms have a constant variance. – A plot of the residuals in order of data can be used to test for correlation between error terms. When the error terms are independent, we expect them to ﬂuctuate in a random pattern around 0.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Prototype Residual Plots

Constant Error Variance Need Curveilinear Regression Residual Residual −1 0 1 2 −1.0 −0.5 0.0 0.5 1.0 1.5

20 40 60 80 100 120 0 10 20 30 40 50 60

Lot Size Days

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Increasing Error Variance Outlier Effect Mortality Residual 60 70 80 90 100 −2 −1 0 1 2

20 30 40 50 60 35 40 45 50

Age Temperature

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Nonlinear Regression

Nonlinear Regression vs. Linear Regression • A regression model is called nonlinear, if the derivatives of the model with respect to the model parameters depends on one or more parameters. – eg) y = δ +(α − δ)/(1 + expβ log(x/γ))+ ǫ – Take derivatives with respect to δ, for example: ∂y/∂δ =1 − 1/(1 + expβlog(x/γ)). – The derivative involves other parameters, hence the model is nonlinear.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Nonlinear Regression

Nonlinear Regression vs. Linear Regression • A regression model is not necessarily nonlinear if the graphed regression trend is curved. 2 – eg) y = β0 + β1x + β2x + ǫ – Take derivatives of y with respect to the parameters 2 β0, β1, and β2: ∂y/∂β0 =1 ∂y/∂β1 = x ∂y/∂β2 = x . – None of these derivatives depends on a model parameter, the model is linear.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Nonlinear Regression

Fitting Nonlinear Regression Models • The general from of a nonlinear regression models is

y = η(x,β)+ ǫ

where x is a vector of covariates, β is a vector of unknown parameters and ǫ is a N(0,σ2) error term. • To estimate unknown parameters,

n 2 min yi − η(xi,β) . β i=1 X

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Nonlinear Regression

Fitting Nonlinear Regression Models • The variance parameter σ2 is estimated by the residual mean square as in linear regression. e′e σ2 = , n − 2 where

y1 − η(x1, β) . e =  .  . b    yn − η(xn, β)      b

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Nonlinear Regression

Fitting Nonlinear Regression Models • There is no explicit formula for the estimates, so iterative procedures are required. • One of the disadvantages of nonlinear models is that the process is iterative. – To estimate the parameters of the model, you commence with a set of user-supplied starting values. – Care must be exercised in choosing good starting values. – It is thus sensible to start the iterative process with diﬀerent sets of starting values and to observe whether the program arrives at the same parameter estimates.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

• The software tries to improve on the quality of the model fit to the data by adjusting the values of the parameters successively: One iteration • In the next iteration, the program again attempts to improve on the fit by modifying the parameters. • Once an improvement is not possible, the fit is considered converged. • Notice that all results in nonlinear regression are asymptotic. That means the standard error, for example, is only correct if you have an infinitely large sample size. For any finite sample size, the reported standard error is only an approximation which improves with increasing sample size.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

Nonlinear Regression

Example of Nonlinear Regression Model

Curvilinear Statistical Realtion Example Prognosis 10 20 30 40 50

0 10 20 30 40 50 60

Days

• The formula of this model is y = β0 exp(β1x)+ ǫ.

SAMSI Linear Least Squares 2005 SAMSI Undergraduate Workshop

• The estimated parameter is β0 = 58.6066 and β1 = −0.0396 with approximate standard errors 1.4845 and 0.0017, respectively. c c

SAMSI Linear Least Squares