<<

Sociology 740 John Fox

Lecture Notes

3. Linear Least-Squares Regression

Copyright © 2014 by John Fox

Linear Least-Squares Regression 1 1. Goals: I To review/introduce the calculation and interpretation of the least- squares regression coefficients in simple and multiple regression. I To review/introduce the calculation and interpretation of the regression and the simple and multiple correlation coefficients. I To introduce and criticize the use of standardized regression coefficients I Time and interest permitting: To introduce arithmetic and least-squares regession in matrix form.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 2 2. Introduction I Despite its limitations, linear lies at the very heart of applied : Some are adequately summarized by linear least-squares • regression. The effective application of is expanded by data • transformations and diagnostics. The general — an extension of least-squares linear • regression — is able to accommodate a very broad class of specifica- tions. Linear least-squares provides a computational basis for a variety of • generalizations (such as generalized linear models). I This lecture describes the mechanics and descriptive interpretation of linear least-squares regression.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 3 3. Simple Regression 3.1 Least-Squares Fit I Figure 1 shows Davis’s data on the measured and reported weight in kilograms of 101 women who were engaged in regular exercise. The relationship between measured and reported weight appears to • be linear, so it is reasonable to fit a line to the . I Denoting measured weight by \ and reported weight by [,aline relating the two variables has the equation \ = D + E[. No line can pass perfectly through all of the data points. A residual, H, • reflects this fact. The regression equation for the lth of the q = 101 observations is • \l = D + E[l + Hl = \l + Hl

b c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 4 Measured Weight (kg) Weight Measured 40 50 60 70

40 45 50 55 60 65 70 75

Reported Weight (kg) Figure 1. A least-squares line fit to Davis’s data on reported and measured weight. (The broken line is the line \ = [.) Some points are over-plotted.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 5 The residual • Hl = \l \l = \l (D + E[l) is the signed vertical  between the point and the line, as showninFigure2. b I A line that fits the data well makes the residuals small. q Simply requiring that the sum of residuals, l=1 Hl, be small is futile, • since large negative residuals can offset large positive ones. P Indeed, any line through the point ([>\ ) has Hl =0. • I Two possibilities immediately present themselves:P Find D and E to minimize the absolute residuals, Hl , which leads • to least-absolute-values (LAV) regression. | | P 2 Find D and E to minimize the squared residuals, Hl , which leads to • least-squares (LS) regression. P

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 6

Y

(Xi, Yi) ^ Y A  BX Yi

Ei

^ Yi

0 X Xi

Figure 2. The residual Hl is the signed vertical distance between the point and the line. c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 7 I In least-squares regression, we seek the values of D and E that minimize: q 2 2 V(D> E)= Hl = (\l D E[l) l=1   X X For those with calculus: • – The most direct approach is to take the partial derivatives of the sum-of-squares function with respect to the coefficients: CV(D> E) = ( 1)(2)(\l D E[l) CD    CV(D> E) X = ( [l)(2)(\l D E[l) CE    Setting these partial derivativesX to zero yields simultaneous linear • equations for D and E,thenormal equations for simple regression:

Dq + E [l = \l 2 D [l + E X[l = X [l\l

c 2014 by John Fox Sociology 740 ° X X X Linear Least-Squares Regression 8 Solving the normal equations produces the least-squares coefficients: • D = \ E[ q  [ \ [ \ ([ [)(\ \ ) E = l l l l = l l 2 2  2 q [ ( [l) ([l [) P l P P P  P P P – The formula for D implies that the least-squares line passes through the point-of- of the two variables. The least-squares residuals therefore sum to zero.

– The second normal equation implies that [lHl =0; similarly, \lHl =0. These properties imply that the residuals are uncorre- P latedwithboththe[’s and the \ ’s. P b b

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 9 I For Davis’s data on measured weight (\ ) and reported weight ([): q = 101 5780 \ = =57=23 101 5731 [ = =56=74 101 ([l [)(\l \ ) = 4435=   2 X ([l [) = 4539=  4435 X E = =0=9771 4539 D =57=23 0=9771 56=74 = 1=789  × The least-squares regression equation is • Measured\ Weight =1=79 + 0=977 Reported Weight ×

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 10 I Interpretation of the least-squares coefficients: E =0=977: A one-kilogram increase in reported weight is associated • on average with just under a one-kilogram increase in measured weight. – Since the data are not longitudinal, the phrase “a unit increase” here implies not a literal change over time, but rather a static comparison between two individuals who differ by one kilogram in their reported weights.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 11 Ordinarily, we may interpret the intercept D as the fitted value • associated with [ =0, but it is impossible for an individual to have a reported weight equal to zero. – The intercept D is usually of little direct interest, since the fitted value above [ =0is rarely important. – Here, however, if individuals’ reports are unbiased of their actual weights, then we should have \ = [ — i.e., D =0.The intercept D =1=79 is close to zero, and the E =0=977 is close to one. b

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 12 3.2 Simple Correlation I It is of interest to determine how closely the line fits the scatter of points. I The of the residuals, VH, called the standard error of the regression, provides one index of fit. Because of estimation considerations, the of the residuals is • defined using q 2 degrees of freedom:  H2 V2 = l H q 2 P The standard error is therefore  • 2 Hl VH = sq 2 P Since it is measured in the units of the response variable, the standard • error represents a type of ‘average’ residual.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 13 For Davis’s regression of measured on reported weight, the sum of • 2 squared residuals is Hl = 418=9, and the standard error 418=9 VP= =2=05 kg. H 101 2 r I believe that social scientists overemphasize correlation and pay • insufficient attention to the standard error of the regression. I The correlation coefficient provides a relative measure of fit: To what degree do our predictions of \ improve when we base that on the linear relationship between \ and [? A relative index of fit requires a baseline — how well can \ be • predicted if [ is disregarded? – To disregard the explanatory variable is implicitly to fit the equation

\l = D0 + Hl0 – We can find the best-fitting constant D0 by least-squares, minimizing 2 2 V(D0)= H0 = (\l D0) l  X X c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 14

– The value of D0 that minimizes this sum of squares is the response- variable , \ .

The residuals Hl = \l \l from the linear regression of \ on [ •  will generally be smaller than the residuals H0 = \l \ , and it is l  necessarily the case that b 2 2 (\l \l) (\l \ )    – This inequality holds because the ‘null model,’ \ = D + H is X X l 0 l0 a special case of the moreb general linear-regression ‘model,’ \l = D + E[l + Hl,settingE =0. We call • 2 2 H0 = (\l \ ) l  the total sum of squares for \ , abbreviated TSS,while X X 2 2 H = (\l \l) l  is called the residual sumX of squaresX , and is abbreviated RSS. b

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 15 The difference between the two, termed the regression sum of • squares, RegSS TSS RSS gives the reduction in squared error due to the linear regression. The ratio of RegSS to TSS, the proportional reduction in squared error, • defines the of the correlation coefficient: RegSS u2  TSS To find the correlation coefficient u we take the positive square root of • u2 when the simple-regression slope E is positive, and the negative square root when E is negative. If there is a perfect positive linear relationship between \ and [,then • u =1. A perfect negative linear relationship corresponds to u = 1. •  If there is no linear relationship between \ and [,thenRSS= TSS, • RegSS =0,andu =0. c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 16 Between these extremes, u gives the direction of the linear relationship • between the two variables, and u2 may be interpreted as the proportion of the total variation of \ that is ‘captured’ by its linear regression on [. Figure 3 depicts several different levels of correlation. • I The decomposition of total variation into ‘explained’ and ‘unexplained’ components, paralleling the decomposition of each observation into a fitted value and a residual, is typical of linear models. The decomposition is called the for the regression: • TSS = RegSS + RSS The regression sum of squares can also be directly calculated as • 2 RegSS = (\l \ )  I It is also possible to define u by analogyX with the correlation  between two random variables. b

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 17

(a)r 0 (b)r 0 (c)r 0.2

(d)r 0.5 (e)r 0.8 (f)r 1

Figure 3. Scatterplots showing different correlation coefficients u. Panel (b) reminds us that u measures linear relationship. c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 18 First defining the sample between [ and \ , • ([l [)(\l \ ) V[\    q 1 we may then write P  • V[\ ([l [)(\l \ ) u = =   V[V\ 2 2 P([l [) (\l \ )   where V and V are, respectively, the sample standard deviations of [ \ qP P [ and \ . I Some comparisons between u and E: The correlation coefficient u is symmetric in [ and \ , while the • least-squares slope E is not. The slope coefficient E is measured in the units of the response • variable per unit of the explanatory variable. For example, if dollars of income are regressed on years of education, then the units of E are dollars/year. The correlation coefficient u, however, is unitless.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 19 A change in scale of \ or [ produces a compensating change in • E, but does not affect u. If, for example, income is measured in thousands of dollars rather than in dollars, the units of the slope become $1000s/year, and the value of the slope decreases by a factor of 1000, but u remains the same. I For Davis’s regression of measured on reported weight, TSS = 4753=8 RSS = 418=87 RegSS = 4334=9 4334=9 u2 = = =91188 4753=8

Since E is positive, u =+s=91188 = =9549. • The linear regression of measured on reported weight captures 91 • percent of the variation in measured weight.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 20 Equivalently, • 4435=9 V = =44=359 [\ 101 1  4539=3 V2 = =45=393 [ 101 1  4753=8 V2 = =47=538 \ 101 1  44=359 u = = =9549 s45=393 47=538 ×

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 21 4. Multiple Regression 4.1 Two Explanatory Variables I The linear multiple-regression equation \ = D + E1[1 + E2[2 for two explanatory variables, [1 and [2, describes a plane in the three-dimensional [1>[b 2>\ space, as shown in Figure 4. The residual is the{ signed vertical} distance from the point to the plane: • Hl = \l \l = \l (D + E1[l1 + E2[l2)   To make the plane come as close as possible to the points in the • aggregate, we want theb values of D> E1> and E2 that minimize the sum of squared residuals: 2 2 V(D> E1>E2)= H = (\l D E1[l1 E2[l2) l    X X

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 22

(Xi1, Xi2, Yi) Y Ei ^ Y A B1X1  B2X2

^ (Xi1, Xi2, Yi)

B2 X2 1 B1

A 1

X1

Figure 4. The multiple regression plane. c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 23 Differentiating the sum-of-squares function with respect to the regres- • sion coefficients, setting the partial derivatives to zero, and rearranging terms produces the normal equations, Dq + E1 [l1 + E2 [l2 = \l 2 D [l1 + E1 [l1 + E2 [l1[l2 = [l1\l P P 2 P D [l2 + E1 [l2[l1 + E2 [ = [l2\l P P P l2 P This is a system of three linear equations in three unknowns, so it P P P P • usually provides a unique solution for the least-squares regression coefficients D> E1> and E2.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 24 – Dropping the subscript l for observations, and using asterisks to denote variables in mean-deviation form (e.g., \  \l \ ),   D = \ E1[1 E2[2   2 [\  [ [\  [[ E = 1 2  2 1 2 1 [ 2 [ 2 ( [ [ )2 P P1 2 P 1P2 2 [P2\  P[1 [P1\  [1[2 E2 =  [ 2 [ 2 ( [ [ )2 P P1 2 P 1P2 – The least-squares coefficients are uniquely defined as long as P2 P 2 P 2 [ [ =( [[) 1 2 6 1 2 that is, unless [X1 and [X2 are perfectlyX correlated or unless one of the explanatory variables in invariant.

– If [1 and [2 are perfectly correlated, then they are said to be collinear.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 25 I An illustration, using Duncan’s occupational prestige data and regressing the prestige of occupations (\ ) on their educational and income levels ([1 and [2, respectively): 2 q =45 [1 =38> 971= 2146 \ = =47=69 [ 2 =26> 271= 45 P 2 2365 [ = =52=56 P [ [ =23> 182= 1 45 1 2 1884 [ = =41=87 P [ \ =35> 152= 2 45 1  P [2\  =28> 383= P Substituting these results into the equations for the least-squares • coefficients produces D = 6=070>E1 =0=5459> and E2 =0=5987.  The fitted least-squares regression equation is • Prestige\ = 6=07 + 0=546 Education +0=599 Income  × ×

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 26 I A central difference in interpretation between simple and multiple regression: The slope coefficients for the explanatory variables in the multiple regression are partial coefficients, while the slope coefficient in simple regression gives the marginal relationship between the response variable and a single explanatory variable. That is, each slope in multiple regression represents the ‘effect’ on the • response variable of a one-unit increment in the corresponding ex- planatory variable holding constant the value of the other explanatory variable. The simple-regression slope effectively ignores the other explanatory • variable. This interpretation of the multiple-regression slope is apparent in the • figure showing the multiple-regression plane. Because the regression plane is flat, its slope in the direction of [1, holding [2 constant, does not depend upon the specific value at which [2 is fixed.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 27

Algebraically, fix [2 to the specific value {2 and see how \ changes • as [1 is increased by 1,fromsomespecificvalue{1 to {1 +1: [D + E1({1 +1)+E2{2] (D + E1{1 + E2{2)=E1b  A similar result holds for [2. • I For Duncan’s regression: A unit increase in education, holding income constant, is associated • onaveragewithanincreaseof0=55 units in prestige. A unit increase in income, holding education constant, is associated • onaveragewithanincreaseof0=60 units in prestige. The regression intercept, D = 6=1, has the following literal interpreta- • tion: The fitted value of prestige is 6=1 for a hypothetical occupation with education and income levels both equal to zero. No occupations have levels of zero for both income and education, however, and the response variable cannot take on negative values.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 28 4.2 Several Explanatory Variables I For the general case of n explanatory variables, the multiple-regression equation is \l = D + E1[l1 + E2[l2 + + En[ln + Hl ··· = \l + Hl It is not possible to visualize the point cloud of the data directly when • nA2, but it is simpleb to find the values of D and the E’s that minimize V(D> E1>E2>===>En) q 2 = [\l (D + E1[l1 + E2[l2 + + En[ln)] l=1  ··· X

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 29 Minimization of the sum-of-squares function produces the normal • equations for general multiple regression: Dq +E1 [l1 +E2 [l2 + +En [ln = \l 2 ··· D [l1 +E1 [l1 +E2 [l1[l2 + +En [l1[ln = [l1\l P P 2 ··· P P D [l2 +E1 [l2[l1 +E2 [ + +En [l2[ln = [l2\l P P P l2 ··· P P · P P P P · P · · · 2 · D [ln +E1 [ln[l1 +E2 [ln[l2 + +En [ = [ln\l ··· ln P P P P P

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 30 Because the normal equations are linear, and because there are as • many equations as unknown regression coefficients (n +1), there is usually a unique solution for the coefficients D> E1>E2> ===> En. Only when one explanatory variable is a perfect of • others, or when one or more explanatory variables are invariant, will the normal equations not have a unique solution. Dividing the first normal equation through by q reveals that the least- • squares surface passes through the point of means ([1> [2>===>[n> \ ). I To illustrate the solution of the normal equations, let us return to the Canadian occupational prestige data, regressing the prestige of the occupations on average education, average income, and the percent of womenineachoccupation.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 31 The various sums, sums of squares, and sums of products that are • required are given in the following table: Percent Variable Prestige Education Income Women Prestige 253,618. Education 55,326. 12,513. Income 37,748,108. 8,121,410. 6,534,383,460. Percent Women 131,909. 32,281. 14,093,097. 187,312. Sum 4777. 1095. 693,386. 2956. Substituting these values into the normal equations and solving for the • regression coefficients produces D = 6=7943  E1 =4=1866 E2 =0=0013136 E3 = 0=0089052 

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 32 The fitted regression equation is • Prestige\ = 6=794 + 4=187 Education  × +0=001314 Income × 0=008905 Percent Women  × In interpreting the regression coefficients, we need to keep in mind the • units of each variable: Prestige scores are arbitrarily scaled, and from a minimum of • 14.8 to a maximum of 87.2 for these 102 occupations; the hinge- spread of prestige is 24.4 points. Education is measured in years, and hence the impact of education on • prestige is considerable — a little more than four points, on average, for each year of education, holding income and gender composition constant.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 33 Despite the small absolute size of its coefficient, the partial effect of • income is also substantial — about 0=001 points on average for an additional dollar of income, or one point for each $1,000. The impact of gender composition, holding education and income • constant, is very small — an average decline of about 0.01 points for each one-percent increase in the percentage of women in an occupation.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 34 4.3 Multiple Correlation I As in simple regression, the standard error in multiple regression measures the ‘average’ size of the residuals. As before, we divide by degrees of freedom, here q (n+1) = q n 1 • to calculate the variance of the residuals; thus, the standard error is 2 Hl VH = sq n 1 P  Heuristically, we ‘lose’ n +1degrees of freedom by calculating the • n +1regression coefficients, D> E1> ===> En. For Duncan’s regression of occupational prestige on the income and • educational levels of occupations, the standard error is 7506=7 V = =13=37 H 45 2 1 r  

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 35 – The response variable here is the percentage of raters classifying the occupation as good or excellent in prestige; an average error of 13 is substantial. I The sums of squares in multiple regression are defined as in simple regression: 2 TSS = (\l \ )  2 RegSS = X(\l \ )  2 2 RSS = X(\l \l) = H b  l The fitted values \ and residuals H now come from the multiple- l X l X • regression equation. b We also have a similarb decomposition of variation: • TSS = RegSS + RSS The least-squares residuals are uncorrelated with the fitted values and • with each of the [’s.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 36 2 I The squared multiple correlation U represents the proportion of variation in the response variable captured by the regression: RegSS U2  TSS The multiple correlation coefficient is the positive square root of U2, • and is interpretable as the simple correlation between the fitted and observed \ -values.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 37 I For Duncan’s regression, TSS =43> 687= RegSS =36> 181= RSS = 7506=7 The squared multiple correlation is • 36> 181 U2 = = =8282 43> 688 indicating that more than 80 percent of the variation in prestige among the 45 occupations is accounted for by its linear regression on the income and educational levels of the occupations.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 38 4.4 Standardized Regression Coefficients I Social researchers often wish to compare the coefficients of different explanatory variables in a . When the explanatory variables are commensurable, comparison is • straightforward. Standardized regression coefficients permit a limited assessment of • the relative effects of incommensurable explanatory variables. I Imagine that the annual dollar income of wage workers is regressed on their years on education, years of labor-force experience, and some other explanatory variables, producing the fitted regression equation

Income\ = D + E1 Education + E2 Experience + × × ··· Since education and experience are measured in years, the coef- • ficients E1 and E2 are both expressed in dollars/year, and can be directly compared.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 39 I More commonly, explanatory variables are measured in different units. In the Canadian occupational prestige regression, for example, the • coefficient for education is expressed in points (of prestige) per year; the coefficient for income is expressed in points per dollar; and the coefficient of gender composition in points per percent-women. – The income coefficient (0.001314) is much smaller than the educa- tion coefficient (4.187) not because income is a much less important determinant of prestige, but because the unit of income (the dollar) is small, while the unit of education (the year) is relatively large. – If we were to re-express income in $1000s, then we would multiply theincomecoefficient by 1000.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 40 I Standardized regression coefficients rescale the E’s according to a measure of explanatory-variable spread. We may, for example, multiply each regression coefficient by the • hinge-spread of the corresponding explanatory variable. For the Canadian prestige data: K-spread Em Education: 4=28 4=187× = 17=92 Income: 4131× 0=001314 = 5=4281 Gender: 48=68× 0=008905 = 0=4335 ×   For other data, where the variation in education and income may be • different, the relative impact of the two variables may also differ, even if the regression coefficients are unchanged. The following observation should give you pause: If two explanatory • variables are commensurable, and if their hinge-spreads differ, then performing this calculation is, in effect, to adopt a rubber ruler.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 41 I It is much more common to standardize regression coefficients using the standard deviations of the explanatory variables rather than their hinge-spreads. The usual practice standardizes the response variable as well, but this • is inessential: \l = D + E1[l1 + + En[ln + Hl ··· \ = D + E1[1 + + En[n ··· \l \ = E1([l1 [1)+ + En([ln [n)+Hl   ··· 

\l \ V1 [l1 [1  = E1  V V V \ μ \ ¶ 1 Vn [ln [n Hl + + En  + ··· V V V μ \ ¶ n \

]l\ = E]l1 + + E]ln + H 1 ··· n l

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 42

– ]\ (\ \ )@V\ is the standardized response variable, linearly transformed  to a mean of zero and a standard deviation of one.

– ]1> ===> ]n are the explanatory variables, similarly standardized.

– H H@V\ is the transformed residual which, note, does not have a standard deviation of one.

– Em Em(Vm@V\ ) is the standardized partial regression coefficient for themth explanatory variable. – The standardized coefficient is interpretable as the average change in \ , in standard-deviation units, for a one standard-deviation increase in [m, holding constant the other explanatory variables.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 43 I For the Canadian prestige regression, Education: 4=187 2=728@17=20 = 0=6639 Income: 0=001314× 4246@17=20 = 0=3242 Gender: 0=008905× 31=72@17=20 = 0=01642  ×  Because both income and gender composition have substantially • non-normal distributions, however, the use of standard deviations here is difficult to justify. I A common misuse of standardized coefficients is to employ them to make comparisons of the effects of the same explanatory variable in two or more samples drawn from populations with different spreads.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 44 5. Least-Squares Regression Using Matrices (time permitting) 5.1 Basic Definitions I A matrix is a rectangular table of numbers or of numerical variables; for example 1 23 4 5 6 X = 5   6 (4 3) 789 × 9 0010: 9 : or, more generally, 7 8 d11 d12 d1q ··· d21 d22 d2q A = 5 . . ··· . 6 (p q) . . . × 9 dp1 dp2 dpq : 9 ··· : 7 8

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 45 The firstmatrixisoforder 4 by 3; the second of order p by q. • I a one-column matrix is called a column vector, d1 d2 a = 5 . 6 (p 1) . × 9 dp : 9 : I Likewise, a matrix with just one row7 is called8 a row vector, b0 =[e1 e2 eq] ··· I The of a matrix reverses rows and columns: 1 23 1470 4 5 6 X = 5   6 X0 = 2 58 0 (4 3) 789 (3 4) 5   6 × × 3 6910 9 0010:  9 : 7 8 7 8

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 46 I A square matrix has equal numbers of rows and columns: 51 3 B = 22 6 (3 3) 5 6 × 73 4  7 8 The diagonal of a (generic) square matrix A of order q consists of the • entries d11>d22>===>dqq. For example, the diagonal of B consists of the entries 5, 2 and 4. •   I A symmetric square matrix is equal to its transpose; thus B is not symmetric, but the following matrix is: 51 3 C = 12 6 5 36 4 6  7 8

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 47 5.2 Matrix Arithmetic I To be added or subtracted, matrices must be of the same order; matrices are added, subtracted, and negated element-wise; for 123 A = (2 3) 456 × and  ¸ 51 2 B =  (2 3) 30 4 we have ×   ¸ 435 C = A + B =  (2 3) 752 ×  ¸ 61 1 D = A B = (2 3)  1510 ×  ¸ 1 2 3 E = A =    (2 3)  4 5 6 ×     ¸

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 48

I The inner product of two vectors (each with q entries), say a0 and (1 q) × b , denoted a0 b, is a number formed by multiplying corresponding (q 1) · entries× of the vectors and summing the resulting products: q

a0 b = dlel · l=1 For example, X • 1 6 [2> 0> 1> 3] 5 6 =2( 1) + 0(6) + 1(0) + 3(9) = 25 · 0  9 9 : 9 : I Two matr ices A and7B are8 conformable for multiplication in the order given (i.e., AB) if the number of columns of the left-hand factor (A)is equal to the number of rows of the right-hand factor (B). Thus A and B are conformable for multiplication if A is of order • (p q) and B is of order (q s). × ×

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 49 For example, • 100 123 010 456 5 0016  (2 3) ¸ × 7 (3 3) 8 are conformable for multiplication, but× 100 123 010 456 5 0016  (2 3) ¸ 7 (3 3) 8 × are not. ×

I Let C = AB be the matrix product; and let al0 be the lth row of A and bm be the mth column of B. Then C is a matrix of order (p s) in which • × q flm = a0 bm= dlnenm l · n=1 X

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 50 Some examples: • = 100 123, 010 5 45665 . 0016 (2 3) (3 3) 7 × 87 × 8 1(1) + 2(0) + 3(0)> 1(0) + 2(1) + 3(0)> 1(0) + 2(0) + 3(1) = 4(1) + 5(0) + 6(0)> 4(0) + 5(1) + 6(0)> 4(0) + 5(0) + 6(1)  (2 3) ¸ × 123 = 456  ¸

12 03 45 = 34 21 813  ¸  ¸  ¸

03 12 912 = 21 34 58  ¸  ¸  ¸

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 51 I A matrix that has ones down the main diagonal and zeroes elsewhere is called an identity matrix, and is symbolized by I; the identity matrix of order 3 is symbolized by I3. As in the previous example, multiplying a matrix by an identity matrix • recovers the original matrix. I In algebra, division is essential to the solution of simple equations. For example, • 6{ =12 12 { = =2 6 or, equivalently, 1 1 6{ = 12 6 × 6 × { =2 1 1 where 6 =6 is the scalar inverse of 6.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 52 I In matrix algebra, there is no direct analog of division, but most square matrices have a matrix inverse. The inverse of a square matrix A is a square matrix of the same order, • 1 1 1 written A , with the property that AA = A A = I. If a square matrix has an inverse, then the matrix is termed non- • singular; a square matrix without an inverse is termed singular.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 53 For example, the inverse of the nonsingular matrix • 25 13 is the matrix  ¸ 3 5 12 as we can verify:   ¸ 25 3 5 10 = 13 12 01 X  ¸   ¸  ¸ 3 5 25 10 = 12 13 01 X   ¸  ¸  ¸

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 54 I In scalar algebra, only the number 0 has no inverse, but there are singular nonzero matrices. Let us hypothesize that B is the inverse of the matrix • 10 A = 00  ¸ But • 10 e11 e12 e11 e12 AB = = = I2 00 e21 e22 00 6  ¸  ¸  ¸ I Matrix inverses can be used to solve systems of linear simultaneous equations. For example: • 2{1 +5{2 =4 {1 +3{2 =5

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 55 Writing these equations as a matrix equation, • 25 { 4 1 = 13 {2 5  ¸  ¸  ¸ A x = b (2 2)(2 1) (2 1) × × × Then • 1 1 A Ax = A b 1 Ix = A b 1 x = A b For the example • 1 { 25  4 1 = {2 13 5  ¸  ¸  ¸ 3 5 4 = 12 5   ¸  ¸ 13 =  6  ¸ c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 56 5.3 Least-Squares Regression I In scalar (i.e., single-number) form, multiple linear regression is written as \l = D + E1[l1 + E2[l2 + + En[ln + Hl There are q such equations, one for each··· observation l =1> 2>===>q. •

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 57 I We can write these q equations as a single matrix equation: y = X b + e (q 1) (q n+1)(n+1 1) (q 1) × × × ×

y =(\1>\2>===>\q)0 is a vector of observations on the response • variable

• 1 [11 [1n 1 [ ··· [ X = 21 2n 5 . . ··· . 6 9 1 [q1 [qn : 9 ··· : called the model (or design7 ) matrix, contains8 the values of the explanatory variables, with an initial column of ones for the regression constant (called the constant regressor).

b =(D> E1>===>En)0 contains the regression coefficient • e =(H1>H2>===>Hq)0 is a vector of residuals. •

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 58

I The is V(b)=e0e. I Minimizing the residual sum of squares leads to the least-squares normal equations in matrix form: X0X b = X0y (n+1 n+1)(n+1 1) (n+1 1) × × ×

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 59 This is a system of n +1linear equations in the n +1unknown • regression coefficients b. The coefficient matrix for the system of equation, • q [l1 [l2 [ln 2 ··· [l1 [ln [l1[l2 [l1[ln 5 P P 2 ··· P 6 X0X = [l2 [l2[l1 [l2 [l2[ln P. P . P . ··· P . 9 ...... : 9 P P P P 2 : 9 [ln [ln[l1 [ln[l2 [ : 9 ··· ln : contains sums7 of squares and cross-products among the columns8 of the model matrix.P P P P

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 60

The right-hand-side vector, X0y =[ \l> [l1\l> [l2\l>===> [ln\l]0 • contains sums of cross-products between each column of the model matrix and the vector of responses.P P P P

The sums of squares and products X0X and X0y can be calculated • directly from the data.

I X0X is nonsingular if no explanatory variable is a perfect linear function of the others. Then the solution of the normal equations is • 1 b =(X0X) X0y

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 61 6. Summary I In , the least-squares coefficients are given by ([ [)(\ \ ) E = l l  2 ([l [) P  D = \ E[ P I The least-squares coefficients in multiple linear regression are found by solving the normal equations for the intercept D and the slope coefficients E1>E2> ===> En. I The least-squares residuals, H, are uncorrelated with the fitted values, \ , and with the explanatory variables, [1> ===> [n. I The linear regression decomposes the variation in \ into ‘explained’ and ‘unexplained’b components: TSS = RegSS + RSS.

c 2014 by John Fox Sociology 740 ° Linear Least-Squares Regression 62 I The standard error of the regression, H2 V = l H q n 1 s P gives the ‘average’ size of the regression  residuals. I The squared multiple correlation, RegSS U2 = TSS indicates the proportion of the variation in \ that is captured by its linear regression on the [’s. I By rescaling regression coefficients in relation to a measure of variation — e.g., the hinge-spread or standard deviation — standardized regression coefficients permit a limited comparison of the relative impact of incommensurable explanatory variables.

c 2014 by John Fox Sociology 740 °

Linear Least-Squares Regression 63 I The least-squares regression coefficients can be calculated in matrix form as 1 b =(X0X) X0y

c 2014 by John Fox Sociology 740 °