<<

CS147

CS 147: Computer Systems Performance Analysis Multiple and Categorical Regression 2015-06-15

CS 147: Computer Systems Performance Analysis Multiple and Categorical Regression

1 / 36 Overview CS147 Overview

Multiple Linear Regression Basic Formulas Example Quality of the Example

Overview Categorical Models 2015-06-15

Multiple Linear Regression Basic Formulas Example Quality of the Example

Categorical Models

2 / 36 Multiple Linear Regression Multiple Linear Regression CS147 Multiple Linear Regression Multiple Linear Regression I Develops models with more than one predictor variable

I But each predictor variable has linear relationship to response variable

I Conceptually, plotting a regression line in n-dimensional Multiple Linear Regression space, instead of 2-dimensional 2015-06-15

I Develops models with more than one predictor variable

I But each predictor variable has linear relationship to response variable

I Conceptually, plotting a regression line in n-dimensional space, instead of 2-dimensional

3 / 36 Multiple Linear Regression Basic Formulas Basic Multiple Linear Regression Formula CS147 Basic Multiple Linear Regression Formula Multiple Linear Regression

Response y is a function of k predictor variables x1, x2,..., xk

Basic Formulas y = b0 + b1x1 + b2x2 + ··· + bk xk + e Basic Multiple Linear Regression Formula 2015-06-15

Response y is a function of k predictor variables x1, x2,..., xk

y = b0 + b1x1 + b2x2 + ··· + bk xk + e

4 / 36 Multiple Linear Regression Basic Formulas

CS147 A Multiple Linear Regression Model A Multiple Linear Regression Model Given sample of n observations Multiple Linear Regression {(x11, x21,..., xk1, y1),..., (x1n, x2n,..., xkn, yn)} model consists of n equations (note possible + vs. − typo in Basic Formulas book): y1 = b0 + b1x11 + b2x21 + ··· + bk xk1 + e1

y2 = b0 + b1x12 + b2x22 + ··· + bk xk2 + e2 . A Multiple Linear Regression Model . yn = b0 + b1x1n + b2x2n + ··· + bk xkn + en Given sample of n observations 2015-06-15

{(x11, x21,..., xk1, y1),..., (x1n, x2n,..., xkn, yn)}

model consists of n equations (note possible + vs. − typo in book):

y1 = b0 + b1x11 + b2x21 + ··· + bk xk1 + e1

y2 = b0 + b1x12 + b2x22 + ··· + bk xk2 + e2 . .

yn = b0 + b1x1n + b2x2n + ··· + bk xkn + en

5 / 36 Multiple Linear Regression Basic Formulas Looks Like It’s Matrix Arithmetic Time CS147 Looks Like It’s Matrix Arithmetic Time Multiple Linear Regression y = Xb + e         y1 1 x11 x21 ... xk1 b0 e0 y2 1 x12 x22 ... xk2 b1 e1   =     +    .   . . . . .   .   .  Basic Formulas  .   . . . . .   .   .  yn 1 x1n x2n ... xkn bk en

Note that: Looks Like It’s Matrix Arithmetic Time I y and e have n elements I b has k + 1 2015-06-15 I x is k by n

y = Xb + e         y1 1 x11 x21 ... xk1 b0 e0 y2 1 x12 x22 ... xk2 b1 e1   =     +    .   . . . . .   .   .   .   . . . . .   .   .  yn 1 x1n x2n ... xkn bk en

Note that:

I y and e have n elements

I b has k + 1

I x is k by n

6 / 36 Multiple Linear Regression Basic Formulas Analysis of Multiple Linear Regression CS147 Analysis of Multiple Linear Regression Multiple Linear Regression I Listed in box 15.1 of Jain I Not terribly important (for our purposes) how they were derived Basic Formulas I This isn’t a class on statistics I But you need to know how to use them Analysis of Multiple Linear Regression I Mostly matrix analogs to simple linear regression results 2015-06-15

I Listed in box 15.1 of Jain I Not terribly important (for our purposes) how they were derived

I This isn’t a class on statistics

I But you need to know how to use them

I Mostly matrix analogs to simple linear regression results

7 / 36 Multiple Linear Regression Example Example of Multiple Linear Regression CS147 Example of Multiple Linear Regression Multiple Linear Regression I IMDB keeps numerical popularity ratings of movies I Postulate popularity of Academy Award-winning films is based on two factors:

I Year made Example I Running time I Produce a regression Example of Multiple Linear Regression rating = b0 + b1(year) + b2(length) 2015-06-15

I IMDB keeps numerical popularity ratings of movies I Postulate popularity of Academy Award-winning films is based on two factors:

I Year made I Running time

I Produce a regression

rating = b0 + b1(year) + b2(length)

8 / 36 Multiple Linear Regression Example Some Sample Data CS147 Some Sample Data Multiple Linear Regression Title Year Length Rating Silence of the Lambs 1991 118 8.1 1983 132 6.8 1976 119 7.0 Example Oliver! 1968 153 7.4 Marty 1955 91 7.7 Gentleman’s Agreement 1947 118 7.5 Mutiny on the Bounty 1935 132 7.6 Some Sample Data It Happened One Night 1934 105 8.0 2015-06-15

Title Year Length Rating Silence of the Lambs 1991 118 8.1 Terms of Endearment 1983 132 6.8 Rocky 1976 119 7.0 Oliver! 1968 153 7.4 Marty 1955 91 7.7 Gentleman’s Agreement 1947 118 7.5 Mutiny on the Bounty 1935 132 7.6 It Happened One Night 1934 105 8.0

9 / 36 Multiple Linear Regression Example Now for Some Tedious Matrix Arithmetic CS147 Now for Some Tedious Matrix Arithmetic

Multiple Linear Regression T T T −1 T I We need to calculate X, X , X X, (X X) , and X y T −1 T I Because b = (X X) (X y)

I We will see that b = (18.5430, −0.0051, −0.0086)

Example I Meaning the regression predicts: Now for Some Tedious Matrix Arithmetic rating = 18.5430 − 0.0051(year) − 0.0086(length) 2015-06-15

T T T −1 T I We need to calculate X, X , X X, (X X) , and X y T −1 T I Because b = (X X) (X y)

I We will see that b = (18.5430, −0.0051, −0.0086)

I Meaning the regression predicts:

rating = 18.5430 − 0.0051(year) − 0.0086(length)

10 / 36 Multiple Linear Regression Example X Matrix for Example CS147 X Matrix for Example Multiple Linear Regression  1 1991 118   1 1983 132     1 1976 119     1 1968 153  X =   Example  1 1955 91     1 1947 118     1 1935 132  X Matrix for Example 1 1934 105 2015-06-15

 1 1991 118   1 1983 132     1 1976 119     1 1968 153  X =    1 1955 91     1 1947 118     1 1935 132  1 1934 105

11 / 36 Multiple Linear Regression Example Transpose to Get XT CS147 Transpose to Get XT Multiple Linear Regression

 1 1 1 1 1 1 1 1  T Example X = 1991 1983 1976 1968 1955 1947 1935 1934  118 132 119 153 91 118 132 105 Transpose to Get XT 2015-06-15

 1 1 1 1 1 1 1 1  T X = 1991 1983 1976 1968 1955 1947 1935 1934  118 132 119 153 91 118 132 105

12 / 36 Multiple Linear Regression Example Multiply To Get XTX CS147 Multiply To Get XTX Multiple Linear Regression

 8 15689 968  T X X =  15689 30771385 1899083  Example 968 1899083 119572 Multiply To Get XTX 2015-06-15

 8 15689 968  T X X =  15689 30771385 1899083  968 1899083 119572

13 / 36 Multiple Linear Regression Example Invert to Get C = (XTX)−1 CS147 Invert to Get C = (XTX)−1 Multiple Linear Regression

 1207.7585 -0.6240 0.1328  T −1 C = (X X) =  -0.6240 0.0003 -0.0001  Example 0.1328 -0.0001 0.0004 Invert to Get C = (XTX)−1 2015-06-15

 1207.7585 -0.6240 0.1328  T −1 C = (X X) =  -0.6240 0.0003 -0.0001  0.1328 -0.0001 0.0004

14 / 36 Multiple Linear Regression Example Multiply to Get XTy CS147 Multiply to Get XTy Multiple Linear Regression

 60.1  T X y =  117840.7  Example 7247.5 Multiply to Get XTy 2015-06-15

 60.1  T X y =  117840.7  7247.5

15 / 36 Multiple Linear Regression Example Multiply (XTX)−1(XTy) to Get b CS147 Multiply (XTX)−1(XTy) to Get b Multiple Linear Regression

 18.5430  b =  -0.0051  Example -0.0086 Multiply (XTX)−1(XTy) to Get b 2015-06-15

 18.5430  b =  -0.0051  -0.0086

16 / 36 Multiple Linear Regression Quality of the Example How Good Is This Regression Model? CS147 How Good Is This Regression Model? Multiple Linear Regression I How accurately does model predict film rating based on age and running time?

I Best way to determine this analytically is to calculate errors:

Quality of the Example SSE = yTy − bTXTy

or X 2 How Good Is This Regression Model? SSE = ei 2015-06-15

I How accurately does model predict film rating based on age and running time?

I Best way to determine this analytically is to calculate errors:

SSE = yTy − bTXTy

or X 2 SSE = ei

17 / 36 Multiple Linear Regression Quality of the Example Calculating the Errors CS147 Calculating the Errors Estimated 2 Multiple Linear Regression Year Length Rating Rating ei e i 1991 118 8.1 7.4 −0.71 0.51 1983 132 6.8 7.30 .51 0.26 1976 119 7.0 7.50 .45 0.21 Quality of the Example 1968 153 7.4 7.2 −0.20 0.04 1955 91 7.7 7.80 .10 0.01 1947 118 7.5 7.60 .11 0.01 1935 132 7.6 7.6 −0.05 0.00 Calculating the Errors 1934 105 8.0 7.8 −0.21 0.04 2015-06-15

Estimated 2 Year Length Rating Rating ei ei 1991 118 8.1 7.4 −0.71 0.51 1983 132 6.8 7.30 .51 0.26 1976 119 7.0 7.50 .45 0.21 1968 153 7.4 7.2 −0.20 0.04 1955 91 7.7 7.80 .10 0.01 1947 118 7.5 7.60 .11 0.01 1935 132 7.6 7.6 −0.05 0.00 1934 105 8.0 7.8 −0.21 0.04

18 / 36 Multiple Linear Regression Quality of the Example Calculating the Errors, Continued CS147 Calculating the Errors, Continued Multiple Linear Regression I SSE = 1.08 P 2 I SSY = yi = 452.91 2 I SS0 = ny = 451.5 Quality of the Example I SST = SSY − SS0 = 452.9 − 451.5 = 1.4 I SSR = SST − SSE = 0.33 2 SSR 0.33 I R = = = 0.23 SST 1.41 Calculating the Errors, Continued I In other words, this regression stinks 2015-06-15

I SSE = 1.08 P 2 I SSY = yi = 452.91 2 I SS0 = ny = 451.5

I SST = SSY − SS0 = 452.9 − 451.5 = 1.4

I SSR = SST − SSE = 0.33 2 SSR 0.33 I R = = = 0.23 SST 1.41 I In other words, this regression stinks

19 / 36 Multiple Linear Regression Quality of the Example

CS147 Why Does It Stink? Why Does It Stink? I Let’s look at properties of the regression parameters r r SSE 1.08 s = = = 0.46 Multiple Linear Regression e n − 3 5

I Now calculate standard deviations of the regression parameters (These are estimations only, since we’re working Quality of the Example with a sample) I Estimated stdev of √ √ b0 is se c00 = 0.46 1207.76 = 16.16 Why Does It Stink? √ √ I Let’s look at properties of the regression parameters b1 is se c11 = 0.46 0.0003 = 0.0084 √ √ 2015-06-15 b2 is se c22 = 0.46 0.0004 = 0.0097 r r SSE 1.08 s = = = 0.46 e n − 3 5

I Now calculate standard deviations of the regression parameters (These are estimations only, since we’re working with a sample)

I Estimated stdev of √ √ b0 is se c00 = 0.46 1207.76 = 16.16 √ √ b1 is se c11 = 0.46 0.0003 = 0.0084 √ √ b2 is se c22 = 0.46 0.0004 = 0.0097

20 / 36 Multiple Linear Regression Quality of the Example Calculating Confidence Intervals of STDEVs CS147 Calculating Confidence Intervals of STDEVs Multiple Linear Regression I We will use 90% level I Confidence intervals for

b0 is 18.54 ∓ 2.015(16.16) = (−14.02, 51.10) Quality of the Example b1 is 0.005 ∓ 2.015(0.0084) = (−0.022, 0.012) b2 is 0.009 ∓ 2.015(0.0097) = (−0.028, 0.011)

Calculating Confidence Intervals of STDEVs I None is significant at this level 2015-06-15

I We will use 90% level

I Confidence intervals for

b0 is 18.54 ∓ 2.015(16.16) = (−14.02, 51.10)

b1 is 0.005 ∓ 2.015(0.0084) = (−0.022, 0.012)

b2 is 0.009 ∓ 2.015(0.0097) = (−0.028, 0.011)

I None is significant at this level

21 / 36 Multiple Linear Regression Quality of the Example Analysis of Variance CS147 Analysis of Variance Multiple Linear Regression I So, can we really say that none of the predictor variables are significant?

I Not yet; predictors may be correlated I F-tests can be used for this purpose Quality of the Example I E.g., to determine if the SSR is significantly higher than the SSE I Equivalent to testing that y does not depend on any of the predictor variables Analysis of Variance I Alternatively, that no bi is significantly nonzero 2015-06-15

I So, can we really say that none of the predictor variables are significant?

I Not yet; predictors may be correlated I F-tests can be used for this purpose

I E.g., to determine if the SSR is significantly higher than the SSE I Equivalent to testing that y does not depend on any of the predictor variables

I Alternatively, that no bi is significantly nonzero

22 / 36 Multiple Linear Regression Quality of the Example Running an F-Test CS147 Running an F-Test Multiple Linear Regression I Need to calculate SSR and SSE I From those, calculate mean squares of regression (MSR) and errors (MSE)

I MSR/MSE has an F distribution

Quality of the Example I If MSR/MSE > Ftable, predictors explain significant fraction of response variation I Note typos in book’s table 15.3

I SSR has k degrees of freedom Running an F-Test I SST matches y − y, not y − yˆ 2015-06-15

I Need to calculate SSR and SSE

I From those, calculate mean squares of regression (MSR) and errors (MSE)

I MSR/MSE has an F distribution

I If MSR/MSE > Ftable, predictors explain significant fraction of response variation I Note typos in book’s table 15.3

I SSR has k degrees of freedom I SST matches y − y, not y − yˆ

23 / 36 Multiple Linear Regression Quality of the Example F-Test for Our Example CS147 F-Test for Our Example Multiple Linear Regression I SSR = .33 I SSE = 1.08

I MSR = SSR/k = .33/2 = .16 Quality of the Example I MSE = SSE/(n − k − 1) = 1.08/(8 − 2 − 1) = .22 I F-computed = MSR/MSE = .76

I F[90; 2, 5] = 3.78 F-Test for Our Example I So it fails the F-test at 90% (miserably) 2015-06-15

I SSR = .33

I SSE = 1.08

I MSR = SSR/k = .33/2 = .16

I MSE = SSE/(n − k − 1) = 1.08/(8 − 2 − 1) = .22

I F-computed = MSR/MSE = .76

I F[90; 2, 5] = 3.78

I So it fails the F-test at 90% (miserably)

24 / 36 Multiple Linear Regression Quality of the Example Multicollinearity CS147 Multicollinearity Multiple Linear Regression I If two predictor variables are linearly dependent, they are collinear

I Meaning they are related I And thus second variable does not improve regression Quality of the Example I In fact, it can make it worse I Typical symptom is inconsistent results from various Multicollinearity significance tests 2015-06-15

I If two predictor variables are linearly dependent, they are collinear

I Meaning they are related I And thus second variable does not improve regression I In fact, it can make it worse

I Typical symptom is inconsistent results from various significance tests

25 / 36 Multiple Linear Regression Quality of the Example Finding Multicollinearity CS147 Finding Multicollinearity Multiple Linear Regression I Must test correlation between predictor variables

I If it’s high, eliminate one and repeat regression without it Quality of the Example I If significance of regression improves, it’s probably due to collinearity between the variables Finding Multicollinearity 2015-06-15

I Must test correlation between predictor variables

I If it’s high, eliminate one and repeat regression without it

I If significance of regression improves, it’s probably due to collinearity between the variables

26 / 36 Multiple Linear Regression Quality of the Example Is Multicollinearity a Problem in Our Example? CS147 Is Multicollinearity a Problem in Our Example? Multiple Linear Regression I Probably not, since significance tests are consistent I But let’s check, anyway

I Calculate correlation of age and length Quality of the Example I After tedious calculation, 0.25 I Not especially correlated I Important point—adding a predictor variable does not always improve a regression Is Multicollinearity a Problem in Our I See example on p. 253 of book 2015-06-15 Example?

I Probably not, since significance tests are consistent

I But let’s check, anyway

I Calculate correlation of age and length I After tedious calculation, 0.25

I Not especially correlated I Important point—adding a predictor variable does not always improve a regression

I See example on p. 253 of book

27 / 36 Multiple Linear Regression Quality of the Example Why Didn’t Regression Work Well Here? CS147 Why Didn’t Regression Work Well Here? Multiple Linear Regression I Check scatter plots

I Rating vs. year Quality of the Example I Rating vs. length I Regardless of how good or bad regressions look, always check the scatter plots Why Didn’t Regression Work Well Here? 2015-06-15

I Check scatter plots

I Rating vs. year I Rating vs. length

I Regardless of how good or bad regressions look, always check the scatter plots

28 / 36 Multiple Linear Regression Quality of the Example Rating vs. Length CS147 Rating vs. Length Multiple Linear Regression Quality of the Example Rating vs. Length 2015-06-15 8

6

Rating 4

2

0 80 100 120 140 160 Length

29 / 36 Multiple Linear Regression Quality of the Example Rating vs. Year CS147 Rating vs. Year Multiple Linear Regression Quality of the Example Rating vs. Year 2015-06-15 8

6

Rating 4

2

0 1940 1960 1980 Year

30 / 36 Categorical Models Regression With Categorical Predictors CS147 Regression With Categorical Predictors Categorical Models I Regression methods discussed so far assume numerical variables

I What if some of your variables are categorical in nature?

I If all are categorical, use techniques discussed later in the course Regression With Categorical Predictors I Levels: number of values a category can take 2015-06-15

I Regression methods discussed so far assume numerical variables

I What if some of your variables are categorical in nature?

I If all are categorical, use techniques discussed later in the course

I Levels: number of values a category can take

31 / 36 Categorical Models Handling Categorical Predictors CS147 Handling Categorical Predictors

Categorical Models I If only two levels, define bi as follows I xi = 0 for first value I xi = 1 for second value

I (This definition is missing from book in section 15.2)

I Can use +1 and -1 as values, instead I Need k − 1 predictor variables for k levels Handling Categorical Predictors I To avoid implying order in categories 2015-06-15

I If only two levels, define bi as follows I xi = 0 for first value I xi = 1 for second value

I (This definition is missing from book in section 15.2)

I Can use +1 and -1 as values, instead I Need k − 1 predictor variables for k levels

I To avoid implying order in categories

32 / 36 Categorical Models Categorical Variables Example CS147 Categorical Variables Example Categorical Models Which is a better predictor of a high rating in the movie database?

I Winning an Oscar?

I Winning the Golden Palm at Cannes? Categorical Variables Example I Winning the New York Critics Circle? 2015-06-15

Which is a better predictor of a high rating in the movie database?

I Winning an Oscar?

I Winning the Golden Palm at Cannes?

I Winning the New York Critics Circle?

33 / 36 Categorical Models Choosing Variables CS147 Choosing Variables Categorical Models I Categories are not mutually exclusive

I x1 = 1 if Oscar, 0 otherwise

I x2 = 1 if Golden Palm, 0 otherwise

I x3 = 1 if Critics Circle Award, 0 otherwise Choosing Variables I y = b0 + b1x1 + b2x2 + b3x3 2015-06-15

I Categories are not mutually exclusive

I x1 = 1 if Oscar, 0 otherwise

I x2 = 1 if Golden Palm, 0 otherwise

I x3 = 1 if Critics Circle Award, 0 otherwise

I y = b0 + b1x1 + b2x2 + b3x3

34 / 36 Categorical Models A Few Data Points CS147 A Few Data Points Title Rating Oscar Palm NYC Categorical Models Gentleman’s Agreement 7.5 X X Mutiny on the Bounty 7.6 X Marty 7.4 X X X If 7.8 X La Dolce Vita 8.1 X 8.2 X The Defiant Ones 7.5 X A Few Data Points Reds 6.6 X 8.1 X 2015-06-15 Title Rating Oscar Palm NYC Gentleman’s Agreement 7.5 X X Mutiny on the Bounty 7.6 X Marty 7.4 X X X If 7.8 X La Dolce Vita 8.1 X Kagemusha 8.2 X The Defiant Ones 7.5 X Reds 6.6 X High Noon 8.1 X

35 / 36 2 I R is 34% of variation

I Better than age and length I But still no great shakes

I Are regression parameters significant at 90% level?

2 I R is 34% of variation

I Better than age and length I But still no great shakes

I Are regression parameters significant at 90% level?

Categorical Models And Regression Says. . . CS147 And Regression Says. . . Categorical Models I yˆ = 7.8 − 0.1x1 + 0.2x2 − 0.4x3 I How good is that? And Regression Says. . . 2015-06-15

I yˆ = 7.8 − 0.1x1 + 0.2x2 − 0.4x3 I How good is that?

36 / 36 I Are regression parameters significant at 90% level?

I Are regression parameters significant at 90% level?

Categorical Models And Regression Says. . . CS147 And Regression Says. . . Categorical Models I yˆ = 7.8 − 0.1x1 + 0.2x2 − 0.4x3 I How good is that? 2 I R is 34% of variation

I Better than age and length And Regression Says. . . I But still no great shakes 2015-06-15

I yˆ = 7.8 − 0.1x1 + 0.2x2 − 0.4x3 I How good is that? 2 I R is 34% of variation

I Better than age and length I But still no great shakes

36 / 36 Categorical Models And Regression Says. . . CS147 And Regression Says. . . Categorical Models I yˆ = 7.8 − 0.1x1 + 0.2x2 − 0.4x3 I How good is that? 2 I R is 34% of variation

I Better than age and length I But still no great shakes And Regression Says. . . I Are regression parameters significant at 90% level? 2015-06-15

I yˆ = 7.8 − 0.1x1 + 0.2x2 − 0.4x3 I How good is that? 2 I R is 34% of variation

I Better than age and length I But still no great shakes

I Are regression parameters significant at 90% level?

36 / 36