Categorical Variables in Regression Models

Home , Categorical variable

HH Chapter 9, Gill Chapter 4

October 29, 2008 NC Mercury in Fish data

30 40 50 60 30 40 50 60 12 13 14 15

0 8 9 10 11

0 4 5 6 7

3 [Mercury] (ppm) 2

0 0 1 2 3

0 30 40 50 60 30 40 50 60 Length (cm) Residuals from Regression of Mercury on Length

Residuals vs Fitted Normal Q−Q

117 117 105 105 70 70 Residuals Standardized residuals −2 −1 0 1 2 3 −1.5 −0.5 0.5 1.5 0.5 1.0 1.5 2.0 2.5 −2 −1 0 1 2

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

117 105 70

152 163 162 Standardized residuals Standardized residuals

−2 −1 0Cook’s 1 2 distance 3 0.0 0.5 1.0 1.5

0.5 1.0 1.5 2.0 2.5 0.00 0.01 0.02 0.03 0.04 0.05

Fitted values Leverage What’s Wrong? Ladder of Powers

Ladders of Powers

−1 −0.5 0 0.5 1 2

MERCURY^2 ~ LENGTH^−1 MERCURY^2 ~ LENGTH^−0.5 MERCURY^2 ~ LENGTH^0 MERCURY^2 ~ LENGTH^0.5 MERCURY^2 ~ LENGTH^1 MERCURY^2 ~ LENGTH^2

10 5 0 2 −5 −10 MERCURY^1 ~ LENGTH^−1 MERCURY^1 ~ LENGTH^−0.5 MERCURY^1 ~ LENGTH^0 MERCURY^1 ~ LENGTH^0.5 MERCURY^1 ~ LENGTH^1 MERCURY^1 ~ LENGTH^2

10 5 0 1 −5 −10 MERCURY^0.5 ~ LENGTH^−1 MERCURY^0.5 ~ LENGTH^−0.5 MERCURY^0.5 ~ LENGTH^0 MERCURY^0.5 ~ LENGTH^0.5 MERCURY^0.5 ~ LENGTH^1 MERCURY^0.5 ~ LENGTH^2

10 5 0 0.5 −5 −10 MERCURY^0 ~ LENGTH^−1 MERCURY^0 ~ LENGTH^−0.5 MERCURY^0 ~ LENGTH^0 MERCURY^0 ~ LENGTH^0.5 MERCURY^0 ~ LENGTH^1 MERCURY^0 ~ LENGTH^2

MERCURY 5 0 0 −5 −10 MERCURY^−0.5 ~ LENGTH^−1 MERCURY^−0.5 ~ LENGTH^−0.5 MERCURY^−0.5 ~ LENGTH^0 MERCURY^−0.5 ~ LENGTH^0.5 MERCURY^−0.5 ~ LENGTH^1 MERCURY^−0.5 ~ LENGTH^2

10 5 0 −0.5 −5 −10 MERCURY^−1 ~ LENGTH^−1 MERCURY^−1 ~ LENGTH^−0.5 MERCURY^−1 ~ LENGTH^0 MERCURY^−1 ~ LENGTH^0.5 MERCURY^−1 ~ LENGTH^1 MERCURY^−1 ~ LENGTH^2

10 5 0 −1 −5 −10

0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 LENGTH Residuals from Regression of log(Mercury) on log(Length)

Residuals vs Fitted Normal Q−Q

117 117 Residuals Standardized residuals 12 141 12 −1.5 −0.5 0.5 1.0 1.5

−3 −2 −1141 0 1 2 3

−1.0 −0.5 0.0 0.5 1.0 −2 −1 0 1 2

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage 141 12 117 Standardized residuals Standardized residuals 133 12 −3 −2 −1 0Cook’s 1 2 distance 3 141 0.0 0.5 1.0 1.5

−1.0 −0.5 0.0 0.5 1.0 0.00 0.01 0.02 0.03 0.04

Fitted values Leverage Residuals by River −1.5 −1.0 −0.5 0.0 0.5 1.0

0 1 Residuals by STATION −1.5 −1.0 −0.5 0.0 0.5 1.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Dummy Variables

◮ RIVER is an indicator variable ◮ 0 = Lumber River ◮ 1 = Wacamaw River ◮ Also called a “Dummy” variable ◮ Use to ﬁt a diﬀerent regression line to each river with a common variance σ2

Yi = β0 + β1Di + β2Xi + β3Di · Xi + ǫi (1) iid 2 ǫi ∼ N(0,σ ) (2)

◮ The following are equivalent in R:

lm(log(MERCURY)~log(LENGTH)*RIVER) lm(log(MERCURY)~log(LENGTH)+RIVER+RIVER*log(LENGTH)) lm(log(MERCURY)~log(LENGTH)+RIVER+RIVER:log(LENGTH)) Coeﬃcients

Estimate Std. Error t value Pr(>|t|) (Intercept) -5.4557 1.0122 -5.390 2.37e-07 *** log(LENGTH) 1.4703 0.2767 5.315 3.38e-07 *** RIVER -3.6245 1.3196 -2.747 0.00668** log(LENGTH):RIVER 1.0045 0.3597 2.792 0.00585 **

◮ Combined Model

log(MERCURY) = − 5.46 + 1.47 log(LENGTH) − 3.63 RIVER + 1.00 log(LENGTH) : RIVER

◮ RIVER= 0:

log(MERCURY) = −5.46 + 1.47 log(LENGTH)

◮ RIVER= 1:

log(MERCURY) = (−5.46−3.63)+(1.47+1.00) log(LENGTH) Residuals by STATION −1.5 −1.0 −0.5 0.0 0.5 1.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Categorical Variables

◮ STATION is a categorical variable ◮ 0, 1,..., 15 labels or “levels” ◮ Default in R is to treat integer data as numerical WRONG WAY: lm(log(MERCURY) ~ log(LENGTH)*STATION)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.41371 1.22982 -3.589 0.000436 *** log(LENGTH) 1.17463 0.33417 3.515 0.000566 *** STATION -0.39978 0.13992 -2.857 0.004820 ** log(LENGTH):STATION 0.11140 0.03753 2.968 0.003440 ** ---

Residual standard error: 0.4816 on 167 degrees of freedom Factors

◮ Create a Dummy variable for each STATION ◮ SUM of all the Dummy variables is one so is perfectly correlated with the intercept ◮ Drop the lowest STATION to eliminate multicollinearity ◮ use factor(STATION) in model formula lm(MERCURY∼ factor(STATION)*log(LENGTH)) ◮ Can coerce STATION to be a factor with as.factor() or factor() fish$STATION = as.factor(fish$STATION) lm(MERCURY∼ STATION*log(LENGTH)) Residuals by STATION −1.0 −0.5 0.0 0.5 1.0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Output

Estimate Std. Error t value Pr(>|t|) (Intercept) -15.6977 4.7984 -3.27 0.0013 STATION1 11.0015 5.2647 2.09 0.0385 STATION2 12.3825 5.0120 2.47 0.0147 ...... STATION15 6.2955 5.7388 1.10 0.2745 log(LENGTH) 4.0854 1.2589 3.25 0.0015 STATION1:log(LENGTH) -2.8784 1.3914 -2.07 0.0404 STATION2:log(LENGTH) -3.1260 1.3236 -2.36 0.0196 ...... STATION14:log(LENGTH) -2.7114 1.4469 -1.87 0.0630 STATION15:log(LENGTH) -1.5107 1.4949 -1.01 0.3140 ANOVA As a Factor:

Df SumSq MeanSq F Pr(>F) log(LENGTH) 1 32.67 32.67 226.32 0.0000 STATION 15 17.65 1.18 8.15 0.0000 log(LENGTH):STATION 15 3.91 0.26 1.81 0.0392 Residuals 139 20.06 0.14

As a numeric variable:

Df SumSq MeanSq F Pr(>F) log(LENGTH) 1 32.67 32.67 140.86 0.0000 STATION 1 0.85 0.85 3.67 0.0572 log(LENGTH):STATION 1 2.04 2.04 8.81 0.0034 Residuals 167 38.73 0.23 Sequential ANOVA

Df SumSq MeanSq F Pr(>F) log(LENGTH) 1 32.67 32.67 226.32 0.0000 STATION 15 17.65 1.18 8.15 0.0000 log(LENGTH):STATION 15 3.91 0.26 1.81 0.0392 Residuals 139 20.06 0.14

Models: ◮ Diﬀerent Lines ◮ Parallel Lines ◮ Common Line Test Hypotheses based on “Extra SS” F-test Interpretations of Coeﬃcients in Original Units (Concentration)