Two Examples of ANOVA

Example 1: This is the age (response variable) by diagnosis (factor or explanatory variable) that was discussed in lecture on 5 Nov. I recoded the diagnosis codes as 0=Normal, 1=ACIS, and 2=AdCa. I entered the data into Minitab and performed the 1- way ANOVA (Stat → ANOVA → 1-way). I pressed the Graphs button and selected Boxplots of Data, Histograms of Residuals, Normal Plot of Residuals, and Residuals vs. Fits. In this case, Residuals vs. Order is meaningless since I don’t have the original order the observations were taken in. The results of the ANOVA are

One-way Analysis of Variance

Analysis of Variance for Age Source DF SS MS F P Diag 2 753 377 2.91 0.061 Error 67 8675 129 Total 69 9428 Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev --+------+------+------+---- 0 13 46.08 15.28 (------*------) 1 38 37.84 8.99 (------*------) 2 19 42.58 12.66 (------*------) --+------+------+------+---- Pooled StDev = 11.38 35.0 40.0 45.0 50.0

Here are the Graphs that were produced: Normal Score

Diag Age -1.0 -0.5 -2.5 -2.0 -1.5 0.0 0.5 1.0 1.5 2.0 2.5 20 30 40 50 60 70 80 -20 Normal Probability Plot of the Residuals the of Plot Probability Normal

0 -10 (means are indicated by solid circles) solid by indicated are (means Boxplots of Age by Diag by Age of Boxplots (response is Age) is (response 0 Residual

1 10 20

30 2 40 Histogram of the Residuals (response is Age)

20 y c n e

u 10 q e r F

0

-30 -20 -10 0 10 20 30 40 Residual

Residuals Versus the Fitted Values (response is Age)

40

30

20 l a

u 10 d i s e

R 0

-10

-20

37.5 38.5 39.5 40.5 41.5 42.5 43.5 44.5 45.5 46.5 Fitted Value

The Boxplots suggest the variances may not be equal. Computing the variances by the different diagnostic categories (Stat → Basic Statistics → Display Descriptive Statistics, then select Age for Variables, check By Variable, and select Diag): Descriptive Statistics

Variable Diag N Mean Median TrMean StDev Age 0 13 46.08 46.00 46.18 15.28 1 38 37.84 36.00 37.12 8.99 2 19 42.58 39.00 41.47 12.66

Variable Diag SE Mean Minimum Maximum Q1 Q3 Age 0 4.24 23.00 68.00 30.50 58.00 1 1.46 25.00 62.00 31.75 40.50 2 2.90 26.00 78.00 34.00 50.00

The ratio of the largest to smallest sample variance is (15.28/8.99)2 ≈ 2.9, which is a little problematic. The histogram and normal probability plot suggest that data may depart some from normality, but this is less problematic than the unequal variances.

Example 2: In this data set, we have a sample of patients with spinal injuries. The response variable is total hospital charge. The factor is “Financial Code” group, meaning basically type of insurance or method for payment (1 = Commercial Insurance, 2 = HMO, 3 = Medicaid, 4 = Medicare, 5 = Non-Resource (uninsured or self-insured), and 6 = PPO). The numbers of observations within each class:

Summary Statistics for Discrete Variables

group Count 1 396 2 74 3 82 4 61 5 190 6 97 N= 900

A 1-way ANOVA on the raw data, with the plots:

One-way Analysis of Variance

Analysis of Variance for charge Source DF SS MS F P group 5 3.351E+11 6.702E+10 9.22 0.000 Error 894 6.500E+12 7.271E+09 Total 899 6.835E+12 Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev -----+------+------+------+- 1 396 41218 58974 (--*--) 2 74 34875 108495 (------*-----) 3 82 107746 122656 (-----*-----) 4 61 48728 75372 (------*------) 5 190 44153 70314 (---*---) 6 97 58861 135911 (-----*----) -----+------+------+------+- Pooled StDev = 85270 30000 60000 90000 120000

I made the font smaller so that everything fit without line wraparound.

Boxplots of charge by group (means are indicated by solid circles)

1000000 e g r a h

c 500000

0

group 1 2 3 4 5 6 Histogram of the Residuals (response is charge)

400

300 y c n e

u 200 q e r F

100

0

0 500000 1000000 Residual

Normal Probability Plot of the Residuals (response is charge)

3

2 e

r 1 o c S

l 0 a m r

o -1 N

-2

-3

0 500000 1000000 Residual Residuals Versus the Fitted Values (response is charge)

1000000 l a u d i

s 500000 e R

0

40000 50000 60000 70000 80000 90000 100000 110000 Fitted Value

Residuals Versus the Order of the Data (response is charge)

1000000 l a u d i

s 500000 e R

0

100 200 300 400 500 600 700 800 900 Observation Order

I also ran the analysis, but using the logarithm of the hospital charge as the response variable. Here are the results:

One-way Analysis of Variance Analysis of Variance for logcharg Source DF SS MS F P group 5 17.047 3.409 10.84 0.000 Error 894 281.169 0.315 Total 899 298.216 Individual 95% CIs For Mean Based on Pooled StDev Level N Mean StDev ----+------+------+------+-- 1 396 4.3025 0.5312 (-*-) 2 74 4.0397 0.5969 (-----*----) 3 82 4.6809 0.6443 (----*----) 4 61 4.2820 0.6095 (----*-----) 5 190 4.3233 0.5328 (--*--) 6 97 4.3833 0.5962 (---*----) ----+------+------+------+-- Pooled StDev = 0.5608 4.00 4.25 4.50 4.75

Here are the Plots: Frequency group logcharge 20 30 40 50 60 70 10 3 4 5 6 0

1 -1 Boxplots of logcharg by group by logcharg of Boxplots Histogram of the Residuals the of Histogram (means are indicated by solid circles) solid by indicated are (means 2 (response is logcharg) is (response

0 3 Residual

4 1

5

6 2 Normal Probability Plot of the Residuals (response is logcharg)

3

2 e r 1 o c S l 0 a m r o -1 N

-2

-3

-1 0 1 2 Residual

Residuals Versus the Fitted Values (response is logcharg)

2

1 l a u d i s e 0 R

-1

4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Fitted Value Residuals Versus the Order of the Data (response is logcharg)

2

1 l a u d i s e 0 R

-1

100 200 300 400 500 600 700 800 900 Observation Order