Analysis of Variance

Analysis of Variance OPRE 6301 Introduction . The purpose of analysis of variance is to compare two or more populations of interval data. Specifically, we are interested in determining whether differences exist between the population means. Although we are interested in a comparison of means, the method works by analyzing the sample variance. The intuitive reason for this is that by studying the “sources” of variation in the sample data, we can decide whether any observed differences among multiple sample means can be attributed to chance, or whether they are indicative of actual differences among the means of the corresponding populations. Analysis of variance is one of the most widely used methods in statistics. 1 One-Way Analysis of Variance . Consider the following scenarios: — We wish to decide on the basis of sample data whether there is really a difference in the effectiveness of three methods of teaching computer programming. — We wish to compare the average yield per acre of wheat when different types of fertilizers are used. — We wish to find out whether differences exist between the mean travel time from home to work/school along three different routes. — We wish to find out whether differences exist between average starting salary for MBA graduates from different schools. In all of these examples, we are trying to find out the effect of a single “factor” on a population; this explains the description “one-way.” 2 Example: Advertising Strategy . We will illustrate the method using the following concrete example. An apple juice manufacturer is planning to develop a new product — a liquid concentrate. The marketing manager has to decide how to market the new product. Three strategies are considered: • Emphasize convenience of using the product. • Emphasize the quality of the product. • Emphasize the product’s low price. An experiment was conducted as follows: • An advertising campaign was launched in three cities. • In each city, only one of the three characteristics of the new product (convenience, quality, and price) was emphasized. 3 Weekly sales were recorded for twenty weeks following the beginning of the campaigns (see Xm15-01.xls). C o n v n c e Q u a l i t y P r i c e 5 2 9 8 0 4 6 7 2 6 5 8 6 3 0 5 3 1 7 9 3 7 7 4 4 4 3 5 1 4 7 1 7 5 9 6 6 6 3 6 7 9 6 0 2 7 1 9 6 0 4 5 0 2 7 1 1 6 2 0 6 5 9 6 0 6 6 9 7 6 8 9 4 6 1 7 0 6 6 7 5 5 2 9 6 1 5 5 1 2 4 9 8 4 9 2 6 9 1 6 6 3 7 1 9 7 3 3 6 0 4 7 8 7 6 9 8 4 9 5 6 9 9 7 7 6 4 8 5 5 7 2 5 6 1 5 5 7 5 2 3 5 7 2 3 5 3 5 8 4 4 6 9 5 5 7 6 3 4 5 8 1 5 4 2 5 8 0 6 7 9 6 1 4 6 2 4 5 3 2 4 Research Hypothesis We wish to answer the following question: Do differences in sales exist between the test markets (i.e., advertising strategies)? In other words, our research hypothesis is that at least two of the three mean sales differ. Therefore, H0 : µ1 = µ2 = µ3 H1 : At least two of the µjs differ We now need to develop an appropriate statistic to test the above pair of hypotheses. 5 Notation Let k = total number of populations (k = 3 in this example) nj = sample size for the jth population (nj = 20 for all 1 ≤ j ≤ k in this example) k n = j=1 nj , the total number of observations Xij =P the ith observation from the jth population, where 1 ≤ j ≤ k and for given j,1 ≤ i ≤ nj. X¯j = the sample mean of all observations from the jth population, for 1 ≤ j ≤ k; formally, n 1 j X¯ = X . j n ij j i=1 X sj = the standard deviation of all observations from the jth population X¯ = the grand mean of all observations; formally, n 1 k j 1 k X¯ = X = n X¯ . n ij n j j j=1 i=1 j=1 X X X 6 Thus, F i r s t o b s e r v a t i o n , X 1 1 X X 1 1 k 2 f i r s t s a m p l e 1 x 2 k x x 2 2 2 . S e C o n D o b s e r v a t i o n , . X n 1 , 1 s e C o n D s a m p l e X X n , n k , k 2 2 n 1 n n 2 k x 1 x x 2 k S a m p l e s i z e S a m p l e m e a n Note that in general, the njs do not have to be the same. 7 Generic Terminology In the context of this example, Response Variable: weekly sales, denoted by X above Responses: actual sales values, which are the observed values of the Xijs Experimental Unit: weeks in the three cities when we recored sales figures Factor: the criterion by which we classify the populations — advertising strategy; another possible factor is advertising medium (TV, newspaper, or internet) Factor Levels: possible treatments for a given factor — convenience, quality, or price 8 Test Statistic We will consider two measures of variability. The idea is depicted below ... 3 0 2 5 2 0 x 3 = 2 0 3 x = 2 0 2 0 1 9 1 5 x 2 = 1 5 1 6 x 2 = 1 5 1 4 1 0 x 1 = 1 2 1 1 1 0 1 x = 1 0 1 0 9 9 7 A s m a l l v a r i a b i l i t y w i t h i n T h e s a m p l e m e a n s a r e t h e s a m e a s b e f o r e , 1 t h e s a m p l e s m a k e s i t e a s i e r b u t t h e l a r g e r w i t h i n w s a m p l e v a r i a b i l i t y T r e a t m e n t 3 T r e a t m e n t 1 T r e a t m e n t 2 T r e a t m e n t 3 T r e a t m e n t 1 T r e a t m e n t 2 t o d r a w a c o n c l u s i o n a b o u t t h e m a k e s i t h a r d e r t o d r a w a c o n c l u s i o n p o p u l a t i o n m e a n s . a b o u t t h e p o p u l a t i o n m e a n s . 9 Formally, we assume that the Xijs are independent and normally distributed with means µj and a common variance σ2. That is, X¯ij = µj + ǫij , (1) where the ǫijs, the errors, are normally distributed with mean 0 and variance σ2. Now, if the null hypothesis is true, say with all µjs equal to µ, then the key idea is that we can estimate σ2 in two ways . 10 Method 1: For each j, the nj observations X1j, X2j, ..., Xnj,j can be viewed as a sample of size nj from a normal population with mean µ and variance σ2. It follows that n 1 j (X − X¯ )2 σ2 ij j i=1 2 X has a χ distribution with nj − 1 degrees of freedom (see equation (3) in notes for Chapter 12). Furthermore, summing the above over j shows that n 1 k j (X − X¯ )2 σ2 ij j j=1 i=1 X X has a χ2 distribution with n − k degrees of freedom k (recall that j=1 nj = n). 2 2 Since E(χ )=P ν, we see that σ can be estimated by n 1 k j MSE ≡ (X − X¯ )2 , (2) n − k ij j j=1 i=1 X X where MSE is for mean square for errors (cf. the white groupings in the previous figure). 11 Method 2: Recall that for each j, the sample mean X¯j also is a normally distributed variable with mean µ 2 and variance σ /nj. Since we have one mean for each j, i.e., for each treatment, it is intuitive that one should be able to estimate σ2 by looking at the variability of the treatment means around the grand mean. Indeed, it can be shown that 1 k n (X¯ − X¯ )2 σ2 j j j=1 X (think of this as weighting the squared deviation (X¯j− ¯ 2 X¯ ) by nj, the number of observations for treatment j)hasa χ2 distribution with k−1 degrees of freedom.

Load more