1
UNIT THREE Chi-square test of independence; correlation coefficient; regression analysis
CHI-SQUARE TEST OF INDEPENDENCE
A chi-square test of independence is designed to assess whether, for some population of entities, two categorical variables are independent of one another. (If they are not independent, then they are related.) For example, you might wonder whether, for U.S. undergraduates, gender (with the two categories Male and Female) and college major (with the six categories Humanities or Social Sciences, Arts, Math or Science, Business, Education, and Other) are independent. If the variables were independent, that would mean that--for each possible major--the proportion of males choosing that major equals the proportion of females choosing that major. Alternatively one could say that if the variables were independent, the percentage breakdown of males by major would be identical to the percentage breakdown of females by major. One could test H0: gender and college major are independent vs. H1: gender and college major are related by gathering gender and college major data on a random sample of U.S. undergraduates. The hypothesis testing procedure is outlined on the last page of this packet. Incidentally, even quantitative variables can be “cast” as categorical variables. For example, the variable Income can have as possible categories: “below $30,000,” “$30,000-$60,000,” and “above $60,000.”
Because the test statistic in a chi-square test of independence has approximately a chi-square distribution (hence the term chi-square in the name of the test), here is a brief introduction to that distribution. There is a family of chi-square distributions; each member of the family has a certain number (v, where v is a positive integer) degrees of freedom. The chi-square distribution with v degrees of freedom is the distribution of the sum of the squares of v independent standard normal variables, and has a mean value equal to v. The figure below depicts four chi-square distributions.
Figure. Distributions depicted, from “tallest” to “shortest”: chi-square distribution with 1, 2, 3, 5, and 10 degrees of freedom, respectively. 2
PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT (typically simply called correlation coefficient)
The correlation coefficient is a measure of the strength and direction of the linear association between two quantitative variables X and Y. A correlation coefficient can be determined for an entire population of paired (x,y) observations or for a sample of paired (x,y) observations In essence, the correlation coefficient measures the extent to which a unit increase in the value of X is associated with a specific change (whether an increase or decrease) in the value of Y.  denotes the population correlation coefficient and r denotes the sample correlation coefficient. Each of  and r falls in the interval [-1,1].
 xy   , where  xy , called the population covariance of X and Y, is defined for a finite population as  x y N
(xi   x )(yi   y ) th  with (xi,yi) denoting the i paired observation in the population.   i1 xy N n sxy (x  x)(y  y) r  , where s , called the sample covariance of X and Y, is defined as  i i s s xy s  i1 x y xy (n 1)
th with (xi,yi) denoting the i paired observation in the sample.
Unlike the correlation coefficient, the covariance will change when the unit of measurement for X or for Y is changed. For example, a change in unit from thousands of dollars to dollars would increase the magnitude of (absolute value of) the covariance but leave the correlation coefficient unchanged; conversely, a change in unit from inches to feet would decrease the magnitude of (absolute value of) the covariance but leave the correlation coefficient unchanged.
As an example of a sample correlation coefficient, consider, with respect to a sample of nine U.S. airlines, 1998 data on the variables X (% flights on time) and Y (complaint rate). The sample of nine (x,y) observations pictured in the scatter diagram below yielded r = -.88, which would be referred to as a high negative correlation, and indicates (for the sample data) a strong inverse (or negative) linear association between % flights on time and the complaint rate.
1998 data on a sam ple of U.S. airlines (r = -.88)
r
) 1.5 e s p r ( e s g t n
n 1 e i s a l s p a p m 0.5 o 0 c 0
f 0 , o
0 . 0 0 o 1 N 65 70 75 80 85 % of flights arriving on tim e
3
As another example, consider, with respect to states in the mid-1990s, the variables X (spending per pupil) and Y (mean composite score of students on the NAEP test). For a sample of 35 states, the (x,y) observations pictured below yielded r = .34, a low positive correlation, indicating (for the sample data) a weak direct (or positive) linear association between spending per pupil and average student performance.
Mid-90s data on Spending & Student Achievement for 35 States (r = .34)
n o 700 e r t o s c e
t 650 S
P e t E i
s 600 A o N p
m 550 o
C 3,000 4,000 5,000 6,000 7,000 8,000 9,000 Spending per Pupil ($)
REGRESSION ANALYSIS
Regression analysis deals with developing a statistical model relating some quantitative variable (typically labeled Y, and called the dependent variable or criterion variable) to one or more other variables (typically labeled X1, X2, etc., and called the independent variables or the predictor variables or the explanatory variables), at least one of which is quantitative. Regression analysis has various purposes, including:
(1) prediction, that is, to predict the value of some dependent variable given the value(s) of the independent variable(s). For example, a chain of portrait studios located in medium-sized cities and specializing in childrens’ portraits may want to predict annual sales of a studio in a city from the number of people in the city 16 years of age or less and the per capita disposable personal income of the city’s residents. (2) estimating effects, that is, to estimate the impact of changes in the value of an independent variable on the value of the dependent variable. For example, a real estate broker may wish to estimate, for houses in a particular community, the impact of each additional bathroom (while controlling for size of house, size of lot, number of bedrooms, and age of house) on the sales price. (3) testing theories postulating particular relationships between a dependent variable and one or more independent variables.
In simple regression, there is exactly one independent variable. In multiple regression, there are two or more independent variables. 4
SIMPLE LINEAR REGRESSION MODEL
The (classical normal) simple linear regression model is expressed in the form
Y = ß0 + ß1X + , where:
* Y and X are quantitative variables * ß0 and ß1 are real numbers *  is called the residual term or error term
The model is referred to as linear in the parameters ß0 and ß1. Assumptions of the model include (with x denoting any value in the presumed domain of X): Across all entities with X = x:
1. E() = 0 ( E(Y) = ß0 + ß1x). This may be called the linearity assumption. E(Y) = ß0 + ß1X is called the population regression equation (or population regression line). It follows from this assumption that an entity with  > 0 has an above average value for Y given its value of X, and an entity with  < 0 has a below average value for Y given its value of X. 2.  is normally distributed (Y is normally distributed). This may be called the normality assumption. 3. VAR() = 2 (VAR(Y) = 2). This may be called the equal variances assumption. (Another name for equal variances is homoskedasticity. A name for unequal variances is heteroskedasticity.) For any two entities: 4. the respective ’s are independent ( the respective Y’s are independent). This may be called the independence assumption.
The mnemonic LINE (L for linearity, I for independence, N for normality, and E for equal variances) can assist in remembering these assumptions. It is also implicitly assumed that the model is correctly specified (e.g., correct functional form; no missing independent variables).
SIMPLE LINEAR REGRESSION ANALYSIS (delegating calculations to Excel)
Once a random sample of n entities has been drawn, the values of X and Y for those entities have been determined, and an Excel printout containing the results of applying Excel’s REGRESSION tool to the sample data has been obtained, all of the following activities related to a regression analysis can be performed.
(a) Constructing a scatter diagram.
A scatter diagram is a plot of the (x,y) data points in the sample.
(b) Graphing (by hand) the sample regression equation on the scatter diagram.
The sample regression equation (which with only one independent variable may be called the least squares line) is given by est(Y) = b0 + b1X, where est(Y) is verbalized as "estimated Y" and where b0
(the y-intercept) and b1 (the slope) are constants chosen so as to minimize  [y - est(y)]², where the sum is taken across the n data points in the sample. Thus, the least squares line is the line for which the sum of the squared vertical distances between the data points in the sample and the line is 5
minimized. The sample regression equation is an estimate of the population regression equation, E(Y) = ß0 + ß1X.
Where do you find the values for b0 (the y-intercept) and b1 (the slope) on the Excel printout? In the Coefficients column of the long narrow table, with the value of b0 provided next to the term Intercept and the value of b1 provided next to the name of X. How do you graph the line by hand? After writing down the equation of the line, pick any value for X and solve for est(Y). Then pick any other value for X and solve for est(Y). Plot the two points and then draw a line through them.
(c) Interpreting the standard error of the estimate (se).
The standard error of the estimate, se, is an estimate of the standard deviation in Y across all entities having the same value of X. To interpret an se of 2.7, for example, one could say: Across all entities having the same value of X, the standard deviation in Y is estimated to be 2.7 units. Where do you find the standard error of the estimate on the Excel printout? In the SUMMARY OUTPUT table next to the term Standard Error.
(d) Constructing a residual plot to informally assess whether model assumptions appear met.
Each entity in the sample has a sample residual score of e = y - est(y), where est(y) is determined by substituting the entity's value for X into the sample regression equation and solving for est(y). e is an estimate of the true residual term  associated with an entity. The sample residuals (residual scores) are provided on the Excel printout. A residual plot is a graph of all the (x, e) points for the entities in the sample. This plot appears on the Excel printout.
The linearity assumption will appear to be met if, when you visually scan the plot from left to right, the points appear to fluctuate about the horizontal 0 line.
The normality assumption will appear to be met if roughly 68% of the residuals are between -se and se, and roughly 95% of the residuals are between –2se and 2se.
The equal variances assumption will appear to be met if, when you visually scan the plot from left to right, the vertical scatter of the points about the horizontal 0 line remains roughly the same.
Checking the independence assumption requires a natural ordering of the data (e.g., by time, or on the basis of a variable not included in the model). The independence assumption will appear to be met if in a residual plot with the residuals plotted from left to right in a natural ordering (e.g., by time, or on the basis of a variable not included in the model), there is no systematic pattern in the points.
The confidence intervals and test statistics referred to below exactly apply only when all the model assumptions are met. Should one or more assumptions appear violated, there are ways to transform the data to try and correct the problem (but we will not be discussing those ways.) 6
(e) Interpreting SST, SSR, SSE, and r2.
Each sum below is taken over all the data points in the sample. Thus, the four measures are sample measures.
SST, called sum of squares total, measures the total variation in the y-values. SST =  (y -y)²
SSR, called sum of squares regression, measures the variation in the y-values explained by the sample regression equation. SSR =  [est(y) -y]²
SSE, called sum of squares error, measures the variation in the y-values not explained by the
sample regression equation. SSE =  [y - est(y)]². note: the values for b0 and b1 in the sample regression equation are chosen so as to minimize SSE.
SST = SSR + SSE.
r2, called the sample coefficient of determination, measures the proportion (percentage) of the total variation in the y-values that is explained by the sample regression equation (i.e., explained by the variation in X). To interpret an r2 of .87, for example, one could say: 87% of the variation in the y-values is explained by the sample regression equation. A high r2 is necessary, but not sufficient, for being able to use the sample regression equation to predict Y from X with reasonable precision and certainty. Where do you find r2 on the Excel printout? In the SUMMARY OUTPUT table next to the term R Square.
(f) Testing for a linear relationship between Y and X.
contextual note: There is a family of what are called F-distributions, each with a certain number of numerator degrees of freedom (d.f.) and a certain number of denominator degrees of freedom. F- distributions are positively skewed. The definition of an F-distribution and a graph of four members of the family may be found at http:// www .itl.nist.gov/div898/handbook/eda/section3/eda3665.htm.
If the model assumptions are met, an appropriate test statistic for testing H0: Y is not linearly related to X (i.e., ß1 = 0), versus
H1: Y is linearly related to X (i.e., ß1  0) is F = MSR/MSE, where MSR = SSR/1 and MSE = SSE/(n-2). MSE is called mean square error. It is an estimate of ², the variance of  (same as variance of Y) across all entities having the same value of
X. MSR is called mean square regression. If H0 is true, the expected value of MSR is ². However,
if H1 is true, the expected value of MSR will be greater than ². For that reason, if Fcalculated is large (what’s considered large depends on the sample size), H1 is supported. Consequently, the p-value is
calculated as p = P(F  Fcalculated), where Fcalculated is the value of F calculated from the sample of (x,y) data points. Where do you find Fcalculated and its associated p-value on the Excel printout? In the ANOVA table. Fcalculated is given in the F column. The p-value is given in the Significance [of] F column.
The theorem underlying the above choice of test statistic is: If the model assumptions are met and H0 is true (i.e., there is no linear relationship between Y and X, i.e., ß1 = 0), then the sampling distribution of F = MSR/MSE is the F-distribution with 1 numerator d.f. and n-2 denominator d.f. 7
(g) Obtaining a point estimate for the mean value of Y across all entities having X = x*. To get a point estimate (single-number “best guess”) for the mean value of Y across all entities having X = x*, substitute x* for X in the sample regression equation and solve for est(Y). (One should also —though due to time constraints, we will not—obtain a confidence interval for the mean Y across all entities having X = x*.)
(h) Obtaining a point estimate for the value of Y for a single entity having X = x*. To get a point estimate (single-number “best guess”) for the value of Y for a single entity having X = x*, substitute x* for X in the sample regression equation and solve for est(Y). (One should also— though due to time constraints, we will not—obtain a confidence interval for the value of Y for a single entity having X = x*.)
(i) Interpreting b1, the sample regression coefficient of X. The sample regression coefficient of X--which in a simple regression context is the slope b1 of the least squares line--is an estimate of the change in the mean Y with each additional unit increase in X. More specifically, it is (the mean Y across all possible entities having X = x + 1) - (the mean Y across all possible entities having X = x) for any x and x +1 in the domain of X under examination. To interpret a coefficient of +3, for example, one could say “We estimate that as X increases by 1 unit, the mean Y increases by 3 units.” To interpret a coefficient of -3, for example, one could say “We estimate that as X increases by 1 unit , the mean Y decreases by 3 units.”
(j) Determining a confidence interval for ß1, the (true) change in the mean Y with each additional unit increase in X.
When the model assumptions are met, this confidence interval is given by b1  t  sb1 , where b1 is the sample regression coefficient of X (i.e., the slope of the least squares line), t is from the t- distribution with n-2 df, and sb1 (called the standard error of the coefficient) is an estimate (derived from the sample data) of the standard deviation in the sample regression coefficient b1 (slope of least squares line) over repeated sampling of n entities. A 95% confidence interval for ß1 is provided on the Excel printout in the “Lower 95%” and “Upper 95%” columns and second row of the long narrow table. To interpret a 95% confidence interval of (.5,.8), for example, one could say: We are 95% confident that as X increases by 1 unit, the mean Y increases by between .5 and .8 units.
(k) Suggesting two additional independent variables which might enable one to better predict or estimate Y. This is a judgment call on your part. Any reasonable answer will be accepted.
Precautions associated with regression analysis
(1) Check the validity of the model assumptions before relying on procedures predicated on those assumptions. (2) Don't extrapolate, that is, do not attempt to predict or estimate the dependent variable for values of the independent variables outside the region of values of the independent variables encompassed by the sample data. (3) Don't infer cause and effect from evidence of a linear relationship. (4) Don't discard an outlier unless there is justification for doing so. (Outliers for any distribution are "quite far"--say, 2.5 standard deviations or more--from the mean.) 8
Practice Problems for Test #3
1. A financial consultant wishes to determine whether, for firms in a given industry, there is a relationship between firm asset size and capital structure. A random sample of 134 firms in the industry was selected, and cross-classified as follows:
firm asset size capital structure low medium high debt less than or = equity 14 20 16 debt greater than equity 30 36 18
Does this sample data provide sufficient evidence to conclude that firm asset size and capital structure are related?
2. An operations manager at a textile mill would like to be able to predict total monthly production costs from the monthly output of textile. For a random sample of 9 months, the production costs (PRODCOST, in $1000's) and output of textile (OUTPUT, in tons) were determined. On the next page is a scatter diagram and results of an Excel-generated regression analysis on this data. In a testing situation, you will be given such Excel-generated output.
(a) Superimpose on the scatter diagram a graph of the least squares line (i.e., the sample regression equation) and label two points belonging to the least squares line. (b) Assess from the residual plot whether each of the linearity, normality, and equal variances assumptions appears to be met. (c) Interpret each of SST, SSR, SSE, and r2 in the context of this problem.
Assume for the remaining questions that all model assumptions are satisfied. (d) Test for a linear relationship between production cost and output. (e) Interpret the sample regression coefficient of output (slope of the least squares line) in the context of this problem. (f) Interpret the standard error of the estimate in the context of this problem. (g) What would you predict the total production cost to be for a month during which 5 tons of textile is produced? (h) Suggest two additional variables that might help in predicting monthly production costs.
3. For problem #3, state the assumptions of the simple linear regression model in the context of the problem.
4. (a) What does a correlation coefficient measure? (b) What type of correlation would you expect between, for operating managers, number of years of managerial experience and most recent job performance rating? 9
Printout to accompany Practice Problem 2. note: X = output (in tons) and Y = production costs (in $1000's)
Output Cost 9
1 2 ) s 2 3 ' 8 0 0
4 4 0 7 1
8 7 $
n 6 6 6 i (
t 5
5 5 s 8 8 o C 4
9 8 n o
i 3 7 6 t c
u 2 d o
r 1 P 0 0 2 4 6 8 10 Output (in tons) SUMMARY OUTPUT
Regression Statistics Multiple R 0.985324502 R Square 0.970864373 Adjusted R Square 0.966702141 Standard Error 0.388285084 Observations 9
ANOVA df SS MS F Significance F Regression 1 35.16687 35.16687 233.2557 1.24E-06 Residual 7 1.055357 0.150765 Total 8 36.22222
Standard Coefficients Error t Stat P-value Lower 95% Upper 95% Intercept 1.267857143 0.302549 4.19058 0.004083 0.552442 1.983272 Output 0.751785714 0.049224 15.27271 1.24E-06 0.635389 0.868182
RESIDUAL OUTPUT Residual Plot
Predicted 1.17 Observation Cost Residuals 0.78 1 2.019642857 -0.019643 l 0.39 2 2.771428571 0.228571 a u
3 4.275 -0.275 d i 0 s
4 7.282142857 -0.282143 e 0 2 4 6 8 10 5 5.778571429 0.221429 R -0.39 6 5.026785714 -0.026786 -0.78 7 7.282142857 0.717857 8 8.033928571 -0.033929 -1.17 9 6.530357143 -0.530357 Output 10 answers:
1. Ho: firm asset size and capital structure are independent H1: firm asset size and capital structure are not independent 2 =  (f-e)2/e = (14-16.4)²/16.4 + (20-20.9)²/20.9 + (16-12.7)²/12.7 + (30-27.6)²/27.6 + (36-35.1)²/35.1 + (18-21.3)²/21.3 = 2.00 p = P(2  2.00); .10 < p < .90 [note: use the ²-distribution with 2 d.f.] conclude firm asset size and capital structure are independent
2. (a) The least squares line is est(Cost) = 1.268 + .752(Output). Two points belonging to that line are (2,2.8) and (8,7.3). Plot, label, and draw a line through the 2 points you used. (b) The linearity, normality, and equal variances assumptions appear met. (c) SST represents the total variation in the production costs. SSR represents the variation in the production costs explained by the sample regression equation. SSE represents the variation in the production costs not explained by the sample regression equation. Interpretation of r2 = .971: approximately 97.1% of the variation in the production costs is explained by the sample regression equation (d) Ho: production cost is NOT linearly related to output H1: production cost is linearly related to output F = MSR/MSE = 233.256 p = P(F  233.256)  0 very strong evidence that production cost is linearly related to output (e) Interpretation of b1 = .752: We estimate that as output increases by 1 ton, the mean production cost increases by .752 thousands of dollars (i.e., $752). (f) Interpretation of se = .388: Across all months with the same output level, the standard deviation in the production costs is estimated to be .388 thousands of dollars (i.e., $388) (g) 5.028 thousands of dollars (i.e., $5,028) (h) the type of production technology; the experience level of the workers
3. Across all months with the same output: E() = 0,  is normally distributed, and VAR() = 2. For any two months, the respective ’s are independent.
4. (a) A correlation coefficient measures the strength and direction of linear association between two variables. (b) a low to moderate positive correlation 11
Answers to Homework Problems in Chapters 11 and 3 Chapter 11
21. H0: The type of flight and type of ticket of persons traveling for business are independent. H1: The type of flight and type of ticket of persons traveling for business are not independent (which is the same thing as saying they are related). 2 =  (f - e)²/e = 100.43. [The e’s are 35.59, 15.41, 150.73, 65.27, 455.68, and 197.32] p = P(2  100.43) p < .005 [Look at chi-square table with df = (3-1)(2-1) = 2.] Very strong evidence that the type of flight and type of ticket of persons traveling for business are not independent, i.e., are related.
22. H0: Brand loyalty and manufacturer are independent. H1: Brand loyalty and manufacturer are not independent (which is the same thing as saying they are related). 2 =  (f - e)²/e = 7.36 [The e’s are 109.53, 66.13, 72.33, 155.47, 93.87, and 102.67] p = P(2  7.36)  .025 [Look at chi-square table with df = (2-1)(3-1) = 2.] Evidence that brandy loyalty and manufacturer are not independent, i.e., are related.
40. H0: Part quality and production shift are independent. H1: Part quality and production shift are not independent (which is the same thing as saying they are related). 2 =  (f - e)²/e = 8.11. [The e’s are 368.44, 31.56, 276.33, 23.67, 184.22, and 15.78] p = P(2  8.11) [Look at chi-square table with df = (3-1)(2-1) = 2.] .01 < p < .025 Evidence that part quality and production shift are not independent, i.e., are related.
Chapter 3
50. r = -.91. For this sample of data on 10 midsize automobiles, there is a strong inverse linear association between the driving speed and mileage (in miles per gallon).
Data on Sample of 10 midsize automobiles
40 35 30 25 g
p 20 m 15 10 5 0 0 20 40 60 80 speed 12
52. r = .92. For this sample of 10 weeks, there is a strong direct linear association between the closing price for the DJIA and the S&P 500.
Closing Prices for Sample of Weeks in Feb, March, and April of 2000
1550 1500 0
0 1450 5 P 1400 & S 1350 1300 9500 10000 10500 11000 11500 DJIA
69b. r = .93 which indicates that, for this sample of 20 U.S. cities, there is a strong direct linear association between typical (median?) household income and typical (median?) home price.
Data on Sample of 20 U.S. Cities (from Places Rated Almanac, 2000)
) 250 s 0 0
0 200 1 $
n
i 150 (
e
c 100 i r P
e 50 m o 0 H 0 50 100 150 Household Income (in $1000s) 13
Regression Homework Assignment (followed by answer key)
1. A clothes manufacturer wanted to assess the relationship between the annual maintenance cost (Y, in dollars) and age (X, in years) of a particular variety of sewing machine. For a random sample of 14 machines of this type, maintenance records were examined to determine, for each machine, its maintenance cost last year and its age at the beginning of last year. The data is given below:
Age Cost 8 118 3 55 1 21 9 135 5 75 7 104 5 83 2 40 1 29 3 48 6 85 2 33 6 95 8 130 note: Refer as needed to the Excel-generated output on the next page.
(a) Write down the sample regression equation. (b) Graph the sample regression equation (least squares line) on the scatter diagram, and label two of the points on that line. (c) Interpret—in the context of the problem—the standard error of the estimate. (d) Assess, based on the residual plot provided by Excel, whether the linearity, normality, and equal variances assumptions appear met.
(e) Interpret—in the context of the problem—each of SST, SSR, SSE, and r2. (f) Test for a linear relationship between annual maintenance cost and age.
(g) Get a point estimate for the mean maintenance cost of all 7 year old machines. (h) Get a point estimate for the annual maintenance cost of a 3 year old machine. (i) Interpret—in the context of the problem—the sample regression coefficient of age (which is the slope of the least squares line). (j) Suggest two additional independent variables that could aid in predicting the annual maintenance cost of a sewing machine. 14
Homework Problem #1. Sample data and Excel-generated output.
Age Cost 8 118 3 55 160 1 21 ) 140 s 9 135 r 120 a l
5 75 l
o 100 d 7 104 80 n i
5 83 (
60 t
2 40 s 40 1 29 o C 20 3 48 0 6 85 2 33 0 2 4 6 8 10 6 95 Age (in years) 8 130
SUMMARY OUTPUT Regression Statistics Multiple R 0.9923999 R Square 0.9848576 Adjusted R Square 0.9835957 Standard Error 4.9002087 Observations 14
ANOVA df SS MS F Significance F Regression 1 18740.78 18740.78 780.4743 2.74E-12 Residual 12 288.1445 24.01205 Total 13 19028.93
Standard Coefficients Error t Stat P-value Lower 95%Upper 95% Intercept 9.4955752 2.687911 3.532698 0.004126 3.639121 15.35203 Age 13.910029 0.497908 27.93697 2.74E-12 12.82518 14.99488
RESIDUAL OUTPUT Predicted Residual Plot Observation Cost Residuals 1 120.77581 -2.775811 14.7 2 51.225664 3.774336 3 23.405605 -2.405605 9.8 4 134.68584 0.314159 s
l 4.9
5 79.045723 -4.045723 a u
6 106.86578 -2.865782 d
i 0
7 79.045723 3.954277 s e 0 2 4 6 8 10 8 37.315634 2.684366 R -4.9 9 23.405605 5.594395 -9.8 10 51.225664 -3.225664 11 92.955752 -7.955752 -14.7 12 37.315634 -4.315634 Age 13 92.955752 2.044248 14 120.77581 9.224189 15
2. The owner of a large firm that manufactures furniture wishes to assess the relationship in the U.S. between annual national expenditures on furniture (Y) and national personal disposable income (X). Below are data, for 10 years and in billions of dollars, from the Economic Report of the President. [source: Kenkel (1996)]
Expenditures on Furniture PDI
20 350 18 364 22 385 24 404 26 438 30 473 30 511 30 546 38 591 40 634 note: Refer as needed to the Excel-generated output on the next page.
(a) Write down the sample regression equation. (b) Graph the sample regression equation (least squares line) on the scatter diagram, and label two of the points on that line. (c) Interpret—in the context of the problem—the standard error of the estimate. (d) Assess, based on the residual plot provided by Excel, whether the linearity, normality, and equal variances assumptions appear met.
(e) Interpret—in the context of the problem—each of SST, SSR, SSE, and r2. (f) Test for a linear relationship between furniture expenditures and PDI.
(g) Get a point estimate for the mean furniture expenditure over years with a PDI of 425 billion dollars. (h) Get a point estimate for the furniture expenditure in a single year for which the PDI is anticipated to be 500 billion dollars. (i) Interpret—in the context of the problem—the sample regression coefficient of PDI, i.e., the slope of the least squares line.. (j) Suggest two additional independent variables that could aid in predicting furniture expenditures. 16
Homework Problem #2. Sample data and Excel-generated output.
PDI Expend
e
350 20 r u
t 50 364 18 i n r
385 22 )
u 40 s F
404 24 n n o
i 30 l o
438 26 l
i s b
473 30 e
r 20 $
u
511 30 t n i i ( 546 30 d 10 n
591 38 e p 0 634 40 x E 0 200 400 600 800 PDI (in $ billions)
SUMMARY OUTPUT
Regression Statistics Multiple R 0.9741882 R Square 0.9490426 Adjusted R Square 0.9426729 Standard Error 1.7405227 Observations 10
ANOVA df SS MS F Significance F Regression 1 451.3646 451.3646 148.9938 1.88E-06 Residual 8 24.23535 3.029419 Total 9 475.6
Standard Coefficients Error t Stat P-value Lower 95%Upper 95% Intercept -5.977543 2.821428 -2.118623 0.066969 -12.48377 0.528687 PDI 0.0719283 0.005893 12.2063 1.88E-06 0.05834 0.085517
Residual Plot RESIDUAL OUTPUT 5.22 Predicted Observation Expend Residuals 3.48 1 19.197372 0.802628 s
l 1.74
2 20.204369 -2.204369 a u
d 0 3 21.714863 0.285137 i s
4 23.081502 0.918498 e 0 200 400 600 800
R -1.74 5 25.527065 0.472935 6 28.044556 1.955444 -3.48 7 30.777833 -0.777833 8 33.295324 -3.295324 -5.22 9 36.532099 1.467901 PDI 10 39.625017 0.374983 17
3. “The Jones Rustproofing Company operates a chain of outlets in Chicago. The company rustproofs automobiles. Management believes that the number of customers [Y] in a quarter of the year can be predicted relatively accurately by using a linear regression model in which the explanatory variable is the number of new automobile registrations [X] in Chicago in the previous quarter. The following data show the number of customers in hundreds during the last eight quarters and the number of new car registrations in thousands for each previous quarter.” (Kenkel, 1996)
# Customers # New autos registered 7.1 14.4 8.2 17.1 6.3 11.9 9.1 20.2 8.7 17.0 6.4 14.0 5.2 11.1 8.1 15.2 note: Refer as needed to the Excel-generated output on the next page.
(a) Write down the sample regression equation. (b) Graph the sample regression equation (least squares line) on the scatter diagram, and label two of the points on that line. (c) Interpret—in the context of the problem—the standard error of the estimate. (d) Assess, based on the residual plot provided by Excel, whether the linearity, normality, and equal variances assumptions appear met.
(e) Interpret—in the context of the problem—each of SST, SSR, SSE, and r2. (f) Test for a linear relationship between number of customers and number of new autos registered.
(g) Get a point estimate for the mean number of customers across all quarters where 18,500 new autos were registered the previous quarter. (h) Get a point estimate for the number of customers next quarter if there were 16,800 new autos registered this quarter. (i) Interpret—in the context of the problem—the sample regression coefficient of number of new autos registered, i.e., the slope of the least squares line. (j) Suggest two additional independent variables which could aid in predicting the number of customers. 18
NewAutoRegs Customers 14.4 7.1 17.1 8.2
11.9 6.3 )
0 10 20.2 9.1 0 1 9 x
17 8.7 ( 8 s r
14 6.4 r e e t 7 11.1 5.2 r m a 6 o u
15.2 8.1 t q s 5 u g n C 4 i
f r o u 3
d r
e 2 b 1 m
u 0 N 0 5 10 15 20 25 New Auto Registrations (x1000) previous quarter SUMMARY OUTPUT
Regression Statistics Multiple R 0.9400942 R Square 0.88377711 Adjusted R Square 0.86440663 Standard Error 0.49888523 Observations 8
ANOVA df SS MS F Significance F Regression 1 11.35543 11.35543 45.62494 0.000514 Residual 6 1.493319 0.248886 Total 7 12.84875
Standard Coefficients Error t Stat P-value Lower 95%Upper 95% Intercept 0.8973018 0.976908 0.918512 0.393776 -1.493107 3.287711 NewAutoRegs 0.42945894 0.06358 6.754624 0.000514 0.273884 0.585034
RESIDUAL OUTPUT Residual Plot
Predicted 1.5 Observation Customers Residuals 1 7.08151051 0.018489 1 s 2 8.24104964 -0.04105 l 0.5 a
3 6.00786316 0.292137 u d
i 0
4 9.57237235 -0.472372 s e -0.5 0 5 10 15 20 25
5 8.19810375 0.501896 R 6 6.90972693 -0.509727 -1 7 5.66429601 -0.464296 -1.5 8 7.42507766 0.674922 NewAutoRegs 19
4. To assess the effect of an organic fertilizer on tomato yield, differing amounts of organic fertilizer were applied to 10 similar plots of land. The same number and variety of tomato seedlings were grown on each plot under similar growing conditions. For each plot the amount of fertilizer (in pounds) and yield (in pounds) of tomatoes throughout the growing season are given below:
Fertilizer Yield 0 6 0 8 10 11 10 14 20 18 20 23 30 25 30 28 40 30 40 34 note: Refer as needed to the Excel-generated output on the next page.
(a) Write down the sample regression equation.. (b) Graph the sample regression equation (least squares line) on the scatter diagram, and label two of the points on that line. (c) Interpret—in the context of the problem—the standard error of the estimate. (d) Assess, based on the residual plot provided by Excel, whether the linearity, normality, and equal variances assumptions appear met. (e) Interpret—in the context of the problem—each of SST, SSR, SSE, and r2. (f) Test for a linear relationship between yield and amount of fertilizer. (g) Get a point estimate for the yield of a plot where 35 pounds of fertilizer is to be used. (h) Get a point estimate for the mean yield across all plots were 15 pounds of fertilizer is to be used. (i) Interpret—in the context of the problem— the sample regression coefficient of fertilizer (which is the slope of the least squares line). 20
Homework Problem #4. Sample data and Excel-generated output.
Fertilizer Yield 40
0 6 ) 35 0 8 s d 30 10 11 n u 25 10 14 o p 20 18 20 n i (
20 23 15 d
30 25 l
e 10 30 28 i 40 30 Y 5 40 34 0 0 10 20 30 40 50 Fertilizer (in pounds)
SUMMARY OUTPUT
Regression Statistics Multiple R 0.97935605 R Square 0.95913827 Adjusted R Square 0.95403056 Standard Error 2.08865986 Observations 10
ANOVA df SS MS F Significance F Regression 1 819.2 819.2 187.782235 7.75E-07 Residual 8 34.9 4.3625 Total 9 854.1
Standard Coefficients Error t Stat P-value Lower 95% Upper 95% Intercept 6.9 1.14400612 6.031436 0.000312279 4.261915 9.538085 Fertilizer 0.64 0.04670385 13.70337 7.75086E-07 0.532301 0.747699
RESIDUAL OUTPUT Residual Plot Predicted Observation Yield Residuals 6.3 1 6.9 -0.9 2 6.9 1.1 4.2
3 13.3 -2.3 s l 2.1
4 13.3 0.7 a u
5 19.7 -1.7 d
i 0
6 19.7 3.3 s e 0 10 20 30 40 50
7 26.1 -1.1 R -2.1 8 26.1 1.9 -4.2 9 32.5 -2.5 10 32.5 1.5 -6.3 Fertilizer 21
Answers to Regression Homework note: Some of our answers may differ due to rounding.
1. (a) The sample regression equation is est(Cost) = 9.4956 + 13.9100(Age) (b) Two of the many points belonging to the line are (3,51.23) and (8,120.78). (c) Across all machines of the same age, the standard deviation in the annual maintenance cost is estimated to be $4.90. (d) The linearity, normality, and equal variances assumptions appear met. (e) SST represents the total variation in the maintenance costs. SSR represents the variation in the maintenance costs explained by the sample regression equation. SSE represents the variation in the maintenance costs not explained by the sample regression equation. Interpretation of r2 = .985: 98.5% of the variation in the maintenance costs is explained by the sample regression equation.
(f) H0: annual maintenance cost is not linearly related to age (or 1 = 0)
H1: annual maintenance cost is linearly related to age (or 1  0) F = MSR/MSE = 780.4743 p = P(F  780.4743)  0 Very strong evidence that annual maintenance cost is linearly related to age. (g) point estimate is 9.4956 + 13.9100(7) = $106.87 (h) point estimate is 9.4956 + 13.9100(3) = $51.23 (i) We estimate that as machine age increases by 1 year, the mean annual maintenance cost increases by $13.91. (j) Hours of usage; percentage of time used on heavy fabric.
2. (a) The sample regression equation is est(Expend) = -5.9775 + .0719(PDI). (b) Two of the many points belonging to the line are (400,22.78) and (600,37.16). (c) We estimate that across all years with the same PDI level, the standard deviation in national expenditures on furniture is 1.74 billion dollars. (d) Based on the residual plot, the linearity, normality, and equal variances assumptions appear met. (e) SST represents the total variation in the annual national furniture expenditures. SSR represents the variation in the annual national furniture expenditures explained by the sample regression equation. SSE represents the variation in the annual national furniture expenditures not explained by the sample regression equation. Interpretation of r2 = .949: 94.9% of the variation in the annual national expenditures on furniture is explained by the sample regression equation. 22
(f) H0: annual national expenditure on furniture is not linearly related to national PDI (or 1 = 0)
H1: annual national expenditure on furniture is linearly related to national PDI (or 1  0) F = MSR/MSE = 148.9938 p = P(F  148.9938)  0 Very strong evidence that annual national expenditure on furniture is linearly related to national PCI. (g) point estimate is -5.9775 + .0719(425) = 24.6 billion dollars (h) point estimate is -5.9775 + .0719(500) = 30.0 billion dollars (i) We estimate that as PDI increases by 1 billion dollars, the mean expenditure on furniture increases by . 072 billion dollars (or 72 million dollars). (j) consumer price index (at the beginning of the year); level of consumer confidence (at the beginning of the year)
3. (a) The sample regression equation is est(Customers) = .8973 + .4295(NewAutoRegs). (b) Two of the many points belonging to the line are (6.0,3.47) and (9.0,4.76). (c) We estimate that across all quarters having the same number of new auto registrations the previous quarter, the standard deviation in the number of customers is 50 (.50 hundred) customers. (d) Based on the residual plot, the linearity, normality, and equal variances assumptions appear met. (e) SST represents the total variation in the numbers of customers. SSR represents the variation in the number of customers explained by the sample regression equation. SSE represents the variation in the number of customers not explained by the sample regression equation. Interpretation of r2 = .884: 88.4% of the variation in the number of customers is explained by the sample regression equation. (f) H0: quarterly number of customers is not linearly related to the number of new auto registrations
the previous quarter (or 1 = 0) H1: quarterly number of customers is linearly related to the number of new auto registrations the
previous quarter (or 1  0) F = MSR/MSE = 45.6249 p = P(F  45.6249) = .0005 Very strong evidence that the quarterly number of customers is linearly related to the number of new auto registrations the previous quarter. (g) point estimate is .8973 + .4295(18.5) = 8.8 hundred or 880 customers (h) point estimate is .8973 + .4295(16.8) = 8.1 hundred or 810 customers (i) We estimate that as the number of new auto registrations the previous quarter increases by 1 thousand, the mean number of customers in a quarter increases by .43 hundred (i.e., 43) customers. (j) mean dollar value of new autos registered the previous quarter; whether or not it is a fourth or first quarter (and thus overlaps with winter) 23
4. (a) The sample regression equation is est(Yield) = 6.9 + .64(Fertilizer) (b) Two points belonging to the equation/line are (0,6.9) and (40,32.5); if you plot those two points (or any other two points belonging to the line) and draw a line through them, you will have a graph of the least squares line. (c) Across all plots with the same amount of fertilizer, the standard deviation in yield is estimated to be 2.1 pounds. (d) All three assumptions appear met. (e) SST: total variation in yield SSR: variation in yield explained by the sample regression equation SSE: variation in yield not explained by the sample regression equation Interpretation of r2 = .959:  96% of the variation in yield is explained by the sample regression equation
(f) H0: yield is not linearly related to amount of fertilizer (or 1 = 0)
H1: yield is linearly related to amount of fertilizer (or 1  0) F = MSR/MSE = 187.782 p = P(F  187.782) = .0000007751  0 Very strong evidence that yield is linearly related to amount of fertilizer (g) point estimate is 6.9 + .64(35) = 29.3 pounds (h) point estimate is 6.9 + .64(15) = 16.5 pounds (i) We estimate that as the amount of organic fertilizer increases by 1 pound, the mean yield increases by .64 pounds. 24
Unit Three Formula Sheet
Chi-square test of independence (p-value approach): Hypotheses: H0: X and Y are independent H1: X and Y are related ( f  e)2 Test statistic:  2  note: f denotes the observed frequencies (the counts within the cross-  e classification, or contingency, table) and e denotes the expected [should H0 be true] frequencies. For each “cell” of the table, e = (row total)(column total)/(grand total), where n is the grand total. 2 2 p-value: p = P(   calculated GIVEN H0 is true). Reference the chi-square distribution with (r – 1)(c – 1) df. Conclusion: p< .005 .005  p < .01 .01  p < .05 .05  p < .10 p  .10 very strong strong evidence marginal evidence evidence H0 may be true that H1 is true evidence that H1 is true that H1 is true that H1 is true
Covariance and Correlation Coefficient n N (x  x)(y  y) s   i i (xi   x )(yi   y ) xy xy i1 i1 r    sxy    s s   (n 1) xy N x y x y
Simple Linear Regression Analysis:
Testing for a linear relationship between Y and X (p-value approach):
Hypotheses: H0: Y is not linearly related to X (or 1 = 0)
H1: Y is linearly related to X (or 1  0) Test statistic: F = MSR/MSE
p-value: p = P(F  Fcalculated GIVEN H0 is true) Conclusion: see above
For your information (not responsible for):
Confidence Interval for the mean value of Y across all entities with X = x* 1 (x * x)2 est(y)  t(se )  2 note: reference the t-distribution with n-2 df n  x 2  nx
Confidence Interval for the value of Y for a single entity with X = x* 1 (x * x) 2 est(y)  t(se ) 1  2 note: reference the t-distribution with n-2 df n  x 2  nx ______note: In any hypothesis testing situation, reject H0 at significance level  if and only if 25 the p-value is <  .
