CHAPTER SIXTEEN

Regression

NOTE TO INSTRUCTORS

This chapter includes a number of complex concepts that may seem intimidating to students. Encourage students to focus on the big picture through some of the discussion questions and classroom activities. You can ease students’ concerns about multiple regression by describing it as similar to simple linear regression except that researchers examine multiple variables rather than only one.

OUTLINE OF RESOURCES

III. Simple Linear Regression  Discussion Question 16-1 (p. 152)  Discussion Question 16-2 (p. 152)  Classroom Activity 16-1: Make It Your Own (p. 153)  Discussion Question 16-3 (p. 153)  Discussion Question 16-4 (p. 153)  Discussion Question 16-5 (p. 154)  Classroom Activity 16-2: Finding the Regression Line (p. 154) III. Interpretation and Prediction  Classroom Activity 16-3: Make It Your Own (p. 154)  Discussion Question 16-6 (p. 155)  Discussion Question 16-7 (p. 156) III. Multiple Regression  Discussion Question 16-8 (p. 156)  Discussion Question 16-9 (p. 157)  Additional Readings (p. 158)  Online Resources (p. 158) IV. Next Steps: Structural Equation Modeling (SEM)  Discussion Question 16-9 (p. 157)  Classroom Activity 16-4: Careers in Prediction (p. 157)  Classroom Activity 16-5: SEM in Context (p. 158) IV. Handouts  Handout 16-1: Finding the Regression Line (p. 159)  Handout 16-2: Careers in Prediction (p. 160)  Handout 16-3: Examing SEM in Context (p. 161) CHAPTER GUIDE

I. Simple Linear Regression 1. Simple linear regression is a statistical tool that enables us to predict an individual’s score on the dependent variable from his or her score on one independent variable. 2. Regression allows us to make quantitative predictions that more precisely explain relations among variables. > Discussion Question 16-1 What is simple linear regression, and why is it useful? Your students’ answers should include:

 Simple linear regression is a tool that allows us to make predictions.

 Simple linear regression is useful as an extension of correlation that allows us to quantify the relationship among variables with greater precision and accuracy. 3. Because simple linear regression helps us to find the equation for a line, we must have data that are linearly related to use it. 4. We can use z scores when making these predictions. Specifically, ^ the formula is zY = (rXY)(zX). The first z score is for the dependent variable and the second z score is for the independent variable. The ^ symbol signals that the z score is predicted rather than being the actual score. 5. The tendency for scores that are particularly high or low to drift toward the mean over time is known as regression to the mean. 6. Usually, we want to predict a raw score from a raw score. We will first need to convert a raw score on one variable to a z score. We can then predict a z score for the second variable. Finally, we convert the z score from the second variable to a raw score. > Discussion Question 16-2 How would we predict a raw score from a raw score? Your students’ answers should include:

 In order to predict a raw score from a raw score, we must first transform one raw score into a z score. Then we multiply that z score by the correlation coefficient to get the predicted z score for the second variable. Finally, we transpose that z score back into a raw score and make our prediction. Classroom Activity 16-1 Make It Your Own

 Use your students’ weight and height as measures for this exercise, or use height and age if you think using weight would be a sensitive issue.

 Have your students anonymously submit their weight and height or height and age.  Load these data into SPSS and run the analysis as a correlation and simple regression. 7. The intercept is the predicted value for Y when X is equal to 0, which is the point at which the line crosses or intercepts the y- axis. 8. The slope is the amount that Y is predicted to increase for an increase of 1 in X. > Discussion Question 16-3 What is the difference between the intercept and the slope? Why do we calculate them in simple linear regression? Your students’ answers should include:

 The difference between the intercept and the slope is that the intercept is the predicted value for Y when X is equal to 0 and the slope is the amount that Y is predicted to increase for an increase of 1 unit in X.

 We calculate them in simple linear regression because it allows us to develop a raw-score regression equation for predicting the raw score for Y.

9. Both the intercept and the slope are needed to calculate the equation of a line: Y^ = a + b(X). 10. To calculate the intercept, we calculate the z score for X when X = 0 by using the formula: zX = (X – MX)/SDX. We then use the z- score regression equation to calculate the predicted z score for Y ^ by using the formula: zY = (rXY)(zX). We then convert this z score ^ ^ to the predicted raw score for Y using the formula: Y = zY (SDY)

+ MY. 11. To calculate the slope, we repeat the previous steps that we used to calculate the intercept but use an X of 1 rather than 0. We then determine the change in Y^ as X increased from 0 to 1. It is important to include the appropriate sign based on whether there is an increase or decrease in Y^. > Discussion Question 16-4 How do we calculate the slope of the regression line? How is it different from calculating the intercept? Your students’ answers should include:

 We calculate the slope of the regression line by first calculating a z

score for X when X = 1 by using the formula: zX = (X – MX)/SDX. We then use the z-score regression equation to calculate the predicted ^ z score for Y by using the formula: zY = (rXY)(zX). We then convert this z score to the predicted raw score for Y using the formula: Y^ ^ = zY (SDY) + MY.

 Calculating the slope of the regression line is different from calculating the intercept of the regression line because for calculating the slope we substitute an X of 0 with an X of 1. 12. With both the intercept and slope calculated, we can now use our formula to predict the raw score for Y. 13. If we find at least three other predicted values for Y, we can use these values to draw a regression line. This is also known as the line of best fit. 14. A negative slope means that the regression line starts in the upper left of the graph and ends in the lower right. A positive slope means that the regression line starts in the lower left of the graph and ends in the upper right. > Discussion Question 16-5 How can you tell whether a slope is positive or negative? Your students’ answers should include:

 You can tell whether a slope is positive or negative by first drawing a regression line through the dots on a graph corresponding to pairs of scores for X and Y^; A negative slope means that the line looks like it’s going downhill as we move from left to right, while a positive slope means that the line looks like it’s going uphill as we move from left to right. Classroom Activity 16-2 Finding the Regression Line

 Have students use data created from Classroom Activity 16-3, “Creating Correlations,” from the previous chapter.

 Have students use the data collected to determine the regression line. Handout 16-1, found at the end of this chapter, can be used to aid in this process. 15. The standardized regression coefficient (also known as beta weight), a standardized version of the slope in a regression equation, is the predicted change in the dependent variable in terms of standard deviations for a 1 standard deviation increase in the independent variable. 16. The standardized regression coefficient is symbolized by  and pronouned “beta” or called beta weight. It is calculated using the

formula  = (b) (SSX/SSY ). II. Interpretation and Prediction 1. The number that best describes how far away, on average, the data points are from the line of best fit is called the standard error of the estimate. In other words, it is a statistic indicating the typical distance between a regression line and the actual data points. Classroom Activity 16-3 Make It Your Own In this activity, use SAT scores and overall GPAs to demonstrate simple regression.  Again, anonymously collect the data from the students.

 Have the students frame the research question for a correlation for a simple regression.

 After running the analysis, have the students discuss the results. It is likely that your data may suffer from a restricted range—but that is a good point for class discussion because real data are messy. 2. The proportionate reduction in error is a statistic that quantifies how much more accurate our predictions are when we use the regression line instead of the mean as a prediction tool. > Discussion Question 16-6 Why do you think that we would use the mean as a basis of comparison with the regression line? Why would we use the mean instead of some other number from our sample? Your students’ answers should include:

 We use the mean as a basis of comparison with the regression line because, with limited information, the mean is a fair predictor. By using the mean, we can calculate the coefficient of determination and measure how accurate our predictions are in using the regression line compared to the mean.

 We use the mean instead of some other number from a sample because the mean is involved in calculating the regression equation and, as a result, we can quantify the improvement in prediction that results from using the regression line over the mean. 3. If we were to subtract the mean score of the sample from each person’s score, square that value, and sum all of the values, we would obtain the sum of squared errors, or the sum of squares

total (SStotal). This is the error that results if we were to predict the mean as the score for each person. 4. We want our regression equation to be a substantial improvement over just using the mean as our prediction. 5. To determine how much better our regression equation predicts over the mean, we plug each X value into the regression equation. 6. To find the sum of squared errors, or SSerror, we subtract each predicted score from the mean, square the errors, and sum them. 7. To find the amount of error we’ve reduced, we subtract the sum of squared errors from the sum of squares total. This number is divided by the sum of squares total to obtain a proportion. 8. The proportionate reduction in error is symbolized as r2 and is 2 calculated using the formula: r = (SStotal – SSerror)/SStotal. > Discussion Question 16-7

What is the difference between the SStotal and the SSerror? What is the purpose of calculating them? Your students’ answers should include:  The difference between the SStotal and the SSerror is that the SStotal represents the error in prediction from the mean compared to the

SSerror, which represents error from predicting Y with our regression equation.

 The purpose of calculating them is to quantify the amount of error that we’ve reduced in using the regression equation instead of the mean.

2 9. We could also calculate r by squaring the correlation coefficient. III. Multiple Regression 1. An orthogonal variable is an independent variable that makes a separate and distinct contribution in the prediction of a dependent variable, as compared to another independent variable. 2. Multiple regression is a statistical technique that includes two or more predictor variables in a prediction equation. 3. Multiple regression is more widely used than simple linear regression because most dependent variables are best explained by using more than one independent variable. > Discussion Question 16-8 Why is multiple regression an improvement over simple linear regression? Your students’ answers should include:

 Multiple regression is an improvement over simple linear regression because it provides greater prediction by incorporating two or more predictor variables into the regression equation. 4. Compared to using averages, multiple regression represents a significant advance in our ability to predict human behavior. 5. When calculating the proportionate reduction in error for multiple regression, its symbol is R2 rather than r2 to indicate that the error is based on more than one independent variable. 6. In stepwise multiple regression, computer software determines the order in which independent variables are included in the equation. 7. Stepwise multiple regression is frequently used because it is the default in many software programs and is useful in the absence of a clear, predictive theory. 8. Another approach is to use hierarchical multiple regression whereby the researcher adds independent variables into the equation in an order determined by theory. 9. In order to use hierarchical multiple regression, we need to have a specific predictive theory that we are testing. > Discussion Question 16-9 What is the difference between stepwise and hierarchical multiple regression? When would you want to use one technique rather than the other? Your students’ answers should include:

 The difference between stepwise and hierarchical multiple regression is that, in stepwise regression, the computer software program determines the order of variable entry, while in hierarchical regression analysis, the researcher determines order of variable entry in light of theory.

 A stepwise regression can be used in the absence of theory, such as in model building, while hierarchical regression can be used to test a specific theory. Classroom Activity 16-4 Careers in Prediction The chapter refers to many opportunities for using prediction within certain careers. In this activity, students will expand on this topic using Handout 16-2. The goal of this activity is for students to observe the relevance and usefulness of regression in their daily experience. IV. Next Steps: Structural Equation Modeling (SEM) 1. Structural equation modeling (SEM) is one of several statistical techniques (and one of the most sophisticated statistical approaches) that quantifies how well sample data fit a theoretical model that hypothesizes a set of relations among multiple variables. 2. When using SEM, statisticians will refer to a statistical (or theoretical) model, which is a hypothesized network of relations, often portrayed graphically, among multiple variables. 3. When creating a model that hypothesizes the relation among factors being tested, we create paths that describe the connection between two variables in a statistical mode. We can conduct a path analysis to examine a hypothesized model by conducting a series of regression analyses that quantify the paths at each succeeding step in the model. 4. In SEM, we refer to variables that we observe and are measured as manifest variables. 5. In contrast, latent variables are ideas that we want to research but cannot directly measure. We will try to indirectly observe such variables using appropriate measurement tools. 6. When encountering a model such as SEM, it is important to first figure out what variables the researcher is studying. Next, look at the numbers to see what variables are related and the signs of the numbers to see the direction of the relation. Classroom Activity 16-5 SEM in Context In this activity, students will try to understand how path analysis is used in context. To do this, students will download or be given copies of the article: Kim, Y. M. & Neff, J. A. (2010). Direct and indirect effects of parental influence upon adolescent alcohol use: A structural equation modeling analysis. Journal of Child & Adolescent Substance Abuse, 19(3), 244–260. Students will use Handout 16-3 in their analysis of the article.

Additional Readings Harrell, F. E. (2001). Regression Modeling Strategies. New York: Springer. Beyond discussing regression, this book also explores when and how to use this statistic. It is geared toward graduate students and researchers. Cohen, J., and Cohen, P. (2002). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Mahwah, NJ: Lawrence Erlbaum Publishers. This book is data oriented and presents an excellent nonmathematical approach to data analysis. It aims toward at least a graduate level course in statistics, but is also an invaluable reference for those wanting more depth in this area.

Online Resources The following site provides you with simulations or demonstrations for almost all topics found in the textbook, as well as additional information about each topic: http://onlinestatbook.com/stat_sim/index.html. The “regression by eye” is a good support for students as they learn to visually grasp regression.

The following is the award-winning Web Experimental Psychology Lab site, home of the “Magic” experiment: http://www.psychologie.uzh. ch/sowi/Ulf/Lab/WebExpPsyLab.html. There are a number of fun experiments that your students can explore, including ranking probability terms and learning via tutorial dialogues.

PLEASE NOTE: Due to formatting, the Handouts are only available in Adobe PDF®.