UNDERSTANDING MULTIPLE REGRESSION

From: Ethington, C. A., Thomas, S. L., & Pike, G. R. (2002). Back to the basics: Regression as it should be. In J. C. Smart (Ed.), Higher education: Handbook of theory and research, Vol. 17. New York: Algora Publishing. Sir Francis Galton (1885) introduced the idea of “regression” to the research community in a study examining the relationship of fathers’ and sons’ heights. In his study he observed that sons do not tend toward their fathers’ heights but instead “regress to” the mean of the population. He thus formulated the idea of “regression toward mediocrity”, and with the development of the method of least squares procedures by Carl Friedrich Gauss (Myers, 1990), multiple regression analysis using ordinary least squares procedures (OLS) has become one of the most common statistical techniques for investigating and modeling relationships among variables. The two predominant uses of multiple regression are for prediction and explanation, and the methodological approach taken in the analyses depends upon the purpose of the estimation of the model. Suppose, for example, that an institutional research office has been charged with determining whether a set of variables (e.g., ability, high school achievement, socioeconomic status, interests, motivation) can predict end-of-freshman-year grade point average. If the purpose is to optimize the prediction and to use such a prediction equation in making admission decisions, the goal is the development of the most parsimonious equation with the least errors of prediction so that the best estimates of yield rates from admission pools can be obtained. The aim is to eliminate superfluous variables, not to test theoretically based hypotheses. The use of theory is not required in the selection of variables for use in the development of such a predictive equation, and the parameter estimates are of little importance. Of more importance are economy, availability of data, ease of obtaining needed information, and accuracy of prediction. Various approaches may be used to identify the smallest number of variables necessary to produce the most accurate prediction estimates (see Montgomery & Peck, 1992 and Myers, 1990, for a presentation of the development of the regression equation for prediction). Different approaches using the same data may lead to the retention of different sets of variables, but any approach that meets the needs of the researcher and produces accurate estimation is sufficient. While the higher education research literature is replete with regression studies using the prediction terminology, few actually use the methodological approaches involved in the development of the most efficient prediction equation. Almost all of the regression applications in higher education are for explanatory purposes, and it is this approach that we take in our development of this chapter, for explanation is the essence of behavioral research. Almost all of our research questions seek to understand and explain why some phenomenon under study varies from person to person, and most generally the phenomena studied in higher education are “multivariate” in nature. That is, the primary focus of a study (the dependent variable) is conceptualized as being related to and influenced by multiple interrelated factors (the independent variables). Rarely can simple bivariate relationships adequately capture and explain or model reality. The complexity of behavioral science phenomena demands that we study the covariation among the independent variables as well as that of the dependent variable with the independent variables. It is this variance and covariance that is the basis of multiple regression. Our search for explanations of variability and attempts to model reality imply that there is a theoretical or conceptual basis for not only the anticipated relationships among variables, but for the selection of variables studied. In his classic text on the application of multiple regression in behavioral research, Pedhazur (1982) argues that “methods per se mean little unless they are integrated within a theoretical context” (p. 3) and goes on to state that “in explanatory research data analysis is designed to shed light on theory” (p. 11). This focus on theory in the application of multiple regression in behavioral research has been argued for since the 1950s when regression became commonly used in the social sciences. Ezekiel and Fox (1959) stress the necessity for “careful logical analysis, and the need both for good theoretical knowledge of the field in which the problem lies and for thorough technological knowledge of the elements involved in the particular problem” (p. 181). Thus, both good theory and methodological competence are required for the most effective application of multiple regression. As noted above, the most general application of multiple regression in higher education research is to explain phenomena, and Kerlinger (1973) states that such explanations are called theories. Kerlinger defines theory as “a set of interrelated constructs (concepts), definitions, and propositions that present a systematic view of phenomena by specifying relations among variables, with the purpose of explaining and predicting the phenomena” (p. 9). Thus, the selection of variables to be used in explaining phenomena and the specific hypotheses to be tested are derived from the particular theory in which the research is grounded. The constructs and definitions within the theory lead to the operationalization of the constructs and the measurement of the variables subsequently used in testing the theoretical hypotheses. We assume, therefore, that the set of independent variables and the dependent variable used in the statistical analyses are directly specified by some underlying theory. Using a theory as the basis for the choice of independent variables in multiple regression also has implications for how the results of the regression analysis can be discussed. The interpretation of the estimated regression coefficients as indices of effects of independent variables on an outcome can only be applied within a theoretical context. Pedhazur (1982) argues that without theory as the basis for variable selection, no statement can be made about effects or meaningfullness of variables. He further argues that with an atheoretical selection of variables or with the use of statistical variable-selection procedures(e.g., stepwise procedures), the context of the research reverts to prediction, and in the predictive study, all that can be concluded about the independent variables is which combination of them will best predict the outcome. Thus, any use of stepwise techniques in developing explanatory models is always improper. Indeed, in an editorial statement in the journal Educational and Psychological Measurement, the editor, Bruce Thompson, delineated the major problems associated with stepwise procedures and suggested that studies utilizing such procedures should not be submitted to that journal (Thompson, 1995). THE STATISTICAL MODEL Multiple linear regression falls into a class of linear models that is based on the statistical model known as the General Linear Model: Y = X β + ε The specific multiple linear regression model is:

y = β0 + β1 x1i + β2 x2i + . . . + βk xki + εi (i = 1, 2, … , n) The outcome, y, is generated by two components. The first component defines the “best” linear relationship between the outcome (y) and the predictors (β0 + β1 x1i + β2 x2i + . . . + βk xki). For any given level of x, there is a corresponding “predicted” level of y, (E[y]= β0 + β1 x1i + β2 x2i + . . . + βk xki). The second component is εi, the stochastic or random source of variation.

UNDERSTANDING MULTIPLE REGRESSION PAGE 2 This error term, εi, is a random variable for which outcomes are governed by a probability distribution and it has following properties:

1. E(εi) = 0, the mean of the error term is always equal to 0,

2 2. Var(εi) = σ , the variance of the error is the same at any level of x, (homoschedasticity)

3. Cov(εi,εj) = 0, the error terms for any two observations are uncorrelated (independence),

4. εi is normally distributed. While there are a number of methods that could be used to estimate this model, the procedure used in the original development of multiple regression analysis and the one most commonly used now is the method of least squares (often referred to as ordinary least squares [OLS], it is the default method in most statistical software packages). The OLS procedure estimates the regression coefficients (β) such that the sum of the squared errors (Σε2) is minimized. The βs are population parameters and the estimated coefficients are denoted by b. Mathematically, the regression coefficient measures the change in the dependent variable (y) for each one-unit increase in x, holding all other independent variables constant. A more common interpretation is that the coefficient for the variable x represents the net effect of x on y (i.e., the effect of x on y after controlling for the other independent variables), and the magnitude of the coefficient depends on the metric in which the independent variable is measured. The intercept of the equation (β0) most generally does not have substantive meaning because it represents the value of the dependent variable y when the value on each of the x is 0. Most software programs also calculate standardized coefficient betas (not to be confused with the population parameters, βs). Betas represent the regression coefficients that would be produced had all variables been standardized. Thus, it is as if all variables were standardized, and they each have the same metric. Since the independent variables can be considered to have a common metric, the relative magnitude of the betas can be compared, unlike the bs. Once the βs are estimated, the total sum-of-squares which constitutes the variability in the dependent measure can be decomposed into two parts: regression sum-of-squares (explained) and residual (error) sum-of-squares. These sums-of-squares are the basis for the statistical tests. TESTS OF SIGNIFICANCE The omnibus test for significance of regression is an F-test that determines whether there is a linear relationship between the dependent variable y and any of the independent variables. The null and alternative hypotheses for this F test are

Ho: β1 = β2 = … = βk = 0

Ha: βj ≠ 0 for at least 1 j Rejection of the null hypothesis implies that at least one of the independent variables has a significant relationship to the dependent variable in the presence of the other variables in the model. This F-test is simply the ratio of the mean square regression divided by the mean square residual: F = MSreg/MSres

UNDERSTANDING MULTIPLE REGRESSION PAGE 3 When the F is significant, we then examine tests on the coefficients of the independent variables to determine which variables are contributing to the explanation of variance and have an effect on the dependent variable. These tests of the coefficients are t-tests and are equivalent to an F- test on the change in the amount of variance explained in the dependent variable that would result from adding the individual independent variable to an equation containing all other independent variables. The ratio of the sum-of-squares regression to the sum-of-squares total is the coefficient of determination (R2) and gives the proportion of the total variation in y that is explained by or attributable to the set of independent variables (most basically, the R2 represents the square of the correlation between the observed and predicted values of y). This is referred to as the R-square and is an indicator of the model fit, ranging from 0 indicating no fit to 1.0 indicating a perfect fit. (The F test above can also be considered to test the null hypothesis that the proportion of 2 variance explained is 0 - Ho: ρ = 0) R2 = SSreg/SStotal Many researchers attempt to decompose the proportion of variance explained by the set of independent variables into components uniquely attributable to each independent variable. The only situation in which this is possible is when all independent variables are uncorrelated, which is almost never the case. Numerous approaches have been developed attempting this decomposition, and depending upon which approach is taken, variance attributed to individual variables differs substantially. We refer readers to Pedhazur (1982) for a complete discussion of the difficulties in attempts at variance partitioning, and we agree with the eminent sociologist, Otis Dudley Duncan who recommended that such efforts be dispensed: The simplest recommendation—one which saves both work and worry—is to eschew altogether the task of dividing up R2 into unique causal components. In a strict sense, it just cannot be done, even though many sociologists, psychologists, and other quixotic persons cannot be persuaded to forego the attempt. (1975, p. 65). SUBSTANTIVE IMPORTANCE VS. STATISTICAL SIGNIFICANCE When statistical significance is found, one must then address the issue of the substantive importance of the findings. It is commonly known that large sample sizes contribute to the likelihood of finding statistically significant effects in any type of statistical analysis, and we often use survey data that represent responses by hundreds if not thousands of subjects. It is the researcher’s responsibility to determine what magnitude of effect is substantively meaningful, given the nature of the data gathered and the question being addressed. Some authors seek to give guidelines for criteria of importance, but an effect of a certain magnitude that is important in one setting is not necessarily important in other settings. For example, Cohen (1977) suggests that an R2 of .01 could be viewed as a small, meaningful effect, but few would agree that explaining only 1% of the variance in the dependent variable using a collection of independent variables is of any importance. Comparing the R2 obtained in a study to R2s reported in similar studies and careful consideration of the magnitude of the betas (standardized coefficients) can help place substantive importance on findings. Betas of .05 or less can hardly be argued to be meaningful given that this represents a 5/100 standard deviation change in the dependent variable for a 1 standard deviation change in the independent variable holding other effects constant.

UNDERSTANDING MULTIPLE REGRESSION PAGE 4 References

Cohen, J. (1977). Statistical power analysis for the behavioral sciences (Rev. ed.). New York: Academic Press.

Duncan, O. D. (1975). Introduction to structural equation models. New York: Academic Press.

Ezekiel, M., & Fox, K. A. (1959). Methods of correlation and regression analysis. New York: John Wiley & Sons.

Galton, F. (1885). Regression toward mediocrity in heredity stature. Journal of the Anthropological Institute, 15, 246-263.

Kerlinger, F. N. (1973). Foundations of behavioral research. New York: Holt, Rinehart and Winston.

Montgomery, D. C., & Peck, E. A. (1992). Introduction to linear regression analysis (2nd ed.). New York: John Wiley & Sons.

Myers, R. H. (1990). Classical and modern regression with applications (2nd ed.). Boston: PWS- KENT Publishing Co.

Pedhazur, E. J. (1982). Multiple regression in behavioral research: Explanation and prediction. New York: Holt, Rinehart & Winston.

Thompson, B. (1995). Stepwise regression and stepwise discriminant analysis need not apply here: A guidelines editorial. Educational and Psychological Measurement, 55, 525-534.

UNDERSTANDING MULTIPLE REGRESSION PAGE 5