MULTIPLE REGRESSION

Introduction • Used when one wishes to determine the relationship between a single dependent variable and a set of independent variables.

• The dependent variable (Y) is typically continuous.

• The independent variables (X1, X2, X3 . . . XP) are typically continuous, but they can be fixed as well.

• When the total number of parameters (one dependent plus one independent variable) is 2 (P=2), then the resultant figure is a line.

• When the number of parameters is P=3 (one dependent plus two independent variables), then the resultant figure is a plane.

• The equation for linear multiple regression can be written as:

� = �0 + �1�1 + �2�2+. . . +����

• Where b0=Y intercept, • b1 through bP = Partial regression coefficients, with respect to X1, X2, . . . XP. • Also, b1 through bP = Represent the slopes of the regression hyperplane, with respect to X1, X2, . . . XP when all X are fixed and an ellipsoid when all X are continuous or variable. • b1 is the rate of change of the of Y as a function of X1 when X2, . . ., XP are held constant.

• The can be written as: � = �! + �!�! + �!�!+ . . . +�!�! + �!

Analysis of in Multiple Regression • The null hypothesis being tested is: Ho:ß1 = ß2 = . . . ßP = 0

• In words, the null hypothesis is that there is no linear relationship between the dependent variable and the independent variables.

Sources of df Sum of Squares Mean square F variation Due to regression P (� − �)! SSReg/P MSReg/MSRes

Residual N-P-1 (� − � )! SSRes/(N-P-1)

Total N-1 (� − �)!

• Where P = number of independent parameters and N = total number of observations.

• The hypothesis Ho:ßI = 0 can be tested for each independent variable using a t-test.

! !! • � = ! and df= N – P - 1 !"(!!)

2 • Coefficient of determination: R =SSReg/SSRes

• The R2 value generally overestimates the population correlation value; thus, an adjusted R2 value may be desired.

• The bias in R2 occurs because as the number of parameters increases in the model, the numerator can only increase but the denominator remains fairly constant. Therefore, each additional variable cannot result in a decreased R2, only similar or larger values.

!(!!!!) • Adjusted R2 = �! − !!!!!

Regression With Variable-X • To characterize the joint distribution in the variable-X case, the parameters µ1 = µ2 = . . . = µP and µY; �! = �! =. . . = �! and �!; and the of the X and Y variables.

• The and covariances from the analysis are displayed as a symmetrical matrix, with the variances on the left-to-right diagonal.

X1 X2 X3 Y � X1 �� �!" �!" �!! � X2 �!" �� �!" �!! � X3 �!" �!" �� �!! � Y �!! �!! �!! ��

• The estimates for the correlations also can be displayed as a symmetrical matrix, with the diagonals equal to 1 because the correlation of a value with itself is one.

X1 X2 X3 Y X1 � �!" �!" �!! X2 �!" � �!" �!! X3 �!" �!" � �!! Y �!! �!! �!! �

• If tests of significance are wanted on the correlation values (�!: � = 0), they can be calculated using the formula:

�(� − 1)!/! � = (1 − �!)!/!

• Standardized regression coefficients that would be obtained from the analysis of X and Y that were standardized before the analysis can be determined using the formula:

�������� ��������� �� � ������������ � = � ! ! ! �������� ��������� �� �

Multiple Correlation • In multiple correlation the is denoted as R.

• This value represents the correlation between Y and the point on the regression plane for all possible combinations of X.

• Each individual in the population has a Y value and a corresponding point on the plane calculated as:

! � = �! + �!�! + �!�!+ . . . +�!�!

• The value for R is the population simple correlation between all Y and Y’ values.

• R also is the highest possible simple correlation between Y and any linear combination of X1 to XP.

• Thus, the minimum value of R is 0 and the maximum value is 1.0.

• When R approaches 0, this indicates that the regression plane poorly predicts Y better than using the value �.

• An R = 1.0 indicates a perfect fit of the plane with the points in the population.

Partial Correlation • Simple correlation may not always allow us to clearly determine the relationship between two variables because other variables may be influencing the results.

• Partial correlation analysis involves the studying the linear relationship between two variables while controlling the effect of one or more factors.

• This technique is often used in “causal” modeling of small numbers of variables.

• For example, assume you have the variables Y, X1, and X2 and you wish to determine if the correlation between Y and X1 is influenced by X2.

o You can determine if there is a causal relationship by calculating the partial correlation of Y with X1 while controlling for variable X2 (written as rY1.2).

§ In partial correlation analysis, the first step is to compare the partial correlation (e.g. rY1.2) with the original correlation (rY1).

§ If the partial correlation approaches 0, the inference is that the original correlation may be spurious and that there is no direct causal link between the two original variables

o An example using the lung function dataset. Wish to determine the partial correlation between father’s lung function (ffev1) and father’s age (fage) while controlling for father’s height (fheight). This partial correlation can be written as rffev1 fage.fheight. The number in the parenthesis is the probability of a greater |r| under Ho: ρ=0.

Partial correlation of ffev1 and fage, fage controlling for height ffev1 -0.30948 -0.326113 (0.0001) (<0.0001)

• Since the partial correlation value of -0.326113 is similar to the original correlation of -0.30948, we can conclude the original correlation between ffev1 and fage was likely not affected by fheight.

Considerations When Conducting Multiple Regression and Partial Correlation • Regression is much more sensitive to violations of the assumptions underlying the analyses and problematic such as outliers.

• If you are analyzing economic or data, multicollinearity (i.e. high correlation among independent variables) may be a problem.

• Because of the limited time for this course, we are unable to discuss running diagnostics on your data to identify potential problems.

• If you are going to use multiple regression, I encourage you to work with a to learn more about running diagnostics on your data.

• Any observation having will be excluded from analyses using SAS.

Examples of Analyses • Using the lung function data. • Dependent variable is father’s fev1 (ffev1). • Independent variables are father’s age (fage) and father’s height (fheight) . • SAS commands for multiple regression.

PROC Reg; Model ffev1=fage fheight; Run;

• SAS commands for producing the variance- matrix.

PROC Corr covar noprob; VAR ffev1 fage fheight; Run;

• SAS commands for producing the partial correlations of ffev1 and fage with fheight controlled.

Proc Corr; var ffev1 fage; partial fheight; title ‘Partial Correlation of ffev1 and fage with fheight controlled’; run;

Multiple regression of father age and father height on father fev1

The REG Procedure Model: MODEL1 Dependent Variable: ffev1

Number of Observations Read 150 Number of Observations Used 150

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 21.05697 10.52848 36.81 <.0001 Error 147 42.04133 0.28600 Corrected Total 149 63.09830

Root MSE 0.53479 R-Square 0.3337 Dependent Mean 4.09327 Adj R-Sq 0.3247 Coeff Var 13.06500

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -2.76075 1.13775 -2.43 0.0165 fage 1 -0.02664 0.00637 -4.18 <.0001 fheight 1 0.11440 0.01579 7.25 <.0001

�! = −2.7605 − 0.02664 (����) + 0.11440 (�ℎ���ℎ�)

Reject Ho:ßfage = 0 at the 95% and 99% levels of confidence.

Reject Ho:ßfheight = 0 at the 95% and 99% levels of confidence.

Both fage and fheight contribute significantly in explaining the variation in ffev1.

33.4% of the variation in ffev1 is explained collectively by fage fheight

Variance- and Simple Linear Correlation

The CORR Procedure

3 Variables: ffev1 fage fheight

Covariance Matrix, DF = 149 ffev1 fage fheight ffev1 0.42347852 -1.38761969 0.91223221 fage -1.38761969 47.47203579 -1.07516779 fheight 0.91223221 -1.07516779 7.72389262

Simple Variable N Mean Std Dev Sum Minimum Maximum ffev1 150 4.09327 0.65075 613.99000 2.50000 5.85000 fage 150 40.13333 6.89000 6020 26.00000 59.00000 fheight 150 69.26000 2.77919 10389 61.00000 76.00000

Pearson Correlation Coefficients, N = 150 Prob > |r| under H0: Rho=0 ffev1 fage fheight ffev1 1.00000 -0.30948 0.50440 0.0001 <.0001 fage -0.30948 1.00000 -0.05615 0.0001 0.4949 fheight 0.50440 -0.05615 1.00000 <.0001 0.4949

Variance-Covariance matrix and Simple Linear Correlation

The CORR Procedure

1 Partial Variables: fheight 2 Variables: ffev1 fage

Simple Statistics Partial Partial Variable N Mean Std Dev Sum Minimum Maximum Variance Std Dev fheight 150 69.26000 2.77919 10389 61.00000 76.00000 ffev1 150 4.09327 0.65075 613.99000 2.50000 5.85000 0.31787 0.56380 fage 150 40.13333 6.89000 6020 26.00000 59.00000 47.64212 6.90233

Pearson Partial Correlation Coefficients, N = 150 Prob > |r| under H0: Partial Rho=0 ffev1 fage ffev1 1.00000 -0.32613 <.0001 fage -0.32613 1.00000 <.0001

Since the partial correlation value of -0.326113 is similar to the original correlation of -0.30948, we can conclude the original correlation between ffev1 and fage was likely not affected by fheight.