Applied Mathematical Sciences, Vol. 11, 2017, no. 13, 601 - 622 HIKARI Ltd, www.m-hikari.com https://doi.org/10.12988/ams.2017.7253

Robust Regression Diagnostic for Detecting and

Solving Multicollinearity and Problems:

Applied Study by Using Financial Data

Mona N. Abdel Bary

Faculty of Commerce, Department of and Insurance Suez Canal University, Al Esmalia, Egypt

Copyright © 2017 Mona N. Abdel Bary. This article is distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Multicollinearity often causes a huge explanatory problem in . In presence of multicollinearity, (OLS) regression may result in high variability in estimates of the regression coefficients in the presence of multicollinearity. Besides multicollinearity, also constitute a problem in the multiple regression analysis Barnett and Lewis (1994) defined outliers as observations that are inconsistent with the rest of the data and most of the time will be hidden to the user since least squares residual plots fail to show these outlying point. There are several methods to get rid of this problem and one of the most famous one is the Principal Component Regression (PCR). In the presence of outliers in financial data, this paper provides method to identify outliers and to reduce their influence on the final estimates of the regression coefficients. In this paper the multi-collinearity was detected by using observing correlation matrix, variance influence factor (VIF), and eigenvalues of the correlation matrix. The PCR Estimator performed better than OLS when data set suffers from both problems. So, robust regression diagnostic results were achieved by application of the proposed estimator to a financial data set having two problems.

Keywords: Multiple Regression, Ordinal Least Square, Multi-collinearity, Robust Regression, Principle Components, Variance Influence Factor, Egyptian Stock Market

1. Introduction

Robust regression provides an alternative to least squares regression that works with

602 Mona N. Abdel Bary

less restrictive assumptions. Specifically, it provides much better regression coefficient estimates when outliers are present in the data. Outliers violate the assumption of normally distributed residuals in least squares regression. They tend to distort the least squares coefficients by having more influence than they deserve. Robust regression down weights the influence of outliers. This makes residuals of outlying observations larger and easier to spot. Robust regression is an iterative procedure that seeks to identify outliers and minimize their impact on the coefficient estimates. A number of different techniques for solving the multi-collinearity problem have been developed. These range from simple methods based on principal components to more specialized techniques for regularization [8]. Study \cite{Asad:01} applied a high breakdown robust regression method to three linear models, having compared regression statistics for both the LS technique used in the original paper and the robust method.

Study [5] promoted the application of robust instead of least square regression for the estimation of agricultural and environmental production function and [2] also estimated the growth effects of fiscal policies by applying robust methods. Numerous analyses of R&D expenditure have been made on the basis of different criteria such as the source of funds, field of science, type of costs, economic activity, enterprise size class, socioeconomic objectives, regions, etc.

[9] Study published the results of an empirical analysis of national energy R&D expenditures. Since the launch of Europe 2020 strategy, a lot of studies, papers and reports have been released. Commenting on the strategy, some of them make relevant remarks regarding the 3% target for the GERD indicator to be reached within the EU by 2020.

[1] Study investigated to what extent the EU members complied with the R&D investment targets set by Europe 2020 strategy, their actual spending being below 2% of GDP on average and only three member states reporting the R&D expenditure ratio to be higher than 3% of GDP.

Multicollinearity is a statistical phenomenon in which there exists a perfect or exact relationship between the predictor variables. When there is a perfect or exact relationship between the predictor variables, it is difficult to come up with reliable estimates of their individual coefficients. It will result in correct conclusion about the relationship between outcome variable and predictor variables. The Ordinary Least Squares (OLS) Estimator is most popularly used to estimate the parameters of regression model. The performance of OLS estimator is inefficient in the presence of multicollinearity. The regression coefficients possess large standard errors and some even have the wrong sign [6].

A number of different techniques for solving the multi-collinearity problem have been developed. These range from simple methods based on principal components to more specialized techniques for regularization [8].

Robust regression diagnostic for detecting and… 603

The PRC method has been proposed as alternatives to OLS estimators when the independent assumption has not been satisfied in analysis. Through this study, we want to compare OLS and PRC methods by using real data of Egyptian stock market.

2. Objectives of the Study:

This paper aimed at the study of financial data taken from one emerging market to answer the following questions:  What are the most important variables that affect the movement of stock prices in the stock market emerging?  Do daily published data on the movements affect stock prices?  What are the most important data published daily influence on the movements of stock prices during the trading?

3. The Proposal Frame

3.1 Detection of Multi-collinearity:  Studying scatter plots of pairs of independent variables: begin by studying pairwise scatter plots of pairs of independent variables, looking for near-perfect relationships. Also glance at the correlation matrix for high correlations. Unfortunately, multi-collinearity does not always show up when considering the variables two at a time.  Consider the variance inflation factors (VIF): VIF quantifies the severity of multi-collinearity in an ordinary least squares regression analysis.

1 푉퐼퐹 = 2 푗 = 1, 2, … , 푃 − 1 (1) 1 − 푅푗 2 Where 푅푗 denote the coefficient of determination when 푥푗 is regressed on all other predictor variables in the model. VIF provides an index that measures how much the variance of an estimated regression coefficient is increased because of the multicollinearity. As per practical experience, if any of the VIF values exceeds 5, or 10, it is indication that the associated regression coefficients are poorly estimates because of multicollinearity [7].

 Eigenvalues analysis of correlation matrix: focus on small eigenvalues of the correlation matrix of the independent variables. An eigenvalue of zero or close to zero indicates that an exact linear dependence exists. Instead of looking at the numerical size of the eigenvalue, use the condition number. Large condition numbers indicate collinearity. The condition number of correlation matrix is defined as:

휆 퐾 = 푚푎푥 푗 = 1, 2, … , 푃 (2) 휆푗

604 Mona N. Abdel Bary

th Where 휆푚푎푥 is the largest eigenvalue, and 휆푗 is the eigenvalue of i independent variable. If the condition number is less than 100, there is no series problem with multi- collinearity and if a condition number is between 100 and 1000 implies a moderate to strong multi-collinearity. Additionally, if the condition number exceeds 1000, severe multi-collinearity is indicated [7].

3.2 Principal Component Regression PCR provides a unified way to handle multi-collinearity which requires some calculations that are not usually included in standard regression analysis. Principal component analysis follows from the fact every model can be restated in terms of a set of orthogonal explanatory variables. These new variables are obtained as linear combinations of the original explanatory variables. They are referred to as the principal components. Consider the following model:

푌 = 푋훽 + 휀 (3)

Where Y is an 푛 × 1 matrix of response variable, X is an 푛 × 푃 matrix of the independent variables, 훽 is a 푃 × 1 vector of unknown constants, and 휀 is an 푛 × 1 vector of random errors such that 퐸(휀) = 0 and 퐸(휀휀́) = 휎2퐼. The OLS estimator is defined as:

훽̂ = (푋′푋)−1푋′푌 (4)

There exists a matrix C, satisfying:

퐶′(푋′푋)퐶 = Λ (5)

(퐶′퐶) = (퐶퐶′) = 1

Λ is a diagonal matrix with ordered characteristics root of (푋′푋) on the diagonal. The characteristic roots are denoted by Λ1 ≥ Λ2 ≥ ⋯ ≥ Λ푃. C may be used to calculate a new set of explanatory variables, namely:

(푍1, 푍2, … , 푍푃) = 푍 = 푋퐶 = (푋1, 푋2, … , 푋푃) (6)

That is linear functions of the original explanatory variables. The Z's are referred to as principal components. Thus the regression model can be restarted in terms of the principal components as:

푌 = 푍훼 + 휀 (7)

Where 푍 = 푋퐶, ∝= 퐶훽

(푍′푍) = (퐶′푋′푋 퐶) = (퐶′퐶 Λ 퐶′퐶) (8)

Robust regression diagnostic for detecting and… 605

The least square estimator of 훼 is

휶̂ = (푍′푍)−ퟏ푍′푌 = 횲−ퟏ푍′푌 (9)

And the variance covariance matrix of 훼̂ is

푉(훼̂) = 휎2(푍′푍)−ퟏ = 휎2Λ−1 (10)

Thus a small eigenvalue of 푋′푋 implies that the variance of the corresponding regression coefficient will be large. Since

(푍′푍) = (퐶′푋′푋 퐶) = (퐶′퐶 Λ 퐶′퐶) = Λ

푡ℎ We often refer to the eigenvalue Λ푗 as the variance of the 푗 principal component. If all Λ푗 equal to unity, the original regresses are orthogonal, while if a Λ푗 is exactly equal to zero, then it implies a perfect linear relationship between the original regresses. One or more near to zero implies that multi-collinearity is present.

The principal component regression approach combats multicollinearity by using less than the full set of principal components in the model. To obtain the principal components estimators, assume that the repressors are arranged in order of decreasing eigenvalues Λ1 ≥ Λ2 ≥ ⋯ ≥ Λ푃 > 0

The principal components corresponding to near zero eigenvalue in principal components regression are removed from the analysis and least squares applied to the remaining components.

3.3 Application and Results

3.3.1 Data

This study used cross-section of companies registered officially Egyptian stock exchange where it was used data for the most active 91 companies and that for the year 2000/2001, which covers the activities of more than 90% of the total IPO size.

3.3.2 Variables:

3.3.2.1 Dependent variable: Y: Close price, measured unit is the stock numbers.

3.3.2.2 Independent variables:

 X1: average account of recording stock. Unit of measure is number of stocks.  X2: average account of working stock. Unit of measure is number of stocks.  X3: average value of working stock. Unit of measure is Egyptian pound.

606 Mona N. Abdel Bary

 X4: average of open stock price. Unit of measure is number of stocks.  X5: average of number of process.  X6: Stock price.  X7: Doubling profit.  X8: Net profit.  X9: Type of the work.

3.3.3 Results

• We start by plotting the close price (Y) on each of the X’s individually. It is obvious from the graphs that some of independent variables seemed to have a non-linear relationship with the close price (Y), see figure 1.

150 150

100 100

Y

Y

50 50

0 0

0 1.00E+08 2.00E+08 3.00E+08 4.00E+08 5.00E+08 0 100000 200000 300000 400000 500000 600000 700000 X1 X2

Figure (1-1) Figure (1-2)

150 150

100 100

Y

Y

50 50

0 0

0 10000000 20000000 30000000 40000000 50000000 0 50 100 150 X3 X4

Figure (1-3) Figure (1-4)

150 150

100 100

Y

Y

50 50

0 0

0 10000 20000 30000 40000 0 20 40 60 80 100 120 140 160 180 X5 X6

Figure (1-5) Figure (1-6)

Robust regression diagnostic for detecting and… 607

150 150

100 100

Y Y

50 50

0 0

0 10 20 30 40 50 60 70 80 -200000 0 200000 400000 X7 X8

Figure (1-7) Figure (1-8)

150

100

Y

50

0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 X9

Figure (1-9)

Figure 1 (1-9): Dependent variable vs. each Predictor.

• We transformed the independent variables (X1, X2, X3, X4, X5, and X6) and * * * * * * dependent variable (Y) into log. They are now called Y , X1 , X2 , X3 , X4 , X5 , and * X6 , see figure 2.

5 5

4 4

3 3

*

*

Y Y 2 2

1 1

0 0

5.5 6.5 7.5 8.5 2 3 4 5 6 X1* X2*

Figure (2-1) Figure (2-2)

5 5

4 4

3 3

*

*

Y Y 2 2

1 1

0 0

3 4 5 6 7 8 0 1 2 X3* X4*

Figure (2-3) Figure (2-4)

608 Mona N. Abdel Bary

5

5 4 4 3 3

*

Y

* 2 Y 2

1 1

0 0

0 1 2 3 4 5 0 1 2 X5* X6*

Figure (2-4) Figure (2-5)

5 5

4 4

3 3

*

*

Y

Y 2 2

1 1

0 0

0 10 20 30 40 50 60 70 80 -200000 0 200000 400000 X7 X8

Figure (2-7) Figure (2-8)

5

4

3

*

Y 2

1

0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 X9

Figure (2-9)

Figure 2 (1-9): Transformed Dependent variable vs. Transformed Predictors.

• We will do regression for ln Y with all independent variables X’s, see Table 1.

Table 1: Regression of the initial Model. The regression equation is Y* = - 0.0281 + 0.00281 X1* + 0.00760 X2* - 0.00758 X3* + 2.31 X4* -0.000001 X5 - 0.00875 X5* + 0.00039 X6* -0.000042 X7 -0.000000 X8 + 0.00271 X9. Predictor Coef SE Coef T P Constant -0.02812 0.01022 -2.75 0.007 LOGX1 0.002814 0.001538 1.83 0.071 LOGX2 0.007597 0.005605 1.36 0.179 LOGX3 -0.007581 0.005368 -1.41 0.162 LOGX4 2.31465 0.00624 371.18 0.000 X5 -0.00000088 0.00000020 -4.44 0.000 LOGX5 -0.008745 0.001634 -5.35 0.000 LOGX6 0.000387 0.003014 0.13 0.898 X7 -0.00004223 0.00005559 -0.76 0.450

Robust regression diagnostic for detecting and… 609

Table 1: (Continued): Regression of the initial Model.

X8 -0.00000000 0.00000001 -0.11 0.915 X9 0.002705 0.002624 1.03 0.306 S = 0.005640 R-Sq = 100.0% R-Sq(adj) = 100.0%

Analysis of Variance Source DF SS MS F P Regression 10 91.3276 9.1328 287067.70 0.0000 Residual 80 0.0025

The first step is validating the assumptions of the Ordinary Least Squares Regression, see [3], and [4]. Assumptions:

1. Linearity: The model should be linear in the parameters: Y = X β + ε. The plot of residuals vs. predicted values results in a random scatter plot which validates linearity and homogeneity assumptions, see figure 3.

Residuals Versus the Fitted Values (response is Y*)

0.01

l

a

u

d

i s 0.00

e

R

-0.01

0 1 2 3 4 5 Fitted Value

Figure 3: Plot of Residuals vs. Predicted Values

Residuals and predicted values are orthogonal. That adding to we transformed the response variable and 6 variables to occurrence that.

2. Errors: The errors should be normally distributed, independent of each other, ε ~ N (0, σ 2 I). Note that the residuals do not have equal variances, so I will be referring to the internally studentized residuals throughout the analysis. The normal Probability Plot results in a straight line through the origin with a slope equal to 1, see figure 4.

610 Mona N. Abdel Bary

Normal Probability Plot of the Residuals (response is LN Y)

3

2

1

e

r

o

c

S

0

l

a

m

r

o -1

N

-2

-3 -0.01 0.00 0.01 Residual

Figure 4: Normal Probability Plot

This indicates that the assumption of normality of residuals holds. The index plot of residuals results in a randomly scattered horizontal band around the zero which means that the residuals are not auto correlated i.e. independent, their mean is zero with constant variance. We noted that there is not any observations have high residuals, see figure 5.

Residuals Versus the Order of the Data (response is Y*)

0.01

l

a

u

d

i s 0.00

e

R

-0.01

10 20 30 40 50 60 70 80 90 Observation Order

Figure 5: index Plot of Residuals

3. Unusual Observations: Ordinary Least squares Regression assumes that the observations are equally reliable and has equal influence in determining the Least Squares results and conclusions. We examine high points by 풌 ퟏퟎ calculating leverage points and plotting them. The average leverage, 풑 = = = 풊풊 풏 ퟗퟏ ퟏퟎ ퟎ. ퟏퟎퟗퟗ. If we consider an observation with 풑 > (ퟐ × ) = ퟎ. ퟐퟏퟗퟖ to be a high 풊풊 ퟗퟏ leverage point, then we will detect 9 high leverage points, see figure 6.

Robust regression diagnostic for detecting and… 611

0.7

0.6

0.5

e

g

a 0.4

r

e

v

e 0.3

L 0.2

0.1

0.0 0 10 20 30 40 50 60 70 80 90 Observation Order

Figure 6: Leverages

Using cook's distance to determine influential points, only 3 points were detected (57, 61, and 63). But it’s known that cook’s distance is multiplicative, thus it might fail in detecting unusual points. That is why we will calculate Hade's influence which is additive and support it with a potential residual plot to determine which points are outliers, which are leverages and which are influential. The potential residual plot declares observations number (3, 4, 6, 20, 28, 30, 31, 35, 39, 53, 55, 61, 63, 67, 85, 87, 90 and 91). Applying the RIRLS, which is the most reliable method for detecting outliers, 21 points were detected as outliers in X-space, these are observations numbers (2, 3, 4, 6, 17, 19, 28, 30, 35, 38, 39, 57, 61, 63, 64, 65, 66, 85, 87, 89, and 91). If we plot the standardized weighted least squares residuals, we detect 10 outliers in X-Y space; these are observations numbers (4, 6, 14, 35, 57, 61, 63, 87, 90, and 91), see figure 8.

0

s

l

a

u

d

i

s

e

R -500

S

L

W

d

n

a

t

S -1000

0 10 20 30 40 50 60 70 80 90 Observations Order

Figure 8: Standardized Weighted Least Squares Residuals Plot

Observations (3, 4, 6, 28, 30, 35, 39, 61, 63, 85, 87, 90 and 91) were detected by the P- R plot. Observations (2, 14, 17, 19, 38, 57, 64, 65, 66, and 89) were masked, that is why it was not detected previously. While observations (20, 31, 53, 55, and 67) were swamped (detected as outliers), while actually they are not. But, since we have no good reason for deleting the detected outliers, we will keep them in my data and just point out that these observations are unusual with respect to the majority of the data.

612 Mona N. Abdel Bary

4. Independence of the predictors: The predictor variables are supposed to be linearly independent.

We will calculate Kappa:

흀풊 푲풋 = √ 흀풋 to discover whether there is Multicollinearity between the independent variables or not. We note that K1, K2, K3… K9 < 10 but K10 = 18.62 > 10, see Table 2 and 3.

Table 2: Principal Component Analysis(For independent variables)

92 a total case of which 1 is missing Eigen Values Variance Values Proportion e1 2.775 27.7 e2 2.582 25.8 e3 1.640 16.4 e4 1.020 10.2 e5 0.753 7.5 e6 0.608 6.1 e7 0.430 4.3 e8 0.094 0.9 e9 0.090 0.9 e10 0.008 0.1 Eigen Vectors V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 X1* -0.477 0.093 0.035 -0.057 -0.027 -0.350 -0.786 -0.064 0.115 0.044 X2* -0.568 -0.032 0.013 -0.023 -0.261 0.176 0.189 0.337 -0.147 -0.637 X3* -0.456 -0.340 -0.083 -0.029 -0.207 0.323 0.148 0.135 0.153 0.677 X4* 0.115 -0.561 -0.164 -0.006 0.051 0.290 -0.190 -0.313 0.539 -0.365 X5 0.163 -0.223 0.664 -0.057 0.175 -0.104 -0.054 0.587 0.301 0.007 X5* -0.165 -0.264 0.649 -0.032 -0.069 -0.019 0.109 -0.592 -0.335 -0.007 X6 0.180 -0.532 -0.205 0.004 0.143 0.067 -0.334 0.258 -0.664 0.012 X7 -0.050 -0.097 -0.142 -0.923 0.156 -0.240 0.184 -0.029 0.008 -0.014 X8 -0.163 -0.366 -0.202 0.358 0.112 -0.726 0.352 -0.005 0.086 -0.017 X9 0.338 -0.119 0.002 -0.110 -0.891 -0.241 -0.073 0.051 0.026 0.006 Unrotated Factor Matrix F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 X1* -0.794 0.150 0.045 -0.058 -0.023 -0.273 -0.515 -0.020 0.035 0.004 X2* -0.946 -0.051 0.017 -0.023 -0.226 0.137 0.124 0.104 -0.044 -0.057 X3* -0.760 -0.546 -0.107 -0.030 -0.179 0.252 0.097 0.042 0.046 0.061 X4* 0.192 -0.902 -0.211 -0.006 0.045 0.226 -0.125 -0.096 0.162 -0.033 X5 0.272 -0.358 0.850 -0.058 0.152 -0.081 -0.035 0.180 0.090 0.001 X5* -0.274 -0.424 0.831 -0.033 -0.060 -0.015 0.072 -0.182 -0.101 -0.001 X6 0.300 -0.854 -0.262 0.004 0.124 0.052 -0.219 0.079 -0.199 0.001 X7 -0.083 -0.156 -0.182 -0.932 0.135 -0.187 0.120 -0.009 0.003 -0.001 X8 -0.272 -0.588 -0.258 0.361 0.098 -0.566 0.231 -0.001 0.026 -0.002 X9 0.562 -0.191 0.003 -0.111 -0.773 -0.188 -0.048 0.016 0.008 0.001

Robust regression diagnostic for detecting and… 613

Table 3: Calculate Kappa Kj:

K1 = 1.00 K2 = 1.04 K3 = 1.30 K4 = 1.65 K5 = 1.92 K6 = 2.14 K7 = 2.54 K8 = 5.43 K9 = 5.55 K10 = 8.62

This indicates that there is multicollinearity. There is one set of collinearity. By * * compute principle component for independent variables X1 , X2 … X9 see table 2. We * * result that, there is multicollinearity between X2 and X3 . We will calculate the * * * * standard values for all variables zY , zX1 … zX9 and do Regression Y on zX1 … zX10, see Table 6.

* * Table 6: Regression zY on zX1 … zX9

Predictor Coef SE Coef T P Constant -495.299e-18 586.9e-6 -844e-15 0.007 * zX1 0.00158662 867.4e-6 1.83 0.071 * zX2 0.00576792 0.004253 1.36 0.179 * zX3 -0.00632457 0.004475 -1.41 0.162 * zX4 1.00571 0.002709 371.18 0.000 zX5 -0.00587668 0.001323 -4.44 0.000 * zX5 -0.00727483 0.001359 -5.35 0.000 * zX6 186.179e-6 0.001451 0.13 0.898 zX7 -467.041e-6 616.1e-6 -0.76 0.450 zX8 -77.419e-6 723.2e-6 -0.11 0.915 zX9 682.425e-6 662.3e-6 1.03 0.306 S = 0.005599 R-Sq = 100.0% R-Sq(adj) = 100.0% Analysis of Variance Source DF SS MS F P Regression 10 89.9975 8.99975 287067.70 0.0000 Residual 80 31.3475e-6 31.3475e-6

* Table 7: Principal Component Analysis to (zX1 … zX9)

92 total cases of which 1 is missing Eigen Values Variance Values Proportion e1 2.775 27.7 e2 2.582 25.8 e3 1.640 16.4 e4 1.020 10.2 e5 0.753 7.5 e6 0.608 6.1 e7 0.430 4.3 e8 0.094 0.9 e9 0.090 0.9 e10 0.008 0.1

614 Mona N. Abdel Bary

* Table 7: (Continued): Principal Component Analysis to (zX1 … zX9)

Eigen Vectors V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 zX1* -0.477 0.093 0.035 -0.057 -0.027 -0.350 -0.786 -0.064 0.115 0.044 zX2* -0.568 -0.032 0.013 -0.023 -0.261 0.176 0.189 0.337 -0.147 -0.637 zX3* -0.456 -0.340 -0.083 -0.029 -0.207 0.323 0.148 0.135 0.153 0.677 zX4* 0.115 -0.561 -0.164 -0.006 0.051 0.290 -0.190 -0.313 0.539 -0.365 zX5 0.163 -0.223 0.664 -0.057 0.175 -0.104 -0.054 0.587 0.301 0.007 zX5* -0.165 -0.264 0.649 -0.032 -0.069 -0.019 0.109 -0.592 -0.335 -0.007 zX6* 0.180 -0.532 -0.205 0.004 0.143 0.067 -0.334 0.258 -0.664 0.012 zX7 -0.050 -0.097 -0.142 -0.923 0.156 -0.240 0.184 -0.029 0.008 -0.014 zX8 -0.163 -0.366 -0.202 0.358 0.112 -0.726 0.352 -0.005 0.086 -0.017 zX9 0.338 -0.119 0.002 -0.110 -0.891 -0.241 -0.073 0.051 0.026 0.006 Unrotated Factor Matrix F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 zX1* -0.794 0.150 0.045 -0.058 -0.023 -0.273 -0.515 -0.020 0.035 0.004 zX2* -0.946 -0.051 0.017 -0.023 -0.226 0.137 0.124 0.104 -0.044 -0.057 zX3* -0.760 -0.546 -0.107 -0.030 -0.179 0.252 0.097 0.042 0.046 0.061 zX4* 0.192 -0.902 -0.211 -0.006 0.045 0.226 -0.125 -0.096 0.162 -0.033 zX5 0.272 -0.358 0.850 -0.058 0.152 -0.081 -0.035 0.180 0.090 0.001 zX5* -0.274 -0.424 0.831 -0.033 -0.060 -0.015 0.072 -0.182 -0.101 -0.001 zX6* 0.300 -0.854 -0.262 0.004 0.124 0.052 -0.219 0.079 -0.199 0.001 zX7 -0.083 -0.156 -0.182 -0.932 0.135 -0.187 0.120 -0.009 0.003 -0.001 zX8 -0.272 -0.588 -0.258 0.361 0.098 -0.566 0.231 -0.001 0.026 -0.002 zX9 0.562 -0.191 0.003 -0.111 -0.773 -0.188 -0.048 0.016 0.008 0.001

By using principle components Analysis and we obtain Eigen Vectors and Eigen Values to make a new variables Wi = zXi Vi, see Table 7. We generally would like to transform the collinear matrix to an orthogonal one in which each vector is orthogonal to the other one, so that we have W1… W10. After that we make regression zY on W1,…,W10, we note that, all coefficients of Wi ’s < 0.05, that is mean all coefficient are significant, see Table 8, and 9.

Table 8: Pearson Product Moment Correlation

No Selector W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W1 1.000 W2 0 1.000 W3 -0 0 1.000 W4 -0 -0.001 0 1.000 W5 0 0.001 0 0 1.000 W6 0 -0.001 0.001 -0 0 1.000 W7 -0 -0 -0.001 -0 -0 0 1.000 W8 0.003 -0 0.001 0 -0 0 0 1.000 W9 0 -0.001 0.002 0.002 -0.001 -0.001 0.001 -0.001 1.000 W10 0.002 0.002 -0.001 0.006 0.003 -0 0.002 0.002 0 1.000

Robust regression diagnostic for detecting and… 615

* Table 9: Regression zY*ON W1 … W10 Dependent variable is: zY* No Selector 92 total cases of which 1 is missing Pre Coef SE Coef T P dic tor Con -505.907e-18 586.9e-6 -862e-15 1.0000 sta nt W1 0.11523 4.2e-6 325 0.0001 W2 -0.559135 367.3e-6 -1.52e3 0.0001 W3 -0.173521 460.7e-6 -377 0.0001 W4 -0.00564867 584.2e-6 -9.67 0.0001 W5 0.0504161 679.8e-6 74.2 0.0001 W6 0.290882 756.7e-6 384 0.0001 W7 -0.192988 899.6e-6 -215 0.0001 W8 -0.313051 0.001922 -163 0.0001 W9 0.541343 0.001965 276 0.0001 W10 -0.375143 0.006561 -57.2 0.0001 S = 0.005599 R-Sq = 100.0% R-Sq(adj) = 100.0% Analysis of Variance Source DF SS MS F P Regress 10 89.9975 8.99975 287067.70 0.0000 ion Residua 80 0.0025078 31.3475e-6 l

After that we can computed βPCR = V α βPCR = ( 0.002 0.1104 –0.0066 1.005 -0.0696 0.399 0.557 0.0889 -0.053 –0.0341)

Model 2: The new model without outliers: Now, we will do another regression excluding the observations that were detected as outliers. See Table 10.

Table 10: Regression of the model without Outliers (Model 1*)

The regression equation is Y* = - 0.00213 - 0.00086 X1* + 0.0146 X2* - 0.0145 X3* + 2.32 X4* +0.000117 X5 - 0.0147 X5* - 0.00057 X6* +0.000603 X7 +0.000000 X8 + 0.00393 X9 Predictor Coef SE Coef T P Constant -0.002129 0.009925 -0.21 0.831 X1* -0.000864 0.001344 -0.64 0.523 X2* 0.01461 0.01907 0.77 0.447 X3* -0.01451 0.01930 -0.75 0.455 X4* 2.32177 0.01968 117.99 0.000 X5 0.00011718 0.00005337 2.20 0.032

616 Mona N. Abdel Bary

Table 10: (Continued): Regression of the model without Outliers (Model 1*)

X5* -0.014734 0.002634 -5.59 0.000 X6* -0.000573 0.003990 -0.14 0.886 X7 0.0006033 0.0001743 3.46 0.001 X8 0.00000000 0.00000001 0.05 0.958 X9 0.003929 0.002130 1.84 0.070

S = 0.004039 R-Sq = 100.0% R-Sq(adj) = 100.0%

Analysis of Variance Source DF SS MS F P Regression 10 57.0696 5.7070 349912.18 0.000 Residual Error 56 0.0009 0.0000 Total 66 57.0705

Again, we will validate the 4 assumptions for the model: 1. Linearity: The plot of residuals vs. predicted values results in a random scatter plot, which validates linearity and homogeneity assumptions, see Figure 9.

Residuals Versus the Fitted Values (response is Y*)

0.01

l

a

u

d

i

s 0.00

e

R

-0.01 0 1 2 3 4 5 Fitted Value

Figure 9: Plot of Residuals vs. Predicted Values (model without outliers)

2. Errors: The Normal Probability Plot results in a straight line through the origin with a slope equal to 1, see Figure 10.

Normal Probability Plot of the Residuals (response is Y*)

2.5

2.0

1.5

1.0

e

r

o 0.5

c

S

0.0

l

a

m -0.5

r

o

N -1.0

-1.5

-2.0

-2.5 -0.01 0.00 0.01 Residual

Figure 10: Normal Probability Plot (model without outliers)

Robust regression diagnostic for detecting and… 617

This indicates that the assumption of the normality of residuals holds. The Index Plot of Residuals results in a randomly scattered horizontal band around the zero which means that the residuals are not auto correlated i.e.: independent, their mean is zero with constant variance, see Figure11.

Figure 11: Index Plot of Residuals (model without outliers)

3. Unusual Observations: Applying RIRLS, we obtain no outliers in the X-Y space from the plot of Standardized WLS residuals, see Figure 12.

s 5

l

a

u

d

i

s

e

R

S

L 0

W

.

d

n

a

t

S

-5

0 10 20 30 40 50 60 70 Observations Order

Figure 12: Standardized Weighted Least Squares Residuals Plot (model without outliers)

4. Independence of the Predictors: We note that K1, K2, K3… K9 < 10 but K10 = 60.35 > 10, which indicates that there is multicollinearity. There is one set of collinearity, see Table 11, and 12.

618 Mona N. Abdel Bary

Table 11: Principal Component Analysis (Without outliers)

No Selector 391 total cases of which 324 are missing

Eigen Values Variance Values Proportion e1 3.642 36.4 e2 2.535 25.3 e3 1.207 12.1 e4 0.798 8.0 e5 0.635 6.4 e6 0.608 6.1 e7 0.437 4.4 e8 0.091 0.9 e9 0.046 0.5 e10 0.001 0 Eigen Vectors V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 X1* -0.351 0.099 -0.299 -0.199 -0.365 -0.516 -0.565 0.126 -0.057 -0.004 X2* -0.478 0.021 0.065 -0.092 -0.151 -0.164 0.525 -0.012 0.228 -0.617 X3* -0.401 -0.337 0.093 -0.053 -0.008 -0.251 0.411 0.025 -0.167 0.674 X4* 0.075 -0.597 0.061 0.048 0.212 -0.169 -0.108 0.040 -0.616 -0.407 X5 -0.453 0.030 0.257 0.243 0.183 0.122 -0.326 -0.717 -0.002 0.002 X5* -0.430 0.036 0.320 0.239 0.146 0.345 -0.240 0.675 0.010 -0.009 X6* 0.096 -0.589 -0.015 0.108 0.161 -0.185 -0.193 0.042 0.730 0.004 X7 -0.110 -0.040 -0.657 0.719 -0.085 0.092 0.135 0.009 -0.060 -0.003 X8 -0.109 -0.409 -0.149 -0.307 -0.536 0.634 -0.072 -0.102 -0.008 0.002 X9 0.241 -0.038 0.520 0.456 -0.651 -0.192 0.003 -0.012 -0.022 0.003 Unrotated Factor Matrix F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 X1* -0.669 0.158 -0.329 -0.178 -0.291 -0.403 -0.374 0.038 -0.012 -0 X2* -0.913 0.033 0.071 -0.082 -0.121 -0.128 0.347 -0.004 0.049 -0.016 X3* -0.766 -0.536 0.102 -0.047 -0.006 -0.195 0.272 0.008 -0.036 0.017 X4* 0.144 -0.951 0.067 0.043 0.169 -0.132 -0.072 0.012 -0.132 -0.011 X5 -0.864 0.048 0.283 0.217 0.146 0.096 -0.215 -0.216 -0 0 X5* -0.821 0.058 0.351 0.214 0.116 0.269 -0.159 0.204 0.002 -0 X6* 0.183 -0.938 -0.017 0.096 0.129 -0.144 -0.127 0.013 0.156 0 X7 -0.210 -0.063 -0.722 0.642 -0.068 0.072 0.089 0.003 -0.013 -0 X8 -0.209 -0.652 -0.164 -0.274 -0.427 0.494 -0.048 -0.031 -0.002 0 X9 0.460 -0.060 0.572 0.408 -0.519 -0.149 0.002 -0.004 -0.005 0

Table 12: Calculate Kappa Kj:

K1 = 1 K2 = 1.199 K3= 1.737 K4= 2.136 K5 = 2.395 K6 = 2.447 K7 = 2.887 K8 = 6.326 K9 = 8.8929 K10=60.35

* * By compute principle component for independent variables X1 , X2 … X9, we * * result that, there is multicollinearity between X2 and X 3 . We will calculate the * * standard values for all variables zY , zX1 … zX9 and do regression Y* on zX1*… zX10, see Table14, and 15.

Robust regression diagnostic for detecting and… 619

* * Table 14: Regression zY on zX1 … zX9

No Selector: 391 total cases of which 324 are missing Dependent variable is zY* R squared = 100.0% R squared (adjusted) = 100.0% s = 0.004343 with 67 - 11 = 56 degrees of freedom Source Sum of Squares df Mean Square F-ratio Regression 65.9989 10 6.59989 350e3 Residual 0.00105624 56 18.86 14e-6 Variable Coefficient s.e. of Coeff t-ratio prob Constant -315.171e-18 530.6e-6 -594e-151.0000 zX1* -462.961e-6 719.2e-6 -0.644 0.5224 zX2* 0.00979101 0.01273 0.769 0.4451 zX3* -0.0104808 0.01389 -0.755 0.4536 zX4* 1.00559 0.008523 118 0.0001 zX5 0.0029102 0.001325 2.2 0.0322 zX5* -0.0071342 0.001275 -5.59 0.0001 zX6* -266.406e-6 0.001855 -0.144 0.8863 zX7 0.00200118 578e-6 3.46 0.0010 zX8 33.1231e-6 643.7e-6 0.0515 0.9591 zX9 0.00110587 600e-6 1.84 0.0706

* Table 15: Principal Component Analysis (without outliers) to zX1 ... zX9

No Selector: 391 total cases of which 324 are missing Eigen Values Variance Values Proportion e1 3.642 36.4 e2 2.535 25.3 e3 1.207 12.1 e4 0.798 8.0 e5 0.635 6.4 e6 0.608 6.1 e7 0.437 4.4 e8 0.091 0.9 e9 0.046 0.5 e10 0.001 0 Eigen Vectors V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 zX1* -0.351 0.099 -0.299 -0.199 -0.365 -0.516 -0.565 0.126 -0.057 -0.004 zX2* -0.478 0.021 0.065 -0.092 -0.151 -0.164 0.525 -0.012 0.228 -0.617 zX3* -0.401 -0.337 0.093 -0.053 -0.008 -0.251 0.411 0.025 -0.167 0.674 zX4* 0.075 -0.597 0.061 0.048 0.212 -0.169 -0.108 0.040 -0.616 -0.407 zX5 -0.453 0.030 0.257 0.243 0.183 0.122 -0.326 -0.717 -0.002 0.002 zX5* -0.430 0.036 0.320 0.239 0.146 0.345 -0.240 0.675 0.010 -0.009 zX6* 0.096 -0.589 -0.015 0.108 0.161 -0.185 -0.193 0.042 0.730 0.004 zX7 -0.110 -0.040 -0.657 0.719 -0.085 0.092 0.135 0.009 -0.060 -0.003 zX8 -0.109 -0.409 -0.149 -0.307 -0.536 0.634 -0.072 -0.102 -0.008 0.002 zX9 0.241 -0.038 0.520 0.456 -0.651 -0.192 0.003 -0.012 -0.022 0.003 Unrotated Factor Matrix F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 zX1* -0.669 0.158 -0.329 -0.178 -0.291 -0.403 -0.374 0.038 -0.012 -0

620 Mona N. Abdel Bary

* Table 15: (Continued): Principal Component Analysis (without outliers) to zX1 ... zX9

zX2* -0.913 0.033 0.071 -0.082 -0.121 -0.128 0.347 -0.004 0.049 -0.016 zX3* -0.766 -0.536 0.102 -0.047 -0.006 -0.195 0.272 0.008 -0.036 0.017 zX4* 0.144 -0.951 0.067 0.043 0.169 -0.132 -0.072 0.012 -0.132 -0.011 zX5 -0.864 0.048 0.283 0.217 0.146 0.096 -0.215 -0.216 -0 0 zX5* -0.821 0.058 0.351 0.214 0.116 0.269 -0.159 0.204 0.002 -0 zX6* 0.183 -0.938 -0.017 0.096 0.129 -0.144 -0.127 0.013 0.156 0 zX7 -0.210 -0.063 -0.722 0.642 -0.068 0.072 0.089 0.003 -0.013 -0 zX8 -0.209 -0.652 -0.164 -0.274 -0.427 0.494 -0.048 -0.031 -0.002 0 zX9 0.460 -0.060 0.572 0.408 -0.519 -0.149 0.002 -0.004 -0.005 0

We will use principle components analysis and obtain Eigen Vectors and Eigen Values to make new variables Wi = zXi Vi, then we have W1… W10, see Table 16. After that we make Regression zY on W1… W10. We note that, all coefficients of Wi ’s < 0.05, that is mean all coefficient are significant, see Table 17.

* * Table 17: Regression zY on W1 … W10 (without outliers):

No Selector 391 total cases of which 324 are missing Dependent variable is: zY* R squared = 100.0% R squared (adjusted) = 100.0% s = 0.004343 with 67 - 11 = 56 degrees of freedom

Source Sum of Squares df Mean Square F-ratio Regression 65.9989 10 6.59989 350e3 Residual 0.00105624 56 18.8614e-6

Variable Coefficient s.e. of Coeff t-ratio prob Constant -299.314e-18 530.6e-6 -564e-15 1.0000 W1 0.0763563 280.6e-6 272 0.0001 W2 -0.597268 335.9e-6 -1.78e3 0.0001 W3 0.0568881 487.1e-6 117 0.0001 W4 0.0465538 581.6e-6 80 0.0001 W5 0.212016 671.2e-6 316 0.0001 W6 -0.170204 685.4e-6 -248 0.0001 W7 -0.106797 808.9e-6 -132 0.0001 W8 0.0331262 0.001769 18.7 0.0001 W9 -0.61575 0.002501 -246 0.0001 W10 -0.422485 0.02059 -20.5 0.0001

Compute βPCR = V α βPCR = (-0.0004 0.0095 –0.01 1.006 0.051 -0.007 -0.001 -0.002 -0.00002 –0.002) .

The final model (all observations) is: ∗ ∗ ∗ ∗ ∗ ∗ 푌 = 0.002 푋1 + 0.006푋2 − 0.006푋3 + 0.93푋4 − 0.006 푋5 + 0.0004푋5 ∗ + 0.00012 푋6 + 0.0002푋7 − 0.0006 푋8 − 0.0341푋9 + 휀

Robust regression diagnostic for detecting and… 621

The final model (without outliers) is:

∗ ∗ ∗ ∗ ∗ ∗ 푌 = −0.0004 푋1 + 0.0095푋2 − 0.01푋3 + 1.006푋4 + 0.051 푋5 − 0.007푋5 ∗ − 0.001 푋6 − 0.002푋7 − 0.00002푋8 − 0.002푋9 + 휀

5. Conclusions

The current study is focused on analysis of financial data to explain the effect of the multicollinearity and outliers problems on goodness of regression model. This paper tries to analysis the financial data to determine the significant variables effect on close stock price.

 There are different in the results of the two models: The final model (all observations) is: ∗ ∗ ∗ ∗ ∗ ∗ 푌 = 0.002 푋1 + 0.006푋2 − 0.006푋3 + 0.93푋4 − 0.006 푋5 + 0.0004푋5 ∗ + 0.00012 푋6 + 0.0002푋7 − 0.0006 푋8 − 0.0341푋9 + 휀 The final model (without outliers) is:

∗ ∗ ∗ ∗ ∗ ∗ 푌 = −0.0004 푋1 + 0.0095푋2 − 0.01푋3 + 1.006푋4 + 0.051 푋5 − 0.007푋5 ∗ − 0.001 푋6 − 0.002푋7 − 0.00002푋8 − 0.002푋9 + 휀

We can result that (the outliers) are important in explain the response variable. The final model includes all observations has better explanation of the response variable (close stock price). * * * *  The variables X1 , X2 , X3 , X4 , X5, X7 on the final model (all observations) are positively related to close stock price, but both of X8, X9 are negatively related to the response variable.

References

[1] N. Albu, Research and Development spending in the EU: 2020 growth strategy in perspective, SWP Working Paper FG 1. SWP, Berlin, 2011.

[2] C. Carsten, Growth Effects of Fiscal Policies: An Application of Robust Modified M-Estimator, Applied Economics, 41 (2009), 899 - 912. https://doi.org/10.1080/00036840701736099

[3] S. Chatterjee and A. S. Hadi, Sensitivity Analysis in Linear Regression, Wiley, New York, 1988.

[4] S. Chatterjee and A. S. Hadi, Regression Analysis by Examples, 4th Edition, New York: Wiley, 2006. https://doi.org/10.1002/0470055464

622 Mona N. Abdel Bary

[5] R. Finger, W. Hediger, The Application of Robust Regression to a Production Function Comparison - The Example of Swiss Corn, IED Working Paper 2, Institute for Environmental Decisions, ETH Zurich, Switzerland, September 2007. https://doi.org/10.2139/ssrn.1430342

[6] D. N. Gujarati, Basic Econometrics, 4thedition, Tata McGraw Hill, New Delhi, 2004.

[7] D. C. Montgomery, E. A. Peck and G. G. Vining, Introduction to Linear Regression Analysis, 3rd edition, Wiley, New York, 2001.

[8] T. Naes and U. Indahi, A Unified Description of Classical Classification Methods for Multi-collinear Data, J. Chemometrics, 12 (1998), 205-220. https://doi.org/10.1002/(sici)1099-128x(199805/06)12:3<205::aid-cem509>3.0.co;2-n

[9] J. T. Zhang, An empirical Analysis for National Energy RD Expenditures, International Journal of Global Energy Issues, 26 (2006), no. 1-2, 141 - 159. https://doi.org/10.1504/ijgei.2006.008388

Received: February 19, 2017; Published: March 10, 2017