A New Approach of Principal Component Regression Estimator with Applications to Collinear Data
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Engineering Research and Technology. ISSN 0974-3154, Volume 13, Number 7 (2020), pp. 1616-1622 © International Research Publication House. http://www.irphouse.com A New Approach of Principal Component Regression Estimator with Applications to Collinear Data 1Kayode Ayinde, 2Adewale F. Lukman, 1Olusegun. O. Alabi and 1Hamidu A. Bello 1Department of Statistics, Federal University of Technology, Akure, Nigeria. 2Department of Physical Sciences, Landmark University, Omu-Aran, Nigeria. Corresponding Author Email: [email protected] ORCID: 0000-0002-2993-6203 (Hamidu A. Bello), ORCID: 0000-0003-2881-1297 (Lukman) Abstract components (PCs) (Naes and Marten, 1988; Chatterjee and Hadi, 2006). Other methods range from this method to more In this paper, a new approach to estimating the model specialized techniques for regularization (Naes, and Marten, parameters in Principal Components (PCs) is developed. 1988). The PCR method has been proposed as alternatives to Principal Component Analysis (PCA) is a method of variable the OLS estimator when the independent assumption has not reduction that has found relevance in combating been satisfied in the analysis (Massy, 1965; Lee et al., 2015). multicollinearity problem in Multiple Linear Regression Models. In this paper, a new approach to estimating the model parameters in Principal Components (PCs) is developed The 2. LITERATURE REVIEW method requires using the PCs as regressors to predict the dependent variable and further utilizing the predicted variable Consider the standard linear regression model defined as: as a dependent variable on the original explanatory variables y = β + β x + β x + ...+ β x + u i = 1,2,...,n (2.1) in an Ordinary Least Square (OLS) regression. The resulting i 0 1 i1 2 i2 p ip i parameter estimates are the same as the estimates from the and defined in matrix form as usual Principal Components Regression Estimator. The sampling properties of the new estimator are proved to be y = Xβ + u (2.2) same as the existing ones. The new approach is simpler and easier to use. Real-life data sets are used to illustrate both the where y is (n ×1) vector for the observations of dependent conventional and the developed method. variable; X is (n × p) matrix of the observations of Keywords: Multiple Linear Regression Models, explanatory variables, β is ( p ×1) vector of regression Multicollinearity, Principal Components, Ordinary Least Square coefficients u is (n ×1) vector of the error terms assumed to multivariate normal distribution with mean zero and variance, 2 σ I n . Thus, the parameters of the model can be estimated 1. INTRODUCTION using the use of Ordinary Least Square (OLS) estimator Violation of independence of explanatory variables in obtained as: multiple linear regression models results in multicollinearity −1 problem (Gujarati, 2005). Literature has reported the use of βˆ = X 1 X X 1 y (2.3) OLS ( ) the Ordinary Least Square (OLS) Estimator in the presence of multicollinearity with consequences. It does produce unbiased The unbiased variance–covariance matrices of the OLS but inefficient estimates of the model parameters, estimates of estimator can also be given as: regression coefficients having wrong signs and high standard −1 Var (βˆ ) = σˆ 2 X 1 X (2.4) deviation, and the parameter estimates and their standard OLS ( ) errors become extremely sensitive to slight changes in the data n points (Chatterjee and Hadi, 2006; Lukman et al., 2018; ()y − yˆ 2 Ayinde et al., 2018). This consequence tends to inflate the ∑ i i where σˆ 2 = i=1 , (Maddalla, 1988; Kibra, 2003). estimated variance of the predicted values (Fomby et al., n − p 1984; Maddalla, 1988; Ayinde et al., 2015). Several other techniques are in use to combat this problem, Lukman et al. (2020) reported the OLS estimator as Best and Principal Components Analysis (PCA) is one. It is a Linear Unbiased Estimator (BLUE) provided there is no multivariate technique developed by Hotelling (1933) for multicollinearity among the explanatory variables is not explaining a set of correlated variables by a reduced number perfect. However, the estimator is still not efficient when there of uncorrelated ones with maximum variances called Principal is multicollinearity (Chatterjee and Price, 1977; Gujarati, 1616 International Journal of Engineering Research and Technology. ISSN 0974-3154, Volume 13, Number 7 (2020), pp. 1616-1622 © International Research Publication House. http://www.irphouse.com 2005; Lukman et al., 2019; Lukman et al., 2020). Among the n ⎛ _ ⎞⎛ _ ⎞ estimators commonly used to address this problem are the ∑⎜ xij − x j ⎟⎜ xik − x k ⎟ where i=1 ⎝ ⎠⎝ ⎠ , estimator based on Principal Component Regression rjk = 2 2 developed by Massy (1965), the Ridge Regression (RRE) ⎛ n ⎛ _ ⎞ ⎞⎛ n ⎛ _ ⎞ ⎞ ⎜ x − x j ⎟⎜ x − x k ⎟ Estimator (Hoerl and Kennard, 1970), estimator based on ⎜∑⎜ ij ⎟ ⎟⎜∑⎜ ik ⎟ ⎟ i=1 ⎝ ⎠ i=1 ⎝ ⎠ Partial Least Square (PLS) Regression (Wold, 1985), and Liu ⎝ ⎠⎝ ⎠ estimator (Liu, 1993) and recently, the KL estimator by Kibria j=1, 2,3, …, p; k= 1, 2,3, …, p and Lukman (2020). Estimator based on Principal Component Regression requires iii. Calculate the Eigenvalues and hence the Eigenvectors of reducing the explanatory variables into a smaller number of the correlation matrices, say W. These become the summary variables often called Principal Components (or coefficients (weights) that are attached to the standardized factors) that explain most of the variations in the data. explanatory variables. Thus, According to Chatterjee and Price (1977), they summarized the algorithm of the procedures using the correlation matrices ⎡w11 w12 w13 w14 . w1p ⎤ as follows: ⎢w w w w . w ⎥ ⎢ 21 22 23 24 2 p ⎥ i. Standardize both the response and the explanatory ⎢w31 w32 w33 w34 . w3 p ⎥ variables: The standardized variables respectively become: ⎢ ⎥ ⎢w41 w42 w43 w44 . w4 p ⎥ − W = (2.8) 1 yi − y ⎢ . ⎥ yi = (2.5) ⎢ ⎥ s y ⎢ . ⎥ 2 ⎢ . ⎥ n n ⎛ _ ⎞ ⎢ ⎥ y ⎜ y − y⎟ _ ∑ i ∑ i ⎢w1p w2 p w3 p w4 p . wpp ⎥ i=1 i=1 ⎝ ⎠ ⎣ ⎦ where y = and sy = n n =1 − iv. Obtain the Principal Components (PCs) using the xij − x j x1 = (2.6) linear equation given as: ij s x j p PC = w x1 , i = 1, 2,…,n; j=1,2,…,p (2.9) 2 ij ∑ kj ik n n ⎛ _ ⎞ k=1 x x − x j _ ∑ ij ∑⎜ ij ⎟ i=1 i=1 ⎝ ⎠ These PCs are p orthogonal variables. where x j = and s = x j n n =1 for j = 1,2, 3,…, p v. Determine the number of PCs to use: There is a need to determine the number of PCs to use. Since there are p of them ii. Obtain the correlation matrices of the explanatory and using all of them will produce the same results with OLS. variables, say R. Thus, Hence a particular number of them has to be chosen. Kaiser (1960) suggested the use of PCs whose Eigenvalue is more ⎡ 1 r r r . r ⎤ 12 13 14 1p than one. Constantin (2014) suggested using any component ⎢r 1 r r . r ⎥ ⎢ 12 23 24 2 p ⎥ that accounts for at least 5% or 10% of the total variance. Some other authors emphasized that components should be ⎢r13 r23 1 r34 . r3 p ⎥ ⎢ ⎥ selected enough so that the cumulative per cent of variance r r r 1 . r R = ⎢ 14 24 34 4 p ⎥ (2.7) accounted for is so high, between 80% and 90% (Naes and ⎢ . ⎥ Marten, 1988; Ali and Robert, 1998; Abdul–Wahab et al., ⎢ ⎥ 2005). ⎢ . ⎥ ⎢ . ⎥ vi. Run the OLS regression (without intercept) of the ⎢ ⎥ standardized dependent variable on the selected m PCs. Thus, ⎢r1p r2 p r3 p r4 p . 1 ⎥ ⎣ ⎦ 1 yi = f (PC1 , PC 2 ,..., PC m ) (2.10) and obtain the estimates of the regression coefficients ) ) ) as γ 1 , γ 2 ,…,γ m . 1617 International Journal of Engineering Research and Technology. ISSN 0974-3154, Volume 13, Number 7 (2020), pp. 1616-1622 © International Research Publication House. http://www.irphouse.com vii. Obtain the true regression coefficients of the linear Theorem 1: The Principal Component Regression Estimator model (2.1 or 2.2) based on Principal Component Regression of a Linear Regression Model y = Xβ + u , where y is method using: (n ×1) vector for the observations of dependent variable; X m ) s y ⎡ ⎤ is (n × p) matrix of the observations of explanatory β = w γ) j= 1,2, …,p (2.11) PCR j ⎢∑ jt t ⎥ s variables, β is ( p ×1) vector of regression coefficients u is x j ⎣ t=1 ⎦ (n ×1) vector of the error terms assumed to multivariate Also, the intercept can be obtained using: 2 normal distribution with mean zero and variance,σ I n which ) _ p ) − β = y− β x j (2.12) uses r Principal Components, r ≤ p , can be obtained as, PCR0 ∑ PCRj j=1 ˆ 1 −1 1 β PCR = (X X ) X yˆ r , where yˆ r is the predicted variable 1 Alternatively, re-consider model (2.2) again. Then, X X when y is regressed on the r principal components. becomes a ( p × p) matrix whose Eigenvectors result into Proof: ( p × p) orthogonal matrixT = (t1 ,t2 ,...t p ) = [Tr ,T p−r ]. Recall that the estimator of the parameters of the Principal Utilizing this and following the study of Ozkale (2009), Component Regression that uses r Principal Components is Batah et al. (2009), and Chang and Yang (2012), (2.2) can be given in (2.14) as: written as: ˆ 1 −1 1 1 1 −1 1 1 1 1 1 α PCR = (Z r Z r )(Z r y = Tr X XTr )Tr X y (3.1) y = XTT β + u = XTrTr β + XT p−rT p−=r β + u (2.13) = Z rα r + Z p−rα p−r + u Then, the predicted value of dependent variable becomes: 1 1 −1 1 1 1 yˆ r = Z r αˆ PCR = XTr (Tr X XTr ) Tr X y (3.2 where Z r = XTr , α r = Tr β , Z p−r = XTp−r , 1 α p −r = T p −r β .