New Diagnostic Methods for Inuential Observations in Linear Regression with some Biased Estimation Methods
A Ph.D. Dissertation By Muhammad Kashif Roll No.: PHDS-11-05 Session: 2011-2016
DEPARTMENT OF STATISTICS Bahauddin Zakariya University Multan - Pakistan 2019 New Diagnostic Methods for Inuential Observations in Linear Regression with some Biased Estimation Methods
A Thesis submitted in partial fulllment of the requirements for the degree of
Doctor of Philosophy in STATISTICS by Muhammad Kashif (Roll No. PHDS-11-05) Session: 20112016
SUPERVISED by Prof. Dr. Muhammad Aman Ullah
CO-SUPERVISED by Dr. Muhammad Aslam
Department of Statistics Bahauddin Zakariya University Multan
2019 New Diagnostic Methods for Influential Observations in Linear
Regression with some Biased Estimation Methods
by
Muhammad Kashif
A thesis
submitted to the Department of Statistics,
Bahauddin Zakariya University, Multan
in fulfillment of the requirements for the Degree of
Doctor of Philosophy
in
Statistics
2019
Author's Declaration
I Muhammad Kashif, hereby state that my Ph.D. (Statistics) thesis titled New Diagnostic Methods for Inuential Observations in Linear Regression with some Biased Estimation Methods is my own work and has not been submitted previously by me for taking any degree from this University Bahauddin Zakariya University, Multan, Pakistan or anywhere else in the country/world. At any time if my statement is found to be incorrect even after my Graduate the university has the right to withdraw my Ph.D. (Statistics) degree.
Muhammad Kashif Date: 27-07-2019
i Plagiarism Undertaking
I solemnly declare that research work presented in the thesis titled New Diagnostic Methods for Inuential Observations in Linear Regression with some Biased Estimation Methods is solely my research work with no signicant contribution from any other person. Small contribution/help wherever taken has been duly acknowledged and that complete thesis has been written by me.
I understand the zero tolerance policy of the HEC and Bahauddin Zakariya University, Multan, Pakistan towards plagiarism. Therefore I as an Author of the above titled thesis declare that no portion of my thesis has been plagiarized and any material used as reference is properly referred/cited.
I undertake that if I am found guilty of any formal plagiarism in the above titled thesis even after award of Ph.D. (Statistics) degree, the University reserves the rights to withdraw/revoke my Ph.D. (Statistics) degree and that HEC and the University has the right to publish my name on the HEC/University Website on which names of students are placed who submitted plagiarized thesis.
Student's Signature: Name: Muhammad Kashif
ii This thesis is dedicated to The Holy Prophet Hazrat Muhammad (S.A.W.W)
(Whose teaching enlightened my heart and flourished my thoughts)
iii Acknowledgements
First and foremost, praises and thanks to Almighty ALLAH for giving me this opportunity, the strength and the patience to complete my dissertation finally, after all the challenges and difficulties. My special praise for the messenger of Allah, the Holy Prophet Hazrat MUHAMMAD (S.A.W.W), the greatest educator, the everlasting source of guidance and knowledge for humanity. He taught the principles of morality and eternal values. I deem it a great honor to express my deep and sincere gratitude to my honorable and estimable supervisor, Prof. Dr. Muhammad Aman Ullah, Professor at Department of Statistics, Bahauddin Zakariya University, Multan for his guidance and invaluable advice. His constructive comments and suggestions throughout the thesis work have contributed to the success of this research. His timely and efficient contribution helped me shape this thesis into its final form and I express my sincere appreciation for his assistance in any way that I may have asked. I consider it my privilege to have accomplished this thesis under his right guidance. I feel great pleasure in expressing my sincerest gratitude to my Co-supervisor Dr. Muhammad Aslam, Associate Professor and Chairman, Department of Statistics, Bahauddin Zakariya University, Multan for his detailed review, constructive sugges- tions, important support and excellent advice during the preparation of this thesis. I would like to express my gratitude to all the esteemed faculty members and staff members of the Department of Statistics, Bahauddin Zakariya University, Multan. Finally, I wish to thank to my loving parents, sisters, brothers and my wife for their prayers, encouragement and support spiritual, emotional, intellectual and otherwise.
Muhammad Kashif
iv Abstract
This thesis is concerned with the expansion of diagnostic methods in parametric regression models with some biased estimators. Of which, the Liu estimator, modified ridge estimator, improved Liu estimator and ridge estimator have been developed as an alternative to the ordinary least squares estimator in the presence of multicollinearity in linear regression models. Firstly, we introduce a type of Pena’s statistic for each point in Liu regression. Using the forecast change property, we simplify the Pena’s statistic in a numerical sense. It is found that the simplified Pena’s statistic behaves quite well as far as detection of influential observations is concerned.
We express Pena’s statistic in terms of the Liu leverages and residuals. For numerical evaluation, simulated studies are given and a real data set has been analyzed for illustration. Secondly, we formulated Pena’s statistic for each point while considering the modified ridge regression estimator. Using this statistic, we showed that when modified ridge regression was used to mitigate the effects of multicollinearity, the influence of some observations could be significantly changed. The normality of this statistic was also discussed and it was proved that it could detect a subset of high modified ridge leverage outliers. The Monte Carlo simulations were used for
v empirical results and an example of real data was presented for illustration. Next,
we introduce a type of Pena’s statistic for each point in the improved Liu estimator.
Using this statistic, we showed that when the improved Liu estimator was used to
mitigate the effects of multicollinearity, the influence of some observations could be
significantly changed. The Monte Carlo simulations were used for empirical results
and an example of real data was presented for illustration. The ridge estimator having
growing and wider applications in statistical data analysis as an alternative technique
to the ordinary least squares estimator to combat multicollinearity in linear regression
models. In regression diagnostics, a large number of influence diagnostic methods
based on numerous statistical tools have been discussed. Finally, we focus on ridge
version of Nurunnabi et al. (2011) method for identification of multiple influential observation in linear regression. The efficiency of the proposed method is presented through several well-known data sets, an artificial large data with high-dimension and heterogeneous sample and a Monte Carlo simulation study.
vi List of Symbols and Abbreviations
Abbreviation/Symbols Description b Prior information vector
β Vector of slope coefficients of X1
ˆ βd Liu estimate
ˆ β(k,b) Modified ridge estimate
ˆ βK,D Improved Liu estimate
ˆ βk Ridge estimate
BACON Block adaptive computationally effective outlier nomina-
tor
CIP Correct identification in percentage d biasing parameter
Di Cook’s distance
Dd,i Cook’s distance with Liu estimate
D(k,b)i Cook’s distance with modified ridge estimate
DK.D;i Cook’s distance with Improved Liu estimate
DR,i Cook’s distance with ridge estimate
vii DFFITS(i) Difference of Fits Test
DFFITSR(i) Difference of Fits Test with ridge estimate
H Hat Matrix
Hd Hat Matrix with Liu estimate
H(k,b) Hat Matrix with modified ridge estimate
HK,D Hat Matrix with Improved Liu estimate
HR Hat Matrix with ridge estimate hii Leverages hd,i Leverages with Liu estimate h(k,b)i Leverages with modified ridge estimate hK.D;i Leverages with Improved Liu estimate hRii Leverages with ridge estimate
I Identity Matrix
ILE Improved Liu Estimator k Ridge parameter
LE Liu Estimator
MAD Median absolute deviation
MLR Multiple linear regression
MRR Modified ridge regression
Mi Nurunnabi’s measure
MR,i Nurunnabi’s measure with ridge estimate
viii OLS Ordinary least squares
RR Ridge regression
Si Pena’s statistic
Sd,i Pena’s statistic with Liu estimate
S(k,b)i Pena’s statistic with modified ridge estimate
SK,D;i Pena’s statistic with Improved Liu estimate
SR,i Pena’s statistic with ridge estimate
X1 Centered and standardized matrix of variables y Response vector
ix Contents
List of Symbols and Abbreviation xii
List of Tables xii
List of Figures xiii
1 Introduction1 1.1 Presentation...... 1 1.2 Significance...... 5 1.3 A brief sketch of the Research...... 6
2 Influence Diagnostic Measures in Linear Regression8 2.1 Introduction...... 8 2.2 The Linear Model and OLS Estimation...... 9 2.3 Residuals and Hat Matrix in Linear Regression...... 10 2.4 Diagnostic Approaches in Linear Regression with No Multicollinearity 11
3 Pena’s statistic for the Liu regression (Kashif et al. (2018)) 17 3.1 Introduction...... 17 3.2 Pena’s statistic...... 19 3.2.1 Pena’s statistic using the LE...... 19 3.2.2 Properties of Pena’s statistic for the Liu Regression...... 21 3.3 Simulation study...... 28 3.3.1 Normality of Proposed measure...... 28 3.3.2 Detection of intermediate and high Liu leverage outliers.... 33 3.3.3 The Monte Carlo simulation...... 37 3.4 Longley data...... 40 3.5 Conclusions...... 43
4 Pena’s statistic for the Modified Ridge Regression 47 4.1 Introduction...... 47 4.2 Pena’s statistic...... 49 4.2.1 Pena’s statistic using the MRR...... 49 4.2.2 Properties of Pena’s statistic for the MRR...... 53 4.3 Simulation...... 60
x 4.3.1 Normality of Proposed measure...... 60 4.3.2 Detection of intermediate and high modified ridge leverage outliers...... 65 4.3.3 Performance of Pena’s statistic...... 69 4.4 Illustration...... 70 4.5 Conclusions...... 74
5 Pena’s statistic for the Improved Liu Estimator 75 5.1 Introduction...... 75 5.2 Pena’s statistic...... 78 5.2.1 Pena’s statistic using the ILE...... 78 5.2.2 Properties of Pena’s statistic for the ILE...... 80 5.3 Simulation...... 87 5.3.1 Normality of the Proposed measure...... 87 5.3.2 Detection of intermediate and high improved Liu leverage outliers 92 5.3.3 Performance of Pena’s statistic...... 96 5.4 Illustration...... 98 5.5 Conclusions...... 101
6 A New Diagnostic Method for Influential Observations in Ridge Regression 102 6.1 Introduction...... 102 6.2 Ridge Regression...... 105 6.3 Diagnostic Methods in Ridge Regression...... 105 6.3.1 Leverage and Residuals...... 105 6.3.2 Cook’s distance, DFFITS and Pena’s measure...... 106 6.4 Proposed Diagnostic Method...... 108 6.5 Examples...... 111 6.5.1 Longley Data...... 111 6.5.2 Tobacco Data...... 114 6.5.3 Artificial high dimensional large data set containing heteroge- neous cases...... 117 6.6 Simulation Findings...... 119 6.7 Conclusions...... 121
7 Conclusions 122
References 125
A Published /Submitted Research Work from Ph.D. Thesis 132
B Data Sets 133
xi List of Tables
2.1 Influence Diagnostic Measures in Linear Regression...... 16
3.1 Three data sets and proposed Diagnostic Measures...... 34 3.2 The Percent of outlier identification by Sd,i in simulation...... 38 3.3 The Percent of outlier identification by Dd,i in simulation...... 39 3.4 Five largest values of Sd,i for d = 0.1, 0.5, 0.9 and 1.0 (OLS Case) for the Longley data...... 42
4.1 Three data sets and proposed Diagnostic Measures...... 66 4.2 Percent of outlier identification by applying S(k,b)i in the Simulation.. 70 4.3 Percent of outlier identification by means of D(k,b)i in the Simulation. 71 4.4 Five largest observations of S(k,b)i for the Longley data...... 72 5.1 Three data sets and proposed Diagnostic Measures...... 93 5.2 Percent of outlier identification through SK.D;i in simulation...... 97 5.3 Percent of outlier identification through DK.D;i in simulation...... 98 5.4 Five largest values of SK.D;i for Optimal value of d for the Longley data. 99 6.1 Measures of influences for Longley data...... 112 6.2 Measures of influences for Tobacco blends data set...... 115 6.3 Simulation results...... 120
B.1 Longley Data (Longley, 1967)...... 133 B.2 Tobacco Data (Myers, 1986)...... 134
xii List of Figures
3.1 Influence investigation of the produced of three data sets with various multicollinearity independent variables. Normal q-q plots of the proposed measure...... 29 3.2 Influence investigation of the produced of three data sets with various multicollinearity variables. Normal q-q plots of Cook’s distance... 30 3.3 Influence investigation of the produced of three data sets with various multicollinearity regressors. Plots of Cook’s distance against proposed measure...... 31 3.4 Influence investigation of the produced of three data sets with various multicollinearity predictors. Plots of proposed measure versus obser- vations...... 32 3.5 Graphs of Cook’s measure against the suggested diagnostic measure three situations: (a) no outliers, (b) high Liu leverage outliers and (c) middle Liu leverage outliers...... 35 3.6 Plots of the proposed measure versus observation three situations: (a) no outliers, (b) high Liu leverage outliers and (c) centeral Liu leverage outliers...... 36 3.7 Histograms of Sd,i of the Longley data for d=0.1, 0.5, 0.9 and 1.0... 44 3.8 Scatter Plots of Dd,i against Sd,i for d=0.1, 0.5, 0.9 and 1.0...... 45 3.9 Plots of Sd,i against observation number for d=0.1, 0.5, 0.9 and 1.0.. 46 4.1 Influence investigation of produced the 3 data sets with various multicollinearity regressors. Normal q-q plots of the proposed measure 61 4.2 Influence investigation of produced the 3 data sets with various multicollinearity regressors. Normal q-q plots of Cook’s measure... 62 4.3 Impact investigation of produced the 3 data sets with various mul- ticollinearity regressors. Graphs of Cook’s distance against proposed measure...... 63 4.4 Impact investigation of produced the 3 data sets with various multi- collinearity regressors. Plots of proposed measure versus observation number...... 64 4.5 Graphs of Cook’s measure against the proposed measure three situa- tions: (a) no outliers, (b) 3 high modified ridge leverage outliers and (c) 3 middle modified ridge leverage outliers...... 67
xiii 4.6 Graphs of the proposed measure against observation three situations: (a) no outliers, (b) 3 high modified ridge leverage outliers and (c) 3 middle modified ridge leverage outliers...... 68 4.7 Impact assessment of Longley data set (a) Histogram of proposed measure (b) graph of Cook’s distance verses proposed measure (c) plot of proposed measure verses cases...... 73
5.1 Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Normal q-q plots of the proposed measure...... 88 5.2 Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Normal q-q plots of Cook’s distance...... 89 5.3 Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Graphs of Cook’s measure against proposed measure...... 90 5.4 Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Plots of proposed measure versus observation number...... 91 5.5 Graphs of Cook’s measure against proposed measure three situations: (a) No outliers, (b) 3 high improved Liu leverage outliers and (c) 3 Middle improved Liu leverage outliers...... 94 5.6 Plots of the proposed measure versus observation three situations: (a) No outliers, (b) 3 high improved Liu leverage outliers and (c) 3 Middle improved Liu leverage outliers...... 95 5.7 Influence examination of Longley data set (a) Histogram of the proposed measure (b) graph of the Cook’s distance verses proposed measure (c) plot of proposed measure verses cases...... 100
6.1 Index plots for Longley data: (a) DR,i measure; (b) DFFITS; (c) Pena’s measure SR,i and (d) Suggested influence diagnostic measure MR,i. .. 113 6.2 Index graphs of Tobacco blends data: (a) DR,i measure; (b) DFFITS; (c) Pena’s measure SR,i and (d) Proposed diagnostic measure MR,i. . 116 6.3 Influence investigation of the big data set with high dimensions and heterogeneous sample cases: (a) Residuals against Fitted plot; (b) Histogram of the Residuals; (c) Index graphs of (a) Cook’s distance; (d) DFFITS; (e) Pena’s statistic; (f) Histogram of Pena’s statistic; (g) Histogram of Proposed measure; (h) Index plot of Proposed measure. 118
xiv Chapter 1
Introduction
1.1 Presentation
In fitting a multiple linear regression (MLR) model, it is regular to apply the ordinary least squares (OLS) technique for the most part in view of simplicity of estimation. It is a well-known fact that, OLS estimator achieve desirable properties only if certain assumptions about the error term holds. If these assumptions violates, this effect the estimator and the subsequent analyses. Moreover, in applying the method of OLS, we observe that the estimates of parameters can be significantly influenced by at least one observation in the data set. It implies not every one of the observations have an equivalent significance in linear regression. Therefore, in decisions that result drawn from an analysis (Chatterjee and Hadi, 1986). Extreme value in explanatory variable(s) results as an influential observation. Influential observations are the values which strongly effect the parameter estimates and fitted values. The inclusion/exclusion of such observations may considerably change the
1 estimated values of parameters and further analyses. Therefore, OLS regression can
be made a practicable statistical methodology by giving more attention to all possible
influential observations in the data set, but not only on estimation.
In literature, two approaches are available for this purpose according to Chatterjee
and Hadi (1988). First approach is case omission technique which is widely applied for
influence analysis in linear regression models, obviously, e.g., Cook (1977), Belsley et
al. (1980), Montgomery et al. (2001), Chatterjee and Hadi (2006), Ullah and Pasha
(2009), additionally, the references referred to in that. According to this approach,
it is investigated that how different measures, involved in regression analysis, change
when some of the observations have to be omitted from given data set. Other approach
is local influence technique, which is recommended by Cook (1986) as an alternative
methodology to case omission approach. He describes a measure named as the local
influence which depends on differentiation instead of a point deletion method. This
approach has been considered by numerous researchers, for example, Thomas and
Cook (1989), Thomas and Cook (1990), Tsai and Wu (1992), Zhao and Lee (1998),
Liu (2000), Ortega et al. (2003), Tanaka et al. (2003), Prendergast (2006), Paula and
Cysneiros (2010), Cancho et al. (2010) and moreover various others. Comprehensive
discussion on both of these approaches of influence study can be found in Cook (1979),
Belsley et al. (1980), Cook and Weisberg (1982), Chatterjee and Hadi (1988), Rancel and Sierra (2001), Ullah and Pasha (2009), Amanullah et al. (2013a) and Amanullah
et al. (2013b).
Pena (2005) presented another influence measure which is absolutely distinctively to identify influential observations. He introduced a procedure which measure the
2 impact of a case totally in a new way. Instead of assessing the effect of case deletion on estimation of parameters, a new approach is propose based on how each case is being affected by the remaining data set. Nurunnabi et al. (2011) extended the idea of Pena (2005) to group omission and introduced another measure to detect multiple influential cases. Their proposed technique comprises of two stages. In the initial step they tried to detect the suspected influential observations. It is not so natural to detect all the influential observations at the first step as a result of masking and swamping issue. In the event that any suspected case left in the data then the detection procedure turns out to be exceptionally awkward. So they tried to ensure that all the potential observations are hailed as suspected and in the meantime no acquitted points are incorrectly omitted. Since the omission of such cases particularly when they are acquitted leverage ones might unfavourably influence the whole influential framework (Habshah et al., 2009). Thus, they utilize the block adaptive computationally effective outlier nominator (BACON) (Billor et al., 2000).
In the second stage they utilized a group omission type of Pena’s measure to affirm whether the doubted observations are really influential or not.
The measures which are used to identify the influential observations in OLS method are easily found because this is a simple model. But these measures are not immediately relevant to other biased estimators because these estimators may have various model assumptions or additional parameters or some other causes.
Consequently, for these estimators another investigation work is required.
The present research work is an extension of Pena (2005) measure and Nurunnabi et al. (2011) measure of group omission for detecting multiple influential observations
3 in MLR models. Currently, interest has increased in investigating the highlights of a given data with regards to a few other options to the OLS estimator. Four of the issues that may emerge in this setting are as take after;
1. Liu estimator (LE) suggested by (Liu, 1993) is an ordinarily used estimator
to combat multicollinearity in MLR models. There is a colossal literature to
justify the use of LE in influence diagnostic analysis (Ozkale and Kaciranlar,
2007; Yang et al., 2009; Gruber, 2012; Jahufer and Jianbao, 2012; Jahufer, 2013;
Amanullah et al., 2013a). So the use of Pena (2005) measure along with LE to
identify multiple influential observations is also required in sensitivity analysis.
2. The use of modified ridge regression (MRR; Swindle, 1976) as an alternative
to the OLS technique in influence diagnostics analysis when dealing with MLR
model, has been consider by various researchers (Jahufer and Jianbao, 2009;
Jahufer and Jianbao, 2011; Turkan and Toktamis, 2012; Amanullah et al.,
2013b). It is also needed to use MRR with Pena (2005) measure in MLR
models.
3. The improved Liu estimator (ILE) suggested by (Liu, X. Q., 2011) has been
proposed as an alternative to the LE with some better properties. The use of
this estimator has not given much attention of researchers in influence diagnostic
analysis. We additionally proposed the use of ILE with Pena (2005) measure
to detect multiple influential cases in MLR models.
4. The issue of finding influential observations using ridge regression (RR; Hoerl
and Kennard, 1970) in MLR models as an alternative to the OLS technique.
4 This issue has accomplished a significant consideration (Walker and Birch, 1988;
Billor and Loynes, 1999; Turkan and Toktamis, 2012). The Nurunnabi et al.
(2011) measure of group deletion for identifying multiple influential observations
in MLR models with ridge regression (RR) is also needed attention.
In our research work, we tended to the previously mentioned issues. These issues gave the primary root and inspiration for our present investigation. We analyzed our work by various very much alluded data sets, high-dimensional substantial artificial data and Monte Carlo simulation scheme. The significance and extent of our research work is as follow.
1.2 Significance
Evaluation of case omission by using the Pena (2005) measure and Nurunnabi et al.
(2011) measure of group deletion for identification of multiple observations in MLR models with the biased estimators such as LE, MRR, ILE and RR helps analysts to evaluate the appropriateness of these influence diagnostic measures. The importance of the matters brought up in the previous section originates from:
1. The identification of influential observations by using influence diagnostic
techniques and following investigation gives information connecting reliability
of conclusions and their reliance on the assumed statistical model.
2. The capability and efficiency of case omission by using the Pena (2005) measure
and Nurunnabi et al. (2011) measure of group deletion for identifying multiple
5 observations in MLR models does not remain the same when we consider the
LE, MRR, ILE and RR to mitigate the effect of multicollinearity instead of the
OLS method.
The goal of this research is to examine the degree to which diagnostic values for LE,
MRR, ILE and RR are influenced as we increase control over multicollinearity.
1.3 A brief sketch of the Research
This thesis emphases on the problem of biased estimation of MLR model in the presence of multicollinearity in the data. Since we are using four biased estimators namely; LE, MRR, ILE and RR. So we break up our study into above stated segments for biased estimators.
After reviewing the available diagnostic methods, we formulate them by using biased estimators with the aim of comparing the performance of proposed methods with methods based on the OLS estimator. For this purpose, we propose LE, MRR, ILE and RR versions of diagnostic measures.
In Chapter 2, the definition and estimation of the MLR model is introduced. The
MLR residuals, leverages, hat matrix and influence diagnostic measures are displayed in this part.
In Chapter 3, we will investigate another alternative technique called the LE which will be broadly used and applied by data investigation. In this chapter, derivation of different diagnostic measures will be endeavored to cover the yet revealed diagnostic areas.
6 In Chapter 4, we study the influence diagnostic methods in MLR models with MRR, which is based on prior information. We should address the issue of estimation of additional parameter k. Due to this one will need to look at the influence diagnostic techniques in the new settings too.
In Chapter 5, we will emphasis Pena (2005) influence diagnostic in MLR models with the ILE.
In Chapter 6, we propose ridge version of diagnostic measures that have been tailored by using residuals of ridge estimator for multicollinear data instead of applying customary OLS residuals. We consider multicollinearity in all MLR models. In the detailing of diagnostic measures, residuals acquired from OLS have been used. In this chapter, following Walker and Birch (1988), we present ridge estimate and residuals in different diagnostic measures. Our main focus is to check the performance of these measures in presence of multicolllinearity using ridge regression.
Finally, Chapter 7 is reserved for summary and concluding remarks of the results drive from all the chapters and conclusions drawn from the results. All the calculations need a wide computer programs. In Monte Carlo simulation schemes we need to deal with a large number of observations for that purpose we use computer programs using
E-Views, Matlab and R-Language.
7 Chapter 2
Influence Diagnostic Measures in
Linear Regression
2.1 Introduction
The MLR models are the richest tools with extensive assortment of applications in the previous decade in different sciences to clarify the connection amongst independent and dependent variables. The OLS method has been broadly used as a standard technique in MLR investigation. The utilization of such technique infers that the estimates are unbiased. But, it is notable that numerical measures in perspective of
OLS regression can be impacted by one or couple of influential cases. Techniques more robust than OLS has incredible considerations in the available literature. However, because of its inventive theory and simplicity of calculation, OLS has upheld the extraordinary part in regression investigation.
8 2.2 The Linear Model and OLS Estimation
Following Walker and Birch (1988) the usual MLR model can be portrayed as
y = 1β0 + X1β1 + ε, (2.1)
where, y is an n × 1 vector of response variable , 1 is n × 1 vector of ones, β0 is an unknown parameter, X1 is an n × q centered and standardized matrix of random variables, β1 is a q × 1 vector of an unknown parameter, ε is an n × 1 vector of unobservable random variables with zero mean and constant variance σ2. If Z =
(1,X1) , then Eq. (2.1) might be expressed as
y = Zβ + ε (2.2)
For the estimation of parameter β in Eq. (2.2), the OLS procedure is used under the hypothesis that the errors are normally distributed. The OLS technique minimizes the sum of squares of errors as
n X 2 0 0 S (β) = εi = ε ε = (y − Zβ) (y − Zβ) . i=1
S (β) = y0 y − 2β0 Z 0 y + β0 Z 0 Zβ (2.3)
Using Eq. (2.3), we get
∂S 0 0 = −2Z y + 2Z Zβ = 0 ∂β
9 then the unbiased estimator βˆ of β is defined as
0 −1 0 βˆ = Z Z Z y.
It is already verified that the variance of βˆ might be composed as
0 −1 V ar βˆ = σ2 Z Z .
2.3 Residuals and Hat Matrix in Linear Regres-
sion
The residuals and leverages are the fundamental devices for the identification of influential cases. Now, from Eq. (2.2), the fitted values of the response variable are given by
yˆ = Zβˆ
The estimated residuals are characterized by
e = y − yˆ = y − Zβ.ˆ (2.4)
10 Furthermore, s2 = e0e/(n − p), where p = q + 1.
The hat or projection matrix for the MLR models is presented as
0 −1 0 H = Z Z Z Z (2.5)
The leverages in the MLR models are used to detect the influential cases. These are formulated as
0 0 −1 hii = zi(Z Z) zi (2.6)
1 where, zi is the ith vector of matrix Z, 0 < hii < 1 or n < hii < 1. Note that n P 2 hii = hij and also hii (1 − hii) ≤ 0. The term leverage is specifically connected with j=1 the regressors, showed that high leverage point in the Z space occurs as influential cases. The leverages hii measure the influence on the fitted values.
2.4 Diagnostic Approaches in Linear Regression
with No Multicollinearity
An extensive research work on influential observations in the concerned statistical measures for MLR with no multicollinearity has received a significant attention.
Numerous statistical measures using single case deletion diagnostic technique for identifying influential observations on the outcomes of a least square regression are given in Chatterjee and Hadi (1988). They named the use of these statistical measures as influence analysis. Among the numerous single case deletion diagnostic measures,
11 one is Cook’s distance (Cook, 1977), which measure at the ith case in a data denoted
by Di and defined as
0 0 ˆ ˆ ˆ ˆ 2−1 Di = β − β(i) Z Z β − β(i) pσˆ (2.7)
ˆ 2 where β(i) is the estimate of β with the ith case omitted andσ ˆ is calculated from
OLS. The other suitable influence diagnostic measure is DFFITS (Belsley et al., 1980) which is defined by
yˆi − yˆi(−i) DFFITS(i) = √ , i = 1, 2, . . . , n (2.8) σˆ(i) hii
(−i) wherey ˆi is the fitted response andσ ˆ(i) is the sample standard error with the ith case omitted. Among these two measures Welsch (1982) recommended DFFITS as a superior decision since, it is more instructive about variance than that of Di and
it will calculate concurrent impact on estimates of parameter and the estimate of
variance.
Pena (2005) proposed another influence diagnostic measure absolutely in a different
way. As mention Pena, instead of examining at how the omission of a point affects
the vector of forecasts, we look how the omission of each sample point affects the
forecast of a specific observation. That is, for every sample point we measure the
estimated change when each other point in the sample is omitted. In his work, he
introduced a process with measures how each case is being influenced by whatever is
12 left of the data set. He considered the vectors