New Diagnostic Methods for Inuential Observations in with some Biased Estimation Methods

A Ph.D. Dissertation By Muhammad Kashif Roll No.: PHDS-11-05 Session: 2011-2016

DEPARTMENT OF Bahauddin Zakariya University Multan - Pakistan 2019 New Diagnostic Methods for Inuential Observations in Linear Regression with some Biased Estimation Methods

A Thesis submitted in partial fulllment of the requirements for the degree of

Doctor of Philosophy in STATISTICS by Muhammad Kashif (Roll No. PHDS-11-05) Session: 20112016

SUPERVISED by Prof. Dr. Muhammad Aman Ullah

CO-SUPERVISED by Dr. Muhammad Aslam

Department of Statistics Bahauddin Zakariya University Multan

2019 New Diagnostic Methods for Influential Observations in Linear

Regression with some Biased Estimation Methods

by

Muhammad Kashif

A thesis

submitted to the Department of Statistics,

Bahauddin Zakariya University, Multan

in fulfillment of the requirements for the Degree of

Doctor of Philosophy

in

Statistics

2019

Author's Declaration

I Muhammad Kashif, hereby state that my Ph.D. (Statistics) thesis titled New Diagnostic Methods for Inuential Observations in Linear Regression with some Biased Estimation Methods is my own work and has not been submitted previously by me for taking any degree from this University Bahauddin Zakariya University, Multan, Pakistan or anywhere else in the country/world. At any time if my statement is found to be incorrect even after my Graduate the university has the right to withdraw my Ph.D. (Statistics) degree.

Muhammad Kashif Date: 27-07-2019

i Plagiarism Undertaking

I solemnly declare that research work presented in the thesis titled New Diagnostic Methods for Inuential Observations in Linear Regression with some Biased Estimation Methods is solely my research work with no signicant contribution from any other person. Small contribution/help wherever taken has been duly acknowledged and that complete thesis has been written by me.

I understand the zero tolerance policy of the HEC and Bahauddin Zakariya University, Multan, Pakistan towards plagiarism. Therefore I as an Author of the above titled thesis declare that no portion of my thesis has been plagiarized and any material used as reference is properly referred/cited.

I undertake that if I am found guilty of any formal plagiarism in the above titled thesis even after award of Ph.D. (Statistics) degree, the University reserves the rights to withdraw/revoke my Ph.D. (Statistics) degree and that HEC and the University has the right to publish my name on the HEC/University Website on which names of students are placed who submitted plagiarized thesis.

Student's Signature: Name: Muhammad Kashif

ii This thesis is dedicated to The Holy Prophet Hazrat Muhammad (S.A.W.W)

(Whose teaching enlightened my heart and flourished my thoughts)

iii Acknowledgements

First and foremost, praises and thanks to Almighty ALLAH for giving me this opportunity, the strength and the patience to complete my dissertation finally, after all the challenges and difficulties. My special praise for the messenger of Allah, the Holy Prophet Hazrat MUHAMMAD (S.A.W.W), the greatest educator, the everlasting source of guidance and knowledge for humanity. He taught the principles of morality and eternal values. I deem it a great honor to express my deep and sincere gratitude to my honorable and estimable supervisor, Prof. Dr. Muhammad Aman Ullah, Professor at Department of Statistics, Bahauddin Zakariya University, Multan for his guidance and invaluable advice. His constructive comments and suggestions throughout the thesis work have contributed to the success of this research. His timely and efficient contribution helped me shape this thesis into its final form and I express my sincere appreciation for his assistance in any way that I may have asked. I consider it my privilege to have accomplished this thesis under his right guidance. I feel great pleasure in expressing my sincerest gratitude to my Co-supervisor Dr. Muhammad Aslam, Associate Professor and Chairman, Department of Statistics, Bahauddin Zakariya University, Multan for his detailed review, constructive sugges- tions, important support and excellent advice during the preparation of this thesis. I would like to express my gratitude to all the esteemed faculty members and staff members of the Department of Statistics, Bahauddin Zakariya University, Multan. Finally, I wish to thank to my loving parents, sisters, brothers and my wife for their prayers, encouragement and support spiritual, emotional, intellectual and otherwise.

Muhammad Kashif

iv Abstract

This thesis is concerned with the expansion of diagnostic methods in parametric regression models with some biased estimators. Of which, the Liu estimator, modified ridge estimator, improved Liu estimator and ridge estimator have been developed as an alternative to the estimator in the presence of multicollinearity in linear regression models. Firstly, we introduce a type of Pena’s statistic for each point in Liu regression. Using the forecast change property, we simplify the Pena’s statistic in a numerical sense. It is found that the simplified Pena’s statistic behaves quite well as far as detection of influential observations is concerned.

We express Pena’s statistic in terms of the Liu leverages and residuals. For numerical evaluation, simulated studies are given and a real data set has been analyzed for illustration. Secondly, we formulated Pena’s statistic for each point while considering the modified ridge regression estimator. Using this statistic, we showed that when modified ridge regression was used to mitigate the effects of multicollinearity, the influence of some observations could be significantly changed. The normality of this statistic was also discussed and it was proved that it could detect a subset of high modified ridge . The Monte Carlo simulations were used for

v empirical results and an example of real data was presented for illustration. Next,

we introduce a type of Pena’s statistic for each point in the improved Liu estimator.

Using this statistic, we showed that when the improved Liu estimator was used to

mitigate the effects of multicollinearity, the influence of some observations could be

significantly changed. The Monte Carlo simulations were used for empirical results

and an example of real data was presented for illustration. The ridge estimator having

growing and wider applications in statistical data analysis as an alternative technique

to the ordinary least squares estimator to combat multicollinearity in linear regression

models. In regression diagnostics, a large number of influence diagnostic methods

based on numerous statistical tools have been discussed. Finally, we focus on ridge

version of Nurunnabi et al. (2011) method for identification of multiple influential observation in linear regression. The efficiency of the proposed method is presented through several well-known data sets, an artificial large data with high-dimension and heterogeneous sample and a Monte Carlo simulation study.

vi List of Symbols and Abbreviations

Abbreviation/Symbols Description b Prior information vector

β Vector of slope coefficients of X1

ˆ βd Liu estimate

ˆ β(k,b) Modified ridge estimate

ˆ βK,D Improved Liu estimate

ˆ βk Ridge estimate

BACON Block adaptive computationally effective nomina-

tor

CIP Correct identification in percentage d biasing parameter

Di Cook’s distance

Dd,i Cook’s distance with Liu estimate

D(k,b)i Cook’s distance with modified ridge estimate

DK.D;i Cook’s distance with Improved Liu estimate

DR,i Cook’s distance with ridge estimate

vii DFFITS(i) Difference of Fits Test

DFFITSR(i) Difference of Fits Test with ridge estimate

H Hat Matrix

Hd Hat Matrix with Liu estimate

H(k,b) Hat Matrix with modified ridge estimate

HK,D Hat Matrix with Improved Liu estimate

HR Hat Matrix with ridge estimate hii Leverages hd,i Leverages with Liu estimate h(k,b)i Leverages with modified ridge estimate hK.D;i Leverages with Improved Liu estimate hRii Leverages with ridge estimate

I Identity Matrix

ILE Improved Liu Estimator k Ridge parameter

LE Liu Estimator

MAD Median absolute deviation

MLR Multiple linear regression

MRR Modified ridge regression

Mi Nurunnabi’s measure

MR,i Nurunnabi’s measure with ridge estimate

viii OLS Ordinary least squares

RR Ridge regression

Si Pena’s statistic

Sd,i Pena’s statistic with Liu estimate

S(k,b)i Pena’s statistic with modified ridge estimate

SK,D;i Pena’s statistic with Improved Liu estimate

SR,i Pena’s statistic with ridge estimate

X1 Centered and standardized matrix of variables y Response vector

ix Contents

List of Symbols and Abbreviation xii

List of Tables xii

List of Figures xiii

1 Introduction1 1.1 Presentation...... 1 1.2 Significance...... 5 1.3 A brief sketch of the Research...... 6

2 Influence Diagnostic Measures in Linear Regression8 2.1 Introduction...... 8 2.2 The Linear Model and OLS Estimation...... 9 2.3 Residuals and Hat Matrix in Linear Regression...... 10 2.4 Diagnostic Approaches in Linear Regression with No Multicollinearity 11

3 Pena’s statistic for the Liu regression (Kashif et al. (2018)) 17 3.1 Introduction...... 17 3.2 Pena’s statistic...... 19 3.2.1 Pena’s statistic using the LE...... 19 3.2.2 Properties of Pena’s statistic for the Liu Regression...... 21 3.3 Simulation study...... 28 3.3.1 Normality of Proposed measure...... 28 3.3.2 Detection of intermediate and high Liu leverage outliers.... 33 3.3.3 The Monte Carlo simulation...... 37 3.4 Longley data...... 40 3.5 Conclusions...... 43

4 Pena’s statistic for the Modified Ridge Regression 47 4.1 Introduction...... 47 4.2 Pena’s statistic...... 49 4.2.1 Pena’s statistic using the MRR...... 49 4.2.2 Properties of Pena’s statistic for the MRR...... 53 4.3 Simulation...... 60

x 4.3.1 Normality of Proposed measure...... 60 4.3.2 Detection of intermediate and high modified ridge leverage outliers...... 65 4.3.3 Performance of Pena’s statistic...... 69 4.4 Illustration...... 70 4.5 Conclusions...... 74

5 Pena’s statistic for the Improved Liu Estimator 75 5.1 Introduction...... 75 5.2 Pena’s statistic...... 78 5.2.1 Pena’s statistic using the ILE...... 78 5.2.2 Properties of Pena’s statistic for the ILE...... 80 5.3 Simulation...... 87 5.3.1 Normality of the Proposed measure...... 87 5.3.2 Detection of intermediate and high improved Liu leverage outliers 92 5.3.3 Performance of Pena’s statistic...... 96 5.4 Illustration...... 98 5.5 Conclusions...... 101

6 A New Diagnostic Method for Influential Observations in Ridge Regression 102 6.1 Introduction...... 102 6.2 Ridge Regression...... 105 6.3 Diagnostic Methods in Ridge Regression...... 105 6.3.1 Leverage and Residuals...... 105 6.3.2 Cook’s distance, DFFITS and Pena’s measure...... 106 6.4 Proposed Diagnostic Method...... 108 6.5 Examples...... 111 6.5.1 Longley Data...... 111 6.5.2 Tobacco Data...... 114 6.5.3 Artificial high dimensional large data set containing heteroge- neous cases...... 117 6.6 Simulation Findings...... 119 6.7 Conclusions...... 121

7 Conclusions 122

References 125

A Published /Submitted Research Work from Ph.D. Thesis 132

B Data Sets 133

xi List of Tables

2.1 Influence Diagnostic Measures in Linear Regression...... 16

3.1 Three data sets and proposed Diagnostic Measures...... 34 3.2 The Percent of outlier identification by Sd,i in simulation...... 38 3.3 The Percent of outlier identification by Dd,i in simulation...... 39 3.4 Five largest values of Sd,i for d = 0.1, 0.5, 0.9 and 1.0 (OLS Case) for the Longley data...... 42

4.1 Three data sets and proposed Diagnostic Measures...... 66 4.2 Percent of outlier identification by applying S(k,b)i in the Simulation.. 70 4.3 Percent of outlier identification by means of D(k,b)i in the Simulation. 71 4.4 Five largest observations of S(k,b)i for the Longley data...... 72 5.1 Three data sets and proposed Diagnostic Measures...... 93 5.2 Percent of outlier identification through SK.D;i in simulation...... 97 5.3 Percent of outlier identification through DK.D;i in simulation...... 98 5.4 Five largest values of SK.D;i for Optimal value of d for the Longley data. 99 6.1 Measures of influences for Longley data...... 112 6.2 Measures of influences for Tobacco blends data set...... 115 6.3 Simulation results...... 120

B.1 Longley Data (Longley, 1967)...... 133 B.2 Tobacco Data (Myers, 1986)...... 134

xii List of Figures

3.1 Influence investigation of the produced of three data sets with various multicollinearity independent variables. Normal q-q plots of the proposed measure...... 29 3.2 Influence investigation of the produced of three data sets with various multicollinearity variables. Normal q-q plots of Cook’s distance... 30 3.3 Influence investigation of the produced of three data sets with various multicollinearity regressors. Plots of Cook’s distance against proposed measure...... 31 3.4 Influence investigation of the produced of three data sets with various multicollinearity predictors. Plots of proposed measure versus obser- vations...... 32 3.5 Graphs of Cook’s measure against the suggested diagnostic measure three situations: (a) no outliers, (b) high Liu leverage outliers and (c) middle Liu leverage outliers...... 35 3.6 Plots of the proposed measure versus observation three situations: (a) no outliers, (b) high Liu leverage outliers and (c) centeral Liu leverage outliers...... 36 3.7 Histograms of Sd,i of the Longley data for d=0.1, 0.5, 0.9 and 1.0... 44 3.8 Scatter Plots of Dd,i against Sd,i for d=0.1, 0.5, 0.9 and 1.0...... 45 3.9 Plots of Sd,i against observation number for d=0.1, 0.5, 0.9 and 1.0.. 46 4.1 Influence investigation of produced the 3 data sets with various multicollinearity regressors. Normal q-q plots of the proposed measure 61 4.2 Influence investigation of produced the 3 data sets with various multicollinearity regressors. Normal q-q plots of Cook’s measure... 62 4.3 Impact investigation of produced the 3 data sets with various mul- ticollinearity regressors. Graphs of Cook’s distance against proposed measure...... 63 4.4 Impact investigation of produced the 3 data sets with various multi- collinearity regressors. Plots of proposed measure versus observation number...... 64 4.5 Graphs of Cook’s measure against the proposed measure three situa- tions: (a) no outliers, (b) 3 high modified ridge leverage outliers and (c) 3 middle modified ridge leverage outliers...... 67

xiii 4.6 Graphs of the proposed measure against observation three situations: (a) no outliers, (b) 3 high modified ridge leverage outliers and (c) 3 middle modified ridge leverage outliers...... 68 4.7 Impact assessment of Longley data set (a) Histogram of proposed measure (b) graph of Cook’s distance verses proposed measure (c) plot of proposed measure verses cases...... 73

5.1 Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Normal q-q plots of the proposed measure...... 88 5.2 Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Normal q-q plots of Cook’s distance...... 89 5.3 Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Graphs of Cook’s measure against proposed measure...... 90 5.4 Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Plots of proposed measure versus observation number...... 91 5.5 Graphs of Cook’s measure against proposed measure three situations: (a) No outliers, (b) 3 high improved Liu leverage outliers and (c) 3 Middle improved Liu leverage outliers...... 94 5.6 Plots of the proposed measure versus observation three situations: (a) No outliers, (b) 3 high improved Liu leverage outliers and (c) 3 Middle improved Liu leverage outliers...... 95 5.7 Influence examination of Longley data set (a) Histogram of the proposed measure (b) graph of the Cook’s distance verses proposed measure (c) plot of proposed measure verses cases...... 100

6.1 Index plots for Longley data: (a) DR,i measure; (b) DFFITS; (c) Pena’s measure SR,i and (d) Suggested influence diagnostic measure MR,i. .. 113 6.2 Index graphs of Tobacco blends data: (a) DR,i measure; (b) DFFITS; (c) Pena’s measure SR,i and (d) Proposed diagnostic measure MR,i. . 116 6.3 Influence investigation of the big data set with high dimensions and heterogeneous sample cases: (a) Residuals against Fitted plot; (b) Histogram of the Residuals; (c) Index graphs of (a) Cook’s distance; (d) DFFITS; (e) Pena’s statistic; (f) Histogram of Pena’s statistic; (g) Histogram of Proposed measure; (h) Index plot of Proposed measure. 118

xiv Chapter 1

Introduction

1.1 Presentation

In fitting a multiple linear regression (MLR) model, it is regular to apply the ordinary least squares (OLS) technique for the most part in view of simplicity of estimation. It is a well-known fact that, OLS estimator achieve desirable properties only if certain assumptions about the error term holds. If these assumptions violates, this effect the estimator and the subsequent analyses. Moreover, in applying the method of OLS, we observe that the estimates of parameters can be significantly influenced by at least one observation in the data set. It implies not every one of the observations have an equivalent significance in linear regression. Therefore, in decisions that result drawn from an analysis (Chatterjee and Hadi, 1986). Extreme value in explanatory variable(s) results as an influential observation. Influential observations are the values which strongly effect the parameter estimates and fitted values. The inclusion/exclusion of such observations may considerably change the

1 estimated values of parameters and further analyses. Therefore, OLS regression can

be made a practicable statistical methodology by giving more attention to all possible

influential observations in the data set, but not only on estimation.

In literature, two approaches are available for this purpose according to Chatterjee

and Hadi (1988). First approach is case omission technique which is widely applied for

influence analysis in linear regression models, obviously, e.g., Cook (1977), Belsley et

al. (1980), Montgomery et al. (2001), Chatterjee and Hadi (2006), Ullah and Pasha

(2009), additionally, the references referred to in that. According to this approach,

it is investigated that how different measures, involved in , change

when some of the observations have to be omitted from given data set. Other approach

is local influence technique, which is recommended by Cook (1986) as an alternative

methodology to case omission approach. He describes a measure named as the local

influence which depends on differentiation instead of a point deletion method. This

approach has been considered by numerous researchers, for example, Thomas and

Cook (1989), Thomas and Cook (1990), Tsai and Wu (1992), Zhao and Lee (1998),

Liu (2000), Ortega et al. (2003), Tanaka et al. (2003), Prendergast (2006), Paula and

Cysneiros (2010), Cancho et al. (2010) and moreover various others. Comprehensive

discussion on both of these approaches of influence study can be found in Cook (1979),

Belsley et al. (1980), Cook and Weisberg (1982), Chatterjee and Hadi (1988), Rancel and Sierra (2001), Ullah and Pasha (2009), Amanullah et al. (2013a) and Amanullah

et al. (2013b).

Pena (2005) presented another influence measure which is absolutely distinctively to identify influential observations. He introduced a procedure which measure the

2 impact of a case totally in a new way. Instead of assessing the effect of case deletion on estimation of parameters, a new approach is propose based on how each case is being affected by the remaining data set. Nurunnabi et al. (2011) extended the idea of Pena (2005) to group omission and introduced another measure to detect multiple influential cases. Their proposed technique comprises of two stages. In the initial step they tried to detect the suspected influential observations. It is not so natural to detect all the influential observations at the first step as a result of masking and swamping issue. In the event that any suspected case left in the data then the detection procedure turns out to be exceptionally awkward. So they tried to ensure that all the potential observations are hailed as suspected and in the meantime no acquitted points are incorrectly omitted. Since the omission of such cases particularly when they are acquitted leverage ones might unfavourably influence the whole influential framework (Habshah et al., 2009). Thus, they utilize the block adaptive computationally effective outlier nominator (BACON) (Billor et al., 2000).

In the second stage they utilized a group omission type of Pena’s measure to affirm whether the doubted observations are really influential or not.

The measures which are used to identify the influential observations in OLS method are easily found because this is a simple model. But these measures are not immediately relevant to other biased estimators because these estimators may have various model assumptions or additional parameters or some other causes.

Consequently, for these estimators another investigation work is required.

The present research work is an extension of Pena (2005) measure and Nurunnabi et al. (2011) measure of group omission for detecting multiple influential observations

3 in MLR models. Currently, interest has increased in investigating the highlights of a given data with regards to a few other options to the OLS estimator. Four of the issues that may emerge in this setting are as take after;

1. Liu estimator (LE) suggested by (Liu, 1993) is an ordinarily used estimator

to combat multicollinearity in MLR models. There is a colossal literature to

justify the use of LE in influence diagnostic analysis (Ozkale and Kaciranlar,

2007; Yang et al., 2009; Gruber, 2012; Jahufer and Jianbao, 2012; Jahufer, 2013;

Amanullah et al., 2013a). So the use of Pena (2005) measure along with LE to

identify multiple influential observations is also required in sensitivity analysis.

2. The use of modified ridge regression (MRR; Swindle, 1976) as an alternative

to the OLS technique in influence diagnostics analysis when dealing with MLR

model, has been consider by various researchers (Jahufer and Jianbao, 2009;

Jahufer and Jianbao, 2011; Turkan and Toktamis, 2012; Amanullah et al.,

2013b). It is also needed to use MRR with Pena (2005) measure in MLR

models.

3. The improved Liu estimator (ILE) suggested by (Liu, X. Q., 2011) has been

proposed as an alternative to the LE with some better properties. The use of

this estimator has not given much attention of researchers in influence diagnostic

analysis. We additionally proposed the use of ILE with Pena (2005) measure

to detect multiple influential cases in MLR models.

4. The issue of finding influential observations using ridge regression (RR; Hoerl

and Kennard, 1970) in MLR models as an alternative to the OLS technique.

4 This issue has accomplished a significant consideration (Walker and Birch, 1988;

Billor and Loynes, 1999; Turkan and Toktamis, 2012). The Nurunnabi et al.

(2011) measure of group deletion for identifying multiple influential observations

in MLR models with ridge regression (RR) is also needed attention.

In our research work, we tended to the previously mentioned issues. These issues gave the primary root and inspiration for our present investigation. We analyzed our work by various very much alluded data sets, high-dimensional substantial artificial data and Monte Carlo simulation scheme. The significance and extent of our research work is as follow.

1.2 Significance

Evaluation of case omission by using the Pena (2005) measure and Nurunnabi et al.

(2011) measure of group deletion for identification of multiple observations in MLR models with the biased estimators such as LE, MRR, ILE and RR helps analysts to evaluate the appropriateness of these influence diagnostic measures. The importance of the matters brought up in the previous section originates from:

1. The identification of influential observations by using influence diagnostic

techniques and following investigation gives information connecting reliability

of conclusions and their reliance on the assumed statistical model.

2. The capability and efficiency of case omission by using the Pena (2005) measure

and Nurunnabi et al. (2011) measure of group deletion for identifying multiple

5 observations in MLR models does not remain the same when we consider the

LE, MRR, ILE and RR to mitigate the effect of multicollinearity instead of the

OLS method.

The goal of this research is to examine the degree to which diagnostic values for LE,

MRR, ILE and RR are influenced as we increase control over multicollinearity.

1.3 A brief sketch of the Research

This thesis emphases on the problem of biased estimation of MLR model in the presence of multicollinearity in the data. Since we are using four biased estimators namely; LE, MRR, ILE and RR. So we break up our study into above stated segments for biased estimators.

After reviewing the available diagnostic methods, we formulate them by using biased estimators with the aim of comparing the performance of proposed methods with methods based on the OLS estimator. For this purpose, we propose LE, MRR, ILE and RR versions of diagnostic measures.

In Chapter 2, the definition and estimation of the MLR model is introduced. The

MLR residuals, leverages, hat matrix and influence diagnostic measures are displayed in this part.

In Chapter 3, we will investigate another alternative technique called the LE which will be broadly used and applied by data investigation. In this chapter, derivation of different diagnostic measures will be endeavored to cover the yet revealed diagnostic areas.

6 In Chapter 4, we study the influence diagnostic methods in MLR models with MRR, which is based on prior information. We should address the issue of estimation of additional parameter k. Due to this one will need to look at the influence diagnostic techniques in the new settings too.

In Chapter 5, we will emphasis Pena (2005) influence diagnostic in MLR models with the ILE.

In Chapter 6, we propose ridge version of diagnostic measures that have been tailored by using residuals of ridge estimator for multicollinear data instead of applying customary OLS residuals. We consider multicollinearity in all MLR models. In the detailing of diagnostic measures, residuals acquired from OLS have been used. In this chapter, following Walker and Birch (1988), we present ridge estimate and residuals in different diagnostic measures. Our main focus is to check the performance of these measures in presence of multicolllinearity using ridge regression.

Finally, Chapter 7 is reserved for summary and concluding remarks of the results drive from all the chapters and conclusions drawn from the results. All the calculations need a wide computer programs. In Monte Carlo simulation schemes we need to deal with a large number of observations for that purpose we use computer programs using

E-Views, Matlab and R-Language.

7 Chapter 2

Influence Diagnostic Measures in

Linear Regression

2.1 Introduction

The MLR models are the richest tools with extensive assortment of applications in the previous decade in different sciences to clarify the connection amongst independent and dependent variables. The OLS method has been broadly used as a standard technique in MLR investigation. The utilization of such technique infers that the estimates are unbiased. But, it is notable that numerical measures in perspective of

OLS regression can be impacted by one or couple of influential cases. Techniques more robust than OLS has incredible considerations in the available literature. However, because of its inventive theory and simplicity of calculation, OLS has upheld the extraordinary part in regression investigation.

8 2.2 The Linear Model and OLS Estimation

Following Walker and Birch (1988) the usual MLR model can be portrayed as

y = 1β0 + X1β1 + ε, (2.1)

where, y is an n × 1 vector of response variable , 1 is n × 1 vector of ones, β0 is an unknown parameter, X1 is an n × q centered and standardized matrix of random variables, β1 is a q × 1 vector of an unknown parameter, ε is an n × 1 vector of unobservable random variables with zero mean and constant variance σ2. If Z =

(1,X1) , then Eq. (2.1) might be expressed as

y = Zβ + ε (2.2)

For the estimation of parameter β in Eq. (2.2), the OLS procedure is used under the hypothesis that the errors are normally distributed. The OLS technique minimizes the sum of squares of errors as

n X 2 0 0 S (β) = εi = ε ε = (y − Zβ) (y − Zβ) . i=1

S (β) = y0 y − 2β0 Z 0 y + β0 Z 0 Zβ (2.3)

Using Eq. (2.3), we get

∂S 0 0 = −2Z y + 2Z Zβ = 0 ∂β

9 then the unbiased estimator βˆ of β is defined as

 0 −1 0 βˆ = Z Z Z y.

It is already verified that the variance of βˆ might be composed as

   0 −1 V ar βˆ = σ2 Z Z .

2.3 Residuals and Hat Matrix in Linear Regres-

sion

The residuals and leverages are the fundamental devices for the identification of influential cases. Now, from Eq. (2.2), the fitted values of the response variable are given by

yˆ = Zβˆ

The estimated residuals are characterized by

e = y − yˆ = y − Zβ.ˆ (2.4)

10 Furthermore, s2 = e0e/(n − p), where p = q + 1.

The hat or projection matrix for the MLR models is presented as

 0 −1 0 H = Z Z Z Z (2.5)

The leverages in the MLR models are used to detect the influential cases. These are formulated as

0 0 −1 hii = zi(Z Z) zi (2.6)

1 where, zi is the ith vector of matrix Z, 0 < hii < 1 or n < hii < 1. Note that n P 2 hii = hij and also hii (1 − hii) ≤ 0. The term leverage is specifically connected with j=1 the regressors, showed that high leverage point in the Z space occurs as influential cases. The leverages hii measure the influence on the fitted values.

2.4 Diagnostic Approaches in Linear Regression

with No Multicollinearity

An extensive research work on influential observations in the concerned statistical measures for MLR with no multicollinearity has received a significant attention.

Numerous statistical measures using single case deletion diagnostic technique for identifying influential observations on the outcomes of a least square regression are given in Chatterjee and Hadi (1988). They named the use of these statistical measures as influence analysis. Among the numerous single case deletion diagnostic measures,

11 one is Cook’s distance (Cook, 1977), which measure at the ith case in a data denoted

by Di and defined as

0   0   ˆ ˆ ˆ ˆ 2−1 Di = β − β(i) Z Z β − β(i) pσˆ (2.7)

ˆ 2 where β(i) is the estimate of β with the ith case omitted andσ ˆ is calculated from

OLS. The other suitable influence diagnostic measure is DFFITS (Belsley et al., 1980) which is defined by

yˆi − yˆi(−i) DFFITS(i) = √ , i = 1, 2, . . . , n (2.8) σˆ(i) hii

(−i) wherey ˆi is the fitted response andσ ˆ(i) is the sample standard error with the ith case omitted. Among these two measures Welsch (1982) recommended DFFITS as a superior decision since, it is more instructive about variance than that of Di and

it will calculate concurrent impact on estimates of parameter and the estimate of

variance.

Pena (2005) proposed another influence diagnostic measure absolutely in a different

way. As mention Pena, instead of examining at how the omission of a point affects

the vector of forecasts, we look how the omission of each sample point affects the

forecast of a specific observation. That is, for every sample point we measure the

estimated change when each other point in the sample is omitted. In his work, he

introduced a process with measures how each case is being influenced by whatever is

12 left of the data set. He considered the vectors

0 si = yˆi − yˆi(−1),..., yˆi − yˆi(−n)

wherey ˆi −yˆi(−j) is the contrast between the ith estimated value of y in existence of all

cases in the data set and with j th case omitted. He presented a diagnostic measure

for the ith case as

0 sisi Si = ; i = 1, 2, . . . , n. (2.9) pVd ar (ˆyi)

Eq. (2.9) can be expressed as

n 2 2 1 X hjiej S = (2.10) i pσˆ2h 2 ii j=1 (1 − hjj)

where hii is the ith diagonal element and hji is the jith component of the projection

2 matrix H, yˆi − yˆi(−j) = hjiej/(1 − hjj) and Vd ar (ˆyi) =σ ˆ hii. Cases with the values

of the measure adequately greater than (Si − E (Si)) /SD (Si) might be considered

as influential case. Because the distribution of Si is influenced within the sight of

influential cases, he entitled an observation influential that satisfied

|Si − Median (Si)| ≥ 4.5MAD (Si) , (2.11)

where MAD (Si) = Median {|Si − Median (Si)|} /0.6745 is the median absolute

deviation of the observations of Si.

Nurunnabi et al. (2011) broadened Pena’s measure of group omission and introduced

13 another measure to detect influential cases from the entire data. They examined that

in Pena’s influence diagnostic measure the leverage observations get more significance

than the usual influence diagnostic measure that is the reason Pena’s measure

can be exceptionally valuable detecting high leverage outliers, which are normally

viewed as the most difficult kind of heterogeneity to identify in regression issues.

According to Imon (2005) the leverages could separate effectively within the sight

of multiple influential cases, particularly when these cases are high leverage outliers

and single case omission technique will most likely be unable to centre around the

genuine influence of these observations. Furthermore, according to Hadi and Simonoff

(1993) group deletion measures reduce extreme unsettling influence by omitting the

suspicious group of influential observations at once and make the data more similar

than previously. For this motive, they proposed their measure which comprises of two

stages. In the initial step, they detect the doubtful influential observations, which is

not very informal to detect all the influential observations at the first run through

due to masking and swamping issues. In the event that suspected observation is

left in the data then the detection process turn out to be exceptionally unwieldy.

So they needed to ensure that all the possible observations were highlighted as

suspected and at the same time no innocent cases were wrongly omitted. Since

as indicated by Habshah et al. (2009) the omission of such observations particularly when they are innocent ones might unfavourably affect the whole influential process.

For this purpose, they employed the BACON procedure. In the second step of their procedure, they utilized group deletion rendition of Pena’s measure to affirm whether the supposed observations were really influential.

14 After discovering a group of suspected d cases among set with n observations by employed BACON procedure. They denoted set of observations staying in investigation by R and set of omitted observations by D. So that, without loss of all inclusive statement these cases are the last d rows of variables Z and Y that is

    Z Y  (R)   (R)  Z =   Y =   ,     Z(D) Y(D)

They considered the vector of distinction amongsty ˆj(−D) andy ˆj(i)(−D) after deleting

D observations as

0 t(i)(−D) = yˆ1(−D) − yˆ1(i)(−D),..., yˆn(−D) − yˆn(i)(−D) (2.12)

0 = t1(i)(−D), . . . , tn(i)(−D) . (2.13) where

hjiei(−D) tj(i)(−D) =y ˆj(−D) − yˆj(i)(−D) = , j = 1, 2, . . . , n. (2.14) 1 − hii and

0  0 −1 hji = zj Z Z zi and ei(−D) = yi − yˆi(−D).

Finally, they presented their measure as,

0 t(i)(−D)t(i)(−D) Mi = ; i = 1, 2, . . . , n. (2.15) pVd ar yˆi(−D)

15 where

 2 Vd ar yˆi(−D) =σ ˆ hii and 0 e e(−D) σˆ2 = (−D) . n − p

After modification, utilizing (2.12) - (2.15) Mi can be composed as

n 2 1 X e(i)(−D) M = h2 . (2.16) i pσˆ2h ji 2 ii j=1 (1 − hii)

This measure is a generalization of Pena’s measure defined in (2.10) that satisfy

the same properties of hat matrix which are introduced by Pena (2005), because

these properties do not change when a group of observations are deleted from the

data. Following a similar contention of Pena (2005) they viewed an observation as

influential that fulfilled the following rule

|Mi| ≥ Median (Mi) + 4.5MAD (Mi) , (2.17)

where MAD (Mi) = Median {|Mi − Median (Mi)|} /0.6745.

Table 2.1: Influence Diagnostic Measures in Linear Regression

Diagnostic Measures Formula Cut- Point Reference

 0   ˆ ˆ 0 ˆ ˆ 2 −1 Cook’s distance Di = β − β(i) Z Z β − β(i) (pσˆ ) Di > Fα(p,n−p) Cook (1977)

yˆi−yˆi(−i) p p DFFITS DFFITS(i) = √ |DFFITSi| > 2 Welsch (1982) σˆ(i) hii n n h2 e2 1 P ji j Pena’s measure Si = 2 2 |Si| ≥ Median (Si) + 4.5MAD (Si) Pena (2005) pσˆ hii (1−h ) j=1 jj n e2 1 P 2 (i)(−D) Nurunnabi’s measure Mi = 2 h 2 |Mi| ≥ Median (Mi) + 4.5MAD (Mi) Nurunnabi (2011) pσˆ hii ji (1−h ) j=1 ii

16 Chapter 3

Pena’s statistic for the Liu

regression (Kashif et al. (2018))

3.1 Introduction

Influence diagnostics have received a lot of attention in the last few years, and most

of them are concerned about linear regression models in the presence or absence of

multicollinearity. One can refer to Belsley et al. (1980), Atkinson (1985) and Ullah and Pasha (2009) for various influence diagnostic measures in linear regression models when no multicollinearity is present. Pena’s statistic (Pena, 2005) is a new influence diagnostic measure in this context. In Pena’s approach, instead of examining at how the omission of each point influences the vector of forecasts, we look how the deletion of each point influences the forecast of a particular case. In Pena’s sense, we measure how every point is being affected by whatever remains of data set. Hence, the prime advantage of this approach is to observe how touchy the forecast of the ith case is because of the omission of every point in sample. In linear regression models

17 when multicollinearity is present, Jahufer and Jianbao (2009) and Amanullah et al.

(2013b) studied residuals, leverages and several measures of case omission approach while using MRR. Walker and Birch (1988) studied leverages, residuals case omission measures and local influence measures in the RR. Liu (1993) has been recommended

LE in the presence of multicollinearity among predictors. In the available literature,

Amanullah et al. (2013a) studied the influence measures in this framework. Recently, the influence of a few observations using Pena’s approach on the RR was studied by Emami and Emami (2016). However, no care has been paid to the impact of anomalous points on the result of the LE. This method combines the merits of the

RR and the Stein (1956) estimators. Thus, the main goal of this chapter is to push the

LE into the Pena’s approach which is designed as the squared norm of the vector of contrasts of the fitted estimation of a point that every one of the points are detached from the sample. Furthermore, we describe this diagnostic measure as far as the LE residuals and leverages. We demonstrate that the distribution of this measure has asymptotically normal in LE outline and is accomplished to detect a set of high LE leverage similar outliers that can’t be detected by Cook’s measure.

The Chapter unfolds as follows. The influence diagnostic measure in OLS and the

LE for the Pena’s statistic are presented in section 3.2. Furthermore, we examine a few properties of modified diagnostic measure in this section. Simulation and an application on real data are presented in sections 3.3 and 3.4 separately for illustration.

Finally, closing comments are given in last section 3.5.

18 3.2 Pena’s statistic

3.2.1 Pena’s statistic using the LE

Pena (2005) proposed a new influence diagnostic measure for measuring influence

of the ith case absolutely in a different fashion. In Pena’s work, he presented a

methodology to measure how evey observation is being affected by whatever is left of

the data. He considered the vectors

 0 0 h1ie1 hnien si = yˆi − yˆi(−1),..., yˆi − yˆi(−n) = ,..., , (3.1) 1 − h11 1 − hnn

wherey ˆi − yˆi(−j) is the difference between the ith estimated y in existence of all cases in a data set and with j th case deleted and ei is the ith OLS residual. He denoted

and defined his statistic for the ith case as

s0 s S = i i ; i = 1, 2, . . . , n (3.2) i pσ2 (ˆyi)

which can also be re-expressed as

n 2 2 1 X hjiej S = , (3.3) i pσˆ2h 2 ii j=1 (1 − hjj)

where hii is the ith diagonal element and hji is the jith component of projection

matrix H and σ2 =σ ˆ2h . One of the properties of S is that for large sample (ˆyi) ii i

sizes and numerous explanatory variables, the distribution of this statistic will

be approximately normal. Cut-off points can be found for this statistic applying

19 this property. In this way, observations with the values sufficiently greater than

(Si − E (Si)) /SD (Si) might be declared as influential observation. Since mean and standard deviation of Si are affected in the presence of influential observations, Pena

(2005) proposed using median and the median absolute deviation of Si instead of mean and standard deviation. Therefore, an observation is referred to as an influential observation if

|Si − Median (Si)| ≥ 4.5MAD (Si) ,

Liu (1993) proposed LE which is denoted and defined as

−1 ˆ  0   0  βd = Z Z + I Z y + dβb

where d is a parameter of the LE and 0 < d < 1. Now, we consider the Liu regression estimation. Pena’s statistic in LE is defined as

0 s sd,i S = d,i ; i = 1, 2, . . . , n. (3.4) d,i pσ2 (yˆd,i)

0 where sd,i = yˆd,i − yˆd,i(−1),..., yˆd,i − yˆd,i(−n) andy ˆd,i−yˆd,i(−j) is the contrast between the ith estimated y in existence of all cases in data set and with j th case omitted by using LE. This statistic after adjustment can also be expressed as

n 2 2 1 X hd,jied,j S = (3.5) d,i pσˆ2h 2 d,i j=1 (1 − hd,j)

20 where hd,i is the ith diagonal component and hd,ji is the jith component of the

0 −1 0  0 −1 0 projection matrix Hd = Z Z Z + I Z Z + dI Z Z Z , ed,i = yi − yˆd,i, yˆd,i −

2 2 yˆd,i(−j) = hd,jied,j/(1 − hd,j) and σ =σ ˆ hd,i. An observation is declared influential (yˆd,i) if it satisfies this rule

|Sd,i − Median (Sd,i)| ≥ 4.5MAD (Sd,i) , (3.6)

where MAD (Sd,i) = Median {|Sd,i − Median (Sd,i)|} /0.6745 is median absolute deviation of the values of Sd,i.

In MLR models, a main device in influence diagnostic measures is Cook’s distance.

This measure for the LE can be reached out to measure influence on estimators as

0  ˆ ˆ  0  ˆ ˆ  2−1 Dd,i = βd − βd(i) Z Z βd − βd(i) pσˆ (3.7)

ˆ 0 −1  0  ˆ where βd = Z Z + I Z y + dβb is the LE and βd(i) is obtained without the ith element.

3.2.2 Properties of Pena’s statistic for the Liu Regression

Emami and Emami (2016) presented the properties of Pena’s statistic for the ridge regression. In this section, we present following theorems regarding the properties of the said statistic for the Liu regression.

THEOREM 3.2.2.1: When n → ∞ and all hd,i are small then under the

21 hypothesis of no outlier or high Liu leverage observation, E (Sd,i) ≈ 1/p.

2  PROOF: Because, ed,j = (1 − hd,j) yd,j ⇒ V ar (ed,j) = E ed,j = (1 − hd,j)σ ˆ the mean value of the influence diagnostic Sd,i can be derived from (3.5) as

n 2 2  1 X hd,jiE ed,j E (S ) = d,i pσˆ2h 2 d,i j=1 (1 − hd,j)

n 2 1 X hd,ji ≤ ph (1 − h ) d,i j=1 d,j

• In the event that we call h = max1≤i≤nhd,i at that point the upper bound of mean value will be 1 1 h• E (S ) ≤ = + d,i p (1 − h•) p p (1 − h•)

Then again, since hd,j ≥ 1/n, therefore we have

n 2 1 X hd,ji 1 E (S ) = ≥ d,i ph (1 − h ) p (1 − 1/n) d,i j=1 d,j

These outcomes show that if h• → 0 and n is large, at that point the mean impact

of all observations is approximately 1/p. This implies, if a sample without high Liu

leverage cases, at that point the majority of the cases have a similar mean sensitivity

concerning the whole sample. In this way, those observations that have their values

far from influence diagnostic can be considered as influential observations.

THEOREM 3.2.2.2: In a sample with small hd,i, as n → ∞ and p → ∞ however

22 p /n → 0, the distribution of Sd,i is normal.

PROOF: This result is proof through by Central Limit Theorem. We assume

n • ¯ ¯ P there is no outlier and that h = max1≤i≤nhd,i < ch for c > 0, in which h = hd,i/n. i=1 p Let n → ∞ and p → ∞ however /n → 0, then influence diagnostic can be written as

n 2 X ed,j  S = m , d,i ij σˆ j=1

where

2 2 mij = hd,ji/phd,i(1 − hd,j) .

2 The residuals ed,j are normal variables with covariance matrix σ (I − Hd) . In this

manner when n → ∞ and hd,ij → 0 then influence diagnostic Sd,i is a weighted blend

of chi-squared variables with 1 d.f. The coefficients mij are positive and the relative n P weight of every chi-squared predictor variable mij/ mij → 0. Since, j=1

hd,j hd,j (1 + 2hd,j) mij ≤ 2 ≈ , p(1 − hd,j) p

we have

mij hd,j (1 + 2hd,j) hd,j (1 + 2hd,j) Pn ≤ Pn 2 ≤ j=1 mij p + 2 j=1 hd,j p

and as p → ∞, so the relative weight of every chi-squared explanatory variable will

n P 2 approach to zero. Since hd,j ≥ 0 and hd,j ≈ p. Therefore, asymptotic distribution j=1

of Sd,i under those hypotheses may be normal.

23 So the distribution of Sd,i might be influenced by presence of influential observations so we can utilize high breakdown estimates. A heterogeneous observation is declared to as an influential observation if

|Sd,i − Median (Sd,i)| ≥ 4.5MAD (Sd,i) ,

From above rule the high-leverage Liu outliers are recognized by a low Sd,i, which

would recommend a one tailed test with alternative hypothesis on the left side middle

leverage outliers are recognized by a greater Sd,i, which would propose a one tailed

test with alternative hypothesis on right side.

THEOREM 3.2.2.3: When the data contain a group of high Liu leverage identical

outliers, the sensitivity statistics will identify them.

0  0  PROOF: Suppose, we have a sample of n points y1, z1 ,..., yn, zn and suppose

0 0 ˆ 0 −1  0 ˆ 0 ˆ Z0 = [z1, . . . , zn] , y0 = [y1, . . . , yn] , βd = Z Z + I Z y + dβ , and ui = yi −ziβd.

0  We consider k identical high Liu leverage outliers ya, xa that contaminated in

0 ˆ the sample and suppose ua = ya − zaβd be the residual concerning the LE and

0 ˆ suppose ed,i = yi − ziβd(T ) be the Liu residuals for the complete model with

ˆ 0 −1  0 ˆ  0  0 0  0 n + k points, where βd(T ) = ZT ZT + I ZT yT + dβd ,ZT = Z0, za1k , yT =

0  0 y0, ya1k furthermore, 1k is a vector 1 s of order k × 1. Suppose Hd(T ) =

0 −1 0  0 −1 0 ZT ZT ZT + I ZT ZT + dI ZT ZT ZT be the hat matrix for the sample of n+k

0 −1 0  0 −1 0 data set and assume that Hd(0) = Z0 Z0Z0 + I Z0Z0 + dI Z0Z0 Z0 be the

24 projection matrix for the clean data set. We divided projection matrix Hd as

  H H  d(11) d(12)  Hd =   ,   Hd(21) Hd(22)

where Hd(11) and Hd(22) have order n × n and k × k respectively and

0 k 0 0  Hd(11) = Hd(0) − 0 hd(1a) hd(1a) , (3.8) khd(a) + 1 where

−1 −1 0 0  0   0   0  hd(a) = za Z0Z0 + I Z0Z0 + dI Z0Z0 za

−1 −1 0  0   0   0  hd(1a) = Z0 Z0Z0 + I Z0Z0 + dI Z0Z0 za

Also

1 0 0 Hd(12) = Hd(21) = 0 hd(1a)1k, (3.9) khd(a) + 1

and

1 0 0 Hd(22) = 0 hd(a)1k1k (3.10) khd(a) + 1

0 ˆ The observed Liu residuals, ed,i = yi−ziβd(T ), are identified with the true Liu residuals,

0 ˆ ud,i = yi − ziβd(0), are assumed by

ed,i = ud,i − khd(ia)uia i = 1, . . . , n (3.11)

25 Furthermore, connected with the outlier cases by

1 ed,a = 0 ua i = 1, . . . , n. (3.12) khd(a) + 1

Using Eq. (3.11) Cook’s distance for clean points is given

2 ui − khd(ia)ua hd,i Dd,i = (3.13) 2 2 pσˆ (1 − hd,i)

For the outlier points using Eq. (3.11), this statistic can be composed as

2 uaha Dd(ia) = (3.14) 2 2  pσˆ 1 + (k − 1) hd(a) 1 + khd(a)

Suppose, we have high Liu leverage outliers and assume that hd(a) → ∞, at that

point Hd(12) = Hd(21) → 0, which suggests that hd(ja) → 0, for j = 1, . . . , n and

−1 0 −1 Hd(22) → k 1k1k infers that hd(a) → k and

2 1 2 γja = hd(ja) hd,jhd(a)

will approaches to zero for j = 1, . . . , n and approaches to kn−1 for j = n+1, . . . , n+k.

Hence for clean points we have

n X 2 Sd,i = γjiDd,j, i = 1, . . . , n j=1

26 however, for outliers,

2 −1 Sd,i = k n Dd(a), i = n + 1, . . . , n + k. (3.15)

0 For clean observations if hd(a) → ∞, at that point hd(ja) → 0 and by applying Eq.

(3.12), ed,i → ui. By using a similar contention given in Theorem 3.2.2.1, the mean value of the influence diagnostic for the clean data set may be 1/p. However, using

0 Eq. (3.12) for outliers, when hd(a) → ∞, then ed,a → 0 and Dd(ia) → 0 and also

Sd,i → 0. Consequently, in the Liu structure for high leverage outliers, the diagnostic measure may be near 0 for the outliers and near 1/p for the clean points. Additionally, diagnostic measure can be appropriate for detecting Liu middle leverage outliers which are not detected by Cook’s measure. Truth be told, the Liu middle leverage outliers

0 are a set of outliers with hd(a) ≥ max1≤i≤nhd,i, that is, their relating Liu leverage are greater than the clean points, though the true Liu residual, ua is to such an extent that the Liu observed residuals, ed,a given by Eq. (3.12), are not near 0. Thus the cross leverage hd(ia) among the clean points and the outliers in the Eq. (3.12) will

2 still be small and thus γia will also be minor. Consequently, the diagnostic measure for the Liu outliers will accumulate Cook’s distance in these outliers, and the value of Sd,i may be greater than this value for clean cases. Hence, intermediate leverage points can easily be detected by Sd,i.

27 3.3 Simulation study

3.3.1 Normality of Proposed measure

We follow McDonald and Galarneau (1975), Newhouse and Oman (1971), Liu (2003),

Kibria (2003), Aslam (2014) and Emami and Emami (2016) in order to generate the explanatory variable for the simulation study.

21/2 xij = 1 − θ wij + θwi,p+1 i = 1, . . . , n, j = 1, . . . , p.

and p X yi = βjxij + εi, εi ∼ N (0, 1) , i = 1, 2, . . . , n. j=1

0 2 where wij s and εi are independent standard normal pseudo-random numbers, θ is the correlation between any two independent variables. Three different correlation sets are viewed corresponding to θ = 0.7, 0.9 and 0.99, which demonstrate the weak, severe multicollinearity between the regressors respectively. The resulting condition numbers of the generated X equal 5.537e+002, 4.545e+004 and 3.738e+005 respectively. From

Fig. 3.1 to 3.4 show that (a) q-q plot of proposed measure, (b) q-q graph of Cook’s measure, (c) graph of Cook’s measure versus proposed measure and (d) graph of individual values of proposed measure versus observation, for the initial data using

θ = 0.7, n = 200 and p = 10. The second plots of each graph shows the same graphs for the second data using θ = 0.9, n = 500 and p = 20 and third plots demonstrates these graphs for the third data using θ = 0.99, n = 1000 and p = 40. It may be easily

28 Figure 3.1: Influence investigation of the produced of three data sets with various multicollinearity independent variables. Normal q-q plots of the proposed measure

29 Figure 3.2: Influence investigation of the produced of three data sets with various multicollinearity variables. Normal q-q plots of Cook’s distance

30 Figure 3.3: Influence investigation of the produced of three data sets with various multicollinearity regressors. Plots of Cook’s distance against proposed measure.

31 Figure 3.4: Influence investigation of the produced of three data sets with various multicollinearity predictors. Plots of proposed measure versus observations.

32 seen that the distribution of Cook’s measure is asymmetrical in all 3 conditions however the distribution of the proposed measure is about symmetrical. It is easily seen that from Fig.3.4 the expected value of proposed measure in different three data

1 1 1 sets near to 10 , 20 and 40 , respectively, which is predictable with Theorem 3.2.2.1.

3.3.2 Detection of intermediate and high Liu leverage out-

liers

Following Emami and Emami (2016), the execution of the proposed measure showed by simulation as take after, in condition (a) a model yi = xi1 + xi2 + εi of sample size n is produced, where xi2 = θxi1 + υi and xi1, υi ∼ N (0, 1) . Then by altering the last three points of this sample by presenting 3 outliers of size 5 yet with various

Liu leverages. Condition (b) relates to high Liu leverage (xij = 20, j = 1, 2) and (c) corresponds to intermediate Liu leverage (xij = 5, j = 1, 2) . A simulated sample of 30 observations with three outliers and with θ = 0.90 is generated; the data, the values of proposed measure and Cook’s measure for three different conditions are presented in Table 3.1.

The condition number of X 0 X is around 292.485 and estimated value of d is 0.9799 by the method of Liu (1993). The results of Table 3.1 for three situations are shown in Fig. 3.5 to 3.6. In (a) every one of the values of the proposed measure are near its expected value 1/2. In (b) the values of the proposed measure for the high Liu leverage outliers are near 0 and for the good cases they are near the expected value,

33 Table 3.1: Three data sets and proposed Diagnostic Measures.

Diagnostic Measures Row y x1 x2 D Sa Sb Sc 1 -1.3898 -0.1227 1.7746 0.0012 0.5037 0.5637 0.5205 2 -1.4887 0.7618 1.6265 0.0065 0.4831 0.6203 0.5347 3 -0.3239 0.1498 1.1951 0.0564 0.4472 0.5725 0.5028 4 -0.5833 0.6551 1.3388 0.1542 0.3807 0.5428 0.4967 5 0.4492 -0.4890 0.6772 0.0059 0.4738 0.5945 0.4726 6 2.3370 2.4673 1.8450 0.1709 0.4011 0.6134 0.4633 7 0.9967 1.4856 -0.2670 0.0004 0.5401 0.5532 0.4823 8 -1.8136 -0.6316 -2.0073 0.0011 0.5014 0.5603 0.5136 9 -1.0605 -1.0897 -0.5715 0.0013 0.3253 0.5315 0.4865 10 -6.6688 -0.7756 0.0127 0.0452 0.3351 0.5517 0.5174 11 -1.1869 -0.0416 0.9717 0.0687 0.5307 0.6367 0.5044 12 1.9506 0.6939 0.5083 0.0763 0.5512 0.6015 0.5235 13 1.6402 -1.9490 -2.4850 0.0162 0.4732 0.5636 0.4954 14 1.3028 1.0280 -0.5792 0.0021 0.5371 0.5372 0.5014 15 4.2125 0.1168 -1.3495 0.0061 0.3989 0.5401 0.5175 16 0.6906 -0.5594 0.3680 0.1546 0.5582 0.5265 0.5034 17 -0.9626 0.3975 -0.5164 0.0167 0.4721 0.6472 0.5313 18 2.4385 -0.3691 -1.6417 0.0066 0.4506 0.6104 0.5107 19 2.1242 -0.3990 0.0660 0.0032 0.3751 0.6507 0.4723 20 0.4442 0.3828 -0.5217 0.0019 0.3627 0.6737 0.4854 21 -4.8017 -1.4104 -2.1941 0.0017 0.3109 0.6205 0.5061 22 -3.7075 -1.3121 -0.5587 0.0238 0.4538 0.5963 0.5163 23 1.0648 0.7954 1.1551 0.0176 0.3849 0.5735 0.4742 24 -2.0159 -0.2967 -0.3343 0.0016 0.3124 0.5693 0.5041 25 -3.3424 1.1590 0.2250 0.0005 0.3715 0.5567 0.4907 26 2.5472 -0.0222 0.0974 0.0001 0.5637 0.6193 0.5105 27 2.1218 0.5596 -0.4547 0.0221 0.4963 0.6411 0.5201 28 (5bc) (20b, 5c) (20b, 5c) 0.0006 0.5287 0.1567 0.6837 0.1445 1.1991 2.4441 29 (5bc) (20b, 5c) (20b, 5c) 0.0002 0.5527 0.1567 0.6837 -2.2983 0.8106 1.6888 30 (5bc) (20b, 5c) (20b, 5c) 0.0033 0.5473 0.1567 0.6837 -5.1350 -0.2381 1.1027

Three data sets with same observations 1-27 and in the last three observations 28-30: Sa, No Outliers; Sb, High Liu Leverage Outliers; Sc, Middle Liu Leverage Outliers. 34 Figure 3.5: Graphs of Cook’s measure against the suggested diagnostic measure three situations: (a) no outliers, (b) high Liu leverage outliers and (c) middle Liu leverage outliers. 35 Figure 3.6: Plots of the proposed measure versus observation three situations: (a) no outliers, (b) high Liu leverage outliers and (c) centeral Liu leverage outliers.

36 which is consistent with Theorem 3.2.2.3. In (c) the values of the proposed measure are greater for the outliers than for the good points.

3.3.3 The Monte Carlo simulation

In this segment, we perform a simulation that is designed to compare about the execution of suggested identification technique with Cook’s distance Dd,i. The performance of proposed technique was examined by applying this model.

yi = β0 + β1x1 + β2x2 + β3x3 + ε; 1 ≤ i ≤ 40,

0 in which 1 ≤ i ≤ 40 − n0, β0 = β1 = β2 = β3 = 1, ε is random error and xij s are produced by

21/2 xij = 1 − θ wij + θwi,p+1 i = 1, . . . , n, j = 1, . . . , p.

with θ2 = 0.70, 0.90 and 0.99, which make multicollinear regressors. Presently to construct an outlier outline for LE, this sample is contaminated by a set of n0 alike middle and high leverage outliers. The ultimate n0 cases, that is, for 40 − n0 + 1 ≤ i ≤ 40, are similar compare to x1 = x0, x2 = x3 = 0, y0 = mx0, the values for x0 were been 5 and 10, and the tainting m was settled at 1, 2, 3 and 4. The quantity of n0 outliers turned into taken as 2, 6 and 8, relating to 5%, 15% and 20% infected.

The quantity of 5% was selected to the small measure of contamination, while 15% and 20% to signify to exceedingly contamination samples. For every contamination

37 Table 3.2: The Percent of outlier identification by Sd,i in simulation.

% of detected outliers for x0 = 5 % of detected outliers for x0 = 10 d θ2 % con m = 1 m = 2 m = 3 m = 4 m = 1 m = 2 m = 3 m = 4 5 14.6 15.3 15.5 14.7 17.8 19.8 16.8 17.2 0.70 15 13.2 14.3 14.7 14.2 15.0 13.5 15.0 15.0 20 12.9 13.1 13.2 12.8 13.2 13.0 12.9 14.2 5 15.2 15.4 15.0 12.9 16.7 19.2 19.8 19.2 0.1 0.90 15 14.4 14.8 14.7 14.4 16.6 15.5 16.7 15.5 20 14.3 13.3 12.6 13.7 14.8 14.6 15.0 14.0 5 12.3 13.4 11.6 13.7 18.8 18.9 20.9 17.5 0.99 15 12.1 13.2 11.3 12.7 16.9 15.8 15.1 14.4 20 11.4 12.4 11.1 12.0 14.9 15.1 14.3 13.8 5 17.8 17.7 16.0 18.5 22.3 21.1 23.8 21.8 0.70 15 14.1 15.3 14.4 12.9 16.2 14.2 15.7 14.6 20 11.9 12.9 12.8 12.4 15.0 14.0 14.1 13.6 5 18.5 19.8 20.6 18.8 22.4 23.2 21.5 22.3 0.5 0.90 15 14.3 15.1 13.3 14.4 17.1 15.2 14.6 16.4 20 13.5 13.0 12.9 13.2 14.9 14.2 13.5 15.1 5 21.1 23.7 24.5 21.7 26.0 24.9 26.5 26.8 0.99 15 14.8 15.5 15.2 15.1 15.5 16.5 15.2 18.4 20 12.7 13.7 14.0 13.0 13.7 14.9 14.2 16.2 5 19.4 19.6 17.7 18.9 24.2 24.1 25.2 26.4 0.70 15 14.2 14.3 14.4 15.0 15.3 15.9 17.6 15.6 20 11.3 11.9 11.7 11.8 13.6 14.3 15.8 14.5 5 21.9 21.5 22.0 23.7 24.3 25.9 27.2 26.4 0.9 0.90 15 14.5 14.6 13.6 13.9 17.1 16.8 15.9 16.4 20 11.9 11.6 12.0 12.1 15.6 15.4 14.0 14.8 5 28.2 27.1 30.8 28.7 29.2 30.2 28.6 30.6 0.99 15 16.5 17.1 16.4 15.1 16.9 17.1 16.8 17.2 20 12.2 11.7 11.3 11.6 13.5 16.5 15.0 15.7 5 19.6 18.3 18.7 19.3 26.9 26.0 25.2 25.7 0.70 15 14.5 14.8 14.0 14.1 16.2 16.3 18.6 16.3 20 11.6 11.1 12.1 11.8 14.5 15.0 17.4 15.3 5 24.7 26.1 25.5 23.0 26.8 28.8 28.3 28.0 1 0.90 15 14.8 15.0 14.3 13.4 15.5 17.5 16.5 16.4 20 11.3 11.7 12.0 11.5 15.0 16.2 15.7 15.3 5 30.5 29.5 30.2 31.2 31.8 31.9 29.3 30.6 0.99 15 17.0 18.0 18.1 17.9 15.8 17.8 16.7 18.4 20 12.3 12.6 11.6 12.0 14.4 16.2 15.0 16.7

38 Table 3.3: The Percent of outlier identification by Dd,i in simulation.

% of detected outliers for x0 = 5 % of detected outliers for x0 = 10 d θ2 % con m = 1 m = 2 m = 3 m = 4 m = 1 m = 2 m = 3 m = 4 5 10.1 11.2 11.3 11.4 13.7 14.1 14.0 14.3 0.70 15 9.7 11.0 11.1 10.4 12.0 11.6 12.5 11.8 20 9.3 10.3 10.6 10.0 10.8 11.0 10.9 11.1 5 11.3 12.0 12.1 10.6 13.4 14.8 14.6 14.5 0.1 0.90 15 10.2 10.4 11.3 9.8 13.0 12.5 13.3 12.6 20 9.9 10.0 11.0 9.5 12.4 11.7 12.0 11.2 5 10.4 10.1 10.0 9.9 14.8 13.9 15.8 14.8 0.99 15 10.0 9.4 9.6 9.0 13.0 12.8 12.1 11.4 20 9.2 9.1 9.1 8.9 10.9 11.1 11.4 10.8 5 14.5 14.7 14.0 15.5 15.5 15.0 16.1 15.6 0.70 15 12.1 13.3 12.4 10.9 13.1 12.8 12.0 12.1 20 10.8 10.7 10.4 9.8 12.0 11.7 10.3 11.0 5 15.5 14.7 16.6 15.5 16.0 16.2 15.6 16.3 0.5 0.90 15 12.3 13.1 12.3 12.3 14.2 14.2 13.6 14.4 20 10.5 12.1 11.9 11.0 12.9 12.2 11.5 12.1 5 16.1 15.7 16.5 15.5 16.4 16.9 17.0 16.7 0.99 15 12.2 11.9 12.8 12.5 12.4 13.5 12.2 13.4 20 10.2 10.0 11.1 10.4 11.7 11.9 10.2 11.3 5 14.5 14.6 15.0 15.2 16.5 16.0 17.2 16.7 0.70 15 11.4 11.6 12.0 12.3 12.3 12.9 13.6 11.6 20 10.1 9.2 9.7 9.0 11.5 11.3 12.8 11.5 5 15.5 15.4 16.0 16.2 16.1 16.3 17.0 16.5 0.9 0.90 15 11.9 11.6 11.2 10.8 13.1 13.8 12.9 13.4 20 10.0 11.3 9.9 9.8 12.6 12.4 11.3 11.8 5 16.1 15.9 16.3 15.2 17.0 17.8 16.9 17.6 0.99 15 13.0 12.9 13.1 12.6 13.5 14.4 13.4 14.2 20 11.8 10.5 10.4 10.2 11.6 12.5 12.0 12.6 5 14.2 14.0 13.9 14.4 16.5 16.0 15.9 15.7 0.70 15 12.3 12.1 12.0 11.8 13.0 13.2 14.5 14.1 20 10.5 10.1 10.0 10.6 11.4 10.6 12.0 12.6 5 16.8 16.2 16.9 15.7 16.0 16.7 16.4 16.6 1 0.90 15 12.4 13.0 12.3 11.6 12.7 13.0 13.3 13.2 20 10.2 10.0 10.4 10.0 11.0 12.1 11.9 12.3 5 17.1 17.4 17.9 18.0 17.5 17.8 17.0 17.2 0.99 15 13.0 14.2 14.1 13.9 13.0 13.4 13.3 13.2 20 10.1 10.5 10.0 9.9 11.7 12.0 11.8 12.3

39 outline, 500 simulations were made. The outlier identification consequences of Sd,i

and Cook’s distance Dd,i are represented in Tables 3.2 and 3.3, respectively.

It can be seen from Table 3.2 and 3.3 that the Sd,i procedure is powerfull for detecting outliers at 5% contamination data and it keeps away from the masking issue when outliers have high leverage (x0 = 10) , while the performance of Cook’s distance Dd,i

decreased as the collinearity increased. The results show that Sd,i procedure performs

better than Cook’s distance Dd,i in every situations of different d, collinearity levels

and percentages of contamination. Generally, Cook’s distance Dd,i lose its power to

detect outliers with the increasing number of explanatory variables, the size of the

data sets and the percentages of contamination. The Sd,i performs satisfactorily in

most of the cases, and it is significantly better than Cook’s distance Dd,i. The findings

are summarized as follows. In the situation with d = 0.1, θ2 = 0.70, percentage

of detected outlier for x0 = 5, and m = 1, which was contaminated by 5%, the

Sd,i procedure identifies 14.6% whereas Cook’s distance Dd,i correctly identifies only

10.1% cases. The Sd,i procedure gives consistent results with better accuracy. We

also note that for large data sets in high dimension, the detection of outliers made by

Sd,i procedure is more accurate than that made by Cook’s distance Dd,i.

3.4 Longley data

The Longley data set (1967) with n = 16 observations and p = 6 explanatory variables

has been applied to explain the impact of strong multicollinearity on OLS. The scaled

condition number of this data is 43,275 (Belsley et al., 1980). This substantial number

40 indicates the presence of an extreme multicollinearity among regressors. This data set are considered by several authors about the identification of influential cases. Cook

(1977) applied Cook’s measure Di to this data set also, he discovered that observations

5, 16, 4, 10 and 15 (in this specific order) as the most influential observations in OLS.

Walker and Birch (1988) investigated this data set to identify anomalous observations in RR by applying Cook’s distance and DFFITS. They detect cases 16, 10, 4, 15 and 5

(in this specific order)have been the most influential cases. Emami and Emami (2016) likewise investigated this data to identify influential cases in RR using generalized influence measure of Pena (2005). They identify cases 16, 15, 10, 5 and 1 (in this order) and 16, 15, 10, 5 and 4 (in this specific order) as influential observations for k = 0 and k = 0.0002 respectively. In this chapter, we applied this data set to evaluate the influential cases for the LE by applying the Pena’s measure. Influential statistic (3.5) was computed for various values of d and the outcomes are displayed in Table 3.4. It can also be seen that cases (3, 10, 11, 4, 5), (10, 3, 4, 15, 16),

(10, 4, 15, 5, 16) and (15, 10, 5, 4, 16) in these order are most influential cases for d = 0.1, 0.5, 0.9 and d = 1 (OLS), respectively. Five most influential cases with largest Sd,i are 16, 15, 10, 5 and 4 (in this order). The examination of these cases with the cut-off point characterized in Eq. (3.6) demonstrate these cases can be observed as influential points. The results also graphed in Fig. 3.7 to 3.9. It can be seen that histograms of the proposed measure isolate distinctive groups of data set. The comparison of largest proposed measure values with the cut-off characterized in Eq.

(3.6) demonstrate these cases can be viewed as influential cases.

41 Table 3.4: Five largest values of Sd,i for d = 0.1, 0.5, 0.9 and 1.0 (OLS Case) for the Longley data.

d Cases Sd,i 3 0.2604 10 0.2509 0.1 11 0.1477 4 0.1344 5 0.0165 10 0.4006 3 0.3101 0.5 4 0.2571 15 0.0614 16 0.0552 10 1.0410 4 0.7519 0.9 15 0.6310 5 0.2669 16 0.0772 15 0.9728 10 0.9127 1.0 (OLS Case) 5 0.7652 4 0.7389 16 0.5223

42 3.5 Conclusions

In this chapter, we discussed Pena’s measure for MLR models by use the LE. We prove that this measure has normally distributed and can detect a subset of high Liu leverage outliers. The numerical findings demonstrate that this measure is helpful. It is due to the fact that the ability of Pena’s statistic of correctly identifying the outliers is significantly better than that of Cook’s distance while using the LE. However, the problem of the detection of uncommon observations leftovers one of the primary goals of a capable investigator. Subsequently, in practice it would be more critical to apply the influence diagnostic measures alongside the data and expertise of the analyst.

43 Figure 3.7: Histograms of Sd,i of the Longley data for d=0.1, 0.5, 0.9 and 1.0. 44 Figure 3.8: Scatter Plots of Dd,i against Sd,i for d=0.1, 0.5, 0.9 and 1.0. 45 Figure 3.9: Plots of Sd,i against observation number for d=0.1, 0.5, 0.9 and 1.0. 46 Chapter 4

Pena’s statistic for the Modified

Ridge Regression

4.1 Introduction

In fitting of regression model, one or more observations may have substantial effects on estimators. Such observations are known as influential observations. Since the last three decades, influence diagnostics have received a lot of attention and most of them are concerned about linear regression models. Few references on this problem comprise Cook (1977), Belsley et al. (1980), Chatterjee and Hadi (1988) and Ullah and Pasha (2009) for various influence diagnostic measures.

In the study of influential cases to quantify the impact of the ith case, the most widely recognized technique is to calculate single-case diagnostic measures with the ith case omitted. Because the technique of Cook (1977), the case omission diagnostic measure, for example, Cook’s distance has been effectively applied to numerous statistical

47 models. Most of ideas of detecting influential cases are based on the case omission method. A new influence diagnostic measure has been proposed by Pena (2005) and is known as Pena’s statistic in the literature. In Pena’s approach, we as opposed to analysing at how the omission of a point influences the parameters, we observe how each case is impacted by using whatever is left of the data.

In most of the regression research, the predictors are not observed to be orthogonal.

In some cases, the absence of orthogonality is definitely not a serious issue. But while the predictor variables have near to perfect linear relationship, Inferences in view of the regression model may be seriously deceptive. While there exist close to linear dependencies among the predictors, the problem of multicollinearity is said to be existing. In such circumstances, the technique of OLS results in unwanted estimates of coefficients of regression. In the available literature, many important biased estimation models are proposed to handle the said issue and among these RR estimator (Hoerl and Kennard, 1970) is very popular. Shi and Wang (1999) Walker and Birch (1988) studied leverages, Cook’s distance and local influence measures for the RR setting. One of the ridge kind estimators suggested to fit the model for multicollinear data set is the MRR estimator (Swindel, 1976). Jahufer and Jianbao

(2009) and Amanullah et al. (2013b) studied residuals, leverages and several measures of case deletion approach while using this estimator.

Recently, the influence of few observations using Pena’s statistic for the RR was studied by Emami and Emami (2016) and modified Pena’s statistic for the biased estimators was studied by Adewale et al. (2016). However, no consideration has been paid to the impact of anomalous cases at the outcome from the MRR. Consequently,

48 in this chapter we concentrated on Pena’s technique that is designed because the squared standard of vector of variations of the fitted value of a point that every one of the points are detached from the sample on the MRR. Furthermore, we present this measure in phrases of the MRR residuals and leverages. It is showed that this measure has asymptotically normal. It is also illustrated that this statistic can be used to detect a set of high MRR-leverage similar outliers which cannot detected by

Cook’s measure.

This chapter is composed of five sections. The Pena’s statistic for the OLS and

MRR are introduced in section 4.2. Furthermore, we examine some properties of this modified measure in this section. A simulated study and an example of real data set to illustrate our results are introduced in sections 4.3 and 4.4 respectively. Lastly, section 4.5 concludes the chapter.

4.2 Pena’s statistic

4.2.1 Pena’s statistic using the MRR

The goal of influence investigation is to quantify the impact of the ith case and the most popular technique is case deletion diagnostic measures with the ith case omitted.

A number of diagnostics have been suggested for this purpose in regression modelling including Pena’s statistic. Pena (2005) proposed a new influence diagnostic measure for measuring influence of the ith case absolutely in a different fashion. In the said work, Pena presented a technique to measure how every case is being affected by

49 whatever is left of the data. Pena consider the vectors

 0 0 h1ie1 hnien si = yˆi − yˆi(−1),..., yˆi − yˆi(−n) = ,..., , (4.1) 1 − h11 1 − hnn

wherey ˆi − yˆi(−j) is distinction among the ith estimate of y in existence of entirely

points and with j th point omitted. Pena’s measure for the ith case is defined as

s0 s S = i i ; i = 1, 2, . . . , n. (4.2) i pσ2 (ˆyi)

which can also be re-expressed as

n 2 2 1 X hjiej S = , (4.3) i pσˆ2h 2 ii j=1 (1 − hjj)

in which hii is the ith diagonal component and hji is the (j, i)th component of H and σ2 =σ ˆ2h . One of the properties of S is that for large sample and numerous (ˆyi) ii i

explanatory variables, the distribution of this measure could be nearly normal. Cut-

off points can be found for this statistic applying this property. So, the points with the

values sufficiently greater than (Si − E (Si)) /SD (Si) may be taken into consideration

as influential observation. Since the mean and standard deviation of Si are affected in

the presence of influential observations, Pena (2005) proposed using median and the

median absolute deviation of Si instead of mean and standard deviation. Therefore,

an observation is referred to as an influential observation if

|Si − Median (Si)| ≥ 4.5MAD (Si) ,

50 The existence of multicollinearity in the data has some destructive effects on regression

analysis. To combat multicollinearity, various remedial measures have been suggested

such as ridge and Liu type estimators. Swindel (1976) proposed an estimator with

biasing parameter k and is characterized as

−1 ˆ  0   0  β(k,b) = Z Z + kI Z y + kb ,

in which k > 0 is MMR parameter and b is a q × 1 prior information vector. If b = 0

ˆ ˆ ˆ then β(k,b) reduces to RR estimator β(k) while for b = 0 and k = 0, β(k,b) coincides with the OLS estimator βˆ. The prior information vector of the MRR is same as introduced by Swindel (1976).

Now, Pena’s statistic in MRR is defined as

0 s s(k,b)i S = (k,b)i ; i = 1, 2, . . . , n. (4.4) (k,b)i pσ2 (yˆ(k,b)i)

0 where s(k,b)i = yˆ(k,b)i − yˆ(k,b)i(−1),..., yˆ(k,b)i − yˆ(k,b)i(−n) andy ˆ(k,b)i − yˆ(k,b)i(−j) is the

distinction among the ith estimate of y in existence of entirely points in a data and

with j th case omitted by using MRR. The scale reliance of modified ridge estimator

precludes the expansion of deletion formula of S(k,b)i. The principle issue exist in the

ˆ calculation of β(k,b)(i) in light of the fact that Z(i) matrix must be scaled to same

column length. The modified ridge estimator after deleting the ith observation can

51 also be written as

−1 ˆ  0 0   0 0  β(k,b)(i) = Z Z − zizi + kI Z Y − ziyi + kb

By applying the Sherman-Morrison-Woodbury (SMW) theorem (Belsley et al., 1980),

S(k,b)i can be re-expressed as

n 2 2 1 X h(k,b)jie(k,b)j S = (4.5) (k,b)i pσˆ2h 2 (k,b)i j=1 1 − h(k,b)j

where h(k,b)i is the ith diagonal component and h(k,b)ji is the jith component

0 −1 h 0 −1i 0 of the projection matrix H(k,b) = Z Z Z + kI I + k Z Z + kI Z , e(k,b)i =

 2 2 yi − yˆ(k,b)i, yˆ(k,b)i − yˆ(k,b)i(−j) = h(k,b)jie(k,b)j/ 1 − h(k,b)j and σ =σ ˆ h(k,b)i. An (yˆ(k,b)i) observation is called influential if it satisfy this rule

  S(k,b)i − Median S(k,b)i ≥ 4.5MAD S(k,b)i , (4.6)

   where MAD S(k,b)i = Median S(k,b)i − Median S(k,b)i /0.6745 is the median absolute deviation of the values of S(k,b)i.

In standard linear regression model, a well-known single-case deletion measure in impact diagnostics is Cook’s distance proposed by Cook (1977). This measure for the

MRR on the same pattern as given by Jahufer and Jianbao (2009) can be defined as

0  ˆ ˆ  0  ˆ ˆ  2−1 D(k,b)i = β(k,b) − β(k,b)(i) Z Z β(k,b) − β(k,b)(i) pσˆ (4.7)

52 ˆ 0 −1 0  ˆ where β(k,b) = Z Z + kI Z y + kb is the MRR estimator (MRRE) and β(k,b)(i)

is obtained when ith case is deleted.

4.2.2 Properties of Pena’s statistic for the MRR

In this segment we offer a few properties of the impact diagnostic, S(k,b)i on the same

pattern as given in Emami and Emami (2016).

THEOREM 4.2.2.1: When n → ∞ and all h(k,b)i are small then under the  hypothesis of no outlier or high modified ridge leverage observation, E S(k,b)i ≈ 1/p.

PROOF: Considering that, the n × 1 vector of modified ridge residual is char-

acterized as

 e(k,b) = y − yˆ(k,b) = I − H(k,b) y,

n  2 P 2 if k is supposed non-stochastic, at that point Var yˆ(k,b)i =σ ˆ h(k,b)ij. It is easy to j=1 show that

" n #  2 X 2 2   Var e(k,b)j =σ ˆ 1 − 2h(k,b)j + h(k,b)ij ≤ σˆ 1 − h(k,b)j j=1

where e(k,b)j is the j th element of the modified ridge residual e(k,b) (see Emami and

Emami, 2016). The mean value of the influence diagnostic S(k,b)i can be derived from

Eq. (4.5) as   n h2 E e2 1 X (k,b)ji (k,b)j E S  = (k,b)i pσˆ2h 2 (k,b)i j=1 1 − h(k,b)j

53 n 2 1 X h(k,b)ji ≤ ph 1 − h  (k,b)i j=1 (k,b)j

? If we call h = max1≤i≤nh(k,b)i then the upper bound of this mean can be

1 1 h? E S  ≤ = + (k,b)i p (1 − h?) p p (1 − h?)

Then again, since h(k,b)j ≥ 1/n, thus we have

n 2 1 X h(k,b)ji 1 E S  = ≥ (k,b)i ph 1 − h  p (1 − 1/n) (k,b)i j=1 (k,b)j

These outcomes show that if h? → 0 and n is large, at that point the mean impact

of all observations in the sample is around 1/p. This implies, if there is a sample

without outliers or high modified ridge leverage perceptions, at that point all the

observations have a similar mean affectability concerning the whole sample. In this

way, the observations that have their values far from influence diagnostic measure

can be considered as influential observations.

THEOREM 4.2.2.2: For a sample with small h(k,b)i, as n → ∞ and p → ∞

p yet /n → 0, the distribution of S(k,b)i is about normal.

PROOF: The result is carried by means of Central Limit Theorem. We assume

? ¯ ¯ there is no outlier and that h = max1≤i≤nh(k,b)i < ch for c > 0, where h = n P p h(k,b)i/n. Let n → ∞ and p → ∞ but /n → 0, from Eq. (4.5) the influence i=1

54 diagnostic can be composed as

n X e(k,b)j 2 S = m , (k,b)i ij σˆ j=1

where

2 2 mij = h(k,b)ji/ph(k,b)i 1 − h(k,b)j .

2  0  The e(k,b)j are normally distributed with covariance σ I − 2H(k,b) + H(k,b)H(k,b) .

In this manner while n → ∞ and h(k,b)ij → 0 then influence diagnostic S(k,b)i is a weighted mixture of chi-squared independent variates with 1 d.f. The mij are positive n P and the relative weight of every chi-squared variate mij/ mij → 0. Since, j=1

h h 1 + 2h  m ≤ (k,b)j ≈ (k,b)j (k,b)j , ij 2 p 1 − h(k,b)j p

we have   mij h(k,b)j 1 + 2h(k,b)j h(k,b)j 1 + 2h(k,b)j Pn ≤ Pn 2 ≤ j=1 mij p + 2 j=1 h(k,b)j p

for p → ∞, so the relative weight of every chi-squared variate will have a tendency

n P 2 to zero. Since, h(k,b)jj ≥ 0 and h(k,b)jj ≈ p. Therefore, asymptotic distribution of j=1

S(k,b)i under these hypotheses will be normal.

The distribution of S(k,b)i might be influenced by existence of influential observations, so we can apply high breakdown estimates. A heterogeneous observation is declared

55 to as an influential observation if

  S(k,b)i − Median S(k,b)i ≥ 4.5MAD S(k,b)i ,

From above rule the high-leverage modified ridge outliers are recognized through a

low S(k,b)i, which might suggest one tailed test with alternative hypothesis on the

left side and middle leverage outliers recognized with the aid of a large S(k,b)i, which

might propose one tailed test with alternative hypothesis on the right side.

THEOREM 4.2.2.3: The influence diagnostic will detect a set of high modified

ridge leverage outliers from the data.

0  0  PROOF: Suppose, a sample of size n cases y1, z1 ,..., yn, zn and suppose

0 0 ˆ 0 −1 0  Z0 = [z1, . . . , zn] , y0 = [y1, . . . , yn] , β(k,b) = Z Z + kI Z y + kb and ui =

0 ˆ 0  yi − ziβ(k,b). We consider k identical high modified ridge leverage outliers ya, xa

0 ˆ that contaminated in the sample and let ua = ya − zaβ(k,b) be the residual regarding

0 ˆ the MRR and suppose e(k,b)i = yi −ziβ(k,b)(T ) be the modified ridge residuals inside the

ˆ 0 −1  0 ˆ  full model with n + k observations, where β(k,b)(T ) = ZT ZT + I ZT yT + dβ(k,b) ,

0  0 0  0 0  0 ZT = Z0, za1k , yT = y0, ya1k furthermore, 1k is a vector 1 s of order k×1. Suppose

0 −1 h 0 −1i 0 H(k,b)(T ) = ZT ZT ZT + kI I + k ZT ZT + kI ZT be the projection matrix for

0 −1 h 0 −1i 0 the sample of n+k data and let H(k,b)(0) = Z0 Z0Z0 + kI I + k Z0Z0 + kI Z0

56 be a projection matrix for the clean data set. We divided H(k,b) as

  H H  (k,b)(11) (k,b)(12)  H(k,b) =   ,   H(k,b)(21) H(k,b)(22)

where H(k,b)(11) and H(k,b)(22) have order n × n and k × k respectively and

0 k 0 0  H(k,b)(11) = H(k,b)(0) − 0 h(k,b)(1a) h(k,b)(1a) , (4.8) kh(k,b)(a) + 1

where

−1 −1 0 0  0   0   0  h(k,b)(a) = za Z0Z0 + kI Z0Z0 + kI Z0Z0 za,

−1 −1 0  0   0   0  h(k,b)(1a) = Z0 Z0Z0 + kI Z0Z0 + kI Z0Z0 za

Also

1 0 0 H(k,b)(12) = H(k,b)(21) = 0 h(k,b)(1a)1k, (4.9) kh(k,b)(a) + 1

and

1 0 0 H(k,b)(22) = 0 h(k,b)(a)1k1k (4.10) kh(k,b)(a) + 1

0 ˆ The observed modified ridge residuals, e(k,b)i = yi−ziβ(k,b)(T ), are linked to the genuine

0 ˆ modified ridge residuals, u(k,b)i = yi − ziβ(k,b)(0), are given by way of

e(k,b)i = u(k,b)i − kh(k,b)(ia)uia i = 1, . . . , n, (4.11)

57 and associated with the outlier cases through

1 e(k,b)a = 0 ua i = 1, . . . , n. (4.12) kh(k,b)(a) + 1

Using Eq. (4.11) Cook’s distance for clean points is given by

2 ui − kh(k,b)(ia)ua h(k,b)i D(k,b)i = (4.13) 2 2 pσˆ 1 − h(k,b)i

For the outlier points using Eq. (4.11), this statisticmay be composed as

2 uaha D(k,b)(ia) = (4.14) 2 2  pσˆ 1 + (k − 1) h(k,b)(a) 1 + kh(k,b)(a)

Suppose, we have high modified ridge leverage outliers and suppose h(k,b)(a) → ∞, at

that point H(k,b)(12) = H(k,b)(21) → 0, which means that h(k,b)(ja) → 0, for j = 1, . . . , n

−1 0 −1 and H(k,b)(22) → k 1k1k, suggests that h(k,b)(a) → k and

2 1 2 γja = h(k,b)(ja) h(k,b)jh(k,b)(a) will generally tend to zero for j = 1, . . . , n and kn−1 for j = n + 1, . . . , n + k. In this

manner, for the clean cases we have

n X 2 S(k,b)i = γjiD(k,b)j, i = 1, . . . , n j=1

58 however, for outliers,

2 −1 S(k,b)i = k n D(k,b)(a), i = n + 1, . . . , n + k. (4.15)

0 For the clean cases if h(k,b)(a) → ∞, at that point h(k,b)(ja) → 0 and by applying Eq.

(4.12), e(k,b)i → ui, By the use of the similar contention given in Theorem 4.2.2.1, the mean of influence diagnostic for clean data set could be 1/p. However, using Eq. (4.12)

0 for outliers, when h(k,b)(a) → ∞, then e(k,b)a → 0 and D(k,b)(ia) → 0 and also S(k,b)i → 0.

Consequently, in the modified ridge structure for high leverage outliers, diagnostic measure might be near 0 for outliers and almost 1/p for clean points. Additionally, the diagnostic statistic may be applicable for detecting modified ridge middle leverage outliers which are not detected by way of Cook’s measure. Actually, the modified ridge

0 middle leverage outliers are a set of outliers with h(k,b)(a) ≥ max1≤i≤nh(k,b)i, that is, their corresponding modified ridge leverage are greater than the clean cases, though the genuine modified ridge residual, ua is with the end goal that the modified ridge residuals, e(k,b)a given by Eq. (4.12), are not near 0. Subsequently the cross leverage h(k,b)(ia) among the clean cases and outliers in the Eq. (4.13) will even now be little

2 and in this manner γia will also be small. Consequently, the impact diagnostic for the modified ridge outlier cases will accumulate Cook’s distance in those cases, and

S(k,b)i value could be greater than this value for clean cases. Hence, the intermediate leverage points can easily be detected by S(k,b)i.

59 4.3 Simulation

4.3.1 Normality of Proposed measure

In this segment, we make a simulation to examine to inspect the impact of influential

cases while considering multicollinear data. The regressors and the cases on the

response variable without outliers are computed following McDonald and Galarneau

(1975) and Newhouse and Oman (1971)

21/2 xij = 1 − θ wij + θwi,p+1 i = 1, . . . , n, j = 1, . . . , p.

and p X yi = βjxij + εi, εi ∼ N (0, 1) , i = 1, 2, . . . , n. j=1

0 in which wij s what’s more, εi are independent random numbers and xi1 = 1 for any

i = 1, . . . , n, θ2 is the correlation between any two regressors. Take the values of

θ are set as 0.9, 0.99 and 0.999. The resulting condition numbers of the generated

X equal 17.5493, 41.6682 and 80.5332 respectively. From Fig. 4.1 to 4.4, we

see that normal q-q plots of (a) proposed measure, (b) Cook’s measure (c) Cook’s

measure versus proposed measure, (d) graph of individual values of proposed measure

versus observation, for the initial data set with θ = 0.9, n = 200 and p = 10.

The second plots of each graph shows the same graphs for second data with

θ = 0.99, n = 500 and p = 20 and third plots show these graphs for third data with θ = 0.999, n = 1000 and p = 40. It could be seen that distribution of Cook’s

60 Figure 4.1: Influence investigation of produced the 3 data sets with various multicollinearity regressors. Normal q-q plots of the proposed measure

61 Figure 4.2: Influence investigation of produced the 3 data sets with various multicollinearity regressors. Normal q-q plots of Cook’s measure

62 Figure 4.3: Impact investigation of produced the 3 data sets with various multicollinearity regressors. Graphs of Cook’s distance against proposed measure.

63 Figure 4.4: Impact investigation of produced the 3 data sets with various multicollinearity regressors. Plots of proposed measure versus observation number.

64 measure is asymmetrical in different three circumstances however the distribution of

the proposed measure is symmetrical. It is likewise observed that from Fig. 4.4

1 1 1 the expected value of proposed measure in three data sets is near 10 , 20 and 40 , respectively, which is reliable with Theorem 4.2.2.1.

4.3.2 Detection of intermediate and high modified ridge

leverage outliers

Following Emami and Emami (2016), the execution of the proposed measure showed by simulation as follows, in circumstance (a) a model yi = xi1 + xi2 + εi of sample size n is produced, in which xi2 = θxi1 + εi and xi1, εi ∼ N (0, 1) . This simulated sample is generated for no outliers. Then modifying the last three cases of this sample by presenting 3 outliers of size five however with various modified ridge leverages.

Circumstance (b) compares to high modified ridge leverage (xij = 20, j = 1, 2) and (c) corresponds to intermediate modified ridge leverage (xij = 5, j = 1, 2) . A simulated sample of 30 with three outliers and with θ = 0.90 is generated; the data, the values of proposed measure and Cook’s measure values for three circumstances are present in Table 4.1.

And condition number of X 0 X is approximately 309.4428. The results of Table 4.1 for three situations are shown in Fig. 4.5 to 4.6. In (a) every one of the values of the proposed measure are near its expected value 1/2. In (b), values of the proposed measure for the high modified ridge leverage outliers are almost 0 and for good cases they are close to the expected value, which is consistent with Theorem 4.2.2.3.

65 Table 4.1: Three data sets and proposed Diagnostic Measures.

Diagnostic Measures Row y x1 x2 D Sa Sb Sc 1 1.6353 1.158 0.7598 0.0012 0.5024 0.5027 0.493 2 1.8759 0.6464 0.9056 0.0049 0.5156 0.5245 0.5105 3 1.4727 0.2055 0.7261 0.0232 0.4701 0.5069 0.4812 4 3.8593 1.3641 1.8615 0.1154 0.4209 0.5464 0.5148 5 -1.3435 -0.2916 -0.6572 0.0036 0.4312 0.5554 0.4977 6 -1.9906 -1.4872 -0.9209 0.0642 0.4429 0.5908 0.4856 7 4.6198 1.1721 2.2513 0.0025 0.5116 0.5322 0.495 8 -0.6158 -1.2514 -0.2453 0.018 0.501 0.5495 0.5118 9 0.2891 0.1093 0.1391 0.0175 0.3503 0.5228 0.4967 10 6.0853 1.7356 2.9559 0.0498 0.4012 0.5395 0.5126 11 -1.3916 0.211 -0.7064 0.0291 0.5103 0.5866 0.5268 12 2.71 1.9155 1.2592 0.0763 0.5424 0.6002 0.5113 13 0.0524 0.2636 0.013 0.0169 0.492 0.574 0.5445 14 -0.2981 0.2009 -0.1591 0.0027 0.5277 0.5354 0.5042 15 -2.9199 -0.1274 -1.4536 0.0074 0.5023 0.5447 0.5172 16 0.9247 0.5389 0.4354 0.1262 0.5427 0.5558 0.4925 17 -2.4054 -0.7718 -1.1641 0.0255 0.4948 0.6211 0.5042 18 4.0035 0.8954 1.957 0.0066 0.4843 0.6319 0.495 19 -2.6527 -1.0336 -1.2747 0.0081 0.4108 0.6009 0.4664 20 0.9502 0.2358 0.4633 0.0037 0.378 0.6428 0.4704 21 -1.8948 -0.1804 -0.9384 0.0118 0.4011 0.6546 0.5032 22 -5.5814 -0.4811 -2.7666 0.0227 0.485 0.5998 0.518 23 0.0133 -0.6449 0.0389 0.032 0.394 0.5834 0.4996 24 -1.291 -0.1784 -0.6366 0.0028 0.3528 0.5668 0.5232 25 -0.6987 1.4113 -0.4199 0.0033 0.3905 0.5522 0.4964 26 -6.689 -1.5815 -3.2654 0.0018 0.498 0.6117 0.5094 27 5.2383 1.3814 2.5501 0.0276 0.527 0.6206 0.5127 28 (5bc) (20b, 5c) (20b, 5c) 0.0019 0.5102 0.1587 0.6884 -2.3622 0.0997 -1.1761 29 (5bc) (20b, 5c) (20b, 5c) 0.0014 0.5506 0.1587 0.6884 -0.1105 0.7016 -0.0202 30 (5bc) (20b, 5c) (20b, 5c) 0.0035 0.5773 0.1587 0.6884 -2.0704 -1.2545 -0.9725

Three data sets with the same cases 1-27 and in 28-30: Sa, No Outliers; Sb, 3 High modified ridge Leverage Outliers; Sc, 3 Middle modified ridge Leverage Outliers. 66 Figure 4.5: Graphs of Cook’s measure against the proposed measure three situations: (a) no outliers, (b) 3 high modified ridge leverage outliers and (c) 3 middle modified ridge leverage outliers. 67 Figure 4.6: Graphs of the proposed measure against observation three situations: (a) no outliers, (b) 3 high modified ridge leverage outliers and (c) 3 middle modified ridge leverage outliers. 68 In (c), values of the proposed measure are greater for the outliers than for the good

cases.

4.3.3 Performance of Pena’s statistic

In this segment, we conduct a simulation study that is designed to compare

the performance of suggested identification technique with Cook’s distance D(k,b)i.

Performance of the proposed technique was examined by applying following model:

yi = β0 + β1x1 + β2x2 + β3x3 + ε; 1 ≤ i ≤ 40,

in which for 1 ≤ i ≤ 40 − n0, β0 = β1 = β2 = β3 = 1, ε is a vector of normal random

0 variable and xij s are produced by

21/2 xij = 1 − θ wij + θwi,p+1 i = 1, . . . , n, j = 1, . . . , p.

with θ2 = 0.70, 0.90 and 0.99, which form multicollinear regressors. Presently to

construct an outlier outline for the MRR, the contaminated sample by set of n0 alike

middle and high leverage outliers. The ultimate n0 observations, for 40 − n0 + 1 ≤

i ≤ 40, were alike corresponding to x1 = x0, x2 = x3 = 0, y0 = mx0, the x0 values

were selected to be 5 and 10, and tainting slope m have become constant at 1, 2,

3 and 4. The quantity of n0 outliers turned into taken as 2, 6 and 8, relating to

5%, 15% and 20% contamination. The value of 5% become chosen to represent the

small quantity of contamination, while 15% and 20% constitute a rather infected

69 sample. For every contamination plan 500 simulations have been made. The outlier

identification outcomes of S(k,b)i and Cook’s distance D(k,b)i are represented in Tables

4.2 and 4.3 respectively. It can be seen from Table 4.2 and 4.3 that the S(k,b)i technique is power full at 5% tainting data and it stay away from the masking issue when outliers have high leverage (x0 = 10) , while the intensity of Cook’s measure D(k,b)i decreased as the collinearity increased.

Table 4.2: Percent of outlier identification by applying S(k,b)i in the Simulation.

% outlier for x=5 % outlier for x=10 θ2 % inf m=1 m=2 m=3 m=4 m=1 m=2 m=3 m=4 5 93 95 94 95 92 94 93 93 0.7 15 82 81 83 83 81 85 84 86 20 72 75 74 72 79 78 80 78 5 91 90 92 91 91 93 92 91 0.9 15 79 80 78 79 80 82 83 84 20 70 74 72 73 78 79 80 79 5 84 82 83 84 94 91 90 92 0.99 15 77 74 75 76 78 80 79 80 20 68 69 68 67 70 71 70 72

4.4 Illustration

The Longley (1967) data set has n = 16 observations and p = 6 explanatory variables

has been applied to clarify the impact of strong multicollinearity. The scaled condition

number is 43,275 (Belsley et al., 1980). This substantial number shows the presence of

an extreme multicollinearity among regressors. This data set has also been considered

70 Table 4.3: Percent of outlier identification by means of D(k,b)i in the Simulation.

% outlier for x=5 % outlier for x=10 θ2 % inf m=1 m=2 m=3 m=4 m=1 m=2 m=3 m=4 5 84 83 85 83 80 81 83 81 0.7 15 72 71 73 72 73 73 75 74 20 60 62 61 63 61 60 63 60 5 81 82 81 83 80 82 81 83 0.9 15 71 72 73 71 72 74 72 73 20 59 61 60 62 60 62 63 62 5 70 71 73 71 71 73 75 73 0.99 15 60 62 62 61 63 61 64 63 20 56 54 55 54 55 54 53 55

by several authors about the identification of influential cases. Cook (1977) applied

Cook’s measure to this data set and discover that points 5, 16, 4, 10 and 15 have

been the most influential cases while the OLS was used for estimation. Walker and

Birch (1988) examined this data set to identify anomalous cases in RR by applying

Cook’s distance Di and DFFITS. They detect points 16, 10, 4, 15 and 5 (in this specific order) as the most influential cases. Emami and Emami (2016) also applied this data to identify influential cases for the RR the use of measure of Pena (2005).

They identified cases 16, 15, 10, 5 and 1 (in this order) and 16, 15, 10, 5 and 4 (on this specific order) as the most influential cases for OLS (k = 0) and k = 0.0002 respectively. In this chapter, we applied this data to detect the influential points for the MRR by means of using the Pena’s measure. Influence statistic (4.5) was computed for k = 0.0002 furthermore, the outcomes are introduced in Table 4.4. It is able to be seen that cases 10, 4, 5, 16 and 15 (in this specific order) are the most influential cases for optimal d. Five most influential observations with largest S(k,b)i

71 are 16, 15, 5, 4 and 10 (in this specific order). The judgement of these cases with the cut-off point characterized in Eq. (4.6) shows these cases can be observed as influential points. These results also presented in Fig. 4.7. It can be seen that the histograms of the proposed measure separated different group of data. Comparison of the largest proposed measure values with the cut-off described in Eq. (4.6), shows these cases can be regarded as influential cases.

Table 4.4: Five largest observations of S(k,b)i for the Longley data.

OLS (k = 0) MRR (k = 0.0002)

Cases Si Cases S(k,b)i 5 0.6201 16 0.4958 16 0.5067 15 0.3685 6 0.4684 5 0.3185 15 0.3829 4 0.2937 10 0.2990 10 0.2565

72 Figure 4.7: Impact assessment of Longley data set (a) Histogram of proposed measure (b) graph of Cook’s distance verses proposed measure (c) plot of proposed measure verses cases. 73 4.5 Conclusions

Influential observations affect the model estimates and inferences. These observations are diagnosed by various methods for the different models and under various assumptions. The diagnostic measures are not reliable for the OLS when the explanatory variables are multicollinear. In this chapter, we modified Pena’s statistic for the MRR. We demonstrated that our modified measure has asymptotically normal and can identify a subset of high modified ridge leverage outliers. The numerical results clearly expose that the modified statistic can identify influential observations correctly and it provides similar results as Cook’s distance does for the MRR. Hence, in practice it would be more fruitful to use this influence diagnostic measure when facing the issue of multicollinearity.

74 Chapter 5

Pena’s statistic for the Improved

Liu Estimator

5.1 Introduction

Linear regression investigation is broadly applied and very much perceived area of

statistical studies. One important aspect of regression study which has been mainly

examined is detecting and dealing with influential observations. The detection is

normally done by presenting some perturbation in the issue formulation and observed

how these bothers effect the investigation, for example, parameter estimates. Few

Widespread references in this difficulty comprise Cook (1977), Belsley et al. (1980),

Chatterjee and Hadi (1988) and Ullah and Pasha (2009) and many others, and the references referred there in.

In the investigation of influential cases to measure the impact of the ith case, the most widely recognized method is to calculate single-case diagnostic measures with the ith

75 observation deleted. Since the work of Cook (1977), the case deletion diagnostic

measure such as Cook’s distance has been effectively applied to numerous statistical

models. The vast majority of the thoughts of influential cases depend on the case

deletion method.

A new influence diagnostic measure has been proposed by Pena (2005) and is known

as Pena’s statistic in the literature. In Pena’s approach, we as opposed to analysing

at how the omission of a point influences the parameters, we look how each case is

impacted with the aid of rest of the data.

In the greater part of the regression research, the predictors are not observed

to be orthogonal. Once in a while, the lack of orthogonality is certainly not a

serious problem. However, whilst the predictor variables have near to perfect linear

relationship, inferences in view of the regression can be seriously deceptive. At the

point while there exist close to linear dependencies among the predictor variable,

the problem of multicollinearity is stated to be existing. In such circumstances, the

technique for OLS results in unfortunate estimates of coefficients of regression. Within

the available literature, many important biased estimation methods are proposed to

handle the said issue and among these the RR estimator (Hoerl and Kennard, 1970)

and LE (Liu, 1993) are very popular. Shi and Wang (1999) Walker and Birch (1988)

studied leverages, Cook’s distance and local influence measures for the RR setting.

One of the ridge type estimators suggested to fit the model for multicollinear data is

the MRR estimator (Swindel, 1976). Jahufer and Jianbao (2009), Amanullah et al.

(2013b) and Kashif et al. (2019) studied residuals, leverages and several measures of case deletion approach while using this estimator. Amanullah et al. (2013a) studied

76 the influence measures in the context of LE.

Liu (2011) improved the available LE or what the selection of biasing parameter was examined under the predicted residual error sum of squares (PRESS) criterion.

This estimator is known as the improved Lie estimator (ILE). Recently, the influence of a few observations using Pena’s technique on the RR was studied by Emami and

Emami (2016). This work motivated us to pay attention on the influence of anomalous observations about the outcomes of the ILE. Therefore, in this chapter, we concentrate on the Pena’s technique that is designed because the squared standard of vector of variations of the fitted value of a point while every of points are detached from the sample on the ILE. Moreover, we modified this measure in phrases of the ILE residuals and leverages. It is showed confirmed that this measure has . It is also illustrated that this statistic can be used to detect a set of high ILE-leverage alike outliers that cannot be detected through Cook’s measure.

The remaining of the chapter is as follows. The influence diagnostic measure in OLS and the ILE for the Pena’s statistic are presented in section 5.2. Additionally, we examine a few properties of modified measure in this section. Simulated and real data examples are presented in sections 5.3 and 5.4 respectively for illustration. At last, finishing up comments are given in section 5.5.

77 5.2 Pena’s statistic

5.2.1 Pena’s statistic using the ILE

The goal of influence analysis is to evaluate the impact of the ith case and the popular

technique is case deletion diagnostics with ith observation deleted. A number of

diagnostics have been suggested for this purpose in regression modelling including

Pena’s statistic. Pena (2005) proposed an influence diagnostic measure for measuring

influence of the ith case totally in a different mode. In the said work, Pena presented a technique to measure how every point is being affected by whatever remains of data.

Suppose the vectors

 0 0 h1ie1 hnien si = yˆi − yˆi(−1),..., yˆi − yˆi(−n) = ,..., , (5.1) 1 − h11 1 − hnn

wherey ˆi −yˆi(−j) is the distinction between the ith estimated y in existence of all cases

in a data and with j th case omitted. Pena’s measure for the ith case is defined as

s0 s S = i i ; i = 1, 2, . . . , n (5.2) i pσ2 (ˆyi)

which can also be re-expressed as

n 2 2 1 X hjiej S = , (5.3) i pσˆ2h 2 ii j=1 (1 − hjj)

where hii is the ith diagonal component and hji is the jith component of the projection

matrix H and σ2 =σ ˆ2h . One of the properties of S is that for large sample (ˆyi) ii i

78 sizes and numerous explanatory variables, the distribution of this statistic will be approximately normal. Cut-off points can be found for this statistic using this property. So, the Cases with the values of the measure sufficiently greater than

(Si − E (Si)) /SD (Si) may be taken into consideration as influential observation.

Since mean and standard deviation of Si are influenced within seeing influential observations. Pena (2005) proposed using median and median absolute deviation of Si rather than mean and standard deviation. Therefore, an observation is referred to as an influential observation if

|Si − Median (Si)| ≥ 4.5MAD (Si) ,

The existence of multicollinearity in the data has some destructive effects on regression analysis. To combat multicollinearity, various remedial measures have been suggested such as ridge and Liu type estimators. Liu (2011) proposed an ILE with biasing parameters under the PRESS criterion. This estimator is defined as

−1 −1 −1 ˆ ∆  0  0  0   0  0 βK,D = K Z Z + Ip Z y + D Z Z + Ip Z Z Z y

where K = diag (k1, . . . , kp) and D = diag (d1, . . . , dp) and k1, . . . , kp, d1, . . . , dp ∈

R be scalars Now, Pena’s statistic in ILE is defined as

0 s sK,D;i S = K,D;i ; i = 1, 2, . . . , n. (5.4) K,D;i pσ2 (yˆK,D;i)

79 0 where sK,D;i = yˆK,D;i − yˆK,D;i(−1),..., yˆK,D;i − yˆK,D;i(−n) andy ˆK,D;i−yˆK,D;i(−j) is the distinction between the ith estimated y in existence of all cases in data set and with j th case omitted by using ILE. This statistic after adjustment can also be re-expressed as n 2 2 1 X hK,D;jieK,D;j S = (5.5) K,D;i pσˆ2h 2 K,D;i j=1 (1 − hK,D;j)

where hK,D;i is the ith diagonal element and hK,D,ji is the jith component of the Hat

0 −1 0  h 0 −1i 0 −1 0 matrix HK,D = KZ Z Z + I Z Z + dI I + (1 − D) Z Z + I Z Z Z ,

2 eK,D;i = yi − yˆK.D;i, yˆK.D;i − yˆK.D;i(−j) = hK.D;jieK.D;j/(1 − hK.D;j) and σ = (yˆK.D;i)

2 σˆ hK.D;i. An observation is called influential if it satisfy this rule

|SK.D;i − Median (SK.D;i)| ≥ 4.5MAD (SK.D;i) , (5.6)

where MAD (SK.D;i) = Median {|SK.D;i − Median (SK.D;i)|} /0.6745 the median absolute deviation of SK.D;i values.

In a standard linear regression model, a well-known single-case deletion measure in impact diagnostics is Cook’s distance. This measure, using the ILE may be defined as

0   0   ˆ ˆ ˆ ˆ 2−1 DK.D;i = βK.D − βK.D(i) Z Z βK.D − βK.D(i) pσˆ (5.7)

5.2.2 Properties of Pena’s statistic for the ILE

In this segment, we present a few properties of the impact diagnostic SK.D;i on the same pattern as given in Emami and Emami (2016).

80 THEOREM 5.2.2.1: When n is large and all hK.D;i are small then under the

hypothesis of no outlier or high improved Liu leverage observation, the mean value

of the influence diagnostic SK.D;i is approximately 1/p.

2  PROOF: As we know, eK.D;j = (1 − hK.D;j) yK.D;j ⇒ V ar (eK.D;j) = E eK.D;j

2 = (1 − hK.D;j)σ ˆ the mean value of the influence diagnostic SK.D;i can be derived from (5.5) as n 2 2  1 X hK.D;jiE eK.D;j E (S ) = K.D;i pσˆ2h 2 K.D;i j=1 (1 − hK.D;j)

n 2 1 X hK.D;ji ≤ ph (1 − h ) K.D;i j=1 K.D;j

? In the event that we call h = max1≤i≤nhK.D;i at that point the upper bound of this mean might be 1 1 h? E (S ) ≤ = + K.D;i p (1 − h?) p p (1 − h?)

Then again, since hK.D;j ≥ 1/n, thus we have

n 2 1 X hK.D;ji 1 E (S ) = ≥ K.D;i ph (1 − h ) p (1 − 1/n) K.D;i j=1 K.D;j

These outcomes show that if h? → 0 and n is large, at that point the mean impact of

all observations in the sample is approximately 1/p. This implies, if a sample other than outliers or high improved Liu leverage cases, at that point the greater part of the observations have the identical mean affectability concerning the whole sample.

81 Along these lines, the observations that have their values far from influence diagnostic

measure can be considered as influential observations.

THEOREM 5.2.2.2: A sample with slight hK.D;i, as n → ∞ and p → ∞ however p /n → 0, the asymptotic distribution of SK.D;i is normal.

PROOF: This proof is perform through Central Limit Theorem. We assume

n ? ¯ ¯ P there’s not outlier and h = max1≤i≤nhK.D;i < ch for c > 0, in which h = hK.D;i/n. i=1 p Let n → ∞ and p → ∞ however /n → 0, then influence diagnostic can be written as

n 2 X eK.D;j  S = m , K.D;i ij σˆ j=1

where

2 2 mij = hK.D;ji/phK.D;i(1 − hK.D;j) .

2 The residuals eK.D;j follow normal distribution and have covariance matrix σ (I − HK.D) .

In this way, when n → ∞ and hK.D;ij → 0 then influence diagnostic SK.D;i is a

weighted blend of chi-squared variates with 1 d.f. The mij are positive and the n P relative weight of every chi-squared variate mij/ mij → 0. Since j=1

hK.D;j hK.D;j (1 + 2hK.D;j) mij ≤ 2 ≈ , p(1 − hK.D;j) p

82 we have

mij hK.D;j (1 + 2hK.D;j) hK.D;j (1 + 2hK.D;j) Pn ≤ Pn 2 ≤ j=1 mij p + 2 j=1 hK.D;j p

for p → ∞, so relative weight of every chi-squared variate will have a tendency to

n P 2 zero. Since hK.D;j ≥ 0 and hK.D;j ≈ p. Therefore, asymptotic distribution of SK.D;i j=1 under those hypotheses may be normal.

The distribution of SK.D;i might be influenced by existence of influential observations, so we can apply high breakdown estimates. A heterogeneous observation is declared to as an influential observation if

|SK.D;i − Median (SK.D;i)| ≥ 4.5MAD (SK.D;i) ,

From above rule the high-leverage improved Liu outliers are recognized via a low

SK.D;i, which could suggest one tailed test with an alternative hypothesis at left

tailed and middle leverage outliers recognised by way of a massive SK.D;i, which could propose one tailed test with an alternative hypothesis on the right side.

THEOREM 5.2.2.3: The influence diagnostic will detect a set of high improved

Liu leverage outliers from the data.

0  0  PROOF: Suppose, a sample of size n cases y1, z1 ,..., yn, zn and suppose

0 0 Z0 = [z1, . . . , zn] , y0 = [y1, . . . , yn] ,

ˆ ∆ 0 −1 0 0 −1 0 −1 0 0 ˆ βK,D = K(Z Z + Ip) Z y + D(Z Z + Ip) (Z Z) Z y, and ui = yi − ziβK.D. We

83 0  consider k same high improved Liu leverage outliers ya, xa that contaminated

0 ˆ in the sample and let ua = ya − zaβK.D be the residual with appreciate to the

0 ˆ ILE and suppose eK.D;i = yi − ziβK.D(T ) be the improved Liu residuals within the

ˆ 0 −1  0 ˆ  0 full model with n + k cases, where βd(T ) = ZT ZT + I ZT yT + dβd ,ZT =

 0 0  0 0  0 Z0, za1k , yT = y0, ya1k and 1k is a vector 1 s of order k × 1. Suppose HK.D(T ) =

0 −1 0  h 0 −1i 0 −1 0 KZT ZT ZT + I ZT ZT + dI I + (1 − D) ZT ZT + I ZT ZT ZT be the pro- jection matrix for sample of size n + k data set and suppose

0 −1 0  h 0 −1i 0 −1 0 HK.D(0) = KZ0 Z0Z0 + I Z0Z0 + dI I + (1 − D) Z0Z0 + I Z0Z0 Z0 be a projection matrix for the clean data set. We divided HK.D as

  H H  K.D(11) K.D(12)  HK.D =   ,   HK.D(21) HK.D(22)

where HK.D(11) and HK.D(22) have order n × n and k × k dimensions respectively and

0 k 0 0  HK.D(11) = HK.D(0) − 0 hK.D(1a) hK.D(1a) , (5.8) khK.D(a) + 1 where

−1 −1 0 0  0   0   0  hK.D(a) = zaK Z0Z0 + I Z0Z0 + DI Z0Z0 za,

−1 −1 0  0   0   0  hK.D(1a) = Z0K Z0Z0 + I Z0Z0 + DI Z0Z0 za also

1 0 0 HK.D(12) = HK.D(21) = 0 hK.D(1a)1k, (5.9) khK.D(a) + 1

84 and

1 0 0 HK.D(22) = 0 hK.D(a)1k1k (5.10) khK.D(a) + 1

0 ˆ The observed improved Liu residuals, eK.D;i = yi − ziβK.D(T ), are associated to the

0 ˆ true improved Liu residuals, uK.D;i = yi − ziβK.D(0), are given by means of

eK.D;i = uK.D;i − khK.D(ia)uia i = 1, . . . , n, (5.11)

furthermore, related to the outlier cases by

1 eK.D,a = 0 ua i = 1, . . . , n. (5.12) khK.D(a) + 1

Using Eq. (5.11) Cook’s distance for clean points is given with the aid of

2 ui − khK.D(ia)ua hK.D;i DK.D;i = (5.13) 2 2 pσˆ (1 − hK.D;i)

For the outlier points using Eq. (5.11), this statistic can be written as

2 uaha DK.D(ia) = (5.14) 2 2  pσˆ 1 + (k − 1) hK.D(a) 1 + khK.D(a)

Suppose, we have high improved Liu leverage outliers and suppose hK.D(a) → ∞, at

that point HK.D(12) = HK.D(21) → 0, which means that hK.D(ja) → 0, for j = 1, . . . , n

85 −1 0 −1 and HK.D(22) → k 1k1k, means that hK.D(a) → k and

2 1 2 γja = hK.D(ja) hK.D;jhK.D(a) will have a tendency to zero for j = 1, . . . , n and kn−1 for j = n + 1, . . . , n + k. In this manner, for the clean cases we have

n X 2 SK.D;i = γjiDK.D;j, i = 1, . . . , n j=1 be that as it may, for the outliers,

2 −1 SK.D;i = k n DK.D(a), i = n + 1, . . . , n + k. (5.15)

0 For the clean cases if hK.D(a) → ∞, at that point hK.D(ja) → 0 and through the usage of Eq. (5.12), eK.D;i → ui. With the aid of the usage of the similar contention given in Theorem 5.2.2.1, the mean of influence diagnostic for the clean data set

0 may be 1/p. However, using Eq. (5.12) for outliers, when hK.D(a) → ∞, then eK.D,a → 0 and DK.D(ia) → 0 and also SK.D;i → 0. Consequently, in the improved Liu structure for high leverage outliers, diagnostic measure might be near 0 for outliers and almost 1/p for clean points. Additionally, diagnostic statistic may be applicable for detecting improved Liu middle leverage outliers that aren’t detected via Cook’s measure. Actually, the improved Liu middle leverage outliers are a set of outliers

0 with hK.D(a) ≥ max1≤i≤nhK.D;i, that is, their corresponding improved Liu leverage are greater than the clean cases, though the genuine improved Liu residual, ua is to

86 such an extent that the improved Liu observed residuals, eK.D;a given with the aid of

Eq. (5.12), are not near 0. Accordingly the cross leverage hK.D(ia) among the clean

2 cases and the outliers in the Eq. (5.12) will in any case be small and in this way γia

will likewise be small. Consequently, influence diagnostic measure for the improved

Liu outlier cases will accrue Cook’s measure in those cases, and SK.D;i value could be

large than this value for clean cases. Hence, intermediate leverage points can easily

be detected by SK.D;i.

5.3 Simulation

5.3.1 Normality of the Proposed measure

In this segment, we make a simulation have a look at to study the effect of influence observations in multicollinear data sets. The regressors and the points on response variable without any outliers are processed after McDonald and Galarneau (1975),

Liu (2011), Aslam (2014) and Emami and Emami (2016) by

21/2 xij = 1 − θ wij + θwi,p+1 i = 1, . . . , n, j = 1, . . . , p.

and p X yi = βjxij + εi, εi ∼ N (0, 1) , i = 1, 2, . . . , n. j=1

0 In which wij s and εi are independent random numbers and xi1 = 1 for any i =

1, . . . , n, θ2 is the correlation between any two explanatory

87 Figure 5.1: Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Normal q-q plots of the proposed measure

88 Figure 5.2: Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Normal q-q plots of Cook’s distance

89 Figure 5.3: Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Graphs of Cook’s measure against proposed measure. 90 Figure 5.4: Influence investigation of the produced of 3 data sets with various multicollinearity independent variables. Plots of proposed measure versus observation number. 91 variables. Take the value of θ as 0.9, 0.99 and 0.999. The resulting condition numbers

of the generated X equal 15.6832, 38.5222 and 91.2375 respectively. From Fig. 5.1

to 5.4, we see that (a) normal q-q graph of the proposed measure, (b) q-q plot of

Cook’s measure (c) graph of Cook’s measure versus proposed measure, (d) graph of

individual values of proposed measure against observation, for the initial data with

θ = 0.9, n = 200 and p = 10. The second plots of each graph shows the similar

graphs for second data set with θ = 0.99, n = 500 and p = 20 and third graphs shows

for third data set with θ = 0.999, n = 1000 and p = 40. It is able to we can see

that distribution of Cook’s measure is asymmetrical in different three circumstances

however the distribution of the proposed measure is symmetrical. It is likewise

observed that from Fig. 5.4 the expected value of proposed measure in three data

1 1 1 sets near 10 , 20 and 40 , respectively, which is predictable with Theorem 5.2.2.1.

5.3.2 Detection of intermediate and high improved Liu lever-

age outliers

Following Emami and Emami (2016), the execution of the proposed measure is demonstrated by simulation. In circumstance (a) a model yi = xi1 +xi2 +εi of sample size n is produced, wherein xi2 = θxi1 + εi and xi1, εi ∼ N (0, 1) . At that point by

adjusting the last 3 points of this sample by presenting 3 outliers of size 5 however

with various improved Liu Leverages. Circumstance (b) relates to high improved

Liu leverage (xij = 20, j = 1, 2) and (c) corresponds to intermediate improved Liu

leverage (xij = 5, j = 1, 2) . A simulated sample of 30 with 3 outliers

92 Table 5.1: Three data sets and proposed Diagnostic Measures.

Diagnostic Measures Row y x1 x2 D Sa Sb Sc 1 -4.9179 1.1910 0.9756 0.0034 0.4901 0.5403 0.5011 2 2.5445 -0.7472 0.7367 0.0087 0.5012 0.6048 0.5227 3 1.6782 1.3767 -1.4788 0.0433 0.4588 0.5296 0.4988 4 -3.6697 -0.4338 -1.1452 0.1635 0.4042 0.5519 0.5007 5 1.6360 -1.1277 1.4874 0.0068 0.4608 0.5604 0.4932 6 -2.2055 -0.7074 -1.0040 0.1811 0.4121 0.6033 0.4729 7 0.7657 -0.5012 -1.3370 0.0013 0.5560 0.5401 0.4700 8 1.8661 0.1169 1.3308 0.0021 0.5203 0.5500 0.5238 9 -0.2508 -0.1340 -0.3019 0.0019 0.3445 0.5196 0.4954 10 0.1949 0.1434 1.5709 0.0523 0.3600 0.5423 0.5397 11 0.9930 -0.8010 0.3946 0.0577 0.5222 0.5907 0.5156 12 -6.0720 -0.0591 -0.3281 0.0837 0.5610 0.6188 0.5058 13 -0.3280 0.0841 0.1758 0.0198 0.4855 0.5544 0.5204 14 0.7523 0.2130 -0.0367 0.0030 0.5498 0.5203 0.5199 15 4.5109 -0.2199 2.0010 0.0069 0.4156 0.5344 0.5034 16 -4.8791 0.7863 -1.5626 0.1711 0.5361 0.5151 0.4819 17 1.7685 -0.2532 1.3426 0.0189 0.4822 0.6382 0.5201 18 0.3450 -2.1865 1.4787 0.0073 0.4610 0.6411 0.4881 19 -0.8137 -0.6599 -0.4408 0.0039 0.4016 0.6276 0.4482 20 -0.3182 1.3108 -0.4145 0.0026 0.3528 0.6501 0.4502 21 -2.8783 -0.7082 -2.3516 0.0022 0.3339 0.6332 0.5111 22 -1.3921 -1.3748 -0.8196 0.0329 0.4751 0.5882 0.5256 23 -4.0384 -0.3515 3.0427 0.0188 0.3967 0.5900 0.4889 24 3.9260 -1.3667 1.6616 0.0018 0.3342 0.5773 0.5172 25 -0.5038 -0.0975 -1.7527 0.0011 0.3807 0.5488 0.4724 26 3.7716 0.6456 -2.6722 0.0015 0.5330 0.6227 0.5121 27 1.8568 0.5391 -0.5250 0.0387 0.5116 0.6310 0.5043 28 (5bc) (20b, 5c) (20b, 5c) 0.0008 0.5156 0.1766 0.6673 2.5404 0.9536 0.8789 29 (5bc) (20b, 5c) (20b, 5c) 0.0012 0.5350 0.1766 0.6673 -3.4276 0.5003 2.0371 30 (5bc) (20b, 5c) (20b, 5c) 0.0033 0.5671 0.1766 0.6673 2.7443 -0.2932 0.2044

Three data sets with the same cases 1-27 and in 28-30 Sa, No Outliers; Sb, 3 High improved Liu leverage Outliers; Sc, 3 Middle improved Liu leverage Outliers. 93 Figure 5.5: Graphs of Cook’s measure against proposed measure three situations: (a) No outliers, (b) 3 high improved Liu leverage outliers and (c) 3 Middle improved Liu leverage outliers. 94 Figure 5.6: Plots of the proposed measure versus observation three situations: (a) No outliers, (b) 3 high improved Liu leverage outliers and (c) 3 Middle improved Liu leverage outliers. 95 and with θ = 0.90 is generated; the data, the proposed measure and Cook’s distance

values for three conditions are present in Table 5.1. And condition number of X 0 X is

about 244.6859 and using optimal value of d by the PRESS criterion of Liu (2011).

The results of Table 5.1 for three situations are shown in Fig. 5.5 to 5.6. In (a) every one of the values of proposed measure are near its expected value 1/2. In (b)

the values of the proposed measure for the high improved Liu leverage outliers are

almost 0 and for good cases they are near the expected value, which is consistent

with Theorem 5.2.2.3. In (c) the values of the proposed measure are greater for the

outliers than for the good cases.

5.3.3 Performance of Pena’s statistic

In this section, following Emami and Emami (2016), we conduct a simulation study

that is planned to compare the execution of suggested identification method with

Cook’s distance Di. Performance of proposed technique was examined by way of applying the subsequent model.

yi = β0 + β1x1 + β2x2 + β3x3 + ε 1 ≤ i ≤ 40,

wherein for 1 ≤ i ≤ 40 − n0, β0 = β1 = β2 = β3 = 1, ε is a vector of normal random

0 variable and xij s are produced by

21/2 xij = 1 − θ wij + θwi,p+1 i = 1, . . . , n, j = 1, . . . , p.

96 with θ2 = 0.70, 0.90 and 0.99, which form multicollinear regressors. Presently to shape

a design of outlier for the ILE, the contaminated sample by set of n0 alike middle and

high leverage outliers. The ultimate n0 observations, for 40 − n0 + 1 ≤ i ≤ 40, were

alike corresponding to x1 = x0, x2 = x3 = 0, y0 = mx0, the x0 values were selected to be 5 and 10, and tainting slope m have become constant at 1, 2, 3 and 4. The quantity of n0 outliers turned into taken as 2, 6 and 8, identifying with 5%, 15% and

20% tainting. The 5% value end up chose to speak to the small quantity of tainting, while 15% and 20% cconstitute a fairly tainted sample. For every tainting plan 500 simulations have been made. The outlier identification outcomes of SK.D;i and Cook’s distance DK.D;i are represented in Tables 5.2 and 5.3, respectively.

It can be seen from Table 5.2 and 5.3 that the SK.D;i method is robust at 5%

Table 5.2: Percent of outlier identification through SK.D;i in simulation.

% outlier for x=5 % outlier for x=10 θ2 % inf m=1 m=2 m=3 m=4 m=1 m=2 m=3 m=4 5 96 98 98 97 97 96 95 98 0.7 15 81 83 84 85 89 90 89 91 20 75 77 78 76 85 86 84 87 5 94 92 93 95 95 94 96 95 0.9 15 78 79 77 78 88 86 88 87 20 71 76 78 75 84 85 86 85 5 82 80 83 84 94 93 92 95 0.99 15 76 75 77 78 86 87 85 86 20 70 72 71 73 83 82 84 84

contamination data and it avoids the masking issue whilst outliers have high leverage

(x0 = 10) , even as strength of DK.D;i reduced as the collinearity extended.

97 Table 5.3: Percent of outlier identification through DK.D;i in simulation.

% outlier for x=5 % outlier for x=10 θ2 % inf m=1 m=2 m=3 m=4 m=1 m=2 m=3 m=4 5 87 88 88 89 81 83 85 87 0.7 15 74 73 72 75 75 76 77 76 20 61 63 65 66 63 62 64 65 5 83 82 84 85 80 82 81 85 0.9 15 72 74 73 75 73 75 75 78 20 60 62 61 65 61 63 64 66 5 70 72 75 74 72 74 76 77 0.99 15 61 63 65 64 64 63 66 65 20 54 55 53 56 57 56 55 58

5.4 Illustration

The Longley (1967) data set has n = 16 observations and p = 6 explanatory variables has been applied to explain the impact of strong multicollinearity. The scaled condition number of this data set is 43,275 (Belsley et al., 1980). This large number indicates the presence of an extreme multicollinearity among regressors. This data set has also been considered by several authors about the identification of influential cases. Cook (1977) applied Cook’s measure Di to this data set what’s more, discovered that cases 5, 16, 4, 10 and 15 have been the most influential points when the OLS was used for estimation. Walker and Birch (1988) studied this data set to identify anomalous observations for the RR by applying Cook’s distance and

DFFITS. They detected cases 16, 10, 4, 15 and 5 as most influential cases. Emami and Emami (2016) likewise examined this data to identify influential cases for the

RR applying generalized influence statistic of Pena (2005). They identify cases 16,

98 15, 10, 5 and 1 (in this order) and 16, 15, 10, 5 and 4 as the most influential cases for OLS (k = 0) and for k = 0.0002 respectively. In this chapter, we applied the this data to evaluate the influential cases for the ILE with the aid of the use of the

Pena’s influence measure. Influence statistic (5.5) was computed for optimal value of d furthermore, the outcomes are displayed in Table 5.4. It is able to be seen that cases 10, 4, 5, 16 and 15 are the most influential observations for the the optimal d.

Five most influential observations with largest SK.D;i are 10, 4, 5, 16 and 15 (on this specific order). The assessment of these cases with the cut-off point characterized in

Eq. (5.6) shows these cases can be observed as influential points. The result also presented in Fig. 5.7. It is able to be seen that the histograms of the proposed measure isolated different set of data. The comparision of largest proposed measure values with the cut-off characterized in Eq. (5.6) shows these cases can be regarded as influential cases.

Table 5.4: Five largest values of SK.D;i for Optimal value of d for the Longley data.

OLS Optimal value of d

Cases Si Cases SK.D;i 15 0.9728 10 0.7223 10 0.9127 4 0.4421 5 0.7652 5 0.4232 4 0.7389 16 0.2639 16 0.5223 15 0.2354

99 Figure 5.7: Influence examination of Longley data set (a) Histogram of the proposed measure (b) graph of the Cook’s distance verses proposed measure (c) plot of proposed measure verses cases. 100 5.5 Conclusions

Influential observations affect the model estimates and inferences. These observations are diagnosed by various methods for the different models and under various assumptions. The diagnostic measures are not reliable for the OLS when the explanatory variables are multicollinear. In this chapter, we modified Pena’s statistic for the ILE. We demonstrated that our modified measure has asymptotically normal distribution and can identify a subset of high improved Liu leverage outliers. The numerical results clearly expose that the modified statistic can identify influential observations correctly and it provides the better results as Cook’s distance does while using the ILE. Hence, in practice it would be more fruitful to use this influence diagnostic measure when facing the issue of multicollinearity.

101 Chapter 6

A New Diagnostic Method for

Influential Observations in Ridge

Regression

6.1 Introduction

The influence analysis plays an important role in statistical modeling and has attracted a lot of interest over the most recent last three decades. The motivation behind of influence analysis is to evaluate whether there exists an observation or any subset of observations such as the obtained results depend heavily upon these observations. Detecting these anomalous observations is indispensable in any statistical analysis, especially when we deal with complicated statistical models.

Several diagnostic methods are available in the literature for this purpose and an extensive search is still going on.

102 Two main approaches have been discussed for developing influential measures in

statistical modeling (Chatterjee and Hadi; 1988). First method is the case deletion

methodology (Cook, 1977), connected for influence investigation in MLR model and

other statistical techniques. This is a applied technique for computing the outcome

of global departures from a data. Various articles and books have been published for

identification of influential observations on the basis of this approach. Among these,

we mention precisely to those of Belsley et al. (1980), Banerjee and Frees (1997),

Beckman et al. (1987), Chattterjee and Hadi (1986), Christensen et al. (1992),

Draper and John (1981), Kim et al. (2001), Pena (1991), Preisser and Qaqish (1996),

Prendergast (2006), Rasekh and Fieller (2004), Rasekh and Mohtashami (2007), Roy

and Guria (2008), Weissfeld and Schneider (1990), Xie and Wei (2009), Zewotir and

Galpin (2005), Zhao and Lee (1998), Ullah and Pasha (2009), Martin et al. (2010),

Amanullah et al. (2013a) and Nurunnabi et al. (2016).

Two common single-case omission procedures in the literature are Cook’s distance and

DFFITS (Belsley et al., 1980). Atkinson (1986) observed that, single-case omission diagnostic techniques often fail to disclose influential observations in presence of masking affects. To remedy this problem of masking and/or swamping a considerable measure of research articles have been composed in light of group omission techniques

(Cook and Weisberg, 1982). The disadvantage of this technique is that, it is characterized just for the omission of an influential group of observations but not for the influential observations for the whole data set. Imon (2005) introduced a measure for the whole data in MLR model named as generalized DFFITS (GDFFITS).

Pena (2005) proposed a new measure that identified the influential cases totally in a

103 different way. He introduced a method to measure how every data point is being

affected by whatever is left of the data set. Nurunnabi et al. (2011) extended

Pena’s idea of group deletion and introduced a new measure to identify influentiall

observations from the whole data in linear regression models.

It is well known that, the presence of multicollinearity in a data set effect the statistical

quantities and gives unreliable results. To overcome this issue, RR is used for better

results instead of OLS method. The RR is due to Hoerl and Kennard (1970) and

it is extensively used to cope with the issue of multicollinearity. The issue lies in

discovering influential observations by conventional diagnostic measures using some

methods alternative to the OLS procedure. In this context, the RR has been given a

considerable attention in statistical literature. A large number of articles are available

on this issue, see, e.g., Mason and Gunst (1985), Steece (1986), Walker and Brich

(1988), Takeuchi (1991), Billor and Loynes (1999), Shi and Wang (1999), Billor (1992)

and Jahufer and Jianbao (2009). Some books also cover this issue, see, e.g., Belsley

et al. (1980) and Belsley (1991).

Right now, no consideration has been given to identify the influential cases applying the concept of forecasted change Nurunnabi et al. (2011) in the context of RR.

Therefore, the goal of this chapter is to develop the above mentioned diagnostic

measures in RR setting.

The remaining chapter is composed as following. The next segment gives some

discussion of RR. Section 6.3 contains some discussion of commonly used diagnostic

methods using the ridge estimator. In Section 6.4 we propose our diagnostic technique

for the detection of multiple influential cases. Section 6.5 gives two examples used

104 as data sets for illustration. A simulated study is presented in Section 6.6. Finally,

Section 6.7 offers concluding remarks.

6.2 Ridge Regression

An alternative estimation approach to the OLS is RR, which is used when multicollinearity is present in the data. Ridge estimators are more stable than the

OLS estimators because practically it comprises of adding a small k to the diagonal

0 ˆ 0 −1 0 of the Z Z. The ridge estimator of β can be composed as βk = Z Z + kI Z y, in which, k > 0 is proposed by Hoerl and Kennard (1970).

6.3 Diagnostic Methods in Ridge Regression

6.3.1 Leverage and Residuals

The fitted values vector by using the ridge estimate is given by

ˆ yˆR = ZβR (6.1)

= HRy, (6.2) where

 0 −1 0 HR = Z Z Z + kI Z . (6.3)

105 The projection matrix HR using the ridge estimate in regression assumes an

indistinguishable part from the typical H does in the OLS. By applying the elements of

n P HR the ith component of the vector of fitted values can be written asy ˆR,i = hR,ijyi; j=1

∂yˆR,i consequently, = hR,ii ≡ hR,i. The diagonal elements hR,i of projection matrix ∂yˆi

HR can be understood as the leverage in the same way as the diagonal elements hii

of H by using the OLS method. In the same way, the ith residual using the ridge

estimate may be communicated as

eR,i = yi − yˆR,i

ˆ = yi − ziβR

= (1 − hR,i) yi. (6.4)

6.3.2 Cook’s distance, DFFITS and Pena’s measure

The Cook’s distance measure for the ridge estimate in MLR model

0  ˆ ˆ  0  ˆ ˆ  2−1 DR,i = βR − βR(i) Z Z βR − βR(i) pσˆ (6.5)

ˆ 2 where βR(i) is the ridge estimate obtained when the ith observation deleted andσ ˆ is calculated from OLS. This is the immediate generalization of Di measure. For benchmark values of DR,i the standard F(p,n−p) distribution may be applied. The

106 DFFITS measure is denoted and defined by

 ˆ ˆ  zi βR − βR(i) DFFITSR(i) = , i = 1, 2, . . . , n (6.6)  ˆ  SE ziβR

which can also be re-expressed as

yˆR(i) − yˆR(i)(−i) DFFITSR(i) = p , i = 1, 2, . . . , n (6.7) σˆR(i) hR,i

wherey ˆR(i)(−i) is the fitted response andσ ˆR(i) is the estimated standard error using

ridge estimate with the ith observation deleted. Belsley et al. (1980) suggested an p observation as influential if DFFITSR(i) ≥ 2 p/(n + p). The generalized DFFITS

(GDFFITS) for this purpose and defined as

 yˆ −yˆ  R(i)(−D) √R(i)(−D−i) for i ∈ R  σˆR(−D−i) hRi(−D) GDF F IT SR(i) =  yˆR(i)(−D+i)−yˆR(i)(−D)  √ for i ∈ D σˆR(−D) hRi(−D+i) where R is the set of remaining observations in the investigation after the deletion

0 0 −1 of D suspected cases of the data, hR,i(−D) = zi ZRZR + kI zi and hR,i(−D+i) =

0 0 0 −1 zi ZRZR + zizi + kI zi = hR,i(−D)/1 + hR,i(−D). Imon (2005) recommended consid- p ering observations as influential if GDF F IT S(R),i ≥ 3 k/(n − d).

The Pena’s measure is defined as

0 sR,isR,i SR,i = ; i = 1, 2, . . . , n. (6.8) pVd ar (ˆyR,i)

107 0 where sR,i = yˆR,i − yˆR,i(−1),..., yˆR,i − yˆR,i(−n) andy ˆR,i − yˆR,i(−j) is the distinction between the ith estimated y in existence of all cases in data set and with j th case omitted by using ridge estimate. This measure after adjustment can also be re- expressed as n 2 2 1 X hR,jieR,i S = . (6.9) R,i pσˆ2h 2 R,i j=1 (1 − hR,j) where hR,i is the ith diagonal element and hR,ji is the jith element of the projection

2 matrix HR, yˆR,i − yˆR,i(−j) = hR,jieR,i/(1 − hR,j) and Vd ar (ˆyR,i) =σ ˆ hR,i. An observation is called influential if it satisfied this rule

|SR,i| ≥ Median (SR,i) + 4.5MAD (SR,i) , (6.10)

where MAD (SR,i) = Median {|SR,i − Median (SR,i)|} /0.6745.

6.4 Proposed Diagnostic Method

Nurunnabi et al. (2011) extended Pena’s idea of group deletion and introduced a new measure to identify influential observations from the entire data. They observed that in Pena’s measure the leverage values get more significance than the ordinary impact measure that is the reason Pena’s measure can be exceptionally valuable for detecting high leverage outliers, which are ordinarily viewed as the most troublesome sort of heterogeneity to identify in regression issues. According to Imon (2005) the residuals and leverages could separate effectively within the sight of multiple influential cases, particularly when these cases are high leverage outliers and a single-case omission

108 measure will most likely be unable to center around the genuine influence of these

points. Furthermore, according to Hadi and Simonoff (1993) group deletion measures

reduce utmost unsettling influence by omitting the suspicious group of influential

observations at once and make the data more homogeneous than previously. For this

motive, they proposed their measure which comprises of two stages. In the initial step,

they attempt to detect the suspect influential observations, which is not so natural to

detect all the influential observations at the initial run through on account of masking

and swamping problems. Assuming any suspected observation is left in any data set

then the detection strategy turn out to be extremely awkward. So they needed to

ensure that all the potential observations were highlighted as suspected and at the

same time no innocent cases were wrongly omitted. Since according to Habshah et

al. (2009) the omission of such observations particularly when they are best leverage

ones may antagonistically influence the entire influential procedure. For this purpose,

they employed BACON procedure. In the second step of their procedure, they used a

group omission form of Pena’s measure to affirm whether the presumed observations

were truly influential or not.

After discovering a group of suspected d cases among a set of n cases by employed

BACON procedure. They denoted a group of observations ”remaining” in the investigation by R and the group of observations ”deleted” by D. So that, without loss of sweeping statement these cases are the last d rows of variables Z and Y that

is     Z Y  (R)   (R)  Z =   Y =       Z(D) Y(D)

109 The Nurunnabi et al. (2011) measure using RR is defined as

0 tR(i)(−D)tR(i)(−D) MR,i = ; i = 1, 2, . . . , n. (6.11) kVd ar yˆR(i)(−D) where

0 tR(i)(−D) = yˆR,1(−D) − yˆR,1(i)(−D),..., yˆR,n(−D) − yˆR,n(i)(−D) (6.12)

0 = tR,1(i)(−D), . . . , tR,n(i)(−D) (6.13) and

hR,jieR,i(−D) tR,j(i)(−D) =y ˆR,j(−D) − yˆR,j(i)(−D) = , j = 1, 2, . . . , n, (6.14) 1 − hR,i where

0  0 −1 hR,ji = zj Z Z + kI zi and eR,i(−D) = yR,i − yˆR,i(−D). also

 2 Vd ar yˆR(i)(−D) =σ ˆ hR,i and 0 e eR(−D) σˆ2 = R(−D) . n − k

And after alteration, using (6.11) - (6.14) Mi can be written as

n 2 1 X eR(i)(−D) M = h2 . (6.15) R,i kσˆ2h R,ji 2 R,i j=1 (1 − hR,i)

110 Following the same argument of Pena (2005) considered an observation to be influential that satisfied this rule

|MR,i| ≥ Median (MR,i) + 4.5MAD (MR,i) , (6.16)

where MAD (MR,i) = Median {|MR,i − Median (MR,i)|} /0.6745.

6.5 Examples

In this segment, we look at the execution of our proposed diagnostic measure through commonly used data, a fake high-dimensional substantial data and a simulation study.

We evaluate the execution of proposed method with Di measure, DFFITS and Pena measures for detection of influential cases in MLR with ridge estimate.

6.5.1 Longley Data

Our initial example is the famous Longley data set from James W. Longley (1967).

The Longley data set contains 16 observations and 6 regressors. The scaled condition number of this data set is 43,275 according to Belsely et al. (1980). This large number suggests the presence of an extreme multicollinearity among regressors.

Table 6.1 demonstrates various influence measures for the Longley data set. From this table we have a look at that the Di measure detects 3 cases (4, 10 and 16) as influential cases. DFFITS flops to detect any of the influential observations. Pena’s measure

SR,i detects just observation 16 as influential interestingly; see Table 6.1 and Fig. 6.1

111 (c). We apply our proposed measure to this data set. Firstly, we employed BACON

(2000) and 5 cases (4, 5, 10, 15 and 16) are highlighted as suspicious influential cases.

Secondly, we applied proposed measure MR,i for the entire data set and observed that all suspicious cases were declared as influential cases. Fig. 6.1 (d) clearly supports in favor of our proposed measure MR,i.

Table 6.1: Measures of influences for Longley data.

DR,i DFFITR(i) SR,i MR,i Index (0.201) | 1.323 | (0.587) (0.214) 1 0.142 0.370 0.2444 0.0437 2 0.033 -0.182 0.0765 0.0576 3 0.004 0.061 0.1672 0.0387 4 0.219 -0.476 0.2840 0.3674 5 0.094 0.356 0.2031 0.5732 6 0.112 -0.347 0.1839 0.0109 7 0.018 -0.146 0.1437 0.0359 8 0.002 0.043 0.0392 0.0042 9 0.000 0.017 0.2024 0.0032 10 0.251 0.648 0.2919 1.4637 11 0.003 0.063 0.1715 0.0201 12 0.002 -0.048 0.0902 0.0107 13 0.041 -0.215 0.1248 0.0307 14 0.003 -0.055 0.1918 0.0254 15 0.145 0.383 0.4121 0.4256 16 0.582 -1.139 0.6161 0.7682

Note: Values with Italic boldfaced indicates the cases with the values more than the cut-off point.

112 Figure 6.1: Index plots for Longley data: (a) DR,i measure; (b) DFFITS; (c) Pena’s measure SR,i and (d) Suggested influence diagnostic measure MR,i.

113 6.5.2 Tobacco Data

Myers (1986) presented tobacco blends data set with 4 regressors covering 30

observations with 5 most influential observations (1, 4, 11, 14 and 24). The motive to

take this data set is to exemplify our procedures for genuinely big sample. The scaled

condition number for this data is 22293 as indicated by Billor and Loynes (1999).

We calculate different measures for this data set and findings are offered in Table

6.2. We watch that Cook’s distance detect just a single case (case 4) such influence.

DFFITS detects 3 cases (4, 14 and 24) correctly but fails to identify remaining 2

cases. Pena’s measure SR,i detectsmost effective case 24 as influential; see Fig. 6.2

(c). When we apply BACON (2000) to this data set cases (1, 4, 11, 14 and 24) flags

as suspicious. Then we compute proposed measure MR,i for the whole data after the

deletion of the suspected observations and conclude that MR,i values relating to these cases separately fulfill the rule (6.16) and thus can be proclaimed as influential cases; see Fig. 6.2 (d).

114 Table 6.2: Measures of influences for Tobacco blends data set.

DR,i DFFITR(i) SR,i MR,i Index (0.297) | 0.816 | (0.922) (0.333) 1 0.061 0.561 0.2454 0.4762 2 0.000 -0.035 0.1949 0.1613 3 0.000 -0.030 0.3725 0.1314 4 0.358 1.777 0.7663 0.5376 5 0.003 -0.125 0.5334 0.2312 6 0.011 -0.238 0.6061 0.2241 7 0.006 0.176 0.2429 0.1154 8 0.033 0.421 0.7131 0.1356 9 0.007 0.185 0.0938 0.1265 10 0.019 0.309 0.2105 0.0271 11 0.078 -0.665 0.4567 0.6185 12 0.021 -0.326 0.2766 0.1075 13 0.003 -0.115 0.5186 0.0101 14 0.176 -1.046 0.4238 0.6584 15 0.041 -0.467 0.6688 0.1462 16 0.001 -0.079 0.2145 0.1321 17 0.069 0.608 0.2600 0.1268 18 0.027 0.374 0.3483 0.1413 19 0.007 -0.194 0.3706 0.1502 20 0.021 -0.323 0.2291 0.1334 21 0.001 0.051 0.2438 0.1037 22 0.000 -0.043 0.3052 0.1002 23 0.001 -0.051 0.3116 0.1108 24 6.626 5.913 6.6532 6.9867 25 0.002 -0.107 0.3899 0.1127 26 0.000 0.018 0.2932 0.0238 27 0.000 0.006 0.3708 0.0052 28 0.004 0.134 0.1648 0.0041 29 0.000 -0.048 0.0893 0.0081 30 0.002 -0.110 0.3145 0.0016

Note: Values with Italic boldfaced indicates the cases with the values more than the cut-off point. 115 Figure 6.2: Index graphs of Tobacco blends data: (a) DR,i measure; (b) DFFITS; (c) Pena’s measure SR,i and (d) Proposed diagnostic measure MR,i.

116 6.5.3 Artificial high dimensional large data set containing

heterogeneous cases

The regressor variables and the observations on the response variable with no outliers

are computed following Kibria (2003), Liu (2003) and Emami and Emami (2016) by

21/2 xij = 1 − θ wij + θwi,p+1, i = 1, . . . , n, j = 1, . . . , p.

and 20 X yi = βjxij + β21W + εi, εi ∼ N (0, 1) , i = 1, 2, . . . , n. j=1

0 2 where wij s and εi are independent standard normal pseudo-random numbers, θ is the correlation between any two regressor variables and x0s have 20 dimensions random variables follow uniform distribution. The initial 400 cases are set at w = 0 and the last 100 cases are set at w = 1 for the categorical variable W. For heterogeneous sample cases we produce observations for each of the x relating to w = 0 from uniform

(0, 10) even as other x variables corresponding to w = 1 are generated from uniform

(9, 10). The parameters were chosen as β0 = β1 = ... = β20 = 1 and β21 = −100.

We suspect the ultimate a 100 observations for influential and build the omission set D in view of them. When we observe proposed measure it superbly detects 100 observations as influential. Fig. 6.3 offers an expansion of graphical presentations of

Pena and proposed measures for this synthetic data. The residual against fitted value scatter plot as given in (a) demonstrates no sign of heterogeneity most of the points.

The plots (see (c), (d) and (f)) of Cook’s distance, DFFITS and

117 Figure 6.3: Influence investigation of the big data set with high dimensions and heterogeneous sample cases: (a) Residuals against Fitted plot; (b) Histogram of the Residuals; (c) Index graphs of (a) Cook’s distance; (d) DFFITS; (e) Pena’s statistic; (f) Histogram of Pena’s statistic; (g) Histogram of Proposed measure; (h) Index plot of Proposed measure. 118 Pena’s statistic totally fail to identify the Influential observations. The histogram of proposed measure (see (g)) obviously demonstrates the existence of heterogeneity.

Index plot of proposed measure (see (h))demonstrates that they could effectively detect all influential observations.

6.6 Simulation Findings

In this segment, we perform a Monte Carlo simulation that is composed in a similar fashion described by Nurunnabi et al. (2011) to compare the perfor- mance of proposed measure MR,i with the Pena’s measure SR,i for estimating impact of cases. We study two sample sizes (n = 50 and 100) with various levels

(θ2 = 10%, 20%, 30% and 40%) of influential observations. We considered MLR model

yi = β0 + β1x1i + β2x2i + β3x3i + β4x4i + β5x5i + εi, i = 1, 2, . . . , n (6.17)

0 where xij s are generated by

21/2 xij = 1 − θ wij + θwi,p+1 i = 1, . . . , n, j = 1, . . . , p.

We produce initial 100 (1 − γ) % of X variables from Uniform (1, 4) and relating Y values are calculated from Eq. (6.17), where random error εi from Normal (0, 0.2) and

β0 = 2, βj = 1 for j = 1, 2,..., 5. The remaining 100γ% of X variables are produced from Normal (7, 0.5) and relating Y values are produced from Normal (2, 0.5). Our

119 outcomes in Table 6.3 depend on 10,000 simulations for every single variant of data.

The outcomes introduce the correct identification in percentage (CIP) is defined as

100 (1 − γ)%

total no. of influential cases detected CIP = × 100 total no. of influential cases

We perceive from the outcomes introduced in Table 6.3 that SR,i performs extremely

Table 6.3: Simulation results.

Diagnostic Correct Identification % inf. cases n Measures in %

SR,i 79.1 50 MR,i 100 10% SR,i 18.3 100 MR,i 100

SR,i 55.1 50 MR,i 100 20% SR,i 14.4 100 MR,i 100

SR,i 27.4 50 MR,i 100 30% SR,i 9.7 100 MR,i 100

SR,i 2.1 50 MR,i 100 40% SR,i 1.3 100 MR,i 100 poor to detect influential observations for various sample sizes and levels of influential cases. But the proposed measure MR,i performs efficiently in the detection of

120 multiple influential cases, and significantly better than SR,i. It can effectively detect all influential cases regardless then sample sizes and percentage of influential observations.

6.7 Conclusions

In this chapter, we proposed Nurunnabi et al. (2011) group-deletion measure to the ones given for OLS model on the ridge estimate for detection of multiple influential cases in MLR model. An exhibition of the modified measure on some of properly- referred data and simulation study a look at aid the benefit of our measure while the existing procedures don’t indicate quality overall performance. As far our research experience shows, the newly proposed modified measure is fairly acceptable in simulations like groups of high ridge leverage cases in massive data with high dimensions that are not clean to handle with the aid of the commonly used methods.

121 Chapter 7

Conclusions

In linear regression analysis, detection of influential cases has established a lot of attention and comprehensive study for model building procedure. The existence of these observations in the data is complicated by the presence of multicollinearity.

Thus, identification of these observations is one of the essential steps in regression analysis in the presence of multicollinearity.

Firstly, we discussed Pena’s influence diagnostic for MLR model with the LE. We prove that this diagnostic measure has asymptotically normal and can identify a subset of high Liu leverage outliers. The numerical results show that the diagnostic measure is helpful. Nonetheless, the problem of the detection of unusual observations leftovers one of the primary goals of a capable expert. It is important to note that the influence measures obtained for the OLS are not necessary trustable for the biased estimators. Thus, practically speaking it would be more important to apply the influence diagnostics alongside the information and skill of the investigator.

122 Influential observations affects the model estimates and inferences. These observa- tions are diagnosed by various methods for the different models and under various assumptions. In this thesis, we modified the Pena’s statistic for modified ridge regression. The diagnostic measures are not reliable for the OLS under the assumption that the explanatory variables are multicollinear. Because of the influence of each observation is a function of the shrinkage parameter. Belsley et al. (1980) noted that shrinkage parameter must be computed before assessment of influential observations.

We demonstrated that our modified statistic has asymptotically normal distribution and can detect a subset of high modified ridge leverage outliers. The numerical results clearly expose that the modified statistic can identify influential observations correctly and provide similar results as alike to Cook’s distance in the modified ridge regression.

Next, we discussed Pena influence diagnostic for MLR model with the Improved Liu estimator. We prove that this diagnostic measure has asymptotically normal and can identify a subset of high Improved Liu leverage outliers. The numerical results show that the diagnostic measure is helpful. It is important to note that the influence measures obtained for OLS are not necessary trustable for the biased estimators.

Finally, the Ridge estimator having growing and wider applications in statistical data analysis as an alternative technique to the OLS estimator to combat multicollinearity in linear regression models. In regression diagnostics, a large number of influence diagnostic methods based on numerous statistical tools have been discussed. In this thesis, initially we reviewed the available measures in the literature. We proposed

Nurunnabi et al. (2011) group-deletion measure to the ones given for OLS model on

123 the ridge estimate for detection of multiple influential cases in MLR. A demonstration of the modified measure on a number of commonly used data sets and simulation studies support the merit of our measure whereas the existing procedures do not show satisfactory performance. As far our research experience shows, the newly proposed modified measure is fairly acceptable in situations like clusters of high ridge leverage cases in large data sets with high dimensions that are not easy to handle by the commonly used methods.

124 References

Adewale, F., Lukman, Oyedeji, J., Abiola, R. (2016). Modified Pena’s Measure for Detecting Influential Observations in Biased Estimators. Zimbabwe Journal of Science and Technology, 2, 58–65. Amanullah, M., Pasha, G.R., Alsam, M. (2013a). Assessing influence on the Liu estimates in linear regression models. Communications in Statistics-Theory and Methods, 42(17), 4200–4216. Amanullah, M., Pasha, G.R., Alsam, M. (2013b). Local Influence Diagnostics in the Modified Ridge Regression. Communications in Statistics-Theory and Methods, 42, 1851–1869. Aslam, M. (2014). Performance of Kibria’s Method for the Heteroscedastic Ridge Regression Model: Some Monte Carlo Evidence. Communications in Statistics - Simulation and Computation, 43, 673–686. Atkinson, A.C. (1986). Masking unmasked. Biometrika, 73, 533–541. Atkinson, A.C. (1985). Plots, Transformations and Regression. Oxford U K: Clean- don Press. Banerjee, M., Frees, E. W. (1997). Influence diagnostics for linear longitudinal models. Journal of the American Statistical Association, 92, 999–1005. Beckman, R. J., Nachtsheim, C. J., Cook, R. D. (1987). Diagnostics for mixed- analysis of variance. Technometrics, 29, 413–426. Belsley, D.A., Kuh, E., Welsch, R. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York, Wiley. Belsley, D.A. (1991). Conditioning Diagnostics: Collinearity and Weak Data in Regression. New York: John Wiley and Sons.

Billor, N. (1992). Diagnostics methods in ridge regression and errors-in-variables model: Ph.D.Thesis. University of Sheffield, UK.

125 Billor, N., Loynes, R. M. (1999). An application of the local Influence approach to ridge regression. Journal of Applied Statistics, 26, 177–183. Billor, N., Hadi, A. S., Velleman, F. (2000). BACON: Blocked adaptive computa- tionally efficient outlier nominator. Computational Statistics and Data Analyses, 34, 279–298. Cancho, V. G., Ortega, E. M. M., Paula, G. A. (2010). On estimation and influence diagnostics for log-Birnbaum–Saunders Student-t regression models: full Bayesian analysis. Journal of Statistical Planning and Inference, 140, 2486–2496. Chatterjee, S., Hadi, A. S. (1986). Influential observations, high leverage points, and outliers in linear regression. Statistical Sciences, 1, 379–416. Chatterjee, S., Hadi, A.S. (1988). Sensitivity Analysis in Linear Regression. John Wiley, New York. Chatterjee, S., Hadi, A. S. (2006). Regression Analysis by Examples. 4th ed. New York: Wiley. Christensen, R., Pearson, L. M., Johnson, W. (1992). Case-deletion diagnostics for mixed models. Technometrics, 34, 38–45. Cook, R. D.(1977). Detection of influential observations in linear regression. Technometrics, 19, 15–18. Cook, R. D. (1979). Influential Observations in Linear Regression. Journal of the American Statistical Association, 74, 169–174. Cook, R.D., Weisberg, S.(1982). Residuals and Influence in Regression. Chapman and Hall, New York. Cook, R. D. (1986). Assessment of local influence (with discussion). Journal of the Royal Statistical Society, Series B, 48, 133–169. Draper, N. R., John, J. A. (1981). Influential observations and outliers in regression. Technometrics, 23, 21–26. Emami, H., Emami, M. (2016). New influence diagnostics in ridge regression. Journal of Applied Statistics, 43(3), 476–489. Gruber, M. H. J. (2012). Liu and Ridge Estimators-A Comparison. Communications in Statistics-Theory and Methods, 41(20), 3739–3749. Habshah, M., Norazan, R., Imon, A. H. M. R. (2009). The performance of diagnostic- robust generalized potentials for the identification of multiple high leverage points in linear regression. Journal of Applied Statistics, 36, 507–520. Hadi, A. S., Simonoff, J. S. (1993). Procedure for the identification of outliers in linear models. Journal of the American Statistical Association, 88, 1264–1272.

126 Hoerl, A. E., Kennard, R. W. (1970). Ridge regression: biased estimation for non- orthogonal Problems. Technometricss, 12, 55–67. Imon, A. H. M. R. (2005). Identifying multiple influential observations in linear regression. Journal of Applied Statistics, 32, 929–946. Jahufer, A., Jianbao, C. (2009). Assessing global influential observations in modified ridge regression. Statistics and Probability Letters, 79, 513–518. Jahufer, A and Chen, J. (2012). Identifying Local Influential Observations in Liu Estimator. Journal of Metrika, 75(3), 425–438. Jahufer, A. (2013). Detecting Global Influential Observations in Liu Regression Model. Open Journal of Statistics, 3, 5–11. Kashif, M., Amanullah, M., Alsam, M. (2018). Pena’s statistic for the Liu regression. Journal of Statistical Computation and Simulation, 88(13), 2473–2488. Kashif, M., Amanullah, M., Alsam, M. (2019). Influential diagnostics with Pena’s statistic for the modified ridge regression . Communications in Statistics - Simulation and Computation, DOI: 10.1080/03610918.2019.1634204, (Online Published). Kibria, B. M. G. (2003).Performance of some new ridge regression estimators. Communications Statistics Simulation and Computation, 32, 419–435. Kim, C., Lee, Y., Park, B. U. (2001). Cooks distance in local polynomial regression. Statistics and Probability Letters, 54, 33–40. Liu, K. (1993). A new class of biased estimate in linear regression. Communications in Statistics-Theory and Methods, 22, 393–402. Liu, S. (2000). On local influence for elliptical linear models. Statistical Paper, 41, 211–224. Liu, K. (2003). Using Liu-Type estimator to combat collinearity. Communications in Statistics-Theory and Methods, 32, 1009–1020. Liu, X. Q. (2011). Improved Liu estimator in a linear regression model. Journal of Statistical Planning and Inference, 141, 189–196. Longley, J.W. (1967). An appraisal of least squares programs for the electronic computer from the point of view of the user. Journal of the American Statistical Association, 62, 819–841. Martin, M.A., Roberts, S., Zheng, L. (2010). Delete-2 and delete-3 jackknife procedures for unmasking in regression. Australian and New Zealand Journal of Statistics, 52, 45–60. Mason, R. L., Gunst, R. F. (1985). Outlier-Induced Collinearities. Technometrics, 27, 401–407.

127 McDonald, G. C., Galarneau, D.I. (1975). A Monte Carlo evaluation of some ridge - type estimators. Journal of the American Statistical Association, 70, 407–416. Montgomery, D. C., Peck, E. A., Vining, G. G. (2001). Introduction to Linear Regression Analysis. 3rd ed. New York: Wiley. Myers, R. H. (1986). Classical and Modern Regression with Applications. Boston: Duxbury Press. Newhouse, J. P., Oman, S.D. (1971). An evaluation of ridge estimators. Rand Report, No. R-716-PR, 1-28. Nurunnabi, A. A. M., Rahmatullah Imon, A. H. M., Naseer, M. (2011). Diagnostic Measure for Influential Observations in Linear Regression. Communications in Statistics-Theory and Methods, 40(7), 1169–1183. Nurunnabi, A. A. M., Naseer, M., Imon, A. H. M. R. (2016). Identification and classification of multiple outliers, high leverage points and influential observations in linear regression. Journal of applied Statistics, 43(3), 509–525. Ortega, E. M. M., Bolfarine, H., Paula, G. A. (2003). Influence diagnostics in generalized log–gamma regression models. Computational Statistics and Data Analyses, 42, 165–186. Ozkale, M. R., Kaciranlar, S. (2007). The Restricted and unrestricted Two-Parameter Estimators. Communication in Statistics-Theory and Methods, 36(15), 2707–2725. Paula, G. A., Cysneiros, F. J. A. (2010). Local influence under parameter constraints. Communication in Statistics-Theory and Methods, 39, 1212–1228. Pena, D. (2005). A new statistic for influence in linear regression. Technometrics, 47, 1–12. Pena, D. (1991). Measuring influence in dynamic regression models. Technometrics, 33, 93–101. Preisser, J. S., Qaqish, B. F. (1996). Diagnostics for generalized estimating equations. Biometrika, 83, 551–562. Prendergast, L. A. (2006). Detecting influential observations in sliced inverse regression analysis. Australian and New Zealand Journal of Statistics, 48, 285–304. Rancel, M., M., S., Sierra, M., A., G. (2001). Regression diagnostic using local influence: a review. Communication in Statistics-Theory and Methods, 30, 799–813. Rasekh, A.R., Fieller, N.R.J. (2003). Inuence functions in functional measurement error models with replicated data. Statistics, 37, 169–178. Rasekh, A., Mohtashami, G. (2007). Assessing influence on ridge estimates in functional measurement error models. Ann. DE LI. S. U. P. Publ. Inst. Statist. Univ. Paris, Fascicule, 1(2), 97–109.

128 Roy, S. S., Guria, S. (2008). Diagnostics in logistic regression models. Journal of the Korean Statistical Society, 37, 89–94. Shi, L., Wang, X. (1999). Local influence in ridge regression. Computational Statistics and Data Analyses, 31, 341–353. Steece, B. M. (1986). Regression space outliers in ridge regression. Communications in Statistics-Theory and Methods, 15, 3599–3605. Stein, C. (1956). Inadmissibility of usual estimator for the mean of a multivariate Normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, University of California Press, 197–206. Swindel, B. F. (1976). Good ridge estimators based on prior information. Commu- nications in Statistics-Theory and Methods, 17, 1065–1075. Takeuchi, H. (1991). Detecting influential observations by using a new expression of Cooks distance. Communications in Statistics-Theory and Methods, 20, 261–274. Tanaka, Y., Zhang, F., Mori, Y. (2003). Local influence in principal component analysis: relationship between the local influence and influence function approach revisited. Computational Statistics and Data Analyses, 44, 143–160. Thomas, W., Cook, R. D. (1989). Assessing influence on regression coefficients in generalized linear models. Biometrika, 76, 741–749. Thomas, W., Cook, R. D. (1990). Assessing influence on predictions in generalized linear models. Technometrics, 32:59–65. Tsai, C. H.,Wu, X. (1992). Assessing local influence in linear regression models with first-order autoregressive or heteroscedastic error structure. Statistics and Probability Letters, 14, 247–252. Turkan, S., Toktamis, O. (2012). Detection of influential observations in ridge regression and modified ridge regression. Model Assisted Statistics and Applications, 7, 91–97. Ullah, M. A., Pasha, G. R. (2009). The origin and developments of influence measures in regression. Pakistan Journal of Statistics, 25:295–307. Walker, E., Birch, J. B. (1988).Influence Measures in ridge regression. Technometrics, 30, 221–227. Weissfeld, L. A., Schneider, H. (1990). Influence diagnostics for the weibull model fit to censored data. Statistics and Probability Letters, 9, 67–73. Welsch, R.E. (1982). Influence functions and regression diagnostics, in: R. Launer, A. Siegel (Eds.), Modern Data Analysis, Academic Press, New York.

129 Xie, F. C., Wei, B. C. (2009). Diagnostics for generalized Poisson regression models with errors in variables. Journal of Statistical Computation and Simulation, 79, 909– 922. Yang, H., Chang, X., Liu, D. (2009). Improvement of the Liu Estimator in weighted Mixed Regression. Communications in Statistics-Theory and Methods, 38(2), 285– 292. Zewotir, T., Galpin, J. S. (2005). Influence diagnostics for linear mixed models. Journal of Data Science, 3, 153–177. Zhao, Y., Lee, A. H. (1998). Influence diagnostics for simultaneous equations models. Australian and New Zealand Journal of Statistics, 40, 345–358.

130 Appendix

131 Appendix A

Published /Submitted Research

Work from Ph.D. Thesis

1. Kashif, M., Amanullah, M., Alsam, M. (2018). Pena’s statistic for the Liu

regression. Journal of Statistical Computation and Simulation, 88(13), 2473–

2488. (Published article is attached).

2. Kashif, M., Amanullah, M., Alsam, M. (2019). Influential diagnostics with

Pena’s statistic for the modified ridge regression. Communications in Statistics

- Simulation and Computation, DOI: 10.1080/03610918.2019.1634204, (Online

Published).

3. Kashif, M., Amanullah, M., Alsam, M. (2019). Pena’s statistic for the improved

Liu estimator. Kuwait Journal of Science, (Submitted).

132 Appendix B

Data Sets

Table B.1: Longley Data (Longley, 1967)

Obs.No. X1 X2 X3 X4 X5 X6 Time Y 1 83 234289 2356 1590 107608 1 1947 60323 2 88.5 259426 2325 1456 108632 2 1948 61122 3 88.2 258054 3682 1616 109773 3 1949 60171 4 89.5 284599 3351 1650 110929 4 1950 61187 5 96.2 328975 2099 3099 112075 5 1951 63221 6 98.1 346999 1932 3594 113270 6 1952 63639 7 99 365385 1870 3547 115094 7 1953 64989 8 100 363112 3578 3350 116219 8 1954 63761 9 101.2 397469 2904 3048 117388 9 1955 66019 10 104.6 419180 2822 2857 118734 10 1956 67857 11 108.4 442769 2936 2798 120445 11 1957 68169 12 110.8 444546 4681 2637 121950 12 1958 66513 13 112.6 482704 3813 2552 123366 13 1959 68655 14 114.2 502601 3931 2514 125368 14 1960 69564 15 115.7 518173 4806 2572 127852 15 1961 69331 16 116.9 554894 4007 2827 130081 16 1962 70551

133 Table B.2: Tobacco Data (Myers, 1986)

Obs.No. X1 X2 X3 X4 Y 1 5.5 4 9.55 13.25 527.91 2 6.2 4.3 11.1 15.32 518.29 3 7.7 5.2 12.84 17.41 549.56 4 8.5 5.3 13.32 18.08 738.06 5 11 6.3 17.84 24.16 704.82 6 11.5 6.5 18.57 24.29 697.94 7 13 7.2 21.96 27.29 826.86 8 15 7.6 25.87 31.32 998.18 9 16.2 7.8 26.82 34.62 1040.22 10 16.9 8.7 27.89 36.03 1040.46 11 14.1 7.2 23.99 28.48 803.26 12 17.5 8.8 29.61 36.88 1009.51 13 15 7.5 25.8 31.41 916.44 14 6.3 4.8 11.49 15.59 394.23 15 9.2 5.4 14.68 19.69 583.2 16 11.5 6.5 19.1 25.4 744.81 17 12 7 19.6 26.39 825.93 18 16.8 8.4 27.16 36.49 1070.88 19 14.2 7.4 23.95 28.84 840.91 20 17.1 8.3 27.94 35.4 991.58 21 11.9 6.9 19.56 26.1 767.4 22 13.1 7.1 22.05 27.4 807.18 23 14.3 7.5 23.91 29.03 857.15 24 7.2 5 24.16 16.13 526.05 25 6.7 4.9 11.91 15.98 495.89 26 5.6 4.1 9.56 13.34 476.38 27 7.1 5 11.98 16.09 520.82 28 17 8 27.77 35.45 1066.99 29 16.1 7.6 26.99 34.16 1020.25 30 6.3 4.6 11.26 15.52 494.59

134