Comparison of Different Tests for Detecting Heteroscedasticity in Datasets
Total Page:16
File Type:pdf, Size:1020Kb
Anale. Seria Informatică. Vol. XVIII fasc. 2 – 2020 Annals. Computer Science Series. 18th Tome 2nd Fasc. – 2020 COMPARISON OF DIFFERENT TESTS FOR DETECTING HETEROSCEDASTICITY IN DATASETS Obabire Akinleye A.1, Agboola, Julius O.1, Ajao Isaac O.1 and Adegbilero-Iwari Oluwaseun E.2 1 Department of Mathematics & Statistics, The Federal Polytechnic, Ado-Ekiti, Nigeria 2 Department of Community Medicine, Afe Babalola University, Ado-Ekiti, Nigeria Corresponding author: [email protected], [email protected], 3 [email protected], [email protected] ABSTRACT: Heteroscedasticity occurs mostly because naturally decreases. Also, as income grows, one has of beneath mistakes in variable, incorrect data choices on how to dispose the income, consequently, transformation, incorrect functional form, omission of 휎2 is more likely to increase with the income. Also, important variables, non-detailed model, outliers and a company with a large profit will give more skewness in the distribution of one or more independent dividends to their shareholders than a company that variables in the model. All analysis were carried out in R statistical package using Imtest, zoo and package.base. was newly established. In a situation where data 2 Five heteroscedasticity tests were selected, which are collecting technique improves, 휎푖 is more likely to Park test, Glejser test, Breusch-Pagan test, White test and decrease. In the same sense, banks with good Goldfeld test, and used on simulated datasets ranging equipment for processing data are less prone to from 20,30,40,50,60,70,80,90 and 100 sample sizes make error when processing the monthly statement respectively at different level of heteroscedasticity (that is of account of their customers than banks that does at low level when sigma = 0.5, mild level when sigma = not have good facilities. Heteroscedasticity can also 1.0 and high level when sigma = 2.0). Also, the come to play due to outliers. Outliers are significance criterion alpha = 0.05. However, each test observations that are much different in a population. was repeated 1000 times and the percentage of rejection was computed over 1000 trials. For Glejser test, the It can either be too large or small in relation to the average empirical type I error rate for the test reject more observation in such sample. It can also be as a result than expected while Goldfeld has the least power value. of violation of the assumption that regression model Therefore, Glejser test has the highest capacity to detect is correctly specified. Not only that, it can be as a heteroscedasticity in most especially on simulated result of skewness in the distribution of one or more datasets. regressors included in the model. Skewness is when KEYWORDS: Park test, Glejser test, Breusch-Pagan test, a distribution is right skewed. Heteroscedasticity White test, Goldfeld test may be as a result of incorrect data transformation and incorrect functional form. Heteroscedasticity is 1. INTRODUCTION not a property that is necessarily restricted to cross sectional data or time series data where there occurs One of the assumptions of classical linear regression an external shock or change in circumstances that model states that the disturbances 푢푖 featuring in created uncertainty about y [7]. Cross sectional data the population regression function are of the same involves data that deal with members of a population variance; meaning that they are homoscedatic. In at a particular time. Here, members may be of other words, the variance of residuals should not different types, size, and completion while time increase with fitted values of response variable series data are similar in order of magnitude [6]. (SelvaPrabhakara, 2016). When this assumption More often than not, heteroscedasticity arises when fails, its consequence is what is termed some variables that are important are omitted from heteroscedasticity which can be expressed as: the model or superfluous of variables in a model 2 2 퐸(푢푖 ) = 휎 i= 1, 2 …n (model specification error). There are several reasons why we can have variability in the variance of 푢푖. The study on error 1.1 Model specification learning models shows that as people learn over Applied econometrics is based on understanding time, their errors rates reduce drastically such that intuition and skill. Users of economic data must be 2 휎푖 is expected to decrease. Practically, as one able to give model that other people can rely on [10]. increases his or her number of typing hours, the rate Assumption of Classical Linear Regression Model of errors committed decreases while the variance states that the regression model used in the analysis 78 Anale. Seria Informatică. Vol. XVIII fasc. 2 – 2020 Annals. Computer Science Series. 18th Tome 2nd Fasc. – 2020 2 2 2 is correctly specified. Other problem called model 퐸(푢푖 ) = 휎 푋푖 specification error or model specification bias is Graphical methods or Park and Glesjer approaches encountered. reveal that, if it is true that the variance of 푢푖 is Users of economic data must have the following in proportional to the square of the explanatory their mindset: variable 푋푗 the original model can be transformed as a. able to have a standard to choose a model for follow: empirical analysis, b. recognize the model specification error types 퐸(푌푖) = 훽1 + 훽2푋푖 2.0 that can be encountered in practice, Divide the original model by 푋푖 c. have in mind the result of specification error, 푌푖 훽1 푢푖 = + 훽2 + 2.1 d. how to discover specification error or 푋푖 푋푖 푋푖 recognize tools that can be used, 1 = 훽1 + 훽2 + 푣푖 2.2 e. discover the cure and good effect that can be 푋푖 used in detecting specification error, Where 푣푖 is the transformed disturbance term, equal f. know how to judge the strength of competing to 푢푖/푋푖. Now it is easy to verify that: models. 푢 2 1 퐸(푣 2) = 퐸 ( 푖) = 퐸(푢 2) 2.3 푖 푋 푋 2 푖 1.2 How to identify heteroscedasticity 푖 푖 = 휎2 (by using 퐸(푢 2) = 휎2푋2) 2.4 The residual from linear regression are 푖 푖 e = 푌 − 푌̂ = 푌 − 푋훽̂ It should be noted that in the regression which is used in place of unobservable errors 휀 [3]. transformation, the intercept term 훽2 is the slope Residual are used to detect the behaviour of variance coefficient in the original equation and the slope coefficient 훽1 is the intercept term in the original with datasets. Residual plots, where residuals 휀푖 are plotted on the y-axis against the dependent variable model. Therefore, to get to the original model we shall have to multiply the estimated equation 2.0 by 푦̂푖 on the x axis are the most commonly used tool [2]. But when heteroscedasticity is present, most Xi. especially when the variance is proportional to a Pattern 2: power of the mean, the residuals will appear in form The error variance is proportional to Xi. The square of a fan shape. root transformation This may not be the best method for detecting 2 2 퐸(푢푖 ) = 휎 푋푖 2.5 heteroscedasticity as it difficult to interpret, It is believed that the variance of 푢 , instead of being particularly when the positive and negative residuals 푖 proportional to the squared X , is proportional to X do not exhibit the same general pattern [3]. Cook i i itself, then the original model can be transformed as: and Weisberg suggested that plotting the square 푌 훽 푢 2 푖 1 푖 residuals, 푒 to account for this. Then, a wedge = + 훽2√푋푖 + 2.6 푖 √푋푖 √푋푖 √푋푖 shape bounded below by 0 would indicate 1 = 훽1 + 훽2√푋푖 + 푣푖 2.7 heteroscedasticity. √푋푖 However, as [2] pointed out, squaring residuals that Where 푣 = 푢 /√푋 and where Xi>0 are large in magnitude creates scaling problems; 푖 푖 푖 resulting to a plot where patterns in the rest of the Having given pattern 2, one can readily verify that residuals are difficult to see. They instead advocate 2 2 for plotting the absolute residuals. This way, we do 퐸(푣푖 ) = 휎 , a homogeneous situation. However, not need to identify positive and negative patterns, one may proceed to apply Ordinary Least Square to and do not need to worry about scaling issues. A equation 2.3 to regressing 푌푖/√푋푖 on 1/√푋푖 on wedge shape of the absolute residual also indicates 1/√푋푖and √푋푖. heteroscedasticity where the variance increase with It is worthy of note that an important feature of the the mean. This is the plotting method that is used transformed model that states there no intercept for identifying heteroscedasticity in the study. term. However, regression-through-the-origin model to estimate 훽1 and 훽2 have to be used. Having run 2. COMMON HETEROSCEDASTICITY through equation 2.3, one can get back to the PATTERNS original model simply by multiplying equation 2.3 by 푋 . There exists some likely assumptions about √ 푖 heteroscedasticity, they include: Pattern 3: Pattern 1: The error variance is proportional to the square of 2 The error variance is proportional to 푋푖 , i.e the mean value of Y. 79 Anale. Seria Informatică. Vol. XVIII fasc. 2 – 2020 Annals. Computer Science Series. 18th Tome 2nd Fasc. – 2020 2 2 2 퐸(푢푖 ) = 휎 [퐸(푌푖)] 2.8 3. HETEROSCEDASTICTY TESTS Equation 2.4 asserts that the variance of 푢푖 is proportional to the square of the expected value of Park Test: Y. Park test made an assumption in his work that Therefore, variance of the error term is proportional to the 퐸(푌 ) = 훽 + 훽 푋 2.9 square of the independent variable [8]. 푖 1 2 푖 It also validates the graphical method by suggesting Now, if the original equation is transformed as that 휎2 serves the purpose of the explanatory follows: 푖 2 2 훽 푣푖 푌푖 훽1 푋푖 푢푖 variable 푋푖.