Testing for Spatial Group-Wise Heteroskedasticity. a Specification Scan Test Procedure
Total Page:16
File Type:pdf, Size:1020Kb
Testing for Spatial Group-Wise Heteroskedasticity. A specification Scan test procedure. Autores y e-mail de la persona de contacto: Coro Chasco (Universidad Autónoma de Madrid) Julie Le Gallo (Université de Franche-Comté) Fernando A. López (Universidad Politécnica de Cartagena) Departamento: Métodos Cuantitativos e Informáticos Universidad: Politécnica de Cartagena Área Temática: (indicar el área temática en la que se inscribe el contenido de la comunicación) Resumen: Spatial heterogeneity is an important topic in the modelling of regional economies. In regression models, spatial heterogeneity can be reflected by varying coefficients (structural instability) and/or by varying error variances across observations forming blocks, what is called Spatial Group-Wise Heteroskedasticity. Spatial autocorrelation implies local clustering in the values of the variable and has been extensively studied in the literature. Unfortunately, testing for Spatial Group-Wise Heteroskedasticity is a less developed field. In this paper we present a novel and powerful procedure for Spatial Group-Wise Heteroskedasticity detection based on the Scan methodology. Palabras Clave: Spatial Group-Wise Heterokedasticity; Scan tests; Permutational approach; Monte Carlo. Clasificación JEL: C21; C50; R15 1. Introduction Spatial heterogeneity means that the behaviour of a certain spatial process is not uniform over space. We can think in different causes of instability (i) in the mean, (ii) in the variance or (iii) in both moments. Mean instability implies local clustering in the values of the variable and this topic has been studied extensively in the literature (Lesage and Pace, 2009, for a recent overview). In other cases, the source of instability is the variance. We are interested in the case of Spatial Group-Wise Heteroskedasticity, which means that the variability of the spatial data is systematically higher in some areas than in others. Obviously, is also possible that both sources of instability concur simultaneously; we are going to skip this case. There is a huge literature on the topic of spatial dependence but, unfortunately, the detection and modelling of Spatial Group-Wise Heteroskedasticity (SGWH from now on) is less developed. However, we suspect that SGWH is a frequent phenomenon when working with real data, which involves serious inference problems. There are several heteroskedasticity tests, usually based in a likelihood approach that can be adapted to check for SGWH (GQ test of Goldfeld and Quandt, 1965, BP test of Breusch-Pagan, 1980, are obvious candidates). Unfortunately, these tests need, in general, a priori information about the spatial structure present in the data that the researcher must supply. This type of information is not always available. In the same spirit, Kelejian and Robinson (1998) introduce a test for spatial heteroskedasticity assuming that the variance can be modeled using some regressors that must be identified previously. More recently, Ord and Getis (2012) consider the problem of local instability in the variance introducing a new statistic, called Hi. The aim of the local Hi is to identify the limits of the area where the variance changes to another value. The authors draw the attention to the lack of papers directed at examining the spatial structure of the variance (p. 530): ‘Spatial statistics’ cluster identification is now common to many fields. (…However) these studies have focused attention upon local means, to the extent that variability is considered at all it is typically assumed that the process has a constant variance (i.e., that it is homoscedastic). A moment’s thought indicates that such an assumption could overlook important information’. Our contribution tries to fill this deficit by developing a flexible and powerful statistic based on the Scan methodology (Kulldorff et al 1995, 2009) to detect group-wise heteroskedasticity. Our procedure explores systematically the entire spatial surface looking for the groups of connected regions where the difference between the variability inside and outside the group is relevant. The origin of this proposal can be traced back to Openshaw et al (1987) with the so-called Geographical Analysis Machine The inference framework is computationally intensive because it is based on permutational bootstrapping. The paper is organized as follows. Section 2 introduces some basic results from the Scan methodology, including our proposal to detect SGWH. The design of a Monte Carlo experiment is presented in Section 3, together with the main results in relation to estimated size and power. Section 4 focuses in what we call accuracy, that the ability to identity exactly the location and composition of the clusters of heteroskedasticity. Main conclusion appear in Section 5. 2. A Scan Test for Spatial Group-Wise Heteroskedasticity (SGWH) This section introduces a technique to detect the presence of SGWH in a spatial data set. The proposal has two objectives: (i) check for the null of homoskedaticity and, in case of rejection of the null, (ii) identify the points, spatially linked, that share the same variance. Suppose {xi} to be a spatial process with i=1,.., R a set of spatial coordinates. We are interested in testing the hypothesis (which, implicitly, assumes normality and spatial independence): Hx0i:..d.N(;) ii (1) The alternative hypothesis says that there is a group of connected observations, Z, where the variance is different: xi ii..d.N( ;Z ) for i Z HA : (2) xi ii..d.N( ;Z ) for iZ In order to proceed with the Scan methodology, it is necessary to obtain the likelihood function under the null and alternative hypotheses, respectively. The log- likelihood function under the null hypothesis is simply: 2 R xi l(HL00 ) lnxR2R , , ln ln i1 (3) 22 The maximum likelihood estimates of the mean and variance are: 2 ˆ x xi H0 ˆ RRi ˆ 2 (4) Hi10 RRH0 i1 Which produce a value in the corresponding log-likelihood function of: R 2 l(H0 ) ln 21ˆ (5) 2 H 0 Under the alternative hypothesis, the log-likelihood is: I l(HLAA ) lnx , ,Z , Z 22 (6) xxii R2Rln ZZ ln RR ln ZZiZ2222 iZ Z Z where RZ is the number of observations in set Z. The maximum likelihood estimates of the mean and variance for this case are: 22 ˆˆ x xxiiHHAA ˆ R i ˆˆ22()ZZ () (7) Hi1A HHAA iZ iZ RRRRZZ The value of the log-likelihood function in this point is: I R RRRZZ22 l()HA ln21 ln()ˆˆ Z ln() Z (8) 22HHAA 2 Finally, the Scan statistic for the assumption of equal variances can be written as: 2 2 ˆ R ˆ ()Z I H 0 Z H A Scan =maxl (HHA0 ) l( )2R ln ln (9) Z ˆˆ22()ZZR () HHAA Θ is a set of connected regions Z, called windows, where the Scan statistic is computed. The size and shape of the window must be defined in advance by the researcher with the idea of getting a good balance between cost and effectiveness. For example, the evaluation of elliptical windows is more time consuming but it provides a greater flexibility. The set Z where the Scan test attains its maximum value is usually called the Most Likelihood Cluster, MLC. The test can be easily extended to the case of simultaneous instability in the mean and in the variance. In this case, the null hypothesis continues to be that of (1) whereas the alternative hypothesis now corresponds to: xi ii..d.N(ZZ ; )for i Z HA : (10) xi ii..d.N(ZZ ; )for i Z The log-likelihood function is: II l()lnHLAA x ,,,,ZZZZ 2 2 (11) xiZ xiZ R2Rln ZZ ln RR ln ZZiZ2222 iZ Z Z The maximum likelihood estimates of the mean and variance are: xx ˆˆ()ZZii () HiZAA HiZ RRRZZ 22 (12) ˆˆ()ZZ () xxiiHHAA ˆˆ22()ZZ () HHAAiZ iZ RRRZZ Consequently, we define the Scan statistic for the alternative of (10), which is very similar to that obtained in (9): 2 2 ˆ R ˆ ()Z II H 0 Z H A Scan, = max l (HHA0 ) l( )2R ln ln (13) Z ˆˆ22()ZZR () HHAA Let us note that the two statistics in (9) and (13) correspond to classical likelihood ratios, for SGWH in the first case and SGWH plus a break in the mean in the second (Rao, 1971). In order to obtain a standard Likelihood Ratio test, LR, the researcher must select in advance the area Z where the test is calculated and the significance level of the test, according to the corresponding probability distribution. In turn, this distribution remains usually unknown for which we need a certain assumption. Both decisions may undermine the confidence on the LR test, especially in relation to size (for example, Engle, 1984, or Drton and Williams, 2011). The Scan tests suffer from a similar weakness in the sense that, in general, the distribution function of the statistics of (9) and (13) are unknown. Our intention is solve the inference of the Scan tests in a more robust permutational framework (and more compute demanding), which avoids data mining and the assumption of normality. Hence, a p-value is obtained through a Monte Carlo testing procedure (Dwass, 1957), by comparing the value of the Scan statistics for the real data set with a large sequence of values corresponding to purely random data sets, according to the null hypothesis of the test. The procedure is as follows: 1. Compute the Scan statistics for the original sample , where S is a set xiiS of spatial coordinates, S xci;yci ;i1 ,,2 ... ,R . 2. Relabel the set of locations by randomly drawing, without replacement, the spatial coordinates; x r is the new, permuted, series, where r is the iiS permutation index. r 3. Compute the Scan statistic for each permuted sample x r . iiS 4. Repeat steps 2 and 3 (B–1) times to obtain B-1 realizations of the B1 Scan r permuted statistic.