Geographical Analysis ISSN 0016-7363
Loglinear Residual Tests of Moran’s I Autocorrelation and their Applications to Kentucky Breast Cancer Data
Ge Lin,1 Tonglin Zhang2
1Department of Geology and Geography, West Virginia University, Morgantown, WV, 2Department of Statistics, Purdue University, West Lafayette, IN
This article bridges the permutation test of Moran’s I to the residuals of a loglinear model under the asymptotic normality assumption. It provides the versions of Moran’s
I based on Pearson residuals (IPR) and deviance residuals (IDR) so that they can be used to test for spatial clustering while at the same time account for potential covariates and
heterogeneous population sizes. Our simulations showed that both IPR and IDR are
effective to account for heterogeneous population sizes. The tests based on IPR and IDR are applied to a set of log-rate models for early-stage and late-stage breast cancer with socioeconomic and access-to-care data in Kentucky. The results showed that socio- economic and access-to-care variables can sufficiently explain spatial clustering of early-stage breast carcinomas, but these factors cannot explain that for the late stage. For this reason, we used local spatial association terms and located four late-stage breast cancer clusters that could not be explained. The results also confirmed our ex- pectation that a high screening level would be associated with a high incidence rate of early-stage disease, which in turn would reduce late-stage incidence rates.
Introduction Linear or loglinear spatial regression models are common in spatial epidemiology (Best et al. 2000). A set of ecological variables are often associated with disease rates or counts, and after a final model is derived, residuals can be visually in- spected on a map for spatial clusters. For a linear model, a residual test of Moran’s I for spatial autocorrelation can also be performed to detect spatial clustering for the unexplained regression errors. However, there is no corresponding spatial residual test of clustering for loglinear or Poisson regressions on count data, which was the challenge for the current study. The study was motivated by our inquiry into the spatial patterns of breast cancer incidents in Kentucky counties as they related to
Correspondence: Ge Lin, Department of Geology and Geography, West Virginia University, Morgantown, WV 26506 e-mail: [email protected]
Submitted: August 22, 2005. Revised version accepted: April 04, 2006.
Geographical Analysis 39 (2007) 293–310 r 2007 The Ohio State University 293 Geographical Analysis the development stage of disease at diagnosis. Breast cancer staging at diagnosis is known to be associated with socioeconomic conditions, mammography screening services, and other variables (Yabroff and Gordis 2003; Barry and Breen 2005). As socioeconomic variables are often spatially autocorrelated (e.g., poor areas tend to be clustered), we expect clustering of breast cancer to occur, at least for the early- state incidence rates. If there is no significant environmental cause of breast cancer, the clustering tendency should disappear once we introduce area socioeconomic variables. One way to test for the existence of spatial clustering is to set up a spatial au- tocorrelation test, such as Moran’s I, for Guassian or continuous data by using the permutation test of residuals for Moran’s I in a linear regression (Cliff and Ord 1981). Converting incidence to rate, however, is often less appealing than retaining the original count of each in spatial data analysis (Griffith and Haining 2006). In addition, Moran’s I test assumes that attribute values (e.g., disease prevalence) are either in equal probability among all the geographic units or from a single parent distribution. These assumptions are often violated in the permutation test of Mo- ran’s I in disease data due to heterogeneous regional populations and large vari- ation in sparsely populated areas (Besag and Newell 1991). Although there have been several extensions of Moran’s I to account for population heterogeneity (Oden 1995; Waldhor 1996; Assuncao and Reis 1999), none of them can include potential ecological covariates. For example, Oden proposes a test statistic Ipop that applies regional population sizes to adjust Moran’s I. However, because of a minor mod- ification in the null hypothesis, Ipop is no longer comparable to the original Moran’s I (Assuncao and Reis 1999). Consequently, Ipop cannot be extended to evaluate covariates and spatial autocorrelation simultaneously. A spatial logit association model can include potential explanatory variables and identify high-value and low-value clusters (Lin 2003; Zhang and Lin 2006). It does not, however, have a global measure of spatial clustering that would com- plement the modeling process for local spatial logit associations. Jacqmin-Gadda et al. (1997) propose a homogeneity score test of a generalized linear model that can also include potential explanatory variables in a correlation test. The test is based on residuals in generalized linear models, a design that its authors claimed to correspond to the permutation test of linear regression errors. However, as we will later show, the test does not adjust variance for heterogeneous population sizes. In addition, because its weight matrix is not necessarily spatially constructed, its null hypothesis is not necessarily spatial independence as one would assume when applying Moran’s I test. Consequently, it is not straightforward to use the score test in a generalized linear model that includes spatial correlation and heterogeneity. The purpose of this article is to extend the permutation test of residuals of the Moran’s I autocorrelation to generalized linear models so that spatial analysts can directly test for spatial clustering while controlling for potential ecological covari- ates. While no one has proposed either deviance or Pearson residuals in the spatial statistic literature, Waller and Gotway (2004) point out a form close to Pearson
294 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation residuals as a way to account for inflated variance in Moran’s I under he- teroskedasticity. In this article, we demonstrate that permutation tests are applica- ble to Pearson or deviance residuals of loglinear models in the same way as in the traditional permutation test of residuals for Moran’s I. In the remaining sections of the article, we first review the permutation test of Moran’s I by using regression residuals and then reformulate it in the context of Poisson data by using the Pearson and deviance residuals of a loglinear model. We then evaluate their statistical properties under the null hypothesis of spatial independence in a series of simulated patterns and apply the Pearson residual Moran’s I test and deviance residual Moran’s I test to breast cancer incidence in Kentucky counties that include potential ecological covariates.
From linear to loglinear residual tests of Moran’s I
Let us consider a study area that has m regions indexed by i. Let zi be the variable of the interest in region i. Moran’s I (Moran 1950) is expressed as: P P m m i¼1 j¼1 wijðzi z Þðzj z Þ I ¼ hiP hiP P ð1Þ m 2 m i¼1 ðzi z Þ =m i¼1 j¼1 wij P m where z ¼ i¼1 zi=m, wij is an element of a spatial weight matrix W, with 1 being adjacent for regions i and j and is 0 otherwise. Under the assumption of homoge- neity, the moments of Moran’s I can be computed under the randomization as- sumption, which assumes that the observations are generated from a set of random permutations of the observed values. Thepffiffiffiffiffiffiffiffiffiffiffiffiffi significance of Moran’s I can be deter- mined by the quantity Istd ¼½I EðIÞ = VarðIÞ under the asymptotic normality as- sumption, where the respective mean and variance are obtained from the random permutation scheme suggested by Cliff and Ord (1981, p. 21). A significant and positive value of Moran’s I (i.e., Istd4za/2) usually indicates a positive autocorre- lation, such as the existence of either high-value or low-value clustering. A signif- icant and negative value of Moran’s I (i.e., Istdo za/2) usually indicates a negative autocorrelation, such as a tendency toward the juxtaposition of high values with low values. If there is no spatial dependence, I is often close to the mean value 1/ (m 1), which can be approximated by 0 if m is large.
In order to account for ecological covariates, it is often suggested that zi be taken as the ith regional residual of a linear regression model when testing for spatial autocorrelation (Cliff and Ord 1981, p. 198). For data resulting purely from a random process, the calculation of Moran’s I in equation (1) based on the residuals is the same as the one based on the observed values. It can be concluded that Istd is approximately N(0, 1) as m !1under some regularity conditions (Sen 1976). Schmoyer (1994) also demonstrated the validity of a general form of permutation test based on the residuals of a linear model with independent and identically dis- tributed errors. In Poisson models with heterogeneous population sizes, these jus- tifications cannot be applied (Waldhor 1996; Assuncao and Reis 1999).
295 Geographical Analysis
The residuals of loglinear models are well established in the statistical literature in a nonspatial context. We can extend them to a spatial context to test for spatial autocorrelation. In his synthesis of previous studies, Agresti (1990, p. 431) showed that the Pearson and log-likelihood (deviance) residuals of loglinear models are asymptotically multivariate normal with mean 0 and the variance–covariance ma- trix a projection matrix. This particular asymptotic form of the residuals is analo- gous to that of linear regression residuals. Following this reasoning, we can devise a log-rate model that closely resembles a linear regression model to account for po- tentially heterogeneous population sizes and ecological covariates. We apply the permutation test based on the asymptotic normality assumption, so that Moran’s I based on the residuals of a log-rate model is analogous to Moran’s I based on re- gression residuals. In the following, we specify the residuals of a log-rate model for the permutation test of Moran’s I.
Let ni be the observed counts for a Poisson random variable Ni at region i (i 5 1, . . ., m), and let xi and yi be the ith regional population and relative risk, re- spectively. Then, Ni are assumed to be independently Poisson distributed, or Ni Poisson (Eiyi), where Ei is the expected count for region i. The null hypoth- esis that all the relative risks (y1,...,ym) are 1 can be stated (Elliott et al. 2000, p. 132). Suppose that a set of explanatory variables (xi,1,...,xi,k 1) are observed together with ni. A log-rate model can then be expressed as ^ ^ ^ logðn^i=xiÞ¼b0 þ b1xi;1 þ ...þ bk 1xi;k 1 ð2Þ
In equation (2), nˆ i and log(nˆ i/xi) are, respectively, the estimated count and estimated ^ ^ log rate for region i, b0is the estimate of grand mean, and the other bs are parameter estimates for explanatory variables. Equation (2) is often estimated by moving log(xi) to the right-hand side, so that it becomes the offset. ^ ^ ^ logðn^iÞ¼logðxiÞþb0 þ b1xi;1 þ ...þ bk 1xi;k 1 ð3Þ
Based on these notations, the conventional Pearson residual for region i is defined as n n^ ¼ i i ð Þ ri;p 1=2 4 n^i and the conventional deviance residual (Agresti 1990, p. 452) for region i is defined as
1=2 ri;d ¼ 2signðni n^iÞ½ni logðni=n^iÞ ni þ n^i ð5Þ where sign(a)is1ifa40, is 0 if a 5 0, and is 1ifao0. We can test for spatial autocorrelation based on the residuals in equation (2) by replacing zi in (1) with either the Pearson residual in (4) or the deviance residual in (5). When zi is replaced with ri,p, for example, Moran’s I becomes Pearson residual Moran’s I, and we denote it as IPR, which can be expressed as
296 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation !0 1 P P X X m m ni n^i 1 m nl n^l @nj n^j 1 m nl n^lA i¼1 j¼1wij n^1=2 m l¼1 n^1=2 n^1=2 m l¼1 n^1=2 2 i l j3 l IPR ¼ ! P X 2 hiP P 4 m ni n^i 1 m nk n^k 5 m =m wij i¼1 ^1=2 m k¼1 ^1=2 i¼1 j¼1 ni nk ð6Þ
When zi is replaced with ri,d, Moran’s I becomes deviance or log-likelihood resid- ual Moran’s I and we denote it as IDR. To implement these tests, the parameters of explanatory variables together with the residuals are first estimated in the model- fitting process. Then, IPR and IDR are calculated. Finally, the means and variances of IPR and IDR are computed according to the random permutation scheme (Cliff and Ord 1981). The P values of IPR and IDR can be computed based on the asymptotic normality assumption when m is large and the number of covariates is small. If the independence model is rejected while all the ecological covariates can be found and modeled, the residuals of IPR or IDR from the model should be completely lacking in spatial autocorrelation. Otherwise, it may indicate spatial clustering that cannot be explained.
Although no one has specified both IDR and IPR as above, it is worth noting the difference between IPR and the score test of the residuals in a generalized linear model (Jacqmin-Gadda et al. 1997). The score test, which also accounts for ex- planatory variables, was derived from a random-effect approach by neglecting an additional term of overdispersion. Based on its original formula, the test statistic is expressed as Xm Xm T ¼ wijðYi m^iÞðYj m^jÞð7Þ i¼1 j6¼i where Yi is the variable of interest in region i as assumed from the exponential family, m^i is an estimate of E(Yi), and wij is an element from the weight matrix. For a Poisson model, one can take Yi 5 Ni so that yi is the number of the observed counts ni, and m^i is the estimated count nˆ i in our notations. Then, Xm Xm T ¼ wijðni n^iÞðnj n^jÞ i¼1 j¼1;j6¼i
This particular form of T is different from the formulation of IPR given by (6) because 1=2 we use zi ¼ðni n^iÞ=n^i instead of zi 5 ni nˆ i. This difference also leads to a variance adjustment problem for T, because the variance of Yi m^i still depends on the i th regional population size xi even when m^i equals the true parameter mi.In addition, wij in T may not be based on spatial relationships, such as spatial adja- cency or distance, and it adds complications in designing a proper weight matrix and in deriving the P value. Pearson residuals, on the other hand, are well estab- lished, and it is much easier to calculate or implement Pearson residuals in a stan-
297 Geographical Analysis dard statistical package than it is for the score test. Because of these differences, we will not compare IPR and IDR with the score test in our simulations.
Simulation study
To evaluate Pearson residuals IPR and deviance residuals IDR for variance adjust- ments, we carried out simulations under the null hypothesis of no spatial clustering. We included two alternative test statistics, Oden’s Ipop and Assuncao and Reis’s EBI, both of which are designed to account for heteroskedasticity without controlling for ecological covariates. As a reference point, we also included the original Moran’s I, denoted by Ir, by taking zi 5 ni/xi in equation (1). For ease of discussion, we list some expressions for Oden’s Ipop and Assuncao and Reis’s EBI below. Oden’s I is defined as P pop P P P n2 m m M ðe d Þðe d Þ nð1 2b Þ m M e nb m M d i¼1 j¼1 ij i i j j i¼1 ii i i¼1 ii i Ipop ¼ P P P 2 m m m bð1 bÞ x i¼1 j¼1 didjMij x i¼1 diMii ð8Þ pffiffiffiffiffiffiffiffi P P m m where b ¼ n=x, ei 5 ni/n, di 5 xi/x, Mij ¼ wij= didj, n ¼ i¼1 ni and x ¼ i¼1 xi. Oden also derived the mean and variance of Ipop. If there is no spatial dependence, Ipop is close to the mean value 1/(x 1). Oden’s Ipop weights a pair of regional populations together with the spatial weight matrix by using Mij. While wij in Mij is still a spatial weight in the W matrix, its diagonal element wii6¼0. This feature makes Ipop incomparable with Moran’s I when spatial clustering and population heterogeneity both exist. Noticing this dif- ference, Assuncao and Reis (1999) proposed their version of population-adjusted pffiffiffiffi Moran’s I or EBI. In their definition,P zi ¼ðpi bÞ= ni, where b 5 n/x, ni 5 a1b/xi, 2 2 m a 5 s b/(x/m)ands ¼ i¼1 xiðpi dÞ=x. Hence, ! P P X X p b 1 m p b pj b 1 m p b m w pi ffiffiffiffi pl ffiffiffiffi ffiffiffiffi pl ffiffiffiffi i¼1 j6¼i ij l¼1 p l¼1 ni m nl nj m nl EBI ¼ "# ð9Þ X 2 hi P p b 1 m p b P P m pi ffiffiffiffi pl ffiffiffiffi =m m m w l¼1 l¼1 i¼1 j¼1;6¼i ij ni m nl
Although they did not notice the connection, Assuncao and Reis point out that ni can be set equal to b/xi if nio0. If all nio0, then IPR and EBI are identical in the absence of covariates. The justification and evaluation of EBI, in this case, would be directly applicable to IPR. However, we never encountered this case in our simu- lations, and so the general form of EBI is always assumed. Simulations were based on a 20 20 lattice. We defined a spatial weight ma- trix according to spatial adjacency, that is, wij 5 1 if two lattice points (i, j) are ad- jacent, 0 otherwise. For Ipop, we followed Oden and took wii 5 2. These weight assignments are identical to those used in Assuncao and Reis’s (1999) simulations.
We denote li and xi, respectively, as the disease rate and population size at lattice
298 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation
(a1) Half Sparse Half Dense (a3) One Cluster of Dense (a5) One Cluster of Sparse 20 20 20
15 ξ=1000 15 15 ξ=1 ξ=1000
Y 10 Y 10 Y 10
5 ξ=1 5 ξ=1000 5 ξ=1
0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 X X X
(a2) Quad Sparse Quad Dense (a4) Two Clusters of Dense (a6) Two Clusters of Sparse 20 20 20
15 ξ=1000 ξ=1 15 ξ=1000 15 ξ=1 ξ=1 ξ=1000
Y 10 Y 10 Y 10
5 ξ=1 ξ=1000 5 ξ=1000 5 ξ=1
0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 X X X
Figure 1. Population patterns for (a1)–(a6) in the unit of 105.
point i, and Ni as the Poisson random variable with the mean value mi 5 lixi.In each run, 400 Poisson random variables with the constant rate li 5 0.0001 were generated independently on the lattice points, so that there would be no spatial clustering of disease rates. In order to compare a wide range of heteroskedasticity, we included a spatial homogeneous pattern, together with six heterogeneous patterns similar to those used in Waldhor’s (1996) simulations (Fig. 1). We used ui and vi to represent the vertical axis and the horizontal axis, respectively, so that a lattice point i can be easily identified by i 5 20(ui 1)1vi. The seven spatial population patterns were 5 5 defined in the units of 10 (e.g., xi 5 1 means 1 10 and xi 5 1000 means 1000 105), and they are:
(a0) Homogeneous population with xi 5 1 for all i 5 1, . . ., 400. (a1) Half sparse and half dense: xi 5 1ifui 10, xi 5 1000 if ui 11. (a2) One quad sparse and one quad dense: xi 5 1ifui 10 and vi 10 or ui 11 and vi 10; xi 5 1000 if ui 10 and vi 11 or ui 11 and vi 10. (a3) All sparse except one clusterqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi with a dense population: xi 5 1, except when 2 2 lattice point i is within ðui 5Þ þðvi 5Þ 3, in which xi 5 1000.
299 Geographical Analysis
(a4) All sparse except two clusters with denseqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi populations: xi 5 1, except when lattice point i is within ðu 5Þ2 þðv 5Þ2 3or qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i i 2 2 ðui 15Þ þðvi 15Þ 3, in which xi 5 1000. (a5) All dense except one clusterqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi with sparse population: xi 5 1000, except when 2 2 lattice point i is within ðui 5Þ þðvi 5Þ 3, in which xi 5 1. (a6) All dense except two clusters with sparseq populations:ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffixi 5 1000, except when lattice point i is within ðu 5Þ2 þðv 5Þ2 3or qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i i 2 2 ðui 15Þ þðvi 15Þ 3, in which xi 5 1.
For a given pattern of population and rate distributions, we ran the simulations 10,000 times and calculated Ir, IPR, IDR, Ipop, and EBI for each run. We fixed the disease rate at 0.0001 for all locations in the seven heterogeneous patterns (a0–a6), so that E(Ni) 5 V(Ni) 5 10 if xi 5 1 unit and E(Ni) 5 V(Ni) 5 10,000 if xi 5 1000 units. Under the null hypothesis, we assessed the validity of the permutation test by comparing (a) the observed variance for each test statistic versus corresponding permutation variance and (b) the rejection rates at the 0.05 nominal value (a 5 0.05). In the preliminary analysis, we also compared the observed and per- mutation means for each of Ir, ID, IDR, and EBI. As the difference between the two could be ignored, we will not compare them. Table 1 lists the simulation results of the observed and permutation variances of 2 Ir, IPR, IDR, Ipop, and EBI based on the seven patterns (a0–a6), where sS denotes the 2 observed sample variance from the simulation and sR denotes estimated variance 2 from the random permutation scheme. As sS can be treated as the true values, we 2 2 compared the ratio of the values of sS and sR. If the ratio is close to 1, it suggests that estimated variance is close to the permutation variance. The results show that, when the populations were homogeneous, the ratios 2 2 between sS and sR were very close to 1. When the populations were heteroge-
Table 1 Comparison of Variance for Selected Test Statistics