Loglinear Residual Tests of Moran'i Autocorrelation

Geographical Analysis ISSN 0016-7363

Loglinear Residual Tests of Moran’s I Autocorrelation and their Applications to Kentucky Breast Cancer Data

Ge Lin,1 Tonglin Zhang2

1Department of Geology and Geography, West Virginia University, Morgantown, WV, 2Department of Statistics, Purdue University, West Lafayette, IN

This article bridges the permutation test of Moran’s I to the residuals of a loglinear model under the asymptotic normality assumption. It provides the versions of Moran’s

I based on Pearson residuals (IPR) and deviance residuals (IDR) so that they can be used to test for spatial clustering while at the same time account for potential covariates and

heterogeneous population sizes. Our simulations showed that both IPR and IDR are

effective to account for heterogeneous population sizes. The tests based on IPR and IDR are applied to a set of log-rate models for early-stage and late-stage breast cancer with socioeconomic and access-to-care data in Kentucky. The results showed that socioeconomic and access-to-care variables can sufficiently explain spatial clustering of early-stage breast carcinomas, but these factors cannot explain that for the late stage. For this reason, we used local spatial association terms and located four late-stage breast cancer clusters that could not be explained. The results also confirmed our expectation that a high screening level would be associated with a high incidence rate of early-stage disease, which in turn would reduce late-stage incidence rates.

Introduction Linear or loglinear spatial regression models are common in spatial epidemiology (Best et al. 2000). A set of ecological variables are often associated with disease rates or counts, and after a final model is derived, residuals can be visually in- spected on a map for spatial clusters. For a linear model, a residual test of Moran’s I for spatial autocorrelation can also be performed to detect spatial clustering for the unexplained regression errors. However, there is no corresponding spatial residual test of clustering for loglinear or Poisson regressions on count data, which was the challenge for the current study. The study was motivated by our inquiry into the spatial patterns of breast cancer incidents in Kentucky counties as they related to

Correspondence: Ge Lin, Department of Geology and Geography, West Virginia University, Morgantown, WV 26506 e-mail: [email protected]

Submitted: August 22, 2005. Revised version accepted: April 04, 2006.

Geographical Analysis 39 (2007) 293–310 r 2007 The Ohio State University 293 Geographical Analysis the development stage of disease at diagnosis. Breast cancer staging at diagnosis is known to be associated with socioeconomic conditions, mammography screening services, and other variables (Yabroff and Gordis 2003; Barry and Breen 2005). As socioeconomic variables are often spatially autocorrelated (e.g., poor areas tend to be clustered), we expect clustering of breast cancer to occur, at least for the early- state incidence rates. If there is no significant environmental cause of breast cancer, the clustering tendency should disappear once we introduce area socioeconomic variables. One way to test for the existence of spatial clustering is to set up a spatial autocorrelation test, such as Moran’s I, for Guassian or continuous data by using the permutation test of residuals for Moran’s I in a linear regression (Cliff and Ord 1981). Converting incidence to rate, however, is often less appealing than retaining the original count of each in spatial data analysis (Griffith and Haining 2006). In addition, Moran’s I test assumes that attribute values (e.g., disease prevalence) are either in equal probability among all the geographic units or from a single parent distribution. These assumptions are often violated in the permutation test of Mo- ran’s I in disease data due to heterogeneous regional populations and large vari- ation in sparsely populated areas (Besag and Newell 1991). Although there have been several extensions of Moran’s I to account for population heterogeneity (Oden 1995; Waldhor 1996; Assuncao and Reis 1999), none of them can include potential ecological covariates. For example, Oden proposes a test statistic Ipop that applies regional population sizes to adjust Moran’s I. However, because of a minor mod- ification in the null hypothesis, Ipop is no longer comparable to the original Moran’s I (Assuncao and Reis 1999). Consequently, Ipop cannot be extended to evaluate covariates and spatial autocorrelation simultaneously. A spatial logit association model can include potential explanatory variables and identify high-value and low-value clusters (Lin 2003; Zhang and Lin 2006). It does not, however, have a global measure of spatial clustering that would complement the modeling process for local spatial logit associations. Jacqmin-Gadda et al. (1997) propose a homogeneity score test of a generalized linear model that can also include potential explanatory variables in a correlation test. The test is based on residuals in generalized linear models, a design that its authors claimed to correspond to the permutation test of linear regression errors. However, as we will later show, the test does not adjust variance for heterogeneous population sizes. In addition, because its weight matrix is not necessarily spatially constructed, its null hypothesis is not necessarily spatial independence as one would assume when applying Moran’s I test. Consequently, it is not straightforward to use the score test in a generalized linear model that includes spatial correlation and heterogeneity. The purpose of this article is to extend the permutation test of residuals of the Moran’s I autocorrelation to generalized linear models so that spatial analysts can directly test for spatial clustering while controlling for potential ecological covariates. While no one has proposed either deviance or Pearson residuals in the spatial statistic literature, Waller and Gotway (2004) point out a form close to Pearson

294 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation residuals as a way to account for inflated variance in Moran’s I under heteroskedasticity. In this article, we demonstrate that permutation tests are applicable to Pearson or deviance residuals of loglinear models in the same way as in the traditional permutation test of residuals for Moran’s I. In the remaining sections of the article, we first review the permutation test of Moran’s I by using regression residuals and then reformulate it in the context of Poisson data by using the Pearson and deviance residuals of a loglinear model. We then evaluate their statistical properties under the null hypothesis of spatial independence in a series of simulated patterns and apply the Pearson residual Moran’s I test and deviance residual Moran’s I test to breast cancer incidence in Kentucky counties that include potential ecological covariates.

From linear to loglinear residual tests of Moran’s I

Let us consider a study area that has m regions indexed by i. Let zi be the variable of the interest in region i. Moran’s I (Moran 1950) is expressed as: P P m m i¼1 j¼1 wijðzi zÞðzj zÞ I ¼ hiP hiP P ð1Þ m 2 m i¼1 ðzi zÞ =m i¼1 j¼1 wij P m where z ¼ i¼1 zi=m, wij is an element of a spatial weight matrix W, with 1 being adjacent for regions i and j and is 0 otherwise. Under the assumption of homogeneity, the moments of Moran’s I can be computed under the randomization assumption, which assumes that the observations are generated from a set of random permutations of the observed values. Thepffiffiffiffiffiffiffiffiffiffiffiffiffi significance of Moran’s I can be deter- mined by the quantity Istd ¼½I EðIÞ= VarðIÞ under the asymptotic normality assumption, where the respective mean and variance are obtained from the random permutation scheme suggested by Cliff and Ord (1981, p. 21). A significant and positive value of Moran’s I (i.e., Istd4za/2) usually indicates a positive autocorrelation, such as the existence of either high-value or low-value clustering. A significant and negative value of Moran’s I (i.e., Istdo za/2) usually indicates a negative autocorrelation, such as a tendency toward the juxtaposition of high values with low values. If there is no spatial dependence, I is often close to the mean value 1/ (m 1), which can be approximated by 0 if m is large.

In order to account for ecological covariates, it is often suggested that zi be taken as the ith regional residual of a linear regression model when testing for spatial autocorrelation (Cliff and Ord 1981, p. 198). For data resulting purely from a random process, the calculation of Moran’s I in equation (1) based on the residuals is the same as the one based on the observed values. It can be concluded that Istd is approximately N(0, 1) as m !1under some regularity conditions (Sen 1976). Schmoyer (1994) also demonstrated the validity of a general form of permutation test based on the residuals of a linear model with independent and identically distributed errors. In Poisson models with heterogeneous population sizes, these jus- tifications cannot be applied (Waldhor 1996; Assuncao and Reis 1999).

295 Geographical Analysis

The residuals of loglinear models are well established in the statistical literature in a nonspatial context. We can extend them to a spatial context to test for spatial autocorrelation. In his synthesis of previous studies, Agresti (1990, p. 431) showed that the Pearson and log-likelihood (deviance) residuals of loglinear models are asymptotically multivariate normal with mean 0 and the variance–covariance matrix a projection matrix. This particular asymptotic form of the residuals is analogous to that of linear regression residuals. Following this reasoning, we can devise a log-rate model that closely resembles a linear regression model to account for potentially heterogeneous population sizes and ecological covariates. We apply the permutation test based on the asymptotic normality assumption, so that Moran’s I based on the residuals of a log-rate model is analogous to Moran’s I based on regression residuals. In the following, we specify the residuals of a log-rate model for the permutation test of Moran’s I.

Let ni be the observed counts for a Poisson random variable Ni at region i (i 5 1, . . ., m), and let xi and yi be the ith regional population and relative risk, respectively. Then, Ni are assumed to be independently Poisson distributed, or Ni Poisson (Eiyi), where Ei is the expected count for region i. The null hypothesis that all the relative risks (y1,...,ym) are 1 can be stated (Elliott et al. 2000, p. 132). Suppose that a set of explanatory variables (xi,1,...,xi,k 1) are observed together with ni. A log-rate model can then be expressed as ^ ^ ^ logðn^i=xiÞ¼b0 þ b1xi;1 þ ...þ bk1xi;k1 ð2Þ

In equation (2), nˆ i and log(nˆ i/xi) are, respectively, the estimated count and estimated ^ ^ log rate for region i, b0is the estimate of grand mean, and the other bs are parameter estimates for explanatory variables. Equation (2) is often estimated by moving log(xi) to the right-hand side, so that it becomes the offset. ^ ^ ^ logðn^iÞ¼logðxiÞþb0 þ b1xi;1 þ ...þ bk1xi;k1 ð3Þ

Based on these notations, the conventional Pearson residual for region i is defined as n n^ ¼ i i ð Þ ri;p 1=2 4 n^i and the conventional deviance residual (Agresti 1990, p. 452) for region i is defined as

1=2 ri;d ¼ 2signðni nîÞ½ni logðni=nîÞni þ nî ð5Þ where sign(a)is1ifa40, is 0 if a 5 0, and is 1ifao0. We can test for spatial autocorrelation based on the residuals in equation (2) by replacing zi in (1) with either the Pearson residual in (4) or the deviance residual in (5). When zi is replaced with ri,p, for example, Moran’s I becomes Pearson residual Moran’s I, and we denote it as IPR, which can be expressed as

296 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation !0 1 P P X X m m ni n^i 1 m nl n^l @nj n^j 1 m nl n^lA i¼1 j¼1wij n^1=2 m l¼1 n^1=2 n^1=2 m l¼1 n^1=2 2 i l j3 l IPR ¼ ! P X 2 hiP P 4 m ni n^i 1 m nk n^k 5 m =m wij i¼1 ^1=2 m k¼1 ^1=2 i¼1 j¼1 ni nk ð6Þ

When zi is replaced with ri,d, Moran’s I becomes deviance or log-likelihood residual Moran’s I and we denote it as IDR. To implement these tests, the parameters of explanatory variables together with the residuals are first estimated in the model- fitting process. Then, IPR and IDR are calculated. Finally, the means and variances of IPR and IDR are computed according to the random permutation scheme (Cliff and Ord 1981). The P values of IPR and IDR can be computed based on the asymptotic normality assumption when m is large and the number of covariates is small. If the independence model is rejected while all the ecological covariates can be found and modeled, the residuals of IPR or IDR from the model should be completely lacking in spatial autocorrelation. Otherwise, it may indicate spatial clustering that cannot be explained.

Although no one has specified both IDR and IPR as above, it is worth noting the difference between IPR and the score test of the residuals in a generalized linear model (Jacqmin-Gadda et al. 1997). The score test, which also accounts for explanatory variables, was derived from a random-effect approach by neglecting an additional term of overdispersion. Based on its original formula, the test statistic is expressed as Xm Xm T ¼ wijðYi mîÞðYj m^jÞð7Þ i¼1 j6¼i where Yi is the variable of interest in region i as assumed from the exponential family, mî is an estimate of E(Yi), and wij is an element from the weight matrix. For a Poisson model, one can take Yi 5 Ni so that yi is the number of the observed counts ni, and mî is the estimated count nˆ i in our notations. Then, Xm Xm T ¼ wijðni nîÞðnj n^jÞ i¼1 j¼1;j6¼i

This particular form of T is different from the formulation of IPR given by (6) because 1=2 we use zi ¼ðni nîÞ=nî instead of zi 5 ni nˆ i. This difference also leads to a variance adjustment problem for T, because the variance of Yi mî still depends on the i th regional population size xi even when mî equals the true parameter mi.In addition, wij in T may not be based on spatial relationships, such as spatial adjacency or distance, and it adds complications in designing a proper weight matrix and in deriving the P value. Pearson residuals, on the other hand, are well established, and it is much easier to calculate or implement Pearson residuals in a stan-

297 Geographical Analysis dard statistical package than it is for the score test. Because of these differences, we will not compare IPR and IDR with the score test in our simulations.

Simulation study

To evaluate Pearson residuals IPR and deviance residuals IDR for variance adjust- ments, we carried out simulations under the null hypothesis of no spatial clustering. We included two alternative test statistics, Oden’s Ipop and Assuncao and Reis’s EBI, both of which are designed to account for heteroskedasticity without controlling for ecological covariates. As a reference point, we also included the original Moran’s I, denoted by Ir, by taking zi 5 ni/xi in equation (1). For ease of discussion, we list some expressions for Oden’s Ipop and Assuncao and Reis’s EBI below. Oden’s I is defined as P pop P P P n2 m m M ðe d Þðe d Þnð1 2bÞ m M e nb m M d i¼1 j¼1 ij i i j j i¼1 ii i i¼1 ii i Ipop ¼ P P P 2 m m m bð1 bÞ x i¼1 j¼1 didjMij x i¼1 diMii ð8Þ pffiffiffiffiffiffiffiffi P P m m where b ¼ n=x, ei 5 ni/n, di 5 xi/x, Mij ¼ wij= didj, n ¼ i¼1 ni and x ¼ i¼1 xi. Oden also derived the mean and variance of Ipop. If there is no spatial dependence, Ipop is close to the mean value 1/(x 1). Oden’s Ipop weights a pair of regional populations together with the spatial weight matrix by using Mij. While wij in Mij is still a spatial weight in the W matrix, its diagonal element wii6¼0. This feature makes Ipop incomparable with Moran’s I when spatial clustering and population heterogeneity both exist. Noticing this difference, Assuncao and Reis (1999) proposed their version of population-adjusted pffiffiffiffi Moran’s I or EBI. In their definition,P zi ¼ðpi bÞ= ni, where b 5 n/x, ni 5 a1b/xi, 2 2 m a 5 s b/(x/m)ands ¼ i¼1 xiðpi dÞ=x. Hence, ! P P X X p b 1 m p b pj b 1 m p b m w pi ffiffiffiffi pl ffiffiffiffi ffiffiffiffi pl ffiffiffiffi i¼1 j6¼i ij l¼1 p l¼1 ni m nl nj m nl EBI ¼ "# ð9Þ X 2 hi P p b 1 m p b P P m pi ffiffiffiffi pl ffiffiffiffi =m m m w l¼1 l¼1 i¼1 j¼1;6¼i ij ni m nl

Although they did not notice the connection, Assuncao and Reis point out that ni can be set equal to b/xi if nio0. If all nio0, then IPR and EBI are identical in the absence of covariates. The justification and evaluation of EBI, in this case, would be directly applicable to IPR. However, we never encountered this case in our simulations, and so the general form of EBI is always assumed. Simulations were based on a 20 20 lattice. We defined a spatial weight matrix according to spatial adjacency, that is, wij 5 1 if two lattice points (i, j) are adjacent, 0 otherwise. For Ipop, we followed Oden and took wii 5 2. These weight assignments are identical to those used in Assuncao and Reis’s (1999) simulations.

We denote li and xi, respectively, as the disease rate and population size at lattice

298 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation

(a1) Half Sparse Half Dense (a3) One Cluster of Dense (a5) One Cluster of Sparse 20 20 20

15 ξ=1000 15 15 ξ=1 ξ=1000

Y 10 Y 10 Y 10

5 ξ=1 5 ξ=1000 5 ξ=1

0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 X X X

(a2) Quad Sparse Quad Dense (a4) Two Clusters of Dense (a6) Two Clusters of Sparse 20 20 20

15 ξ=1000 ξ=1 15 ξ=1000 15 ξ=1 ξ=1 ξ=1000

Y 10 Y 10 Y 10

5 ξ=1 ξ=1000 5 ξ=1000 5 ξ=1

0 0 0 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 X X X

Figure 1. Population patterns for (a1)–(a6) in the unit of 105.

point i, and Ni as the Poisson random variable with the mean value mi 5 lixi.In each run, 400 Poisson random variables with the constant rate li 5 0.0001 were generated independently on the lattice points, so that there would be no spatial clustering of disease rates. In order to compare a wide range of heteroskedasticity, we included a spatial homogeneous pattern, together with six heterogeneous patterns similar to those used in Waldhor’s (1996) simulations (Fig. 1). We used ui and vi to represent the vertical axis and the horizontal axis, respectively, so that a lattice point i can be easily identified by i 5 20(ui 1)1vi. The seven spatial population patterns were 5 5 defined in the units of 10 (e.g., xi 5 1 means 1 10 and xi 5 1000 means 1000 105), and they are:

(a0) Homogeneous population with xi 5 1 for all i 5 1, . . ., 400. (a1) Half sparse and half dense: xi 5 1ifui 10, xi 5 1000 if ui 11. (a2) One quad sparse and one quad dense: xi 5 1ifui 10 and vi 10 or ui 11 and vi 10; xi 5 1000 if ui 10 and vi 11 or ui 11 and vi 10. (a3) All sparse except one clusterqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi with a dense population: xi 5 1, except when 2 2 lattice point i is within ðui 5Þ þðvi 5Þ 3, in which xi 5 1000.

299 Geographical Analysis

(a4) All sparse except two clusters with denseqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi populations: xi 5 1, except when lattice point i is within ðu 5Þ2 þðv 5Þ2 3or qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i i 2 2 ðui 15Þ þðvi 15Þ 3, in which xi 5 1000. (a5) All dense except one clusterqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi with sparse population: xi 5 1000, except when 2 2 lattice point i is within ðui 5Þ þðvi 5Þ 3, in which xi 5 1. (a6) All dense except two clusters with sparseq populations:ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffixi 5 1000, except when lattice point i is within ðu 5Þ2 þðv 5Þ2 3or qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i i 2 2 ðui 15Þ þðvi 15Þ 3, in which xi 5 1.

For a given pattern of population and rate distributions, we ran the simulations 10,000 times and calculated Ir, IPR, IDR, Ipop, and EBI for each run. We fixed the disease rate at 0.0001 for all locations in the seven heterogeneous patterns (a0–a6), so that E(Ni) 5 V(Ni) 5 10 if xi 5 1 unit and E(Ni) 5 V(Ni) 5 10,000 if xi 5 1000 units. Under the null hypothesis, we assessed the validity of the permutation test by comparing (a) the observed variance for each test statistic versus corresponding permutation variance and (b) the rejection rates at the 0.05 nominal value (a 5 0.05). In the preliminary analysis, we also compared the observed and permutation means for each of Ir, ID, IDR, and EBI. As the difference between the two could be ignored, we will not compare them. Table 1 lists the simulation results of the observed and permutation variances of 2 Ir, IPR, IDR, Ipop, and EBI based on the seven patterns (a0–a6), where sS denotes the 2 observed sample variance from the simulation and sR denotes estimated variance 2 from the random permutation scheme. As sS can be treated as the true values, we 2 2 compared the ratio of the values of sS and sR. If the ratio is close to 1, it suggests that estimated variance is close to the permutation variance. The results show that, when the populations were homogeneous, the ratios 2 2 between sS and sR were very close to 1. When the populations were heteroge-

Table 1 Comparison of Variance for Selected Test Statistics

21 Ir ( 0.001) IPR ( 0.001) IDR ( 0.001) EBI ( 0.001) Ipop ( 10 ) 2 2 2 2 2 2 2 2 2 2 sS sR sS sR sS sR sS sR sS sR (a0) 1.291 1.303 1.291 1.303 1.292 1.303 1.291 1.303 1.112 106 1.147 106 (a1) 2.453 1.293 1.248 1.303 1.248 1.303 1.257 1.303 4.643 4.730 (a2) 2.443 1.293 1.300 1.303 1.301 1.303 1.308 1.303 4.909 4.891 (a3) 1.364 1.302 1.295 1.303 1.296 1.303 1.497 1.301 273.4 275.0 (a4) 1.407 1.302 1.272 1.303 1.274 1.303 1.369 1.302 69.32 69.82 (a5) 12.87 1.189 1.314 1.303 1.314 1.303 1.315 1.303 1.357 1.374 (a6) 6.889 1.257 1.271 1.303 1.271 1.303 1.273 1.303 1.640 1.675

300 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation

Table 2 Rejection Rates (%) for Selected Test Statistics

Ir IPR IDR EBI Ipop (a0) 4.98 4.98 5.03 4.60 4.60 (a1) 15.69 4.60 4.65 4.88 4.87 (a2) 14.98 4.78 4.88 4.94 5.23 (a3) 5.46 4.95 4.91 4.93 4.78 (a4) 5.94 4.52 4.51 4.92 4.68 (a5) 54.92 5.03 5.02 4.82 4.98 (a6) 39.67 4.70 4.71 4.61 4.86

neous, the rate-based test Ir produced predominantly biased variance estimates, a result consistent with Waldhor’s simulations. When a set of sparsely populated points or grids were clustered, the permutation tended to underestimate the true variance. In the cases of one-cluster (a5) and two-cluster (a6) patterns, the permutation variances were underestimated by about 9.2% and 18.2%, respectively. When the population distribution was characterized by (a1) or (a2), the permutation underestimated true variances by about 50%. In contrast, the ratios 2 2 between sS and sR for IPR, IDR, Ipop, and EBI were all very close to 1 in patterns (a1) to (a6). The results for IPR and IDR suggest that the estimated values of the variance are trustworthy and that they can effectively reduce potentially inflated variance. As the ratios were very close to 1 for Ipop and EBI, IPR and IDR, it might not be critical for using IPR or IDR if one simply wants to account for heteroskedasticity. Nevertheless, IPR or IDR have an advantage when one wishes to incorporate covariates. Table 2 lists the rejection rates of the permutation test based on Ir, IPR, IDR, Ipop and EBI according to the seven spatial population patterns. When populations are homogeneous, all the test statistics had a type I error rate between 4.60% to 5.03% with an IDR above 5% by a fraction. When populations were heterogeneous, all the test statistics except Ir had a consistent rejection rates around 5%, suggesting that IPR, IDR, Ipop, and EBI were all reasonable under the null hypothesis and they can effectively correct potentially inflated variance with a satisfactory level of type I errors. The results for Ir, in contrast, failed to reject the null hypothesis with the accepted nominal value 0.05 due to an inflated variance. Although some patterns (a3 and a4) registered lower rejection rates, all the patterns had a rejection rate more than 5%. In the cases of one or two sparsely populated regions clustered in a densely populated study area (patterns a5 and a6), the rejection rate for a spatially random pattern was more than 40%. These results essentially confirmed Waldhor’s simulation results for Ir. Even though the performance of the permutation test for IPR and IDR is comparable with the performance from Ipop and EBI, IPR and IDR have the flexibility of incorporating potential covariates in to their tests for spatial clustering, which we will demonstrate in the following section.

301 Geographical Analysis

Application Data and variables County-level breast cancer and at-risk population data were obtained from the Kentucky cancer registry for the years 1996–2000. The data set reports breast cancer cases according to their developmental stage as follows: 0, a benign tumor; 1 an in situ tumor; 2, localized tumor; 3, a regional tumor; 4, a distant tumor; and 5, usually used to code patients who died with later stage disease without an autopsy report on file. For the purpose of our analysis, we deleted the stage 0 cases. Fol- lowing the U.S. Surveillance, Epidemiology, and End Results (SEER) Program def- initions, we regrouped the in situ and localized tumors as early stage and the regional, distant and unknown tumors as late stage. Generally, early-stage breast carcinomas are confined to the breast and can often be treated successfully, whereas late-stage carcinomas tend to spread beyond the breast and are often fatal. There were a total of 16,055 breast cancer cases during this period, and 80% of the cases were between age 35 and 75. On average, there were 96.7 per 100,000 women diagnosed at an early stage and 36.9 per 100,000 at a later stage. If the overall breast cancer incidence rate is constant across counties, the early-stage breast cancer rate should, theoretically, be negatively related to the late-stage breast cancer rate. Based on 10-year age-group incidence data during the same period, we calculated the indirect standard incidence rate (SIR) or relative risk for each county (Waller and Gotway 2004, p. 15), and the results are shown in Figs. 2 and 3 in six quantiles. We observed that a strip of counties along the boundary between Appalachian and non-Appalachian regions had an elevated early-stage breast cancer risk, as did the westernmost counties. With regard to late-stage breast cancer, there was a cluster of counties with a high incidence rate around the

Figure 2. Early-stage standard relative risk in Kentucky.

302 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation

Figure 3. Late-stage standard relative risk in Kentucky. northeastern Appalachian area; counties along the southern border of the state also had a higher relative risk. As screening for early-stage breast cancer tends to be associated with age, socioeconomic conditions, and access to health care within a geographic area (Freeman 1989), we included additional county variables while testing spatial clustering for both early-stage and late-stage breast cancer incidents. We obtained county population and socioeconomic data from the 2000 U.S. Census. Age is expected to be positively associated with breast cancer incidents, and we obtained median age and population age groups as potential control variables. The socioeconomic conditions in a county can be related to breast cancer in two ways (Bradley, Given, and Robert 2001). On the one hand, breast cancer is more prev- alent among White women or those with higher socioeconomic status, and so counties that have a higher median family income (MEDFINC) are expected to have greater breast cancer incidence rates. On the other hand, women who have a higher socioeconomic status tend to be more aware of and more able to afford breast cancer screening than are those who have a lower socioeconomic status. Consequently, although counties that have better socioeconomic conditions may have a higher incidence rate of early-stage disease, they may not necessarily have a higher incidence rate of late-stage disease than do counties with worse socioeconomic conditions (Roche, Skinner, and Weinstein 2002). We relied on several other data sources for access-to-care measures. We obtained breast cancer screening rates from the 1997 and 1998 Behavioral Risk Factor Surveillance Systems (BRFSS). Owing to the confidentiality concern, the original release of cancer screening level had some counties grouped together, and so we divided rates into tertiles of high (H-screening: 470%), middle (M-screening: 65–70%), and low (L-screening: o65%). A higher breast cancer screening level

303 Geographical Analysis

(primarily by mammography) is expected to be associated with higher early-stage incidence rates, and negatively associated with late-stage rates. In the preliminary analysis, we found that the differences between low and middle tertiles were min- imal, and we grouped them together in the final analysis. We also obtained the 1998 population-to-primary care physician ratio (POP/PMD) in 1998 from the Kentucky Department of Public Health; a lower ratio indicates that a physician can pay more attention to each patient. As breast cancer screening is most frequently recommended in a primary care setting, having a greater number of primary care physicians should help to reduce the incidence of all stages of breast cancer. For this reason, we expected the population-to-physician ratio to be negatively associated with both early-stage and late-stage breast cancer rates. Finally, we used data from a geographic information system to derive geographic access measures for Kentucky counties. We used the TIGER file from the 2000 U.S. census to derive a measure of access to major highways. A county is coded 1 if a major national highway (HWY) passes through it, and 0 otherwise. It was expected that highway access would increase access to health care facilities and increase early-stage breast cancer diagnoses. We also divided counties into within and outside the Appalachian region (Appalachian) as encircled by the bold line in Figs. 2 and 3. Counties within the Appalachian region generally are eco- nomically distressed and medically underserved, and the all-cause mortality rate tends to be much higher within the region (Haaga 2004; Mather 2004).

Analysis

To test Moran’s I for spatial clustering, we first used Pearson residuals IPR and deviance residuals IDR for the null model without any covariates. Here, IPR and IDR in the null loglinear model correspond to the traditional Moran’s I without any covariates. We then introduced explanatory variables in the so-called ecological model. In the preliminary analysis, we found that college education, poverty rate, and MEDFINC were highly correlated; we used MEDFINC in the final analysis because it was the most significant variable in terms of the likelihood ratio test. We also experimented with different age variables and found that proportions of age 40–64 together with age 65 and over were not significant. We selected median age, because it was consistently significant in the ecological models for both stages; the greater the age, the greater the likelihood of breast cancer. We expected that the null model would indicate some spatial clustering through significant spatial autocorrelation and that the correlation should be weakened or disappeared once the explanatory variables were introduced. As most of explanatory variables in the literature are tapped for the early stage, their effects on the late stage are expected to be weaker. If the autocorrelation was found to persist or could not be explained by the ecological model, our task was to locate spatially clustered counties and provide our findings to epidemiologists and cancer specialists for further identification of the etiologies associated with breast cancer clusters. We used a spatial mixed

304 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation model to search for high-value and low-value spatial clusters by including additional local spatial association terms, as demonstrated by Lin (2003). The method makes use of the vector of the spatial weight matrix, with 1 being adjacent to i inclusive (i.e., including the ith region itself), and 0 being otherwise. If a cluster of counties associated with the ith vector could significantly reduce the log-likelihood (deviance), it indicates a local association or cluster centered around the ith county. If the association is positive, it suggests a high-value cluster. If the association is negative, it suggests a low-value cluster (Lin and Zhang 2004). After controlling for pockets of high-value and low-value clustering and ecological covariates, we would then expect an insignificant residual Moran’s I.

Results Table 3 lists Moran’s I coefficients and the t values of ecological variables from the early-stage log-rate models. In the null model, both Pearson residual Moran’s IPR and deviance residual Moran’s IDR were positively significant, suggesting a clustering tendency. Once the explanatory variables were introduced into the ecological model, however, IPR and IDR both became insignificant based on their corresponding residuals. Hence, a spatial clustering tendency in the null model reflected spatial patterning of age structure, socioeconomic status, and access to care. In particular, counties with an older median age, a high level of breast cancer screening, or with easy highway access were associated with higher detection rates of early-stage breast cancer, whereas the POP/PMD and location in the Appala- chian region were negatively associated with detection rates. These results were all consistent with our expectations and the existing literature. Finally, county

Table 3 Early-Stage Breast Cancer Log-Rate Models Models Null Ecological

Coefficient t value Coefficient t value (Intercept) 6.876 703.68 8.064 36.51 Median age 0.0364 7.76 MEDFINC 0.006 1.68 HWY 0.098 3.72 Appalachian 0.269 8.29 High screening 0.081 7.06 POP/PMD 0.041 4.82 Clustering tests

Moran’s IPR 0.177 3.41 0.061 1.24

Moran’s IDR 0.179 3.44 0.063 1.28 Summary statistics G2 5 597 119 G2 5 278 113

2 2 NOTE: G stands for likelihood ratio w and t value is the ratio of estimated value and its standard error. POP/PMD, population-to-primary care physician ratio; MEDFINC, median family income.

305 Geographical Analysis

MEDFINC had a negative and weak (P value o0.10) association with the early breast cancer detection rate. While it has been reported widely that women in less- developed counties, such as those in the Appalachian area of Kentucky, are less likely to have their breast cancer diagnosed early (Gregorio et al. 2002), it seemed counterintuitive to us that a higher county MEDFINC would be associated with a lower county rate of early-stage breast cancer. When we added only MEDFINC to the null model, the coefficient for MEDFINC was positive and significant. The weak association, therefore, may reflect an effect when age, the level of breast cancer screening, and other access-to-care variables were taken into account. Turning to the results from the late-stage log-rate models (Table 4), we found that Moran’s IPR and IDR were significant for both the null and ecological models and that the explanatory variables in the ecological model were insufficient to account for the clustering tendency of the late-stage incidence rates. Three variables that remained significant were median age, population-to-physician ratio, and high screening level. The coefficients for the median age and population-to-physician ratio remained consistent with the early-stage model. However, the late-stage rate became negatively associated with a high screening level. This result was consistent with our expectation that a high screening level was associated with high rates of early-stage diagnoses and low rates of late-stage disease. To further pinpoint the core of clustered counties unexplained by the ecological model, we deleted insignificant variables except median age from the

Table 4 Late-Stage Breast Cancer Log-Rate Models Models Null Ecology Mixed

Coefficient t value Coefficient t value Coefficient t value (Intercept) 7.949 476.00 8.634 23.25 8.090 28.32 Median age 0.024 3.07 0.007 1.11 MEDFINC 0.0001 0.009 HWY 0.047 1.32 Appalachian 0.043 0.78 H-screening 0.132 6.64 0.137 7.27 POP/PMD 0.090 5.79 0.089 6.13 Barran-cluster 0.293 3.34 Greenup-cluster 0.287 3.49 Marshall-cluster 0.273 3.504 Union-cluster 0.336 3.49 Clustering tests

Moran’s IPR 0.172 3.25 0.132 2.52 0.083 1.66

Moran’s IDR 0.182 3.42 0.134 2.59 0.077 1.54 Summary statistics G2 5 392 df 5 119 G2 5 294 df 5 113 G2 5 251 df 5 112

2 2 NOTE: G stands for likelihood ratio w and t value is the ratio of estimated value and its standard error. POP/PMD, population-to-primary care physician ratio; MEDFINC, median family income.

306 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation

Figure 4. High-value and low-value clusters of late-stage breast cancer in Kentucky. ecological model and applied a stepwise regression by including local spatial association terms. We identified four local association terms, none of which over- lapped geographically. Each core county and its adjacent counties constituted a cluster, and including the core county only would not significantly reduce the clustered effect. Except for a cool spot around Barran County, the other three core counties represented the centers of three elevated late-stage clusters (Fig. 4). For instance, the rate in Union County and its adjacent counties was 1.415 times the rates of other counties not included in the clusters. By including these clusters in the final model, both IPR and IDR were found to be insignificant, suggesting the disap- pearance of the clustering tendency once these clusters were accounted for in the model. It is worth mentioning that, if we dropped Marshall County, the median age effect would be significant, but the clustering effect would still remain. It suggests that a greater proportion of aging population around Marshall and its adjacent counties contributed to this cluster. It is also worth mentioning that Jefferson Coun- ty, where Louisville is located, had a very low late-stage rate, but its adjacent counties each had a relatively high rate. Although including Jefferson County in the final model would reduce the log-likelihood ratio and the t values for IPR and IPD, counties around Jefferson County would not constitute a cluster.1

Even though the main purpose of using IPR and IDR was to account for heteroskedasticity, new insights have been gained from analyzing IPR and IDR in the model-fitting processes. When the five explanatory variables were sufficient to reduce the spatial clustering effect of early-stage breast cancer, they highlight the underlying socioeconomic and health access processes that operate in geographic space. When known explanatory variables were not sufficient to reduce the spatial clustering effect as in the case of late-stage breast cancer, local spatial association terms can be used to further identify local clusters. The identification of spatial clusters may help to further reveal unique geographic characteristics in the clus-

307 Geographical Analysis tered areas associated with cancer disparities. Without measuring IPR or IDR,we would not be able to evaluate potential clustering effect in the traditional log-rate or Poisson regression models.

Conclusions Although the asymptotic validity of permutation tests has been demonstrated in the literature and Pearson residuals and deviance residuals of loglinear models are asymptotically normal, no one has evaluated and applied them in the spatial context. In the current study, we have bridged Pearson residuals and deviance residuals of loglinear models with the permutation test of Moran’s I under the asymptotic normality assumption. Our simulation study showed that both test statistics IPR and IDR were effective in reducing inflated variance caused by heterogeneous populations, and they both had an acceptable type I error rate. In addition, we tested both based on a set of log-rate models or Poisson regressions for early-stage and late-stage breast cancer incidence data, together with socioeconomic and access-to-care data in Kentucky. The results showed that socioeconomic and access-to-care variables were sufficient to account for spatial clustering of early-stage breast carcinomas with access-to-care measures, such as breast cancer screening and number of primary care providers being more persistent than county MEDFINC. After controlling for age, access-to-care measures, and regional distress factors in the Appalachian counties, the purported positive association between higher socioeconomic and early-stage breast cancer could be substantially weakened. For late-stage carcinomas, two salient and persistent factors were level of breast cancer screening and POP/PMD. In contrast to the finding that a high screening level was associated with a high incidence rate of early-stage breast cancer, the late-stage incidence rate was negatively associated with breast cancer screening level. This result confirmed our expectation: a high screening level is associated with a high incidence rate of early-stage diagnoses, which in turn reduces late-stage incidence rates. When the two access variables failed to reduce the spatial clustering tendencies from late-stage breast cancer, we searched for a local spatial association based on the likelihood ratio test. We located four clusters: one low-value cluster around Barran County and three high-value clusters. Two of the high-value clusters were adjacent to each other near the western corner of Kentucky. These unexplained clusters provided the basis for further investigation of the etiological and ecological factors of late-stage breast cancer in Kentucky. It should be pointed out that, even though we include median age, the age effect may not be fully accounted for in our model: future work should explore ways to account for both age and ecological effects while testing for spatial clustering. Spatial regressions have been widely used, but their use with the permutation tests of residuals either in linear or loglinear models is rarely seen. An advantage of the loglinear residual permutation test over the linear residual permutation test is that the former can account for potential spatially heterogeneous populations,

308 Ge Lin and Tonglin Zhang Loglinear Residual Tests of Moran’s I Autocorrelation which makes it a viable alternative in the log-rate modeling of disease rates, as demonstrated in our study. The method is expected to complement some spatial cluster tests, such as the spatial scan statistic (Kulldorff 1997) and G statistic (Getis and Ord 1992). In addition, the ability to show spatial clustering in IPR and IDR is complementary to disease mapping, which intends to display true disease risks while controlling for heterogeneous populations and regional risk factors (Lawson and Clark 2002). Finally, a local version of loglinear residuals can be developed as an exploratory tool to complement other local indicators of spatial association (Anselin 1995).

Acknowlegements Data used in this publication were provided by the Kentucky Cancer Registry, Lexington, KY. We would like to acknowledge helpful comments from Linda Pickle and Robert Hanham. We would also like to thank three reviewers and the editor for their comments and suggestions.

Note

1 We also analyzed the late-stage cases versus all cases at the patient level. The results were consistent. Population-to-physician ratio and breast cancer screening and close to highway were less likely to be associated with late-stage diagnoses, while residing in the Appalachian region was more likely to be associated with late-stage diagnoses. Counties around Jefferson County had an excessive late-stage breast cancer diagnoses among breast cancer patients.

References

Agresti, A. (1990). Categorical Data Analysis. New York: Wiley. Anselin, L. (1995). ‘‘Local Indicators of Spatial Association-LISA.’’ Geographic Analysis 27, 93–115. Assuncao, R., and E. Reis. (1999). ‘‘A New Proposal to Adjust Moran’s I for Population Density.’’ Statistics in Medicine 18, 2147–62. Barry, J., and N. Breen. (2005). ‘‘The Importance of Place of Residence in Predicting Late- Stage Diagnosis of Breast or Cervical Cancer.’’ Health and Place 11, 15–29. Besag, J., and J. Newell. (1991). ‘‘The Detection of Clusters in Rare Diseases.’’ Journal of Royal Statistic Society A 154, 143–55. Best, N., K. Ickstadt, and R. Wolpert. (2000). ‘‘Spatial Poisson Regression for Health and Exposure Data Measured at Disparate Resolutions.’’ Journal of the American Statistical Association 95, 1076–88. Bradley, C. J., C. W. Given, and C. Robert. (2001). ‘‘Disparities in Cancer Diagnosis and Survival.’’ Cancer 91, 178–88. Cliff, A. D., and J. K. Ord. (1981). Spatial Processes: Models and Applications. London: Pion. Elliott, P., J. Wakefield, N. Best, and D. Briggs. (2000). Spatial Epidemiology. New York: Oxford University Press.

309 Geographical Analysis

Freeman, H. P. (1989). ‘‘Cancer and the Socioeconomically Disadvantaged.’’ Ca-A Cancer Journal for Clinicians 39, 263–95. Getis, A., and J. Ord. (1992). ‘‘The Analysis of Spatial Association by Use of Distance Statistics.’’ Geographical Analysis 24, 189–206. Gregorio, D. I., M. Kulldorff, L. Barry, and H. Samociuk. (2002). ‘‘Geographic Differences in Invasive and in situ Breast Cancer Incidence According to Precise Geographic Coordinates, Connecticut 1991–95.’’ International Journal of Cancer 100, 194–98. Griffith, D., and R. Haining. (2006). ‘‘Beyond Mule Kicks: The Poisson Distribution in Geographical Analysis.’’ Geographical Analysis 38, 123–39. Haaga, J. H. (2004). Educational Attainment in Appalachia. Population Reference Bureau. ARC Series, Vol. 5. Washington, DC: Population Reference Bureau. Jacqmin-Gadda, H., D. Commenges, C. Nejjari, and J. Dartigues. (1997). ‘‘Testing of Geographical Correlation with Adjustment for Explanatory Variables: An Application to Dyspnoea in the Elderly.’’ Statistics in Medicine 16, 1283–97. Kulldorff, M. (1997). ‘‘A Spatial Scan Statistic.’’ Communications in Statistics, Theory and Methods 26, 1481–96. Lawson, A. B., and A. Clark. (2002). ‘‘Spatial Mixture Relative Risk Models Applied to Disease Mapping.’’ Statistics in Medicine 21, 359–70. Lin, G. (2003). ‘‘A Spatial Logit Association Model for Cluster Detection.’’ Geographical Analysis 35, 329–40. Lin, G., and T. Zhang. (2004). ‘‘A Method for Testing Low-Value Spatial Clustering for Rare Disease.’’ Acta Tropica 91, 279–89. Mather, M. (2004). Households and Families in Appalachia. Population Reference Bureau ARC Series, Vol. 4. Washington, DC: Population Reference Bureau. Moran, P. A. P. (1950). ‘‘A Test for the Serial Independence of Residuals.’’ Biometrika 37, 178–81. Oden, N. (1995). ‘‘Adjusting Moran’s I for Population Density.’’ Statistics in Medicine 14, 17–26. Roche, L. S., R. Skinner, and R. B. Weinstein. (2002). ‘‘Use of a Geographic Information System to Identify and Characterize Areas with High Proportions of Distant Stage Breast Cancer.’’ Journal of Public Health Management Practice 8, 26–32. Schmoyer, R. L. (1994). ‘‘Permutation Tests Form Correlation in Regression Errors.’’ Journal of American Statistical Association 89, 1507–16. Sen, A. (1976). ‘‘Large Sample-Size Distribution of Statistics Used in Testing for Spatial Correlation.’’ Geographical analysis 9, 175–84. Waldhor, T. (1996). ‘‘The Spatial Autocorrelation Coefficient Moran’s I Under Heteroscedasticity.’’ Statistic in Medicine 15, 887–92. Waller, L. A., and C. A. Gotway. (2004). Applied Spatial Statistics for Public Health Data. New York: Wiley. Yabroff, Y. R., and L. Gordis. (2003). ‘‘Does Stage at Diagnosis Influence the Observed Relationship Between Socioeconomic Status and Breast Cancer Incidence, Case- Fatality, and Mortality?’’ Social Sciences and Medicine 57, 2265–79. Zhang, T., and G. Lin. (2006). ‘‘A Supplemental Indicator of High-Value or Low-Value Spatial Clustering.’’ Geographical Analysis 38, 211–26.

310