Bayesian Probit Regression Models for Spatially-Dependent Categorical Data
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By Candace Berrett, B.S., M.S. Graduate Program in Statistics
The Ohio State University
2010
Dissertation Committee:
Catherine A. Calder, Advisor
L. Mark Berliner Peter F. Craigmile Elizabeth A. Stasny c Copyright by
Candace Berrett
2010 ABSTRACT
Data augmentation/latent variable methods have been widely recognized for facilitating model fitting in the Bayesian probit regression model. First proposed by Albert and Chib
(1993) for independent binary and multi-category data, the latent variable representation of the Bayesian probit regression model allows model fitting to be performed using a simple
Gibbs sampler and, for more than two categories, also allows the so-called assumption of irrelevant alternatives required by the logistic regression model to be relaxed (Hausman and Wise, 1978). To accommodate residual spatial dependence, the latent variable speci-
fication of the Bayesian probit regression model can be extended to incorporate standard parametric covariance models typically used in analyses of spatially-dependent continuous data, defining what we term the Bayesian spatial probit regression model. In this disserta- tion, we develop and extend the Bayesian spatial probit regression model by (i) introducing efficient model-fitting algorithms, (ii) deriving classification methods based on the model, and (iii) extending the model to the multi-category spatial setting.
Statistical models for spatial data are notoriously cumbersome to fit necessitating the availability of fast and efficient model-fitting algorithms. To improve the efficiency of the Gibbs sampler used to fit the Bayesian regression model for independent categorical response variables, Imai and van Dyk (2005) propose introducing a working parameter into the model and compare various data augmentation strategies resulting from different treatments of the working parameter. We build on this work by investigating the efficiency
ii of modified and extended versions of conditional and marginal data augmentation Markov chain Monte Carlo (MCMC) algorithms for the spatial probit regression model, focusing on the special case of binary spatially-dependent response variables.
Within the classification literature, methods that exploit spatial dependence are limited.
We show how a spatial classification rule can be derived from the Bayesian spatial probit regression model. In addition, we compare our proposed spatial classifier to various other classifiers in terms of training and test error rates using a land-cover/land-use data set.
When extending the spatial probit regression model to the multi-category setting, care must be taken to ensure that model parameters are estimable and interpretable. Considering three types of categorical and spatial covariate information, we discuss various specifica- tions of the latent variable mean structure and the associated parameter interpretations.
Additionally, we explore the specification of the latent variable cross space-category de- pendence structure and discuss how data augmentation MCMC strategies for fitting the
Bayesian spatial probit regression model can be extended to the multi-category setting.
iii Dedicated to my parents, Bob and Nanette, and siblings, Tenille, Nat, Preston, MeChel, and Taylor.
iv ACKNOWLEDGMENTS
First and foremost, I would like to thank my advisor, Dr. Kate Calder, who over the last four and a half years has devoted a substantial amount of time and effort in training me to be a well-rounded statistician. She has provided me with numerous opportunities to learn and grow through research, teaching, mentoring, and collaboration. She has also become a good friend, whom I admire professionally and personally, and I am grateful for her example and support.
I would like to thank my committee members: Dr. Mark Berliner for his comments on my research, his help with job and fellowship applications, and for allowing me to laugh in his class; Dr. Elizabeth Stasny for her comments on my research, her help with job and fellowship applications, her support as graduate chair, and in encouraging me to come to
Ohio State; and Dr. Peter Craigmile for his valuable comments and contributions to my research.
I would like to thank Dr. Darla Munroe and Dr. Ningchuan Xiao of the Department of
Geography for their generous assistance in obtaining and understanding the land cover data used in this work.
I would like to thank the other professors in the Department of Statistics who have provided guidance and support during my time at Ohio State: Dr. Doug Wolfe, Dr. Tao Shi,
Dr. Chris Hans, Dr. Jackie Miller, Dr. Steve MacEachern, and Dr. Noel Cressie.
v I would like to thank Lisa Van Dyke for her help in answering my many graduation questions and in pulling together the final documents of this dissertation. I would also like to thank Terry England for her help with all my travel and posters.
Support for this research was provided by grants from NASA (NNG06GD31G) and the
NSF (ATM-0934595).
Finally, I would like to thank my family and many friends, who all believed in me when I didn’t believe in myself; and God, for giving me strength and understanding, and providing me with opportunities to grow.
vi VITA
1983 ...... Born - Ogden, Weber, Utah, USA
2005 ...... B.S. Actuarial Science, cum laude, Brigham Young University. 2005 - 2006 ...... University Fellow, Graduate School, The Ohio State University. 2005 - 2006, 2010 ...... Teaching Assistant, Department of Statis- tics, The Ohio State University. 2007 ...... M.S. Statistics, The Ohio State University. 2007 - 2010 ...... Research Assistant, Department of Statis- tics, The Ohio State University. 2009 ...... Graduate Fellow, Statistical and Applied Mathematical Sciences Institute.
PUBLICATIONS
Research Publications
Xiao, N., Shi, T., Calder, C.A., Munroe, D.K., Berrett, C., Wolfinbarger, S., and Li, D. (2008) “Spatial Characteristics of the Difference between MISR and MODIS Aerosol Optical Depth Retrievals over Mainland Southeast Asia,” Remote Sensing of Environment, DOI: 10.1016/j.rse.2008.07.011.
FIELDS OF STUDY
Major Field: Statistics
vii TABLE OF CONTENTS
Page
Abstract ...... ii
Dedication ...... iv
Acknowledgments ...... v
Vita ...... vii
List of Tables ...... xi
List of Figures ...... xii
Chapters:
1. Introduction ...... 1
1.1 Background and Motivation ...... 2 1.2 Modeling Categorical Spatial Data ...... 15 1.2.1 The Spatial Generalized Linear Model ...... 15 1.2.2 The Spatial Generalized Linear Mixed Model ...... 19 1.2.3 Indicator Kriging ...... 20 1.2.4 The Autologistic Model ...... 22 1.2.5 The Bayesian Spatial Probit Regression Model ...... 23 1.3 Overview of Contributions ...... 24 1.4 Illustrative Data Set ...... 25
2. Bayesian Spatial Probit Regression ...... 29
2.1 The Bayesian Probit Regression Model ...... 29 2.1.1 Albert and Chib’s Data Augmentation Strategy ...... 29
viii 2.1.2 Multi-Category and Multivariate Extensions ...... 31 2.2 The Bayesian Spatial Probit Regression Model ...... 34 2.2.1 Model Specification ...... 34 2.2.2 Parameterization of the Spatial Correlation Matrix ...... 36
3. Data Augmentation MCMC Strategies ...... 39
3.1 Data Augmentation MCMC Strategies ...... 40 3.1.1 Conditional versus Marginal Data Augmentation ...... 40 3.1.2 Partially Collapsed Algorithms ...... 45 3.1.3 Full Conditional Distributions ...... 46 3.2 Simulation Study ...... 49 3.2.1 Simulation Set-up ...... 49 3.2.2 Simulation Results ...... 52 3.3 Application ...... 54 3.4 Summary ...... 56
4. The Bayesian Spatial Probit Regression Model as a Tool for Classification . . . 73
4.1 The Classification Problem ...... 74 4.2 GLM-Based Classification ...... 76 4.2.1 Non-Spatial GLM-Based Classification ...... 76 4.2.2 Spatial GLM-Based Classification ...... 80 4.3 Alternative Classification Methods ...... 84 4.3.1 Discriminant Analysis ...... 84 4.3.2 Support Vector Machines ...... 90 4.3.3 k-Nearest Neighbors ...... 93 4.4 Comparison of Classification Methods ...... 94 4.4.1 Parameter Estimation ...... 95 4.4.2 Classification Errors ...... 97 4.5 Summary ...... 101
5. Bayesian Spatial Multinomial Probit Regression ...... 102
5.1 The Bayesian Spatial Multinomial Probit Regression Model ...... 102 5.1.1 Latent Mean Specification ...... 104 5.1.2 Parameterization of the Space-Category Covariance Matrix . . . 125 5.2 Model-Fitting ...... 128 5.2.1 Data Augmentation MCMC Algorithms ...... 128 5.3 Summary ...... 133
ix 6. Contributions and Future Work ...... 134
x LIST OF TABLES
Table Page
3.1 This table lists the steps in each of the data augmentation algorithms. The first portion shows the non-collapsed data augmentation algorithms intro- duced in Section 3.1.1. The second portion shows the partially collapsed data augmentation algorithms introduced in Section 3.1.2...... 44
3.2 Scenarios used to compare the marginal and conditional data augmentation algorithms...... 50
3.3 Autocorrelations of the sample paths of β1 and ρ for the land cover data analysis...... 56
4.1 Fitted values for the covariance function parameters for both class C0 and C1. 96
4.2 Tuning parameter values for each classification method. The optimal value of the tuning parameter is listed along with the CVE associated with this value. The optimal values were chosen by minimizing the five-fold CVE. . 97
4.3 Training and test errors for the SE Asia land cover data obtained using each of the classification methods discussed in this chapter...... 100
xi LIST OF FIGURES
Figure Page
1.1 Land cover over Southeast Asia, covering the region bounded by 17◦ to 21◦N and 98◦ to 105◦E. The data were taken from the MODIS Land Cover Type Yearly Level 3 Global 500m (MOD12Q1 and MCD12Q1) data prod- uct for the year 2005...... 26
1.2 Elevation (in meters) over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E...... 27
1.3 Standardized value of the measured distance to the nearest major road over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E...... 27
1.4 Standardized value of the measured distance to the coast over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E...... 28
1.5 Standardized value of the measured distance to the nearest big city over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E...... 28
2.1 Illustration of neighborhood structures for a regular grid. The grid cell with the black dot represents the location of interest and its neighbors are represented by the grid cells with the empty squares for (a) a first order neighborhood structure and (b) a second order neighborhood structure. . . . 38
3.1 Histograms and trace plots for β and ρ under Scenario 1 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution. . . 58
3.2 Histograms and trace plots for β and ρ under Scenario 2 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution. . . 59
xii 3.3 Histograms and trace plots for β and λ under Scenario 3 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution. . . 60
3.4 Histograms and trace plots for β and λ under Scenario 4 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution. . . 61
3.5 Histograms and trace plots for β under Scenario 5 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution...... 62
3.6 Autocorrelation and partial autocorrelation in the sample paths of β and ρ under Scenario 1 for each of the three corresponding algorithms...... 63
3.7 Autocorrelation and partial autocorrelation in the sample paths of β and ρ under Scenario 2 for each of the three corresponding algorithms...... 64
3.8 Autocorrelation and partial autocorrelation in the sample paths of β and λ under Scenario 3 for each of the three corresponding algorithms...... 65
3.9 Autocorrelation and partial autocorrelation in the sample paths of β and λ under Scenario 4 for each of the three corresponding algorithms...... 66
3.10 Autocorrelation and partial autocorrelation in the sample path of β under Scenario 5 for each of the three corresponding algorithms...... 67
3.11 σ vs. β˜ under Scenario 1 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values...... 68
3.12 σ vs. β˜ under Scenario 2 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values...... 69
3.13 σ vs. β˜ under Scenario 3 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values...... 70
xiii 3.14 σ vs. β˜ under Scenario 4 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values...... 71
3.15 σ vs. β˜ under Scenario 5 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values...... 72
4.1 Illustration of the hyperplane (solid line) and margins (dotted lines) deter- mined by support vector machines separating the two classes (denoted by the two plotting symbols). The hyperplane in (a) is the maximum margin hyperplane and the hyperplane in (b) does not maximize the margin. . . . . 91
4.2 Estimated and fitted variograms for class C0 (left) and C1 (right) fit using the training data set. The numbers 1-4 indicate the estimated variogram for the four covariates, the dots represent the mean of the four variograms, and the line indicates the fitted variogram...... 96
xiv CHAPTER 1
INTRODUCTION
The field of spatial statistics encapsulates a wide array of statistical methodology for analyzing spatial data, or observations associated with particular points or regions in space.
The increasing availability and variety of spatial data has generated an enormous amount of methodological development in the field over the past 20 years, and today the field is one of the most active areas of research in all of statistics (see Banerjee et al., 2004; Waller and Gotway, 2004; Schabenberger and Gotway, 2005; Diggle and Ribeiro Jr., 2007, for a discussion of recent advances). Spatial statistical methods are also increasingly being used in areas of application across the biological, environmental, health, and social sciences, and often specific problems in these fields motivate further methodological developments. In fact, the motivation for much of the research presented in this dissertation is the study of land-cover/land-use change (LCLUC) using satelite-derived land cover observations. Be- fore discussing this motivating application that is used throughout this dissertation, we first provide background and motivation as to why incorporating spatial information is impor- tant in statistical analyses of spatially-referenced data in Section 1.1. In Section 1.2 we give an overview of existing spatial statistical methodology for spatially-dependent categorical data. In Section 1.3, we outline the contributions of this dissertation in terms of statistical methodology for analyzing spatially-dependent categorical data using the Bayesian spatial
1 probit regression model. We then discuss the LCLUC motivating application by introduc-
ing a specific illustrative data set in Section 1.4.
1.1 Background and Motivation
r Let si ∈ D ⊂ R be an r × 1 vector in Euclidean space. In spatial statistics, typically, r = 2 or 3, indicating two- or three-dimensional space, respectively. Let, Y (si) ≡ Yi for
i = 1, . . . , n, where, for now, Yi is a continuous real-valued random variable observed at
a point si ∈ D. There are three types of spatially-referenced data (Cressie, 1993), defined
by the nature of the spatial domain, D:
• Geostatistical/point-referenced data, where si varies continuously over D and D is a
r fixed subset of R . In this case, Y (si) can potentially be observed everywhere within D.
• Lattice/gridded data, where D is again a fixed subset of Rr, but in this case, the
elements of D are areal units. Thus, si may not represent a point in space, but will
indicate an areal unit indexed by the point si.
r • Point pattern data, where D is still a subset of R , but the si are now viewed as
random. For point pattern data, the observed si are modeled as realizations from a
stochastic process, whereas with geostatistical and lattice data, the points are usually
assumed to be fixed and non-stochastic.
In this dissertation, we only consider statistical methods for the analysis of geostatistical
and lattice data.
We expect spatially-referenced data to have a specific type of dependence structure,
called spatial dependence, where observations near to one another are more similar than
2 observations further apart. Often, a statistical analysis of spatial data will focus on learn-
ing about the nature of the spatial dependence or in exploiting it for prediction purposes.
Sometimes, however, we may need to account, or adjust, for spatial dependence when we
want to measure or quantify relationships between two observable quantities of interest. In
particular, in a regression analysis a researcher may be interested in determining how one
variable Y , known as the response variable or dependent variable, varies given a set of
other variables x, known as covariates, independent variables, or explanatory variables.
For example:
• Environmental Health. Researchers and policy makers may be interested in the re-
lationship between exposure to particulate matter or other pollutants and a particular
health outcome (e.g., cancer or mortality).
• Real Estate. An economist may be interested in determining factors that contribute
to whether or not a house will sell in a particular period of time.
• Land-Cover/Land-Use Change. Researchers may be interested in determining the
demographic, social, geographical, and political factors contributing to the observed
type of land cover at various locations.
The Classical Linear Regression Model
Before describing the spatial regression model, we first review the linear regression model from both a classical and Bayesian perspective. Let {(Yi, xi); i = 1, . . . , n}, be a collection of paired observations, where Yi ∈ R is a univariate real-valued response variable or dependent variable, and xi is a k × 1 vector of continuous or discrete-valued covariates or independent variables. For now, assume the Yi are conditionally independent
3 for all i = 1, . . . , n given the independent variables. The linear regression model specifies
the following relationship between the components of xi and the mean of Yi:
0 E(Yi) = xiβ, where β is a k × 1 vector of fixed but unknown regression coefficients. The classical linear regression model is often written as
0 Yi = xiβ + i, (1.1)
2 where i is a random noise error term with mean 0 and variance σ and is assumed to be independent for all i = 1, . . . , n. It is common to assume that the i are also normally distributed, but that assumption is not necessary.
One method for fitting the linear regression model is maximum likelihood estimation.
When the is are assumed to be normally distributed, the likelihood function is
2 2 L(β, σ |Y ) ∝ p(Y |x1,..., xn, β, σ ) n Y 2 = p(Yi|xi, β, σ ) i=1 n Y 1 1 0 2 = √ exp{− (Yi − x β) }, (1.2) 2 2σ2 i i=1 2πσ
0 2 where Y = (Y1,...,Yn) . Maximizing (1.2) with respect to β and σ yields the following maximum likelihood estimator (MLE) for β:
n !−1 n ! 1 X 1 X βˆ = x x0 x Y . MLE n i i n i i i=1 i=1
0 0 0 We can write this estimator of β in matrix notation by letting X = (x1,..., xn) so that
ˆ 0 −1 0 βMLE = (X X) X Y .
4 The MLE of σ2 is
1 σˆ2 = ˆ0ˆ, MLE n
ˆ where ˆ = Y − XβMLE. This likelihood-based model-fitting approach is considered to be within the class of classical or frequentist statistics and standard inferential statements about the uncertainty in model parameter estimates can be obtained (e.g., confidence inter- vals, hypothesis tests). Another method for fitting the linear regression model is known as ordinary least squares (OLS), which does not require an assumption about the distribution of the error terms, but instead seeks to minimize the sum of the squared residuals, or the ˆ difference between the observed response Yi and the predicted value of the response Yi. We note that the OLS estimator for β is identical to the MLE provided above. As an alternative to these two approaches, we can consider a Bayesian version of the linear regression model, which we discuss below.
The Bayesian Linear Regression Model
In Bayesian analyses, we specify prior distributions on the model parameters; for the linear regression model, a prior distribution is needed for β and σ2, and is denoted π(β, σ2).
A prior distribution is a probability distribution which captures our knowledge about the model parameters prior to observing the data. Then, using the data likelihood and prior dis- tribution, we determine posterior distributions for the model parameters, or the distribution of the model parameters given the data, using Bayes’ Theorem. For the linear regression model, the posterior distribution is π(β, σ2|Y ). Posterior distributions are determined ei- ther analytically or numerically using simulation-based techniques such as Markov chain
Monte Carlo (MCMC); see Gelman et al. (1995) for a general overview of Bayesian data
5 analysis techniques. In the Bayesian setting, inferences about model parameters are then reported based on summaries of the posterior distributions of model parameters.
Before describing the Bayesian linear regression model, we first discuss the role of the covariates X in the specification of the posterior distribution of the model parameters, fol- lowing the discussion in Chapter 14 of Gelman et al. (1995). In the Bayesian paradigm, the observed data in a regression analysis include both Y and X. Because of this, a Bayesian model will include the specification of a joint distribution on (Y , X), p(Y , X|ψY |X , ψX ), where ψY |X are the model parameters associated with the distribution of Y |X – for the
2 Bayesian linear regression model, ψY |X = (β, σ ) – and ψX are the parameters associ- ated with X. However, the Bayesian linear regression model traditionally assumes that the distribution of X provides no additional information about the conditional distribution of
Y |X, or that
p(Y , X|ψY |X , ψX ) = p(Y |ψY |X , X)p(X|ψX ) (1.3) and ψY |X and ψX are assumed independent a priori so that
π(ψY |X , ψX ) = π(ψY |X )π(ψX ). (1.4)
Thus, a Bayesian regression model inherently assumes that the prior distribution on X provides no information about the regression model parameters, ψY |X = (β, σ). To see this, notice that using equations (1.3) and (1.4), we can write the joint posterior distribution of (ψY |X , ψX ) as
p(Y , X|ψY |X , ψX ) π(ψY |X , ψX ) π(ψY |X , ψX |Y , X) = RR p(Y , X|ψY |X , ψX ) π(ψY |X , ψX ) dψY |X dψX
p(Y |ψY |X , X) π(ψY |X ) p(X|ψX ) π(ψX ) = R R p(Y |ψY |X , X) π(ψY |X ) dψY |X p(X|ψX ) π(ψX ) dψX
= π(ψY |X |Y , X)π(ψX |X).
6 It follows that we can determine the posterior distribution of ψY |X by only considering
π(ψY |X |Y , X) ∝ π(ψY |X )p(Y |X, ψY |X ). (1.5)
To distinguish the roles X and Y play in Bayesian inference on ψY |X , it is common practice to tacitly condition on X and denote the posterior distribution of β and σ2 as
π(β, σ2|Y ). We follow this convention throughout this dissertation except in Chapter 4 for reasons that we discuss therein.
Although any probability distribution can be chosen for modeling the error terms of the regression model, typically, the errors are assumed to be normally distributed. Equivalently, it is assumed that
Y |β, σ2, X ∼ N(Xβ, σ2I), where I is the n × n identity matrix. We then assign prior distributions to the parameters β and σ2. The standard choice for the Bayesian linear regression model is the noninformative prior distribution 1 π(β, σ2|X) ∝ . σ2 Using these likelihood and prior distributions results in a posterior distribution that can be decomposed as follows:
π(β, σ2|Y ) = π(β|σ2, Y )π(σ2|Y ), where
2 ˆ 2 β|σ , Y ∼ N(βB, σ VB), and
2 2 2 −1 σ |Y ∼ σˆB(χn−k) ,
7 with
ˆ 0 −1 0 βB = (X X) X Y ,
0 −1 VB = (X X) , and
1 0 σˆ2 = Y − Xβˆ Y − Xβˆ , B n − k B B
2 −1 and where (χn−k) denotes the inverse chi-squared distribution with n − k degrees of freedom.
Often in Bayesian analyses, when the full joint posterior distribution is not available in closed form, we use simulation-based approaches to generate samples from the posterior distribution. We can make inferences about (β, σ2) by using a simulation-based (Monte
Carlo) approach where we generate samples from the posterior distribution of (β, σ2).
To do this, we sample from σ2|Y and then, given the sampled value of σ2, sample from
β|σ2, Y . Posterior inferences are then reported based on the empirical distribution of these sampled values.
Comparison to Classical Estimators
Notice that in the Bayesian case, the mean of the conditional posterior distribution for
2 ˆ ˆ 2 n 2 β|σ , Y is the same as the MLE, i.e., βB = βMLE. Furthermore, σˆB = n−k σˆMLE. There- fore, Bayesian inferences on model parameters can be similar to inferences based on the classical estimators. A Bayesian analysis, however, offers additional benefits, including:
1. Flexibility in model specification. Although the means of the posterior distributions
of β and σ2 are similar to the classical estimators, this is not always the case. To see
this, suppose we have more information a priori about β: for example, we may know
8 or require β > 0, where 0 is a k × 1 vector of zeros. A Bayesian analysis provides
a natural and straightforward way to allow for this prior knowledge or constraint, to
be included in an analysis through an informative prior distribution.
2. Interpretability of inferences. In a classical/frequentist setting, inferences are often
reported as confidence intervals. Confidence intervals, however, are not probability
intervals, although many interpret them as such. In a Bayesian setting, the posterior
distribution provides interpretable probabilistic statements about model parameters,
making clear the uncertainty associated with parameter estimates. Furthermore, this
uncertainty is easily propagated to uncertainty in deterministic and stochastic func-
tions of parameters (e.g., transformations and predicted values).
3. Allowing for increased complexity in model specification. Increasing the complex-
ity of a model (e.g., hierarchical modeling with a large number of unknown pa-
rameters, allowing for complicated spatial dependence structures, or modeling non-
continuous data) can make model-fitting in a classical/frequentist setting infeasible.
However, simulation-based approaches such as MCMC (see discussion below) can
make Bayesian model fitting feasible in high dimensional situations.
The Spatial Linear Regression Model
We now consider a more general form of linear regression in order to account for spatial dependence among the errors. In some instances, the error terms in the linear regression model will not be independent. This is often the case for spatially-referenced data, but there can also be residual dependence in data obtained from studies using repeated measures or when data are observed over time. When an assumption of residual independence is not valid, we can specify a general linear regression model. The model is the same as in (1.1),
9 however we no longer assume the i are independent, but rather assume that
Var() = Σ, where Σ is a valid and invertible covariance matrix (i.e., positive definite and symmet- ric). Again, it is common to assume is normally distributed, but this assumption is not necessary.
To estimate the parameters of the general linear model, we consider two cases, Σ is known and Σ is unknown. When Σ is known, for maximum likelihood estimation, we typically specify a normal distribution on the error terms resulting in the likelihood function
L(β, |Y ) ∝ p(Y |X, β, Σ) 1 1 = exp − (Y − Xβ)0Σ−1(Y − Xβ) . (2π)n/2|Σ|1/2 2
The MLE for β is then
ˆ 0 −1 −1 0 −1 βMLE = X Σ X X Σ Y .
Weighted (or generalized) least squares, an extension of OLS, results in the same estimator for β.
When Σ is unknown, we can parameterize Σ (i.e., let Σ = Σ(θ)). The spatial linear model is defined to be the general linear model with a specific type of dependence captured by Σ(θ) (see the discussion in Section 2.2.2). In this case, the likelihood function is
L(β, θ|Y ) ∝ p(Y |X, β, Σ(θ)) 1 1 = exp − (Y − Xβ)0 (Σ(θ))−1 (Y − Xβ) , (2π)n/2|Σ|1/2 2 which is maximized with respect to β and θ to obtain the MLE. Another likelihood-based approach to estimation is restricted/residual maximum likelihood (REML; Patterson and
10 Thompson, 1971) estimation where we introduce a matrix C satisfying E(CY ) = 0 and
Var(CY ) = σ2I. Then, given C, we use standard maximum likelihood estimation to
fit the model CY = CXβ + C. Another approach is based on an extension of OLS
called iteratively reweighted least squares (IRLS). IRLS first fixes the covariance Σ to some preliminary estimate and, conditioning on this value, uses the weighted least squares estimator to obtain an initial estimate of β. Using this estimate of β, IRLS estimation then obtains residuals and determines a more accurate estimate of Σ. Using the updated estimates, IRLS iterates between these two steps until the values of the parameter estimates have converged.
Alternatively, we can specify a Bayesian version of the general linear model by spec- ifying prior distributions on model parameters and using a simulation-based method to approximate the posterior distributions of model parameters. The model-fitting approach is similar to that used for the Bayesian linear regression model, however, we now specify a prior distribution on Σ or on parameters defining Σ. Because of the parameterization of
Σ, fitting a Bayesian general linear regression model requires more complex simulation- based approaches than those used in the Bayesian linear regression model. In general, the joint posterior distribution of β and θ cannot be decomposed into the product of standard density functions to allow Monte Carlo samples from the posterior to be drawn directly.
Instead, it has become standard practice to use an estimation technique known as Markov chain Monte Carlo (MCMC). Rather than drawing directly from the posterior distribution, an MCMC algorithm draws samples from a Markov chain with a long run, or stationary distribution, equal to the posterior. Two popular MCMC algorithms are:
11 • The Gibbs sampler, where we draw a sample from the full conditional posterior dis-
tributions of each parameter sequentially. For the Bayesian spatial linear regression
model, for each iteration we sample
1. β[t] ∼ π(β|Y , θ[t−1])
2. θ[t] ∼ π(θ|Y , β[t])
where β[t] and θ[t] represent the sampled values of β and θ for the tth iteration, and
each is sampled conditional on the current value of the other parameter.
• The Metropolis-Hastings algorithms, where for each iteration, we draw a sample
from a proposal distribution and either accept or reject the proposed value. Let ψ
be the parameter of interest and Y the observed data. Then, we accept the proposed
value with probability
∗ ∗ [t−1] ∗ [t−1] π(ψ |Y )/Jt(ψ |ψ ) a(ψ , ψ ) = min [t−1] [t−1] ∗ , 1 , π(ψ |Y )/Jt(ψ |ψ )
where ψ∗ is the proposed value of ψ sampled from the proposal distribution for iter-
[t−1] ation t, Jt(·|ψ ). It follows from Bayes’ theorem that this acceptance probability
can be simplified to rely only on the prior, likelihood, and proposal distribution, i.e.,
∗ ∗ ∗ [t−1] ∗ [t−1] π(ψ )p(Y |ψ )/Jt(ψ |ψ ) a(ψ , ψ ) = min [t−1] [t−1] [t−1] ∗ , 1 . π(ψ )p(Y |ψ )/Jt(ψ |ψ )
The Metropolis random walk algorithm is a special case of the Metropolis-Hastings
algorithm, where the proposal distribution for each parameter is a symmetric dis-
tribution centered on the previous sampled value of that parameter. In this case,
∗ [t−1] [t−1] ∗ Jt(ψ |ψ ) = Jt(ψ |ψ ), resulting in the simplified form for the acceptance
12 probability given by,
π(ψ∗)p(Y |ψ∗) a(ψ∗, ψ[t−1]) = min , 1 . π(ψ[t−1])p(Y |ψ[t−1])
When fitting Bayesian models, it is also common to use a mixture of these two MCMC
algorithms. For example, in the Gibbs sampler for the Bayesian spatial regression model,
we can use a Metropolis-Hastings step in place of Step 2 listed above.
We note that we must specify starting values for the parameters for both types of
MCMC algorithms listed above. To reduce the effect of these starting values on the final
parameter inferences and allow the Markov chains to reach their stationary distributions,
typically we discard the sampled draws from a number of the initial iterations. These dis-
carded values are called the burn in samples.
Although it is a special case of the general linear regression model, we argue that the
spatial linear regression model has certain features that distinguish it from other models
in this class. First, data modeled using the general linear regression model typically have
some sort of replication, where, for example, we observe a response for n individuals m times. However, with spatially-referenced data, although we have n locations, often times we observe a spatial process at these locations only once. Secondly, the specific nature of the dependence places constraints on Σ. In this dissertation, we consider two classes of parameterizations for Σ(θ), corresponding to geostatistical and lattice data, and discuss these in more detail in Section 2.2.
By accounting for spatial dependence in the spatial linear regression model, we can:
13 1. Quantify the relationship between the response variable and covariates while ac-
counting for residual spatial dependence. When we ignore residual spatial depen-
dence, inferences on model parameters will be invalid and the standard errors associ-
ated with these estimators tend to be too small (see, for example, Schabenberger and
Gotway, 2005).
2. Predict the value of the response variable at an unobserved location using both ob-
served values of the response variable and covariate information. We expect that re-
sponse variables at unobserved locations will be more similar to nearby observations
than to observations which are farther away, and accounting for this phenomenon is
imperative in prediction.
Up to this point, we have only considered regression models for continuous response variables. Frequently, however, the response variable will not be continuous but may be reported as belonging to two or more categories. At the beginning of this section, we gave three examples of situations where a spatial regression model would be desirable. However, in each of these examples, the response variable may not be continuous:
• Environmental Health. The health outcome of an individual (e.g., has cancer/does
not have cancer) is a binary response variable.
• Real Estate. Whether or not the house sells during a particular period of time is a
binary response variable.
• Land-Cover/Land-Use Change. The land cover category (e.g., forest, agriculture,
grassland, or urban) at each location is a categorical response variable.
14 When the response variable is discrete or categorical, the spatial linear regression model
for continuous response variables is not appropriate. In the following section, we give an
overview of existing approaches for modeling spatially-dependent categorical data.
1.2 Modeling Categorical Spatial Data
Let {(Yi, xi); i = 1, . . . , n} be paired observations at location si, where Yi is now a categorical response variable and xi is a corresponding k × 1 vector of covariates. Let
0 0 0 0 Y = (Y1,...,Yn) and X = (x1,..., xn) .
1.2.1 The Spatial Generalized Linear Model
To motivate the spatial generalized linear model (GLM), we first describe the GLM for
independent response variables, first introduced by Nelder and Wedderburn (1972). Let
Yi for i = 1, . . . , n, be independent random variables. If the distribution of Yi|ζi, ωi is a
member of the exponential family, we can write
Yiζi − b(ζi) p(Yi|ζi, ωi) = exp + c(Yi, ωi) . a(ωi)
For the special case of a two-category, or binary, response variable, Yi|ζi, ωi can be assumed to have a Bernoulli distribution with
pi ζi = log 1 − pi
b(ζi) = log(1 + exp(ζi))
a(ωi) ≡ a = 1
c(Yi, ωi) ≡ c = 0
15 0 where pi is the probability that Yi = 1. The mean of Yi is E(Yi) = b (ζi) ≡ λi and variance
00 of Yi is Var(Yi) = b (ζi)a(ω) ≡ vλi . Therefore, for the Bernoulli distribution,
0 exp(ζi) E(Yi) = b (ζi) = = pi 1 + exp(ζi)
and
00 exp(ζi) Var(Yi) = b (ζi)a(ω) = 2 = pi(1 − pi). (1 + exp(ζi))
For an n × 1 vector Y = (Y1,...,Yn), we have
E(Y ) ≡ λ = p and
Var(Y ) = Vλ,
th where Vλ is a diagonal matrix with the i diagonal element equal to vλi = pi(1 − pi).
To relate the Yis to the covariates xi, we introduce a so-called link function, h(·), and assume
0 h(λi) ≡ µi = xiβ, where β is a k × 1 vector of coefficients. Two frequently used link functions for the
Bernoulli distribution are the logit and probit functions:
pi 0 Logit: µi = log = xiβ 1 − pi −1 0 Probit: µi = Φ (pi) = xiβ.
Therefore,
µ = Xβ,
16 0 0 0 where X = (x1,..., xn) is an n × k matrix of covariates and µ is a k × 1 vector.
One approach to estimate β is by specifying a quasi-likelihood (QL; Wedderburn,
1974). Let Ui = u(λi; Yi) be the standardized value of Yi, so that, Ui = (Yi − λi)/vλi . The
first two moments of this random variable have the same properties as the log-likelihood derivatives, so λ ∂ R i Yi−t dt Yi − λi Yi vt ∂Q(λi; Yi) Ui = = = . vλi ∂λi ∂λi Thus, Z λi Yi − t Q(λi; Yi) = dt. Yi vt Pn −1 Because the data are independent, Q(λ; Y ) = i=1 Q(λi; Yi) = Vλ (Y − λ). Notice that Q(λ; Y ) is a function of λ and Y , but we are actually interested in estimating β.
Therefore, we determine U(β) = ∂Q(λ; Y )/∂β by calculating
∂Q(λ; Y ) ∂λ U(β) = = ∆0V −1(Y − λ) ∂λ ∂β λ
where ∆ is an n × k matrix with elements ∆ij = ∂λi/∂βj. We then estimate β by solving
the quasi-likelihood estimation equations (i.e., setting U(βˆ) = 0 and solving for βˆ).
For the generalized linear model, we now consider the case where the elements of Y
are not independent. Albert and McShane (1995) and Gotway and Stroup (1995) redefine
the variance of Y to be
1/2 1/2 Var(Y ) = Vλ R(θ)Vλ ≡ ΣY ,
where Vλ is defined as above, R(θ) is a correlation matrix, and θ is an m × 1 vector of
dependence parameters with m ≥ 1. For spatially-referenced data, R(θ) will be a spatial
correlation matrix like those given in Section 2.2.
17 To fit the GLM for correlated response variables, Liang and Zeger (1986) extend the
QL approach to model fitting. They use ΣY , in place of Vλ, resulting in
∂Q(λ; Y ) ∂λ U(β) = = ∆0Σ−1(Y − λ). (1.6) ∂λ ∂β Y
These equations, U(β), are called generalized estimating equations (GEE).
In the spatial setting, Albert and McShane (1995) use GEE not only to fit β, but also to
fit ΣY . In addition to (1.6), they let
0 ˆ −1 A ΣY (σˆY − σY ) = 0 (1.7)
ˆ where ΣY is the working covariance matrix, σY and σˆY are n(n−1)/2×1 vectors consist- ˆ ing of all entries below the diagonal of ΣY and ΣY , and A = ∂σY /∂β. They then iterate between solving (1.6) and (1.7) until convergence. Lin and Clayton (2005) show asymp- totic normality and consistency of the GEE for binary spatial data with isotropic covariance functions. They prove and apply their results for the logit link function. Lin (2008) extends the GEE to Lp space.
Breslow and Clayton (1993) propose using a pseudolikelihood function to estimate model parameters in the spatial generalized linear model. Their approach is similar to GEE, but it also computes ‘pseudodata’ corresponding to the observed binary process within the iterations of the model-fitting algorithm.
In the literature for spatially-referenced data, there are no Bayesian versions of the
GEE and pseudolikelihood methods described above. This is due to the fact that these approaches do not have a true likelihood function. Yin (2009) proposes a Bayesian gener- alized method of moments that uses a weighted quadratic objective function in place of a likelihood function. To the best of our knowledge, this approach has not been applied in
18 a spatial setting. Instead, it is common to use a Bayesian spatial generalized linear mixed
model (see Section 1.2.2).
Albert and McShane (1995) emphasize the fact that the GEE approach treats the spatial
correlation as a nuisance. In contrast, the GLMM approach seeks to estimate the mean of
the data given the spatial random effect, as described below.
1.2.2 The Spatial Generalized Linear Mixed Model
The spatial generalized linear mixed model (GLMM) extends the GLM by introducing
a random spatially-dependent error term, or spatial random effect, denoted S(si) ≡ Si, into the mean model. The spatial GLMM can be written as
E(Yi|Si) = λ(si) ≡ λi
0 h(λi) = xiβ + Si
0 where E(Si) = 0 and Var(S) = Σ(θ) with S = (S1,...,Sn) and Σ(θ) an n × n spatial
covariance matrix.
The spatial GLMM was first introduced within the Bayesian setting by Diggle et al.
(1998) as a general modeling framework for spatially-dependent discrete data. As an al-
ternative, a classical version of the spatial GLMM was introduced by Heagerty and Lele
(1998). Heagerty and Lele propose a composite likelihood approach for fitting a spatial
GLMM for binary response data. Paciorek (2007) gives a review of various versions of
spatial logistic GLMMs, as well as a discussion of the specification of spatial dependence
structure in this class of models.
Diggle et al. (1998) propose fitting the Bayesian spatial GLMM using MCMC algo-
rithms. This model-fitting approach is popular due to established computer software, such
19 as WinBUGS (Lunn et al., 2000), that can be used to fit Bayesian spatial GLMMs (see, for
example Banerjee et al., 2004; Law and Haining, 2004). However, as noted by Paciorek
(2007), MCMC algorithms for Bayesian spatial GLMM often convergence slowly and ex-
hibit poor mixing. To improve the performance of these algorithms, Langevin-Hastings
methods have been suggested in the literature (Christensen et al., 2001; Christensen and
Waagepetersen, 2002; Christensen and Ribeiro Jr., 2002). This method involves modifying
a Metropolis random walk algorithm, so that the proposal distribution not only relies on
the value of the parameter from the previous iteration, but also relies on the gradient of the
log-likelihood function. Christensen et al. (2006) propose further improvements by trans-
forming the posterior values of the spatial random effects, Si, by modifying Si|Yi with a
Cholesky factorization of the posterior covariance matrix. They illustrate that this method is more robust than either not transforming the Sis or transforming the Sis a priori.
When S has a Gaussian distribution, an alternative Bayesian model-fitting approach to MCMC is using integrated nested Laplace approximations (INLA; Rue et al., 2009).
INLA seeks to approximate posterior marginals rather than producing samples from the joint posterior distribution of model parameters. While this approach is attractive, it is not clear whether its applicability is as general as with MCMC methods.
1.2.3 Indicator Kriging
Rather than modeling relationships between response variables and covariates, indica- tor kriging predicts values of binary random variables at unobserved locations based only on nearby binary observations (Switzer, 1977; Journel, 1983). Indicator kriging is based on a spatial prediction method called kriging and is a direct extension of this popular method to the binary data setting. Indicator kriging requires a function γ(dij) = (1/2)Var(Yi −Yj),
20 where γ(·) is an isotropic semivariogram and dij = ||si − sj|| are the distance between
locations. Using this function, prediction of Y (s0), where s0 is an unobserved location, is
0 0 0 based on p ≡ P (Y (s0) = 1|Y ). An estimate of p , pˆ , is its best linear unbiased estimator
(Cressie, 1993; De Oliveira, 2000):
1 − 10 Γ−1 γ 0 pˆ0 = γ + 1 Γ−1Y 10 Γ−1 1
th where Γ is an n×n matrix with the ij element equal to γ(||si −sj||), γ is an n×1 vector
th with the i element equal to γ(||s0 − si||), and 1 is an n × 1 vector of ones. Then, Y (s0)
is predicted by taking ( 1, if pˆ0 > l0 ˆ l0+l1 Y (s0) = , 0, otherwise where l0 and l1 are specified losses for mispredicting Yi as a 0 and 1, respectively.
Indicator kriging was originally motivated by considering indicator functions of a col-
0 0 lection of real-valued random variables Z = (Z(s1),...,Z(sn)) ≡ (Z1,...,Zn) and
defining Yi = I(Zi < z) for i = 1, . . . , n (Switzer, 1977; Journel, 1983). Solow (1993)
0 compares the estimates of the probability p = P (I(Z(s0) < z) = 1|I(Z1 < z),...,I(Zn <
0∗ z)) using indicator kriging to the probability p = P (Z(s0) < z|Z), where Z has a Gaus-
sian distribution. Although p0 6≡ p0∗, Solow finds that the number of incorrectly predicted
values based on Y and Z were similar and relatively small. However, he only uses a sample
size of n = 4.
Indicator kriging has the benefit that no distributional assumptions about the data gen-
erating process are required. However, there is no guarantee the estimated probabilities will
stay within the appropriate range of [0, 1]. Furthermore, if p0∗ is the probability of interest,
21 Cressie (1993) states that theoretically, disjunctive kriging1 provides a better approxima-
tion to this probability than indicator kriging. Diggle et al. (1998) also discuss the weak
theoretical assumptions of indicator kriging.
Other suggested approaches for improving indicator kriging are indicator cokriging
(Journel, 1983), which includes covariates to supplement predictions, and probability krig-
ing (Cressie, 1993), which utilizes an estimate of the cumulative distribution function of
Z within the indicator kriging framework. However, these methods also do not guarantee
estimated probabilities remain in the appropriate range.
1.2.4 The Autologistic Model
First proposed by Besag (1972), the autologistic model directly models the spatial cor-
relation among the binary data by relating the log odds of Yi = 1 to the values of Yj for sj in some neighborhood of si. Let
pi ≡ P (Yi = 1|Y-i, xi, β, α),
th where Y-i indicates Y with the i element removed. The autologistic model assumes that pi X log = x0 β + α Y 1 − p i j i j:i∼j where xi is a k × 1 vector of covariates at location si, β is a k × 1 vector of coefficients, i ∼ j indicates that location j is a neighbor to location i, and α is the spatial dependence parameter. In this setting, the log-odds at each location is not only a linear function of the covariates, but is also dependent on the neighboring binary observations. Extensions and generalizations of this model have been proposed by Besag (1972), Besag (1974), Augustin
1 Disjunctive kriging requires specification of bivariate distributions of (Zi,Zj), 0 ≤ i < j ≤ n and then estimates any measurable function g(Z0), where in the binary setting, g(Z0) = 1(Z0 < z|z) and 0∗ E(g(Z0)) = p .
22 et al. (1996), Gumpertz et al. (1974), Sim (2000), Zheng and Zhu (2008), and Zhu et al.
(2008).
Methods for fitting the autologistic model include coding, pseudo-likelihood, Monte
Carlo maximum likelihood, and, for a Bayesian version, Gibbs sampling. Coding meth- ods, proposed by Besag (1974), begin by separating the data into subsets. For each sub- set, conditional on the values in the remaining portion of the data set, the observations in that subset are independent. Given these conditionally independent observations, condi- tional maximum likelihood is used to estimate the parameters. Because there are multiple ways to separate the data, inconsistencies arise when this method is used. Huffer and Wu
(1998) introduce a Monte Carlo maximum likelihood method to fit this model, and show their approach yields more efficient estimates than pseudo-likelihood and coding. Sherman et al. (2006) give an overview and comparison of pseudo-likelihood, generalized pseudo- likelihood, and Monte Carlo maximum likelihood model-fitting methods. Augustin et al.
(1996) use a Gibbs sampler to estimate the model parameters.
The autologistic model offers an improvement over indicator kriging in that the esti- mated pis must be in [0, 1]. In addition, the autologistic model does not require any under- lying distributional assumptions. However, Weir and Pettitt (1999) discuss computational challenges associated with fitting the model and note that parameter estimates can be poor when strong spatial dependence is present.
1.2.5 The Bayesian Spatial Probit Regression Model
The final model in the literature used to model spatially-dependent binary data is the
Bayesian spatial probit regression model. Because this model is the focus of this disserta- tion, we provide a detailed introduction to it in Chapter 2.
23 1.3 Overview of Contributions
In this dissertation, we add to the development of the Bayesian spatial probit regression model in the following three ways:
1. Efficient Model Fitting: Models for spatially-dependent data are notoriously cumber-
some to fit. We show how a marginal data augmentation MCMC algorithm can more
efficiently fit the Bayesian spatial probit regression model than standard MCMC al-
gorithms.
2. Spatial Classification: Within the classification literature, classification methods
which allow for spatial dependence are limited. We show how a spatial classification
rule can be derived from the Bayesian spatial probit regression model and provide
an example where our spatial classifier outperforms other well known classification
methods.
3. Mulitnomial Model Specification: When extending the Bayesian spatial probit re-
gression model to the multi-categorical setting, special considerations must be made
when specifying the latent variable mean and covariance structure to ensure model
parameters are estimable and interpretable. We discuss various specifications of the
latent mean structure and the associated parameter interpretation, and explore the
specification of the latent cross spatial-categorical dependence structure. Addition-
ally, we discuss how data augmentation MCMC strategies for fitting the Bayesian
spatial probit regression model can be extended to the multi-category setting.
24 1.4 Illustrative Data Set
To illustrate our methods, we use satellite-derived land cover observations over South-
east Asia. In Southeast Asia, deforestation is a major concern and, over the last century,
much of the original forests–as much as 12 percent–have been lost to other land uses
(Munroe et al., 2008). Researchers are interested in the economic, geographic, social,
and demographic factors that contribute to deforestation and other land cover patterns. The
spatial probit regression model is well suited for assessing the strength of the relationships
between binary response variables and covariate information while also allowing for resid-
ual spatial dependence.
The particular data used in our analyses were taken from the Moderate Resolution
Imaging Spectroradiometer (MODIS) Land Cover Type Yearly Level 3 Global 500m
(MOD12Q1 and MCD12Q1) data product for the year 2005. We selected land cover ob-
servations from this data product corresponding to the region bounded by 17◦ to 21◦N and
98◦ to 105◦E, which covers portions of Myanmar, Thailand, Laos, and Vietnam. Figure 1.1 shows an image of the land cover over this region. For this figure, the International Human
Dimensions Programme (IHDP) land cover classifications provided by the MODIS data product were collapsed into five categories: forest, shrub/grassland, savanna, cropland, and other.
We also consider four covariates: elevation, distance to the nearest major road, distance to the coast, and distance to the nearest big city. Elevation is measured in meters and distances are Euclidean and measured in degrees. The covariates are standardized, meaning that there were no costs taken into account in calculating distance (e.g. distance calculations do not take into account the fact that it might take longer to go over mountains than go
25 Figure 1.1: Land cover over Southeast Asia, covering the region bounded by 17◦ to 21◦N and 98◦ to 105◦E. The data were taken from the MODIS Land Cover Type Yearly Level 3 Global 500m (MOD12Q1 and MCD12Q1) data product for the year 2005.
around them). Figures 1.2 - 1.5 show images of the four respective covariates. The city in the middle of Figure 1.5 (distance to big city) is Vientiane, the capitol of Laos.
Finally, we note that Figure 1.3 has an artificial wavy line break. This is most likely due to the inadequacies in the plotting capabilities of R, the software used to create these images. However, it may also be an artifact of the country borders and further explanations are being explored. For the data analyses in Sections 3.3 and 4.4, we use a portion of the data not affected by this break (i.e., 17◦ to 19◦N and 98◦ to 100◦E).
26 Figure 1.2: Elevation (in meters) over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E.
Figure 1.3: Standardized value of the measured distance to the nearest major road over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E.
27 Figure 1.4: Standardized value of the measured distance to the coast over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E.
Figure 1.5: Standardized value of the measured distance to the nearest big city over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E.
28 CHAPTER 2
BAYESIAN SPATIAL PROBIT REGRESSION
An attractive alternative to the models for spatially-dependent categorical data reviewed in Section 1.2 is the Bayesian spatial probit regression model. In this chapter, we moti- vate this model by first describing in Section 2.1.1 the latent variable representation of the
Bayesian probit regression model for independent response variables as proposed by Al- bert and Chib (1993). In Section 2.1.2 we present extensions to the mutlivariate and multi- category response variable setting. Following this discussion, we introduce the Bayesian spatial probit regression model in Section 2.2.
2.1 The Bayesian Probit Regression Model
2.1.1 Albert and Chib’s Data Augmentation Strategy
Consider {(Yi, xi); i = 1, . . . , n}, a collection of n binary response variables Yi and
corresponding k×1 vectors of covariates xi. As discussed in Section 1.2.1, the probit GLM
relating the covariates to the binary response variable assumes that the Yis are conditionally
independent given a k × 1 vector of regression coefficients β and can be written as
Yi|β ∼ Bernoulli(pi) (2.1) −1 0 Φ (pi) = xiβ,
29 where Φ−1(·) denotes the inverse standard normal cumulative distribution function. In the Bayesian setting, a prior distribution must be specified for the unknown parameter β.
Unlike the normal linear regression model where the normal distribution is a conjugate prior for the regression coefficients, a conjugate prior is not available for β in (2.1). Thus, inference on β typically requires numerical integration, which is not feasible if k is large, or a simulation-based approach such as MCMC.
To facilitate the use of the Gibbs sampler in the Bayesian probit regression model, Al- bert and Chib (1993) propose the following data augmentation representation of the model.
˜ ˜ ˜ 0 They introduce a collection of latent variables Z = (Z1,..., Zn) and take ( 1, if Z˜ > 0 Y = i , i ˜ (2.2) 0, if Zi ≤ 0 where
˜ ˜ 2 Z ∼ N(Xβ, σ In), (2.3)
0 X = (x1,..., xn) is an n × k matrix of covariates, In is the n × n identity matrix, and
σ2 is a variance parameter. (The ∼ notation over the random variables used here distin-
guishes identifiable and non-identifiable parameters; we use this notational convention to
aid our discussion of data augmentation strategies in Section 3.1.) Taking σ2 = 1, setting ˜ ˜ Zi = Zi/σ and β = β/σ, and integrating out the Zis, it is straightforward to show that
Albert and Chib’s model specification is equivalent to the probit GLM given in (2.1). In- ˜ troducing the latent Zis and taking the prior on β to be N (mβ,Cβ) facilitates model fitting
via the following Gibbs sampler:
30 Step 1: Sample Z|Y , β, (σ2 = 1).
For i = 1, . . . , n, sample Zi from ( TN(x0 β, σ2 = 1, 0, ∞), if Y = 1 Z |Y , β, (σ2 = 1) ∼ i i , i i 0 2 TN(xiβ, σ = 1, −∞, 0), if Yi = 0
where TN(µ, σ2, l, u) denotes a truncated normal distribution with mean µ, variance
σ2, lower bound l, and upper bound u. By sampling from these univariate truncated
normal distributions, we obtain a sample for Z (Geweke, 1991). In this dissertation,
when sampling from the truncated normal distribution, we use Hans and Craigmile
(2009), which provides a tool in R, based on Geweke (1991), for efficiently sampling
from a truncated normal distribution, particularly when sampling in the tails of the
distribution.
Step 2: Sample β|Z, Y , (σ2 = 1).
When β ∼ N(0, Cβ), the full conditional distribution of β is given by
2 −1 0 −1 0 −1 0 −1 β|Z, Y , (σ = 1) ∼ N (Cβ + X X) X Y , (Cβ + X X) .
Each of these steps involves drawing from known distributions from which sampling is easily implemented; see Albert and Chib (1993) for details.
2.1.2 Multi-Category and Multivariate Extensions
Albert and Chib (1993)’s data augmentation strategy for the probit GLM has been ex- tended in several ways. We first define notation for various possible model extensions, then describe each of the model extensions in more detail. Consider observations associated with n ‘individuals’, each of whom provides m ‘responses’ (i.e., multivariate observations), which fall into one of ` categories. Using this notation, the latent variable representation
31 of the Bayeisan probit regression model described in the previous section accommodates
n > 1 individuals with m = 1 responses for each individual, and ` = 2 possible cate- gories, corresponding to the two categories of a binary response. We now consider various extensions of Albert and Chib’s data augmentation strategy when ` > 2 or m > 1.
In addition to introducing the latent variable representation of the probit regression model binary responses, Albert and Chib (1993) also provide a multi-category (multino- mial) extension. Now Yi ∈ {1, . . . , `} denotes the categorical outcome associated with the ith individual, where ` > 2 is the number of categories. In this case, the latent variables ˜ Zij for i = 1, . . . , n and j = 1, . . . , ` are introduced to facilitate model fitting. The latent variable representation of this model is given by
˜ Yi = arg maxj {Zij, j = 1, . . . , `}
where
˜ 0 ˜ ˜ Zi ∼ N([1` ⊗ xi] β, Ω),
˜ ˜ ˜ 0 ˜ and Zi = (Zi1,..., Zi`) , 1` is an ` × 1 vector of ones, xi is a k × 1 vector of covariates, β is a k ×1 vector of regression coefficients, and Ω˜ is an `×` covariance matrix. Conditional ˜ ˜ ˜ on β and Ω, the Zis are independent. That is, ˜ Z1 ˜ . ˜ ˜ vec(Z) ≡ . ∼ N(Xβ, In ⊗ Ω), (2.4) ˜ Zn where 0 1` ⊗ x1 . X ≡ . 0 1` ⊗ xn is a n` × k matrix and ⊗ denotes the Kronecker or tensor product between two matrices. ˜ ˜ ˜ For β to be identifiable, the first diagonal element of Ω, Ω1,1, is typically set equal to one
32 (Albert and Chib, 1993; McCulloch et al., 2000). McCulloch and Rossi (1994) and Nobile
(1998), however, do not impose identifiability constraints on Ω˜ , but rather assign proper pri-
ors on all parameters and report marginal posterior inferences on the identified parameters ˜ ˜ (e.g., β/Ω1,1). Additional approaches for handling this issue of parameter identifiability
(i.e., Imai and van Dyk, 2005) will be discussed in Section 3.1.
Albert and Chib (1993)’s latent variable representation of the probit GLM has also been
extended by Chib and Greenberg (1998) to the multivariate setting where n > 1 individuals
have m > 1 responses with ` = 2 categories for each response. They model this type of
0 independent multivariate binary observations, Y1,..., Yn, where Yi = (Yi1,...,Yim) , by ˜ introducing latent variables Zih for i = 1, . . . , n and h = 1, . . . , m. In this case, ( 1, if Z˜ > 0 Y = ih ih ˜ 0, if Zih ≤ 0
where
˜ 0 ˜ Zi ∼ N([1m ⊗ xi] β, Σ),
˜ ˜ ˜ 0 and Zi = (Zi1,..., Zim) , 1m is the m×1 vector of ones, xi is a k ×1 vector of covariates,
β˜ is a k × 1 vector of regression coefficients, and Σ is an m × m correlation matrix. Again, ˜ conditional on β and Σ, the Zis are independent. That is, ˜ Z1 ˜ . ˜ 2 vec(Z) ≡ . ∼ N(Xβ, σ I ⊗ Σ), ˜ Zn where 0 1m ⊗ x1 . X ≡ . 0 1m ⊗ xn is a nm × k matrix. For β˜ to be identifiable, Chib and Greenberg require Σ to be a
correlation matrix and set σ2 = 1. More recently, to facilitate model fitting, Liu and Daniels
33 (2006) relax the assumptions on the correlation matrix by using parameter expansion and reparameterization methods to first sample Σ as a covariance matrix and then translate it back to a correlation matrix.
2.2 The Bayesian Spatial Probit Regression Model
2.2.1 Model Specification
We now discuss incorporating spatial dependence into the Bayesian probit regression model. From the extensions considered in the previous section, this can be done in two ways. The first way to view this model extension is to consider observations at the different locations as n dependent observations (rather than n independent observations as in Sec- tion 2.1.1), so that we have n > 1 ‘individuals’ (or locations in the spatial setting), m = 1
‘responses’ at each location, and ` = 2 possible categories for each observation. We could also view the observations across all locations as a single multivariate response variable
(i.e, n = 1 individuals with m > 1 responses which can take on ` = 2 possible categories).
Although the resulting models are equivalent, we arbitrarily take the first view in defin- ing our notation, and consider n spatially-dependent univariate binary response variables,
Y1,...,Yn. The resulting model is equivalent to De Oliveira (2000)’s “clipped Gaussian random fields” and the basis for the model considered by Weir and Pettitt (2000).
˜ ˜ ˜ 0 For the spatial probit regression model, we introduce latent variables Z = (Z1,..., Zn) , which are realizations of a spatially-dependent Gaussian process, and let ( 1, if Z˜ > 0 Y = i i ˜ (2.5) 0, if Zi ≤ 0 where
Z˜ ∼ N(Xβ˜, σ2Σ(θ)), (2.6)
34 0 with X = (x1,..., xn) , and we assume that the marginal variances of the Zis are known up to a multiplicative constant σ2 and that the matrix Σ(θ) captures the residual spatial dependence structure. Typically, Σ(θ) will be a correlation matrix. However, we allow for the possibility of heteroskedasticity by only assuming that the diagonal elements of
Σ(θ) are fixed and known constants. We note that in Chapter 1, Σ(θ) represented a spatial
covariance matrix. In our discussion of the Bayesian spatial probit regression model, we
2 will refer to σ as the variance of Zi for i = 1, . . . , n, and Σ(θ) as a spatial correlation
matrix.
As in the two previous model extensions, β˜ is only identifiable up to a multiplicative
˜ 2 constant. For illustration, consider P(Yi = 1|β, Σ(θ), σ ):
˜ 2 ˜ ˜ 2 P Yi = 1|β, Σ(θ), σ = P Zi > 0|β, Σ(θ), σ ! x0 β˜ = Φ i σ 0 = Φ (xiβ) ,
where β = β˜/σ is the identifiable parameter. De Oliveira (2000) and Weir and Pettitt
(2000) chose to set σ2 = 1 to ensure that the regression coefficients are identifiable, a choice that has implications for the efficiency of the resulting MCMC algorithms as we discuss Section 3.1.
Although similar in spirit to that of the spatial GLM and spatial GLMM, this model
fits within a separate class under the GLM framework. For the spatial GLM, the spatial dependence is specified directly on Y , rather than on a latent Gaussian process. The spatial
GLMM is more similar to the Bayesian probit regression model in that we introduce a
latent spatially-dependent Gaussian process in the mean of Y ; however, the probability pi
35 is specified conditional on the value of the Gaussian process, Si, i.e.,
pi = P (Yi = 1|β,Si)
for all i = 1, . . . , n. In contrast to the spatial GLM and the spatial GLMM, the spatial dependence structure is embedded in the link function.
2.2.2 Parameterization of the Spatial Correlation Matrix
When modeling spatial dependence, it is important to make certain that Σ(θ) is a valid
correlation matrix. Various parameterizations of Σ(θ) that ensure validity and uphold the
characteristics of spatial dependence are available. We consider two classes of parameteri-
zations for Σ(θ), corresponding to geostatistical and lattice data.
For geostatistical/point-referenced data, a geostatistical dependence structure is com-
monly used. In this setting, the correlation of a spatial process at two locations is often
modeled as a function of the distance between the two locations corresponding to the as-
sumptions of second-order stationarity and isotropy. One popular class of parametric spa-
tial correlation functions is the Matern´ class. In this case,
Σ(θ) ≡ Σ(ν, λ),
where the ijth element of Σ(ν, λ) is equal to √ √ 1 2 νd ν 2 νd ij K ij , 2ν−1Γ(ν) λ ν λ
Γ(·) is the usual gamma function, Kν is the modified Bessel function of order ν (see e.g.,
Abramowitz and Stegun, 1965), dij = ||si−sj|| is the Euclidean distance between locations
si and sj, ν > 0 is a parameter controlling for smoothness of the realized random field,
and λ > 0 is the spatial scale parameter. Special cases of the Matern´ correlation functions
36 include the exponential (ν = 1/2) and the Gaussian (ν → ∞) correlation functions. Both
of these correlation functions can be written in a simpler form, i.e.,
Σ(θ) ≡ Σ(λ).
For the exponential correlation function, the ijth element of Σ(λ) is equal to
d exp − ij , (2.7) λ
and for the Gaussian correlation function, the ijth element of Σ(λ) is equal to
d2 exp − ij , (2.8) λ2
where dij and λ are as defined above. There are other parametric correlation functions for
geostatistical data, and we refer the reader to Cressie (1993), Stein (1999), and Banerjee
et al. (2004) for further examples.
For lattice/gridded data, spatial dependence is modeled by considering spatial neigh-
borhood structures. Neighborhood structures define a set of neighbors for each partition
of a domain D indexed by locations {si,..., sn}. Figure 2.1 illustrates this concept for a regular lattice/grid. In this figure, grid cell with the black dot represents the location of interest and its neighbors are represented by the grid cells with empty squares. Figure
2.1 (a) illustrates a first order neighborhood structure, where the ith and jth grid cells are
neighbors if they share a common edge, and (b) illustrates a second order neighborhood
structure, where the ith and jth grid cells are neighbors if they share a common edge or
corner.
A spatial autoregressive structure is commonly used to capture spatial dependence in
models for lattice/gridded data. One example of an autoregressive dependence structure
37 (a) (b)
First Order Neighborhood Structure Second Order Neighborhood Structure
Figure 2.1: Illustration of neighborhood structures for a regular grid. The grid cell with the black dot represents the location of interest and its neighbors are represented by the grid cells with the empty squares for (a) a first order neighborhood structure and (b) a second order neighborhood structure.
is the conditionally autoregressive (CAR) model (e.g., Banerjee et al., 2004). In the CAR model,
−1 Σ(θ) ≡ Σ(ρ) = (Dw − ρW ) (2.9)
th where the ij element of W is equal to wij, wij is equal to 1 if grid cell/partition i is a neighbor of cell/partition j and is equal to 0 if cells i and j are not neighbors, Dw is a
th Pn 2 diagonal matrix with the i diagonal element equal to wi+ = j=1 wij, σ is the variance, and ρ is the spatial dependence parameter. Other autoregressive dependence structures in- clude the simultaneous autoregressive (SAR) model and the spatial moving average (SMA) model (see, e.g., Cressie, 1993).
38 CHAPTER 3
DATA AUGMENTATION MCMC STRATEGIES
There has been a recent emphasis in the spatial statistics literature on the development of methods that accommodate large data sets. In the Bayesian setting, these methods in- clude dimension reduction techniques (e.g., Higdon, 2002; Xu et al., 2005; Calder, 2007;
Banerjee et al., 2008), integrated nested Laplacian approximations (Rue et al., 2009), and covariance tapering (see recent work by Shaby and Ruppert, 2010, for a Bayesian treatment of this technique). In this chapter, instead of focusing on model adjustments to accommo- date large data sets or approximations to full Bayesian inference procedures, we investi- gate strategies for efficient Markov chain Monte Carlo (MCMC) algorithms for fitting the
Bayesian spatial probit regression model. While the MCMC strategies discussed here are not necessarily designed to overcome computational challenges associated with massive data sets, in high dimensional settings having efficient algorithms is clearly desirable.
Data augmentation/latent variable methods have been widely recognized for facilitating model fitting of the Bayesian probit regression model. As discussed in Section 2.1.1, the latent variable representation of the Bayesian probit regression model proposed by Albert and Chib (1993) allows model fitting to be performed using a simple Gibbs sampler. To im- prove the efficiency of the Gibbs sampler in this setting, Imai and van Dyk (2005) propose introducing a working parameter (defined in Section 3.1.1) into the model and compare
39 various data augmentation strategies resulting from different treatments of the working pa-
rameter. In this chapter, we build on this work by investigating the efficiency of modified
and extended versions of these algorithms for the spatial probit regression model, focus-
ing on the special case of binary response variables. These algorithms include the one
previously proposed by De Oliveira (2000), which we discuss further in Section 3.1.1.
In Section 2.2, we defined the latent variable representation of the Bayesian probit
regression model for spatially-referenced binary data. However, as noted in Section 2.2, the
spatial variance parameter, σ2, is not identifiable. In Section 3.1.1, we discuss conditional
and marginal data augmentation strategies for making use of this non-identifiable variance
parameter within an MCMC model-fitting algorithm. We propose three different Gibbs
sampling algorithms corresponding to these strategies. Furthermore, in Section 3.1.2, we
propose modifications to these algorithms by partially collapsing over the sampling steps
within the Gibbs sampler. We compare the various resulting algorithms using a simulation
study in Section 3.2 and in an analysis of satellite-derived land cover data over Southeast
Asia in Section 3.3.
3.1 Data Augmentation MCMC Strategies
3.1.1 Conditional versus Marginal Data Augmentation
In the previous chapter, “data augmentation” referred to the introduction of continuous
latent variables, Z˜, in the Albert and Chib (1993) representation of the Bayesian probit regression model. In this section, we extend our use of “data augmentation” to include conditional and marginal data augmentation MCMC strategies, where we use a working pa-
rameter to identify fast and easily implemented algorithms. In data augmentation MCMC
40 strategies, the working parameter is frequently taken to be a parameter that is not identifi-
able under the observed data, Y , but is identifiable under the complete or augmented data,
(Y , Z). In the Bayesian spatial probit regression model, the spatial variance parameter, σ2, can serve as a working parameter.
Meng and van Dyk (1999) and van Dyk and Meng (2001) were the first to distinguish between conditional augmentation and marginal augmentation strategies. Under a condi- tional augmentation strategy, the working parameter is fixed to an optimal constant within the model-fitting algorithm, whereas a marginal augmentation strategy seeks to marginalize over the working parameter within the algorithm. Meng and van Dyk (1999) show that al- gorithms using a marginal augmentation strategy will have a geometric rate of convergence no larger than its conditional augmentation counterpart.
Below we describe conditional and marginal augmentation strategies for the Bayesian spatial probit regression model defined in Section 2.2. Referring back to the notation in
Section 2.2, we use Z˜ and β˜ to denote the non-identifiable parameters, and Z and β to denote the identifiable parameters (i.e., Z˜ = σZ and β˜ = σβ).
Consider the likelihood function for the identifiable parameters (β, θ) in the spatial probit regression model introduced in Section 2.2:
L(β, θ|Y ) ∝ p(Y |β, θ) Z (3.1) = p(Y , Z|β, θ)dZ, where (Y , Z) denotes the complete augmented data. As Imai and van Dyk (2005) point out for an independent multi-category response version of the probit regression model, since the working parameter is not identifiable under the observed data likelihood function, we can condition on a fixed value of the working parameter and the likelihood function
41 will remain unchanged. This conditional augmentation strategy also holds for the spatial version of the model.
2 2 Fixing the working parameter σ to some constant σ0 results in a conditional augmen-
2 2 tation algorithm, and without loss of generality, we can take σ0 = 1. Thus, for conditional augmentation, (3.1) becomes L(β, θ|Y ) = L(β, θ, σ2|Y ) Z 2 2 ∝ p(Y , Z|β, θ, σ = σ0)dZ Z Z 1 (3.2) = ··· 2 n/2 1/2 A1 An (2πσ0) |Σ(θ)| 1 0 −1 × exp − 2 (Z − Xβ) Σ(θ) (Z − Xβ) dZ 2σ0 where ( (−∞, 0], if Yi = 0 Ai = . (3.3) (0, ∞), if Yi = 1 Using this conditional augmentation strategy, the associated Gibbs sampling algorithm for sampling from the posterior distribution of (β, θ) is given in Table 3.1 under the Con- ditional heading under the Non-Collapsed Algorithms heading.
As an alternative to conditioning on a specific value of the working parameter, the working parameter can be assigned a proper prior which we can marginalize over to obtain the likelihood of the identifiable parameters. Using this marginal augmentation strategy,
(3.1) can be expressed as Z L(β, θ|Y ) ∝ p(Y , Z|β, θ, σ2)π(σ2|β, θ)dσ2 dZ Z Z Z ∞ 1 = ··· 2 n/2 1/2 (3.4) A1 An 0 (2πσ ) |Σ(θ)| 1 × exp − (Z − Xβ)0Σ(θ)−1(Z − Xβ) π(σ2|β, θ)dσ2 dZ 2σ2
2De Oliveira (2000) implicitly uses this conditional augmentation strategy in his model for spatially- dependent binary data.
42 where the Ai are as defined in (3.3).
Following Imai and van Dyk (2005), we consider two marginal data augmentation schemes. In the first scheme, labeled Scheme 1, we marginalize over σ2 completely in updating Z. We do this by sampling (σ2)∗ from its prior distribution π(σ2|β, θ) ≡ π(σ2), sampling from Z˜|Y , β, θ, (σ2)∗, and “sweeping” over σ2 by setting Z = Z˜/σ∗. Thus, the sampled Z is dependent on the identifiable parameter β, but not on the non-identifiable parameter β˜. This approach is valid since σ2 is not likelihood identifiable, therefore we can sample Z˜ conditional on any plausible value of σ2. We then sample from the joint full conditional distribution of (σ2, β˜) and from the full conditional distribution of θ and again “sweep” over the sampled value of σ2 by setting β = β˜/σ. The Gibbs sampling algo- rithm associated with this marginal augmentation scheme is listed in Table 3.1 as Marginal-
Scheme 1 in the Non-Collapsed Algorithms section.
In the second marginal augmentation scheme, labeled Scheme 2, we include σ2 in the
Gibbs sampler in the usual way by assigning it a proper prior distribution and sampling from its full conditional distribution. Then, we sweep over σ by properly normalizing the non-identifiable parameters in each iteration of the algorithm (i.e., set Z = Z˜/σ and β =
β˜/σ). This algorithm is listed in Table 3.1 as Marginal-Scheme 2 of the Non-Collapsed
Algorithms.
In Marginal-Scheme 1, when the prior distribution on σ2 is diffuse, conditioning on a value of σ2 from the prior distribution when sampling Z˜ will allow the distribution of Z˜ to be more diffuse than when conditioning on the value of σ2 obtained from the previous iteration of the algorithm, as in Marginal-Scheme 2. Thus, the sampled values of Z˜/σ∗ =
Z when using Marginal-Scheme 1 will have a smaller autocorrelation. In turn, we expect a similar decrease in autocorrelation in the sample paths of β and θ.
43 Imai and van Dyk (2005) consider similar conditional and marginal augmentation strate-
gies for fitting the Bayesian multinomial probit regression model for independent multi-
category response variables (see Section 2.1.2). However, in their model, the working ˜ parameter is the first diagonal element of the cross-category covariance matrix, Ω1,1. Com- paring their algorithms to that of McCulloch and Rossi (1994) and Nobile (1998), they find that their marginal augmentation algorithms converge more quickly and are less sensitive to starting values. In the following sections, through a simulation study and data analysis, we show similar benefits for MCMC algorithms based on marginal data augmentation for the spatial probit regression model.
Non-Collapsed Algorithms
Marginal-Scheme 1 Marginal-Scheme 2 Conditional
Step 1: Sample (σ2)∗ ∼ π(σ2) Sample Z˜|Y , β, θ, (σ2)∗ Sample Z˜|Y , β, θ, σ2 Sample Z|Y , β, θ, σ2 = 1 Set Z = Z˜/σ∗ Set Z = Z˜/σ Step 2: Sample (σ2, β˜)|Z˜, Y , θ Sample (σ2, β˜)|Z˜, Y , θ Sample β|Z˜, Y , θ, σ2 = 1 Set β = β˜/σ Set β = β˜/σ Step 3: Sample θ|Z˜, Y , β˜, σ2 Sample θ|Z˜, Y , β˜, σ2 Sample θ|Z, Y , β, σ2 = 1
Partially Collapsed Algorithms
Marginal-Scheme 1 Marginal-Scheme 2 Conditional
Step 1: Sample θ|Y , β Sample θ|Y , β, σ2 Sample θ|Y , β, σ2 = 1 Sample (σ2)∗ ∼ π(σ2) Sample Z˜|Y , β, θ, (σ2)∗ Sample Z˜|Y , β, θ, σ2 Sample Z|Y , β, θ, σ2 = 1 Set Z = Z˜/σ∗ Set Z = Z˜/σ Step 2: Sample (σ2, β˜)|Z˜, Y , θ Sample (σ2, β˜)|Z˜, Y , θ Sample β|Z˜, Y , θ, σ2 = 1 Set β = β˜/σ Set β = β˜/σ
Table 3.1: This table lists the steps in each of the data augmentation algorithms. The first portion shows the non-collapsed data augmentation algorithms introduced in Section 3.1.1. The second portion shows the partially collapsed data augmentation algorithms introduced in Section 3.1.2.
44 3.1.2 Partially Collapsed Algorithms
In this section, we discuss a method for collapsing Steps 1 and 3 in the algorithms introduced in the previous section. In Step 3 of each algorithm, we draw samples of θ from
θ|Z˜, Y , β˜, σ2. Because Z˜ is a vector of real-valued random variables, by conditioning on it (as opposed to the binary vector Y ) we unnecessarily constrain the distribution of values that θ can take at each iteration of the algorithm, particularly for high dimensional
Z˜. Instead of sampling from θ|Z˜, Y , β˜, σ2, we could marginalize over Z˜ and sample from
θ|Y , β˜, σ2. The latter distribution will be more diffuse than the former, and thus intuitively we might expect improvements in the efficiency of the algorithm.
Consider the joint full conditional distribution θ and Z˜, where
p(θ, Z˜|Y , β˜, σ2) ∝ p(Y , Z˜|β˜, σ2, θ)π(θ). (3.5)
Since the left hand side of (3.5) can be decomposed into the product of p(θ|Y , β˜, σ2) and p(Z˜|Y , β˜, σ2, θ), Steps 1 and 3 can be collapsed into a single step where we sample from
θ|Y , β˜, σ2 and then from Z˜|Y , β˜, σ2, θ. From (3.5),
Z Z p(θ|Y , β˜, σ2) ∝ ··· p(Y , Z˜|β˜, σ2, θ)dZ˜ π(θ), (3.6) A1 An where the quantity in square brackets is simply the volume under the n-dimensional multi-
n variate normal density function corresponding to the orthant of R defined by A1×· · ·×An. Thus, a Metropolis-Hastings step can be used to sample from θ|Y , β˜, σ2. Sampling from
Z˜|Y , β˜, σ, θ can be done as before. The modified versions of the algorithms introduced in
Section 3.1.1 that collapse Steps 1 and 3 are listed in Table 3.1 under the heading Partially
Collapsed Algorithms, following terminology used by van Dyk and Park (2008).
45 Finally, we note that it is straightforward to show that
p(θ|Y , β˜, σ2) = p(θ|Y , β, σ2 = 1).
Therefore, in the modified Marginal-Scheme 1 algorithm, we do not condition on a partic- ular value of σ2 in sampling θ, thus ensuring that this partially collapsed algorithm samples from the appropriate posterior distribution (see van Dyk and Park, 2008, for related discus- sion).
3.1.3 Full Conditional Distributions
2 2 −1 Using the priors β ∼ N(0, Cβ), σ ∼ a0(χv0 ) , and θ ∼ π(θ), we obtain the fol- lowing full conditional distributions. We use the superscript [t] to denote the value of a parameter at the tth iteration of the algorithm.
Non-collapsed Algorithms
Step 1: Sample Z[t] from Z|Y , β[t−1], θ[t−1], (σ2)∗:
Each algorithm treats σ2 differently, so that for Marginal-Scheme 1 (σ2)∗ ∼ π(σ2),
for Marginal-Scheme 2 (σ2)∗ = (σ2)[t−1], and for Conditional (σ2)∗ = 1.
∗ [t] [t] [t−1] [t−1] 0 ˜ For i = 1, . . . , n, define Z¬i = (Z1 ,...,Zi−1,Zi−1 ,...,Zn ) and sample Zi from ( TN(µ , τ 2 , 0, ∞), if Y = 1 ˜ ∗ [t−1] [t−1] 2 ∗ z˜i z˜i i Zi|Y , Z , β , θ , (σ ) ∼ , ¬i (µ , τ 2 , −∞, 0), Y = 0 TN z˜i z˜i if i (µ , τ 2 , `, u) where TN z˜i z˜i is a truncated normal distribution with lower and upper bounds ` and u, respectively, and mean and variance
−1 0 ∗ [t−1] [t−1] [t−1] ∗ ∗ ∗ [t−1] µz˜i = xi σ β + Σ(θ ) i,¬i Σ(θ ) ¬i,¬i σ Z¬i − X¬i σ β −1 τ 2 = (σ2)∗ Σ(θ[t−1]) − Σ(θ[t−1]) Σ(θ[t−1]) Σ(θ[t−1]) . z˜i i,i i,¬i ¬i,¬i ¬i,i
46 [t] ˜ ∗ Set Zi = Zi/σ .
Step 2: Sample (σ2)[t], β[t] from σ2, β|Y , Z[t], θ[t−1]:
For Marginal-Scheme 1 and Marginal-Scheme 2 2 [t] ˜ ˆ 0 [t−1] −1 ˜ ˆ 2 ˆ0 −1 ˆ 2 −1 (σ ) ∼ (Z − Xβ) Σ(θ ) (Z − Xβ) + a0 + β Cβ β (χn+v0 )
ˆ 0 [t−1] −1 −1−1 0 [t−1] −1 ˜ ˜ where β = X Σ(θ ) X + Cβ X Σ(θ ) Z and Z is taken from Step 1.
For Conditional, set (σ2)[t] = 1.
Then, for all algorithms, sample
˜ ˆ 2 [t] 0 [t−1] −1 −1−1 β ∼ N(β, (σ ) X Σ(θ ) X + Cβ )
and set β[t] = β˜/σ[t].
Step 3: Sample θ[t] from θ|Y , Z[t], β[t], (σ2)[t] via a Metropolis-Hastings step:
Sample a proposed value, θ∗, from a proposal distribution q(θ|θ[t−1]). Take ( θ∗, with probability c(θ[t−1], θ∗) θ[t] = , θ[t−1], with probability 1 − c(θ[t−1], θ∗) where ( ) π(θ∗|Y , Z˜, β˜, (σ2)[t]) q(θ[t−1]|θ∗) c(θ[t−1], θ∗) = min , 1 . π(θ[t−1]|Y , Z˜, β˜, (σ2)[t]) q(θ∗|θ[t−1]) In the acceptance probability,
π(θ|Y , Z˜, β˜, (σ2)[t]) ∝ p(Y , Z˜|β˜, θ, (σ2)[t]) π(θ) = φ(Z˜; Xβ˜, (σ2)[t]Σ(θ)) π(θ),
where φ(Z˜; Xβ˜, σ2Σ(θ)) is the multivariate normal probability density function
with mean Xβ˜ and variance σ2Σ(θ) evaluated at Z˜, and Z˜ is taken from Step 1
and β˜ is taken from Step 2.
47 Partially Collapsed Algorithms
Step 1: Sample θ[t], Z[t] from θ, Z|Y , β[t−1], (σ2)∗:
• Sample θ[t] from θ|Y , β[t−1] via a Metropolis-Hastings step.
Sample a proposed value, θ∗, from a proposal distribution q(θ|θ[t−1]). Take ( θ∗, with probability c(θ[t−1], θ∗) θ[t] = θ[t−1], with probability 1 − c(θ[t−1], θ∗)
where
π(θ∗|Y , β[t−1]) q(θ[t−1]|θ∗) c(θ[t−1], θ∗) = min , 1 . π(θ[t−1]|Y , β[t−1]) q(θ∗|θ[t−1])
In the acceptance probability,
π(θ|Y , β[t−1]) ∝ p(Y |β[t−1], θ) π(θ) Z Z = ··· φ(Z; Xβ[t−1], Σ(θ)) dZ π(θ), A1 An where φ(Z; Xβ, Σ(θ)) is the multivariate normal probability density function
with mean Xβ and variance Σ(θ) evaluated at Z and the Ai are as defined in
(3.3).
• Sample Z[t]|Y , β[t−1], θ[t], (σ2)∗.
Each algorithm treats σ2 differently, so that for Marginal-Scheme 1 (σ2)∗ ∼
π(σ2), for Marginal-Scheme 2 (σ2)∗ = (σ2)[t−1], and for Conditional (σ2)∗ =
1.
∗ [t] [t] [t−1] [t−1] 0 For i = 1, . . . , n, define Z¬i = (Z1 ,...,Zi−1,Zi−1 ,...,Zn ) and sample ˜ Zi from ( TN(µ , τ 2 , 0, ∞), if Y = 1 ˜ ∗ [t−1] [t] 2 ∗ z˜i z˜i i Zi|Y , Z , β , θ , (σ ) ∼ , ¬i (µ , τ 2 , −∞, 0), Y = 0 TN z˜i z˜i if i
48 (µ , τ 2 , `, u) where TN z˜i z˜i is a truncated normal distribution with lower and upper bounds ` and u, respectively, and mean and variance
−1 0 ∗ [t−1] h [t] i h [t] i ∗ ∗ ∗ [t−1] µz˜i = xi σ β + Σ(θ ) Σ(θ ) σ Z¬i − X¬i σ β i,¬i ¬i,¬i −1 ! 2 2 ∗ h [t] i h [t] i h [t] i h [t] i τz˜ = (σ ) Σ(θ ) − Σ(θ ) Σ(θ ) Σ(θ ) . i i,i i,¬i ¬i,¬i ¬i,i
[t] ˜ ∗ Set Zi = Zi/σ .
Step 2: Sample (σ2)[t], β[t] from σ2, β|Y , Z[t], θ[t]:
For Marginal-Scheme 1 and Marginal-Scheme 2
2 [t] ˜ ˆ 0 [t] −1 ˜ ˆ 2 ˆ0 −1 ˆ 2 −1 (σ ) ∼ (Z − Xβ) Σ(θ ) (Z − Xβ) + a0 + β Cβ β (χn+v0 )
ˆ 0 [t] −1 −1−1 0 [t] −1 ˜ ˜ where β = X Σ(θ ) X + Cβ X Σ(θ ) Z, and Z is taken from Step 1.
For Conditional, set (σ2)[t] = 1.
Then, for all algorithms, sample
˜ ˆ 2 [t] 0 [t] −1 −1−1 β ∼ N(β, (σ ) X Σ(θ ) X + Cβ )
and set β[t] = β˜/σ[t].
3.2 Simulation Study
3.2.1 Simulation Set-up
In our simulation study, we compare each of the six proposed algorithms in terms of
computational efficiency and sensitivity to starting values. We consider data sets of sample
size n = 100 corresponding to observations on a 10 × 10 regular grid. We generate our
49 data from the spatial probit regression model given in (2.5) and (2.6) with a single covari- ate xi = xi and regression coefficient β = β. We consider a geostatistical dependence structure based on an exponential covariance function as defined in (2.7), as well as on an autoregressive structure based on the CAR model as defined in (2.9). We also consider an independent covariance structure (as in Section 2.1.1), i.e., Σ(θ) ≡ In, to use as a baseline for comparison.
When fitting the independent model, we use the three data augmentation algorithms, without fitting the spatial dependence parameter. We note that the issue of collapsing these algorithms is not relevant in this case. These algorithms are identical to those used by
Imai and van Dyk (2005) for the binary response special case of the multi-category probit regression model. We show how adding spatial dependence to the model can impact the convergence of the algorithms.
Under the three dependence structures, we consider both the non-collapsed and partially collapsed algorithms introduced in Section 3.1 and define the five scenarios listed in Table
3.2. For each scenario, we compare the algorithms resulting from the various conditional and marginal data augmentation strategies (i.e., Marginal-Scheme 1, Marginal-Scheme 2, and Conditional).
Scenario Spatial Dependence Structure Partially Collapsed 1 CAR No 2 CAR Yes 3 Geostatistical No 4 Geostatistical Yes 5 Independent –
Table 3.2: Scenarios used to compare the marginal and conditional data augmentation al- gorithms.
50 In assigning prior distributions, the resulting models should be the same across all al-
gorithms. Thus, where applicable, we assign priors on identifiable parameters (i.e., on β
rather than β˜). Furthermore, because of the non-identifiability within our model, it is im-
portant that the prior distributions on the parameters are proper. For Scenarios 1-5, we
assign a normal prior distribution to β (i.e., β ∼ N(mβ,Cβ) with mβ = 0 and Cβ = 100).
Each algorithm uses the non-identifiable working parameter, σ, differently. For the condi-
tional augmentation algorithm, σ is fixed at 1. For both marginal augmentation algorithms,
2 2 −1 2 −1 σ ∼ aσ(χvσ ) where aσ = 3 and (χvσ ) represents an inverse chi-squared distribution
with parameter vσ = 3. The spatial dependence structures have different parameterizations
necessitating the need for different prior distributions on the spatial dependence parame-
ters. For the CAR spatial dependence structure, ρ ∼ Unif(1/ξ(1), 1/ξ(n)), where ξ(1) and
−1/2 −1/2 ξ(n) are the smallest and largest eigenvalues of Dw WDw (Banerjee et al., 2004, pg.
80). For the geostatistical dependence structure, λ ∼ Unif(lλ, uλ) with lλ = 0 and uλ = 20.
In our simulation study, we generate data sets following the simulation example used in Nobile (1998) and Imai and van Dyk (2005) for independent binary data, including spa- tial dependence where appropriate: we independently generate the covariates xi from the √ uniform distribution on the interval (-.5, .5); take β = − 2, σ2 = 1, and β˜ = σβ; and set
the spatial dependence parameters ρ = .9 (CAR structure) and λ = 2 (geostatistical struc- ture). For each of the data sets generated under the the two spatial dependence structures
(CAR and geostatistical), we fit the corresponding model using both the non-collapsed and partially collapsed algorithms. For the independence case, we fit the model using the
(non-collapsed) algorithms.
To compare the augmentation algorithms in terms of sensitivity to starting values, we √ √ consider two different starting values for (σ, β), namely (σ, β) = ( 2, − 2) and (σ, β) =
51 (10, −2). Each algorithm is run for 50,000 iterations, and we somewhat arbitrarily take the
first 10,000 iterations to be the burn-in period.
3.2.2 Simulation Results
Using the five scenarios in Table 3.2, we compare the various algorithms in terms of mixing, convergence, and sensitivity to starting values. As shown by the histograms in Fig- ures 3.1 - 3.5, the algorithms result in nearly identical inferences on the posterior distribu- tions of the identifiable parameters. Thus, with the inferences consistent across algorithms, the algorithms can be compared in terms of their efficiency.
First, we note the trace plots of β and λ under the conditional algorithm of Scenario
3 shown in Figure 3.3. Here, we see evidence of potential lack of convergence from the different behavior in the chain near iteration 50,000. On the other hand, there is no indica- tion of convergence problems in the trace plots of the two Scenario 3 marginal algorithms, nor in the trace plots for Scenario 5 algorithms – the algorithms for the independent probit regression model – as seen in Figure 3.5. This difference provides evidence of additional difficulties in fitting spatial probit regression models and the need for more efficient MCMC algorithms.
Autocorrelation and partial autocorrelation plots of the (post burn-in) sample paths of model parameters also help in determining whether an MCMC algorithm is mixing well.
Figures 3.6 and 3.8 show the autocorrelation and partial autocorrelation plots for the re- gression coefficient and spatial dependence parameter for Scenarios 1 and 3. It is clear that among the three non-collapsed algorithms, Marginal-Scheme 1 results in the smallest and most quickly decreasing autocorrelation in both parameters’ paths under Scenario 3. Un- der Scenario 1, the top row plots show that the sample path of β under Marginal-Scheme
52 1 again has the smallest and most quickly decreasing autocorrelation, but the improve-
ment provided by the marginalization strategy is less apparent for the spatial dependence
parameter, ρ.
We also compare the convergence of the three partially collapsed algorithms using au-
tocorrelation plots. Figures 3.7 and 3.9 show the autocorrelation and partial autocorrelation
plots of the regression coefficient and the spatial dependence parameter sample paths for
Scenarios 2 and 4. Here we see that, when compared with the other partially collapsed
algorithms, Marginal-Scheme 1 is again superior when we compare the autocorrelation of
the sampled parameter values using each of the three partially-collapsed algorithms. We
might also expect partial collapsing of the algorithms to further improve autocorrelation
summaries. However, for the CAR structure, partially collapsing the algorithms does not
result in improved autocorrelation summaries, as seen by comparing the autocorrelations
of Figure 3.6 and 3.7. On the other hand, for the geostatistical spatial dependence structure,
partially collapsing the algorithms does appear to improve mixing, as seen by comparing
the autocorrelations of Figures 3.8 and 3.9. In this scenario, the most noticeable benefits of
collapsing appear to be for the conditional algorithms. The differences between the non-
collapsed and partially collapsed marginal algorithms are not nearly as strong. Given the
significant increase in computation time required to run the partially collapsed algorithms
compared with their non-collapsed counterparts (it can take roughly 12 times as long to
generate the same number of posterior samples), partially collapsed marginal augmenta-
tion algorithms appear not to be a worthwhile.
All scenarios showed that Marginal-Scheme 1 was the least sensitive to starting values.
Figures 3.11 - 3.15 show scatter plots of sampled pairs of σ versus β˜ for Scenarios 1 - 5 generated under each of the three augmentation algorithms and both starting values. The
53 black dots show the burn-in samples and the colored dots show the draws from the posterior.
Under both starting values, Marginal-Scheme 1 appears to immediately generate samples from the stationary posterior distribution and both parameters easily move around the entire parameter space. However, under the second set of starting values, Marginal-Scheme 2 takes longer to converge and the parameters seem to move around the parameter space more slowly.
Based on our simulation study, we recommend fitting the spatial probit regression model using the non-collapsed Marginal-Scheme 1 algorithm. This algorithm showed su- perior mixing and convergence properties compared to the Marginal-Scheme 2 and Condi- tional algorithms. When computation burden is taken into account, it does not appear that partially collapsing this algorithm is beneficial. In the next section, we compare the per- formance of the non-collapsed Marginal-Scheme 1 and Conditional algorithms in an anal- ysis of land cover data. These algorithms were selected based on the simulation findings
(non-collapsed Marginal-Scheme 1) and existing literature (non-collapsed Conditional; for example, as in De Oliveira, 2000).
3.3 Application
To illustrate our methods, we use a portion of the data described in Section 1.4. For this analysis, we considered the region bounded by 17◦ to 19◦N and 98◦ to 100◦E, which covers a portion of northwestern Thailand and a small part of Myanmar. Using a 24 × 24 grid over this region, we collapsed the response variable to two categories, forest and nonforest, where the land cover response variable associated with each grid cell was taken to be the most common observed land cover type, where forest was coded as “1” and non-forest was coded as “0”. We used the covariate distance to the nearest major road in our analysis, and
54 defined it over the grid to be the median distance for all measurements of this covariate
within a grid cell.
Using the spatial probit regression model defined by (2.5) and (2.6) with the CAR spa-
tial dependence structure given in (2.9), we model the binary land cover response variable
and the distance to the nearest major road covariate. Unlike the simulation study, here we
include an intercept parameter. We expect that less accessible locations (high distance to
nearest major road) are more likely to be forested.
We fit the model using the non-collapsed Marginal-Scheme 1 and Conditional algo-
rithms defined in Table 3.1 and compare the two algorithms in terms of mixing based on
the autocorrelation in the sample paths of the model parameters. Each algorithm was run
for 80,000 iterations, and after examining trace plots we decided to discard the first 10,000
samples as burn-in.
Inferences on the model parameters are nearly identical for both model-fitting algo-
rithms. As expected, the estimate for the regression coefficient associated with the dis-
tance to nearest major road covariate is positive (95 percent credible interval on β1 is
(1.687, 3.428)) so that the farther a location is from a major road (i.e., less accessible), the more likely that location is to be forested. The intercept is not significantly different from 0 (95 percent credible interval on β0 is (−1.094, 0.165)), and the 95 percent credible interval for ρ is (0.968, 0.999) indicating strong residual spatial dependence.
Rather than showing sample autocorrelation plots as in Section 3.2, to highlight the differences between the sample autocorrelation summaries, Table 3.3 provides the autocor- relations for the sample paths of β1 and ρ at selected lags. We do not show the autocorrela- tion values for the sample path of β0 because they were approximately zero for all lags and
55 both algorithms. Table 3.3 shows that the Marginal-Scheme 1 algorithm outperforms the
Conditional algorithm, exhibiting smaller autocorrelation in the sampled paths of β1 and ρ.
Sample Autocorrelations for β1 Lag Marginal-Scheme 1 Conditional 1 0.3194 0.3576 2 0.1791 0.2133 3 0.1115 0.1387 4 0.0786 0.1004 5 0.0617 0.0741 10 0.0256 0.0284
Sample Autocorrelations for ρ Lag Marginal-Scheme 1 Conditional 1 0.8514 0.8621 2 0.7303 0.7496 3 0.6319 0.6576 4 0.5515 0.5802 5 0.4849 0.5134 10 0.2589 0.2898 15 0.1447 0.1752 20 0.0830 0.1081
Table 3.3: Autocorrelations of the sample paths of β1 and ρ for the land cover data analysis.
3.4 Summary
In this chapter, we extended the algorithms of Imai and van Dyk (2005) to the spatially- dependent setting and compared the efficiency of three MCMC algorithms via a simulation study and data analysis. Furthermore, we proposed and compared three additional MCMC algorithms, called partially-collapsed algorithms, which marginalize over the latent vari- able when sampling the spatial dependence parameter. In both the simulation study and
56 data analysis, we found that the non-collapsed Marginal-Scheme 1 algorithm was the most efficient in terms of autocorrelation, sensitivity to starting values, and computational time.
57 Figure 3.1: Histograms and trace plots for β and ρ under Scenario 1 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution.
58 Figure 3.2: Histograms and trace plots for β and ρ under Scenario 2 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution.
59 Figure 3.3: Histograms and trace plots for β and λ under Scenario 3 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution.
60 Figure 3.4: Histograms and trace plots for β and λ under Scenario 4 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution.
61 Figure 3.5: Histograms and trace plots for β under Scenario 5 for each of the three cor- responding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution.
62 Figure 3.6: Autocorrelation and partial autocorrelation in the sample paths of β and ρ under Scenario 1 for each of the three corresponding algorithms.
63 Figure 3.7: Autocorrelation and partial autocorrelation in the sample paths of β and ρ under Scenario 2 for each of the three corresponding algorithms.
64 Figure 3.8: Autocorrelation and partial autocorrelation in the sample paths of β and λ under Scenario 3 for each of the three corresponding algorithms.
65 Figure 3.9: Autocorrelation and partial autocorrelation in the sample paths of β and λ under Scenario 4 for each of the three corresponding algorithms.
66 Figure 3.10: Autocorrelation and partial autocorrelation in the sample path of β under Scenario 5 for each of the three corresponding algorithms.
67 Figure 3.11: σ vs. β˜ under Scenario 1 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values.
68 Figure 3.12: σ vs. β˜ under Scenario 2 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values.
69 Figure 3.13: σ vs. β˜ under Scenario 3 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values.
70 Figure 3.14: σ vs. β˜ under Scenario 4 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values.
71 Figure 3.15: σ vs. β˜ under Scenario 5 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values.
72 CHAPTER 4
THE BAYESIAN SPATIAL PROBIT REGRESSION MODEL AS A TOOL FOR CLASSIFICATION
There are often two primary goals in regression analyses. The first is to model the relationship between the response variable and the covariates. The second is to accurately predict missing or unobserved responses. Up to this point, we have focused on the first goal, emphasizing the need to allow for residual spatial dependence when analyzing relationships between spatially-referenced binary response variables and associated covariates. In this chapter, we focus on the second goal, prediction.
In some sense, prediction of binary or categorical outcomes can be thought of as a classification problem, where we determine into which class, or category, a collection of
predictors/inputs, or covariates, is likely to fall. For example, in predicting the land cover
category at an unobserved location, we observe a collection of predictors/inputs, such as
elevation or distance to the nearest major road. Using these predictors/inputs, we can de-
termine some decision function which defines a classification rule for classifying each lo-
cation. Within the classification literature, the decision function is determined using one of
a number of available classification methods, each method using statistical decision theory
to determine optimal classification rules. Popular classification techniques, however, do
73 not typically take into account the categories of observations nearby in geographical space.
Instead, the decision function relies only on a function of the predictors/inputs.
In classification problems involving spatially-referenced observations, we argue that neighboring observations along with predictors/inputs should be considered. As we will illustrate, classification rules that rely on neighboring observations can be derived from the
Bayesian spatial probit regression model.
The goal of this chapter is to compare, in terms of classification error, the Bayesian spatial probit regression model-based classifier to various other classifiers. We first define notation for the classification problem in Section 4.1. In Section 4.2 we discuss classi-
fication in the regression setting and, using the Bayesian spatial probit regression model, define a classification rule for spatially-referenced observations. In Section 4.3 we give an overview of popular classification methods, which in Section 4.4, we then compare to the Bayesian spatial probit regression model-based classifier in terms of both the training
(in-sample prediction) and test (out-of-sample prediction) error rates for the Southeast Asia land cover data set used in Section 3.3.
4.1 The Classification Problem
In the classification problem, we again have the set of paired observations {(Yi, xi); i =
1, . . . , n}, where Yi is a binary response variable and xi is a k × 1 vector of covariates used as predictors/inputs in determining decision boundaries for the classes. In the classification setting, the response variable is an indicator identifying the class to which the set of ob- served predictors/inputs belong. Although many of the classification methods listed below can be generalized to the multi-category/class setting, we restrict our description of these
74 methods to the binary setting. In this setting there are two classes, C0 and C1, where Cj rep-
resents the class of observations where Y = j. Of the observations {(Yi, xi); i = 1, . . . , n}, n0 fall into class C0 and n1 fall into class C1, and n0 + n1 = n. Each classification method
defines a decision function δ(ω) where ω is the set of applicable predictors/inputs, model parameters, and in the spatial setting, surrounding observations. Based on this decision function, we can define a classification rule.
Before discussing each of the classification methods, we first give a brief overview of the classification process. Typically, when using classification methods, the sample data is randomly divided into two subsets, the training and test data sets, of sizes ntrain and ntest, respectively. This allows us to fit the model on which the classifier is based using the training data and evaluate the method’s ability to predict based on the test data set. When we compare classification methods, we consider the training and test error rates, or the number of incorrect classifications among training and test data sets, divided by the sample size of each data set, respectively.
Furthermore, some of the classifiers rely on fixed parameters that are not estimated, and thus, require tuning to minimize the prediction error. In this case, it is common to use
five-fold cross-validation to obtain an optimal value for the fixed parameters. We do this by dividing the training data set into five equal subsets. We then repeatedly fit the model iteratively leaving the mth subset out and compute the error rate for the mth subset for
each m = 1,..., 5. Averaging across the five error rates determines the cross-validation
error (CVE). This procedure is repeated for various fixed values of the tuning parameter to
determine a satisfactory value for the analysis.
In each of the following sections, we generically use {(Yi, xi); i = 1, . . . n} as the data used for fitting the method, and (Y 0, x0) as the point of prediction or classification.
75 4.2 GLM-Based Classification
In this section, we first discuss the standard approach for classification using a GLM
and the decision boundaries and classification rules for two commonly used GLMs. We
then introduce a spatial classification technique using the Bayesian spatial probit regres-
sion model. In specifying our underlying probabilty models, although a slight misuse of
notation, we explicitly condition on x to make it clear which predictors/inputs we rely on for classification. Furthermore, this also makes the classification rule comparable to that of discriminant analysis as discussed in Section 4.3.1. Finally, we note that in GLM-based classification, the xi include a term allowing for an intercept along with the other predic- tors/inputs, as is usually done in regression analyses.
4.2.1 Non-Spatial GLM-Based Classification
Consider the GLM for independent binary response variables
Yi|xi, β ∼ Bernoulli(pi) (4.1) 0 g(pi) = xiβ, where pi = P (Yi = 1|xi, β), xi is a fixed k × 1 vector of covariates, β is a k × 1 vector of coefficients, and g(·) is a link function. To classify x0, in regression analyses we predict
Y 0 = 1 when P (Y 0 = 1|x0, β) > P (Y 0 = 0|x0, β) and Y 0 = 0 otherwise. This prediction method, can be turned into a classification rule as follows.
Analogous to prediction in the regression setting, we define the following decision func- tion,
P (Y 0 = 1|x0, β) g−1(x00β) δ(ω) ≡ δ(x0, β) = = , (4.2) P (Y 0 = 0|x0, β) 1 − g−1(x00β)
76 and set it equal to one. Equivalently, the decision function,
g−1(x00β) δ(ω) ≡ δ∗(x0, β) ≡ log , (4.3) 1 − g−1(x00β) can be set equal to zero.
Based on (4.2), a classification rule is then ( 1, if δ(x0, β) > 1 Y 0 = . (4.4) 0, otherwise
Similarly, based on (4.3), a classification rule is ( 1, if δ∗(x0, β) > 0 Y 0 = . (4.5) 0, otherwise
A popular form for the link function, g(·), is the logit link, where
pi g(pi) = log . 1 − pi
Using this link function, the decision function in (4.3) becomes
g−1(x00β) δ∗(x0, β) = log = x00β (4.6) 1 − g−1(x00β) and the classification rule in (4.5) is ( 1, if x00β > 0 Y 0 = . (4.7) 0, otherwise
Although there are many approaches for estimating β in a frequentist setting, we estimate
β numerically by maximizing the likelihood function,
n n 0 Y Y exp{Yix β} L(β, Y ) = pYi (1 − p )1−Yi = i i i 1 + exp{x0 β} i=1 i=1 i with respect to β.
77 Another common link function is the probit link, where
−1 g(pi) = Φ (pi) and Φ is the cumulative standard normal distribution function. Using this link function, the decision function in (4.2) is then
g−1(x00β) Φ(x00β) δ(x0, β) = = (4.8) 1 − g−1(x00β) 1 − Φ(x00β) and the classification rule is ( 1, if Φ(x00β)/(1 − Φ(x00β)) > 1 Y 0 = . (4.9) 0, otherwise
Again, β can be estimated a number of ways, one of which is to estimate β numerically by maximizing the likelihood function,
n n Y Yi 1−Yi X 0 Yi 0 (1−Yi) L(β, Y ) = pi (1 − pi) = Φ(xiβ) (1 − Φ(xiβ)) i=1 i=1 with respect to β. We also consider a Bayesian approach.
Consider the latent variable representation of the Bayesian probit regression model as defined in Section 2.1.1. Without loss of generality, assume σ2 = 1, β˜ = β, and Z˜ = Z.
Using this representation, P (Y = 1|x, β) = P (Z > 0|x, β) and the decision function is:
P (Y 0 = 1|x00, β) δ(x0, β) = P (Y 0 = 0|x00, β) P (Z0 > 0|x00, β) = P (Z0 ≤ 0|x00, β) Φ(x00β) = , 1 − Φ(x00β)
which corresponds to the decision function given by (4.8). Note that in this case, the
decision function is the posterior odds. We can use the classification rule given in (4.9)
78 which, in this setting, corresponds to a 0-1 loss function (i.e., the loss for predicting Y 0 correctly is 0 and the loss for predicting Y 0 incorrectly is 1). For consistency, we continue to use the frequentist notation for the decision function and classification rule as initially defined earlier in this chapter.
We can estimate P (Y 0 = 1|x0, β) using one of the following two approaches. The first is to use a posterior mean classifier where we estimate E[β|Y ], or the posterior mean of β, and use this estimate to obtain P (Y 0 = 1|x0, βˆ) = P (Z0 > 0|x0, βˆ) = Φ(x00βˆ). In prac-
ˆ PT [t] [t] tice, we approximate E[β|Y ] by using β = t=1 β /T , where β are draws from the posterior distribution of β, and T is the number of draws from the distribution. This first representation is analogous to the estimation method of the previous likelihood approach.
The second is to use a posterior predictive classifier p0 = E[I(Z0 > 0)|x0, Y ], or the pos- terior probability that Z0 > 0. In doing this, we marginalize over the posterior distribution of β to obtain an estimate of P (Y 0 = 1|x0, Y ) = R P (Z0 > 0|x0, β)π(β|Y )dβ. In prac-
0 0 PT 0[t] 0[t] tice, we estimate p by using pˆ = t=1 I(Z > 0)/T , where Z are draws from the posterior distribution of Z0, and T is the number of draws from the distribution. Thus, for the independent Bayesian probit regression model, the two estimated classification rules are:
1. Posterior Mean Classifier: ( 1, if Φ(x00βˆ)/(1 − Φ(x00βˆ)) > 1 Y 0 = (4.10) 0, otherwise
2. Posterior Predictive Classifier: ( 1, if pˆ0/(1 − pˆ0) > 1 Y 0 = (4.11) 0, otherwise
79 As discussed in Section 4.1, when using classification techniques, we compute error
rates for both the training data (or the observed data used to fit the model) and the test
data (or the observed data not used to fit the model). For the test data, we can simply use
the latent Zj, for j = 1, . . . , ntest, as sampled within the Gibbs sampler. However, for the training data, the latent Zi, for i = 1, . . . , ntrain, are sampled within the Gibbs sampler given the observed Yi. In this case, to use the latent Zi as predictors, we must sample these
values as if the Yi are unknown. Using the posterior draws of β, we can resample the latent
Zi as follows:
(i) Take samples β[t] for i = 1,...,T from π(β|Y ) and sample a corresponding
[t] [t] [t] Zi ∼ N(xiβ , 1). (Note that these Zi are not those sampled in the MCMC algorithm.)
0 PT [t] 0 (ii) Determine pˆi = t=1 I(Zi > 0)/T and let Yi be the predicted value of Yi using the posterior predictive classifier.
(iii) Repeat (i) and (ii) for all i = 1, . . . , ntrain.
(iv) Compute the training error rate for the posterior predictive classifier:
Pntrain 0 i=1 I(Yi 6= Yi)/ntrain.
4.2.2 Spatial GLM-Based Classification
As discussed in Section 2.2, the latent variable representation of the probit regression model allows us to include spatial dependence among the response variables. Here we propose extending the use of this model to the classification setting. In this case, the deci- sion function δ(ω) depends on the covariates and regression coefficients, as well as on the categories of the surrounding observations.
80 Using equations (2.5) and (2.6), it follows that the distribution for the latent variable at an unobserved location is
0 Z |Y , X, β, θ, Z ∼ N(µZ0 , σZ0 ) where
00 0 −1 0 µZ0 = x β + σ(θ)-1 (Σ(θ)) (Z − X β)
2 0 −1 σZ0 = σ(θ)1 − σ(θ)-1 (Σ(θ)) σ(θ)-1
0 0 X = (x1,..., xn) , σ(θ) is an (n + 1) × 1 vector representing the variance of Z and
0 0 0 (Z , Z ) , σ(θ)-1 is σ(θ) with the first element removed, and Σ(θ) is the spatial correlation
0 matrix among Z = (Z1,...,Zn) . We can easily include sampling from this distribution in the first step of the MCMC algorithm (see Section 3.1.1 and Table 3.1), so that we can obtain draws from the posterior distribution of Z0. In this spatially-dependent case,
P (Y 0 = 1|x0, β, θ, Y , Z) = P (Z0 > 0|x0, β, θ, Y , Z) ! µ 0 = Φ Z . p 2 σZ0 Thus, the decision function at x0 is
P (Z0 > 0|x0, Y , Z, β, θ) δ(ω) ≡ δ(x0, Y , Z, β, θ) = 1 − P (Z0 > 0|x0, Y , Z, β, θ)
Just as for the independent Bayesian probit regression model, we can consider two ways to obtain an estimated classification rule:
1. Posterior Mean Classifier:
ˆ PT [t] ˆ Estimate E[β|Y ],E[θ|Y ], and E[Z|Y ] by computing β = t=1 β /T , θ =
PT [t] ˆ PT [t] t=1 θ /T , and Z = t=1 Z /T , respectively. The estimated classification rule is
81 √ 2 Φ(ˆµ 0 / σˆ ) Z √Z0 1, if 2 > 1 0 1−Φ(ˆµ 0 / σˆ ) Y = Z Z0 (4.12) 0, otherwise where
−1 00 ˆ ˆ 0 ˆ ˆ 0 ˆ µˆZ0 = x β + σ(θ)-1 Σ(θ) (Z − X β) −1 2 ˆ ˆ 0 ˆ ˆ σˆZ0 = σ(θ)1 − σ(θ)-1 Σ(θ) σ(θ)-1
2. Posterior Predictive Classifier:
0 0 0 0 PT 0[t] Estimate p = E[I(Z > 0)|x , Y ] by computing pˆ = t=1 I(Z > 0)/T . This gives an estimate of
P (Y 0 = 1|x0, Y ) = P (Z0 > 0|x0, Y ) Z = P (Z0 > 0|Z, β, θ) π(Z, β, θ) dZ dβ dθ.
The estimated classification rule is then
( 0 1, if pˆ > 1 Y 0 = 1−pˆ0 . (4.13) 0, otherwise
Again, each of these classification rules correspond to a 0-1 loss function on the predicted
Y 0.
As with the independent Bayesian probit regression model, for the training data, we must resample the Zis to determine prediction error rates. For the Bayesian spatial probit regression model, however, we rely on the observed surrounding observations to provide information about the category of the unobserved locations. In this case, evaluating the training error is not straightforward. We propose the following two approaches for the spatial probit posterior predictive classifier:
82 A. One-at-a-Time Training Error:
[t] [t] [t] [t] (i) Take samples (β , θ , Z-i ), for t = 1,...,T , where Z-i is an (n−1)×1 vector
of sampled Zj for j = 1, . . . , i−1, i+1, . . . , ntrain and sample a corresponding
Z[t] ∼ (µ , σ2 ) i N Zi Zi where
0 [t] [t] [t] −1 [t] [t] µˆZi = xiβ + Σ(θ )i,-i Σ(θ )-i,-i (Z-i − X-iβ )
−1 σˆ2 = Σ(θ[t]) − Σ(θ[t]) Σ(θ[t]) Σ(θ[t]) Zi i,i i,-i -i,-i -i,i
th [t] th and X-i is X with the i row removed, and Σ(θ )j,-k is the j row of the
th [t] estimated spatial correlation of Z with the k column removed. (Note that Z-i are the posterior samples obtained from the MCMC algorithm, however, the
[t] Zi s are not the same as those sampled in the MCMC algorithm.)
0 PT [t] 0 (ii) Determine pˆi = t=1 I(Zi > 0)/T and let Yi be the predicted value of Yi using the posterior predictive classifier.
(iii) Repeat (i) and (ii) for all i = 1, . . . , ntrain.
Pntrain 0 (iv) Compute the one-at-a-time training error: i=1 I(Yi 6= Yi)/ntrain
B. Joint Training Error:
(i) Take samples (β[t], θ[t]) for t = 1,...,T and sample a corresponding Z[t] ∼
N(Xβ[t], Σ(θ[t])). (Note that the Z[t] are not the same as those sampled in the
MCMC algorithm.)
0 PV [v] 0 (ii) For i = 1, . . . , ntrain, compute pˆi = v=1 I(Zi > 0)/V and let Yi be the
predicted value of Yi using the posterior predictive classifier.
Pntrain 0 (iii) Compute the joint training error: i=1 I(Yi 6= Yi)/ntrain
83 Both the one-at-a-time and joint training errors allow for spatial dependence among the bi-
nary predictions/classifications through the latent random variable. The joint training error
allows for spatial dependence only through the spatial dependence structure of the latent
variables, Σ(θ). In contrast, the one-at-a-time training error allows for spatial dependence
through Σ(θ), but also allows for spatial dependence by conditioning on the current values
[t] of the latent random variables at nearby locations, Z-i . We note that for the independent
model, both the one-at-a-time and joint training errors will be the same since Zi is inde-
pendent of all other Zj for j = 1, . . . , i − 1, i + 1, . . . , n.
4.3 Alternative Classification Methods
In this section, we give an overview of alternatives to GLM-based classification. In these alternative classification methods, we assume that the xis only include the predic- tors/inputs, and thus do not include a term to allow for an intercept as in the GLM-based classification methods. Unless otherwise noted, the classification methods described here are taken from Hastie et al. (2001).
4.3.1 Discriminant Analysis
In discriminant analysis, rather than considering the explanatory variables x as fixed as they are in regression analyses, the x are viewed as random variables, with class-specific density functions fj(x) corresponding to each class Cj. The classes also have prior proba- bilities πj, such that π0 + π1 = 1. To determine the probability that a set of predictors will fall into class j, we employ Bayes’ theorem which implies that
f (x)π P (Y = j|x) = j j . (4.14) f1(x)π1 + f0(x)π0
84 The Bayes’ classification rule is to classify an observation to class C1 when P (Y = 1|x) >
P (Y = 0|x) and to C0 otherwise.
We first describe discriminant analysis in its general form, allowing a general form for the fj(x)s and the decision function, and then discuss four special cases that we use in our analysis.
Let x1 . Xj = .
xnj where {x1,..., xnj } = {xi : Yi = j} and Xj is an njk × 1 vector of predictors/inputs
0 corresponding to observations in class Cj. To classify Y , for each class we define
0 0 x Xj = , Xj
0 0 where Xj is a (nj + 1)k × 1 vector and x is a k × 1 vector of predictors/inputs associated with Y 0. Our goal is to determine a decision boundary for classifying Y 0.
In discriminant analysis, fj(·) is typically the multivariate normal density function, 0 1 1 0 X 0 X −1 0 X fj(X ) = exp − (X − µ ) Σ (X − µ ) j (nj +1)k/2 X 1/2 j j j j j (2π) |Σj | 2
X X where µj is the (nj +1)k×1 class-specific mean vector and Σj is the (nj +1)k×(nj +1)k class-specific covariance matrix. It follows that 1 1 0 0 −1 0 f (x0|x ,..., x ) = exp − (x0 − µx )0 Σx (x0 − µx ) j 1 nj k/2 x0 1/2 j j j (2π) |Σj | 2 (4.15) where
x0 X X −1 µj = µj({1:k}) + Σj({1:k},-{1:k}) Σj(-{1:k},-{1:k}) (Xj − µj(-{1:k}))
x0 X X X −1 X Σj = Σj({1:k},{1:k}) − Σj({1:k},-{1:k}) Σj(-{1:k},-{1:k}) Σj(-{1:k},{1:k}).
85 We use the subscript notation ({1 : k}) to indicate the first k elements of the corresponding matrix or vector (i.e., those indices corresponding to x0) and (-{1 : k}) indicates the matrix
or vector without the first k elements (i.e., the remaining indices corresponding to Xj).
Using the log-odds as the decision function to determine a classification rule for classi-
fying Y 0, it follows from (4.14) and (4.15) that
0 x0 x0 x0 x0 δ(ω) ≡ δ(x , µ0 , µ1 , Σ0 , Σ1 ) P (Y 0 = 1|x0) = log P (Y 0 = 0|x0) x0 π 1 |Σ | 1 0 0 −1 0 1 0 0 −1 0 = log 1 + log 0 − µx 0 Σx µx + µx 0 Σx µx x0 1 1 1 0 0 0 . (4.16) π0 2 |Σ1 | 2 2 | {z } ≡α0
0 0 0 1 0 0 + x00 (Σx )−1µ − (Σx )−1µx −x00 (Σx )−1 − (Σx )−1 x0 1 1 0 0 2 1 0 | {z } | {z } ≡α1 ≡α2
Here, α0 is a scalar, α1 is a k × 1 vector, and α2 is a k × k matrix, which we define
for notational convenience. Setting the decision function equal to zero, the discriminant
analysis-based classification rule is ( 1, if (x00α − x00α x0) > −α Y 0 = 1 2 0 . 00 00 0 (4.17) 0, if (x α1 − x α2x ) ≤ −α0
We now consider special cases to (4.17). Each of these special cases assumes that the
X mean of xi is equal across all observations, so that µj = [1nj +1 ⊗ µj] where 1nj +1 is an
(nj + 1) × 1 vector of ones and µj is a k × 1 class specific mean vector. The difference
X between each of these special cases is in the specification of the covariance matrix Σj . We
first describe three popular discriminant analysis methods (linear discriminant analysis,
diagonal linear discriminant analysis, and quadratic discriminant analysis) all of which
assume that the xi are independent. Then we describe an approach for spatial discriminant
analysis (spatial linear discriminant analysis) due to Saltytˇ e˙ Benth and Ducinskasˇ (2005).
86 Assuming the xi are independent results in the following form for the covariance of
0 Xj :
0 X var(Xj ) = Σj = (Inj +1 ⊗ Λj), (4.18)
where Inj +1 is an (nj + 1) × (nj + 1) identity matrix and Λj is a class-specific covari-
0 ance matrix for the k components of xi. Under this assumption, x is independent of
x1,..., xnj , so
0 0 1 1 0 0 −1 0 fj(x |x1,..., xnj ) = fj(x ) = k/2 1/2 exp{− (x − µj) Λj (x − µj)}. (2π) |Λj| 2
Assuming a constant variance across classes (i.e., Λj = Λ for j = 0, 1) results in
linear discriminant analysis (LDA) because the decision boundary is linear in the xs. The
decision function given in (4.16) can be written in this case as
P (Y 0 = 1|x0) δ(ω) ≡ δ(x0, µ , µ , Λ) = log 0 1 P (Y 0 = 0|x0)
π1 1 0 −1 00 −1 = log − (µ1 + µ0) Λ (µ1 − µ0) +x Λ (µ1 − µ0), π0 2 | {z } | {z } αLDA LDA 1 α0
LDA LDA where α0 is a scalar and α1 is a k × 1 vector, defined for notational convenience.
Note that this decision function is effectively equivalent to the one based on the logistic regression model in (4.6), however, in logistic regression, we assume the x’s are fixed and thus make no distributional assumptions on x as in discriminant analysis. This results in the following LDA-based classification rule for Y 0 given x0: ( 1, if x00αLDA > −αLDA Y 0 = 1 0 00 LDA LDA (4.19) 0, if x α1 ≤ −α0
LDA LDA In practice, the parameters πj, µj, Λ (and thus α0 and α1 ) are unknown but can
be estimated using maximum likelihood:
87 • πˆj = nj/n
• µˆ = P x /n j i:Yi=j i j
• Λˆ = P P (x − µˆ )(x − µˆ )0/(n − 2) j∈{0,1} i:Yi=j i j i j
Diagonal linear discriminant analysis (DLDA) additionally assumes independence be-
tween the k predictors/inputs so that var(xi) = Λ is a diagonal matrix. The DLDA-based
classification rule is the same as (4.19), but using a diagonal matrix Λ. The mth diagonal
element of Λ is estimated by Λˆ = P P (x − µˆ )2/(n − 2) where x (m,m) j∈{0,1} i:Yi=j im jm im
th and µˆjm are the m elements of xi and µj, respectively.
In LDA, we assume a constant covariance for xi among the classes (i.e., Λj = Λ for
j = 0, 1). On the other hand, quadratic discriminant analysis (QDA) allows for each class
to have its own covariance. The decision function now contains a quadratic term in the xs:
P (Y 0 = 1|x0) δ(ω) ≡ δ(x0, µ , µ , Λ , Λ ) = log 0 1 0 1 P (Y 0 = 0|x0)
π1 1 |Λ0| 1 0 −1 1 0 −1 = log + log − µ1Λ1 µ1 + µ0Λ0 µ0 π0 2 |Λ1| 2 2 | {z } QDA α0
00 −1 −1 00 1 −1 −1 0 + x (Λ1 µ1 − Λ0 µ0) −x (Λ1 − Λ0 ) x . | {z } 2 QDA | {z } α1 QDA α2
QDA QDA QDA Here, α0 is a scalar, α1 is a k × 1 vector, and α2 is a k × k matrix, which are defined for notational convenience. This results in the QDA-based classification rule for
Y 0 given x0: ( 1, if (x00αQDA − x00αQDAx0) > −αQDA Y 0 = 1 2 0 00 QDA 00 QDA 0 QDA (4.20) 0, if (x α1 − x α2 x ) ≤ −α0 where we estimate the parameters by taking
• πˆj = nj/n
88 • µˆ = P x /n j i:Yi=j i j
• Σˆ = P (x − µˆ )(x − µˆ )0/(n − 1). j i:Yi=j i j i j j
The final discriminant analysis method we discuss is a special case of the spatial-
temporal extension of LDA recently proposed by Saltytˇ e˙ Benth and Ducinskasˇ (2005).
We restrict our discussion to the spatial case, but refer the reader to Saltytˇ e˙ Benth and
Ducinskasˇ (2005) for more details on the spatial-temporal approach. In spatial LDA, we do not assume independence among the xi as in the previous three methods; instead, we as- sume the predictors/inputs are spatially dependent. Following Saltytˇ e˙ Benth and Ducinskasˇ
(2005), the specific form for the covariance is