Bayesian Probit Regression Models for Spatially-Dependent Categorical Data

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By Candace Berrett, B.S., M.S. Graduate Program in

The Ohio State University

2010

Dissertation Committee:

Catherine A. Calder, Advisor

L. Mark Berliner Peter F. Craigmile Elizabeth A. Stasny c Copyright by

Candace Berrett

2010 ABSTRACT

Data augmentation/latent variable methods have been widely recognized for facilitating model fitting in the Bayesian probit regression model. First proposed by Albert and Chib

(1993) for independent binary and multi-category data, the latent variable representation of the Bayesian probit regression model allows model fitting to be performed using a simple

Gibbs sampler and, for more than two categories, also allows the so-called assumption of irrelevant alternatives required by the model to be relaxed (Hausman and Wise, 1978). To accommodate residual spatial dependence, the latent variable speci-

fication of the Bayesian probit regression model can be extended to incorporate standard parametric covariance models typically used in analyses of spatially-dependent continuous data, defining what we term the Bayesian spatial probit regression model. In this disserta- tion, we develop and extend the Bayesian spatial probit regression model by (i) introducing efficient model-fitting algorithms, (ii) deriving classification methods based on the model, and (iii) extending the model to the multi-category spatial setting.

Statistical models for spatial data are notoriously cumbersome to fit necessitating the availability of fast and efficient model-fitting algorithms. To improve the efficiency of the Gibbs sampler used to fit the Bayesian regression model for independent categorical response variables, Imai and van Dyk (2005) propose introducing a working parameter into the model and compare various data augmentation strategies resulting from different treatments of the working parameter. We build on this work by investigating the efficiency

ii of modified and extended versions of conditional and marginal data augmentation Markov chain Monte Carlo (MCMC) algorithms for the spatial probit regression model, focusing on the special case of binary spatially-dependent response variables.

Within the classification literature, methods that exploit spatial dependence are limited.

We show how a spatial classification rule can be derived from the Bayesian spatial probit regression model. In addition, we compare our proposed spatial classifier to various other classifiers in terms of training and test error rates using a land-cover/land-use data set.

When extending the spatial probit regression model to the multi-category setting, care must be taken to ensure that model parameters are estimable and interpretable. Considering three types of categorical and spatial covariate information, we discuss various specifica- tions of the latent variable mean structure and the associated parameter interpretations.

Additionally, we explore the specification of the latent variable cross space-category de- pendence structure and discuss how data augmentation MCMC strategies for fitting the

Bayesian spatial probit regression model can be extended to the multi-category setting.

iii Dedicated to my parents, Bob and Nanette, and siblings, Tenille, Nat, Preston, MeChel, and Taylor.

iv ACKNOWLEDGMENTS

First and foremost, I would like to thank my advisor, Dr. Kate Calder, who over the last four and a half years has devoted a substantial amount of time and effort in training me to be a well-rounded statistician. She has provided me with numerous opportunities to learn and grow through research, teaching, mentoring, and collaboration. She has also become a good friend, whom I admire professionally and personally, and I am grateful for her example and support.

I would like to thank my committee members: Dr. Mark Berliner for his comments on my research, his help with job and fellowship applications, and for allowing me to laugh in his class; Dr. Elizabeth Stasny for her comments on my research, her help with job and fellowship applications, her support as graduate chair, and in encouraging me to come to

Ohio State; and Dr. Peter Craigmile for his valuable comments and contributions to my research.

I would like to thank Dr. Darla Munroe and Dr. Ningchuan Xiao of the Department of

Geography for their generous assistance in obtaining and understanding the land cover data used in this work.

I would like to thank the other professors in the Department of Statistics who have provided guidance and support during my time at Ohio State: Dr. Doug Wolfe, Dr. Tao Shi,

Dr. Chris Hans, Dr. Jackie Miller, Dr. Steve MacEachern, and Dr. Noel Cressie.

v I would like to thank Lisa Van Dyke for her help in answering my many graduation questions and in pulling together the final documents of this dissertation. I would also like to thank Terry England for her help with all my travel and posters.

Support for this research was provided by grants from NASA (NNG06GD31G) and the

NSF (ATM-0934595).

Finally, I would like to thank my family and many friends, who all believed in me when I didn’t believe in myself; and God, for giving me strength and understanding, and providing me with opportunities to grow.

vi VITA

1983 ...... Born - Ogden, Weber, Utah, USA

2005 ...... B.S. Actuarial Science, cum laude, Brigham Young University. 2005 - 2006 ...... University Fellow, Graduate School, The Ohio State University. 2005 - 2006, 2010 ...... Teaching Assistant, Department of Statis- tics, The Ohio State University. 2007 ...... M.S. Statistics, The Ohio State University. 2007 - 2010 ...... Research Assistant, Department of Statis- tics, The Ohio State University. 2009 ...... Graduate Fellow, Statistical and Applied Mathematical Sciences Institute.

PUBLICATIONS

Research Publications

Xiao, N., Shi, T., Calder, C.A., Munroe, D.K., Berrett, C., Wolfinbarger, S., and Li, D. (2008) “Spatial Characteristics of the Difference between MISR and MODIS Aerosol Optical Depth Retrievals over Mainland Southeast Asia,” Remote Sensing of Environment, DOI: 10.1016/j.rse.2008.07.011.

FIELDS OF STUDY

Major Field: Statistics

vii TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

List of Tables ...... xi

List of Figures ...... xii

Chapters:

1. Introduction ...... 1

1.1 Background and Motivation ...... 2 1.2 Modeling Categorical Spatial Data ...... 15 1.2.1 The Spatial ...... 15 1.2.2 The Spatial Generalized Linear Mixed Model ...... 19 1.2.3 Indicator Kriging ...... 20 1.2.4 The Autologistic Model ...... 22 1.2.5 The Bayesian Spatial Probit Regression Model ...... 23 1.3 Overview of Contributions ...... 24 1.4 Illustrative Data Set ...... 25

2. Bayesian Spatial Probit Regression ...... 29

2.1 The Bayesian Probit Regression Model ...... 29 2.1.1 Albert and Chib’s Data Augmentation Strategy ...... 29

viii 2.1.2 Multi-Category and Multivariate Extensions ...... 31 2.2 The Bayesian Spatial Probit Regression Model ...... 34 2.2.1 Model Specification ...... 34 2.2.2 Parameterization of the Spatial Correlation Matrix ...... 36

3. Data Augmentation MCMC Strategies ...... 39

3.1 Data Augmentation MCMC Strategies ...... 40 3.1.1 Conditional versus Marginal Data Augmentation ...... 40 3.1.2 Partially Collapsed Algorithms ...... 45 3.1.3 Full Conditional Distributions ...... 46 3.2 Simulation Study ...... 49 3.2.1 Simulation Set-up ...... 49 3.2.2 Simulation Results ...... 52 3.3 Application ...... 54 3.4 Summary ...... 56

4. The Bayesian Spatial Probit Regression Model as a Tool for Classification . . . 73

4.1 The Classification Problem ...... 74 4.2 GLM-Based Classification ...... 76 4.2.1 Non-Spatial GLM-Based Classification ...... 76 4.2.2 Spatial GLM-Based Classification ...... 80 4.3 Alternative Classification Methods ...... 84 4.3.1 Discriminant Analysis ...... 84 4.3.2 Support Vector Machines ...... 90 4.3.3 k-Nearest Neighbors ...... 93 4.4 Comparison of Classification Methods ...... 94 4.4.1 Parameter Estimation ...... 95 4.4.2 Classification Errors ...... 97 4.5 Summary ...... 101

5. Bayesian Spatial Multinomial Probit Regression ...... 102

5.1 The Bayesian Spatial Multinomial Probit Regression Model ...... 102 5.1.1 Latent Mean Specification ...... 104 5.1.2 Parameterization of the Space-Category Covariance Matrix . . . 125 5.2 Model-Fitting ...... 128 5.2.1 Data Augmentation MCMC Algorithms ...... 128 5.3 Summary ...... 133

ix 6. Contributions and Future Work ...... 134

x LIST OF TABLES

Table Page

3.1 This table lists the steps in each of the data augmentation algorithms. The first portion shows the non-collapsed data augmentation algorithms intro- duced in Section 3.1.1. The second portion shows the partially collapsed data augmentation algorithms introduced in Section 3.1.2...... 44

3.2 Scenarios used to compare the marginal and conditional data augmentation algorithms...... 50

3.3 Autocorrelations of the sample paths of β1 and ρ for the land cover data analysis...... 56

4.1 Fitted values for the covariance function parameters for both class C0 and C1. 96

4.2 Tuning parameter values for each classification method. The optimal value of the tuning parameter is listed along with the CVE associated with this value. The optimal values were chosen by minimizing the five-fold CVE. . 97

4.3 Training and test errors for the SE Asia land cover data obtained using each of the classification methods discussed in this chapter...... 100

xi LIST OF FIGURES

Figure Page

1.1 Land cover over Southeast Asia, covering the region bounded by 17◦ to 21◦N and 98◦ to 105◦E. The data were taken from the MODIS Land Cover Type Yearly Level 3 Global 500m (MOD12Q1 and MCD12Q1) data prod- uct for the year 2005...... 26

1.2 Elevation (in meters) over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E...... 27

1.3 Standardized value of the measured distance to the nearest major road over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E...... 27

1.4 Standardized value of the measured distance to the coast over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E...... 28

1.5 Standardized value of the measured distance to the nearest big city over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E...... 28

2.1 Illustration of neighborhood structures for a regular grid. The grid cell with the black dot represents the location of interest and its neighbors are represented by the grid cells with the empty squares for (a) a first order neighborhood structure and (b) a second order neighborhood structure. . . . 38

3.1 and trace plots for β and ρ under Scenario 1 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution. . . 58

3.2 Histograms and trace plots for β and ρ under Scenario 2 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution. . . 59

xii 3.3 Histograms and trace plots for β and λ under Scenario 3 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution. . . 60

3.4 Histograms and trace plots for β and λ under Scenario 4 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution. . . 61

3.5 Histograms and trace plots for β under Scenario 5 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution...... 62

3.6 Autocorrelation and partial autocorrelation in the sample paths of β and ρ under Scenario 1 for each of the three corresponding algorithms...... 63

3.7 Autocorrelation and partial autocorrelation in the sample paths of β and ρ under Scenario 2 for each of the three corresponding algorithms...... 64

3.8 Autocorrelation and partial autocorrelation in the sample paths of β and λ under Scenario 3 for each of the three corresponding algorithms...... 65

3.9 Autocorrelation and partial autocorrelation in the sample paths of β and λ under Scenario 4 for each of the three corresponding algorithms...... 66

3.10 Autocorrelation and partial autocorrelation in the sample path of β under Scenario 5 for each of the three corresponding algorithms...... 67

3.11 σ vs. β˜ under Scenario 1 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values...... 68

3.12 σ vs. β˜ under Scenario 2 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values...... 69

3.13 σ vs. β˜ under Scenario 3 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values...... 70

xiii 3.14 σ vs. β˜ under Scenario 4 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values...... 71

3.15 σ vs. β˜ under Scenario 5 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values...... 72

4.1 Illustration of the hyperplane (solid line) and margins (dotted lines) deter- mined by support vector machines separating the two classes (denoted by the two plotting symbols). The hyperplane in (a) is the maximum margin hyperplane and the hyperplane in (b) does not maximize the margin. . . . . 91

4.2 Estimated and fitted variograms for class C0 (left) and C1 (right) fit using the training data set. The numbers 1-4 indicate the estimated variogram for the four covariates, the dots represent the mean of the four variograms, and the line indicates the fitted variogram...... 96

xiv CHAPTER 1

INTRODUCTION

The field of spatial statistics encapsulates a wide array of statistical methodology for analyzing spatial data, or observations associated with particular points or regions in space.

The increasing availability and variety of spatial data has generated an enormous amount of methodological development in the field over the past 20 years, and today the field is one of the most active areas of research in all of statistics (see Banerjee et al., 2004; Waller and Gotway, 2004; Schabenberger and Gotway, 2005; Diggle and Ribeiro Jr., 2007, for a discussion of recent advances). Spatial statistical methods are also increasingly being used in areas of application across the biological, environmental, health, and social sciences, and often specific problems in these fields motivate further methodological developments. In fact, the motivation for much of the research presented in this dissertation is the study of land-cover/land-use change (LCLUC) using satelite-derived land cover observations. Be- fore discussing this motivating application that is used throughout this dissertation, we first provide background and motivation as to why incorporating spatial information is impor- tant in statistical analyses of spatially-referenced data in Section 1.1. In Section 1.2 we give an overview of existing spatial statistical methodology for spatially-dependent categorical data. In Section 1.3, we outline the contributions of this dissertation in terms of statistical methodology for analyzing spatially-dependent categorical data using the Bayesian spatial

1 probit regression model. We then discuss the LCLUC motivating application by introduc-

ing a specific illustrative data set in Section 1.4.

1.1 Background and Motivation

r Let si ∈ D ⊂ R be an r × 1 vector in Euclidean space. In spatial statistics, typically, r = 2 or 3, indicating two- or three-dimensional space, respectively. Let, Y (si) ≡ Yi for

i = 1, . . . , n, where, for now, Yi is a continuous real-valued random variable observed at

a point si ∈ D. There are three types of spatially-referenced data (Cressie, 1993), defined

by the nature of the spatial domain, D:

• Geostatistical/point-referenced data, where si varies continuously over D and D is a

r fixed subset of R . In this case, Y (si) can potentially be observed everywhere within D.

• Lattice/gridded data, where D is again a fixed subset of Rr, but in this case, the

elements of D are areal units. Thus, si may not represent a point in space, but will

indicate an areal unit indexed by the point si.

r • Point pattern data, where D is still a subset of R , but the si are now viewed as

random. For point pattern data, the observed si are modeled as realizations from a

stochastic process, whereas with geostatistical and lattice data, the points are usually

assumed to be fixed and non-stochastic.

In this dissertation, we only consider statistical methods for the analysis of geostatistical

and lattice data.

We expect spatially-referenced data to have a specific type of dependence structure,

called spatial dependence, where observations near to one another are more similar than

2 observations further apart. Often, a statistical analysis of spatial data will focus on learn-

ing about the nature of the spatial dependence or in exploiting it for prediction purposes.

Sometimes, however, we may need to account, or adjust, for spatial dependence when we

want to measure or quantify relationships between two observable quantities of interest. In

particular, in a a researcher may be interested in determining how one

variable Y , known as the response variable or dependent variable, varies given a set of

other variables x, known as covariates, independent variables, or explanatory variables.

For example:

• Environmental Health. Researchers and policy makers may be interested in the re-

lationship between exposure to particulate matter or other pollutants and a particular

health outcome (e.g., cancer or mortality).

• Real Estate. An economist may be interested in determining factors that contribute

to whether or not a house will sell in a particular period of time.

• Land-Cover/Land-Use Change. Researchers may be interested in determining the

demographic, social, geographical, and political factors contributing to the observed

type of land cover at various locations.

The Classical Linear Regression Model

Before describing the spatial regression model, we first review the linear regression model from both a classical and Bayesian perspective. Let {(Yi, xi); i = 1, . . . , n}, be a collection of paired observations, where Yi ∈ R is a univariate real-valued response variable or dependent variable, and xi is a k × 1 vector of continuous or discrete-valued covariates or independent variables. For now, assume the Yi are conditionally independent

3 for all i = 1, . . . , n given the independent variables. The linear regression model specifies

the following relationship between the components of xi and the mean of Yi:

0 E(Yi) = xiβ, where β is a k × 1 vector of fixed but unknown regression coefficients. The classical linear regression model is often written as

0 Yi = xiβ + i, (1.1)

2 where i is a random noise error term with mean 0 and variance σ and is assumed to be independent for all i = 1, . . . , n. It is common to assume that the i are also normally distributed, but that assumption is not necessary.

One method for fitting the linear regression model is maximum likelihood estimation.

When the is are assumed to be normally distributed, the likelihood function is

2 2 L(β, σ |Y ) ∝ p(Y |x1,..., xn, β, σ ) n Y 2 = p(Yi|xi, β, σ ) i=1 n Y 1 1 0 2 = √ exp{− (Yi − x β) }, (1.2) 2 2σ2 i i=1 2πσ

0 2 where Y = (Y1,...,Yn) . Maximizing (1.2) with respect to β and σ yields the following maximum likelihood estimator (MLE) for β:

n !−1 n ! 1 X 1 X βˆ = x x0 x Y . MLE n i i n i i i=1 i=1

0 0 0 We can write this estimator of β in matrix notation by letting X = (x1,..., xn) so that

ˆ 0 −1 0 βMLE = (X X) X Y .

4 The MLE of σ2 is

1 σˆ2 = ˆ0ˆ, MLE n

ˆ where ˆ = Y − XβMLE. This likelihood-based model-fitting approach is considered to be within the class of classical or frequentist statistics and standard inferential statements about the uncertainty in model parameter estimates can be obtained (e.g., confidence inter- vals, hypothesis tests). Another method for fitting the linear regression model is known as ordinary least squares (OLS), which does not require an assumption about the distribution of the error terms, but instead seeks to minimize the sum of the squared residuals, or the ˆ difference between the observed response Yi and the predicted value of the response Yi. We note that the OLS estimator for β is identical to the MLE provided above. As an alternative to these two approaches, we can consider a Bayesian version of the linear regression model, which we discuss below.

The Bayesian Linear Regression Model

In Bayesian analyses, we specify prior distributions on the model parameters; for the linear regression model, a prior distribution is needed for β and σ2, and is denoted π(β, σ2).

A prior distribution is a probability distribution which captures our knowledge about the model parameters prior to observing the data. Then, using the data likelihood and prior dis- tribution, we determine posterior distributions for the model parameters, or the distribution of the model parameters given the data, using Bayes’ Theorem. For the linear regression model, the posterior distribution is π(β, σ2|Y ). Posterior distributions are determined ei- ther analytically or numerically using simulation-based techniques such as Markov chain

Monte Carlo (MCMC); see Gelman et al. (1995) for a general overview of Bayesian data

5 analysis techniques. In the Bayesian setting, inferences about model parameters are then reported based on summaries of the posterior distributions of model parameters.

Before describing the Bayesian linear regression model, we first discuss the role of the covariates X in the specification of the posterior distribution of the model parameters, fol- lowing the discussion in Chapter 14 of Gelman et al. (1995). In the Bayesian paradigm, the observed data in a regression analysis include both Y and X. Because of this, a Bayesian model will include the specification of a joint distribution on (Y , X), p(Y , X|ψY |X , ψX ), where ψY |X are the model parameters associated with the distribution of Y |X – for the

2 Bayesian linear regression model, ψY |X = (β, σ ) – and ψX are the parameters associ- ated with X. However, the Bayesian linear regression model traditionally assumes that the distribution of X provides no additional information about the conditional distribution of

Y |X, or that

p(Y , X|ψY |X , ψX ) = p(Y |ψY |X , X)p(X|ψX ) (1.3) and ψY |X and ψX are assumed independent a priori so that

π(ψY |X , ψX ) = π(ψY |X )π(ψX ). (1.4)

Thus, a Bayesian regression model inherently assumes that the prior distribution on X provides no information about the regression model parameters, ψY |X = (β, σ). To see this, notice that using equations (1.3) and (1.4), we can write the joint posterior distribution of (ψY |X , ψX ) as

p(Y , X|ψY |X , ψX ) π(ψY |X , ψX ) π(ψY |X , ψX |Y , X) = RR p(Y , X|ψY |X , ψX ) π(ψY |X , ψX ) dψY |X dψX

p(Y |ψY |X , X) π(ψY |X ) p(X|ψX ) π(ψX ) = R  R  p(Y |ψY |X , X) π(ψY |X ) dψY |X p(X|ψX ) π(ψX ) dψX

= π(ψY |X |Y , X)π(ψX |X).

6 It follows that we can determine the posterior distribution of ψY |X by only considering

π(ψY |X |Y , X) ∝ π(ψY |X )p(Y |X, ψY |X ). (1.5)

To distinguish the roles X and Y play in Bayesian inference on ψY |X , it is common practice to tacitly condition on X and denote the posterior distribution of β and σ2 as

π(β, σ2|Y ). We follow this convention throughout this dissertation except in Chapter 4 for reasons that we discuss therein.

Although any probability distribution can be chosen for modeling the error terms of the regression model, typically, the errors are assumed to be normally distributed. Equivalently, it is assumed that

Y |β, σ2, X ∼ N(Xβ, σ2I), where I is the n × n identity matrix. We then assign prior distributions to the parameters β and σ2. The standard choice for the Bayesian linear regression model is the noninformative prior distribution 1 π(β, σ2|X) ∝ . σ2 Using these likelihood and prior distributions results in a posterior distribution that can be decomposed as follows:

π(β, σ2|Y ) = π(β|σ2, Y )π(σ2|Y ), where

2 ˆ 2 β|σ , Y ∼ N(βB, σ VB), and

2 2 2 −1 σ |Y ∼ σˆB(χn−k) ,

7 with

ˆ 0 −1 0 βB = (X X) X Y ,

0 −1 VB = (X X) , and

1  0   σˆ2 = Y − Xβˆ Y − Xβˆ , B n − k B B

2 −1 and where (χn−k) denotes the inverse chi-squared distribution with n − k degrees of freedom.

Often in Bayesian analyses, when the full joint posterior distribution is not available in closed form, we use simulation-based approaches to generate samples from the posterior distribution. We can make inferences about (β, σ2) by using a simulation-based (Monte

Carlo) approach where we generate samples from the posterior distribution of (β, σ2).

To do this, we sample from σ2|Y and then, given the sampled value of σ2, sample from

β|σ2, Y . Posterior inferences are then reported based on the empirical distribution of these sampled values.

Comparison to Classical Estimators

Notice that in the Bayesian case, the mean of the conditional posterior distribution for

2 ˆ ˆ 2 n 2 β|σ , Y is the same as the MLE, i.e., βB = βMLE. Furthermore, σˆB = n−k σˆMLE. There- fore, Bayesian inferences on model parameters can be similar to inferences based on the classical estimators. A Bayesian analysis, however, offers additional benefits, including:

1. Flexibility in model specification. Although the means of the posterior distributions

of β and σ2 are similar to the classical estimators, this is not always the case. To see

this, suppose we have more information a priori about β: for example, we may know

8 or require β > 0, where 0 is a k × 1 vector of zeros. A Bayesian analysis provides

a natural and straightforward way to allow for this prior knowledge or constraint, to

be included in an analysis through an informative prior distribution.

2. Interpretability of inferences. In a classical/frequentist setting, inferences are often

reported as confidence intervals. Confidence intervals, however, are not probability

intervals, although many interpret them as such. In a Bayesian setting, the posterior

distribution provides interpretable probabilistic statements about model parameters,

making clear the uncertainty associated with parameter estimates. Furthermore, this

uncertainty is easily propagated to uncertainty in deterministic and stochastic func-

tions of parameters (e.g., transformations and predicted values).

3. Allowing for increased complexity in model specification. Increasing the complex-

ity of a model (e.g., hierarchical modeling with a large number of unknown pa-

rameters, allowing for complicated spatial dependence structures, or modeling non-

continuous data) can make model-fitting in a classical/frequentist setting infeasible.

However, simulation-based approaches such as MCMC (see discussion below) can

make Bayesian model fitting feasible in high dimensional situations.

The Spatial Linear Regression Model

We now consider a more general form of linear regression in order to account for spatial dependence among the errors. In some instances, the error terms in the linear regression model will not be independent. This is often the case for spatially-referenced data, but there can also be residual dependence in data obtained from studies using repeated measures or when data are observed over time. When an assumption of residual independence is not valid, we can specify a general linear regression model. The model is the same as in (1.1),

9 however we no longer assume the i are independent, but rather assume that

Var() = Σ, where Σ is a valid and invertible covariance matrix (i.e., positive definite and symmet- ric). Again, it is common to assume  is normally distributed, but this assumption is not necessary.

To estimate the parameters of the general linear model, we consider two cases, Σ is known and Σ is unknown. When Σ is known, for maximum likelihood estimation, we typically specify a on the error terms resulting in the likelihood function

L(β, |Y ) ∝ p(Y |X, β, Σ) 1  1  = exp − (Y − Xβ)0Σ−1(Y − Xβ) . (2π)n/2|Σ|1/2 2

The MLE for β is then

ˆ 0 −1 −1 0 −1 βMLE = X Σ X X Σ Y .

Weighted (or generalized) least squares, an extension of OLS, results in the same estimator for β.

When Σ is unknown, we can parameterize Σ (i.e., let Σ = Σ(θ)). The spatial linear model is defined to be the general linear model with a specific type of dependence captured by Σ(θ) (see the discussion in Section 2.2.2). In this case, the likelihood function is

L(β, θ|Y ) ∝ p(Y |X, β, Σ(θ)) 1  1  = exp − (Y − Xβ)0 (Σ(θ))−1 (Y − Xβ) , (2π)n/2|Σ|1/2 2 which is maximized with respect to β and θ to obtain the MLE. Another likelihood-based approach to estimation is restricted/residual maximum likelihood (REML; Patterson and

10 Thompson, 1971) estimation where we introduce a matrix C satisfying E(CY ) = 0 and

Var(CY ) = σ2I. Then, given C, we use standard maximum likelihood estimation to

fit the model CY = CXβ + C. Another approach is based on an extension of OLS

called iteratively reweighted least squares (IRLS). IRLS first fixes the covariance Σ to some preliminary estimate and, conditioning on this value, uses the weighted least squares estimator to obtain an initial estimate of β. Using this estimate of β, IRLS estimation then obtains residuals and determines a more accurate estimate of Σ. Using the updated estimates, IRLS iterates between these two steps until the values of the parameter estimates have converged.

Alternatively, we can specify a Bayesian version of the general linear model by spec- ifying prior distributions on model parameters and using a simulation-based method to approximate the posterior distributions of model parameters. The model-fitting approach is similar to that used for the Bayesian linear regression model, however, we now specify a prior distribution on Σ or on parameters defining Σ. Because of the parameterization of

Σ, fitting a Bayesian general linear regression model requires more complex simulation- based approaches than those used in the Bayesian linear regression model. In general, the joint posterior distribution of β and θ cannot be decomposed into the product of standard density functions to allow Monte Carlo samples from the posterior to be drawn directly.

Instead, it has become standard practice to use an estimation technique known as Markov chain Monte Carlo (MCMC). Rather than drawing directly from the posterior distribution, an MCMC algorithm draws samples from a Markov chain with a long run, or stationary distribution, equal to the posterior. Two popular MCMC algorithms are:

11 • The Gibbs sampler, where we draw a sample from the full conditional posterior dis-

tributions of each parameter sequentially. For the Bayesian spatial linear regression

model, for each iteration we sample

1. β[t] ∼ π(β|Y , θ[t−1])

2. θ[t] ∼ π(θ|Y , β[t])

where β[t] and θ[t] represent the sampled values of β and θ for the tth iteration, and

each is sampled conditional on the current value of the other parameter.

• The Metropolis-Hastings algorithms, where for each iteration, we draw a sample

from a proposal distribution and either accept or reject the proposed value. Let ψ

be the parameter of interest and Y the observed data. Then, we accept the proposed

value with probability

 ∗ ∗ [t−1]  ∗ [t−1] π(ψ |Y )/Jt(ψ |ψ ) a(ψ , ψ ) = min [t−1] [t−1] ∗ , 1 , π(ψ |Y )/Jt(ψ |ψ )

where ψ∗ is the proposed value of ψ sampled from the proposal distribution for iter-

[t−1] ation t, Jt(·|ψ ). It follows from Bayes’ theorem that this acceptance probability

can be simplified to rely only on the prior, likelihood, and proposal distribution, i.e.,

 ∗ ∗ ∗ [t−1]  ∗ [t−1] π(ψ )p(Y |ψ )/Jt(ψ |ψ ) a(ψ , ψ ) = min [t−1] [t−1] [t−1] ∗ , 1 . π(ψ )p(Y |ψ )/Jt(ψ |ψ )

The Metropolis random walk algorithm is a special case of the Metropolis-Hastings

algorithm, where the proposal distribution for each parameter is a symmetric dis-

tribution centered on the previous sampled value of that parameter. In this case,

∗ [t−1] [t−1] ∗ Jt(ψ |ψ ) = Jt(ψ |ψ ), resulting in the simplified form for the acceptance

12 probability given by,

 π(ψ∗)p(Y |ψ∗)  a(ψ∗, ψ[t−1]) = min , 1 . π(ψ[t−1])p(Y |ψ[t−1])

When fitting Bayesian models, it is also common to use a mixture of these two MCMC

algorithms. For example, in the Gibbs sampler for the Bayesian spatial regression model,

we can use a Metropolis-Hastings step in place of Step 2 listed above.

We note that we must specify starting values for the parameters for both types of

MCMC algorithms listed above. To reduce the effect of these starting values on the final

parameter inferences and allow the Markov chains to reach their stationary distributions,

typically we discard the sampled draws from a number of the initial iterations. These dis-

carded values are called the burn in samples.

Although it is a special case of the general linear regression model, we argue that the

spatial linear regression model has certain features that distinguish it from other models

in this class. First, data modeled using the general linear regression model typically have

some sort of replication, where, for example, we observe a response for n individuals m times. However, with spatially-referenced data, although we have n locations, often times we observe a spatial process at these locations only once. Secondly, the specific nature of the dependence places constraints on Σ. In this dissertation, we consider two classes of parameterizations for Σ(θ), corresponding to geostatistical and lattice data, and discuss these in more detail in Section 2.2.

By accounting for spatial dependence in the spatial linear regression model, we can:

13 1. Quantify the relationship between the response variable and covariates while ac-

counting for residual spatial dependence. When we ignore residual spatial depen-

dence, inferences on model parameters will be invalid and the standard errors associ-

ated with these estimators tend to be too small (see, for example, Schabenberger and

Gotway, 2005).

2. Predict the value of the response variable at an unobserved location using both ob-

served values of the response variable and covariate information. We expect that re-

sponse variables at unobserved locations will be more similar to nearby observations

than to observations which are farther away, and accounting for this phenomenon is

imperative in prediction.

Up to this point, we have only considered regression models for continuous response variables. Frequently, however, the response variable will not be continuous but may be reported as belonging to two or more categories. At the beginning of this section, we gave three examples of situations where a spatial regression model would be desirable. However, in each of these examples, the response variable may not be continuous:

• Environmental Health. The health outcome of an individual (e.g., has cancer/does

not have cancer) is a binary response variable.

• Real Estate. Whether or not the house sells during a particular period of time is a

binary response variable.

• Land-Cover/Land-Use Change. The land cover category (e.g., forest, agriculture,

grassland, or urban) at each location is a categorical response variable.

14 When the response variable is discrete or categorical, the spatial linear regression model

for continuous response variables is not appropriate. In the following section, we give an

overview of existing approaches for modeling spatially-dependent categorical data.

1.2 Modeling Categorical Spatial Data

Let {(Yi, xi); i = 1, . . . , n} be paired observations at location si, where Yi is now a categorical response variable and xi is a corresponding k × 1 vector of covariates. Let

0 0 0 0 Y = (Y1,...,Yn) and X = (x1,..., xn) .

1.2.1 The Spatial Generalized Linear Model

To motivate the spatial generalized linear model (GLM), we first describe the GLM for

independent response variables, first introduced by Nelder and Wedderburn (1972). Let

Yi for i = 1, . . . , n, be independent random variables. If the distribution of Yi|ζi, ωi is a

member of the exponential family, we can write

  Yiζi − b(ζi) p(Yi|ζi, ωi) = exp + c(Yi, ωi) . a(ωi)

For the special case of a two-category, or binary, response variable, Yi|ζi, ωi can be assumed to have a Bernoulli distribution with

  pi ζi = log 1 − pi

b(ζi) = log(1 + exp(ζi))

a(ωi) ≡ a = 1

c(Yi, ωi) ≡ c = 0

15 0 where pi is the probability that Yi = 1. The mean of Yi is E(Yi) = b (ζi) ≡ λi and variance

00 of Yi is Var(Yi) = b (ζi)a(ω) ≡ vλi . Therefore, for the Bernoulli distribution,

0 exp(ζi) E(Yi) = b (ζi) = = pi 1 + exp(ζi)

and

00 exp(ζi) Var(Yi) = b (ζi)a(ω) = 2 = pi(1 − pi). (1 + exp(ζi))

For an n × 1 vector Y = (Y1,...,Yn), we have

E(Y ) ≡ λ = p and

Var(Y ) = Vλ,

th where Vλ is a diagonal matrix with the i diagonal element equal to vλi = pi(1 − pi).

To relate the Yis to the covariates xi, we introduce a so-called link function, h(·), and assume

0 h(λi) ≡ µi = xiβ, where β is a k × 1 vector of coefficients. Two frequently used link functions for the

Bernoulli distribution are the and probit functions:

  pi 0 Logit: µi = log = xiβ 1 − pi −1 0 Probit: µi = Φ (pi) = xiβ.

Therefore,

µ = Xβ,

16 0 0 0 where X = (x1,..., xn) is an n × k matrix of covariates and µ is a k × 1 vector.

One approach to estimate β is by specifying a quasi-likelihood (QL; Wedderburn,

1974). Let Ui = u(λi; Yi) be the standardized value of Yi, so that, Ui = (Yi − λi)/vλi . The

first two moments of this random variable have the same properties as the log-likelihood derivatives, so λ ∂ R i Yi−t dt Yi − λi Yi vt ∂Q(λi; Yi) Ui = = = . vλi ∂λi ∂λi Thus, Z λi Yi − t Q(λi; Yi) = dt. Yi vt Pn −1 Because the data are independent, Q(λ; Y ) = i=1 Q(λi; Yi) = Vλ (Y − λ). Notice that Q(λ; Y ) is a function of λ and Y , but we are actually interested in estimating β.

Therefore, we determine U(β) = ∂Q(λ; Y )/∂β by calculating

∂Q(λ; Y ) ∂λ U(β) = = ∆0V −1(Y − λ) ∂λ ∂β λ

where ∆ is an n × k matrix with elements ∆ij = ∂λi/∂βj. We then estimate β by solving

the quasi-likelihood estimation equations (i.e., setting U(βˆ) = 0 and solving for βˆ).

For the generalized linear model, we now consider the case where the elements of Y

are not independent. Albert and McShane (1995) and Gotway and Stroup (1995) redefine

the variance of Y to be

1/2 1/2 Var(Y ) = Vλ R(θ)Vλ ≡ ΣY ,

where Vλ is defined as above, R(θ) is a correlation matrix, and θ is an m × 1 vector of

dependence parameters with m ≥ 1. For spatially-referenced data, R(θ) will be a spatial

correlation matrix like those given in Section 2.2.

17 To fit the GLM for correlated response variables, Liang and Zeger (1986) extend the

QL approach to model fitting. They use ΣY , in place of Vλ, resulting in

∂Q(λ; Y ) ∂λ U(β) = = ∆0Σ−1(Y − λ). (1.6) ∂λ ∂β Y

These equations, U(β), are called generalized estimating equations (GEE).

In the spatial setting, Albert and McShane (1995) use GEE not only to fit β, but also to

fit ΣY . In addition to (1.6), they let

0 ˆ −1 A ΣY (σˆY − σY ) = 0 (1.7)

ˆ where ΣY is the working covariance matrix, σY and σˆY are n(n−1)/2×1 vectors consist- ˆ ing of all entries below the diagonal of ΣY and ΣY , and A = ∂σY /∂β. They then iterate between solving (1.6) and (1.7) until convergence. Lin and Clayton (2005) show asymp- totic normality and consistency of the GEE for binary spatial data with isotropic covariance functions. They prove and apply their results for the logit link function. Lin (2008) extends the GEE to Lp space.

Breslow and Clayton (1993) propose using a pseudolikelihood function to estimate model parameters in the spatial generalized linear model. Their approach is similar to GEE, but it also computes ‘pseudodata’ corresponding to the observed binary process within the iterations of the model-fitting algorithm.

In the literature for spatially-referenced data, there are no Bayesian versions of the

GEE and pseudolikelihood methods described above. This is due to the fact that these approaches do not have a true likelihood function. Yin (2009) proposes a Bayesian gener- alized method of moments that uses a weighted quadratic objective function in place of a likelihood function. To the best of our knowledge, this approach has not been applied in

18 a spatial setting. Instead, it is common to use a Bayesian spatial generalized linear mixed

model (see Section 1.2.2).

Albert and McShane (1995) emphasize the fact that the GEE approach treats the spatial

correlation as a nuisance. In contrast, the GLMM approach seeks to estimate the mean of

the data given the spatial random effect, as described below.

1.2.2 The Spatial Generalized Linear Mixed Model

The spatial generalized linear mixed model (GLMM) extends the GLM by introducing

a random spatially-dependent error term, or spatial random effect, denoted S(si) ≡ Si, into the mean model. The spatial GLMM can be written as

E(Yi|Si) = λ(si) ≡ λi

0 h(λi) = xiβ + Si

0 where E(Si) = 0 and Var(S) = Σ(θ) with S = (S1,...,Sn) and Σ(θ) an n × n spatial

covariance matrix.

The spatial GLMM was first introduced within the Bayesian setting by Diggle et al.

(1998) as a general modeling framework for spatially-dependent discrete data. As an al-

ternative, a classical version of the spatial GLMM was introduced by Heagerty and Lele

(1998). Heagerty and Lele propose a composite likelihood approach for fitting a spatial

GLMM for binary response data. Paciorek (2007) gives a review of various versions of

spatial logistic GLMMs, as well as a discussion of the specification of spatial dependence

structure in this class of models.

Diggle et al. (1998) propose fitting the Bayesian spatial GLMM using MCMC algo-

rithms. This model-fitting approach is popular due to established computer software, such

19 as WinBUGS (Lunn et al., 2000), that can be used to fit Bayesian spatial GLMMs (see, for

example Banerjee et al., 2004; Law and Haining, 2004). However, as noted by Paciorek

(2007), MCMC algorithms for Bayesian spatial GLMM often convergence slowly and ex-

hibit poor mixing. To improve the performance of these algorithms, Langevin-Hastings

methods have been suggested in the literature (Christensen et al., 2001; Christensen and

Waagepetersen, 2002; Christensen and Ribeiro Jr., 2002). This method involves modifying

a Metropolis random walk algorithm, so that the proposal distribution not only relies on

the value of the parameter from the previous iteration, but also relies on the gradient of the

log-likelihood function. Christensen et al. (2006) propose further improvements by trans-

forming the posterior values of the spatial random effects, Si, by modifying Si|Yi with a

Cholesky factorization of the posterior covariance matrix. They illustrate that this method is more robust than either not transforming the Sis or transforming the Sis a priori.

When S has a Gaussian distribution, an alternative Bayesian model-fitting approach to MCMC is using integrated nested Laplace approximations (INLA; Rue et al., 2009).

INLA seeks to approximate posterior marginals rather than producing samples from the joint posterior distribution of model parameters. While this approach is attractive, it is not clear whether its applicability is as general as with MCMC methods.

1.2.3 Indicator Kriging

Rather than modeling relationships between response variables and covariates, indica- tor kriging predicts values of binary random variables at unobserved locations based only on nearby binary observations (Switzer, 1977; Journel, 1983). Indicator kriging is based on a spatial prediction method called kriging and is a direct extension of this popular method to the binary data setting. Indicator kriging requires a function γ(dij) = (1/2)Var(Yi −Yj),

20 where γ(·) is an isotropic semivariogram and dij = ||si − sj|| are the distance between

locations. Using this function, prediction of Y (s0), where s0 is an unobserved location, is

0 0 0 based on p ≡ P (Y (s0) = 1|Y ). An estimate of p , pˆ , is its best linear unbiased estimator

(Cressie, 1993; De Oliveira, 2000):

 1 − 10 Γ−1 γ 0 pˆ0 = γ + 1 Γ−1Y 10 Γ−1 1

th where Γ is an n×n matrix with the ij element equal to γ(||si −sj||), γ is an n×1 vector

th with the i element equal to γ(||s0 − si||), and 1 is an n × 1 vector of ones. Then, Y (s0)

is predicted by taking ( 1, if pˆ0 > l0 ˆ l0+l1 Y (s0) = , 0, otherwise where l0 and l1 are specified losses for mispredicting Yi as a 0 and 1, respectively.

Indicator kriging was originally motivated by considering indicator functions of a col-

0 0 lection of real-valued random variables Z = (Z(s1),...,Z(sn)) ≡ (Z1,...,Zn) and

defining Yi = I(Zi < z) for i = 1, . . . , n (Switzer, 1977; Journel, 1983). Solow (1993)

0 compares the estimates of the probability p = P (I(Z(s0) < z) = 1|I(Z1 < z),...,I(Zn <

0∗ z)) using indicator kriging to the probability p = P (Z(s0) < z|Z), where Z has a Gaus-

sian distribution. Although p0 6≡ p0∗, Solow finds that the number of incorrectly predicted

values based on Y and Z were similar and relatively small. However, he only uses a sample

size of n = 4.

Indicator kriging has the benefit that no distributional assumptions about the data gen-

erating process are required. However, there is no guarantee the estimated probabilities will

stay within the appropriate range of [0, 1]. Furthermore, if p0∗ is the probability of interest,

21 Cressie (1993) states that theoretically, disjunctive kriging1 provides a better approxima-

tion to this probability than indicator kriging. Diggle et al. (1998) also discuss the weak

theoretical assumptions of indicator kriging.

Other suggested approaches for improving indicator kriging are indicator cokriging

(Journel, 1983), which includes covariates to supplement predictions, and probability krig-

ing (Cressie, 1993), which utilizes an estimate of the cumulative distribution function of

Z within the indicator kriging framework. However, these methods also do not guarantee

estimated probabilities remain in the appropriate range.

1.2.4 The Autologistic Model

First proposed by Besag (1972), the autologistic model directly models the spatial cor-

relation among the binary data by relating the log odds of Yi = 1 to the values of Yj for sj in some neighborhood of si. Let

pi ≡ P (Yi = 1|Y-i, xi, β, α),

th where Y-i indicates Y with the i element removed. The autologistic model assumes that   pi X log = x0 β + α Y 1 − p i j i j:i∼j where xi is a k × 1 vector of covariates at location si, β is a k × 1 vector of coefficients, i ∼ j indicates that location j is a neighbor to location i, and α is the spatial dependence parameter. In this setting, the log-odds at each location is not only a linear function of the covariates, but is also dependent on the neighboring binary observations. Extensions and generalizations of this model have been proposed by Besag (1972), Besag (1974), Augustin

1 Disjunctive kriging requires specification of bivariate distributions of (Zi,Zj), 0 ≤ i < j ≤ n and then estimates any measurable function g(Z0), where in the binary setting, g(Z0) = 1(Z0 < z|z) and 0∗ E(g(Z0)) = p .

22 et al. (1996), Gumpertz et al. (1974), Sim (2000), Zheng and Zhu (2008), and Zhu et al.

(2008).

Methods for fitting the autologistic model include coding, pseudo-likelihood, Monte

Carlo maximum likelihood, and, for a Bayesian version, Gibbs sampling. Coding meth- ods, proposed by Besag (1974), begin by separating the data into subsets. For each sub- set, conditional on the values in the remaining portion of the data set, the observations in that subset are independent. Given these conditionally independent observations, condi- tional maximum likelihood is used to estimate the parameters. Because there are multiple ways to separate the data, inconsistencies arise when this method is used. Huffer and Wu

(1998) introduce a Monte Carlo maximum likelihood method to fit this model, and show their approach yields more efficient estimates than pseudo-likelihood and coding. Sherman et al. (2006) give an overview and comparison of pseudo-likelihood, generalized pseudo- likelihood, and Monte Carlo maximum likelihood model-fitting methods. Augustin et al.

(1996) use a Gibbs sampler to estimate the model parameters.

The autologistic model offers an improvement over indicator kriging in that the esti- mated pis must be in [0, 1]. In addition, the autologistic model does not require any under- lying distributional assumptions. However, Weir and Pettitt (1999) discuss computational challenges associated with fitting the model and note that parameter estimates can be poor when strong spatial dependence is present.

1.2.5 The Bayesian Spatial Probit Regression Model

The final model in the literature used to model spatially-dependent binary data is the

Bayesian spatial probit regression model. Because this model is the focus of this disserta- tion, we provide a detailed introduction to it in Chapter 2.

23 1.3 Overview of Contributions

In this dissertation, we add to the development of the Bayesian spatial probit regression model in the following three ways:

1. Efficient Model Fitting: Models for spatially-dependent data are notoriously cumber-

some to fit. We show how a marginal data augmentation MCMC algorithm can more

efficiently fit the Bayesian spatial probit regression model than standard MCMC al-

gorithms.

2. Spatial Classification: Within the classification literature, classification methods

which allow for spatial dependence are limited. We show how a spatial classification

rule can be derived from the Bayesian spatial probit regression model and provide

an example where our spatial classifier outperforms other well known classification

methods.

3. Mulitnomial Model Specification: When extending the Bayesian spatial probit re-

gression model to the multi-categorical setting, special considerations must be made

when specifying the latent variable mean and covariance structure to ensure model

parameters are estimable and interpretable. We discuss various specifications of the

latent mean structure and the associated parameter interpretation, and explore the

specification of the latent cross spatial-categorical dependence structure. Addition-

ally, we discuss how data augmentation MCMC strategies for fitting the Bayesian

spatial probit regression model can be extended to the multi-category setting.

24 1.4 Illustrative Data Set

To illustrate our methods, we use satellite-derived land cover observations over South-

east Asia. In Southeast Asia, deforestation is a major concern and, over the last century,

much of the original forests–as much as 12 percent–have been lost to other land uses

(Munroe et al., 2008). Researchers are interested in the economic, geographic, social,

and demographic factors that contribute to deforestation and other land cover patterns. The

spatial probit regression model is well suited for assessing the strength of the relationships

between binary response variables and covariate information while also allowing for resid-

ual spatial dependence.

The particular data used in our analyses were taken from the Moderate Resolution

Imaging Spectroradiometer (MODIS) Land Cover Type Yearly Level 3 Global 500m

(MOD12Q1 and MCD12Q1) data product for the year 2005. We selected land cover ob-

servations from this data product corresponding to the region bounded by 17◦ to 21◦N and

98◦ to 105◦E, which covers portions of Myanmar, Thailand, Laos, and Vietnam. Figure 1.1 shows an image of the land cover over this region. For this figure, the International Human

Dimensions Programme (IHDP) land cover classifications provided by the MODIS data product were collapsed into five categories: forest, shrub/grassland, savanna, cropland, and other.

We also consider four covariates: elevation, distance to the nearest major road, distance to the coast, and distance to the nearest big city. Elevation is measured in meters and distances are Euclidean and measured in degrees. The covariates are standardized, meaning that there were no costs taken into account in calculating distance (e.g. distance calculations do not take into account the fact that it might take longer to go over mountains than go

25 Figure 1.1: Land cover over Southeast Asia, covering the region bounded by 17◦ to 21◦N and 98◦ to 105◦E. The data were taken from the MODIS Land Cover Type Yearly Level 3 Global 500m (MOD12Q1 and MCD12Q1) data product for the year 2005.

around them). Figures 1.2 - 1.5 show images of the four respective covariates. The city in the middle of Figure 1.5 (distance to big city) is Vientiane, the capitol of Laos.

Finally, we note that Figure 1.3 has an artificial wavy line break. This is most likely due to the inadequacies in the plotting capabilities of R, the software used to create these images. However, it may also be an artifact of the country borders and further explanations are being explored. For the data analyses in Sections 3.3 and 4.4, we use a portion of the data not affected by this break (i.e., 17◦ to 19◦N and 98◦ to 100◦E).

26 Figure 1.2: Elevation (in meters) over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E.

Figure 1.3: Standardized value of the measured distance to the nearest major road over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E.

27 Figure 1.4: Standardized value of the measured distance to the coast over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E.

Figure 1.5: Standardized value of the measured distance to the nearest big city over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E.

28 CHAPTER 2

BAYESIAN SPATIAL PROBIT REGRESSION

An attractive alternative to the models for spatially-dependent categorical data reviewed in Section 1.2 is the Bayesian spatial probit regression model. In this chapter, we moti- vate this model by first describing in Section 2.1.1 the latent variable representation of the

Bayesian probit regression model for independent response variables as proposed by Al- bert and Chib (1993). In Section 2.1.2 we present extensions to the mutlivariate and multi- category response variable setting. Following this discussion, we introduce the Bayesian spatial probit regression model in Section 2.2.

2.1 The Bayesian Probit Regression Model

2.1.1 Albert and Chib’s Data Augmentation Strategy

Consider {(Yi, xi); i = 1, . . . , n}, a collection of n binary response variables Yi and

corresponding k×1 vectors of covariates xi. As discussed in Section 1.2.1, the probit GLM

relating the covariates to the binary response variable assumes that the Yis are conditionally

independent given a k × 1 vector of regression coefficients β and can be written as

Yi|β ∼ Bernoulli(pi) (2.1) −1 0 Φ (pi) = xiβ,

29 where Φ−1(·) denotes the inverse standard normal cumulative distribution function. In the Bayesian setting, a prior distribution must be specified for the unknown parameter β.

Unlike the normal linear regression model where the normal distribution is a conjugate prior for the regression coefficients, a conjugate prior is not available for β in (2.1). Thus, inference on β typically requires numerical integration, which is not feasible if k is large, or a simulation-based approach such as MCMC.

To facilitate the use of the Gibbs sampler in the Bayesian probit regression model, Al- bert and Chib (1993) propose the following data augmentation representation of the model.

˜ ˜ ˜ 0 They introduce a collection of latent variables Z = (Z1,..., Zn) and take ( 1, if Z˜ > 0 Y = i , i ˜ (2.2) 0, if Zi ≤ 0 where

˜ ˜ 2 Z ∼ N(Xβ, σ In), (2.3)

0 X = (x1,..., xn) is an n × k matrix of covariates, In is the n × n identity matrix, and

σ2 is a variance parameter. (The ∼ notation over the random variables used here distin-

guishes identifiable and non-identifiable parameters; we use this notational convention to

aid our discussion of data augmentation strategies in Section 3.1.) Taking σ2 = 1, setting ˜ ˜ Zi = Zi/σ and β = β/σ, and integrating out the Zis, it is straightforward to show that

Albert and Chib’s model specification is equivalent to the probit GLM given in (2.1). In- ˜ troducing the latent Zis and taking the prior on β to be N (mβ,Cβ) facilitates model fitting

via the following Gibbs sampler:

30 Step 1: Sample Z|Y , β, (σ2 = 1).

For i = 1, . . . , n, sample Zi from ( TN(x0 β, σ2 = 1, 0, ∞), if Y = 1 Z |Y , β, (σ2 = 1) ∼ i i , i i 0 2 TN(xiβ, σ = 1, −∞, 0), if Yi = 0

where TN(µ, σ2, l, u) denotes a truncated normal distribution with mean µ, variance

σ2, lower bound l, and upper bound u. By sampling from these univariate truncated

normal distributions, we obtain a sample for Z (Geweke, 1991). In this dissertation,

when sampling from the truncated normal distribution, we use Hans and Craigmile

(2009), which provides a tool in R, based on Geweke (1991), for efficiently sampling

from a truncated normal distribution, particularly when sampling in the tails of the

distribution.

Step 2: Sample β|Z, Y , (σ2 = 1).

When β ∼ N(0, Cβ), the full conditional distribution of β is given by

2 −1 0 −1 0 −1 0 −1 β|Z, Y , (σ = 1) ∼ N (Cβ + X X) X Y , (Cβ + X X) .

Each of these steps involves drawing from known distributions from which sampling is easily implemented; see Albert and Chib (1993) for details.

2.1.2 Multi-Category and Multivariate Extensions

Albert and Chib (1993)’s data augmentation strategy for the probit GLM has been ex- tended in several ways. We first define notation for various possible model extensions, then describe each of the model extensions in more detail. Consider observations associated with n ‘individuals’, each of whom provides m ‘responses’ (i.e., multivariate observations), which fall into one of ` categories. Using this notation, the latent variable representation

31 of the Bayeisan probit regression model described in the previous section accommodates

n > 1 individuals with m = 1 responses for each individual, and ` = 2 possible cate- gories, corresponding to the two categories of a binary response. We now consider various extensions of Albert and Chib’s data augmentation strategy when ` > 2 or m > 1.

In addition to introducing the latent variable representation of the probit regression model binary responses, Albert and Chib (1993) also provide a multi-category (multino- mial) extension. Now Yi ∈ {1, . . . , `} denotes the categorical outcome associated with the ith individual, where ` > 2 is the number of categories. In this case, the latent variables ˜ Zij for i = 1, . . . , n and j = 1, . . . , ` are introduced to facilitate model fitting. The latent variable representation of this model is given by

˜ Yi = arg maxj {Zij, j = 1, . . . , `}

where

˜ 0 ˜ ˜ Zi ∼ N([1` ⊗ xi] β, Ω),

˜ ˜ ˜ 0 ˜ and Zi = (Zi1,..., Zi`) , 1` is an ` × 1 vector of ones, xi is a k × 1 vector of covariates, β is a k ×1 vector of regression coefficients, and Ω˜ is an `×` covariance matrix. Conditional ˜ ˜ ˜ on β and Ω, the Zis are independent. That is,  ˜  Z1 ˜  .  ˜ ˜ vec(Z) ≡  .  ∼ N(Xβ, In ⊗ Ω), (2.4) ˜ Zn where  0  1` ⊗ x1  .  X ≡  .  0 1` ⊗ xn is a n` × k matrix and ⊗ denotes the Kronecker or tensor product between two matrices. ˜ ˜ ˜ For β to be identifiable, the first diagonal element of Ω, Ω1,1, is typically set equal to one

32 (Albert and Chib, 1993; McCulloch et al., 2000). McCulloch and Rossi (1994) and Nobile

(1998), however, do not impose identifiability constraints on Ω˜ , but rather assign proper pri-

ors on all parameters and report marginal posterior inferences on the identified parameters ˜ ˜ (e.g., β/Ω1,1). Additional approaches for handling this issue of parameter identifiability

(i.e., Imai and van Dyk, 2005) will be discussed in Section 3.1.

Albert and Chib (1993)’s latent variable representation of the probit GLM has also been

extended by Chib and Greenberg (1998) to the multivariate setting where n > 1 individuals

have m > 1 responses with ` = 2 categories for each response. They model this type of

0 independent multivariate binary observations, Y1,..., Yn, where Yi = (Yi1,...,Yim) , by ˜ introducing latent variables Zih for i = 1, . . . , n and h = 1, . . . , m. In this case, ( 1, if Z˜ > 0 Y = ih ih ˜ 0, if Zih ≤ 0

where

˜ 0 ˜ Zi ∼ N([1m ⊗ xi] β, Σ),

˜ ˜ ˜ 0 and Zi = (Zi1,..., Zim) , 1m is the m×1 vector of ones, xi is a k ×1 vector of covariates,

β˜ is a k × 1 vector of regression coefficients, and Σ is an m × m correlation matrix. Again, ˜ conditional on β and Σ, the Zis are independent. That is,  ˜  Z1 ˜  .  ˜ 2 vec(Z) ≡  .  ∼ N(Xβ, σ I ⊗ Σ), ˜ Zn where  0  1m ⊗ x1  .  X ≡  .  0 1m ⊗ xn is a nm × k matrix. For β˜ to be identifiable, Chib and Greenberg require Σ to be a

correlation matrix and set σ2 = 1. More recently, to facilitate model fitting, Liu and Daniels

33 (2006) relax the assumptions on the correlation matrix by using parameter expansion and reparameterization methods to first sample Σ as a covariance matrix and then translate it back to a correlation matrix.

2.2 The Bayesian Spatial Probit Regression Model

2.2.1 Model Specification

We now discuss incorporating spatial dependence into the Bayesian probit regression model. From the extensions considered in the previous section, this can be done in two ways. The first way to view this model extension is to consider observations at the different locations as n dependent observations (rather than n independent observations as in Sec- tion 2.1.1), so that we have n > 1 ‘individuals’ (or locations in the spatial setting), m = 1

‘responses’ at each location, and ` = 2 possible categories for each observation. We could also view the observations across all locations as a single multivariate response variable

(i.e, n = 1 individuals with m > 1 responses which can take on ` = 2 possible categories).

Although the resulting models are equivalent, we arbitrarily take the first view in defin- ing our notation, and consider n spatially-dependent univariate binary response variables,

Y1,...,Yn. The resulting model is equivalent to De Oliveira (2000)’s “clipped Gaussian random fields” and the basis for the model considered by Weir and Pettitt (2000).

˜ ˜ ˜ 0 For the spatial probit regression model, we introduce latent variables Z = (Z1,..., Zn) , which are realizations of a spatially-dependent Gaussian process, and let ( 1, if Z˜ > 0 Y = i i ˜ (2.5) 0, if Zi ≤ 0 where

Z˜ ∼ N(Xβ˜, σ2Σ(θ)), (2.6)

34 0 with X = (x1,..., xn) , and we assume that the marginal variances of the Zis are known up to a multiplicative constant σ2 and that the matrix Σ(θ) captures the residual spatial dependence structure. Typically, Σ(θ) will be a correlation matrix. However, we allow for the possibility of heteroskedasticity by only assuming that the diagonal elements of

Σ(θ) are fixed and known constants. We note that in Chapter 1, Σ(θ) represented a spatial

covariance matrix. In our discussion of the Bayesian spatial probit regression model, we

2 will refer to σ as the variance of Zi for i = 1, . . . , n, and Σ(θ) as a spatial correlation

matrix.

As in the two previous model extensions, β˜ is only identifiable up to a multiplicative

˜ 2 constant. For illustration, consider P(Yi = 1|β, Σ(θ), σ ):

    ˜ 2 ˜ ˜ 2 P Yi = 1|β, Σ(θ), σ = P Zi > 0|β, Σ(θ), σ ! x0 β˜ = Φ i σ 0 = Φ (xiβ) ,

where β = β˜/σ is the identifiable parameter. De Oliveira (2000) and Weir and Pettitt

(2000) chose to set σ2 = 1 to ensure that the regression coefficients are identifiable, a choice that has implications for the efficiency of the resulting MCMC algorithms as we discuss Section 3.1.

Although similar in spirit to that of the spatial GLM and spatial GLMM, this model

fits within a separate class under the GLM framework. For the spatial GLM, the spatial dependence is specified directly on Y , rather than on a latent Gaussian process. The spatial

GLMM is more similar to the Bayesian probit regression model in that we introduce a

latent spatially-dependent Gaussian process in the mean of Y ; however, the probability pi

35 is specified conditional on the value of the Gaussian process, Si, i.e.,

pi = P (Yi = 1|β,Si)

for all i = 1, . . . , n. In contrast to the spatial GLM and the spatial GLMM, the spatial dependence structure is embedded in the link function.

2.2.2 Parameterization of the Spatial Correlation Matrix

When modeling spatial dependence, it is important to make certain that Σ(θ) is a valid

correlation matrix. Various parameterizations of Σ(θ) that ensure validity and uphold the

characteristics of spatial dependence are available. We consider two classes of parameteri-

zations for Σ(θ), corresponding to geostatistical and lattice data.

For geostatistical/point-referenced data, a geostatistical dependence structure is com-

monly used. In this setting, the correlation of a spatial process at two locations is often

modeled as a function of the distance between the two locations corresponding to the as-

sumptions of second-order stationarity and isotropy. One popular class of parametric spa-

tial correlation functions is the Matern´ class. In this case,

Σ(θ) ≡ Σ(ν, λ),

where the ijth element of Σ(ν, λ) is equal to √ √ 1 2 νd ν 2 νd  ij K ij , 2ν−1Γ(ν) λ ν λ

Γ(·) is the usual gamma function, Kν is the modified Bessel function of order ν (see e.g.,

Abramowitz and Stegun, 1965), dij = ||si−sj|| is the Euclidean distance between locations

si and sj, ν > 0 is a parameter controlling for smoothness of the realized random field,

and λ > 0 is the spatial scale parameter. Special cases of the Matern´ correlation functions

36 include the exponential (ν = 1/2) and the Gaussian (ν → ∞) correlation functions. Both

of these correlation functions can be written in a simpler form, i.e.,

Σ(θ) ≡ Σ(λ).

For the exponential correlation function, the ijth element of Σ(λ) is equal to

 d  exp − ij , (2.7) λ

and for the Gaussian correlation function, the ijth element of Σ(λ) is equal to

 d2  exp − ij , (2.8) λ2

where dij and λ are as defined above. There are other parametric correlation functions for

geostatistical data, and we refer the reader to Cressie (1993), Stein (1999), and Banerjee

et al. (2004) for further examples.

For lattice/gridded data, spatial dependence is modeled by considering spatial neigh-

borhood structures. Neighborhood structures define a set of neighbors for each partition

of a domain D indexed by locations {si,..., sn}. Figure 2.1 illustrates this concept for a regular lattice/grid. In this figure, grid cell with the black dot represents the location of interest and its neighbors are represented by the grid cells with empty squares. Figure

2.1 (a) illustrates a first order neighborhood structure, where the ith and jth grid cells are

neighbors if they share a common edge, and (b) illustrates a second order neighborhood

structure, where the ith and jth grid cells are neighbors if they share a common edge or

corner.

A spatial autoregressive structure is commonly used to capture spatial dependence in

models for lattice/gridded data. One example of an autoregressive dependence structure

37 (a) (b)

First Order Neighborhood Structure Second Order Neighborhood Structure

Figure 2.1: Illustration of neighborhood structures for a regular grid. The grid cell with the black dot represents the location of interest and its neighbors are represented by the grid cells with the empty squares for (a) a first order neighborhood structure and (b) a second order neighborhood structure.

is the conditionally autoregressive (CAR) model (e.g., Banerjee et al., 2004). In the CAR model,

−1 Σ(θ) ≡ Σ(ρ) = (Dw − ρW ) (2.9)

th where the ij element of W is equal to wij, wij is equal to 1 if grid cell/partition i is a neighbor of cell/partition j and is equal to 0 if cells i and j are not neighbors, Dw is a

th Pn 2 diagonal matrix with the i diagonal element equal to wi+ = j=1 wij, σ is the variance, and ρ is the spatial dependence parameter. Other autoregressive dependence structures in- clude the simultaneous autoregressive (SAR) model and the spatial moving average (SMA) model (see, e.g., Cressie, 1993).

38 CHAPTER 3

DATA AUGMENTATION MCMC STRATEGIES

There has been a recent emphasis in the spatial statistics literature on the development of methods that accommodate large data sets. In the Bayesian setting, these methods in- clude dimension reduction techniques (e.g., Higdon, 2002; Xu et al., 2005; Calder, 2007;

Banerjee et al., 2008), integrated nested Laplacian approximations (Rue et al., 2009), and covariance tapering (see recent work by Shaby and Ruppert, 2010, for a Bayesian treatment of this technique). In this chapter, instead of focusing on model adjustments to accommo- date large data sets or approximations to full Bayesian inference procedures, we investi- gate strategies for efficient Markov chain Monte Carlo (MCMC) algorithms for fitting the

Bayesian spatial probit regression model. While the MCMC strategies discussed here are not necessarily designed to overcome computational challenges associated with massive data sets, in high dimensional settings having efficient algorithms is clearly desirable.

Data augmentation/latent variable methods have been widely recognized for facilitating model fitting of the Bayesian probit regression model. As discussed in Section 2.1.1, the latent variable representation of the Bayesian probit regression model proposed by Albert and Chib (1993) allows model fitting to be performed using a simple Gibbs sampler. To im- prove the efficiency of the Gibbs sampler in this setting, Imai and van Dyk (2005) propose introducing a working parameter (defined in Section 3.1.1) into the model and compare

39 various data augmentation strategies resulting from different treatments of the working pa-

rameter. In this chapter, we build on this work by investigating the efficiency of modified

and extended versions of these algorithms for the spatial probit regression model, focus-

ing on the special case of binary response variables. These algorithms include the one

previously proposed by De Oliveira (2000), which we discuss further in Section 3.1.1.

In Section 2.2, we defined the latent variable representation of the Bayesian probit

regression model for spatially-referenced binary data. However, as noted in Section 2.2, the

spatial variance parameter, σ2, is not identifiable. In Section 3.1.1, we discuss conditional

and marginal data augmentation strategies for making use of this non-identifiable variance

parameter within an MCMC model-fitting algorithm. We propose three different Gibbs

sampling algorithms corresponding to these strategies. Furthermore, in Section 3.1.2, we

propose modifications to these algorithms by partially collapsing over the sampling steps

within the Gibbs sampler. We compare the various resulting algorithms using a simulation

study in Section 3.2 and in an analysis of satellite-derived land cover data over Southeast

Asia in Section 3.3.

3.1 Data Augmentation MCMC Strategies

3.1.1 Conditional versus Marginal Data Augmentation

In the previous chapter, “data augmentation” referred to the introduction of continuous

latent variables, Z˜, in the Albert and Chib (1993) representation of the Bayesian probit regression model. In this section, we extend our use of “data augmentation” to include conditional and marginal data augmentation MCMC strategies, where we use a working pa-

rameter to identify fast and easily implemented algorithms. In data augmentation MCMC

40 strategies, the working parameter is frequently taken to be a parameter that is not identifi-

able under the observed data, Y , but is identifiable under the complete or augmented data,

(Y , Z). In the Bayesian spatial probit regression model, the spatial variance parameter, σ2, can serve as a working parameter.

Meng and van Dyk (1999) and van Dyk and Meng (2001) were the first to distinguish between conditional augmentation and marginal augmentation strategies. Under a condi- tional augmentation strategy, the working parameter is fixed to an optimal constant within the model-fitting algorithm, whereas a marginal augmentation strategy seeks to marginalize over the working parameter within the algorithm. Meng and van Dyk (1999) show that al- gorithms using a marginal augmentation strategy will have a geometric rate of convergence no larger than its conditional augmentation counterpart.

Below we describe conditional and marginal augmentation strategies for the Bayesian spatial probit regression model defined in Section 2.2. Referring back to the notation in

Section 2.2, we use Z˜ and β˜ to denote the non-identifiable parameters, and Z and β to denote the identifiable parameters (i.e., Z˜ = σZ and β˜ = σβ).

Consider the likelihood function for the identifiable parameters (β, θ) in the spatial probit regression model introduced in Section 2.2:

L(β, θ|Y ) ∝ p(Y |β, θ) Z (3.1) = p(Y , Z|β, θ)dZ, where (Y , Z) denotes the complete augmented data. As Imai and van Dyk (2005) point out for an independent multi-category response version of the probit regression model, since the working parameter is not identifiable under the observed data likelihood function, we can condition on a fixed value of the working parameter and the likelihood function

41 will remain unchanged. This conditional augmentation strategy also holds for the spatial version of the model.

2 2 Fixing the working parameter σ to some constant σ0 results in a conditional augmen-

2 2 tation algorithm, and without loss of generality, we can take σ0 = 1. Thus, for conditional augmentation, (3.1) becomes L(β, θ|Y ) = L(β, θ, σ2|Y ) Z 2 2 ∝ p(Y , Z|β, θ, σ = σ0)dZ Z Z 1 (3.2) = ··· 2 n/2 1/2 A1 An (2πσ0) |Σ(θ)|   1 0 −1 × exp − 2 (Z − Xβ) Σ(θ) (Z − Xβ) dZ 2σ0 where ( (−∞, 0], if Yi = 0 Ai = . (3.3) (0, ∞), if Yi = 1 Using this conditional augmentation strategy, the associated Gibbs sampling algorithm for sampling from the posterior distribution of (β, θ) is given in Table 3.1 under the Con- ditional heading under the Non-Collapsed Algorithms heading.

As an alternative to conditioning on a specific value of the working parameter, the working parameter can be assigned a proper prior which we can marginalize over to obtain the likelihood of the identifiable parameters. Using this marginal augmentation strategy,

(3.1) can be expressed as Z L(β, θ|Y ) ∝ p(Y , Z|β, θ, σ2)π(σ2|β, θ)dσ2 dZ Z Z Z ∞ 1 = ··· 2 n/2 1/2 (3.4) A1 An 0 (2πσ ) |Σ(θ)|  1   × exp − (Z − Xβ)0Σ(θ)−1(Z − Xβ) π(σ2|β, θ)dσ2 dZ 2σ2

2De Oliveira (2000) implicitly uses this conditional augmentation strategy in his model for spatially- dependent binary data.

42 where the Ai are as defined in (3.3).

Following Imai and van Dyk (2005), we consider two marginal data augmentation schemes. In the first scheme, labeled Scheme 1, we marginalize over σ2 completely in updating Z. We do this by sampling (σ2)∗ from its prior distribution π(σ2|β, θ) ≡ π(σ2), sampling from Z˜|Y , β, θ, (σ2)∗, and “sweeping” over σ2 by setting Z = Z˜/σ∗. Thus, the sampled Z is dependent on the identifiable parameter β, but not on the non-identifiable parameter β˜. This approach is valid since σ2 is not likelihood identifiable, therefore we can sample Z˜ conditional on any plausible value of σ2. We then sample from the joint full conditional distribution of (σ2, β˜) and from the full conditional distribution of θ and again “sweep” over the sampled value of σ2 by setting β = β˜/σ. The Gibbs sampling algo- rithm associated with this marginal augmentation scheme is listed in Table 3.1 as Marginal-

Scheme 1 in the Non-Collapsed Algorithms section.

In the second marginal augmentation scheme, labeled Scheme 2, we include σ2 in the

Gibbs sampler in the usual way by assigning it a proper prior distribution and sampling from its full conditional distribution. Then, we sweep over σ by properly normalizing the non-identifiable parameters in each iteration of the algorithm (i.e., set Z = Z˜/σ and β =

β˜/σ). This algorithm is listed in Table 3.1 as Marginal-Scheme 2 of the Non-Collapsed

Algorithms.

In Marginal-Scheme 1, when the prior distribution on σ2 is diffuse, conditioning on a value of σ2 from the prior distribution when sampling Z˜ will allow the distribution of Z˜ to be more diffuse than when conditioning on the value of σ2 obtained from the previous iteration of the algorithm, as in Marginal-Scheme 2. Thus, the sampled values of Z˜/σ∗ =

Z when using Marginal-Scheme 1 will have a smaller autocorrelation. In turn, we expect a similar decrease in autocorrelation in the sample paths of β and θ.

43 Imai and van Dyk (2005) consider similar conditional and marginal augmentation strate-

gies for fitting the Bayesian multinomial probit regression model for independent multi-

category response variables (see Section 2.1.2). However, in their model, the working ˜ parameter is the first diagonal element of the cross-category covariance matrix, Ω1,1. Com- paring their algorithms to that of McCulloch and Rossi (1994) and Nobile (1998), they find that their marginal augmentation algorithms converge more quickly and are less sensitive to starting values. In the following sections, through a simulation study and data analysis, we show similar benefits for MCMC algorithms based on marginal data augmentation for the spatial probit regression model.

Non-Collapsed Algorithms

Marginal-Scheme 1 Marginal-Scheme 2 Conditional

Step 1: Sample (σ2)∗ ∼ π(σ2) Sample Z˜|Y , β, θ, (σ2)∗ Sample Z˜|Y , β, θ, σ2 Sample Z|Y , β, θ, σ2 = 1 Set Z = Z˜/σ∗ Set Z = Z˜/σ Step 2: Sample (σ2, β˜)|Z˜, Y , θ Sample (σ2, β˜)|Z˜, Y , θ Sample β|Z˜, Y , θ, σ2 = 1 Set β = β˜/σ Set β = β˜/σ Step 3: Sample θ|Z˜, Y , β˜, σ2 Sample θ|Z˜, Y , β˜, σ2 Sample θ|Z, Y , β, σ2 = 1

Partially Collapsed Algorithms

Marginal-Scheme 1 Marginal-Scheme 2 Conditional

Step 1: Sample θ|Y , β Sample θ|Y , β, σ2 Sample θ|Y , β, σ2 = 1 Sample (σ2)∗ ∼ π(σ2) Sample Z˜|Y , β, θ, (σ2)∗ Sample Z˜|Y , β, θ, σ2 Sample Z|Y , β, θ, σ2 = 1 Set Z = Z˜/σ∗ Set Z = Z˜/σ Step 2: Sample (σ2, β˜)|Z˜, Y , θ Sample (σ2, β˜)|Z˜, Y , θ Sample β|Z˜, Y , θ, σ2 = 1 Set β = β˜/σ Set β = β˜/σ

Table 3.1: This table lists the steps in each of the data augmentation algorithms. The first portion shows the non-collapsed data augmentation algorithms introduced in Section 3.1.1. The second portion shows the partially collapsed data augmentation algorithms introduced in Section 3.1.2.

44 3.1.2 Partially Collapsed Algorithms

In this section, we discuss a method for collapsing Steps 1 and 3 in the algorithms introduced in the previous section. In Step 3 of each algorithm, we draw samples of θ from

θ|Z˜, Y , β˜, σ2. Because Z˜ is a vector of real-valued random variables, by conditioning on it (as opposed to the binary vector Y ) we unnecessarily constrain the distribution of values that θ can take at each iteration of the algorithm, particularly for high dimensional

Z˜. Instead of sampling from θ|Z˜, Y , β˜, σ2, we could marginalize over Z˜ and sample from

θ|Y , β˜, σ2. The latter distribution will be more diffuse than the former, and thus intuitively we might expect improvements in the efficiency of the algorithm.

Consider the joint full conditional distribution θ and Z˜, where

p(θ, Z˜|Y , β˜, σ2) ∝ p(Y , Z˜|β˜, σ2, θ)π(θ). (3.5)

Since the left hand side of (3.5) can be decomposed into the product of p(θ|Y , β˜, σ2) and p(Z˜|Y , β˜, σ2, θ), Steps 1 and 3 can be collapsed into a single step where we sample from

θ|Y , β˜, σ2 and then from Z˜|Y , β˜, σ2, θ. From (3.5),

Z Z  p(θ|Y , β˜, σ2) ∝ ··· p(Y , Z˜|β˜, σ2, θ)dZ˜ π(θ), (3.6) A1 An where the quantity in square brackets is simply the volume under the n-dimensional multi-

n variate normal density function corresponding to the orthant of R defined by A1×· · ·×An. Thus, a Metropolis-Hastings step can be used to sample from θ|Y , β˜, σ2. Sampling from

Z˜|Y , β˜, σ, θ can be done as before. The modified versions of the algorithms introduced in

Section 3.1.1 that collapse Steps 1 and 3 are listed in Table 3.1 under the heading Partially

Collapsed Algorithms, following terminology used by van Dyk and Park (2008).

45 Finally, we note that it is straightforward to show that

p(θ|Y , β˜, σ2) = p(θ|Y , β, σ2 = 1).

Therefore, in the modified Marginal-Scheme 1 algorithm, we do not condition on a partic- ular value of σ2 in sampling θ, thus ensuring that this partially collapsed algorithm samples from the appropriate posterior distribution (see van Dyk and Park, 2008, for related discus- sion).

3.1.3 Full Conditional Distributions

2 2 −1 Using the priors β ∼ N(0, Cβ), σ ∼ a0(χv0 ) , and θ ∼ π(θ), we obtain the fol- lowing full conditional distributions. We use the superscript [t] to denote the value of a parameter at the tth iteration of the algorithm.

Non-collapsed Algorithms

Step 1: Sample Z[t] from Z|Y , β[t−1], θ[t−1], (σ2)∗:

Each algorithm treats σ2 differently, so that for Marginal-Scheme 1 (σ2)∗ ∼ π(σ2),

for Marginal-Scheme 2 (σ2)∗ = (σ2)[t−1], and for Conditional (σ2)∗ = 1.

∗ [t] [t] [t−1] [t−1] 0 ˜ For i = 1, . . . , n, define Z¬i = (Z1 ,...,Zi−1,Zi−1 ,...,Zn ) and sample Zi from ( TN(µ , τ 2 , 0, ∞), if Y = 1 ˜ ∗ [t−1] [t−1] 2 ∗ z˜i z˜i i Zi|Y , Z , β , θ , (σ ) ∼ , ¬i (µ , τ 2 , −∞, 0), Y = 0 TN z˜i z˜i if i (µ , τ 2 , `, u) where TN z˜i z˜i is a truncated normal distribution with lower and upper bounds ` and u, respectively, and mean and variance

−1 0 ∗ [t−1]  [t−1]   [t−1]   ∗ ∗ ∗ [t−1] µz˜i = xi σ β + Σ(θ ) i,¬i Σ(θ ) ¬i,¬i σ Z¬i − X¬i σ β   −1  τ 2 = (σ2)∗ Σ(θ[t−1]) − Σ(θ[t−1]) Σ(θ[t−1]) Σ(θ[t−1]) . z˜i i,i i,¬i ¬i,¬i ¬i,i

46 [t] ˜ ∗ Set Zi = Zi/σ .

Step 2: Sample (σ2)[t], β[t] from σ2, β|Y , Z[t], θ[t−1]:

For Marginal-Scheme 1 and Marginal-Scheme 2   2 [t] ˜ ˆ 0 [t−1] −1 ˜ ˆ 2 ˆ0 −1 ˆ 2 −1 (σ ) ∼ (Z − Xβ) Σ(θ ) (Z − Xβ) + a0 + β Cβ β (χn+v0 )

ˆ 0 [t−1] −1 −1−1 0 [t−1] −1 ˜ ˜ where β = X Σ(θ ) X + Cβ X Σ(θ ) Z and Z is taken from Step 1.

For Conditional, set (σ2)[t] = 1.

Then, for all algorithms, sample

˜ ˆ 2 [t] 0 [t−1] −1 −1−1 β ∼ N(β, (σ ) X Σ(θ ) X + Cβ )

and set β[t] = β˜/σ[t].

Step 3: Sample θ[t] from θ|Y , Z[t], β[t], (σ2)[t] via a Metropolis-Hastings step:

Sample a proposed value, θ∗, from a proposal distribution q(θ|θ[t−1]). Take ( θ∗, with probability c(θ[t−1], θ∗) θ[t] = , θ[t−1], with probability 1 − c(θ[t−1], θ∗) where ( ) π(θ∗|Y , Z˜, β˜, (σ2)[t]) q(θ[t−1]|θ∗) c(θ[t−1], θ∗) = min , 1 . π(θ[t−1]|Y , Z˜, β˜, (σ2)[t]) q(θ∗|θ[t−1]) In the acceptance probability,

π(θ|Y , Z˜, β˜, (σ2)[t]) ∝ p(Y , Z˜|β˜, θ, (σ2)[t]) π(θ) = φ(Z˜; Xβ˜, (σ2)[t]Σ(θ)) π(θ),

where φ(Z˜; Xβ˜, σ2Σ(θ)) is the multivariate normal probability density function

with mean Xβ˜ and variance σ2Σ(θ) evaluated at Z˜, and Z˜ is taken from Step 1

and β˜ is taken from Step 2.

47 Partially Collapsed Algorithms

Step 1: Sample θ[t], Z[t] from θ, Z|Y , β[t−1], (σ2)∗:

• Sample θ[t] from θ|Y , β[t−1] via a Metropolis-Hastings step.

Sample a proposed value, θ∗, from a proposal distribution q(θ|θ[t−1]). Take ( θ∗, with probability c(θ[t−1], θ∗) θ[t] = θ[t−1], with probability 1 − c(θ[t−1], θ∗)

where

 π(θ∗|Y , β[t−1]) q(θ[t−1]|θ∗)  c(θ[t−1], θ∗) = min , 1 . π(θ[t−1]|Y , β[t−1]) q(θ∗|θ[t−1])

In the acceptance probability,

π(θ|Y , β[t−1]) ∝ p(Y |β[t−1], θ) π(θ) Z Z = ··· φ(Z; Xβ[t−1], Σ(θ)) dZ π(θ), A1 An where φ(Z; Xβ, Σ(θ)) is the multivariate normal probability density function

with mean Xβ and variance Σ(θ) evaluated at Z and the Ai are as defined in

(3.3).

• Sample Z[t]|Y , β[t−1], θ[t], (σ2)∗.

Each algorithm treats σ2 differently, so that for Marginal-Scheme 1 (σ2)∗ ∼

π(σ2), for Marginal-Scheme 2 (σ2)∗ = (σ2)[t−1], and for Conditional (σ2)∗ =

1.

∗ [t] [t] [t−1] [t−1] 0 For i = 1, . . . , n, define Z¬i = (Z1 ,...,Zi−1,Zi−1 ,...,Zn ) and sample ˜ Zi from ( TN(µ , τ 2 , 0, ∞), if Y = 1 ˜ ∗ [t−1] [t] 2 ∗ z˜i z˜i i Zi|Y , Z , β , θ , (σ ) ∼ , ¬i (µ , τ 2 , −∞, 0), Y = 0 TN z˜i z˜i if i

48 (µ , τ 2 , `, u) where TN z˜i z˜i is a truncated normal distribution with lower and upper bounds ` and u, respectively, and mean and variance

 −1 0 ∗ [t−1] h [t] i h [t] i  ∗ ∗ ∗ [t−1] µz˜i = xi σ β + Σ(θ ) Σ(θ ) σ Z¬i − X¬i σ β i,¬i ¬i,¬i  −1 ! 2 2 ∗ h [t] i h [t] i h [t] i h [t] i τz˜ = (σ ) Σ(θ ) − Σ(θ ) Σ(θ ) Σ(θ ) . i i,i i,¬i ¬i,¬i ¬i,i

[t] ˜ ∗ Set Zi = Zi/σ .

Step 2: Sample (σ2)[t], β[t] from σ2, β|Y , Z[t], θ[t]:

For Marginal-Scheme 1 and Marginal-Scheme 2

  2 [t] ˜ ˆ 0 [t] −1 ˜ ˆ 2 ˆ0 −1 ˆ 2 −1 (σ ) ∼ (Z − Xβ) Σ(θ ) (Z − Xβ) + a0 + β Cβ β (χn+v0 )

ˆ 0 [t] −1 −1−1 0 [t] −1 ˜ ˜ where β = X Σ(θ ) X + Cβ X Σ(θ ) Z, and Z is taken from Step 1.

For Conditional, set (σ2)[t] = 1.

Then, for all algorithms, sample

˜ ˆ 2 [t] 0 [t] −1 −1−1 β ∼ N(β, (σ ) X Σ(θ ) X + Cβ )

and set β[t] = β˜/σ[t].

3.2 Simulation Study

3.2.1 Simulation Set-up

In our simulation study, we compare each of the six proposed algorithms in terms of

computational efficiency and sensitivity to starting values. We consider data sets of sample

size n = 100 corresponding to observations on a 10 × 10 regular grid. We generate our

49 data from the spatial probit regression model given in (2.5) and (2.6) with a single covari- ate xi = xi and regression coefficient β = β. We consider a geostatistical dependence structure based on an exponential covariance function as defined in (2.7), as well as on an autoregressive structure based on the CAR model as defined in (2.9). We also consider an independent covariance structure (as in Section 2.1.1), i.e., Σ(θ) ≡ In, to use as a baseline for comparison.

When fitting the independent model, we use the three data augmentation algorithms, without fitting the spatial dependence parameter. We note that the issue of collapsing these algorithms is not relevant in this case. These algorithms are identical to those used by

Imai and van Dyk (2005) for the binary response special case of the multi-category probit regression model. We show how adding spatial dependence to the model can impact the convergence of the algorithms.

Under the three dependence structures, we consider both the non-collapsed and partially collapsed algorithms introduced in Section 3.1 and define the five scenarios listed in Table

3.2. For each scenario, we compare the algorithms resulting from the various conditional and marginal data augmentation strategies (i.e., Marginal-Scheme 1, Marginal-Scheme 2, and Conditional).

Scenario Spatial Dependence Structure Partially Collapsed 1 CAR No 2 CAR Yes 3 Geostatistical No 4 Geostatistical Yes 5 Independent –

Table 3.2: Scenarios used to compare the marginal and conditional data augmentation al- gorithms.

50 In assigning prior distributions, the resulting models should be the same across all al-

gorithms. Thus, where applicable, we assign priors on identifiable parameters (i.e., on β

rather than β˜). Furthermore, because of the non-identifiability within our model, it is im-

portant that the prior distributions on the parameters are proper. For Scenarios 1-5, we

assign a normal prior distribution to β (i.e., β ∼ N(mβ,Cβ) with mβ = 0 and Cβ = 100).

Each algorithm uses the non-identifiable working parameter, σ, differently. For the condi-

tional augmentation algorithm, σ is fixed at 1. For both marginal augmentation algorithms,

2 2 −1 2 −1 σ ∼ aσ(χvσ ) where aσ = 3 and (χvσ ) represents an inverse chi-squared distribution

with parameter vσ = 3. The spatial dependence structures have different parameterizations

necessitating the need for different prior distributions on the spatial dependence parame-

ters. For the CAR spatial dependence structure, ρ ∼ Unif(1/ξ(1), 1/ξ(n)), where ξ(1) and

−1/2 −1/2 ξ(n) are the smallest and largest eigenvalues of Dw WDw (Banerjee et al., 2004, pg.

80). For the geostatistical dependence structure, λ ∼ Unif(lλ, uλ) with lλ = 0 and uλ = 20.

In our simulation study, we generate data sets following the simulation example used in Nobile (1998) and Imai and van Dyk (2005) for independent binary data, including spa- tial dependence where appropriate: we independently generate the covariates xi from the √ uniform distribution on the interval (-.5, .5); take β = − 2, σ2 = 1, and β˜ = σβ; and set

the spatial dependence parameters ρ = .9 (CAR structure) and λ = 2 (geostatistical struc- ture). For each of the data sets generated under the the two spatial dependence structures

(CAR and geostatistical), we fit the corresponding model using both the non-collapsed and partially collapsed algorithms. For the independence case, we fit the model using the

(non-collapsed) algorithms.

To compare the augmentation algorithms in terms of sensitivity to starting values, we √ √ consider two different starting values for (σ, β), namely (σ, β) = ( 2, − 2) and (σ, β) =

51 (10, −2). Each algorithm is run for 50,000 iterations, and we somewhat arbitrarily take the

first 10,000 iterations to be the burn-in period.

3.2.2 Simulation Results

Using the five scenarios in Table 3.2, we compare the various algorithms in terms of mixing, convergence, and sensitivity to starting values. As shown by the histograms in Fig- ures 3.1 - 3.5, the algorithms result in nearly identical inferences on the posterior distribu- tions of the identifiable parameters. Thus, with the inferences consistent across algorithms, the algorithms can be compared in terms of their efficiency.

First, we note the trace plots of β and λ under the conditional algorithm of Scenario

3 shown in Figure 3.3. Here, we see evidence of potential lack of convergence from the different behavior in the chain near iteration 50,000. On the other hand, there is no indica- tion of convergence problems in the trace plots of the two Scenario 3 marginal algorithms, nor in the trace plots for Scenario 5 algorithms – the algorithms for the independent probit regression model – as seen in Figure 3.5. This difference provides evidence of additional difficulties in fitting spatial probit regression models and the need for more efficient MCMC algorithms.

Autocorrelation and partial autocorrelation plots of the (post burn-in) sample paths of model parameters also help in determining whether an MCMC algorithm is mixing well.

Figures 3.6 and 3.8 show the autocorrelation and partial autocorrelation plots for the re- gression coefficient and spatial dependence parameter for Scenarios 1 and 3. It is clear that among the three non-collapsed algorithms, Marginal-Scheme 1 results in the smallest and most quickly decreasing autocorrelation in both parameters’ paths under Scenario 3. Un- der Scenario 1, the top row plots show that the sample path of β under Marginal-Scheme

52 1 again has the smallest and most quickly decreasing autocorrelation, but the improve-

ment provided by the marginalization strategy is less apparent for the spatial dependence

parameter, ρ.

We also compare the convergence of the three partially collapsed algorithms using au-

tocorrelation plots. Figures 3.7 and 3.9 show the autocorrelation and partial autocorrelation

plots of the regression coefficient and the spatial dependence parameter sample paths for

Scenarios 2 and 4. Here we see that, when compared with the other partially collapsed

algorithms, Marginal-Scheme 1 is again superior when we compare the autocorrelation of

the sampled parameter values using each of the three partially-collapsed algorithms. We

might also expect partial collapsing of the algorithms to further improve autocorrelation

summaries. However, for the CAR structure, partially collapsing the algorithms does not

result in improved autocorrelation summaries, as seen by comparing the autocorrelations

of Figure 3.6 and 3.7. On the other hand, for the geostatistical spatial dependence structure,

partially collapsing the algorithms does appear to improve mixing, as seen by comparing

the autocorrelations of Figures 3.8 and 3.9. In this scenario, the most noticeable benefits of

collapsing appear to be for the conditional algorithms. The differences between the non-

collapsed and partially collapsed marginal algorithms are not nearly as strong. Given the

significant increase in computation time required to run the partially collapsed algorithms

compared with their non-collapsed counterparts (it can take roughly 12 times as long to

generate the same number of posterior samples), partially collapsed marginal augmenta-

tion algorithms appear not to be a worthwhile.

All scenarios showed that Marginal-Scheme 1 was the least sensitive to starting values.

Figures 3.11 - 3.15 show scatter plots of sampled pairs of σ versus β˜ for Scenarios 1 - 5 generated under each of the three augmentation algorithms and both starting values. The

53 black dots show the burn-in samples and the colored dots show the draws from the posterior.

Under both starting values, Marginal-Scheme 1 appears to immediately generate samples from the stationary posterior distribution and both parameters easily move around the entire parameter space. However, under the second set of starting values, Marginal-Scheme 2 takes longer to converge and the parameters seem to move around the parameter space more slowly.

Based on our simulation study, we recommend fitting the spatial probit regression model using the non-collapsed Marginal-Scheme 1 algorithm. This algorithm showed su- perior mixing and convergence properties compared to the Marginal-Scheme 2 and Condi- tional algorithms. When computation burden is taken into account, it does not appear that partially collapsing this algorithm is beneficial. In the next section, we compare the per- formance of the non-collapsed Marginal-Scheme 1 and Conditional algorithms in an anal- ysis of land cover data. These algorithms were selected based on the simulation findings

(non-collapsed Marginal-Scheme 1) and existing literature (non-collapsed Conditional; for example, as in De Oliveira, 2000).

3.3 Application

To illustrate our methods, we use a portion of the data described in Section 1.4. For this analysis, we considered the region bounded by 17◦ to 19◦N and 98◦ to 100◦E, which covers a portion of northwestern Thailand and a small part of Myanmar. Using a 24 × 24 grid over this region, we collapsed the response variable to two categories, forest and nonforest, where the land cover response variable associated with each grid cell was taken to be the most common observed land cover type, where forest was coded as “1” and non-forest was coded as “0”. We used the covariate distance to the nearest major road in our analysis, and

54 defined it over the grid to be the median distance for all measurements of this covariate

within a grid cell.

Using the spatial probit regression model defined by (2.5) and (2.6) with the CAR spa-

tial dependence structure given in (2.9), we model the binary land cover response variable

and the distance to the nearest major road covariate. Unlike the simulation study, here we

include an intercept parameter. We expect that less accessible locations (high distance to

nearest major road) are more likely to be forested.

We fit the model using the non-collapsed Marginal-Scheme 1 and Conditional algo-

rithms defined in Table 3.1 and compare the two algorithms in terms of mixing based on

the autocorrelation in the sample paths of the model parameters. Each algorithm was run

for 80,000 iterations, and after examining trace plots we decided to discard the first 10,000

samples as burn-in.

Inferences on the model parameters are nearly identical for both model-fitting algo-

rithms. As expected, the estimate for the regression coefficient associated with the dis-

tance to nearest major road covariate is positive (95 percent credible interval on β1 is

(1.687, 3.428)) so that the farther a location is from a major road (i.e., less accessible), the more likely that location is to be forested. The intercept is not significantly different from 0 (95 percent credible interval on β0 is (−1.094, 0.165)), and the 95 percent credible interval for ρ is (0.968, 0.999) indicating strong residual spatial dependence.

Rather than showing sample autocorrelation plots as in Section 3.2, to highlight the differences between the sample autocorrelation summaries, Table 3.3 provides the autocor- relations for the sample paths of β1 and ρ at selected lags. We do not show the autocorrela- tion values for the sample path of β0 because they were approximately zero for all lags and

55 both algorithms. Table 3.3 shows that the Marginal-Scheme 1 algorithm outperforms the

Conditional algorithm, exhibiting smaller autocorrelation in the sampled paths of β1 and ρ.

Sample Autocorrelations for β1 Lag Marginal-Scheme 1 Conditional 1 0.3194 0.3576 2 0.1791 0.2133 3 0.1115 0.1387 4 0.0786 0.1004 5 0.0617 0.0741 10 0.0256 0.0284

Sample Autocorrelations for ρ Lag Marginal-Scheme 1 Conditional 1 0.8514 0.8621 2 0.7303 0.7496 3 0.6319 0.6576 4 0.5515 0.5802 5 0.4849 0.5134 10 0.2589 0.2898 15 0.1447 0.1752 20 0.0830 0.1081

Table 3.3: Autocorrelations of the sample paths of β1 and ρ for the land cover data analysis.

3.4 Summary

In this chapter, we extended the algorithms of Imai and van Dyk (2005) to the spatially- dependent setting and compared the efficiency of three MCMC algorithms via a simulation study and data analysis. Furthermore, we proposed and compared three additional MCMC algorithms, called partially-collapsed algorithms, which marginalize over the latent vari- able when sampling the spatial dependence parameter. In both the simulation study and

56 data analysis, we found that the non-collapsed Marginal-Scheme 1 algorithm was the most efficient in terms of autocorrelation, sensitivity to starting values, and computational time.

57 Figure 3.1: Histograms and trace plots for β and ρ under Scenario 1 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution.

58 Figure 3.2: Histograms and trace plots for β and ρ under Scenario 2 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution.

59 Figure 3.3: Histograms and trace plots for β and λ under Scenario 3 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution.

60 Figure 3.4: Histograms and trace plots for β and λ under Scenario 4 for each of the three corresponding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution.

61 Figure 3.5: Histograms and trace plots for β under Scenario 5 for each of the three cor- responding algorithms. The black portion represents the burn-in and the colored portion represents draws from the posterior distribution.

62 Figure 3.6: Autocorrelation and partial autocorrelation in the sample paths of β and ρ under Scenario 1 for each of the three corresponding algorithms.

63 Figure 3.7: Autocorrelation and partial autocorrelation in the sample paths of β and ρ under Scenario 2 for each of the three corresponding algorithms.

64 Figure 3.8: Autocorrelation and partial autocorrelation in the sample paths of β and λ under Scenario 3 for each of the three corresponding algorithms.

65 Figure 3.9: Autocorrelation and partial autocorrelation in the sample paths of β and λ under Scenario 4 for each of the three corresponding algorithms.

66 Figure 3.10: Autocorrelation and partial autocorrelation in the sample path of β under Scenario 5 for each of the three corresponding algorithms.

67 Figure 3.11: σ vs. β˜ under Scenario 1 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values.

68 Figure 3.12: σ vs. β˜ under Scenario 2 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values.

69 Figure 3.13: σ vs. β˜ under Scenario 3 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values.

70 Figure 3.14: σ vs. β˜ under Scenario 4 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values.

71 Figure 3.15: σ vs. β˜ under Scenario 5 for each of the three corresponding algorithms. The black dots represent the burn in and the colored dots represent the draws from the posterior. Top row: First starting values. Bottom row: Second starting values.

72 CHAPTER 4

THE BAYESIAN SPATIAL PROBIT REGRESSION MODEL AS A TOOL FOR CLASSIFICATION

There are often two primary goals in regression analyses. The first is to model the relationship between the response variable and the covariates. The second is to accurately predict missing or unobserved responses. Up to this point, we have focused on the first goal, emphasizing the need to allow for residual spatial dependence when analyzing relationships between spatially-referenced binary response variables and associated covariates. In this chapter, we focus on the second goal, prediction.

In some sense, prediction of binary or categorical outcomes can be thought of as a classification problem, where we determine into which class, or category, a collection of

predictors/inputs, or covariates, is likely to fall. For example, in predicting the land cover

category at an unobserved location, we observe a collection of predictors/inputs, such as

elevation or distance to the nearest major road. Using these predictors/inputs, we can de-

termine some decision function which defines a classification rule for classifying each lo-

cation. Within the classification literature, the decision function is determined using one of

a number of available classification methods, each method using statistical decision theory

to determine optimal classification rules. Popular classification techniques, however, do

73 not typically take into account the categories of observations nearby in geographical space.

Instead, the decision function relies only on a function of the predictors/inputs.

In classification problems involving spatially-referenced observations, we argue that neighboring observations along with predictors/inputs should be considered. As we will illustrate, classification rules that rely on neighboring observations can be derived from the

Bayesian spatial probit regression model.

The goal of this chapter is to compare, in terms of classification error, the Bayesian spatial probit regression model-based classifier to various other classifiers. We first define notation for the classification problem in Section 4.1. In Section 4.2 we discuss classi-

fication in the regression setting and, using the Bayesian spatial probit regression model, define a classification rule for spatially-referenced observations. In Section 4.3 we give an overview of popular classification methods, which in Section 4.4, we then compare to the Bayesian spatial probit regression model-based classifier in terms of both the training

(in-sample prediction) and test (out-of-sample prediction) error rates for the Southeast Asia land cover data set used in Section 3.3.

4.1 The Classification Problem

In the classification problem, we again have the set of paired observations {(Yi, xi); i =

1, . . . , n}, where Yi is a binary response variable and xi is a k × 1 vector of covariates used as predictors/inputs in determining decision boundaries for the classes. In the classification setting, the response variable is an indicator identifying the class to which the set of ob- served predictors/inputs belong. Although many of the classification methods listed below can be generalized to the multi-category/class setting, we restrict our description of these

74 methods to the binary setting. In this setting there are two classes, C0 and C1, where Cj rep-

resents the class of observations where Y = j. Of the observations {(Yi, xi); i = 1, . . . , n}, n0 fall into class C0 and n1 fall into class C1, and n0 + n1 = n. Each classification method

defines a decision function δ(ω) where ω is the set of applicable predictors/inputs, model parameters, and in the spatial setting, surrounding observations. Based on this decision function, we can define a classification rule.

Before discussing each of the classification methods, we first give a brief overview of the classification process. Typically, when using classification methods, the sample data is randomly divided into two subsets, the training and test data sets, of sizes ntrain and ntest, respectively. This allows us to fit the model on which the classifier is based using the training data and evaluate the method’s ability to predict based on the test data set. When we compare classification methods, we consider the training and test error rates, or the number of incorrect classifications among training and test data sets, divided by the sample size of each data set, respectively.

Furthermore, some of the classifiers rely on fixed parameters that are not estimated, and thus, require tuning to minimize the prediction error. In this case, it is common to use

five-fold cross-validation to obtain an optimal value for the fixed parameters. We do this by dividing the training data set into five equal subsets. We then repeatedly fit the model iteratively leaving the mth subset out and compute the error rate for the mth subset for

each m = 1,..., 5. Averaging across the five error rates determines the cross-validation

error (CVE). This procedure is repeated for various fixed values of the tuning parameter to

determine a satisfactory value for the analysis.

In each of the following sections, we generically use {(Yi, xi); i = 1, . . . n} as the data used for fitting the method, and (Y 0, x0) as the point of prediction or classification.

75 4.2 GLM-Based Classification

In this section, we first discuss the standard approach for classification using a GLM

and the decision boundaries and classification rules for two commonly used GLMs. We

then introduce a spatial classification technique using the Bayesian spatial probit regres-

sion model. In specifying our underlying probabilty models, although a slight misuse of

notation, we explicitly condition on x to make it clear which predictors/inputs we rely on for classification. Furthermore, this also makes the classification rule comparable to that of discriminant analysis as discussed in Section 4.3.1. Finally, we note that in GLM-based classification, the xi include a term allowing for an intercept along with the other predic- tors/inputs, as is usually done in regression analyses.

4.2.1 Non-Spatial GLM-Based Classification

Consider the GLM for independent binary response variables

Yi|xi, β ∼ Bernoulli(pi) (4.1) 0 g(pi) = xiβ, where pi = P (Yi = 1|xi, β), xi is a fixed k × 1 vector of covariates, β is a k × 1 vector of coefficients, and g(·) is a link function. To classify x0, in regression analyses we predict

Y 0 = 1 when P (Y 0 = 1|x0, β) > P (Y 0 = 0|x0, β) and Y 0 = 0 otherwise. This prediction method, can be turned into a classification rule as follows.

Analogous to prediction in the regression setting, we define the following decision func- tion,

P (Y 0 = 1|x0, β) g−1(x00β) δ(ω) ≡ δ(x0, β) = = , (4.2) P (Y 0 = 0|x0, β) 1 − g−1(x00β)

76 and set it equal to one. Equivalently, the decision function,

 g−1(x00β)  δ(ω) ≡ δ∗(x0, β) ≡ log , (4.3) 1 − g−1(x00β) can be set equal to zero.

Based on (4.2), a classification rule is then ( 1, if δ(x0, β) > 1 Y 0 = . (4.4) 0, otherwise

Similarly, based on (4.3), a classification rule is ( 1, if δ∗(x0, β) > 0 Y 0 = . (4.5) 0, otherwise

A popular form for the link function, g(·), is the logit link, where

  pi g(pi) = log . 1 − pi

Using this link function, the decision function in (4.3) becomes

 g−1(x00β)  δ∗(x0, β) = log = x00β (4.6) 1 − g−1(x00β) and the classification rule in (4.5) is ( 1, if x00β > 0 Y 0 = . (4.7) 0, otherwise

Although there are many approaches for estimating β in a frequentist setting, we estimate

β numerically by maximizing the likelihood function,

n n 0 Y Y exp{Yix β} L(β, Y ) = pYi (1 − p )1−Yi = i i i 1 + exp{x0 β} i=1 i=1 i with respect to β.

77 Another common link function is the probit link, where

−1 g(pi) = Φ (pi) and Φ is the cumulative standard normal distribution function. Using this link function, the decision function in (4.2) is then

g−1(x00β) Φ(x00β) δ(x0, β) = = (4.8) 1 − g−1(x00β) 1 − Φ(x00β) and the classification rule is ( 1, if Φ(x00β)/(1 − Φ(x00β)) > 1 Y 0 = . (4.9) 0, otherwise

Again, β can be estimated a number of ways, one of which is to estimate β numerically by maximizing the likelihood function,

n n Y Yi 1−Yi X 0 Yi 0 (1−Yi) L(β, Y ) = pi (1 − pi) = Φ(xiβ) (1 − Φ(xiβ)) i=1 i=1 with respect to β. We also consider a Bayesian approach.

Consider the latent variable representation of the Bayesian probit regression model as defined in Section 2.1.1. Without loss of generality, assume σ2 = 1, β˜ = β, and Z˜ = Z.

Using this representation, P (Y = 1|x, β) = P (Z > 0|x, β) and the decision function is:

P (Y 0 = 1|x00, β) δ(x0, β) = P (Y 0 = 0|x00, β) P (Z0 > 0|x00, β) = P (Z0 ≤ 0|x00, β) Φ(x00β) = , 1 − Φ(x00β)

which corresponds to the decision function given by (4.8). Note that in this case, the

decision function is the posterior odds. We can use the classification rule given in (4.9)

78 which, in this setting, corresponds to a 0-1 loss function (i.e., the loss for predicting Y 0 correctly is 0 and the loss for predicting Y 0 incorrectly is 1). For consistency, we continue to use the frequentist notation for the decision function and classification rule as initially defined earlier in this chapter.

We can estimate P (Y 0 = 1|x0, β) using one of the following two approaches. The first is to use a posterior mean classifier where we estimate E[β|Y ], or the posterior mean of β, and use this estimate to obtain P (Y 0 = 1|x0, βˆ) = P (Z0 > 0|x0, βˆ) = Φ(x00βˆ). In prac-

ˆ PT [t] [t] tice, we approximate E[β|Y ] by using β = t=1 β /T , where β are draws from the posterior distribution of β, and T is the number of draws from the distribution. This first representation is analogous to the estimation method of the previous likelihood approach.

The second is to use a posterior predictive classifier p0 = E[I(Z0 > 0)|x0, Y ], or the pos- terior probability that Z0 > 0. In doing this, we marginalize over the posterior distribution of β to obtain an estimate of P (Y 0 = 1|x0, Y ) = R P (Z0 > 0|x0, β)π(β|Y )dβ. In prac-

0 0 PT 0[t] 0[t] tice, we estimate p by using pˆ = t=1 I(Z > 0)/T , where Z are draws from the posterior distribution of Z0, and T is the number of draws from the distribution. Thus, for the independent Bayesian probit regression model, the two estimated classification rules are:

1. Posterior Mean Classifier: ( 1, if Φ(x00βˆ)/(1 − Φ(x00βˆ)) > 1 Y 0 = (4.10) 0, otherwise

2. Posterior Predictive Classifier: ( 1, if pˆ0/(1 − pˆ0) > 1 Y 0 = (4.11) 0, otherwise

79 As discussed in Section 4.1, when using classification techniques, we compute error

rates for both the training data (or the observed data used to fit the model) and the test

data (or the observed data not used to fit the model). For the test data, we can simply use

the latent Zj, for j = 1, . . . , ntest, as sampled within the Gibbs sampler. However, for the training data, the latent Zi, for i = 1, . . . , ntrain, are sampled within the Gibbs sampler given the observed Yi. In this case, to use the latent Zi as predictors, we must sample these

values as if the Yi are unknown. Using the posterior draws of β, we can resample the latent

Zi as follows:

(i) Take samples β[t] for i = 1,...,T from π(β|Y ) and sample a corresponding

[t] [t] [t] Zi ∼ N(xiβ , 1). (Note that these Zi are not those sampled in the MCMC algorithm.)

0 PT [t] 0 (ii) Determine pˆi = t=1 I(Zi > 0)/T and let Yi be the predicted value of Yi using the posterior predictive classifier.

(iii) Repeat (i) and (ii) for all i = 1, . . . , ntrain.

(iv) Compute the training error rate for the posterior predictive classifier:

Pntrain 0 i=1 I(Yi 6= Yi)/ntrain.

4.2.2 Spatial GLM-Based Classification

As discussed in Section 2.2, the latent variable representation of the probit regression model allows us to include spatial dependence among the response variables. Here we propose extending the use of this model to the classification setting. In this case, the deci- sion function δ(ω) depends on the covariates and regression coefficients, as well as on the categories of the surrounding observations.

80 Using equations (2.5) and (2.6), it follows that the distribution for the latent variable at an unobserved location is

0 Z |Y , X, β, θ, Z ∼ N(µZ0 , σZ0 ) where

00 0 −1 0 µZ0 = x β + σ(θ)-1 (Σ(θ)) (Z − X β)

2 0 −1 σZ0 = σ(θ)1 − σ(θ)-1 (Σ(θ)) σ(θ)-1

0 0 X = (x1,..., xn) , σ(θ) is an (n + 1) × 1 vector representing the variance of Z and

0 0 0 (Z , Z ) , σ(θ)-1 is σ(θ) with the first element removed, and Σ(θ) is the spatial correlation

0 matrix among Z = (Z1,...,Zn) . We can easily include sampling from this distribution in the first step of the MCMC algorithm (see Section 3.1.1 and Table 3.1), so that we can obtain draws from the posterior distribution of Z0. In this spatially-dependent case,

P (Y 0 = 1|x0, β, θ, Y , Z) = P (Z0 > 0|x0, β, θ, Y , Z) ! µ 0 = Φ Z . p 2 σZ0 Thus, the decision function at x0 is

P (Z0 > 0|x0, Y , Z, β, θ) δ(ω) ≡ δ(x0, Y , Z, β, θ) = 1 − P (Z0 > 0|x0, Y , Z, β, θ)

Just as for the independent Bayesian probit regression model, we can consider two ways to obtain an estimated classification rule:

1. Posterior Mean Classifier:

ˆ PT [t] ˆ Estimate E[β|Y ],E[θ|Y ], and E[Z|Y ] by computing β = t=1 β /T , θ =

PT [t] ˆ PT [t] t=1 θ /T , and Z = t=1 Z /T , respectively. The estimated classification rule is

81 √ 2  Φ(ˆµ 0 / σˆ ) Z √Z0 1, if 2 > 1 0 1−Φ(ˆµ 0 / σˆ ) Y = Z Z0 (4.12) 0, otherwise where

 −1 00 ˆ ˆ 0 ˆ ˆ 0 ˆ µˆZ0 = x β + σ(θ)-1 Σ(θ) (Z − X β) −1 2 ˆ ˆ 0  ˆ  ˆ σˆZ0 = σ(θ)1 − σ(θ)-1 Σ(θ) σ(θ)-1

2. Posterior Predictive Classifier:

0 0 0 0 PT 0[t] Estimate p = E[I(Z > 0)|x , Y ] by computing pˆ = t=1 I(Z > 0)/T . This gives an estimate of

P (Y 0 = 1|x0, Y ) = P (Z0 > 0|x0, Y ) Z = P (Z0 > 0|Z, β, θ) π(Z, β, θ) dZ dβ dθ.

The estimated classification rule is then

( 0 1, if pˆ > 1 Y 0 = 1−pˆ0 . (4.13) 0, otherwise

Again, each of these classification rules correspond to a 0-1 loss function on the predicted

Y 0.

As with the independent Bayesian probit regression model, for the training data, we must resample the Zis to determine prediction error rates. For the Bayesian spatial probit regression model, however, we rely on the observed surrounding observations to provide information about the category of the unobserved locations. In this case, evaluating the training error is not straightforward. We propose the following two approaches for the spatial probit posterior predictive classifier:

82 A. One-at-a-Time Training Error:

[t] [t] [t] [t] (i) Take samples (β , θ , Z-i ), for t = 1,...,T , where Z-i is an (n−1)×1 vector

of sampled Zj for j = 1, . . . , i−1, i+1, . . . , ntrain and sample a corresponding

Z[t] ∼ (µ , σ2 ) i N Zi Zi where

0 [t] [t] [t] −1 [t] [t] µˆZi = xiβ + Σ(θ )i,-i Σ(θ )-i,-i (Z-i − X-iβ )

−1 σˆ2 = Σ(θ[t]) − Σ(θ[t]) Σ(θ[t])  Σ(θ[t]) Zi i,i i,-i -i,-i -i,i

th [t] th and X-i is X with the i row removed, and Σ(θ )j,-k is the j row of the

th [t] estimated spatial correlation of Z with the k column removed. (Note that Z-i are the posterior samples obtained from the MCMC algorithm, however, the

[t] Zi s are not the same as those sampled in the MCMC algorithm.)

0 PT [t] 0 (ii) Determine pˆi = t=1 I(Zi > 0)/T and let Yi be the predicted value of Yi using the posterior predictive classifier.

(iii) Repeat (i) and (ii) for all i = 1, . . . , ntrain.

Pntrain 0 (iv) Compute the one-at-a-time training error: i=1 I(Yi 6= Yi)/ntrain

B. Joint Training Error:

(i) Take samples (β[t], θ[t]) for t = 1,...,T and sample a corresponding Z[t] ∼

N(Xβ[t], Σ(θ[t])). (Note that the Z[t] are not the same as those sampled in the

MCMC algorithm.)

0 PV [v] 0 (ii) For i = 1, . . . , ntrain, compute pˆi = v=1 I(Zi > 0)/V and let Yi be the

predicted value of Yi using the posterior predictive classifier.

Pntrain 0 (iii) Compute the joint training error: i=1 I(Yi 6= Yi)/ntrain

83 Both the one-at-a-time and joint training errors allow for spatial dependence among the bi-

nary predictions/classifications through the latent random variable. The joint training error

allows for spatial dependence only through the spatial dependence structure of the latent

variables, Σ(θ). In contrast, the one-at-a-time training error allows for spatial dependence

through Σ(θ), but also allows for spatial dependence by conditioning on the current values

[t] of the latent random variables at nearby locations, Z-i . We note that for the independent

model, both the one-at-a-time and joint training errors will be the same since Zi is inde-

pendent of all other Zj for j = 1, . . . , i − 1, i + 1, . . . , n.

4.3 Alternative Classification Methods

In this section, we give an overview of alternatives to GLM-based classification. In these alternative classification methods, we assume that the xis only include the predic- tors/inputs, and thus do not include a term to allow for an intercept as in the GLM-based classification methods. Unless otherwise noted, the classification methods described here are taken from Hastie et al. (2001).

4.3.1 Discriminant Analysis

In discriminant analysis, rather than considering the explanatory variables x as fixed as they are in regression analyses, the x are viewed as random variables, with class-specific density functions fj(x) corresponding to each class Cj. The classes also have prior proba- bilities πj, such that π0 + π1 = 1. To determine the probability that a set of predictors will fall into class j, we employ Bayes’ theorem which implies that

f (x)π P (Y = j|x) = j j . (4.14) f1(x)π1 + f0(x)π0

84 The Bayes’ classification rule is to classify an observation to class C1 when P (Y = 1|x) >

P (Y = 0|x) and to C0 otherwise.

We first describe discriminant analysis in its general form, allowing a general form for the fj(x)s and the decision function, and then discuss four special cases that we use in our analysis.

Let   x1  .  Xj =  . 

xnj where {x1,..., xnj } = {xi : Yi = j} and Xj is an njk × 1 vector of predictors/inputs

0 corresponding to observations in class Cj. To classify Y , for each class we define

 0  0 x Xj = , Xj

0 0 where Xj is a (nj + 1)k × 1 vector and x is a k × 1 vector of predictors/inputs associated with Y 0. Our goal is to determine a decision boundary for classifying Y 0.

In discriminant analysis, fj(·) is typically the multivariate normal density function,   0 1 1 0 X 0 X −1 0 X fj(X ) = exp − (X − µ ) Σ (X − µ ) j (nj +1)k/2 X 1/2 j j j j j (2π) |Σj | 2

X X where µj is the (nj +1)k×1 class-specific mean vector and Σj is the (nj +1)k×(nj +1)k class-specific covariance matrix. It follows that   1 1 0  0 −1 0 f (x0|x ,..., x ) = exp − (x0 − µx )0 Σx (x0 − µx ) j 1 nj k/2 x0 1/2 j j j (2π) |Σj | 2 (4.15) where

x0 X X −1 µj = µj({1:k}) + Σj({1:k},-{1:k}) Σj(-{1:k},-{1:k}) (Xj − µj(-{1:k}))

x0 X X X −1 X Σj = Σj({1:k},{1:k}) − Σj({1:k},-{1:k}) Σj(-{1:k},-{1:k}) Σj(-{1:k},{1:k}).

85 We use the subscript notation ({1 : k}) to indicate the first k elements of the corresponding matrix or vector (i.e., those indices corresponding to x0) and (-{1 : k}) indicates the matrix

or vector without the first k elements (i.e., the remaining indices corresponding to Xj).

Using the log-odds as the decision function to determine a classification rule for classi-

fying Y 0, it follows from (4.14) and (4.15) that

0 x0 x0 x0 x0 δ(ω) ≡ δ(x , µ0 , µ1 , Σ0 , Σ1 ) P (Y 0 = 1|x0) = log P (Y 0 = 0|x0) x0 π 1 |Σ | 1 0  0 −1 0 1 0  0 −1 0 = log 1 + log 0 − µx 0 Σx µx + µx 0 Σx µx x0 1 1 1 0 0 0 . (4.16) π0 2 |Σ1 | 2 2 | {z } ≡α0

 0 0 0  1  0 0  + x00 (Σx )−1µ − (Σx )−1µx −x00 (Σx )−1 − (Σx )−1 x0 1 1 0 0 2 1 0 | {z } | {z } ≡α1 ≡α2

Here, α0 is a scalar, α1 is a k × 1 vector, and α2 is a k × k matrix, which we define

for notational convenience. Setting the decision function equal to zero, the discriminant

analysis-based classification rule is ( 1, if (x00α − x00α x0) > −α Y 0 = 1 2 0 . 00 00 0 (4.17) 0, if (x α1 − x α2x ) ≤ −α0

We now consider special cases to (4.17). Each of these special cases assumes that the

X mean of xi is equal across all observations, so that µj = [1nj +1 ⊗ µj] where 1nj +1 is an

(nj + 1) × 1 vector of ones and µj is a k × 1 class specific mean vector. The difference

X between each of these special cases is in the specification of the covariance matrix Σj . We

first describe three popular discriminant analysis methods (linear discriminant analysis,

diagonal linear discriminant analysis, and quadratic discriminant analysis) all of which

assume that the xi are independent. Then we describe an approach for spatial discriminant

analysis (spatial linear discriminant analysis) due to Saltytˇ e˙ Benth and Ducinskasˇ (2005).

86 Assuming the xi are independent results in the following form for the covariance of

0 Xj :

0 X var(Xj ) = Σj = (Inj +1 ⊗ Λj), (4.18)

where Inj +1 is an (nj + 1) × (nj + 1) identity matrix and Λj is a class-specific covari-

0 ance matrix for the k components of xi. Under this assumption, x is independent of

x1,..., xnj , so

0 0 1 1 0 0 −1 0 fj(x |x1,..., xnj ) = fj(x ) = k/2 1/2 exp{− (x − µj) Λj (x − µj)}. (2π) |Λj| 2

Assuming a constant variance across classes (i.e., Λj = Λ for j = 0, 1) results in

linear discriminant analysis (LDA) because the decision boundary is linear in the xs. The

decision function given in (4.16) can be written in this case as

P (Y 0 = 1|x0) δ(ω) ≡ δ(x0, µ , µ , Λ) = log 0 1 P (Y 0 = 0|x0)

π1 1 0 −1 00 −1 = log − (µ1 + µ0) Λ (µ1 − µ0) +x Λ (µ1 − µ0), π0 2 | {z } | {z } αLDA LDA 1 α0

LDA LDA where α0 is a scalar and α1 is a k × 1 vector, defined for notational convenience.

Note that this decision function is effectively equivalent to the one based on the logistic regression model in (4.6), however, in logistic regression, we assume the x’s are fixed and thus make no distributional assumptions on x as in discriminant analysis. This results in the following LDA-based classification rule for Y 0 given x0: ( 1, if x00αLDA > −αLDA Y 0 = 1 0 00 LDA LDA (4.19) 0, if x α1 ≤ −α0

LDA LDA In practice, the parameters πj, µj, Λ (and thus α0 and α1 ) are unknown but can

be estimated using maximum likelihood:

87 • πˆj = nj/n

• µˆ = P x /n j i:Yi=j i j

• Λˆ = P P (x − µˆ )(x − µˆ )0/(n − 2) j∈{0,1} i:Yi=j i j i j

Diagonal linear discriminant analysis (DLDA) additionally assumes independence be-

tween the k predictors/inputs so that var(xi) = Λ is a diagonal matrix. The DLDA-based

classification rule is the same as (4.19), but using a diagonal matrix Λ. The mth diagonal

element of Λ is estimated by Λˆ = P P (x − µˆ )2/(n − 2) where x (m,m) j∈{0,1} i:Yi=j im jm im

th and µˆjm are the m elements of xi and µj, respectively.

In LDA, we assume a constant covariance for xi among the classes (i.e., Λj = Λ for

j = 0, 1). On the other hand, quadratic discriminant analysis (QDA) allows for each class

to have its own covariance. The decision function now contains a quadratic term in the xs:

P (Y 0 = 1|x0) δ(ω) ≡ δ(x0, µ , µ , Λ , Λ ) = log 0 1 0 1 P (Y 0 = 0|x0)

π1 1 |Λ0| 1 0 −1 1 0 −1 = log + log − µ1Λ1 µ1 + µ0Λ0 µ0 π0 2 |Λ1| 2 2 | {z } QDA α0

00 −1 −1 00 1 −1 −1 0 + x (Λ1 µ1 − Λ0 µ0) −x (Λ1 − Λ0 ) x . | {z } 2 QDA | {z } α1 QDA α2

QDA QDA QDA Here, α0 is a scalar, α1 is a k × 1 vector, and α2 is a k × k matrix, which are defined for notational convenience. This results in the QDA-based classification rule for

Y 0 given x0: ( 1, if (x00αQDA − x00αQDAx0) > −αQDA Y 0 = 1 2 0 00 QDA 00 QDA 0 QDA (4.20) 0, if (x α1 − x α2 x ) ≤ −α0 where we estimate the parameters by taking

• πˆj = nj/n

88 • µˆ = P x /n j i:Yi=j i j

• Σˆ = P (x − µˆ )(x − µˆ )0/(n − 1). j i:Yi=j i j i j j

The final discriminant analysis method we discuss is a special case of the spatial-

temporal extension of LDA recently proposed by Saltytˇ e˙ Benth and Ducinskasˇ (2005).

We restrict our discussion to the spatial case, but refer the reader to Saltytˇ e˙ Benth and

Ducinskasˇ (2005) for more details on the spatial-temporal approach. In spatial LDA, we do not assume independence among the xi as in the previous three methods; instead, we as- sume the predictors/inputs are spatially dependent. Following Saltytˇ e˙ Benth and Ducinskasˇ

(2005), the specific form for the covariance is

0 X var Xj = Σj = (Rj(θ) ⊗ Λj)

where Rj(θ) is an (nj + 1) × (nj + 1) class-specific spatial covariance matrix and Λj is again the var(xi). (Contrast this specification with that given by (4.18) where we assumed

X the xi were independent.) The separable structure for Σj implies that each of the predic- tors/inputs have the same spatial dependence. Just as with LDA, spatial LDA assumes the covariance for each xi is the same for both classes, i.e., Λj = Λ for j = 0, 1.

The spatial LDA-based classification rule is the same as the general discriminant analy-

x0 x0 sis classification rule in (4.17), but we can simplify the parameters µj and Σj as follows:

x0 −1   µj = µj + Rj(θ)(1,-1)(Rj(θ)(-1,-1)) ⊗ Ik Xj − (1nj ⊗ µj)

x0 −1  Σj = Rj(θ)(1,1) − Rj(θ)(1,-1)(Rj(θ)(-1,-1)) Rj(θ)(-1,1) ⊗ Λ

where Rj(θ)1,1 indicates the first element of Rj(θ), and Rj(θ)(-1,-1) indicates Rj(θ) with

the first row and column removed, and likewise for Rj(θ)(1,-1) and Rj(θ)(-1,1).

89 To make use of the spatial-temporal classifier, Saltytˇ e˙ Benth and Ducinskasˇ use a fixed and known spatial-temporal correlation function. They then calculate unbiased forms of maximum likelihood estimators of the unknown parameters. Therefore, to use their ap- proach in our analysis which only includes spatial dependence, the estimators for the un- known parameters are:

 −1 • µˆ = 1 (R (θˆ) )−110 1 (R (θˆ) )−1[x ,..., x ]0 j nj j (-1,-1) nj nj j (-1,-1) 1 nj

ˆ P 0 0 0 • Λ = (1/(n − 2)) j∈0,1 [x1,..., xnj ] − (1nj ⊗ µˆj)

ˆ −1 0 0  ×(Rj(θ)(-1,-1)) [x1,..., xnj ] − (1nj ⊗ µˆj) .

4.3.2 Support Vector Machines

First introduced by Cortes and Vapnik (1995), the goal of support vector machines

(SVM) is to determine a hyperplane in covariate space separating the classes in such a way

that the margin between the two classes is maximized. The margin is the minimum distance

between the the predictors/inputs xi of the two classes in the direction perpendicular to the

hyperplane. The resulting function determining this hyperplane is the decision function.

An illustration of this approach is provided in Figure 4.1 (a). In this illustration, for each

observations there are two inputs, xi1 and xi2 which determine the classes (denoted by the

two plotting symbols). The margin is the gap between the two classes, or the space between

the two dotted lines, and the maximum-margin hyperplane is the solid line between the two

classes. The dashed lines from the two points on the edges of the margin (the support

vectors) to the hyperplane show the distance to be maximized. Figure 4.1 (b) shows

another hyperplane separating the two classes, but one that does not maximize the margin.

90 (a) (b)

Figure 4.1: Illustration of the hyperplane (solid line) and margins (dotted lines) determined by support vector machines separating the two classes (denoted by the two plotting sym- bols). The hyperplane in (a) is the maximum margin hyperplane and the hyperplane in (b) does not maximize the margin.

In SVM, the classes are labeled as either 1 or −1 (instead of 1 or 0, as before). To

∗ ∗ accommodate this convention, we define Yi = 2Yi − 1, for i = 1, . . . , n, so that Yi ∈

{−1, 1}.

Consider the decision function determined by the linear hyperplane {x : δ(x) = 0}

where

0 δ(ω) ≡ δ(x, β, β0) = x β + β0. (4.21)

∗ Given observations {Yi , xi} for i = 1, . . . , n, maximizing the margin between the two

∗ 0 classes and the hyperplane is equivalent to minimizing ||β|| subject to Yi (xiβ+β0) ≥ 1 for

all i = 1, . . . , n. This problem can be represented as the following Lagrange optimization

91 function n n n ! X 1 X X ∗ ∗ 0 max L = max ηi − ηiηi∗ Yi Yi∗ xixi∗ (4.22) η η 2 i=1 i=1 i∗=1 0 where η = (η1, . . . , ηn) are the Lagrangian multipliers, which are subject to the constraints

Pn ∗ that i=1 ηiYi = 0 and ηi ≥ 0 for all i. The xi where ηi > 0 are the support vectors and are the only vectors which influence the position of the hyperplane. In Figure 4.1, the support vectors are the points on the outer boundary of the margins. We can write the relationship between η and β as

n X ∗ β = ηiYi xi i=1 and the relationship between η and β0 as

1 X ∗ β0 = (βxi − Yi ) nsv i:ηi>0

where nsv is the number of support vectors.

While the above hyperplane is linear, SVM can be extended to create nonlinear bound-

aries between the classes. This extension can be acheived by transforming the predic-

tors/inputs into a space where they can be separated linearly, and again find the separating

hyperplane in this transformed covariate space. We can use the Lagrange optimization

function in (4.22) with the transformed predictors/inputs K(xi, xj):

n n n ! X 1 X X max L = max η − η η Y ∗Y ∗K(x , x ) (4.23) i 2 i j i j i j i=1 i=1 j=1

Pn ∗ subject to i=1 ηiYi = 0 and 0 ≤ ηi ≤ γ for all i, where γ is a tuning parameter allowing for crossover among the two classes and K(·, ·) is a symmetric positive (semi-) definite

function. In our data analysis, we consider the following three popular kernels:

0 • Linear: K(xi, xj) = xixj

92 th 0 d • d Degree Polynomial: K(xi, xj) = (1 + xixj)

2 • Radial: K(xi, xj) = exp{−||xi − xj|| /c}

Now, (4.21) can be written as

n X ∗ δ(ω) ≡ δ(x, η, β0, γ) = ηiYi K(x, xi) + β0. (4.24) i=1

Thus,

n ˆ 0 X 0 ˆ δ(x , η, β0, γ) = ηˆiYiK(x , xi) + β0 i=1

and

0∗ ˆ 0  Y = sign δ(x , η, β0, γ)

so that the SVM-based classification rule of Y 0 is

1 Y 0 = (Y 0∗ + 1). 2

When implementing this classification method, we use the R package e1071 (Dimitriadou ˆ et al., 2010), to compute ηˆi and β0 via a quadratic optimization function for a fixed value of γ.

4.3.3 k-Nearest Neighbors

The k-Nearest Neighbors classification method makes no assumptions about an un-

0 0 derlying model. Using this method, for a point {Y , x }, the closest k points {x(r), r =

1, . . . , k} to x0 are identified, and Y 0 is assigned to the most popular class among the

k neighbors, where ties are broken at random. “Distance” here is measured in covariate

93 space, not geographic space, and could be defined using any valid distance metric. In our

implementation of the method, we use Euclidean distance so that

0 d(r) = ||x(r) − x ||.

Here, d(i) represent the ordered distances where the minimum is d(1) and the maximum

is d(n), and x(r) are the xi corresponding to d(r). Using this measure of distance requires

standardization of the variables so that no variable is given more weight than another.

For the binary case, a decision function can then be defined as

k 0 X δ(ω) ≡ δ(x , x, Y ) = Y(r)/k r=1

where Y(r) is the Yi associated with d(r).A k-nearest neighbor classification rule will then

be ( Pk 0 1, if i=1 Y(i)/k > .5 Y = Pk 0, if i=1 Y(i)/k < .5 Pk 0 0 and if i=1 Y(i)/k = .5, Y = 1 with probability .5 and Y = 0 with probability .5.

4.4 Comparison of Classification Methods

In this section, we discuss the fitting of classification parameters and compare the meth-

ods in terms of prediction error on the training and test data sets. We base our comparisons

on the land cover data used in Section 3.3, but this time we use four covariates: elevation,

distance to the coast, distance to the nearest big city, and distance to the nearest major road.

We randomly select ntrain = 432 locations to the training data, and use the remaining ntest = 144 locations as test data.

94 4.4.1 Parameter Estimation

For each of the classifiers, we compute estimates of the corresponding unknown pa- rameters as described in the previous section. However, further details are required for this specific data set for the spatial LDA-based, SVM, and k-nearest neighbor classifiers.

When fitting the parameters corresponding to the spatial LDA-based classifier, it is necessary to fit a spatial covariance function. As discussed in Section 4.3.1, Saltytˇ e˙ Benth and Ducinskasˇ (2005) use a specific parametric class of covariance functions. To estimate the parameters of each the covariance functions corresponding to each class, we use the R package geoR. We use the Gaussian covariance function, where the covariance between two locations si and sj is

2  2 2 (R(θ))ij ≡ R(τ , φ) ij = τ exp −(dij/φ) ,

2 dij is the Euclidean distance between si and sj, σ is the variance, and φ is the range parameter.

Figure 4.2 shows the estimated semivariograms and fitted semivariograms for each of the classes. In each of these plots, the numbers one through four represent the estimated

(empirical) semivariance at the specific distances for each of the four covariates. These estimates are obtained using the variog() function in the geoR package of R which computes the sample variance for all pairs of observations that specific difference apart.

Note that the empirical semivariograms differ for each covariate. This model, however, requires the same covariance function for each of the covariates. Therefore, we fit the semivariogram to the mean of the four estimated semivariograms. In the plot, the mean semivariogram is represented by the dots, and the line represents the fitted semivariogram function. The fitted values of the parameters are listed in Table 4.1. We note that although

95 Figure 4.2: Estimated and fitted variograms for class C0 (left) and C1 (right) fit using the training data set. The numbers 1-4 indicate the estimated variogram for the four covari- ates, the dots represent the mean of the four variograms, and the line indicates the fitted variogram.

Parameter Class C0 Class C1 τ 2 0.2162 0.3928 φ 1.2316 1.5099

Table 4.1: Fitted values for the covariance function parameters for both class C0 and C1.

the Guassian covariance function may be too smooth for some of the four covariates, in our analysis, this function was found to fit the mean of all four covariate semivariograms best when compared with other functions from the Matern´ class of covariance functions (e.g., exponential covariance function).

SVM and k-nearest neighbor classifiers require tuning of the parameters, γ and k. We obtain optimal values of γ and k using five-fold cross-validation as discussed in Section

96 Classification Method Tuning Parameter CVE Optimal Value Linear SVM γ 0.2744 0.13 Cubic SVM γ 0.2791 0.88 Radial SVM γ 0.2512 0.715 k-NN k 0.2349 4

Table 4.2: Tuning parameter values for each classification method. The optimal value of the tuning parameter is listed along with the CVE associated with this value. The optimal values were chosen by minimizing the five-fold CVE.

4.1 and assigned the value of the associated parameter to be the value with the lowest cross-validation error (CVE). Table 4.2 shows the tuning parameter for each classification method and the associated chosen value of each parameter. For the SVM classifiers, the tuning parameter is a cost parameter [discuss table].

4.4.2 Classification Errors

Table 4.3 shows the training and test errors for each of the classification methods. Most of the methods have training error rates around 27% and test error rates around 29%, with a few exceptions. The spatial LDA classifier has particularly large training and test error rates of approximately 76%. Therefore, it appears that adding the restriction of spatial de- pendence to the predictors/inputs has a negative effect on prediction. One reason for this may be that the underlying model requires each covariate to have the same fitted spatial covariance function. Because of this possible explanation for the poor performance of this classifier, we also fit the model using only one covariate, but found similar error rates, indi- cating that adding spatial dependence in covariate space may be a too restrictive assumption for these data.

97 While the k-nearest neighbors classifier has the smallest error rate for the training data, the classifier based on the Bayesian spatial probit regression model (spatial probit classifier) has the smallest error rate for the test data and only a slightly larger one-at-a-time training error rate. The spatial probit classifier joint training error is similar to that observed with the other classifiers. This is due to the fact that the method does not take advantage of the observed class of neighboring locations within the training data when the joint training error is determined. The small test error for the spatial probit classifier also illustrates how conditioning on the observed class of neighboring locations within the training data will improve classification error rates over other non-spatial classifiers.

For this training sample, we note that the mean and posterior predictive methods result in identical error rates for the spatial probit classifier. To see why these error rates are the same, consider the following features of the model underlying the classifier. The poste- rior mean classifier is based on P (Z0 > 0|Y , x0, βˆ, ρ,ˆ Zˆ), while the posterior predictive classifier is based on P (Z0 > 0|Y , x0), which is obtained by marginalizing over the pos- terior distributions of β, ρ, and Z rather than conditioning on the posterior mean of these parameters. Consider the mean of Z0|Y , x0:

E(Z0|Y , x0) = E E(Z0|Y , x0, β, ρ, Z)   X wj = E x00β + ρ (Z − x0 β)  w j j  0 + j:s ∼sj

X wj = x00E(β) + E(ρZ ) − x0 E(ρβ) w j j 0 + j:s ∼sj

0 0 0 where s ∼ sj indicates that sj and s are neighbors, wj is equal to one if sj and s are

0 neighbors, and w+ denotes the total number of neighbors for location s . Note that if the

98 posterior correlations between ρ and β and between ρ and Z are zero, then

E(ρβ) = E(ρ)E(β) (4.25)

and

E(ρZj) = E(ρ)E(Zj) (4.26)

for all j = 1, . . . , n. We cannot determine analytically the posterior correlations between

these pairs of parameters since the posterior distributions are not available in closed form;

however, we can estimate the correlations between these parameters using the sampled

values. In our analysis,

|Cor(ρ, βk)| < 0.13

for all k = 0, 1,..., 4, and

|Cor(ρ, Zj)| < 0.17

for all j = 1, . . . , n. Although not perfectly uncorrelated, the correlations are small enough

to approximate the relationships in (4.25) and (4.26). Furthermore, for our particular anal-

ysis, the posterior distribution of ρ has a mean close to one and a very small variance, i.e.,

the 95% credible interval of ρ is (0.934, 0.999). Therefore,

X wj E(Z0|Y , x0) ≈ x00E(β) + E(ρ) (E(Z ) − x0 E(β)) w j j 0 + j:s ∼sj X wj = x00βˆ +ρ ˆ (Zˆ − x0 βˆ) w j j 0 + j:s ∼sj = E(Z0|Y , x0, βˆ, ρ,ˆ Zˆ).

Thus, the estimated probabilities P (Z0 > 0|Y , x0, βˆ, ρ,ˆ Zˆ) and P (Z0 > 0|Y , x0) will be similar and the binary classifications based on these probabilities are likely to be the same, as is observed for the one-at-a-time training error and the joint test error rates. For a

99 Classification Method Training Error Test Error Spatial Probit Classifier Posterior Mean 0.1690 0.1667 One-at-a-Time Joint Posterior Predictive 0.1667 0.1690 0.2708 Bayesian Probit Classifier Posterior Mean 0.2778 0.2986 Posterior Predictive 0.2755 0.2917 GLM-Based maximum likelihood Classifier Logistic 0.2824 0.3056 Probit 0.2824 0.2986 Discriminant Analysis Classifier LDA 0.2778 0.2917 DLDA 0.3611 0.3889 QDA 0.2593 0.2986 Spatial LDA 0.7639 0.7569 SVM Classifier Linear SVM 0.2847 0.3056 Cubic SVM 0.2755 0.2847 Radial SVM 0.2315 0.2986 k-Nearest Neighbor Classifier k-NN 0.1667 0.2431

Table 4.3: Training and test errors for the SE Asia land cover data obtained using each of the classification methods discussed in this chapter.

different data set or for different spatial dependence structures, the error rates may not be the same.

Based on our analysis, we conclude that including residual spatial dependence is im- portant in classification of spatially-referenced data and that the classifier based on the

Bayesian probit regression model is well-suited for doing so. Furthermore, including spa- tial dependence in the predictors/inputs, as in the spatial LDA classifier, is not sufficient, and in fact, may be inappropriate.

100 4.5 Summary

In this chapter, we proposed a spatial classifier based on the Bayesian spatial probit regression model and compared this classifier to other well-known classification methods.

Furthermore, for the Bayesian probit regression model and the Bayesian spatial probit re- gression model, we distinguish between two classifiers, which we call the posterior mean and posterior predictive classifiers. Using a data analysis, we showed that the spatial probit classifier (both the posterior mean and posterior predictive versions) is the best classifier in terms of out-of-sample prediction for our spatially-referenced land cover data example.

101 CHAPTER 5

BAYESIAN SPATIAL MULTINOMIAL PROBIT REGRESSION

In Section 2.1.2, we introduced the Albert and Chib (1993) Bayesian multinomial pro- bit regression model for multi-category response variables. For these models, the response variable Yi can take on a value in {1, . . . , `}, where ` is the number of categories. In this chapter, we extend the Bayesian multinomial probit regression model and the Bayesian spatial probit regression model to a model for spatially-referenced multi-category response variables, the Bayesian spatial multinomial probit regression model. In Section 5.1, we discuss the specification of this model, specifically, we discuss considerations for the la- tent mean and covariance structures. In Section 5.2.1, we discuss a data augmentation model-fitting technique, as an extension of Marginal-Scheme 1 as proposed in Section 3.1 and the Imai and van Dyk (2005) model-fitting algorithm for Bayesian multinomial probit regression model.

5.1 The Bayesian Spatial Multinomial Probit Regression Model

Just as the Bayesian spatial probit regression model is an extension of the Bayesian probit regression model, we can extend the Bayesian multinomial regression model in the same way, by allowing the continuous latent variable to have a spatial covariance structure.

102 Therefore, the general case for a spatially-dependent multi-category Bayesian

is: ˜ Yi = arg maxj {Zij, j = 1, . . . , `} (5.1) and  ˜  Z1 ˜  .  ˜ ˜ vec(Z) ≡  .  ∼ N(µ, Σ), (5.2) ˜ Zn ˜ ˜ ˜ 0 ˜ where Zi = (Zi1,..., Zi`) , µ˜ is an n` × 1 mean vector, and Σ is an n` × n` spatial

covariance matrix across the ` categories.

For each observation Yi, rather than modeling the ` × 1 unobserved latent variables, it ˜ ˜ is common to express each latent variable Zij with respect to Zi` by defining an (` − 1) × 1

˜ ˜ ˜ 0 ˜ ˜ ˜ vector Ui = (Ui1,..., Ui,`−1) where Uij = Zij − Zi` for j = 1,..., (` − 1) (e.g., Daganzo,

1980; McCulloch and Rossi, 1994; McCulloch et al., 2000; Imai and van Dyk, 2005). By

doing this, rather than estimating an ` × 1 latent vector for each observation, we now

estimate only an (` − 1) × 1 latent vector. As a result, all parameters, if appropriately

specified, are (likelihood) identifiable.

This change of variables can be written by defining a matrix ∆ to be an n(` − 1) × n`

block-diagonal matrix, where the ith diagonal block is a matrix ∆ such that the first ` − 1

columns of ∆ are an (`−1)×(`−1) identity matrix and the `th column of ∆ is an (`−1)×1

vector of −1s. That is, 1 0 ··· 0 −1 0 1 ··· 0 −1   ∆ = ......  , . . . . .  0 0 ... 1 −1

and ∆ ≡ In ⊗ ∆.

In the latent variable representation of the multinomial probit regression model, the ˜ ˜ ˜ observed, Yis are related to the Uis rather than the Zis. To illustrate, consider Zi, where

103 ˜ ˜ ˜ ˜ ˜ ˜ ˜ Yi = j if Zij = max Zi. When Zi` is the maximum of Zi, Uij = Zij − Zi` < 0 for all ˜ ˜ j = 1,..., (`−1). Otherwise, if ZiJ is the maximum of Zi, where J ∈ {1, . . . , `−1}, then ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ UiJ = ZiJ −Zi` > 0 and UiJ = ZiJ −Zi` > Zij −Zi` = Uij for all j 6= J ∈ {1, . . . , `−1}.

Thus, we express (5.1) and (5.2) as ( `, if max U˜ < 0 Y = i i ˜ ˜ (5.3) j, if max Ui = Uij > 0 where

vec(U˜ ) = ∆vec(Z˜) ∼ N(∆µ˜, ∆Σ∆˜ 0). (5.4)

In the following subsections, we consider a regression framework and discuss various specifications of the latent variable mean and covariance and the interpretation of the asso- ciated model parameters.

5.1.1 Latent Mean Specification

As discussed in Section 2.2, the parameters of the binary Bayesian probit regression model are only identifiable up to a multiplicative constant. This is also the case in the multi-category setting. However, without loss of generality, in the following discussion concerning the latent mean structure, we express the model only in terms of the identifiable parameters Z, U, and µ. More discussion on identifiability in the latent variable represen- tation of the Bayesian spatial multinomial probit model will be provided in Section 5.2.1.

Analogous to the binary response setting, a multinomial analysis seeks to determine the relationship between the response variable and covariate information. To determine such a relationship, we consider three types of potential covariate information:

A. Category-specific covariates

B. Location-specific covariates

104 C. Location-category-specific covariates.

To help motivate these different types of covariate information, consider a land cover

example. Category-specific covariates are information that may vary across categories, but

are constant across locations. For example, suppose that across our study region, each land

cover type has an associated tax. We might expect that land cover categories with higher

taxes may be less likely to be observed. Location-specific covariates contain information

that varies across locations, but for a specific location, is constant across categories. Eleva-

tion is an example of a location-specific covariate because it changes for each location, but

at each location it is the same no matter the category. Finally, location-category-specific

covariates are information that varies across both locations and categories. One example of

a location-category-specific covariate is the value, or price, of each land cover for a specific

parcel of land. At each location, or at smaller subregions within our study area, the price

of each land cover type is known and changes across locations or subregions. We would

also expect this type of information to be related to the land cover response variable. For

the multinomial probit regression model, including each of the three types of information

in an analysis requires different specifications of the latent mean structure. In this section,

we consider how each of the three types of covariate information can be accommodated in

the latent mean specification within a regression framework, i.e., µ = Xβ.

Suppose we have k(A) covariates representing category-specific information, k(B) covariates representing location-specific information, and k(C) covariates representing location-category-specific information. Let X = [X(A), X(B), X(C)] where X(A) des- ignates the covariates capturing category-specific information, X(B) designates the covari- ates capturing location-specific information, and X(C) designates the covariates capturing location-category-specific information. A category-specific intercept can be considered a

105 special case of a category-specific covariate, and is included in X(A), as discussed in Sec-

tion A below. Similarly, let β = (β(A)0, β(B)0, β(C)0)0, where β(A), β(B), and β(C) are the collections of coefficients corresponding to X(A), X(B), and X(C), respectively. The exact dimension of each of these covariate and coefficient matrices will depend on the model specification and will be discussed further in the following subsections.

In the following subsections, we discuss two ways to include each type of information in the model. The first is to assume that the relationship between each covariate and each of the categories is the same for all categories. This means that each covariate will have a single regression coefficient for all categories. The second is to assume that the relationship between each covariate and each category may be different. In this case, for each covariate we fit a separate regression coefficient for each category. In discussing these two cases, we only consider continuous covariates, but the methods could be extended for discrete or categorical covariate information. For each type of information, we consider µ = Xβ and the impact the specification of the design matrix has on ∆µ = ∆Xβ.

A. Category-Specific Information

(A) (A) (A) 0 (A) th Let xm = (xm1 , . . . , xm` ) , where xmj is the m category-specific covariate for the jth category, for m = 1, . . . , k(A). For the moment, we assume the mean only includes category-specific information, i.e., µ = X(A)β(A). We also include the intercept in the model in the standard way by including a column of ones in X(A). Note that the meaning

(A) (A) of xm remains constant for each case discussed below. However, we redefine X based

(A) (A) on xm for m = 1, . . . , k , to distinguish between the cases.

Case 1a: The regression coefficients and intercept are identical across categories.

106 Let h i X(A) ≡ X(A1a) = 1 ⊗ (A) (A) n 1` x1 ··· xk(A) so that X(A1a) is an n` × (k(A) + 1) matrix of covariates, and let

(A) (A1a) (A1a) (A1a) (A1a) 0 β ≡ β = (β0 , β1 , . . . , βk(A) ) , so that β(A1a) is a (k(A) + 1) × 1 vector of coefficients.

Consider µ = X(A)β(A):

 (A1a) β0  h i β(A1a) µ = X(A1a)β(A1a) = 1 ⊗ 1 x(A) ... x(A)  1  n ` 1 k(A)  .   .  (A1a) βk(A) h i = 1 ⊗ (A1a) (A1a) (A) (A1a) (A) n β0 1` + β1 x1 + ··· + βk(A) xk(A)

It follows that the mean for the latent Zij is

(A1a) (A1a) (A) (A1a) (A) µij = β0 + β1 x1j + ··· + βk(A)j xk(A)j.

(A1a) for i = 1, . . . , n and j = 1, . . . , `. In this case, βm captures the increase in the

th (A) mean of the latent variable Zij for a one unit increase in the m covariate xmj , for

(A) (A1a) m = 1, . . . , k . β0 is the intercept. These parameters are constant across all categories. Furthermore, because the category-specific covariates are constant across

∗ locations, µij = µi∗j for all i, i = 1, . . . , n.

107 Now consider the mean of the latent variable U, ∆µ = ∆X(A)β(A):

∆µ = ∆X(A1a)β(A1a)  (A1a) β0  h i β(A1a) = ∆ 1 ⊗ 1 x(A) ··· x(A)  1  n ` 1 k(A)  .   .  (A1a) βk(A)  (A1a) β0  h i β(A1a) = (I ⊗ ∆) 1 ⊗ 1 x(A) ··· x(A)  1  n n ` 1 k(A)  .   .  (A1a) βk(A) h i = 1 ⊗ (A1a) (A1a) (A) (A1a) (A) . n β0 ∆1` + β1 ∆x1 + ··· + βk(A) ∆xk(A)

In this case, 1 − 1  .  ∆1` =  .  1 − 1 so that

(A1a) β0 ∆1` = 0`−1 where 0`−1 is an (` − 1) × 1 vector of zeros, and

 (A) (A)  xm1 − xm` . ∆x(A) =  .  m  .  (5.5) (A) (A) xm(`−1) − xm` so that

 (A1a) (A) (A)  βm (xm1 − xm` ) . β(A1a)∆x(A) =  .  , m m  .  (5.6) (A1a) (A) (A) βm (xm(`−1) − xm` )

(A) for m = 1, . . . , k . Thus, the mean for the latent Uij is

(A1a) (A) (A) (A1a) (A) (A) (∆µ)ij = β1 (x1j − x1` ) + ··· + βk(A) (xk(A)j − xk(A)`)

108 th where (∆µ)ij is the ij element of the vector ∆µ for i = 1, . . . , n and j =

1,..., (` − 1).

(A1a) th (A1a) We can interpret βm in terms of comparison to the ` category, so that βm

is the increase in the mean of the latent variable Uij for a one unit increase in the

(A) (A) difference between xmj and xm` . This value is constant across all categories. Just

∗ as µij = µi∗j, (∆µ)ij = (∆µ)i∗j for all i, i = 1, . . . , n. Finally note that since

the intercept is identical across categories, there is effectively no intercept when we

compare the jth category to the `th category. For this reason, in Case 1b, we consider

allowing the intercept to vary across categories.

Case 1b: The regression coefficients are identical across categories, but each category has

a unique intercept.

Let h i X(A) ≡ X(A1b) = 1 ⊗ (A) (A) n In x1 ··· xk(A)

so that X(A1b) is an n` × (k(A) + `) matrix of covariates and let

(A) (A1b) (A1b)0 (A1b) (A1b) 0 β ≡ β = (β0 , β1 , . . . , βk(A) ) ,

(A1b) (A1b) (A1b) 0 (A1b) (A) where β0 = (β01 , . . . , β0` ) . β is thus a (k + `) × 1 vector of coeffi- cients in this case.

109 Again consider µ:

µ = X(A1b)β(A1b)  (A1b) β0  h i β(A1b) = 1 ⊗ I x(A) ··· x(A)  1  n ` 1 k(A)  .   .  (A1b) βk(A) h i = 1 ⊗ (A1b) (A1b) (A) (A1b) (A) . n I`β0 + β1 x1 + ··· + βk(A) xk(A)

It follows that the mean of the latent Zij is

(A1b) (A1b) (A) (A1b) (A) µij = β0j + β1 x1j + ··· + βk(A) xk(A)j

(A1b) for i = 1, . . . , n and j = 1, . . . , `. As in Case 1a, the covariate coefficients, βm for m = 1, . . . , k(A), are constant across categories and represent the increase in

th (A) the mean of the latent variable Zij for a one unit increase in the m covariate xmj , for m = 1, . . . , k(A). This specification, however, allows for a category-specific

(A1b) intercept, so that β0j is the intercept for category j. The category means are still

∗ the same across locations, i.e., µij = µi∗j for all i, i = 1, . . . , n.

Now consider ∆µ:

∆µ = ∆X(A1b)β(A1b)  (A1b) β0  h i β(A1b) = ∆ 1 ⊗ I x(A) ··· x(A)  1  n ` 1 k(A)  .   .  (A1b) βk(A)  (A1b) β0  h i β(A1b) = (I ⊗ ∆) 1 ⊗ I x(A) ··· x(A)  1  n n ` 1 k(A)  .   .  (A1b) βk(A) h i = 1 ⊗ (A1b) (A1b) (A) (A1b) (A) . n ∆I`β0 + β1 ∆x1 + ··· + βk(A) ∆xk(A)

110 In this case, 1 0 ··· 0 −1 0 1 ··· 0 −1   ∆I` = ......  (5.7) . . . . .  0 0 ··· 1 −1 so that

 (A1b) (A1b)  β01 − β0`  (A1b) (A1b)  (A1b)  β02 − β0`  ∆I`β0 =  .  (5.8)  .   (A1b) (A1b) β0(`−1) − β0`

(A) (A1b) (A) (A1b) (A1a) and ∆xm and βm ∆xm are the same as (5.5) and (5.6) with βm = βm for

(A) m = 1, . . . , k . Thus, the mean for the latent Uij is

(A1b) (A1b) (A1b) (A) (A) (A1b) (A) (A) (∆µ)ij = β0j − β0` + β1 (x1j − x1` ) + ··· + βk(A) (xk(A)j − xk(A)`)

for i = 1, . . . , n and j = 1,..., (` − 1).

The intercept component of the mean ∆µ is the difference between the jth and `th

(A1b) (A1b) category intercepts, or β0j − β0` , for j = 1,..., (` − 1). As in Case 1a, the

(A1b) regression coefficients β represent the increase in the mean of Uij due to a one

(A) (A) unit increase in the difference between xmj and xm` . Also, we again have that the

∗ category means are constant across locations, i.e., (∆µ)ij = (∆µ)i∗j for all i, i =

1, . . . , n.

Case 2: Each covariate has a unique regression coefficient for each category.

The two previous cases require that the increase in the latent variable mean for the

jth category is the same for all categories. Suppose, however, that we want to allow

for distinct coefficients across categories. In the land cover example, this alternative

assumption may be appropriate when including land taxes in the model. The effect

111 of a tax increase may differ across land cover types. For example, a one percent increase in the tax for a residential parcel of land may not be as important as a similar tax increase imposed on a commercial parcel of land. Here we discuss how category- specific covariates can differ according to outcome category.

In this case, let  (A)  xm1 0 ··· 0  0 x(A) ··· 0  D(A2) = diag(x(A)) =  m2  m m  . . .. .   . . . .  (A) 0 0 ··· xm` and let h i X(A) ≡ X(A2) = 1 ⊗ (A2) (A2) , n I` D1 ··· Dk(A) so X(A2) is a n` × (k(1) + 1)` matrix of covariates. Furthermore, let

(A) (A2) (A2)0 (A2)0 (A2)0 0 β ≡ β = (β0 , β1 ,..., βk(A) ) ,

(A2) (A2) (A2) 0 (A) (A2) where βm = (βm1 , . . . , βm` ) for m = 0, 1, . . . , k so that β is a (k(A) + 1)` × 1 vector of coefficients.

Consider µ:

µ = X(A2)β(A2)  (A2) β0  h i β(A2) = 1 ⊗ I D(A2) ··· D(A2)  1  n ` 1 k(A)  .   .  (A2) βk(A) h i = 1 ⊗ (A2) (A2) (A2) (A2) (A2) . n I`β0 + D1 β1 + ··· + Dk(A) βk(A)

This implies that the mean of the latent Zij is

(A2) (A2) (A) (A2) (A) µij = β0j + β1j x1j + ··· + βk(A)jxk(A)j, 112 (A2) for i = 1, . . . , n and j = 1, . . . , `. This time, βmj represents the increase in the

th mean of the latent variable Zij due to a one unit increase in the m covariate for the

th (A) j category, xmj . Note that the regression coefficients differ across category.

Now consider ∆µ:

∆µ = ∆X(A2)β(A2)  (A2) β0  h i β(A2) = ∆ 1 ⊗ I D(A2) ··· D(A2)  1  n ` 1 k(A)  .   .  (A2) βk(A)  (A2) β0  h i β(A2) = (I ⊗ ∆) 1 ⊗ I D(A2) ··· D(A2)  1  n n ` 1 k(A)  .   .  (A2) βk(A) h i = 1 ⊗ (A2) (A2) (A2) (A2) (A2) , n ∆I`β0 + ∆D1 β1 + ··· + ∆Dk(A) βk(A)

(A2) (A2) (A1b) where ∆I`β0 is identical to the expression given in (5.8) with β0 = β0 and

 (A) (A) xm1 0 ··· 0 −xm`  0 x(A) ··· 0 −x(A) (A2)  m2 m`  ∆Dm =  . . . . .   ......   (A) (A) 0 0 ··· xm(`−1) −xm` so that

 (A2) (A) (A2) (A)  βm1 xm1 − βm` xm`  β(A2)x(A) − β(A2)x(A)  (A2) (A2)  m2 m2 m` m`  ∆Dm βm =  .  .  .   (A2) (A) (A2) (A) βm(`−1)xm(`−1) − βm` xm`

Thus, the mean for the latent Uij is

(A2) (A2) (A2) (A) (A2) (A) (A2) (A) (A2) (A) (∆µ)ij = (β0j − β0` ) + (β1j x1j − β1` x1` ) + ··· + (βk(A)jxk(A)j − βj(A)`xk(A)`).

113 (A2) Here, the mean of the latent variable Uij increases by βmj for a one unit increase

(A) (A2) (A) in xmj , but decreases by βm` for a one unit increase in xm` . As desired, this case allows categories to have distinct regression coefficients. However, many of the pa- rameters will not be estimable due to collinearity as we discuss below.

th First, we discuss collinearity within each covariate. The ` column of both ∆I` and

(A2) ∆Dm can be written as linear combinations of the first ` − 1 columns. To see this

th for ∆I`, we first define the notation (M)·j to be the j column of the matrix M.

th Using this notation, (∆I`)·j is the j column of ∆I` for j = 1, . . . , `, and we have that `−1 X (∆I`)·` = −1 (∆I`)·j. j=1

Obviously, this relationship will hold for all i locations, and the columns of 1n ⊗

th ∆I` will also be collinear. Thus, it is not appropriate to fit an ` category-specific

(A2) intercept in a multinomial probit regression model. Rather than fitting both β0j and

(A2) (A2) (A2) β0` , we instead fit the difference (β0j − β0` ). [Does this satsify, ”so what do we do?”]

(A2) (A2) th For ∆Dm , a similar relationship holds. Taking (∆Dm )·j to be the j column of

(A2) ∆Dm for j = 1, . . . , `, we have that

`−1 (A) (A2) X xm` (A2) (∆Dm )·` = − (A) (∆Dm )·j. j=1 xmj Because the category-specific covariates are constant across locations by definition, this relationship will hold across all locations i = 1, . . . , n, and the columns of 1n ⊗

(A2) ∆Dm will also be collinear.

We now discuss the collinearity between the covariates and intercept component of

(A2) (A) the design matrix. Notice that the columns of ∆I` and ∆Dm for m = 1, . . . , k

114 are collinear:

1 (A2) 1 (A2) (∆I`)·j = (A) (∆D1 )·j = ··· = (A) (∆Dk(A) )·j . x1j xk(A)j Because the covariates are constant across locations, the columns making up the

design matrix will also be collinear. Thus, in Case 2 where regression coefficients

vary across categories, only the coefficients corresponding to one covariate or the

intercept can be identified. If we wish to include further category-specific covariates,

we must include them as in Case 1b.

In Case 1a and 1b, we saw that it is necessary to fit a category-specific intercept. In Case

2 we saw, however, that category-specific coefficients and intercept are not identifiable.

Thus, it is necessary to express X(A) using Case 1b, noting that the intercept could be

replaced by a single category-specific covariate.

Thus,  (A) (A)  1 0 ··· 0 x11 ··· xk(A)1 0 1 ··· 0 x(A) ··· x(A)   12 k(A)2  (A) ......  X = 1n ⊗ ......    0 0 ··· 1 x(A) ··· x(A)   1(`−1) k(A)(`−1) (A) (A) 0 0 ··· 0 x1` ··· xk(A)` and

(A) (A)0 (A) (A) 0 β = (β0 , β1 , . . . , βk(A) )

(A) (A) (A) 0 where β0 = (β01 , . . . , β0(`−1)) .

B. Location-Specific Information

(B) (B) (B) 0 (B) th Let xm = (xm1 , . . . , xmn ) where xmi is the m location-specific covariate for the ith location, for m = 1, . . . , k(B). In this discussion, we assume the mean of the latent Z only includes location-specific information, i.e., µ = X(B)β(B). Note that the meaning of

115 (B) (B) xm remains constant for each case discussed below. However, we redefine X based on

(B) (B) xm for m = 1, . . . , k , to distinguish between the cases.

Case 1: The regression coefficients are identical across categories.

Let h i X(B) ≡ X(B1) = (B) (B) ⊗ 1 x1 ··· xk(B) `

so that X(B1) is a n` × k(B) matrix of location-specific covariates, and let

(B) (B1) (B1) (B1) 0 β ≡ β = (β1 , . . . , βk(B) )

so that β(B1) is a k(B) × 1 vector of coefficients.

Now consider µ:

 (B1) β1 (B1) (B1) h (B) (B) i  . µ = X β = x ... x ⊗ 1  .  1 k(B) `  .  (B1) βk(B) h i = (B1) (B) (B1) (B) β1 (x1 ⊗ 1`) + ··· + βk(B) (xk(B) ⊗ 1`)

so that the mean of the latent Zij is

(B1) (B) (B1) (B) µij = β1 x1i + ··· + βk(B) xk(B)i,

(B1) for i = 1, . . . , n and j = 1, . . . , `. Here, βm represents the increase in the mean of

(B) the latent variable Zij corresponding to a one unit increase in xmi .

116 Now consider ∆µ:

 (B1) β1 (B1) (B1) h (B) (B) i  . ∆µ = ∆X β = ∆ x ··· x ⊗ 1  .  1 k(B) `  .  (B1) βk(B)  (B1) β1 h (B) (B) i . = (I ⊗ ∆) x ⊗ 1 ··· x ⊗ 1  .  n 1 ` k(B) `  .  (B1) βk(B) h i = (B1) (B) (B1) (B) β1 (x1 ⊗ ∆1`) + ··· + βk(B) (xk(B) ⊗ ∆1`)  (B1) (B) (B1) (B)  β1 x11 ∆1` + ··· + βk(B) xk(B)1∆1`  (B1) (B) (B1) (B)  β1 x12 ∆1` + ··· + βk(B) xk(B)2∆1` =  .  ,  .   (B1) (B) (B1) (B)  β1 x1n ∆1` + ··· + βk(B) xk(B)n∆1` where    (B) (B) (1 − 1) xmi − xmi x(B)∆1 = x(B)  .  =  .  = 0 mi ` mi  .   .  `−1 (1 − 1) (B) (B) xmi − xmi and

(B1) (B) β1 xmi ∆1` = 0`−1,

(B) for i = 1, . . . , n and m = 1, . . . , k . As a result, the mean of the latent Uij is

(∆µ)ij = 0,

for i = 1, . . . , n and j = 1, . . . , `.

Thus, when including location-specific covariates in a multinomial probit regression

model, regression coefficients that are constant across categories are not identifiable.

Therefore, for location-specific covariates, we must specify a unique regression co-

efficient for each category, as in Case 2.

Case 2: Each covariate has a unique regression coefficient for each category.

117 Let h i X(B) ≡ X(B2) = (B) (B) ⊗ I x1 ··· xk(B) ` so that X(B2) is a n` × k(B)` matrix of location-specific covariates. Let

(B) (B2) (B2)0 (B2)0 0 β ≡ β = (β1 ,..., βk(B) ) ,

(B2) (B2) (B2) 0 (B2) (B) where βm = (βm1 , . . . , βm` ) so that β is a k ` × 1 vector of coefficients.

Consider µ:

 (B2) β1 (B2) (B2) h (B) (B) i  . µ = X β = x ··· x ⊗ I  .  1 k(B) `  .  (B2) βk(B) h i = (B) (B2) (B) (B2) (x1 ⊗ I`)β1 + ··· + (xk(B) ⊗ I`)βk(B)

so that the mean for the latent Zij is

(B2) (B) (B2) (B) µij = β1j x1i + ··· + βk(B)jxk(B)i,

(B2) for i = 1, . . . , n and j = 1, . . . , `. Here, βmj represents the increase in the mean of

(B) (B) the latent variable Zij for a one unit increase in xmi , for m = 1, . . . , k , and each

(B2) βmj is allowed to be different across the ` categories.

118 Now consider ∆µ:

 (B2) β1 (B2) (B2) h (B) (B) i  . ∆µ = ∆X β = ∆ x ··· x ⊗ I  .  1 k(B) `  .  (B2) βk(B)  (B2) β1 h (B) (B) i . = (I ⊗ ∆) x ⊗ I ··· x ⊗ I  .  n 1 ` k(B) `  .  (B2) βk(B) h i = (B) (B2) (B) (B2) (x1 ⊗ ∆)β1 + ··· + (xk(B) ⊗ ∆)βk(B)  (B) (B2) (B) (B2) x11 ∆β1 + ··· + xk(B)1∆βk(B)  (B) (B2) (B) (B2) x12 ∆β1 + ··· + xk(B)2∆βk(B)  =  .   .   (B) (B2) (B) (B2) x1n ∆β1 + ··· + xk(B)n∆βk(B) where  (B) (B) xmi 0 ··· 0 −xmi  0 x(B) ··· 0 −x(B) x(B)∆ =  mi mi  mi  ......   . . . . .  (B) (B) 0 0 ··· xmi −xmi and

 (B2)  β x(B) 0 ··· 0 −x(B) m1 mi mi  β(B2)   0 x(B) ··· 0 −x(B)  m2  (B) (B2)  mi mi   .  xmi ∆βm = . . . . .  .   ......      β(B2)  (B) (B)  m(`−1) 0 0 ··· xmi −xmi (B2) βm`  (B2) (B) (B2) (B)  βm1 xmi − βm` xmi  (B2) (B) (B2) (B)   βm2 xmi − βm` xmi  =  .   .   (B2) (B) (B2) (B) βm(`−1)xmi − βm` xmi  (B2) (B2) (B)  (βm1 − βm` )xmi  (B2) (B2) (B)   (βm2 − βm` )xmi  =  .  .  .   (B2) (B2) (B) (βm(`−1) − βm` )xmi

119 Thus, the mean of the latent Uij is

(B2) (B2) (B) (B2) (B2) (B) (∆µ)ij = (β1j − β1` )x1i + ··· + (βk(B)j − βk(B)`)xk(B)i

for i = 1, . . . , n and j = 1, . . . , `.

(B2) (B2) Here, βmj − βm` represents the increase in the mean of the latent variable Uij due

(B) to a one unit increase in xmi . Note that here we have a similar collinearity problem

th (B) to that seen in Case 2 for the category-specific covariates: the ` column of xmi ∆

(B2) can be written as a linear combination of the first ` − 1 columns. Therefore, βm` will not be estimable.

In Case 1, we saw that we cannot estimate regression coefficients corresponding to location-specific covariates when the regression coefficients were assumed to be constant across categories. We can, however, estimate ` − 1 distinct coefficients. Thus, when speci- fying the design matrix for location-specific covariates, let

h i I  X(B) = (B) (B) ⊗ `−1 x1 ··· xk(B) 0 0`−1 and

(B) (B)0 (B)0 0 β = (β1 ,..., βk(B) ) ,

(B) (B) (B) 0 where βm = (βm1 , . . . , βm(`−1)) .

C. Location-Category-Specific Information

(C) (C) (C) 0 Let xmi = (xmi1, . . . , xmi`) be an ` × 1 vector of covariates for each location i = 1, . . . , n, for m = 1, . . . , k(C) covariates. In this discussion, we assume the mean only in- cludes location-category-specific information, i.e., µ = X(C)β(C). Note that the meaning

(C) (C) of xmi remains constant for each case discussed below. However, we redefine X based

(C) (C) on xmi for m = 1, . . . , k and i = 1, . . . , n, to distinguish between the cases.

120 Case 1: The regression coefficients are identical across categories.

Let  (C) (C)  x11 ··· xk(C)1  (C) (C)  (C) (C1) x12 ··· xk(C)2 X ≡ X =  . . .   . . .   (C) (C)  x1n ··· xk(C)n be an n` × k(C) matrix of covariates, and let

(C) (C1) (C1) (C1) 0 β ≡ β = (β1 , . . . , βk(C) )

be a k(C) × 1 vector of coefficients.

Consider µ:

µ = X(C1)β(C1)  (C) (C)  x11 ··· xk(C)1  (C1)  (C) (C)  β1 x12 ··· x (C)  . =  k 2  .   . . .     . . .  (C1) (C) (C) βk(C) x1n ··· xk(C)n  (C1) (C) (C1) (C)  β1 x11 + ··· + βk(C) xk(C)1  (C1) (C) (C1) (C)  β1 x12 + ··· + βk(C) xk(C)2 =  .   .   (C1) (C) (C1) (C)  β1 x1n + ··· + βk(C) xk(C)n

so that the mean of the latent Zij is

(C1) (C) (C1) (C) µij = β1 x1ij + ··· + β1 xk(C)ij,

(C1) for i = 1, . . . , n and j = 1, . . . , `. Here, βm represents the increase in the mean of

th the latent variable Zij due to a one unit increase in the m location-category-specific

(C) covariate xmij.

121 Now consider ∆µ:

∆µ = ∆X(C1)β(C1)  (C) (C)  x11 ··· xk(C)1  (C1)  (C) (C)  β1 x12 ··· x (C)  . = ∆  k 2  .   . . .     . . .  (C1) (C) (C) βk(C) x1n ··· xk(C)n  (C) (C)  x11 ··· xk(C)1  (C1)  (C) (C)  β1 x12 ··· xk(C)2 . = (In ⊗ ∆)    .   . . .     . . .  (C1) (C) (C) βk(C) x1n ··· xk(C)n  (C1) (C) (C1) (C)  β1 ∆x11 + ··· + βk(C) ∆xk(C)1  (C1) (C) (C1) (C)  β1 ∆x12 + ··· + βk(C) ∆xk(C)2 =  .   .   (C1) (C) (C1) (C)  β1 ∆x1n + ··· + βk(C) ∆xk(C)n where  (C) (C)  xmi1 − xmi` (C) . ∆x =  .  mi  .  (C) (C) xmi(`−1) − xmi` and  (C1) (C) (C)  βm (xmi1 − xmi`) (C) . β(C1)∆x =  .  , m mi  .  (C1) (C) (C) βm (xmi(`−1) − xmi`) (C) for i = 1, . . . , n and m = 1, . . . , k . It follows that the mean of the latent Uij is

(C1) (C1) (∆µ)ij = β1 (x1ij − x1i`) + ··· + βk(C) (xk(C)ij − xk(C)i`), for i = 1, . . . , n and j = 1,..., (` − 1).

(C1) Here, βm represents the increase in the mean of the latent variable Uij for a one

(C) (C) (C1) unit increase in the difference between xmij and xmi`. For each covariate, βm is constant across the categories.

122 Case 2: Each covariate has a unique regression coefficient for each category.

Define  (C)  xmi1 0 ··· 0  0 x(C) ··· 0  D(C2) = diag(x(C)) =  mi2  , mi mi  . . .. .   . . . .  (C) 0 0 ··· xmi` and let  (C2) (C2)  D11 ··· Dk(C)1 (C) (C2)  . . .  X = X =  . . .  (C2) (C2) D1n ··· Dk(C)n be an n` × k(C)` matrix of covariates. Let

(C) (C2)0 (C2)0 0 β = (β1 ,..., βk(C) )

(3) (C2) (C2) (C2) 0 be a k ` × 1 vector of coefficients where βm = (βm1 , . . . , βm` ) .

Consider µ:

µ = X(C2)β(C2)  (C2) (C2)   (C2) D11 ··· Dk(C)1 β1  . . .   .  =  . . .   .  (C2) (C2) β(C2) D1n ··· Dk(C)n k(C)  (C2) (C2) (C2) (C2) D11 β1 + ··· + Dk(C)1βk(C)  .  =  .  , (C2) (C2) (C2) (C2) D1n β1 + ··· + Dk(C)nβk(C)

so that the mean of the latent Zij is

(C2) (C) (C2) (C) µij = β1j x1ij + ··· + βk(C)jxk(C)ij,

(C2) for i = 1, . . . , n and j = 1, . . . , `. Here, βmj represents the increase in the mean of

th th the j category latent variable Zij due to a one unit increase in the m covariate.

123 Now consider ∆µ:

∆µ = ∆X(C2)β(C2)  (C2) (C2)   (C2) D11 ··· Dk(C)1 β1  . . .   .  = ∆  . . .   .  (C2) (C2) β(C2) D1n ··· Dk(C)n k(C)  (C2) (C2)   (C2) D11 ··· Dk(C)1 β1  . . .   .  = (In ⊗ ∆)  . . .   .  (C2) (C2) β(C2) D1n ··· Dk(C)n k(C)  (C2) (C2) (C2) (C2) ∆D11 β1 + ··· + ∆Dk(C)1βk(C)  .  =  .  , (C2) (C2) (C2) (C2) ∆D1n β1 + ··· + ∆Dk(C)nβk(C) where  (C) (C)  xmi1 0 ··· 0 −xmi`  (C) (C)  (C2)  0 xmi2 ··· 0 −xmi` ∆Dmi =  . . . . .   ......   (C) (C)  0 0 ··· xmi(`−1) −xmi` and  (C2) (C) (C2) (C)  βm1 xmi1 − βm` xmi`  β(C2)x(C) − β(C2)x(C)  (C2) (C2)  m2 mi2 m` mi`  ∆D βm =  .  , mi  .   (C2) (C) (C2) (C)  βm(`−1)xmi(`−1) − βm` xmi`

(C) for i = 1, . . . , n and m = 1, . . . , k . It follows that the mean of the latent Uij is

(C2) (C) (C2) (C) (C2) (C) (C2) (C) (∆µ)ij = β1j x1ij − β1` x1i` + ··· + βk(C)jxk(C)ij − βk(C)`xk(C)i` (C2) ! (C2) ! (C2) β1j (C) (C) (C2) β1j (C) (C) = β1` (C2) x1ij − x1i` + ··· + β1` (C2) x1ij − x1i` β1` β1` (C2)  (C) (C)  (C2) (C2) (C) = β1` x1ij − x1i` + β1j − β1` x1ij + ···

(C2)  (C) (C)   (C2) (C2)  (C) + βk(C)` xk(C)ij − xk(C)i` + βk(C)j − βk(C)` xk(C)ij.

(C1) (C2) Notice that Case 1 can be written as a special case of Case 2, where βm = βm`

(C2) (C2) and (βmj − βm` ) = 0, for all j = 1,..., (` − 1).

124 For category-specific information, depending on the assumption made about the re- gression coefficient, Case 1 or Case 2 may be appropriate. In Case 2, all k(C)` coefficient parameters are estimable. Thus, we can take

X(C) ≡ X(C1) and β(C) ≡ β(C1) or

X(C) ≡ X(C2) and β(C) ≡ β(C2) or even a combination of both, depending on the specification that makes the most sense for each location-category-specific covariate.

5.1.2 Parameterization of the Space-Category Covariance Matrix

In (5.2), we expressed the covariance structure for vec(Z˜) generally as Σ˜ . In this section, we examine various specifications of Σ˜ , as well as the implications these structures have on the covariance of vec(U˜ ).

Separable Space-Category Dependence

One specification for Σ˜ , is to extend the covariance structure for the independent multi- nomial probit regression model in (2.4) to allow for spatial dependence by replacing the identity matrix with a spatial correlation structure, i.e.,

Σ˜ ≡ Σ(θ) ⊗ Ω˜ (5.9) where Σ(θ) is an n × n spatial correlation matrix and Ω˜ is an ` × ` covariance matrix spec- ifying the dependence among the ` categories. Notice that this specification of Σ˜ requires that the spatial dependence is the same for all categories and the categorical dependence is the same for all locations. Thus, Σ˜ given by (5.9) is space-category separable.

125 From (5.4), the covariance of vec(U˜ ) is ∆Σ∆˜ , and for the separable case, can be expressed as

˜  ˜  0 ∆Σ∆ = (In ⊗ ∆) Σ(θ) ⊗ Ω (In ⊗ ∆)

0 ˜ 0 = (In Σ(θ) In) ⊗ (∆ Ω ∆ )

= Σ(θ) ⊗ (∆ Ω˜ ∆0).

Notice that in this case, we again have a separable covariance, where each category relative to category ` has the same spatial dependence and the categorical dependence is the same at each location. Therefore, separability in the covariance structure of Z˜ results in separability in the covariance structure of U˜ .

This model is simpler to fit than the model with a general spatial-categorical depen- dence structure because we only need to work with an n × n matrix and an ` × ` matrix rather than an n`×n` matrix. See Section 5.2 for a model-fitting algorithm associated with this covariance structure.

When we fit the Bayesian spatial multinomial probit model with a separable depen- dence structure based on U˜ , we only estimate the (` − 1) × (` − 1) categorical covariance,

˜ ˜ 0 th Ω∆ ≡ ∆Ω∆ . The ij element of this covariance matrix is interpreted as the covariance between the difference between the latent variables associated with categories i and ` and the difference between the latent variables associated with categories j and `. We might be interested in modeling the dependence between all ` categorical responses. However, ˜ ˜ because ∆ is not invertible, given Ω∆, Ω is not identifiable.

Non-Separable Space-Category Dependence

Suppose that instead of assuming space-category separability, we want to allow the different categories to have different spatial dependence structures. To motivate why this

126 would be desirable, consider two categories of the land cover example, forest and urban. In

many places on Earth, locations covered by forest take up a larger area than urban locations.

This is seen in Figure 1.1, where urban land cover is included in the category ‘other’. Thus,

the spatial dependence for the category forest will have a longer range, or distance at which

locations will be correlated, than that of the spatial dependence for the category urban.

Therefore, the spatial dependence structures for each of these categories should allow for

this difference.

To accommodate category-specific spatial dependence structures, we can consider the

latent variables of the data augmentation representation of the Bayiesian multinomial probit

regression model to be a spatially-dependent multivariate random variable. Then, we can

use non-separable dependence structures defined for Gaussian multivariate spatial data.

For geostatistical data, a non-space-category separable covariance structure can be ob-

tained through the linear model of coregionalization (LMC; e.g., Wackernagel, 1998). The

Bayesian version of the LMC arises as a special case of the conditional hierarchical ap-

proach in Royle and Berliner (1999). Gelfand et al. (2004) propose an extension to the

LMC, called the spatially-varying LMC, by specifying the dependence structure jointly.

When modeling the dependence structure of lattice/gridded data, several multivariate CAR

models have been proposed (e.g., Mardia, 1988; Carlin and Banerjee, 2003; Gelfand and

Vounatsou, 2003). In addition, Sain and Cressie (2007) propose a canonical multivari-

ate conditional autoregressive (CAMCAR) model as a multivariate extension to the CAR

model that allows for space-category asymmetries. We leave determining the dependence

structure of vec(U˜ ) based on each of the previous models as well as the interpretation of the elements of the resulting covariance structures to future research.

127 5.2 Model-Fitting

5.2.1 Data Augmentation MCMC Algorithms

In this section, we propose a marginal data augmentation algorithm for fitting the space- category separable model with Σ˜ = Σ(θ) ⊗ Ω˜ . This algorithm is an extension of the algorithm proposed by Imai and van Dyk (2005) for independent multi-category response variables and of Marginal-Scheme 1 proposed in Section 3.1 for spatially-dependent binary data.

First, let

X∆ ≡ ∆X and

˜ ˜ 0 Ω∆ ≡ ∆Ω∆ , so that ˜ ˜ ˜ vec(U) ∼ N(X∆β, Σ(θ) ⊗ Ω∆).

Here, X∆ is specified appropriately for each type of covariate information considered, as

(E) (E) discussed in Section 5.1.1. We let X∆ be an n(` − 1) × k matrix, where k represents the total number of effective covariates, where each covariate may have up to ` effective covariates.

In this case, following Imai and van Dyk (2005), the working parameter is the first

˜ 2 element in Ω∆, which we denote as ω∆. Therefore, the identifiable parameters in this case will be U˜ U = , ω∆

128 β˜ β = , ω∆ and ˜ Ω∆ Ω∆ = 2 . ω∆ We assign prior distributions on identifiable parameters. Specifically, we take

β ∼ N(0, Cβ),

(E) (E) (E) where 0 is a k ×1 vector of zeros and Cβ is a k ×k covariance matrix. We consider

only one spatial dependence parameter, i.e., θ ≡ θ, and its prior distribution is

θ ∼ Unif(lθ, uθ),

where lθ and uθ are appropriate lower and upper bounds, respectively, for θ. Finally, ˜ ˜ we take the prior on Ω∆ to be inverse Wishart with parameters νΩ and MΩ, i.e., ˜ ˜ ˜ Ω∆ ∼ Inv Wishart(νΩ, MΩ), where νΩ is a scalar and MΩ is an (` − 1) × (` − 1)

matrix. Transforming to Ω∆ yields the following joint density function

 a2  2 −(νΩ+`)/2 ω −1 2 −[νΩ(`−1)/2+1] π(Ω∆, ω∆) ∝ |Ω∆| exp − 2 trace(MΩΩ∆ ) (ω∆) , 2ω∆

2 ˜ 2 where aω is a positive constant, and MΩ = MΩ/aω. Therefore,

ω2 |Ω ∼ a2 (M Ω−1)/χ2 . ∆ ∆ ωtrace Ω ∆ νΩ(`−1) (5.10)

The last element of discussion before giving the steps of the algorithm

is to determine the conditional distributions of Ui|[vec(U)]-i, β, θ, Ω∆ and

2 0 Uij|Ui,-j, [vec(U)]-i, β, θ, Ω∆, ω∆, where Ui = (Ui1,...,Ui(`−1)) , [vec(U)]-i de-

th notes vec(U) with Ui removed, and Ui,-j denotes Ui with the j element removed.

129 Notice that Ui|[vec(U)]-i, β, θ, Ω∆ has a normal distribution with mean

E(Ui|[vec(U)]-i, β, θ, Ω∆)

 −1 = (X∆)iβ + ([Σ(θ)]i,-i ⊗ Ω∆)([Σ(θ)]-i,-i ⊗ Ω∆)

([vec(U)]-i − (X∆)-iβ)]

−1  = (X∆)iβ + [Σ(θ)]i,-i ([Σ(θ)]-i,-i) ⊗ I`−1 ([vec(U)]-i − (X∆)-iβ) and variance

Var(Ui|[vec(U)]-i, β, θ, Ω∆)

−1 = ([Σ(θ)]i,i ⊗ Ω∆) − ([Σ(θ)]i,-i ⊗ Ω∆) ([Σ(θ)]-i,-i ⊗ Ω∆) ([Σ(θ)]-i,i ⊗ Ω∆)

−1  = ([Σ(θ)]i,i ⊗ Ω∆) − [Σ(θ)]i,-i([Σ(θ)]-i,-i) [Σ(θ)]-i,i ⊗ Ω∆

−1  = [Σ(θ)]i,i − [Σ(θ)]i,-i ([Σ(θ)]-i,-i) [Σ(θ)]-i,i Ω∆, | {z } ≡σ(θ)i

(E) where (X∆)i is the (`−1)×k matrix of covariates associated with observation i, (X∆)-i

(E) th is the remaining (n − 1)(` − 1) × k matrix of X∆, and [Σ(θ)]i,-j denotes the i row of

Σ(θ) with the jth column removed. Using this conditional distribution, we can determine the distribution of Uij|Ui,-j, [vec(U)]-i, β, θ, Ω∆, which is normally distributed with mean

µUij ≡ E(Uij|Ui,-j, [vec(U)]-i, β, θ, Ω∆)

−1   = (X∆)ijβ + [Σ(θ)]i,-i ([Σ(θ)]-i,-i) {[vec(U)]-i}j − {(X∆)-i}j β

−1  −1  + [Ω∆]j,-j ([Ω∆]-j,-j) (Ui,-j − (X∆)i,-jβ) + [Σ(θ)]i,-i ([Σ(θ)]-i,-i) ⊗ I`−2  i × {[vec(U)]-i}-j − {(X∆)-i}-j β , (5.11)

0 (E) where (X∆)ij is the k × 1 vector associated with Uij, (X∆)i,-j is the remaining

(E) (` − 2) × k component of (X∆)i, and the notation {·}j denotes the selection of

130 only the rows corresponding to category j, and {·}-j denotes the selection of the rows not corresponding to category j. Although the above equation looks daunting, computationally, it is just a matter of selecting the appropriate rows and columns of the appropriate matrices.

As noted in Section 5.1.2, these computations will be relatively faster than the computa- tions for the conditional mean for a non-separable dependence structure. The variance is then

τ 2 ≡ (U |U , [ (U)] , β, θ, Ω ) Uij Var ij i,-j vec -i ∆

−1 = σ(θ)i[Ω∆]j,j − σ(θ)i[Ω∆]j,-j (σ(θ)i[Ω∆]-j,-j) σ(θ)i[Ω∆]-j,j

−1  = σ(θ)i [Ω∆]j,j − [Ω∆]j,-j ([Ω∆]-j,-j) [Ω∆]-j,j . (5.12)

We now specify the Gibbs sampler and full conditional distributions for this separable

Bayesian spatial multinomial probit regression model:

˜ [t−1] [t−1] [t−1] 2 ∗ Step 1: Sample vec(U) from vec(U)|Y , β , θ , Ω∆ , (ω∆) .

2 ∗ 2 • Sample (ω∆) from the prior distribution ω∆|Ω as specified in (5.10).

˜ • For each i = 1, . . . , n and j = 1,..., (` − 1), sample Uij from

˜ [t−1] [t−1] [t−1] 2 ∗ Uij|Y ,[vec(U)]-i, Ui,-j, β , θ , Ω∆ , (ω∆)    ω∗ µ , (ω2 )∗ τ 2 , max(ω∗ U , 0), ∞ , Y = j TN ∆ Uij ∆ Uij ∆ i,-j if i ∼   ω∗ µ , (ω2 )∗ τ 2 , −∞, max(ω∗ U , 0) , Y 6= j TN ∆ Uij ∆ Uij ∆ i,-j if i

µ τ 2 where Uij and Uij are as specified in (5.11) and (5.12), respectively, plugging in the current values of the other parameters.

2 ∗ [t] 2 [t] [t−1] [t−1] Step 2: Sample (ω∆) , β from ω∆, β|Y , vec(U ), θ , Ω∆ .

131 2 ∗ • Sample (ω∆) from ωˆ2 (ω2 ) ∼ ∆ χ2 (n+νΩ)(`−1) where

2 ˜ ˆ 0 [t−1] [t−1] −1 ˜ ˆ 0 ωˆ = (vec(U) − X∆β) (Σ(θ ) ⊗ Ω∆ ) (vec(U) − Xβ)   −1 ˆ −1 ˆ ˜ [t−1] + β Cβ β + trace M Ω∆

and

 −1 −1 ˆ 0  [t−1] [t−1] −1 β = X∆ Σ(θ ) ⊗ Ω∆ X∆ + Cβ   −1  0 [t−1] [t−1] ˜ × X∆ Σ(θ ) ⊗ Ω∆ vec(U) .

!   −1 −1 ˜ ˆ 2 ∗ 0 [t−1] [t−1] −1 • Sample β ∼ N β, (ω∆) X∆ Σ(θ ) ⊗ Ω∆ X∆ + Cβ .

[t] ˜ ∗ • Set β = β/ω∆.

2 [t] [t] 2 [t] [t] [t−1] Step 3: Sample (ω∆) , Ω∆ from (ω∆), Ω∆|Y , vec(U ), β , θ .

˜ • Sample Ω∆ ∼ Inv Wishart (n + νΩ,  ˜ ˜ ˜ 0 [t−1] −1 ˜ ˜ M + [[vec(U) − X∆β]]n×(`−1) Σ(θ ) [[vec(U) − X∆β]]n×(`−1) .

where the notation [[c]]a×b denotes the operator that reorders the ab × 1 vector

c to form the corresponding a × b matrix, filling this matrix by rows.

2 [t] ˜ • Set (ω∆) equal to the first element of Ω∆.

[t] ˜ [t] • Set Uij = Uij/ω∆ , for i = 1, . . . , n and j = 1,..., (` − 1).

[t] ˜ 2 [t] • Set Ω∆ = Ω∆/(ω∆) .

[t] [t] [t] [t] 2 [t] Step 4: Sample θ from θ|Y , vec(U ), β , Ω∆ , (ω∆) using a Metropolis random walk step.

132 5.3 Summary

In this chapter, we showed how care must be taken in the specification of the latent variable representation of the Bayesian spatial multinomial probit regression model. In order for regression coefficient parameters to be identifiable, when specifying the design matrix we must consider the type of covariate information (i.e., category-specific, location- specific, or location-category-specific). Furthermore, we discussed various ways to specify the spatial-categorical dependence structure of the latent variable. Under this appropri- ately specified latent variable representation of the Bayesian spatial multinomial probit regression model, we provided a model-fitting algorithm for the case of a separable spatial- categorical dependence structure.

133 CHAPTER 6

CONTRIBUTIONS AND FUTURE WORK

In this final chapter, we discuss our contributions to the development of the Bayesian spatial probit regression model and provide a list of future research directions.

We began this work by motivating the need for models that allow for residual spatial de- pendence when analyzing spatially-referenced categorical response variables, specifically, within a regression framework. We reviewed a diverse literature of current methods for analyzing spatially-dependent binary and categorical data, one being the Bayesian spatial probit regression model. To provide context, in Chapter 2 we discussed the general data- augmented version of the Bayesian probit regression model, and showed how extensions of this model fit within the general specification of the model. Specifically, we showed how the Bayesian probit regression model is extended to allow for residual spatial dependence, resulting in the Bayesian spatial probit regression model. We also demonstrated that the variance of the latent variable in the data-augmented Bayesian probit regression models is not identifiable.

To account for, and exploit for computational purposes, this non-identifiability in the

Bayesian spatial probit regression model, in Chapter 3 we introduced and compared several data augmentation MCMC algorithms. These algorithms use the non-identifiable param- eter, or working parameter, to improve computational efficiency of model fitting. While

134 conditional and marginal augmentation strategies for the independent version of this model have been compared in the literature previously, the spatial extension introduces additional complexity. Furthermore, the presence of the spatial dependence parameter allows for the possibility of partially collapsing the data augmentation algorithms. Based on our simula- tion study and data analysis, we recommend using the Marginal-Scheme 1 algorithm when

fitting the spatial probit regression model. While the differences in the sample autocorre- lations among the algorithms may not appear to be dramatic, we return to our discussion at the beginning of Chapter 3 of the recent emphasis in the literature on methodology for massive spatial data – improved mixing of MCMC algorithms can lead to significant gains in computational efficiency, which can be particularly important in analyses of large data sets.

In Chapter 3, our investigations of efficient computational strategies for fitting the spa- tial probit regression model focused exclusively on the special case where the outcome variable is binary. We proposed an extension of the Marginal-Scheme 1 algorithm for spatially-referenced multi-categorical outcomes in Chapter 5, however, we left implemen- tation of this algorithm to future work. Additionally, data augmentation MCMC strategies also need to be developed for spatially-referenced multivariate or ordinal response vari- ables. For example, Higgs and Hoeting (2010) recently proposed a clipped latent variable model for spatially-dependent ordered categorical data, an extension of the model origi- nally introduced by De Oliveira (2000). To facilitate sampling the latent variables in mod- els for non-spatially-referenced ordered categorical data, Hans et al. (2009) propose using a covariance decomposition technique. Future work will explore data augmentation MCMC strategies and algorithms using this covariance decomposition technique for fitting models for spatially-referenced multivariate and ordered categorical data.

135 In Chapter 4 we showed how a spatial classifier can be derived from the Bayesian spatial probit regression model. In an analysis of the Southeast Asia land cover data, this spatial classifier was found to be a more accurate predictor of unobserved spatially-dependent binary data than other popular classification techniques. Furthermore, including spatial de- pendence in the residual dependence structure of a generalized linear model is more effec- tive in terms of prediction/classification than including spatial dependence in the covariate or predictor space, at least in this example.

In this work, we only considered a spatial classifier for two classes. We could also consider a similar spatial classifier for multi-category response variables based on the spa- tial multinomial probit regression model proposed in Chapter 5 and compare it to popular classification techniques in the multi-class setting.

Finally, in Chapter 5, we extended the spatial probit regression model to the multi- category case. We considered how the mean and covariance specification for all ` categories will impact the specification and interpretation of the model when we use the `th category as a baseline, i.e., model the first ` − 1 categories relative to the `th category. We showed how it was important to consider the type of covariate information when specifying the la- tent mean structure and provided specifications of the design matrix which would provide interpretable and identifiable coefficients. We also provided potential ways for modeling the latent variable dependence structure, but more work needs to be done in considering non-separable space-category dependence structures as well as the interpretation of result- ing covariances. Furthermore, we provided a data augmentation algorithm for fitting the

Bayesian multinomial probit regression model when the space-category dependence is sep- arable.

136 Suppose that after we fit the model using the `th category as the baseline category, we want to instead consider the model using the first category as the baseline category. Future work is needed to determine whether or not we can go between baseline categories and estimate the associated parameters of the model with one category as the baseline using the

fitted model with a different category as the baseline.

One area of future work applicable to all areas of this dissertation is to create an R pack-

age for the Bayesian spatial multinomial probit regression model. In addition to providing

an accessible way to fit the Bayesian spatial probit regression model and to use this model

for prediction/classification, this package would also provide functions that accommodate

the special features required for specification of the Bayesian spatial multinomial probit

regression model. Specifically, this package would enable creation of an appropriate de-

sign matrix for different types of covariate information, and upon further development of

spatial-categorical dependence structures, provide a set-up for the specification of various

dependence structures and fit the model using the appropriate data augmentation algorithm.

The final area of future work that we mention here is to extend the Bayesian spatial pro-

bit regression model to the space-time setting. The motivating example of this dissertation

research was the analysis of land-cover/land-use data. For example, researchers may want

to determine patterns of land-cover change over time (e.g., deforestation) and what factors

may have contributed to an observed change. The Bayesian spatial probit regression model

is a natural starting point for modeling spatial-temporal categorical response variables.

137 BIBLIOGRAPHY

Abramowitz, M. and Stegun, I. A. (1965). Handbook of Mathematical Functions. New

York, NY: Dover.

Albert, J. H. and Chib, S. (1993). “Bayesian analysis of binary and polychotomous re-

sponse data.” Journal of the American Statistical Association, 88, 669–679.

Albert, P. S. and McShane, L. M. (1995). “A generalized estimating equations approach

for spatially correlated binary data: Applications to the analysis of neuroimaging data.”

Biometrics, 51, 627–638.

Augustin, N. H., Mugglestone, M. A., and Buckland, S. T. (1996). “An autologistic model

for the spatial distribution of wildlife.” Journal of Applied Ecology, 33, 339–347.

Banerjee, S., Carlin, B., and Gelfand, A. (2004). Hierarhical Modeling and Analysis for

Spatial Data. Boca Raton, FL: Chapman & Hall/CRC.

Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). “Gaussian predictive

process models for large spatial data sets.” Journal of the Royal Statistical Society,

Series B, 70, 825–848.

Besag, J. E. (1972). “Nearest-neighbour systems and the auto-logistic model for binary

data.” Journal of the Royal Statistical Society, Series B, 34, 75–83.

138 — (1974). “Spatial interaction and the statistical analysis of lattice systems.” Journal of

the Royal Statistical Society, Series B, 36, 2, 192–236.

Breslow, N. E. and Clayton, D. G. (1993). “Approximate inference in generalized linear

mixed models.” Journal of the American Statistical Association, 88, 9–25.

Calder, C. A. (2007). “Dynamic factor process convolution models for multivariate space-

time data with application to air quality assessment.” Environmental and Ecological

Statistics, 14, 229–247.

Carlin, B. P. and Banerjee, S. (2003). “Hierarchical multivariate CAR models for spatio-

temporally correlated survival data (with discussion).” Bayesian Statistics 7, eds. J. M.

Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith, and

M. West, 45–63. Oxford: Oxford University Press.

Chib, S. and Greenberg, E. (1998). “Analysis of multivarite probit models.” Biometrika,

85, 347–361.

Christensen, O. F., Møller, J., and Waagepetersen, R. P. (2001). “Geometric ergodicity of

Metropolis Hastings algorithms for conditional simulation in generalised linear mixed

models.” Methodology and Computing in Applied Probability, 3, 309–327.

Christensen, O. F. and Ribeiro Jr., P. J. (2002). “geoRglm: A package for generalised linear

spatial models.” R-NEWS, 2, 2, 26–28.

Christensen, O. F., Roberts, G. O., and Sko¨ld, M. (2006). “Robust Markov chain Monte

Carlo methods for spatial generalized linear mixed models.” Journal of Computational

and Graphical Statistics, 15, 1–17.

139 Christensen, O. F. and Waagepetersen, R. P. (2002). “Bayesian prediction of spatial count

data using generalized linear mixed models.” Biometrics, 58, 280–286.

Cortes, C. and Vapnik, V. (1995). “Support-vector network.” Machine Learning, 20, 273–

297.

Cressie, N. (1993). Statistics for Spatial Data. Revised ed. New York: John Wiley.

Daganzo, C. (1980). Multinomial Probit. New York, NY: Academic Press.

De Oliveira, V. (2000). “Bayesian prediction of clipped Gaussian random fields.” Compu-

tational Statistics and Data Analysis, 34, 299–314.

Diggle, P. J. and Ribeiro Jr., P. J. (2007). Model-based Geostatistics. New York, New York:

Springer.

Diggle, P. J., Tawn, J. A., and Moyeed, R. A. (1998). “Model-based geostatistics.” Applied

Statistician, 47, 299–350.

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., and Weingessel, A. (2010). e1071:

Misc Functions of the Department of Statistics (e1071), TU Wien. R package version

1.5-24.

Gelfand, A., Schmidt, A., Banerjee, S., and Sirmans, C. (2004). “Nonstationary multivari-

ate process modeling through spatially varying coregionalization.” Test, 13, 2. 263-312.

Gelfand, A. E. and Vounatsou, P. (2003). “Proper multivariate conditional autoregressive

models for spatial data analysis.” Biostatistics, 4, 11–25.

Gelman, A., Carlin, J., Stern, H., and Rubin, D. (1995). Bayesian Data Analysis. Chapman

and Hall.

140 Geweke, J. (1991). “Efficient simulation from the multivariate normal and Student t-

distributions subject to linear constraints and the evaluation of constraint probabilities.”

Computer Sciences and Statistic, 23, 571–578.

Gotway, C. A. and Stroup, W. W. (1995). “A generalized linear model approach to spatial

data analysis and prediction.” Journal of Agricultural, Biological, and Environmental

Statistics, 2, 157–178.

Gumpertz, M. L., Graham, J. M., and Ristaino, J. B. (1974). “Autologistic model of spatial

pattern of phytophthora epidemic in bell pepper: effects of soil variables on disease

presence.” Journal of Agricultural, Biological, and Environmental Statistics, 2, 131–

156.

Hans, C., Allenby, G. M., Craigmile, P. F., Lee, J. H., MacEachern, S. N., and Xu, X.

(2009). “Covariance decompositions for accurate computation in Bayesian scale-usage

models.” Tech. rep., No. 825, Department of Statistics, The Ohio State University,

Columbus, OH, 43210.

Hans, C. and Craigmile, P. F. (2009). truncatedNormals: R functions for truncated normal

distributions. R package version 0.4.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning:

Data Mining, Inference, and Prediction. New York, NY: Springer.

Hausman, J. and Wise, D. (1978). “A conditional probit model for qualitative choice: Dis-

crete decisions recognizing interdependence and heterogeneous preferences.” Econo-

metrics, 46, 403–426.

141 Heagerty, P. J. and Lele, S. R. (1998). “A composite likelihood approach to binary spatial

data.” Journal of the American Statistical Association, 93, 1099–1111.

Higdon, D. (2002). “Space and space-time modeling using process convolutions.” In

Quantitative Methods for Current Environmental Issues, eds. C. Anderson, V. Barnett,

P. C. Chatwin, and A. H. El-Shaarawi, 37–56. Springer Verlag.

Higgs, M. D. and Hoeting, J. A. (2010). “A clipped latent-variable model for spatially

correlated ordered categorical data.” Computational Statistics and Data Analysis, 54,

1999–2011.

Huffer, F. W. and Wu, H. (1998). “Markov chain Monte Carlo for autologistic regression

models with application to the distribution of plant species.” Biometrics, 54, 509–524.

Imai, K. and van Dyk, D. A. (2005). “A Bayesian analysis of the multinomial probit model

using marginal data augmentation.” Journal of Econometrics, 124, 311–334.

Journel, A. G. (1983). “Nonparametric estimation of spatial distributions.” Mathematical

Geology, 15, 445–468.

Law, J. and Haining, R. (2004). “A Bayesian approach to modeling binary data: the case

of high-intensity crime areas.” Geographical Analysis, 36, 197–216.

Liang, K. Y. and Zeger, S. L. (1986). “Longitudinal Data Analysis Using Generalized

Linear Models.” Biometrika, 73, 13–22.

Lin, P. S. (2008). “Efficiency of quasi-likelihood estimation for spatially correlated binar

data on Lp spaces.” Journal of Statistical Planning and Inference, 138, 1528–1541.

142 Lin, P. S. and Clayton, M. K. (2005). “Analysis of binary spatial data by quasi-likelihood

estimating equations.” The Annals of Statistics, 33, 542–555.

Liu, X. and Daniels, M. J. (2006). “A new algorithm for simulating a correlation matrix

based on parameter expansion and reparameterization.” Journal of Computational and

Graphical Statistics, 15, 897–914.

Lunn, D. J., Thomas, A., Best, N., and Spiegelhalter, D. (2000). “WinBUGS – a Bayesian

modelling framework: concepts, structure, and extensibility.” Statistics and Computing,

10, 325–337.

Mardia, K. V. (1988). “Multi-dimensional multivariate Gaussian Markov random fields

with application to image processing.” Journal of Multivariate Analysis, 24, 265–284.

McCulloch, R. and Rossi, P. E. (1994). “An exact likelihood analysis of the multinomial

probit model.” Journal of Econometrics, 64, 207–240.

McCulloch, R. E., Polson, N. G., and Rossi, P. E. (2000). “A Bayesian analysis of the

multinomial probit model with fully identified parameters.” Journal of Econometrics,

99, 173–193.

Meng, X. L. and van Dyk, D. A. (1999). “Seeking efficient data augmentation schemes via

conditional and marginal augmentation.” Biometrika, 86, 301–320.

Munroe, D. K., Wolfinbarger, S. R., Calder, C. A., Shi, T., Xiao, N., Lamb, C. Q., and

Li, D. (2008). “The relationships between biomass burning, land-cover/-use change,

and the distribution of carbonaceous aerosols in mainland Southeast Asia: a review and

synthesis.” Journal of Land Use Science, 3, 161–183.

143 Nelder, J. A. and Wedderburn, R. W. A. (1972). “Generalized linear models.” Journal of

the Royal Statistical Society, Series A, 135, 370–384.

Nobile, A. (1998). “A hybrid Markov chain for the Bayesian analysis of the multinomial

probit model.” Statistics and Computing, 8, 229–242.

Paciorek, C. J. (2007). “Computational techniques for spatial logistic regression with large

data sets.” Computational Statistics and Data Analysis, 51, 3631–3653.

Patterson, H. and Thompson, R. (1971). “Recovery of Inter-block Information When Block

Sizes are Unequal.” Biometrika.

Royle, J. A. and Berliner, L. M. (1999). “A hierarchical approach to multivariate spa-

tial modeling and prediction.” Journal of Agricultural, Biological, and Environmental

Statistics, 4, 29–56.

Rue, H., Martino, S., and Chopin, N. (2009). “Approximate Bayesian inference for latent

Gaussian models by using integrated nested Laplace approximations.” Journal of the

Royal Statistical Society, Series B, 71, 1–35.

Sain, S. R. and Cressie, N. (2007). “A spatial model for multivariate lattice data.” Journal

of Econometrics, 140, 226–259.

Schabenberger, O. and Gotway, C. A. (2005). Statistical Methods for Spatial Data Analysis.

Boca Raton, Florida: Chapman & Hall/CRC.

Shaby, B. and Ruppert, D. (2010). “Tapered covariance: Bayesian estimation and asymp-

totics.” Journal of the American Statistical Association, Submitted.

144 Sherman, M., Apanasovich, T. V., and Carroll, R. J. (2006). “On estimation in binary

autologistic spatial models.” Journal of Statistical Computation and Simulation, 76,

167–179.

Sim, S. (2000). “A test for spatial correlation for binary data.” Statistics & Probability

Letters, 47, 129–134.

Solow, A. R. (1993). “On the efficiency of the indicator approach in geostatistics.” Mathe-

matical Geology, 25, 53–57.

Stein, M. (1999). Interpolation of Spatial Data: Some Theory for Kriging. New York:

Springer-Verlag.

Switzer, P. (1977). “Estimation of spatial distributions from point sources with application

to air pollution measurement.” Bulletin de l’Institute International de Statistique, 47,

123–137. van Dyk, D. A. and Meng, X. L. (2001). “The art of data augmentation.” Journal of

Computational and Graphical Statistics, 10, 1–50. van Dyk, D. A. and Park, T. (2008). “Partially collapsed Gibbs samplers: theory and

methods.” Journal of the American Statistical Association, 103, 790–796.

Saltytˇ e˙ Benth, J. and Ducinskas,ˇ K. (2005). “Linear discriminant analysis of multivariate

spatial-temporal regressions.” Scandinavian Journal of Statistics, 32, 281–294.

Wackernagel, H. (1998). Multivariate Geostatistics: An Introduction with Applications.

2nd ed. New York, NY: Springer-Verlag.

145 Waller, L. A. and Gotway, C. A. (2004). Applied Spatial Statistics for Public Health Data.

Hoboken, New Jersey: John Wiley & Sons, Inc.

Wedderburn, R. W. M. (1974). “Quasi-likelihood functions, generalized linear models, and

the Gauss-Newton method.” Biometrika, 61, 439–447.

Weir, I. S. and Pettitt, A. N. (1999). “Spatial modelling for binary data using a hidden con-

ditional autoregressive Gaussian process: a multivariate extension of the probit model.”

Statistics and Computing, 9, 77–86.

— (2000). “Binary probability maps using hidden conditional autoregressive Gaussian

processes with an application to Finnish common toad data.” Applied Statistics, 49,

473–484.

Xu, K., Wikle, C., and Fox, N. (2005). “A kernel-based spatio-temporal dynamical model

for nowcasting radar precipitation.” Journal of the American Statistical Association, 100,

1133–1144.

Yin, G. (2009). “Bayesian generalized method of moments.” Bayesian Analysis, 4, 191–

208.

Zheng, Y. and Zhu, J. (2008). “Markov chain Monte Carlo for a spatial-temporal autologis-

tic regression model.” Journal of Computational and Graphical Statistics, 17, 123–137.

Zhu, J., Zheng, Y., Carroll, A. L., and Aukema, B. H. (2008). “Autologistic regression

analysis of spatial-temporal binary data via Monte Carlo maximum likelihood.” Journal

of Agricultural, Biological, and Environmental Statistics, 13, 84–98.

146