Bayesian Probit Regression Models for Spatially-Dependent Categorical Data
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian Probit Regression Models for Spatially-Dependent Categorical Data DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Candace Berrett, B.S., M.S. Graduate Program in Statistics The Ohio State University 2010 Dissertation Committee: Catherine A. Calder, Advisor L. Mark Berliner Peter F. Craigmile Elizabeth A. Stasny c Copyright by Candace Berrett 2010 ABSTRACT Data augmentation/latent variable methods have been widely recognized for facilitating model fitting in the Bayesian probit regression model. First proposed by Albert and Chib (1993) for independent binary and multi-category data, the latent variable representation of the Bayesian probit regression model allows model fitting to be performed using a simple Gibbs sampler and, for more than two categories, also allows the so-called assumption of irrelevant alternatives required by the logistic regression model to be relaxed (Hausman and Wise, 1978). To accommodate residual spatial dependence, the latent variable speci- fication of the Bayesian probit regression model can be extended to incorporate standard parametric covariance models typically used in analyses of spatially-dependent continuous data, defining what we term the Bayesian spatial probit regression model. In this disserta- tion, we develop and extend the Bayesian spatial probit regression model by (i) introducing efficient model-fitting algorithms, (ii) deriving classification methods based on the model, and (iii) extending the model to the multi-category spatial setting. Statistical models for spatial data are notoriously cumbersome to fit necessitating the availability of fast and efficient model-fitting algorithms. To improve the efficiency of the Gibbs sampler used to fit the Bayesian regression model for independent categorical response variables, Imai and van Dyk (2005) propose introducing a working parameter into the model and compare various data augmentation strategies resulting from different treatments of the working parameter. We build on this work by investigating the efficiency ii of modified and extended versions of conditional and marginal data augmentation Markov chain Monte Carlo (MCMC) algorithms for the spatial probit regression model, focusing on the special case of binary spatially-dependent response variables. Within the classification literature, methods that exploit spatial dependence are limited. We show how a spatial classification rule can be derived from the Bayesian spatial probit regression model. In addition, we compare our proposed spatial classifier to various other classifiers in terms of training and test error rates using a land-cover/land-use data set. When extending the spatial probit regression model to the multi-category setting, care must be taken to ensure that model parameters are estimable and interpretable. Considering three types of categorical and spatial covariate information, we discuss various specifica- tions of the latent variable mean structure and the associated parameter interpretations. Additionally, we explore the specification of the latent variable cross space-category de- pendence structure and discuss how data augmentation MCMC strategies for fitting the Bayesian spatial probit regression model can be extended to the multi-category setting. iii Dedicated to my parents, Bob and Nanette, and siblings, Tenille, Nat, Preston, MeChel, and Taylor. iv ACKNOWLEDGMENTS First and foremost, I would like to thank my advisor, Dr. Kate Calder, who over the last four and a half years has devoted a substantial amount of time and effort in training me to be a well-rounded statistician. She has provided me with numerous opportunities to learn and grow through research, teaching, mentoring, and collaboration. She has also become a good friend, whom I admire professionally and personally, and I am grateful for her example and support. I would like to thank my committee members: Dr. Mark Berliner for his comments on my research, his help with job and fellowship applications, and for allowing me to laugh in his class; Dr. Elizabeth Stasny for her comments on my research, her help with job and fellowship applications, her support as graduate chair, and in encouraging me to come to Ohio State; and Dr. Peter Craigmile for his valuable comments and contributions to my research. I would like to thank Dr. Darla Munroe and Dr. Ningchuan Xiao of the Department of Geography for their generous assistance in obtaining and understanding the land cover data used in this work. I would like to thank the other professors in the Department of Statistics who have provided guidance and support during my time at Ohio State: Dr. Doug Wolfe, Dr. Tao Shi, Dr. Chris Hans, Dr. Jackie Miller, Dr. Steve MacEachern, and Dr. Noel Cressie. v I would like to thank Lisa Van Dyke for her help in answering my many graduation questions and in pulling together the final documents of this dissertation. I would also like to thank Terry England for her help with all my travel and posters. Support for this research was provided by grants from NASA (NNG06GD31G) and the NSF (ATM-0934595). Finally, I would like to thank my family and many friends, who all believed in me when I didn’t believe in myself; and God, for giving me strength and understanding, and providing me with opportunities to grow. vi VITA 1983 . Born - Ogden, Weber, Utah, USA 2005 . B.S. Actuarial Science, cum laude, Brigham Young University. 2005 - 2006 . University Fellow, Graduate School, The Ohio State University. 2005 - 2006, 2010 . Teaching Assistant, Department of Statis- tics, The Ohio State University. 2007 . M.S. Statistics, The Ohio State University. 2007 - 2010 . Research Assistant, Department of Statis- tics, The Ohio State University. 2009 . Graduate Fellow, Statistical and Applied Mathematical Sciences Institute. PUBLICATIONS Research Publications Xiao, N., Shi, T., Calder, C.A., Munroe, D.K., Berrett, C., Wolfinbarger, S., and Li, D. (2008) “Spatial Characteristics of the Difference between MISR and MODIS Aerosol Optical Depth Retrievals over Mainland Southeast Asia,” Remote Sensing of Environment, DOI: 10.1016/j.rse.2008.07.011. FIELDS OF STUDY Major Field: Statistics vii TABLE OF CONTENTS Page Abstract . ii Dedication . iv Acknowledgments . v Vita . vii List of Tables . xi List of Figures . xii Chapters: 1. Introduction . 1 1.1 Background and Motivation . 2 1.2 Modeling Categorical Spatial Data . 15 1.2.1 The Spatial Generalized Linear Model . 15 1.2.2 The Spatial Generalized Linear Mixed Model . 19 1.2.3 Indicator Kriging . 20 1.2.4 The Autologistic Model . 22 1.2.5 The Bayesian Spatial Probit Regression Model . 23 1.3 Overview of Contributions . 24 1.4 Illustrative Data Set . 25 2. Bayesian Spatial Probit Regression . 29 2.1 The Bayesian Probit Regression Model . 29 2.1.1 Albert and Chib’s Data Augmentation Strategy . 29 viii 2.1.2 Multi-Category and Multivariate Extensions . 31 2.2 The Bayesian Spatial Probit Regression Model . 34 2.2.1 Model Specification . 34 2.2.2 Parameterization of the Spatial Correlation Matrix . 36 3. Data Augmentation MCMC Strategies . 39 3.1 Data Augmentation MCMC Strategies . 40 3.1.1 Conditional versus Marginal Data Augmentation . 40 3.1.2 Partially Collapsed Algorithms . 45 3.1.3 Full Conditional Distributions . 46 3.2 Simulation Study . 49 3.2.1 Simulation Set-up . 49 3.2.2 Simulation Results . 52 3.3 Application . 54 3.4 Summary . 56 4. The Bayesian Spatial Probit Regression Model as a Tool for Classification . 73 4.1 The Classification Problem . 74 4.2 GLM-Based Classification . 76 4.2.1 Non-Spatial GLM-Based Classification . 76 4.2.2 Spatial GLM-Based Classification . 80 4.3 Alternative Classification Methods . 84 4.3.1 Discriminant Analysis . 84 4.3.2 Support Vector Machines . 90 4.3.3 k-Nearest Neighbors . 93 4.4 Comparison of Classification Methods . 94 4.4.1 Parameter Estimation . 95 4.4.2 Classification Errors . 97 4.5 Summary . 101 5. Bayesian Spatial Multinomial Probit Regression . 102 5.1 The Bayesian Spatial Multinomial Probit Regression Model . 102 5.1.1 Latent Mean Specification . 104 5.1.2 Parameterization of the Space-Category Covariance Matrix . 125 5.2 Model-Fitting . 128 5.2.1 Data Augmentation MCMC Algorithms . 128 5.3 Summary . 133 ix 6. Contributions and Future Work . 134 x LIST OF TABLES Table Page 3.1 This table lists the steps in each of the data augmentation algorithms. The first portion shows the non-collapsed data augmentation algorithms intro- duced in Section 3.1.1. The second portion shows the partially collapsed data augmentation algorithms introduced in Section 3.1.2. 44 3.2 Scenarios used to compare the marginal and conditional data augmentation algorithms. 50 3.3 Autocorrelations of the sample paths of β1 and ρ for the land cover data analysis. 56 4.1 Fitted values for the covariance function parameters for both class C0 and C1. 96 4.2 Tuning parameter values for each classification method. The optimal value of the tuning parameter is listed along with the CVE associated with this value. The optimal values were chosen by minimizing the five-fold CVE. 97 4.3 Training and test errors for the SE Asia land cover data obtained using each of the classification methods discussed in this chapter. 100 xi LIST OF FIGURES Figure Page 1.1 Land cover over Southeast Asia, covering the region bounded by 17◦ to 21◦N and 98◦ to 105◦E. The data were taken from the MODIS Land Cover Type Yearly Level 3 Global 500m (MOD12Q1 and MCD12Q1) data prod- uct for the year 2005. 26 1.2 Elevation (in meters) over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E...................................... 27 1.3 Standardized value of the measured distance to the nearest major road over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E. 27 1.4 Standardized value of the measured distance to the coast over the region bounded by 17◦ to 21◦N and 98◦ to 105◦E.