A Unified Exposure Prediction Approach for Multivariate Spatial
Total Page:16
File Type:pdf, Size:1020Kb
A unified exposure prediction approach for multivariate spatial data: from predictions to health analysis Zheng Zhu A Dissertation Presented to the Faculty of University of Cincinnati in Candidacy for the Degree of Doctor of Philosophy Recommended for Acceptance by the Division of Biostatistics and Bioinformatics Department of Environmental Health Adviser: Roman A. Jandarov, Ph.D. Abstract Epidemiological cohort studies of health effect often rely on spatial models to predict ambient air pollutant concentrations at participants' residential addresses. Com- pared with traditional linear regression models, spatial models such as Kriging pro- vide us accurate prediction by taking into account spatial correlations within data. Spatial model utilizes regression covariates from high dimensional database provided by geographical information system (GIS). This modeling requires dimension reduc- tion techniques such as partial least squares, lasso, elastic net, etc. In the first chapter of this thesis, we presented a comparison of performance of four potential spatial pre- diction models. The first two approaches are based on universal kriging (UK). The third and fourth approaches are based on random forest and Bayesian additive re- gression trees (BART), with some degree of spatial smoothing. Multivariate spatial models are often considered for point-referenced spatial data, which contains multiple measurements at each monitoring location and therefore correlation between mea- surements is anticipated. In the second chapter of the thesis, we proposed a chain model, for analyzing multivariate spatial data. We showed that chain model out- perform other spaital models such as universal kriging and coregionalization model. In the third chapter, we connect our spatial analysis with epidemiological studies of health effects of environmental chemical mixtures. Specifically, we investigated the relationship between environmental chemical mixture exposure and cognitive and motor development of infants. We proposed a framework to analyze health effects of environmental chemical mixtures. We first perform dimension reduction of the ex- ii posure variables using principal component analysis. In the second stage, we applied a best subset regression to obtain the final model. iii c Copyright by Zheng Zhu, 2019. All rights reserved. Acknowledgements I would like to thank division of biostatistics and bioinformatics, department of environmental health. I would also like to thank my advisor, Dr. Roman Jandarov for guiding me through the PhD study. I am grateful for being his student. He set an excellent example as a successful scientific researcher. His guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my Ph.D study. Besides my advisor, I would like to thank the rest of my thesis committee: Dr. MB Rao, Dr. Won Chang and Dr. Sivaraman Balachandran, for their insightful comments and encouragement, but also for the hard question which incented me to widen my research from various perspectives. Thanks also to my research funding source, center for environmental genetics; and to the faculty and staff in the UC Department of Environmental Health. Last but not the least, I would like to thank my family: my parents and to my wife for supporting me spiritually throughout writing this thesis and my my life in general. v Contents Abstract . ii Acknowledgements . .v List of Tables . ix List of Figures . xi 1 A comparison of spatial prediction models for traffic-related air pol- lution data 1 1.1 Abstract . .1 1.2 Introduction . .3 1.3 Data . .6 1.3.1 Monitoring data . .6 1.4 Methods . .7 1.4.1 Correlated data . .7 1.4.2 Partial Least Square Regression . .8 1.4.3 Universal kriging with PLS regression . .9 1.4.4 Elastic net . 10 1.4.5 Universal kriging with Elastic net and Best subset . 11 vi 1.4.6 Random forests . 11 1.4.7 Bayesian additive regression trees . 13 1.4.8 Tuning parameter selection . 15 1.4.9 Cross-validation and prediction accuracy . 15 1.5 Results . 16 1.6 Discussion . 19 2 A Chain model for multivariate spatial data 22 2.1 Abstract . 22 2.2 Introduction . 23 2.3 Methods . 25 2.3.1 Correlated data . 25 2.3.2 Variable selection via Lasso . 25 2.3.3 Universal kriging with Lasso . 27 2.3.4 Co-Kriging . 29 2.3.5 Multivariate random forest . 29 2.3.6 Multivariate spatial regression model . 30 2.3.7 Chain model for multivariate spatial data ........ 32 2.4 Simulation . 35 2.5 Application to Traffic-related air pollution data . 40 2.6 Discussion . 43 2.7 Future work . 44 vii 3 A framework to analyze health effects of environmental chemical mixtures 46 3.1 Abstract . 46 3.2 Introduction . 47 3.3 Methods . 49 3.4 Results . 51 3.5 Discussion . 52 Bibliography 62 viii List of Tables 1.1 List of species on passive sample detectors . .7 1.2 Pollutants summary statistics . 18 1.3 10-fold cross-validation R2 statistics for four methods in heating season 18 1.4 10-fold cross-validation R2 statistics for four methods in non-heating season . 19 1.5 Average prediction for each methods . 19 1.6 Average prediction for each pollutants . 20 2.1 List of GIS variables . 26 2.2 Prediction accuracy for Univariate model vs Chain model (one simu- lation) . 39 2.3 Prediction accuracy comparison for four spatial models . 42 3.1 Summary statistics for demographic and outcome variables . 54 3.2 Loadings for OC . 55 3.3 Loadings for OP . 55 3.4 Loadings for PBDE . 56 3.5 Loadings for PCB . 56 ix 3.6 Loadings for PHTHBPA . 57 3.7 Loadings for PYR . 57 3.8 Summary of regression model for each biomarker category (mental score) 58 3.9 Summary of regression model for each biomarker category (motor score) 59 3.10 Best subset regression model for each biomarker category (mental score) 60 3.11 Best subset regression model for each biomarker category (motor score) 61 x List of Figures 1.1 Monitor location map in Baltimore . 17 2.1 Simulated spatial processes (one simulation). Note Y1;Y2; and Y3 are independent; Y4;Y5; and Y6 are correlated spatial processes. 39 2.2 Prediction accuracy density plots for UK and chain model in low and high correlations. The density plot is based on six spatial processes (3 independent and 3 correlated) with 100 simulations. 41 xi Chapter 1 A comparison of spatial prediction models for traffic-related air pollution data 1.1 Abstract We present a geostatistical analysis of the traffic-related air pollution data collected by the Center for Clean Air Research (CCAR) at the University of Washington in the summer (non-heating season) and winter (heating season) of 2013 in Baltimore, MD. We analyze the spatial predictability of ozone (O3), nitrogen dioxide (NO2), nitrogen oxides(NOx), sulfur dioxide (SO2), pentane, isoprene, nonane, decane, un- decane, dodecane (winter only), benzene, toluene, m-xylene, and o-xylene, measured at 43 monitoring locations. In this chapter, our goal is to understand the spatial 1 predictability of the pollutants using geographical information system (GIS) covari- ates at each location. We compare the performance of four potential prediction approaches for spatial data. The first two approaches are based on universal krig- ing (UK). In the first approach, we model the mean structure in UK using a lower dimensional representation of the GIS covariates via partial least squares (PLS). In the second approach, for each pollutant, we perform a two-step variable selection of the GIS covariates: the elastic-net penalized regression followed by a search for the best subset of variables with the lowest Mallows Cp score. We then use the selected variables in UK. In the third approach, as an alternative to UK, we apply a ran- dom forests algorithm using the GIS covariates and thin-plane splines (to account for spatial variability). Finally, in the fourth approach, we predict the pollutants via an approach based on Bayesian additive regression trees (BART), a Bayesian version of random forests, using the GIS covariates and thin-plate splines. For each approach, the tuning parameters are selected and the performance of the prediction models are evaluated by 10-fold cross-validation. Based on our results, 1)the pre- diction accuracy for the non-heating season is higher than heating season in general (except for isoprene and NOx); 2) pollutants O3 and SO2 are not predicted well in both seasons; 3) the performance of different models varies for different pollutants, and no universally best model exists. Keywords: Air pollution, variable selection, partial least squares, spatial misalign- ment, land use regression, universal kriging, random forest, spatial prediction 2 1.2 Introduction In epidemiological cohort studies, estimating health effects of air pollution often rely on spatial models to predict pollutant concentrations at participants' residential ad- dresses based on monitoring data. In contrast to naive models based on predicting individual exposures using region-wide averages or nearest-monitor approaches, the current state of the art methods are more sophisticated and can lead to more accurate predictions. The two widely used models in recent research for predicting exposures are based on land-use regression (LUR) [19, 23, 3, 29, 31] with Geographic Infor- mation System (GIS) covariates and satellite and remote sensing data, and methods based on universal kriging (UK) [2, 22, 24]. In this chapter, we focus on prediction approaches based on universal kriging and random forests. In addition to using GIS covariates to construct the mean structure of the model as in land-use regression, UK also allows for spatial dependence by modeling correlations between residuals at different locations [27, 32]. Random forests, a machine learning method, is often used to solve prediction problems in a broad range of applications.