An Introduction to Model-Based Geostatistics
Total Page:16
File Type:pdf, Size:1020Kb
An Introduction to Model-based Geostatistics Peter J Diggle School of Health and Medicine, Lancaster University and Department of Biostatistics, Johns Hopkins University September 2009 Outline • What is geostatistics? • What is model-based geostatistics? • Two examples – constructing an elevation surface from sparse data – tropical disease prevalence mapping Example: data surface elevation Z 6 65 70 75 80 85 90 95100 5 4 Y 3 2 1 0 1 2 3 X 4 5 6 Geostatistics • traditionally, a self-contained methodology for spatial prediction, developed at Ecole´ des Mines, Fontainebleau, France • nowadays, that part of spatial statistics which is concerned with data obtained by spatially discrete sampling of a spatially continuous process Kriging: find the linear combination of the data that best predicts the value of the surface at an arbitrary location x Model-based Geostatistics • the application of general principles of statistical modelling and inference to geostatistical problems – formulate a statistical model for the data – fit the model using likelihood-based methods – use the fitted model to make predictions Kriging: minimum mean square error prediction under Gaussian modelling assumptions Gaussian geostatistics (simplest case) Model 2 • Stationary Gaussian process S(x): x ∈ IR E[S(x)] = µ Cov{S(x), S(x′)} = σ2ρ(kx − x′k) 2 • Mutually independent Yi|S(·) ∼ N(S(x),τ ) Point predictor: Sˆ(x) = E[S(x)|Y ] • linear in Y = (Y1, ..., Yn); • interpolates Y if τ 2 = 0 • called simple kriging in classical geostatistics Predictive distribution • choose the target for prediction, F(S), where S = {S(x): x ∈ A} • draw samples Si : i = 1, ..., N from [S|Y ] • then Fi = F(Si): i = 1, ..., N is a sample from required predictive distribution [F(S)|Y ] Interpolating the elevation surface Under Gaussian modelling assumptions, we need to: • identify a parametric family of correlation functions • fit the model • use the model for prediction • identify a parametric family of correlation functions The empirical variogram 1 2 (xi, Yi): i = 1, ..., n uij = ||xi − xj || vij = (yi − yj) 2 The theoretical variogram 1 2 2 V (u) = Var{Y (x) − Y (x − u)} = τ + σ {1 − ρ(u)} 2 Exploratory analysis E[vij ] = V (uij ) ⇒ smoothed scatterplot of (uij ,vij ) identifies rough shape of ρ(u) and initial estimates of model parameters geoR code: library(geoR) data(elevation) summary(elevation) vario<-variog(elevation,uvec=0.2*(0:25)) plot(vario) ?variog vario2<-variog(elevation,uvec=0.2*(0:25),trend="1st") plot(vario2) plot(vario$u,vario$v,type="l",xlim=c(0,5),ylim=c(0,7000), xlab="u",ylab="V(u)") lines(vario2$u,vario2$v,col="red") • identify a parametric family of correlation functions • fit the model 1. Classical: compute maximum likelihood estimates θˆ 2. Bayesian: prior [θ] implies posterior [θ|Y ] geoR code for option 1: mlfit<-likfit(elevation,ini.cov.pars=c(5000,2.0), cov.model="matern",kappa=1) • identify a parametric family of correlation functions • fit the model • use the model for prediction 1. Plug-in:[S|Y ; θˆ] 2. Bayesian:[S|Y ] = R [S|Y ; θ][θ|Y ]dθ geoR code for option 1: region<-matrix(c(0,0,6.4,0,6.4,6.4,0,6.4),4,2,T) grid<-pred_grid(region,by=0.2) KC<-krige.control(obj.model=mlfit) OC<-output.control(n.predictive=100) set.seed(24367) predictions<-krige.conv(geodata=elevation,locations=grid, borders=region,krige=KC,output=OC) image(predictions) points(elevation,add=T) Tropical disease prevalence mapping • “river blindness” – an endemic disease in wet tropics • donation programme of mass treatment with ivermectin • approximately 50 million people treated to date (target is 80 million by 2015) • serious adverse reactions experienced by some patients highly co-infected with Loa loa parasites • precautionary measures put in place before mass treatment in areas of high Loa loa prevalence http://www.who.int/pbd/blindness/onchocerciasis/en/ Diggle et al, Annals of Tropical Medicine and Parasitology, 101, 499–509. The Loa loa prediction problem Ground-truth survey data • random sample of subjects in each of a number of villages • blood-samples test positive/negative for Loa loa Environmental data (satellite images) • measured on regular grid to cover region of interest • elevation, green-ness of vegetation Objectives • predict local prevalence throughout study-region (Cameroon) • compute local exceedance probabilities, P(prevalence > 0.2|data) Loa loa: a generalised linear model • Latent spatial process S(x) ∼ SGP{0, σ2, ρ(u)} ρ(u) = exp(−|u|/φ) • Linear predictor d(x) = environmental variables at location x η(x) = d(x)′β + S(x) p(x) = exp{η(x)}/[1 + exp{η(x)}] • Conditional distribution for positive proportion Yi/ni Yi|S(·) ∼ Bin{ni, p(xi)} The modelling strategy • use relationship between environmental variables and ground-truth prevalence to construct preliminary predictions via logistic regression • use local deviations from regression model to estimate smooth residual spatial variation • use fitted model for predictive inference logit prevalence vs elevation logit prevalence −5 −4 −3 −2 −1 0 0 500 1000 1500 elevation logit prevalence vs max NDVI logit prevalence −5 −4 −3 −2 −1 0 0.65 0.70 0.75 0.80 0.85 0.90 Max Greeness Comparing non-spatial and spatial predictions in Cameroon Non-spatial 60 50 40 30 20 10 0 Observed prevalence (%) 0 10 20 30 Predicted prevalence - 'without ground truth data' Spatial 60 50 40 30 20 10 0 Observed prevalence (%) 0 10 20 30 40 Predicted prevalence - 'with ground truth data' (%) Probabilistic prediction in Cameroon Take-home message • model-based approach: – makes assumptions explicit – makes choice of analysis strategy less subjective – emphasises uncertainty • exceedance probabilty maps are often more useful than point predictions and standard errors • text-book linked to geoR software Diggle, P.J. and Ribeiro, P.J. (2007). Model-based Geostatistics. New York : Springer..