An Introduction to Model-based Geostatistics

Peter J Diggle School of Health and Medicine, Lancaster University and Department of , Johns Hopkins University

September 2009 Outline

• What is geostatistics?

• What is model-based geostatistics?

• Two examples

– constructing an elevation surface from sparse – tropical disease prevalence mapping

Example: surface elevation data

100

95

90

85

Z

80

75

70 65 6

5

4 6 5 3 Y 4 2 3 X 1 2 1 0 Geostatistics

• traditionally, a self-contained methodology for spatial prediction, developed at Ecole´ des Mines, Fontainebleau, France

• nowadays, that part of spatial which is concerned with data obtained by spatially discrete of a spatially continuous process

Kriging: find the linear combination of the data that best predicts the value of the surface at an arbitrary location x Model-based Geostatistics

• the application of general principles of statistical modelling and inference to geostatistical problems

– formulate a for the data – fit the model using likelihood-based methods – use the fitted model to make predictions

Kriging: minimum square error prediction under Gaussian modelling assumptions Gaussian geostatistics (simplest case) Model

2 • Stationary S(x): x ∈ IR

E[S(x)] = µ Cov{S(x), S(x′)} = σ2ρ(kx − x′k)

2 • Mutually independent Yi|S(·) ∼ N(S(x),τ )

Point predictor: Sˆ(x) = E[S(x)|Y ]

• linear in Y = (Y1, ..., Yn);

• interpolates Y if τ 2 = 0

• called simple kriging in classical geostatistics Predictive distribution

• choose the target for prediction, F(S), where S = {S(x): x ∈ A}

• draw samples Si : i = 1, ..., N from [S|Y ]

• then Fi = F(Si): i = 1, ..., N is a sample from required predictive distribution [F(S)|Y ] Interpolating the elevation surface

Under Gaussian modelling assumptions, we need to:

• identify a parametric family of correlation functions

• fit the model

• use the model for prediction • identify a parametric family of correlation functions The empirical

1 2 (xi, Yi): i = 1, ..., n uij = ||xi − xj || vij = (yi − yj) 2

The theoretical variogram

1 2 2 V (u) = Var{Y (x) − Y (x − u)} = τ + σ {1 − ρ(u)} 2

Exploratory analysis

E[vij ] = V (uij ) ⇒ smoothed scatterplot of (uij ,vij ) identifies rough shape of ρ(u) and initial estimates of model parameters geoR code: library(geoR) data(elevation) summary(elevation) vario<-variog(elevation,uvec=0.2*(0:25)) plot(vario) ?variog vario2<-variog(elevation,uvec=0.2*(0:25),trend="1st") plot(vario2) plot(vario$u,vario$v,type="l",xlim=c(0,5),ylim=c(0,7000), xlab="u",ylab="V(u)") lines(vario2$u,vario2$v,col="red") • identify a parametric family of correlation functions

• fit the model

1. Classical: compute maximum likelihood estimates θˆ 2. Bayesian: prior [θ] implies posterior [θ|Y ] geoR code for option 1: mlfit<-likfit(elevation,ini.cov.pars=c(5000,2.0), cov.model="matern",kappa=1) • identify a parametric family of correlation functions

• fit the model

• use the model for prediction

1. Plug-in:[S|Y ; θˆ] 2. Bayesian:[S|Y ] = R [S|Y ; θ][θ|Y ]dθ geoR code for option 1: region<-matrix(c(0,0,6.4,0,6.4,6.4,0,6.4),4,2,T) grid<-pred_grid(region,by=0.2) KC<-krige.control(obj.model=mlfit) OC<-output.control(n.predictive=100) set.seed(24367) predictions<-krige.conv(geodata=elevation,locations=grid, borders=region,krige=KC,output=OC) image(predictions) points(elevation,add=T)

Tropical disease prevalence mapping

• “river blindness” – an endemic disease in wet tropics

• donation programme of mass treatment with ivermectin

• approximately 50 million people treated to date (target is 80 million by 2015)

• serious adverse reactions experienced by some patients highly co-infected with Loa loa parasites

• precautionary measures put in place before mass treatment in areas of high Loa loa prevalence http://www.who.int/pbd/blindness/onchocerciasis/en/ Diggle et al, Annals of Tropical Medicine and Parasitology, 101, 499–509. The Loa loa prediction problem

Ground-truth survey data • random sample of subjects in each of a number of villages

• blood-samples test positive/negative for Loa loa Environmental data (satellite images) • measured on regular grid to cover region of interest

• elevation, green-ness of vegetation Objectives • predict local prevalence throughout study-region (Cameroon)

• compute local exceedance probabilities, P(prevalence > 0.2|data) Loa loa: a generalised

• Latent spatial process S(x) ∼ SGP{0, σ2, ρ(u)} ρ(u) = exp(−|u|/φ)

• Linear predictor d(x) = environmental variables at location x η(x) = d(x)′β + S(x) p(x) = exp{η(x)}/[1 + exp{η(x)}]

• Conditional distribution for positive proportion Yi/ni

Yi|S(·) ∼ Bin{ni, p(xi)} The modelling strategy

• use relationship between environmental variables and ground-truth prevalence to construct preliminary predictions via

• use local deviations from regression model to estimate smooth residual spatial variation

• use fitted model for predictive inference logit prevalence vs elevation logit prevalence −5 −4 −3 −2 −1 0

0 500 1000 1500

elevation logit prevalence vs max NDVI logit prevalence −5 −4 −3 −2 −1 0

0.65 0.70 0.75 0.80 0.85 0.90

Max Greeness Comparing non-spatial and spatial predictions in Cameroon

Non-spatial

60

50

40

30

20

10 Observed prevalence (%) 0 0 10 20 30

Predicted prevalence - 'without ground truth data' Spatial

60

50

40

30

20

10

Observed prevalence (%) 0 0 10 20 30 40

Predicted prevalence - 'with ground truth data' (%) Probabilistic prediction in Cameroon Take-home message

• model-based approach:

– makes assumptions explicit – makes choice of analysis strategy less subjective – emphasises uncertainty

• exceedance probabilty maps are often more useful than point predictions and standard errors

• text-book linked to geoR software

Diggle, P.J. and Ribeiro, P.J. (2007). Model-based Geostatistics. New York : Springer.