Gaussian Markov Random Fields: Theory and Applications by Håvard Rue and Leonhard Held Additional Articles Will Be Used Later On

Lecture 1: Introduction Gaussian Markov random ﬁelds

David Bolin Chalmers University of Technology January 19, 2015 Practical information

Litterature: The course will mostly be based on the book Gaussian Markov Random Fields: Theory and Applications by Håvard Rue and Leonhard Held Additional articles will be used later on. Homepage: http://www.math.chalmers.se/~bodavid/GMRF2015/ Schedule: We will meet two times each week: Mondays and Tuesdays (10-12) Lectures will be in MVL:14 There will be 10 lectures and 4 computer labs

Practical David Bolin Examimation

There will be two components in the examination: Project assignments introduced in the computer labs • An oral exam at the end of the course •

The projects can be done individually or in pairs of two students.

The ﬁnal oral exam is individual.

The grading scale comprises Fail, (U), Pass (G).

Successful completion of the course will be rewarded by 7.5 hp.

Practical David Bolin Three relevant questions

Why take this course when we have had one course on Markov random ﬁelds and one course on Gaussian random ﬁelds this year?

Why is it a good idea to learn about Gaussian Markov random ﬁelds?

What’s there to learn? Isn’t all just a Gaussian?

Practical David Bolin Outline of lectures

Lecture 1 Introduction Lecture 2 Deﬁnitions and basic properties of GMRFs Lecture 3 Simulation and conditioning Lecture 4 Numerical methods for sparse matrices Lecture 5 Intrinsic GMRFs Lecture 6 MCMC estimation for hierarchical models Lecture 7 Approximation techniques and INLA Lecture 8 Stochastic PDEs and FEM Lecture 9 SPDEs part 2 Lecture 10 Extensions and applications

Practical David Bolin Random ﬁelds

Random fields A random field (or stochastic field), X(s, ω), s , ω Ω, is a random function specified by its finite-dimensional∈ D joint∈ distributions

F (y1, . . . , yn; s1,..., sn) = P(X(s1) y1,...,X(sn) yn) ≤ ≤

for every finite n and every collection s1,..., sn of locations in . D d The set is usually a subset of R . • D At every location s , X(s, ω) is a random variable where • the event ω lies in some∈ D abstract sample space Ω. Kolmogorov’s existence theorem can be used to ensure that • the random field has a valid mathematical specification. To simplify the notation, one often writes X(s), removing the • dependency on ω from the notation.

GRFs David Bolin Gaussian random ﬁelds

An important special case is when the random field is Gaussian. Gaussian random fields A Gaussian random field X(s) is defined by a mean function µ(s) = E(X(s)) and a covariance function C(s, t) = Cov(X(s),X(t)). It has the property that, for every finite collection of points s1, . . . , sp , { } T x (X(s1),...,X(sp)) N(µ, Σ), ≡ ∼

where Σij = C(si, sj).

For existence of a Gaussian ﬁeld with a prescribed mean and covariance it is enough to ensure that C is positive deﬁnite.

GRFs David Bolin Positive deﬁnite functions

A function C(s, t) is positive definite if for any finite set of locations s1,..., sn in , the covariance matrix D   C(s1, s1) C(s1, s2) C(s1, sn) ··· C(s2, s1) C(s2, s2) C(s2, sn) Σ =  ···   . . .. .   . . . .  C(sn, s1) C(sn, s2) C(sn, sn) ··· is non-negative definite: z>Σz 0 for all real valued vectors z. ≥

(Note the inconsequence in the notation here: A positive deﬁnite function requires a positive semi-deﬁnite matrix)

GRFs David Bolin Stationary random fields A common simplifying assumption is that the random field is stationary. Strict stationarity A random field X(s) is called strictly stationary if for any vector h and for every collection s1,..., sn of locations in D

F (y1, . . . , yn; s1 + h,..., sn + h) = F (y1, . . . , yn; s1,..., sn).

Weak stationarity A random ﬁeld X(s) is called weakly stationary if for any vector h and any locations s, t ∈ D µ(s + h) = µ(s), and C(s + h, t + h) = C(s, t) = C(s t). − There is no distinction between the two concepts in the Gaussian case and one then simply writes that the ﬁeld is stationary.

GRFs David Bolin Isotropic ﬁelds

An important subclass of the weakly stationary ﬁelds are the isotropic ﬁelds. These have covariance functions that depend only on distance, and not direction, between points, i.e. C(s1, s2) = C( s1 s2 ). k − k

In most practical applications of Gaussian random ﬁelds, the covariance function is chosen from a parametric family of isotropic covariance functions such as: Exponential covariance function. • Gaussian covariance function. • Matérn covariance function. •

GRFs David Bolin The standard choice: Gaussian Matérn ﬁelds

The Matérn covariance function: 1−ν 2 2 φ ν d C(h) = d (κ h ) Kν(κ h ), h R , ν > 0, 2 d 2ν k k k k ∈ (4π) Γ(ν + 2 )κ Here ν is a shape parameter for the covariance function, κ a spatial scale parameter, φ2 a variance parameter, Γ is the gamma function, and Kν is a modiﬁed Bessel function of the second kind. GRFs David Bolin Spectral representations

An alternative to covariance-based representation of Gaussian • ﬁelds is to do the speciﬁcation in the frequency domain. By Bochner’s theorem, a function C is a valid covariance • function if and only if it can be written as Z C(h) = exp(ih>k) dΛ(k) (1)

for some non-negative and symmetric measure Λ. Equation (1) is called the spectral representation of the • covariance function, and if the measure Λ has a Lebesgue density S, this is called the spectral density. For example, the spectral density associated with the Matérn • covariance function is φ2 1 S(k) = . d ν+ d (2π) (κ2 + k 2) 2 k k

GRFs David Bolin Variograms

Another popular representation, ﬁrst proposed by Matheron • (1971), is the (semi)variogram γ(h), that for a stationary process is deﬁned as 1 γ(h) = V(X(s + h) X(s)). 2 −

One popular estimation method in geostatistics is to use so • called empirical variograms These can be useful for non-differentiable random fields but • can be missleading for differientable processes. We will not use variograms at all. •

GRFs David Bolin Geostatistics and kriging

One of the most important problems in geostatistics is spatial reconstruction of a random field X(s) given a finite number of > observations Y = (Y1,...,Yn) of the latent field at locations s1,..., sn taken under measurement noise. The most popular method for spatial reconstruction in geostatistics was developed by Georges Matheron. Depending on the assumptions on the mean value function µ(s) for the latent field, linear kriging is usually divided into three cases: Simple kriging µ(s) is known Ordinary kriging µ(s) = µ and µ is unknown Pm Universal kriging µ(s) = k=1 βkbk(s) where bk are known basis functions and the parameters βk are unknown.

The kriging estimator of X(s) at some location s0 is derived as the minimum mean squared error linear predictor.

GRFs David Bolin Hierarchical models

There is a close connection between kriging and estimation in hierarchical models which we use.

A hierarchical model is constructed as a hierarchy of conditional probability models that, when multiplied together, yield the joint distribution for all quantities in the model.

Typically, we have a three-stage statistical model for data y modelled using a latent ﬁeld x with hyperparameters θ, structured in a hierarchical way

π(y, x, θ) = π(y x, θ)π(x θ)π(θ) | |

GRFs David Bolin The data y x, θ | We have been given some data y. Normally distributed? • Count data? • Binary data? • Point pattern? • How was it collected? (Distance sampling? • Capture/Recapture? Exhaustive sample? Preferential sampling?) We place all of this information into our likelihood π(y x, θ). A typical situation is when the latent ﬁeld is measured| under additive noise, Yi = X(si) + εi .

A common assumption is that ε1, . . . , εn are independent identically distributed with some variance σ2, uncorrelated with the latent process.

GRFs David Bolin The latent ﬁeld x θ |

In our models, we will assume that the data depends on some unobserved latent components x. Covariates • Unstructured random eﬀects ("white noise") • Structured random eﬀects (temporal dependency, spatial • dependency, smoothness terms)

The dependence between the data and the latent ﬁeld can be linear or non-linear, but as these are not directly observed, the modelling assumptions need to be more restrictive.

The process model can in itself be written as a hierarchical model, speciﬁed by a number of conditional sub-models.

GRFs David Bolin The hyperparameters θ

Both our likelihood and our latent field can depend on some hyperparameters θ Variance of observation noise • Probability of a zero (zero-inflated models) • Variance of the unstructured random field • Range of a structured random effect (effective correlation • distance) Autocorrelation parameter • For a Bayesian model, we specify these using a joint prior π(θ)

Frequentists assume that the parameters are ﬁxed but unknown. The model is then sometimes referred to as an empirical-Bayesian model, or empirical hierarchical model.

GRFs David Bolin Inference

Inference in hierarchical models is performed using the posterior distribution

π(X, θ Y) π(Y X, θ)π(X θ)π(θ). | ∝ | | Kriging predictions are calculated from the marginal posterior distribution Z π(X Y) π(X Y, θ)π(θ Y) dθ, | ∝ | | and one typically reports the posterior mean E(X Y) as a point estimator and the posterior variance V(X Y) as a| measure of the uncertainty in the predictor. | The posterior distribution for X and θ generally have to be estimated using Markov Chain Monte Carlo (MCMC) methods.

GRFs David Bolin Inference II

In an empirical hierarchical model, inference is instead performed using the conditional posterior π(X Y, θˆ). Here θˆ is an estimate of θ obtained using for example maximum| likelihood estimation, or maximum a posteriori estimation in the Bayesian setting.

The parameter model π(θ) can often be chosen so that the posterior mean and variance of X agree with the classical kriging predictions.

Even if this is not done, we will refer to the conditional mean of the posterior distribution as the kriging predictor.

GRFs David Bolin Latent Gaussian Models

We call a Bayesian hierarchical model where π(x θ) is a Gaussian distribution a Latent Gaussian model (LGM): |

θ π(θ) ∼ x θ π(x θ) = (0, Σ(θ)) | ∼ Y | N y x, θ π(yi ηi, θ) | ∼ | i Note that we also assume that the observations are independent given the latent process.

This is a huge model class that is used in many seemingly unrelated areas, and which is especially useful if we let x be a GMRF

GRFs — Latent Gaussian models David Bolin Bayesian linear models

1 2 Consider the linear model yi = µ + β1ci + β2ci + ui + i. y is an observation • i µ is the intercept • c1 and c2 are covariates (fixed effects) and β and β are the • 1 2 corresponding weights is i.i.d. normal observation noise. • i u is a random effect • To make a Bayesian model, we need to chose some priors. Classical choices: β = (µ, β , β )T N(0, σ2 I), where σ is a large number • 1 2 fix fix number. ∼

u N(0, Σu) where the covariance matrix Σu is known. • ∼ N(0, σ2I). • ∼ n

LGMs — Examples David Bolin Bayesian structured additive regression models GLM/GAM/GLMM/GAMM/+++ Perhaps the most important class of statistical models • n-dimensional observation vector y, distributed according to • an exponential family. mean µ = E(y ) linked to a linear predictor • i i T X ηi = g(µi) = α + zi β + fγ(cγ,i) + ui, i = 1, . . . , n γ where α : Intercept β : linear eﬀects of covariates z

fγ( ) : Non-linear/smooth eﬀects of covariates cγ { · } u : Unstructured error terms

LGMs — Examples David Bolin Bayesian structured additive regression models cont.

Flexibility due to many diﬀerent forms of the unknown functions fγ( ) { · } relax linear relationship of covariates • include random eﬀects • temporally and/or spatially indexed covariates • Special cases: Generalized linear models (GLM): g(µ) = α + Pm β z • j=1 j j Generalized additive models (GAM): g(µ) = α + Pm f (z ) • j=1 j j A latent Gaussian model is obtained by assigning Gaussian priors to all random terms α, β, fγ( ) , u in the linear predictor. { { · } }

LGMs — Examples David Bolin Example: Disease mapping

Data yi Poisson(Eiexp(ηi)) • ∼ Log-relative risk

• ηi = µ + ui + vi + f(ci)

Smooth eﬀect of a covariate c 0.98 0.71 • 0.44 Structured component u 0.17 −0.1 • −0.37 Unstructured component v −0.63 •

LGMs — Examples David Bolin Example: Spatial geological count data

Spatio-temporal observations • yit nBin(Eitexp(ηit)) ∼ Log-relative risk of bycatch • P j ηit = uit + j f(cit) Smooth eﬀects of covariates cj • Spatio-temporal random ﬁeld u •

LGMs — Examples David Bolin Example: Stochastic volatility

Log daily diﬀerence of the • pound-dollar exchange rates

Data yt ηt (0, exp(ηt)) • | ∼ N Volatility η = µ + u • i t Unknown mean µ

• −2 0 2 4 AR-process u •

0 200 400 600 800 1000

LGMs — Examples David Bolin A more surprising example: Point processes

Spatial point processes model They focus on the random location at which events happen. • They make excellent models for ‘presence only’ data when • coupled with an appropriate observation process. Realistic models can be quite complicated. • LGMs — Examples David Bolin Log-Gaussian Cox processes The homogeneous Poisson process is often too restrictive. Generalizations include: inhomogeneous Poisson process - inhomogeneous intensity • Markov point process - local interactions among individuals • Cox process - random intensity • We focus on the Cox process, the random intensity depends on a Gaussian random ﬁeld Z(s):

Λ(s) = exp(Z(s))

If Y denotes the set of observed locations, the likelihood is

Z X log(π(Y η)) = Ω Λ(s) ds + Λ(si), | | | − Ω si∈Y

This is very diﬀerent to the previous examples!

LGMs — Examples David Bolin Or is it?

NB: The number of pointsIntroduction in a regionCase study I CaseR studyis Poisson II Summary distributedResolution Spatial e↵ect withInteraction Estimated e↵ects R Andersonia heterophylla: 55 55 grid mean R Λ(s) ds. ⇥

Divide the ‘observation ● ● ● ● ● ● • ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● window’ into rectangles. ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ● ● ●● ●● ●● ● ●● ● ● Let y be the number of ● ● ● ● ●● ●● ● ● ● ● i ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● • ●● ● ● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● points in rectangle i. ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● x ● ● ● ● ● ● i ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● y x , θ P o(e ), ● ● ●●●● ● ●● ● ●● ● ● i i ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● | ∼ ● ● ● ● ●●● ●●●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● The log-risk surface is ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ●● • ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● replaced with ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● x θ N(µ(θ), Σ(θ)). ● ● ● ● ● ● ● ● | ∼

Sigrunn Holbek Sørbye, University of Tromsø Spatial point patterns - simple case studies LGMs — Examples David Bolin Back to the linear model

Observation: (y, u, β) are jointly Gaussian! τ π(y u, β) exp n (y u XT β)T (y u XT β) | ∝ − 2 − − − −   I −I XT  y τn T − T = exp  yT uT β  II X  u − 2 − − X XXT X β − − It follows that

π(y, u, β) = π(y u, β)π(u)π(β) |   I −I XT  y τn T T T −1 − T exp  y u β  II + τn Qu X  u ∝ − 2 − − τ X XXT X + ﬁx I β − − τn

LGMs — Examples David Bolin Estimation

Let x = (y, u, β). Estimating the parameters in the model, using MCMC or ML, we have to evaluate the log-likelihood 1 log Σ(θ) x>Σ(θ)−1x | | − 2

We can easily calculate marginal distributions, for example to do kriging. Recall that if

x µ Σ Σ A N A , AA AB , xB ∼ µB ΣBA ΣBB

then the conditional distribution is given by

−1 −1 xA xB N(ΣABΣ xB, ΣAA ΣABΣ ΣBA) | ∼ BB − BB

LGMs — Examples David Bolin Can we calculate these things in practice?

Evaluating the likelihood and the kriging predictor both require solving Σ(θ)−1x. Evaluating the likelihood also requires calculating Σ(θ) . | |

Computations scale as (N 3) • O Storage scales as (N 2): • O • 2500 points for 20 years requires 20 Gbytes ∼

Thus, even this very simple model is not feasible for large problems

For more complicated models that require MCMC, N does not have to be particularly large for computations to be a major problem

We need to decrease the computational burden!

LGMs — Examples David Bolin “Low rank methods” A popular approach to decrease the computational cost is to Approximate X(s) using some basis expansion m X X(s) = wjϕj(s), (2) j=1

m where wj are Gaussian random variables and ϕj j=1 are some pre-deﬁned basis functions. { } > This allows us to write Σ(θ) = BΣwB and basically gives us (K3) cost instead of (N 3) cost: Choose K N. ThereO are many ways toO obtain these “low rank” approximations: Karhunen-Loève transforms • Empirical orthogonal functions • Process convolutions • Fixed-rank kriging or predictive processes • +++ • LGMs — Examples David Bolin Key Lesson: Sparse matrices

Deﬁnition (Sparse Matrix) A matrix Q is called sparse if most of its elements is zero.

There exist very eﬃcient numerical algorithms to deal with • sparse matrices • Computations scale as (N 3/2). • Storage scales as (N)O: O • 2500 points for 20 years requires ∼ 400 Kilobytes

Two possible options: 1 Force Σ to be sparse. Forces independence between variables. 2 Force the precision matrix Q = Σ−1 to be sparse. What does this correspond to?

LGMs — Examples David Bolin Example: AR(1) process

The simplest example of a GMRF is the AR(1) process:

xt = φxt−1 + t, t = 1, 2, . . . , t (0, 1) ∼ N

where t represents time and the distribution of x0 is chosen as the 1 stationary distribution of the process: x0 0, 2 ∼ N 1−φ The joint density for x is

π(x) = π(x0)π(x1 x2) π(xn−1 xn−2) | ··· | 1 1 = Q 1/2 exp x>Qx . (2π)n/2 | | −2

The matrix Q = Σ−1 is called the precision matrix The covariance matrix for x is dense since all time points are 1 |i−j| dependent, Σij = 1−φ2 φ .

GMRFs — AR David Bolin Conditional independence However, the precision matrix Q = Σ−1 is sparse:  1 φ  −  φ 1 + φ2 φ  − −   ......  Q =  . . .     φ 1 + φ2 φ − φ −1 − What is the key property of this example that causes Q to be sparse? The key lies in the full conditionals • φ σ2 xt x−t (xt−1 + xt+1), | ∼ N 1 φ2 1 + φ2 − Each timepoint is only conditionally dependent on the two • closest timepoints, which is the reason for the tridiagonal structure of Q

GMRFs — AR David Bolin Main features of GMRFs

Analytically tractable • Modelling using conditional independence • Merging GMRFs using conditioning (hierarchical models) • Uniﬁed framework for • • understanding • representation • computation using numerical methods for sparse matrices Fits nicely into the MCMC world • Can construct faster and more reliable block-MCMC • algorithms. Approximate Bayesian inference • Approximate GRFs through SPDE representations •

GMRFs — AR David Bolin Usage of GMRFs (I)

Structural time-series analysis Autoregressive models. • Gaussian state-space models. • Computational algorithms based on the Kalman ﬁlter and its • variants. Analysis of longitudinal and survival data temporal GMRF priors • state-space approaches • spatial GMRF priors • used to analyse longitudinal and survival data.

GMRFs — AR David Bolin Usage of GMRFs (II)

Graphical models A key model • Estimate Q and its (associated) graph from data. • Often used in a larger context. • Semiparametric regression and splines Model a smooth curve in time or a surface in space. • Intrinsic GMRF models and random walk models • Discretely observed integrated Wiener processes are GMRFs • GMRFs models for coeﬃcients in B-splines. •

GMRFs — AR David Bolin Usage of GMRFs (III) Image analysis Image restoration using the Wiener ﬁlter. • Texture modelling and texture discrimination. • Segmentation and object identiﬁcation • Deformable templates • 3D reconstruction • Restoring ultrasound images • Spatial statistics Latent GMRF model analysis of spatial binary data • Geostatistics using GMRFs • Analysis of data in social sciences and spatial econometrics • Spatial and space-time epidemiology • Environmental statistics • Inverse problems • GMRFs — AR David Bolin