<<

The Bayesian approach to inverse problems

Youssef Marzouk

Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology [email protected], http://uqgroup.mit.edu

7 July 2015

Marzouk (MIT) ICERM IdeaLab 7 July 2015 1 / 29 Statistical inference

Why is a statistical perspective useful in inverse problems? To characterize uncertainty in the inverse solution To understand how this uncertainty depends on the number and quality of observations, features of the forward model, prior information, etc. To make probabilistic predictions To choose “good” observations or experiments To address questions of model error, model validity, and

Marzouk (MIT) ICERM IdeaLab 7 July 2015 2 / 29

Bayes’ rule p(y|θ)p(θ) p(θ|y) = p(y)

Key idea: model parameters θ are treated as random variables (For simplicity, we let our random variables have densities)

Notation θ are model parameters; y are the data; assume both to be finite-dimensional unless otherwise indicated p(θ) is the prior probability density L(θ) ≡ p(y|θ) is the likelihood function p(θ|y) is the posterior probability density p(y) is the evidence, or equivalently, the marginal likelihood

Marzouk (MIT) ICERM IdeaLab 7 July 2015 3 / 29 Bayesian inference

Summaries of the posterior distribution What information to extract? Posterior mean of θ; maximum a posteriori (MAP) estimate of θ Posterior or higher moments of θ Quantiles Credibile intervals: C(y) such that P [θ ∈ C(y) | y] = 1 − α. Credibile intervals are not uniquely defined above; thus consider, for example, the HPD (highest posterior density) region. Posterior realizations: for direct assessment, or to estimate posterior predictions or other posterior expectations

Marzouk (MIT) ICERM IdeaLab 7 July 2015 4 / 29 Bayesian and frequentist

Understanding both perspectives is useful and important. . . Key differences between these two statistical paradigms Frequentists do not assign probabilities to unknown parameters θ. One can write likelihoods pθ(y) ≡ p(y|θ) but not priors p(θ) or posteriors. θ is not a random variable. In the frequentist viewpoint, there is no single preferred methodology for inverting the relationship between parameters and data. Instead, consider various estimators θˆ(y) of θ. The estimator θˆ is a random variable. Why? Frequentist paradigm considers y to result from a random and repeatable experiment.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 5 / 29 Bayesian and frequentist statistics

Key differences (continued) Evaluate quality of θˆ through various criteria: bias, variance, mean-square error, consistency, efficiency, . . . One common estimator is maximum likelihood: ˆ θML = argmaxθ p(y|θ). p(y|θ) defines a family of distributions indexed by θ. Link to Bayesian approach: MAP estimate maximizes a “penalized likelihood.”

What about Bayesian versus frequentist prediction of ynew ⊥ y | θ?

Frequentist: “plug-in” or other estimators of ynew Bayesian: posterior prediction via integration

Marzouk (MIT) ICERM IdeaLab 7 July 2015 6 / 29 Bayesian inference

Likelihood functions In general, p(y|θ) is a probabilistic model for the data In the or parameter estimation context, the likelihood function is where the forward model appears, along with a noise model and (if applicable) an expression for model discrepancy Contrasting example (but not really!): parametric density estimation, where the likelihood function results from the probability density itself. Selected examples of likelihood functions 1 Bayesian linear regression 2 Nonlinear forward model g(θ) with additive Gaussian noise 3 Nonlinear forward model with noise + model discrepancy

Marzouk (MIT) ICERM IdeaLab 7 July 2015 7 / 29 Bayesian inference

Prior distributions In ill-posed parameter estimation problems, e.g., inverse problems, prior information plays a key role Intuitive idea: assign lower probability to values of θ that you don’t expect to see, higher probability to values of θ that you do expect to see Examples 1 Gaussian processes with specified covariance kernel 2 Gaussian Markov random fields 3 Gaussian priors derived from differential operators 4 Hierarchical priors 5 Besov space priors 6 Higher-level representations (objects, marked point processes)

Marzouk (MIT) ICERM IdeaLab 7 July 2015 8 / 29 Gaussian process priors

Key idea: any finite-dimensional distribution of the stochastic process θ(x, ω): D × Ω → R is multivariate normal. In other words: θ(x, ω) is a collection of jointly Gaussian random variables, indexed by x Specify via mean function and covariance function E [θ(x)] = µ(x) E [(θ(x) − µ)(θ(x0) − µ)] = C(x, x0) Smoothness of process is controlled by behavior of covariance function as x0 → x Restrictions: stationarity, isotropy, . . .

Marzouk (MIT) ICERM IdeaLab 7 July 2015 9 / 29 Example:Gaussian stationary Gaussian process random fields priors

•! Prior is a stationary Gaussian random field:

(exponential covariance kernel) (Gaussian covariance kernel)

K 2 Both are θ(x, ω): D × Ω → R, with D = [0, 1] . M(x,!) = µ(x) + $ "i ci (!) #i (x) i=1 Marzouk (MIT) (Karhunen-LoèveICERM IdeaLab expansion) 7 July 2015 10 / 29

Gaussian Markov random fields

Key idea: discretize space and specify a sparse inverse covariance (“precision”) W  1  p(θ) ∝ exp − γθT Wθ 2 where γ controls scale

Full conditionals p(θi |θ∼i ) are available analytically and may simplify dramatically. Represent as an undirected graphical model

Example: E [θi |θ∼i ] is just an average of site i’s nearest neighbors Quite flexible; even used to simulate textures

Marzouk (MIT) ICERM IdeaLab 7 July 2015 11 / 29 Priors through differential operators

Key idea: return to infinite-dimensional setting; again penalize roughness in θ(x) Stuart 2010: define the prior using fractional negative powers of the Laplacian A = −∆: −α θ ∼ N θ0, βA Sufficiently large α (α > d/2), along with conditions on the likelihood, ensures that posterior measure is well defined

Marzouk (MIT) ICERM IdeaLab 7 July 2015 12 / 29 GPs, GMRFs, and SPDEs

In fact, all three “types” of Gaussian priors just described are closely connected. Linear fractional SPDE: 2 β/2 d κ − ∆ θ(x) = W(x), x ∈ R , β = ν + d/2, κ > 0, ν > 0 Then θ(x) is a Gaussian field with Mat´erncovariance: σ2 C(x, x0) = (κkx − x0k)νK (κkx − x0k) 2ν−1Γ(ν) ν Covariance kernel is Green’s function of differential β κ2 − ∆ C(x, x0) = δ(x − x0) ν = 1/2 equivalent to exponential covariance; ν → ∞ equivalent to squared exponential covariance Can construct a discrete GMRF that approximates the solution of SPDE (See Lindgren, Rue, Lindstr¨omJRSSB 2011.)

Marzouk (MIT) ICERM IdeaLab 7 July 2015 13 / 29 Hierarchical Gaussian priors

Inverse Problems 24 (2008) 034013 D Calvetti and E Somersalo

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0

−0.1 − 0.1

−0.2 − 0.2

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Figure 1. Three realization drawn from the prior (6) with constant variance θj θ0 (left) and from the corresponding prior where the variance is 100–fold at two points indicated= by arrows (right).

Calvetti & Somersalo, Inverse Problems 24 (2008) 034013. where X and W are the n-variate random variables with components Xj and Wj ,respectively, and Marzouk (MIT) 1 ICERM IdeaLab 7 July 2015 14 / 29 11 − L  . .  ,Ddiag(θ1, θ2,...,θn). (5) = .. .. =    11    −  Since W is a standard normal random variable, relation (4) allows us to write the (prior) probability density of X as

1 1/2 2 πprior(x) exp D− LX . (6) ∝ − 2 # # Not surprisingly, the first-order' autoregressive Markov( model leads to a first-order smoothness prior for the variable X.Thevariancevectorθ expresses the expected variability of the signal over the support interval, and provides a handle to control the qualitative behavior of the signal. Assume, for example, that we set θj θ0 const., 1 ! j ! n,leadingtoahomogenous smoothness over the support interval.= By changing= some of the components, e.g., setting θk θ# 100θ0 for some k,#,weexpectthesignaltohavejumpsofstandarddeviation = = √θk √θ# 10√θ0 at the grid intervals [tk 1,tk]and[t# 1,t#]. This is illustrated in figure=1,whereweshowsomerandomdrawsfromtheprior.Itisimportanttonotethatthe= − − higher values of θj sdonotforce,butmakethejumpssimplymorelikelytooccurbyincreasing the local variance. This observation suggests that when the number, location and expected amplitudes of the jumps are known, that is, when the prior information is quantitative,thefirst-orderMarkov model provides the means to encode the available information into the prior. Suppose now that the only available information about the solution of the inverse problem is qualitative:jumps may occur, but there is no available information of how many, where and how large. Adhering to the Bayesian paradigm, we express this lack of quantitative information by modeling the variance of the Markov process as a random variable. The estimation of the variance vector thus becomes a part of the inverse problem.

4 Inverse Problems 24 (2008) 034013 D Calvetti and E Somersalo

Iteration 1 Iteration 3 Iteration 5

Iteration 1 Iteration 3 Iteration 5

Figure 4. Approximation of the MAP Estimate of the image (top row) and of the variance (bottom Hierarchicalrow) after 1, 3 Gaussian and 5 iteration of the priors cyclic algorithm when using the GMRES method to compute the updated of the image at each iteration step.

Iteration 1 Iteration 3 Iteration 7

Iteration 1 Iteration 3 Iteration 7

Figure 5. Approximation of the MAP estimate of the image (top row) and of the variance (bottom row) after 1, 3 and 5 iteration of the cyclic algorithm when using the CGLS method to compute the updated of the image at each iteration step

Calvetti & Somersalo, Inverse Problems 24 (2008) 034013. The graphs displayed in figure 6 refer to the CGLS iteration with inverse gamma hyperprior. The value ofMarzouk the objective (MIT) function levels off afterICERM five IdeaLab iterations, and this could be the7 July basis 2015 15 / 29 for a stopping criterion. Note that after seven iterations, the norm of the estimation error starts to grow again, typical of algorithms which exhibit semi-convergence. The speckling phenomenon, by which individual pixel values close to the discontinuity start to diverge is partly responsible for the growth of the error. This suggests that the iterations should be stopped soon after the settling of the objective function. The fact that the norm of the derivative is small already at the end of the first iterations which indicate that the sequential iteration finds indeed a good approximation to a minimizer.

15 Non-Gaussian priors

s Besov space Bpq(T): ∞ 2j −1 X X θ(x) = c0 + wj,hψj,h(x) j=0 h=0 and  q/p1/q ∞ 2j −1  q X jq(s+ 1 − 1 ) X p kθk s := |c0| + 2 2 p |wj,h|  < ∞. Bpq (T)     j=0 h=0 Consider p = q = s = 1: ∞ 2j −1 X X j/2 kθk 1 = |c0| + 2 |wj,h|. B11(T) j=0 h=0 j/2 Then the distribution of θ is a Besov prior if αc0 and α2 wj,h are independent and Laplace(1).   Loosely, π(θ) = exp −αkθk 1 . B11(T)

Marzouk (MIT) ICERM IdeaLab 7 July 2015 16 / 29 Higher-level representations

Marked point processes, and more:

Rue & Hurn, Biometrika 86 (1999) 649–660.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 17 / 29 Bayesian inference

Hierarchical modeling One of the key flexibilities of the Bayesian construction! Hierarchical modeling has important implications for the design of efficient MCMC samplers (later in the lecture) Examples:

1 Unknown noise variance 2 Unknown variance of a Gaussian process prior (cf. choosing the regularization parameter) 3 Many more, as dictated by the physical models at hand

Marzouk (MIT) ICERM IdeaLab 7 July 2015 18 / 29 Example: prior variance hyperparameter in an inverse diffusion problem

hyperprior 1.6 posterior, ς=10−1, 13 sensors posterior, ς=10−2, 25 sensors 1.4

1.2

|d) 1 θ

0.8 ) or p( θ

p( 0.6

0.4

0.2

0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 θ

Figure : Posterior marginal density of the variance hyperparameter θ, versus quality of data, contrasted with its hyperprior density. “Regularization” ∝ ς2/θ.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 19 / 29 The linear Gaussian model

A key building-block problem: n m Parameters θ ∈ R , observations y ∈ R m n Forward model f (θ) = Gθ, where G ∈ R × Additive noise yields observations: y = Gθ + 

 ∼ N(0, Γobs ) and is independent of θ

Endow θ with a Gaussian prior, θ ∼ N(0, Γpr ).

Posterior probability density

p(θ|y) ∝ p(y|θ)p(θ) = L(θ)p(θ)  1   1  = exp − (y − Gθ)T Γ−1 (y − Gθ) exp − θT Γ−1θ 2 obs 2 pr  1  = exp − (θ − µ )T Γ−1 (θ − µ ) 2 pos pos pos

Marzouk (MIT) ICERM IdeaLab 7 July 2015 20 / 29 The linear Gaussian model

A key building-block problem: n m Parameters θ ∈ R , observations y ∈ R m n Forward model f (θ) = Gθ, where G ∈ R × Additive noise yields observations: y = Gθ + 

 ∼ N(0, Γobs ) and is independent of θ

Endow θ with a Gaussian prior, θ ∼ N(0, Γpr ).

Posterior probability density

p(θ|y) ∝ p(y|θ)p(θ) = L(θ)p(θ)  1   1  = exp − (y − Gθ)T Γ−1 (y − Gθ) exp − θT Γ−1θ 2 obs 2 pr  1  = exp − (θ − µ )T Γ−1 (θ − µ ) 2 pos pos pos

Marzouk (MIT) ICERM IdeaLab 7 July 2015 20 / 29 The linear Gaussian model

Posterior is again Gaussian: T −1 −1−1 Γpos = G Γobs G + Γpr T T −1 = Γpr − ΓprG GΓprG + Γobs GΓpr

= (I − KG ) Γpr

T −1 µpos = ΓposG Γobs y In the context of filtering, K is known as the (optimal) Kalman gain. T −1 H := G Γobs G is the Hessian of the negative log-likelihood How does low of H affect the structure of the posterior? How does H interact with the prior?

Marzouk (MIT) ICERM IdeaLab 7 July 2015 21 / 29 Likelihood-informed directions

Consider the Rayleigh ratio

w >Hw w . R( ) = > −1 w Γpr w

When R(w) is large, likelihood dominates the prior in direction w. The ratio is maximized by solutions to the generalized eigenvalue problem

−1 Hw = λΓpr w.

The posterior covariance can be written as a negative update along these ”likelihood-informed” directions, and approximation can be obtained by using only r largest eigenvalues:

n r X λi > X λi > Γpos = Γpr − wi wi ≈ Γpr − wi wi (1) 1 + λi 1 + λi i=1 i=1

Marzouk (MIT) ICERM IdeaLab 7 July 2015 22 / 29 Likelihood-informed directions

Consider the Rayleigh ratio

w >Hw w . R( ) = > −1 w Γpr w

When R(w) is large, likelihood dominates the prior in direction w. The ratio is maximized by solutions to the generalized eigenvalue problem

−1 Hw = λΓpr w.

The posterior covariance can be written as a negative update along these ”likelihood-informed” directions, and approximation can be obtained by using only r largest eigenvalues:

n r X λi > X λi > Γpos = Γpr − wi wi ≈ Γpr − wi wi (1) 1 + λi 1 + λi i=1 i=1

Marzouk (MIT) ICERM IdeaLab 7 July 2015 22 / 29 Likelihood-informed directions

Consider the Rayleigh ratio

w >Hw w . R( ) = > −1 w Γpr w

When R(w) is large, likelihood dominates the prior in direction w. The ratio is maximized by solutions to the generalized eigenvalue problem

−1 Hw = λΓpr w.

The posterior covariance can be written as a negative update along these ”likelihood-informed” directions, and approximation can be obtained by using only r largest eigenvalues:

n r X λi > X λi > Γpos = Γpr − wi wi ≈ Γpr − wi wi (1) 1 + λi 1 + λi i=1 i=1

Marzouk (MIT) ICERM IdeaLab 7 July 2015 22 / 29 Optimality results for bΓpos

It turns out that the approximation

r X λi > bΓpos = Γpr − wi wi (2) 1 + λi i=1

is optimal in a class of loss functions L(bΓpos, Γpos) for approximations of > 1 form bΓpos = Γpr − KK , where rank(K) ≤ r.

bΓpos minimises the Hellinger and the KL- between N (µpos(y), bΓpos) and N (µpos(y), Γpos). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance.

1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear inverse problems, http://arxiv.org/abs/1407.3463 Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29 Optimality results for bΓpos

It turns out that the approximation

r X λi > bΓpos = Γpr − wi wi (2) 1 + λi i=1

is optimal in a class of loss functions L(bΓpos, Γpos) for approximations of > 1 form bΓpos = Γpr − KK , where rank(K) ≤ r.

bΓpos minimises the Hellinger distance and the KL-divergence between N (µpos(y), bΓpos) and N (µpos(y), Γpos). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance.

1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear inverse problems, http://arxiv.org/abs/1407.3463 Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29 Optimality results for bΓpos

It turns out that the approximation

r X λi > bΓpos = Γpr − wi wi (2) 1 + λi i=1

is optimal in a class of loss functions L(bΓpos, Γpos) for approximations of > 1 form bΓpos = Γpr − KK , where rank(K) ≤ r.

bΓpos minimises the Hellinger distance and the KL-divergence between N (µpos(y), bΓpos) and N (µpos(y), Γpos). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance.

1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear inverse problems, http://arxiv.org/abs/1407.3463 Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29 Optimality results for bΓpos

It turns out that the approximation

r X λi > bΓpos = Γpr − wi wi (2) 1 + λi i=1

is optimal in a class of loss functions L(bΓpos, Γpos) for approximations of > 1 form bΓpos = Γpr − KK , where rank(K) ≤ r.

bΓpos minimises the Hellinger distance and the KL-divergence between N (µpos(y), bΓpos) and N (µpos(y), Γpos). The results can also be used to devise efficient approximations for the posterior mean. λ = 1 means that prior and likelihood are roughly balanced. Truncate at λ = 0.1, for instance.

1For details see Spantini et al., Optimal low-rank approximations of Bayesian linear inverse problems, http://arxiv.org/abs/1407.3463 Marzouk (MIT) ICERM IdeaLab 7 July 2015 23 / 29 Remarks on the optimal approximation

r ∗ > > X λi > bΓpos = Γpr − KK , KK = wi wi 1 + λi i=1 The form of the optimal update is widely used (Flath et al. 2011) Compute with Lanczos, randomized SVD, etc. −1 Directions wei = Γpr wi maximize the relative difference between prior and posterior variance: >  >  Var w x − Var w x | y λi ei ei = >  1 + λ Var wei x i Using the Frobenius norm as a loss would instead yield directions of greatest absolute difference between prior and posterior variance.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 24 / 29 A metric between covariance matrices

1/2 1/2 B− AB−

F¨orstnermetric σ1 Let A, B 0, and (σ ) be the eigenvalues i σ2 of (A, B), then: 2 h 2  − 1 − 1 i dF (A, B) = tr ln B 2 AB 2 X 2 = ln (σi ) i

u>Au Compare curvatures: sup u>Bu = σ1 u Invariance properties: −1 −1 > > dF (A, B) = dF A , B dF (A, B) = dF MAM , MBM

Frobenius dF (A, B) = kA − BkF does not share the same properties

Marzouk (MIT) ICERM IdeaLab 7 July 2015 25 / 29 Example: computerized tomography

X-rays travel from sources to detectors through an object of interest. The intensities from the sources are measured at the detectors, and the goal is to reconstruct the density of the object.

1.02

1

0.98

0.96 intensity 0.94

0.92

0.9 0 20 40 60 80 100 detector pixel This synthetic example is motivated by a real application: real-time X-ray imaging of logs that enter a saw mill for the purpose of automatic quality control. 2

2Check www.bintec.fi Marzouk (MIT) ICERM IdeaLab 7 July 2015 26 / 29 Example: computerized tomography

Weaker data → faster decay of generalized eigenvalues → lower order approximations possible.

105 -3 10 prior limited angle limited angle limited angle 104 full angle full angle full angle 3 10-4 10 103

10-5 102 102 F d 101 10-6 eigenvalues 100 101 10-7 generalized eigenvalues 10-1

0 10-8 10 10-2 0 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 index i index i rank of update

In the limited angle case, roughly r = 200 is enough to get a good approximation (with full angle r ≈ 800 needed).

prior rank = 50 rank = 100 rank = 200 posterior 5.4 6.2 5.6 6.4 6.6 5.8 6.8 7 Marzouk (MIT) ICERM IdeaLab 7 July 2015 27 / 29 Example: computerized tomography

> −1 Approximation of the mean: µpos(y) = Γpos G Γobs y ≈ Ar y

Note:Marzouk pre-computing (MIT) the LIS basisICERM offline IdeaLab allows fast online reconstructions7 July 2015 for 28 / 29 repeated data! Questions yet to answer

How to simulate from or explore the posterior distribution? How to make Bayesian inference computationally tractable when the forward model is expensive (e.g., a PDE) and the parameters are high- or infinite-dimensional? Downstream questions: model selection, optimal experimental design, decision-making, etc.

Marzouk (MIT) ICERM IdeaLab 7 July 2015 29 / 29