Inverse problems: From Regularization to Bayesian Inference1

Daniela Calvetti

Case Western Reserve University

1based on forthcoming WIRES overview article ”Bayesian methods in inverse problems” Inverse problems are everywhere

The solution of an is needed when

I Recovering unknown causes from indirect, noisy measurement

I Determining poorly known - or unknown - parameters of mathematical model

I Inferring on the interior of a body from non invasive measurements Going against causality

Forward models:

I Move in the causal direction from causes to consequences

I Are usually well-posed: small perturbations in the causes lead to little change in the consequences

I May be very difficult to solve Inverse problems:

I Go against causality mapping consequences to causes

I Their instability is the price for the well-posedness of the associated forward model

I The more forgiving the forward problem, the more unstable the inverse one Instability vs regularization

I Regularization trades original unstable problem for a nearby approximate one

I For a linear inverse problem with forward model f (x) = Ax instability ⇐⇒ wide range of singular values

I Division by small singular values amplifies noise in the data

I Lanczos (1958) replaces A with low approximation prior to inversion

I Similar idea is formulated as Truncated SVD regularization (1987) Tikhonov regularization

I Tikhonov (1960s) leave forward model intact

I The solution is penalized for traits due to amplified noise:

 2 xTik = argmin kb − f (x)k + αG(x)

I Under suitable condition on f and G the solution exists and is unique

I G(x) monitors growth of instability

I α > 0 assesses relative weights of fidelity and penalty terms

Tikhonov penalty requires a priori belief about the solution! Regularization operator

Standard choices of the regularization operator are 2 I G(x) = kLxk , with L = I to penalize growth in the solution Euclidean 2 I G(x) = kLxk , with L equal to a discrete first order differencing to penalize growth in the gradient 2 I G(x) = kLxk , with L equal to the discrete Laplacian, to penalize growth in the curvature

I G(x) = kxk1, to favor sparse solutions, related to compressed sensing

I Total variation (TV) of x, to favor blocky solutions. Regularization parameter

I Tikhonov regularized solution depends heavily on value of α

I In the limit as α → 0, solve an output problem

I If α is very large the solution may not predict well the data

I Morozov discrepancy principle: If the norm of the noise, kek ≈ δ, is known, α = α(δ) should be chosen so that kb − f (xα)k = δ.

I If norm of the noise is not known, selection of α can be based on heuristic criteria Semi-convergence and stopping rules

Given the linear system Ax = b,

and x0 = 0, a Krylov least squares solver generate a sequence of approximate solutions {x0, x1,... xk ,...}, with

 k−1   T T  T  T  T xk ∈ Kk A A, A b = span A b,... A A A b .

I If Ax = b is a discrete inverse problem, iterates start to converge to the solution, then diverge (semi-convergence).

I If dk = b − Axk form a decreasing sequence, iterate until

kb − Axk k ≤ δ. Tikhonov regularization Iterative regularization

A priori info Solution features Solution smoothness

Regularization operator Penalty term

Noise level Regularization parameter Stopping rule

Data Fidelity term Fidelity term Forward model

Single solution Single solution The Bayesian framework for inverse problems

I Model unknown as a random variable X, and describe it via a distribution, not a single value

I Model observed data as random variable B describing its uncertainty

I Stochastic extension of f (x) = b + e

B = f (X ) + E

I If X, E are mutually independent, likelihood of the form

πB|X (b | x) = πE (b − f (x)). Bayes’ formula

Update the belief about X by the information through B = bobserved using Bayes’ formula,

πB|X (b | x)πX (x) πX |B (x | b) = , b = bobesrved, πB (b)

where Z πB (b) = πB|X (b | x)πX (x)dx, Rn normalization factor.

This is the complete solution of the inverse problem! The art and science of designing priors

A prior is the initial description of the unknown X before considering the data. There are different types of priors:

I Smoothness priors (0th, 1st and 2nd order)

I Correlation priors

I Structural priors

I Sparsity promoting priors

I Sample-based priors Smoothness priors: 1st order

I Let

Xj − Xj−1 = γWj , Wj ∼ N (0, 1), 1 ≤ j ≤ n

or, collectively

 1   −1 1  L X = γW ∼ N (0, I ), L =   , 1 n 1  .. ..   . .  −1 1

leading to a Gaussian prior model,

 1  π (x) ∝ exp − kL xk2 , x ∈ n. X 2γ2 1 R Smoothness priors: 2nd order

I Let 2Xj − (Xj−1 + Xj+1) ∼ γWj , 1 ≤ j ≤ n − 1, .

I Collectively

 2 −1   ..   −1 2 .  L2X = γW ∼ N (0, In−1), L2 =   .  .. ..   . . −1  −1 2

This leads to the Gaussian prior

 1  π (x) ∝ exp − kL xk2 , x ∈ n−1. X 2γ2 2 R Whittle-Mat´ernpriors,

n I D discrete approximation of Laplacian in Ω ⊂ R I Prior for X defined in Ω can be described by a stochastic model (−D + λ−2I)βX = γW ,

I where γ, β > 0, λ > 0 is the correlation length, and W is a white noise process.

Structural priors

I Solution to have a jump across structure boundaries

I Within the structures, the smoothness prior is a valid description

I Xj − Xj−1 = γj Wj , Wj ∼ N (0, 1), 1 ≤ j ≤ n,

where γj > 0 is small if we expect xk near xk−1, and large where xk could differ significantly from xk−1. I  1  π (x) ∝ exp − kΓ−1L xk2 . struct 2 1

I Examples:

I geophysical sounding problems with known sediment layers I electrical impedance tomography with known anatomy 30 30 30

25 25 25

20 20 20

15 15 15

10 10 10

5 5 5

0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0.04 0.04 0.04

0.02 0.02 0.02

0 0 0

−0.02 −0.02 −0.02

−0.04 −0.04 −0.04

−0.06 −0.06 −0.06 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Sparsity promoting priors

These priors:

I assume very small values for most entries, while allowing isolated, sudden changes.

I Satisfy πX (x) ∝ exp (−αkxkp) . p I ` -priors are non-Gaussian, introducing computational challenges. Sample based priors

I Assume we can build a generative model to generate typical solutions

I Generate a sample of typical solutions

X = {x(1), x(2),..., x(N)}

I Regard solutions in X as independent draws from approximate prior

I Approximate prior by a Gaussian

 1  π (x) ∝ exp − (x − x)TΓ−1(x − x) , X 2

where

N N 1 X 1 X x = x(j), Γ = (x(j) − x)(x(j) − x)T + δ2I, N N j=1 j=1 Solving inverse problems the Bayesian way

The main steps in solving inverse problems are:

I constructing an informative and computationally feasible prior;

I constructing the likelihood using the forward model and information about the noise;

I forming the posterior density using Bayes’ formula;

I extracting information from the posterior. Smoothness Sparsity A priori belief Sample Structure

Noise Model Prior density Modeling error Likelihood Forward model Data Bayes’ formula

Posterior Density

Single point estimate Posterior sample

1.5

1

0.5

1.5

1

0.5

The pure Gaussian case

Assume   1 1 T −1 E ∼ N (0, C), or πE (e) = exp − e C e . (2π)m/2pdet(C) 2

and  1  π (x) = exp − G(x) . X 2 then  1  π (x | b) ∝ exp − (x − x)TB−1(x − x) , X |B 2 where the posterior mean and covariance are given by

x = (G −1 + AT C −1A)−1AT C −1b, B = (G −1 + AT C −1A)−1. Regularization and Bayes’ formula

Assume 2 I Scaled white noise model C = σ I

I Zero-mean Gaussian prior with G with

G−1 = LTL,

I The posterior is of the form

 1  π (x | b) ∝ exp − kb − f (x)k2 + σ2kLxk2 . X |B 2σ2

I Tikhonov regularized solution with G(x) = Lx is the Maximum A Posteriori estimate of the posterior

2 xα = argmax{πX |B (x | b)}, α = σ . Hierarchical Bayes models and sparsity

I Conditionally Gaussian prior

 n 2 n  1 1 X xj 1 X π (x | θ) = exp − − log θ , X |Θ n/2  j  (2π) 2 θj 2 j=1 j=1

I Solution is sparse, or close to sparse: most variances of xj are small, with some outliers

I Gamma hyperpriors

n Y β−1 −θ/θ0 πΘ(θ) ∝ θj e , j=1

with β > 0 and θ0 are the shape and scaling parameters. I Posterior distribution

πX ,Θ|B (x, θ | b) ∝ πΘ(θ)πX |B,Θ(x | b, θ) Efficient quasi-MAP estimate

n n To find a point estimate for the pair (x, θ) ∈ R × R+ (k) n n I Given the current estimate θ ∈ R+, update x ∈ R by minimizing

n 2 1 1 X xj E(x | θ(k)) = kb − f (x)k2 + ; 2σ2 2 (k) j=1 θj

(k) (k) (k+1) I Given the updated x , update θ → θ by minimizing

n 2 n n (k) 1 X xj X X θj 3 E(θ | x ) = + η log θj + , η = β − . 2 θj θ0 2 j=1 j=1 j=1 Bayes meets Krylov...

When the problems are large and the computing time at a premium:

I The least squares problems in step 1 can be solved by CGSL with suitable stopping rule

I A right preconditioner accounts for the information in the prior

I A left precondtioner accounts for the observational and modeling errors

I The quasi-MAP is computed very rapidly via an inner-outer iteration scheme

I A number of open questions about the success of the partnership are currently under investigation 2.5 2.5 2.5 25 2 2 2 20 1.5 1.5 1.5 15 1 1 1 10 0.5 0.5 0.5 5 0 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 2.5 2.5 2.5 25 2 2 2 20 1.5 1.5 1.5 15 1 1 1 10 0.5 0.5 0.5 5 0 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

The Bayesian horizon is wide open

When solving inverse problems in the Bayesian framework

I We compensate uncertainties in forward models (see Ville’s talk)

I We can explore the posterior (see John’s talk)

I We can generalize classical regularization techniques to account for all sort of uncertainties.