The Bayesian Approach to Inverse Problems

The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology [email protected], http://uqgroup.mit.edu 7 July 2015 Marzouk (MIT) ICERM IdeaLab 7 July 2015 1 / 29 Statistical inference Why is a statistical perspective useful in inverse problems? To characterize uncertainty in the inverse solution To understand how this uncertainty depends on the number and quality of observations, features of the forward model, prior information, etc. To make probabilistic predictions To choose \good" observations or experiments To address questions of model error, model validity, and model selection Marzouk (MIT) ICERM IdeaLab 7 July 2015 2 / 29 Bayesian inference Bayes' rule p(yjθ)p(θ) p(θjy) = p(y) Key idea: model parameters θ are treated as random variables (For simplicity, we let our random variables have densities) Notation θ are model parameters; y are the data; assume both to be finite-dimensional unless otherwise indicated p(θ) is the prior probability density L(θ) ≡ p(yjθ) is the likelihood function p(θjy) is the posterior probability density p(y) is the evidence, or equivalently, the marginal likelihood Marzouk (MIT) ICERM IdeaLab 7 July 2015 3 / 29 Bayesian inference Summaries of the posterior distribution What information to extract? Posterior mean of θ; maximum a posteriori (MAP) estimate of θ Posterior covariance or higher moments of θ Quantiles Credibile intervals: C(y) such that P [θ 2 C(y) j y] = 1 − α. Credibile intervals are not uniquely defined above; thus consider, for example, the HPD (highest posterior density) region. Posterior realizations: for direct assessment, or to estimate posterior predictions or other posterior expectations Marzouk (MIT) ICERM IdeaLab 7 July 2015 4 / 29 Bayesian and frequentist statistics Understanding both perspectives is useful and important. Key differences between these two statistical paradigms Frequentists do not assign probabilities to unknown parameters θ. One can write likelihoods pθ(y) ≡ p(yjθ) but not priors p(θ) or posteriors. θ is not a random variable. In the frequentist viewpoint, there is no single preferred methodology for inverting the relationship between parameters and data. Instead, consider various estimators θ^(y) of θ. The estimator θ^ is a random variable. Why? Frequentist paradigm considers y to result from a random and repeatable experiment. Marzouk (MIT) ICERM IdeaLab 7 July 2015 5 / 29 Bayesian and frequentist statistics Key differences (continued) Evaluate quality of θ^ through various criteria: bias, variance, mean-square error, consistency, efficiency, . One common estimator is maximum likelihood: ^ θML = argmaxθ p(yjθ). p(yjθ) defines a family of distributions indexed by θ. Link to Bayesian approach: MAP estimate maximizes a \penalized likelihood." What about Bayesian versus frequentist prediction of ynew ? y j θ? Frequentist: \plug-in" or other estimators of ynew Bayesian: posterior prediction via integration Marzouk (MIT) ICERM IdeaLab 7 July 2015 6 / 29 Bayesian inference Likelihood functions In general, p(yjθ) is a probabilistic model for the data In the inverse problem or parameter estimation context, the likelihood function is where the forward model appears, along with a noise model and (if applicable) an expression for model discrepancy Contrasting example (but not really!): parametric density estimation, where the likelihood function results from the probability density itself. Selected examples of likelihood functions 1 Bayesian linear regression 2 Nonlinear forward model g(θ) with additive Gaussian noise 3 Nonlinear forward model with noise + model discrepancy Marzouk (MIT) ICERM IdeaLab 7 July 2015 7 / 29 Bayesian inference Prior distributions In ill-posed parameter estimation problems, e.g., inverse problems, prior information plays a key role Intuitive idea: assign lower probability to values of θ that you don't expect to see, higher probability to values of θ that you do expect to see Examples 1 Gaussian processes with specified covariance kernel 2 Gaussian Markov random fields 3 Gaussian priors derived from differential operators 4 Hierarchical priors 5 Besov space priors 6 Higher-level representations (objects, marked point processes) Marzouk (MIT) ICERM IdeaLab 7 July 2015 8 / 29 Gaussian process priors Key idea: any finite-dimensional distribution of the stochastic process θ(x;!): D × Ω ! R is multivariate normal. In other words: θ(x;!) is a collection of jointly Gaussian random variables, indexed by x Specify via mean function and covariance function E [θ(x)] = µ(x) E [(θ(x) − µ)(θ(x0) − µ)] = C(x; x0) Smoothness of process is controlled by behavior of covariance function as x0 ! x Restrictions: stationarity, isotropy, . Marzouk (MIT) ICERM IdeaLab 7 July 2015 9 / 29 Example:Gaussian stationary Gaussian process random fields priors •! Prior is a stationary Gaussian random field: (exponential covariance kernel) (Gaussian covariance kernel) K 2 Both are θ(x;!): D × Ω ! R, with D = [0; 1] . M(x,!) = µ(x) + $ "i ci (!) #i (x) i=1 Marzouk (MIT) (Karhunen-LoèveICERM IdeaLab expansion) 7 July 2015 10 / 29 Gaussian Markov random fields Key idea: discretize space and specify a sparse inverse covariance (\precision") matrix W 1 p(θ) / exp − γθT Wθ 2 where γ controls scale Full conditionals p(θi jθ∼i ) are available analytically and may simplify dramatically. Represent as an undirected graphical model Example: E [θi jθ∼i ] is just an average of site i's nearest neighbors Quite flexible; even used to simulate textures Marzouk (MIT) ICERM IdeaLab 7 July 2015 11 / 29 Priors through differential operators Key idea: return to infinite-dimensional setting; again penalize roughness in θ(x) Stuart 2010: define the prior using fractional negative powers of the Laplacian A = −∆: −α θ ∼ N θ0; βA Sufficiently large α (α > d=2), along with conditions on the likelihood, ensures that posterior measure is well defined Marzouk (MIT) ICERM IdeaLab 7 July 2015 12 / 29 GPs, GMRFs, and SPDEs In fact, all three \types" of Gaussian priors just described are closely connected. Linear fractional SPDE: 2 β=2 d κ − ∆ θ(x) = W(x); x 2 R ; β = ν + d=2; κ > 0; ν > 0 Then θ(x) is a Gaussian field with Matérncovariance: σ2 C(x; x0) = (κkx − x0k)νK (κkx − x0k) 2ν−1Γ(ν) ν Covariance kernel is Green's function of differential operator β κ2 − ∆ C(x; x0) = δ(x − x0) ν = 1=2 equivalent to exponential covariance; ν ! 1 equivalent to squared exponential covariance Can construct a discrete GMRF that approximates the solution of SPDE (See Lindgren, Rue, LindströmJRSSB 2011.) Marzouk (MIT) ICERM IdeaLab 7 July 2015 13 / 29 Hierarchical Gaussian priors Inverse Problems 24 (2008) 034013 D Calvetti and E Somersalo 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 −0.1 − 0.1 −0.2 − 0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Figure 1. Three realization drawn from the prior (6) with constant variance θj θ0 (left) and from the corresponding prior where the variance is 100–fold at two points indicated= by arrows (right). Calvetti & Somersalo, Inverse Problems 24 (2008) 034013. where X and W are the n-variate random variables with components Xj and Wj ,respectively, and Marzouk (MIT) 1 ICERM IdeaLab 7 July 2015 14 / 29 11 − L . ,Ddiag(θ1, θ2,...,θn). (5) = .. .. = 11 − Since W is a standard normal random variable, relation (4) allows us to write the (prior) probability density of X as 1 1/2 2 πprior(x) exp D− LX . (6) ∝ − 2 # # Not surprisingly, the first-order' autoregressive Markov( model leads to a first-order smoothness prior for the variable X.Thevariancevectorθ expresses the expected variability of the signal over the support interval, and provides a handle to control the qualitative behavior of the signal. Assume, for example, that we set θj θ0 const., 1 ! j ! n,leadingtoahomogenous smoothness over the support interval.= By changing= some of the components, e.g., setting θk θ# 100θ0 for some k,#,weexpectthesignaltohavejumpsofstandarddeviation = = √θk √θ# 10√θ0 at the grid intervals [tk 1,tk]and[t# 1,t#]. This is illustrated in figure=1,whereweshowsomerandomdrawsfromtheprior.Itisimportanttonotethatthe= − − higher values of θj sdonotforce,butmakethejumpssimplymorelikelytooccurbyincreasing the local variance. This observation suggests that when the number, location and expected amplitudes of the jumps are known, that is, when the prior information is quantitative,thefirst-orderMarkov model provides the means to encode the available information into the prior. Suppose now that the only available information about the solution of the inverse problem is qualitative:jumps may occur, but there is no available information of how many, where and how large. Adhering to the Bayesian paradigm, we express this lack of quantitative information by modeling the variance of the Markov process as a random variable. The estimation of the variance vector thus becomes a part of the inverse problem. 4 Inverse Problems 24 (2008) 034013 D Calvetti and E Somersalo Iteration 1 Iteration 3 Iteration 5 Iteration 1 Iteration 3 Iteration 5 Figure 4. Approximation of the MAP Estimate of the image (top row) and of the variance (bottom Hierarchicalrow) after 1, 3 Gaussian and 5 iteration of the priors cyclic algorithm when using the GMRES method to compute the updated of the image at each iteration step. Iteration 1 Iteration 3 Iteration 7 Iteration 1 Iteration 3 Iteration 7 Figure 5. Approximation of the MAP estimate of the image (top row) and of the variance (bottom row) after 1, 3 and 5 iteration of the cyclic algorithm when using the CGLS method to compute the updated of the image at each iteration step Calvetti & Somersalo, Inverse Problems 24 (2008) 034013. The graphs displayed in figure 6 refer to the CGLS iteration with inverse gamma hyperprior. The value ofMarzouk the objective (MIT) function levels off afterICERM five IdeaLab iterations, and this could be the7 July basis 2015 15 / 29 for a stopping criterion.

Load more