<<

Statistics for Inverse Problems 101

June 5, 2011 Introduction

I : to recover an unknown object from indirect noisy observations Example: Deblurring of a signal f R 1 Forward problem: µ(x) = 0 A(x − t)f (t) dt Noisy data: yi = µ(xi ) + εi , i = 1,..., n Inverse problem: recover f (t) for t ∈ [0, 1] General Model

I A = (Ai ), Ai : H → IR, f ∈ H

I yi = Ai [f ] + εi , i = 1,..., n

I The recovery of f from A[f ] is ill-posed

I Errors εi modeled as random

I Estimate bf is an H-valued random variable

Basic Question (UQ) How good is the estimate bf ?

I What is the distribution of bf ? Summarize characteristics of the distribution of bf E.g., how ‘likely’ is it that bf will be ‘far’ from f ? Review: E, Var & Cov

(1) Expected value of random variable X with pdf f (x): Z E(X) = µx = x f (x) dx

2  (2) Variance of X: Var(X) = E (X − µx )

(3) of random variables X and Y :

Cov(X, Y ) = E ((X − µx )(Y − µy ))

(4) Expected value of random vector X = (Xi ):

E(X) = µx = ( E(Xi )) (5) Covariance of X:

t Var(X) = E((X − µx )(X − µx ) ) = ( Cov(Xi , Xj ))

(6) For a fixed n × m matrix A

E(AX) = A E(X), Var(AX) = A Var(X) At

2 Example: X = (X1,..., Xn) correlated, Var(X) = σ Σ, then X + ··· + X  σ2 Var 1 n = Var(1t X/n) = 1t Σ1 n n2 Review: least-squares

y = Aβ + ε

t I A is n × m with n > m and A A has stable inverse 2 I E(ε) = 0, Var(ε) = σ I

I Least-squares estimate of β:

βb = arg min ky − Abk2 = (At A)−1At y b

I βb is an unbiased estimator of β:

Eβ( βb ) = β ∀β I Covariance matrix of βb:

2 t −1 Varβ( βb ) = σ (A A)

2 I If ε is Gaussian N(0, σ I), then βb is the MLE and

βb ∼ N( β, σ2 (At A)−1 )

2 I Unbiased estimator of σ :

2 2 σb = ky − A βbk /(n − m) 2 = ky − ybk /(n − dof(H)),

t −1 t where yb = H βb, H = A (A A) A = ‘hat-matrix’ Confidence regions

I 1 − α confidence interval for βi : q t −1 Ii (α): βbi ± tα/2,n−m σb ((A A) )i,i

I C.I.’s are pre-data

I Pointwise vs simultaneous 1 − α coverage:

pointwise: P[ βi ∈ Ii (α) ] = 1 − α ∀i

simultaneous: P[ βi ∈ Ii (α) ∀i ] = 1 − α

I Easy to find 1 − α joint confidence region Rα for β which gives simultaneous coverage and if f f Rα = {f (b): b ∈ Rα} ⇒ P[f (β) ∈ Rα] ≥ 1 − α Residuals

r = y − Aβb = y − yb

2 I E(r) = 0, Var(r) = σ (I − H) 2 2 I If ε ∼ N(0, σ I), then: r ∼ N(0, σ (I − H)) ⇒ do not behave like original errors ε √ I Corrected residuals bri = ri / 1 − Hii

I Corrected residuals may be used for model validation Review: Bayes theorem

I Conditional probability measure P( · | B):

P(A | B) = P(A ∩ B)/P(B) if P(B) 6= 0

I Bayes theorem: for P(A) 6= 0

P(A | B) = P(B | A)P(A)/P(B)

I For X and Θ with pdf’s

fΘ|X (θ | x) ∝ fX|Θ(x | θ)fΘ(θ)

More generally:

dµ dµ Θ|X ∝ X|Θ dµΘ dν I Basic tools: conditional probability measures and Radon-Nikodym derivatives

I Conditional probabilities can be defined via Kolmogorov’s definition of conditional expectations or using conditional distributions (e.g., via disintegration of measures) Review: Gaussian Bayesian linear model

y | β ∼ N(Aβ, σ2I), β ∼ N(µ, τ 2Σ) µ, σ, τ, Σ known

I Goal: update prior distribution of β in light of data y. Compute posterior distribution of β given y (use Bayes theorem):

β | y ∼ N(µy , Σy )

2 t 2 2 −1 −1 Σy = σ A A + (σ /τ )Σ

t 2 2 −1 −1 t 2 2 −1 µy = A A + (σ /τ )Σ (A y + (σ /τ )Σ µ) = weighted average of µbls and µ I Summarizing posterior:

µy = E(µ | y) = mean and mode of posterior (MAP)

Σy = Var(µ | y) = covariance matrix of posterior

1 − α credible interval for βi : q (µy )i ± zα/2 ( Σy )i,i

Can also construct 1 − α joint credible region for β Some properties 2 I µy minimizes Ekδ(y) − βk among all δ(y)

2 2 2 2 I µy → βbls as σ /τ → 0, µy → µ as σ /τ → ∞

I µy can be used as a frequentist estimate of β:  µy is biased  its variance is NOT obtained by sampling from the posterior

I Warning: a 1 − α credible regions MAY NOT have 1 − α frequentist coverage

I If µ, σ, τ or Σ unknown ⇒ use hierarchical or empirical Bayesian and MCMC methods Parametric reductions for IP

I Transform infinite to finite dimensional problem: i.e., A[f ] ≈ Aa

I Assumptions

y = A[f ] + ε = Aa + δ + ε

Function representation: f (x) = φ(x)t a Penalized least-squares estimate:

a = arg min ky − Abk2 + λ2bt Sb b b = (At A + λ2S)−1At y t bf (x) = φ(x) ab R 1 I Example: yi = 0 A(xi − t)f (t) dt + εi Quadrature approximation:

m Z 1 X A(xi − t)f (t) dt ≈ (1/m) A(xi − tj )f (tj ) 0 j=1

Discretized system:

y = Af + δ + ε, Aij = A(xi − tj ), fi = f (xi )

Regularized estimate:

bf = arg min ky − Agk2 + λ2kDgk2 g∈IRm = (At A + λ2Dt D)−1At y ≡ Ly t bf (xi ) = fi = ei Ly I Example: Ai continuous on Hilbert space H. Then:

n ∃ orthonormal φk ∈ H, orthonormal vk ∈ IR and non-decreasing λk ≥ 0 such that A[φk ] = λk v k , ⊥ f = f0 + f1, with f0 ∈ Null(A), f1 ∈ Null(A)

Pn f0 not constrained by data, f1 = i=1 ak φk Discretized system:

y = V a + ε, V = (λ1v 1 ··· λnv n)

I Regularized estimate:

2 2 2 ab = arg min ky − V bk + λ kbk ≡ Ly b∈IRn t bf (x) = φ(x) Ly, φ(x) = (φi (x)) I Example: Ai continuous on RKHS H = Wm([0, 1]). L Then: H = Nm−1 Hm and ∃ φj ∈ H and κj ∈ H such that Z 1 2 2 (m) 2 X 2 2 2 ky − A[f ]k + λ (f1 ) = (yi − hκi , f i) + λ kPHm f k 0 i X ⊥ f = aj φj + η, η ∈ span{φk } j

Discretized system:

y = Xa + δ + ε

Regularized estimate:

2 2 t ab = arg min ky − Xbk + λ b Pb ≡ Ly b∈IRn t bf (x) = φ(x) Ly, φ(x) = (φi (x)) To summarize:

y = Aa + δ + ε f (x) = φ(x)t a

2 2 t ab = arg min ky − Xbk + λ b Pb b∈IRn

= (At A + λ2P)−1At y

t bf (x) = φ(x) ab Bias, variance and MSE of bf

I Is bf (x) close to f (x) on average?

t t t Bias(bf (x) ) = E(bf (x)) − f (x) = φ(x) Bλa + φ(x) GλA δ

I Pointwise sampling variability:

2 2 Var(bf (x) ) = σ k AGλφ(x) k

I Pointwise MSE:

MSE(bf (x) ) = Bias(bf (x))2 + Var(bf (x))

I Integrated MSE: Z Z IMSE(bf (x) ) = E |bf (x)−f (x)|2 dx = MSE(bf (x)) dx Review: are unbiased estimators good?

I They don’t have to exist

I If they exist they may be very difficult to find

I A biased estimator may still be better: variance variance trade off

i.e,. don’t get too attached to them What about the bias of bf ?

I Upper bounds are sometimes possible

I Geometric information can be obtained from bounds

I Optimization approach: find max/min of bias s.t. f ∈ S

I Average bias: choose a prior for f and compute mean bias over prior

I Example: f in RKHS H, f (x) = hρx , f i

Bias(bf (x) ) = hAx − ρx , f i

| Bias(bf (x)) | ≤ kAx − ρx k kf k Confidence intervals

I If ε Gaussian and |Bias(bf (x))| ≤ B(x), then

bf (x) ± zα/2( σkAGλφ(x)k + B(x))

is a 1 − α C.I. for f (x)

I Simultaneous 1 − α C.I.’s for E(bf (x)), x ∈ S: find β such that   P sup |Z t V (x)| ≥ β ≤ α x∈S

Zi iid N(0, 1), V (x) = KGλφ(x)/kKGλφ(x)k and use results from maxima of Gaussian processes theory Residuals

r = y − Aab = y − yb

I Moments:

E(r) = −A Bias( ab ) + δ 2 2 Var(r) = σ (I − Hλ)

Note: Bias( A ab ) = A Bias( ab ) may be small even if Bias( ab ) is significant

I Corrected residuals: bri = ri /(1 − (Hλ)ii ) Estimating σ and λ

I λ can be chosen using same data y (with GCV, L-curve, discrepancy principle, etc.) or independent training data This selection introduces an additional error

2 I If δ ≈ 0, σ can be estimated as in least-squares:

2 2 σb = ky − ybk /(n − dof(Hλ)) otherwise one may consider y = µ + ε, µ = A[f ] and use methods from nonparametric regression Resampling methods

Idea: If bf and σb are ‘good’ estimates, then one should be able to generate synthetic data consistent with actual observations

I If bf is a good estimate ⇒ make synthetic data

∗ ∗ ∗ ∗ y1 = A[bf ] + εi ,..., yb = A[bf ] + εb

∗ with εi from same distribution as ε. Use same ∗ ∗ estimation procedure to get bfi from yi :

∗ ∗ ∗ ∗ y1 → bf1 ,..., yb → bfb

∗ ∗ Approximate distribution of bf with that of bf1 ,...,bfb Generating ε∗ similar to ε ∗ 2 I Parametric resampling: ε ∼ N(0, σb I)

∗ I Nonparametric resampling: sample ε with replacement from corrected residuals

I Problems:  Possibly badly biased and computational intensive  Residuals may need to be corrected also for correlation structure  A bad bf may lead to misleading results

I Training sets: Let {f1, ...fk } be a training set (e.g., historical data)

 For each fj use resampling to estimate MSE(bfj )  Study variability of MSE(bfj )/kfj k Bayesian inversion

I Hierarchical model (simple Gaussian case):

2 y | a, θ ∼ N(Aa + δ, σ I), a | θ ∼ Fa(· | θ), θ ∼ Fθ

θ = (δ, σ, τ)

I It all reduces to sampling from posterior F(a | y) ⇒ use MCMC methods

Possible problems:  Selection of priors  Convergence of MCMC  Computational cost of MCMC in high dimensions  Interpretation of results Warning

Parameters can be tuned so that

bf = E(f | y) but this does not mean that the uncertainty of bf is obtained from the posterior of f given y Warning

Bayesian or frequentist inversions can be used for uncertainty quantification but:

I Uncertainties in each have different interpretations that should be understood

I Both make assumptions (sometimes stringent) that should be revealed

I The validity of the assumptions should be explored ⇒ exploratory analysis and model validation Exploratory data analysis (by example)

Example: `2 and `1 estimates data = y = Af + ε = AW β + ε

2 2 2 bf`2 = arg min k y − Ag k2 + λ k Dg k2 g λ : from GCV

2 bf`1 = W βb, βb = arg min k y − AW b k2 + λ k b k1 b λ : from discrepancy principle

E(ε) = 0, Var(ε) = σ2I σb = smoothing spline estimate Example: y, Af & f Tools: robust simulation summaries

I Sample mean and variance of x1,..., xm: ¯ X = (x1 + ··· + xm)/m 1 X S2 = (x − X¯ )2 m i i

I Recursive formulas: 1 X¯ = X¯ + (x − X¯ ) n+1 n n + 1 n+1 n n T = T + (x − X¯ )2, S2 = T /n n+1 n n + 1 n+1 n n n

I Not robust Robust measures

I Median and median absolute deviation from median (MAD)

Xe = median{x1,..., xm} n o MAD = median |x1 − Xem|,..., |xm − Xem|

I Approximate recursive formulas with good asymptotics Stability of bf `2 Stability of bf `1 Tools: trace estimators

X X t Trace(H) = Hii = ei Hei i i

I Expensive for large H

I Randomized estimator: H symmetric n × n, U1,..., Um with Uij iid Unif{−1, 1}. Define

m 1 X t Tbm(H) = U HUi m i i=1

I E( Tbm(H) ) = Trace( H )

2 2 2 I Var( Tbm(H) ) = m [ Trace(H ) − k diag(H) k ] Some bounds

I Relative variance:

2 Var( Tbm(H)) 2 Sσ Vr (Tbm) = ≤ , Trace( H )2 mn σ¯2

P 2 P 2 σ¯ = (1/n) i σi , Sσ = (1/n) i (σi − σ¯)

I Concentration: For any t > 0,

  2 −m t /4Vr (Tb1) P Tbm(H) ≥ Trace(H)(1 + t) ≤ e

  2 −m t /4Vr (Tb1) P Tbm(H) ≤ Trace(H)(1 − t) ≤ e I Example: ‘Hat matrix’ H in least-squares; projects onto k-dimensional space

2  k  Vr (Tbm) ≤ 1 − mk n

  −mkt2/8 P Tbm(H) ≥ Trace(H)(1 + t) ≤ e .

I Example: Approximating a determinant of A > 0:

1 X log( Det(A) ) = Trace( log(A)) ≈ Ut log(A)U m i i i

log(A)U requires only matrix-vector products with A CI for bias of Abf `2

I Good estimate of f ⇒ good estimate of Af

I y: direct observation of Af ; less sensitive to λb

I 1 − α CIs for bias of Abf ( yb = Hy ): s y − y (H2) − (H )2 bi i ± z σ 1 + ii ii α/2 b 2 1 − Hii (1 − Hii )

ybi − yi 1 ± zα/2 σ Trace(I − H)/n b pTrace(I − H)/n 95% CIs for bias Coverage of 95% CIs for bias Bias bounds for bf `2

I For fixed λ: 2 −1 t −1 Var(bf `2 ) = σ G(λ) A AG(λ)

Bias(bf `2 ) = E(bf `2 ) − f = B(λ),

G(λ) = At A + λ2Dt D, B(λ) = −λ2G(λ)−1 Dt Df .

I Median bias: BiasM (bf `2 ) = median(bf `2 ) − f

I For λ estimated:

BiasM (bf `2 ) ≈ median[ B(λb)] I Holder’s¨ inequality

2 | B(λb)i | ≤ λb Up,i (λb) kDf kq 2 w | B(λb)i | ≤ λb Up,i (λb) kβkq

−1 w t −1 Up,i (λb) = kDG(λb) ei kp or Up,i (λb) = kWD DG(λb) ei kp

w I Plots of Up,i (λb) and Up,i (λb) vs xi

Assessing fit

I Compare y to Abf + bε ∗ I Parametric/nonparametric bootstrap samples y : ∗ 2 (i) (parametric) εi ∼ N(0, σb I), or (nonparametric) from corrected residuals r c ∗ ∗ (ii) y i = Abf + εi

∗ ∗ I Compare of y and {y 1,..., y B} (blue,red) 2 I Compare statistics of r c and N(0, σb I) (black) I Compare parametric vs nonparametric results (red)

I Example: Simulations for Abf `2 with: y (1) good y (2) with skewed noise y (3) biased Example

T1(y) = min(y)

T2(y) = max(y) n 1 X T (y) = ( y − mean(y))k /std(y)k (k = 3, 4) k n i=1

T5(y) = sample median (y)

T6(y) = MAD(y)

T7(y) = 1st sample quartile (y)

T8(y) = 3rd sample quartile (y)

T9(y) = # runs above/below median(y)

∗ o  o Pj = P |Tj (y )| ≥ |Tj | , Tj = Tj (y)

Assessing fit: Bayesian case

I Hierarchical model:

y | f , σ, γ ∼ N(Af , σ2I) f | γ ∝ exp −f t Dt Df /2γ2  (σ, γ) ∼ π  σ2 −1 with E(f | y) = At A + Dt D At y γ2

I Posterior mean = bf`2 for γ = σ/λ

∗ ∗ I Sample (f , σ ) from posterior of (f , σ) | y ⇒ and simulated data y ∗ = Af ∗ + σ∗ε∗ with ε∗ ∼ N(0, I)

∗ I Explore consistency of y with simulated y Simulating f

∗ ∗ I Modify simulations from y i = Abf + εi to

∗ ∗ y i = Af i + εi , f i ‘similar’ to f

I Frequentist ‘simulation’ of f ?

One option: fi from historical data (training set)

I Check relative MSE for the different f i Bayesian or frequentist?

I Results have different interpretations. Different meanings of probability: P 6= P

I Frequentists can use Bayesian methods to derive procedures; Bayesians can use frequentist methods to evaluate procedures

I Consistent results for small problems and lots of data

I Possibly very different results for complex large-scale problems

I Theorems do not address relation of theory to reality Personal view

I Neither is a substitute for detective work, common sense and honesty

I Some say that Bayesian is the only way, some others that frequentist is the best way, while those with vision see Application: experimental design Classic framework

t yi = f (xi ) θ + εi ⇒ y = Fθ + ε,

p I Experimental configurations xi ∈ X ⊂ IR m I θ ∈ IR unknown parameters 2 I εi iid zero-mean, variance σ

I Experimental design question: How many times should each configuration xi be used to get ‘best’ estimate of θ? 2 i.e., variance σi for each selected xi is controlled by averaging independent replications 2 I More general approach: Control the σi to get ‘best’ estimate of θ (i.e., not necessarily by averaging)

Let σt,i be ‘optimal’ target variance for xi . Set constraint: X σ2 i = N σ2 i t,i for fixed N ∈ IR. 2 2 Define wi = σi /Nσt,i ⇒

X σ2 w ≥ 0, w = 1 and σ2 = i w −1 i i t,i N i i

I ith experimental configuration: (xi , wi )

I Experiment ξ: {x1,..., xn, w1,..., wn} t P t I Well-posed case: F F = i f (xi )f (xi ) well conditioned

X t 2 t −1 t θb = arg min wi (yi − f (xi ) θ) = F WF F Wy i

W = diag(wi ) θb unbiased with

2 t −1 2 −1 Var(θb) = σ F WF = σ M(ξ) ,

t P t M(ξ) = information matrix = F WF = i wi f (xi )f (xi ) Write D(ξ) = M(ξ)−1 I Optimization problem: X min Ψ(D(ξ)) s.t. w ≥ 0, wi = 1, w D-design: Ψ = det A-design: Ψ = Trace ⇒ minimization of MSE(θb)

Optimal wi independent of σ and N Ill-posed problems

I 1D magnetotelluric example:

Z 1 dj = exp(−αj x) cos(γj x) m(x) dx + εj j = 1,..., n 0

γj : recording frequencies αj : determined by frequencies and background conductivity m(x): conductivity

I Goal: Choose γj , αj to determine smooth m consistent with data Ill-posed problems

I Ill-conditioning of F requires regularization

I Regularized estimate

X t 2 2 2 θb = arg min wi [ yi − f (xi ) θ ] + λ kDθk i −1 = F t WF + λ2Dt D  F t W y

I MSE( θb) depends on unknown θ ⇒ control ‘average’ MSE

I Use prior moment conditions: E(y | θ, γ) = Fθ, Var(y | θ, γ) = σ2 W −1/N 2 E(θ | γ) = θ0, Var(θ | γ) = γ Σ 2 2 E(γ ) = µγ, with σ and Σ known and γ = σ/λ 2 2 I Affine estimate minimizing Bayes risk: δ = E(1/λ )

2 t −1 −1 2 t −1  θb = Nδ F WF + Σ Nδ F Wy + Σ θ0

2 t τσ t −1 −1 E[(θb− θ)(θb− θ) ] = τF WF + (1 − τ)Σ N τ ∈ (0, 1)

I A-design optimization:

−1 min Trace τF t WF + (1 − τ)Σ−1  w X s.t. w ≥ 0, wi = 1.

I Optimal wi depend on σ and N Readings

Basic Bayesian methods:  Carlin, JB & Louis, TA, 2008, Bayesian Methods for Data Analysis (3rd Edn.), Chapman & Hall  Gelman, A, Carlin, JB, Stern, HS & Rubin, DB, 2003, Bayesian Data Analysis (2nd Edn.), Chapman & Hall Bayesian methods for inverse problems:  Calvetti, D. & Somersalo, E, 2007, Introduction to Bayesian Scientific Computing: Ten Lectures on Subjective Computing, Springer  Kaipio, J & Somersalo, E, 2004, Statistical and Computational Inverse Problems, Springer Discussion of frequentist vs Bayesian statistics  Freedman, D, Some issues in the foundation of statistics, 1995, Foundations of , 1, 19–39. Rejoinder pp.69–83 Tutorial on frequentist and Bayesian methods for inverse problems  Stark, PB & Tenorio, L, 2011, A primer of frequentist and in inverse problems. In Computational Methods for Large-Scale Inverse Problems and Quantification of Uncertainty, pp.9–32, Wiley Statistics and inverse problems  Evans, SN & Stark, PB, 2003, Inverse problems as statistics, Inverse Problems, 18, R55  O’Sullivan, F, 1986, A statistical perspective on ill-posed inverse problems, Statistical , 1, 502–527  Tenorio, L, Andersson, F, de Hoop, M & Ma, P, 2011, Data analysis tools for uncertainty quantification of inverse problems, Inverse Problems, 27, 045001  Wahba, G, 1990, Spline Smoothing for Observational Data, SIAM  Special section in Inverse Problems: Volume 24, Number 3, June 2008 Mathematical statistics  Casella, G & Berger, RL, 2001, Statistical Inference, Duxbury Press  Schervish, ML, 1996, Theory of Statistics, Springer