Statistics for Inverse Problems 101
Total Page:16
File Type:pdf, Size:1020Kb
Statistics for Inverse Problems 101 June 5, 2011 Introduction I Inverse problem: to recover an unknown object from indirect noisy observations Example: Deblurring of a signal f R 1 Forward problem: µ(x) = 0 A(x − t)f (t) dt Noisy data: yi = µ(xi ) + "i , i = 1;:::; n Inverse problem: recover f (t) for t 2 [0; 1] General Model I A = (Ai ), Ai : H! IR, f 2 H I yi = Ai [f ] + "i , i = 1;:::; n I The recovery of f from A[f ] is ill-posed I Errors "i modeled as random I Estimate bf is an H-valued random variable Basic Question (UQ) How good is the estimate bf ? I What is the distribution of bf ? Summarize characteristics of the distribution of bf E.g., how ‘likely’ is it that bf will be ‘far’ from f ? Review: E, Var & Cov (1) Expected value of random variable X with pdf f (x): Z E(X) = µx = x f (x) dx 2 (2) Variance of X: Var(X) = E (X − µx ) (3) Covariance of random variables X and Y : Cov(X; Y ) = E ((X − µx )(Y − µy )) (4) Expected value of random vector X = (Xi ): E(X) = µx = ( E(Xi )) (5) Covariance matrix of X: t Var(X) = E((X − µx )(X − µx ) ) = ( Cov(Xi ; Xj )) (6) For a fixed n × m matrix A E(AX) = A E(X); Var(AX) = A Var(X) At 2 Example: X = (X1;:::; Xn) correlated, Var(X) = σ Σ, then X + ··· + X σ2 Var 1 n = Var(1t X=n) = 1t Σ1 n n2 Review: least-squares y = Aβ + " t I A is n × m with n > m and A A has stable inverse 2 I E(") = 0, Var(") = σ I I Least-squares estimate of β: βb = arg min ky − Abk2 = (At A)−1At y b I βb is an unbiased estimator of β: Eβ( βb ) = β 8β I Covariance matrix of βb: 2 t −1 Varβ( βb ) = σ (A A) 2 I If " is Gaussian N(0; σ I), then βb is the MLE and βb ∼ N( β; σ2 (At A)−1 ) 2 I Unbiased estimator of σ : 2 2 σb = ky − A βbk =(n − m) 2 = ky − ybk =(n − dof(H)); t −1 t where yb = H βb, H = A (A A) A = ‘hat-matrix’ Confidence regions I 1 − α confidence interval for βi : q t −1 Ii (α): βbi ± tα/2;n−m σb ((A A) )i;i I C.I.’s are pre-data I Pointwise vs simultaneous 1 − α coverage: pointwise: P[ βi 2 Ii (α) ] = 1 − α 8i simultaneous: P[ βi 2 Ii (α) 8i ] = 1 − α I Easy to find 1 − α joint confidence region Rα for β which gives simultaneous coverage and if f f Rα = ff (b): b 2 Rαg ) P[f (β) 2 Rα] ≥ 1 − α Residuals r = y − Aβb = y − yb 2 I E(r) = 0, Var(r) = σ (I − H) 2 2 I If " ∼ N(0; σ I), then: r ∼ N(0; σ (I − H)) ) do not behave like original errors " p I Corrected residuals bri = ri = 1 − Hii I Corrected residuals may be used for model validation Review: Bayes theorem I Conditional probability measure P( · j B): P(A j B) = P(A \ B)=P(B) if P(B) 6= 0 I Bayes theorem: for P(A) 6= 0 P(A j B) = P(B j A)P(A)=P(B) I For X and Θ with pdf’s fΘjX (θ j x) / fXjΘ(x j θ)fΘ(θ) More generally: dµ dµ ΘjX / XjΘ dµΘ dν I Basic tools: conditional probability measures and Radon-Nikodym derivatives I Conditional probabilities can be defined via Kolmogorov’s definition of conditional expectations or using conditional distributions (e.g., via disintegration of measures) Review: Gaussian Bayesian linear model y j β ∼ N(Aβ; σ2I); β ∼ N(µ; τ 2Σ) µ; σ; τ; Σ known I Goal: update prior distribution of β in light of data y. Compute posterior distribution of β given y (use Bayes theorem): β j y ∼ N(µy ; Σy ) 2 t 2 2 −1 −1 Σy = σ A A + (σ /τ )Σ t 2 2 −1 −1 t 2 2 −1 µy = A A + (σ /τ )Σ (A y + (σ /τ )Σ µ) = weighted average of µbls and µ I Summarizing posterior: µy = E(µ j y) = mean and mode of posterior (MAP) Σy = Var(µ j y) = covariance matrix of posterior 1 − α credible interval for βi : q (µy )i ± zα/2 ( Σy )i;i Can also construct 1 − α joint credible region for β Some properties 2 I µy minimizes Ekδ(y) − βk among all δ(y) 2 2 2 2 I µy ! βbls as σ /τ ! 0, µy ! µ as σ /τ ! 1 I µy can be used as a frequentist estimate of β: µy is biased its variance is NOT obtained by sampling from the posterior I Warning: a 1 − α credible regions MAY NOT have 1 − α frequentist coverage I If µ; σ; τ or Σ unknown ) use hierarchical or empirical Bayesian and MCMC methods Parametric reductions for IP I Transform infinite to finite dimensional problem: i.e., A[f ] ≈ Aa I Assumptions y = A[f ] + " = Aa + δ + " Function representation: f (x) = φ(x)t a Penalized least-squares estimate: a = arg min ky − Abk2 + λ2bt Sb b b = (At A + λ2S)−1At y t bf (x) = φ(x) ab R 1 I Example: yi = 0 A(xi − t)f (t) dt + "i Quadrature approximation: m Z 1 X A(xi − t)f (t) dt ≈ (1=m) A(xi − tj )f (tj ) 0 j=1 Discretized system: y = Af + δ + "; Aij = A(xi − tj ); fi = f (xi ) Regularized estimate: bf = arg min ky − Agk2 + λ2kDgk2 g2IRm = (At A + λ2Dt D)−1At y ≡ Ly t bf (xi ) = fi = ei Ly I Example: Ai continuous on Hilbert space H. Then: n 9 orthonormal φk 2 H, orthonormal vk 2 IR and non-decreasing λk ≥ 0 such that A[φk ] = λk v k , ? f = f0 + f1, with f0 2 Null(A), f1 2 Null(A) Pn f0 not constrained by data, f1 = i=1 ak φk Discretized system: y = V a + "; V = (λ1v 1 ··· λnv n) I Regularized estimate: 2 2 2 ab = arg min ky − V bk + λ kbk ≡ Ly b2IRn t bf (x) = φ(x) Ly; φ(x) = (φi (x)) I Example: Ai continuous on RKHS H = Wm([0; 1]). L Then: H = Nm−1 Hm and 9 φj 2 H and κj 2 H such that Z 1 2 2 (m) 2 X 2 2 2 ky − A[f ]k + λ (f1 ) = (yi − hκi ; f i) + λ kPHm f k 0 i X ? f = aj φj + η; η 2 spanfφk g j Discretized system: y = Xa + δ + " Regularized estimate: 2 2 t ab = arg min ky − Xbk + λ b Pb ≡ Ly b2IRn t bf (x) = φ(x) Ly; φ(x) = (φi (x)) To summarize: y = Aa + δ + " f (x) = φ(x)t a 2 2 t ab = arg min ky − Xbk + λ b Pb b2IRn = (At A + λ2P)−1At y t bf (x) = φ(x) ab Bias, variance and MSE of bf I Is bf (x) close to f (x) on average? t t t Bias(bf (x) ) = E(bf (x)) − f (x) = φ(x) Bλa + φ(x) GλA δ I Pointwise sampling variability: 2 2 Var(bf (x) ) = σ k AGλφ(x) k I Pointwise MSE: MSE(bf (x) ) = Bias(bf (x))2 + Var(bf (x)) I Integrated MSE: Z Z IMSE(bf (x) ) = E jbf (x)−f (x)j2 dx = MSE(bf (x)) dx Review: are unbiased estimators good? I They don’t have to exist I If they exist they may be very difficult to find I A biased estimator may still be better: variance variance trade off i.e,. don’t get too attached to them What about the bias of bf ? I Upper bounds are sometimes possible I Geometric information can be obtained from bounds I Optimization approach: find max/min of bias s.t. f 2 S I Average bias: choose a prior for f and compute mean bias over prior I Example: f in RKHS H, f (x) = hρx ; f i Bias(bf (x) ) = hAx − ρx ; f i j Bias(bf (x)) j ≤ kAx − ρx k kf k Confidence intervals I If " Gaussian and jBias(bf (x))j ≤ B(x), then bf (x) ± zα/2( σkAGλφ(x)k + B(x)) is a 1 − α C.I. for f (x) I Simultaneous 1 − α C.I.’s for E(bf (x)), x 2 S: find β such that P sup jZ t V (x)j ≥ β ≤ α x2S Zi iid N(0; 1), V (x) = KGλφ(x)=kKGλφ(x)k and use results from maxima of Gaussian processes theory Residuals r = y − Aab = y − yb I Moments: E(r) = −A Bias( ab ) + δ 2 2 Var(r) = σ (I − Hλ) Note: Bias( A ab ) = A Bias( ab ) may be small even if Bias( ab ) is significant I Corrected residuals: bri = ri =(1 − (Hλ)ii ) Estimating σ and λ I λ can be chosen using same data y (with GCV, L-curve, discrepancy principle, etc.) or independent training data This selection introduces an additional error 2 I If δ ≈ 0, σ can be estimated as in least-squares: 2 2 σb = ky − ybk =(n − dof(Hλ)) otherwise one may consider y = µ + ", µ = A[f ] and use methods from nonparametric regression Resampling methods Idea: If bf and σb are ‘good’ estimates, then one should be able to generate synthetic data consistent with actual observations I If bf is a good estimate ) make synthetic data ∗ ∗ ∗ ∗ y1 = A[bf ] + "i ;:::; yb = A[bf ] + "b ∗ with "i from same distribution as ".