Uncertainty Quantification in Ill-Posed Inverse Problems: Case Studies in the Physical Sciences

Mikael Kuusela Department of and Data Science Carnegie Mellon University

CERN-EP/IT Data Science Seminar

April 1, 2020

Mikael Kuusela (CMU) April 1, 2020 1 / 40 Motivation: Heat equation

Consider the heat equation on the real line:

( ∂u(x,t) ∂2u(x,t) ∂t = k ∂x2 , (x, t) ∈ R × [0, ∞), u(x, 0) = f (x), x ∈ R

Forward problem: Given the initial condition f (x) = u(x, 0), what is the temperature distribution g(x) = u(x, t∗) at time t∗?

Inverse problem: Given the temperature distribution g(x) = u(x, t∗) at time t∗, what was the initial distribution f (x) = u(x, 0) at time t = 0?

Mikael Kuusela (CMU) April 1, 2020 2 / 40 Motivation: Heat equation

The forward problem is easy: Z ∞ u(x, t) = Φt (x − y)f (y) dy −∞ where 1  x2  Φt (x) = √ exp − 4πkt 4kt

So we have a well-defined integral operator K that maps the initial distribution f to the distribution g at time t∗: Z ∞ ∗ g(x) = u(x, t ) = Φt∗ (x − y)f (y) dy = (Kf )(x) −∞ The is to solve g = Kf for f . This is fundamentally difficult. Let’s see why.

Mikael Kuusela (CMU) April 1, 2020 3 / 40 Motivation: Heat equation

Initial distribution 0.1 Distribution at time t=0.05 0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0 -25 -20 -15 -10 -5 0 5 10 15 Temperature distribution at time t = 0.05

Mikael Kuusela (CMU) April 1, 2020 4 / 40 Motivation: Heat equation

Initial distribution 0.1 Distribution at time t=0.2 0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0 -25 -20 -15 -10 -5 0 5 10 15 Temperature distribution at time t = 0.2

Mikael Kuusela (CMU) April 1, 2020 4 / 40 Motivation: Heat equation

Initial distribution 0.1 Distribution at time t=0.5 0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0 -25 -20 -15 -10 -5 0 5 10 15 Temperature distribution at time t = 0.5

Mikael Kuusela (CMU) April 1, 2020 4 / 40 Motivation: Heat equation

Initial distribution 0.1 Distribution at time t=1 0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0 -25 -20 -15 -10 -5 0 5 10 15 Temperature distribution at time t = 1

Mikael Kuusela (CMU) April 1, 2020 4 / 40 Motivation: Heat equation

Initial distribution 0.1 Distribution at time t=10 0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0 -25 -20 -15 -10 -5 0 5 10 15 Temperature distribution at time t = 10

Mikael Kuusela (CMU) April 1, 2020 4 / 40 Motivation: Heat equation

Initial distribution 0.1 Distribution at time t=100 0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0 -25 -20 -15 -10 -5 0 5 10 15 Temperature distribution at time t = 100

Mikael Kuusela (CMU) April 1, 2020 4 / 40 Motivation: Heat equation

Now, consider the two initial distributions shown below and run them forward in time.

Initial distribution Initial distribution 0.1 0.1

0.09 0.09

0.08 0.08

0.07 0.07

0.06 0.06

0.05 0.05

0.04 0.04

0.03 0.03

0.02 0.02

0.01 0.01

0 0 -25 -20 -15 -10 -5 0 5 10 15 -25 -20 -15 -10 -5 0 5 10 15

Mikael Kuusela (CMU) April 1, 2020 5 / 40 Motivation: Heat equation

Now, consider the two initial distributions shown below and run them forward in time.

Initial distribution Initial distribution 0.1 0.1 Distribution at time t=1 Distribution at time t=1 0.09 0.09

0.08 0.08

0.07 0.07

0.06 0.06

0.05 0.05

0.04 0.04

0.03 0.03

0.02 0.02

0.01 0.01

0 0 -25 -20 -15 -10 -5 0 5 10 15 -25 -20 -15 -10 -5 0 5 10 15

The distributions at time t = 1 are almost indistinguishable! If the distribution at time t = 1 is observed with even the tiniest amount of noise, then there is no way to distinguish between these two initial distributions based on the distribution at time t = 1

Mikael Kuusela (CMU) April 1, 2020 5 / 40 Ill-posed inverse problems

Ill-posed inverse problems are problems of the type

g = K(f ),

where the mapping K is such that it can take inputs f1 and f2 that look very different into outputs g1 and g2 that are very similar This means that, without further information, data collected in the g-space can only mildly constrain the solution in the f -space

Mikael Kuusela (CMU) April 1, 2020 6 / 40 Ill-posed inverse problems

A classical example is the Fredholm integral operator Z g(x) = k(x, y)f (y) dy,

where k(x, y) is a problem-specific integration kernel

Also more complex operators, including nonlinear ones, arise in important practical applications

Mikael Kuusela (CMU) April 1, 2020 6 / 40 Ill-posed inverse problems in the physical sciences

Ill-posed inverse problems are ubiquitous in the physical sciences

Some illustrative examples:

Black hole image from Inversion of ice sheet Positron emission the Event Horizon flow parameters tomography of human Telescope (Isaac et al., 2015) brain

Mikael Kuusela (CMU) April 1, 2020 7 / 40 Case studies

Throughout this talk, I will specifically focus on the following two problems:

Large Hadron Collider Orbiting Carbon Observatory-2

Unfolding of detector smearing in Space-based retrieval of atmospheric differential cross-section measurements carbon dioxide concentration

Mikael Kuusela (CMU) April 1, 2020 8 / 40 The unfolding problem

Any differential cross section measurement is affected by the finite resolution of the particle detectors This causes the observed spectrum of events to be “smeared” or “blurred” with respect to the true one The unfolding problem is to estimate the true spectrum using the smeared observations Ill-posed(b) Smeared inverse intensity problem with major methodological(a) True challenges intensity 1500 1500

Folding 1000 ←−− 1000

Unfolding Intensity 500 −−→Intensity 500 0 0 Figure:−5 Smeared0 spectrum5 −5Figure: True0 spectrum5 Physical observable Physical observable

Mikael Kuusela (CMU) April 1, 2020 9 / 40 Problem formulation

Let f be the true, particle-level spectrum and g the smeared, detector-level spectrum Denote the true space by T and the smeared space by S (both taken to be intervals on the real line for simplicity) Mathematically f and g are the intensity functions of the underlying Poisson point process The two spectra are related by Z g(s) = k(s, t)f (t) dt, T where the smearing kernel k represents the response of the detector and is given by k(s, t) = p(Y = s|X = t, X observed)P(X observed|X = t), where X is a true event and Y the corresponding smeared event

Task: Infer the true spectrum f given smeared observations from g

Mikael Kuusela (CMU) April 1, 2020 10 / 40 Discretization Problem usually discretized using histograms (splines are also sometimes used) p n Let {Ti }i=1 and {Si }i=1 be binnings of the true space T and the smeared space S T Smeared histogram y = [y1,..., yn] with mean Z Z T µ = g(s) ds,..., g(s) ds S1 Sn Quantity of interest: "Z Z #T x = f (t) dt,..., f (t) dt T1 Tp The mean histograms are related by µ = Kx, where the elements of the response matrix K are given by R R S T k(s, t)f (t) dt ds K = i j = P(smeared event in bin i | true event in bin j) i,j R f (t) dt Tj The discretized becomes

y ∼ Poisson(Kx) and we wish to make inferences about x under this model Mikael Kuusela (CMU) April 1, 2020 11 / 40 OCO-2: general model

p p n x ∈ R : state vector, F ∈ R → R : forward model, n n ε ∈ R : instrument noise, y ∈ R : radiance observations

Mikael Kuusela (CMU) April 1, 2020 12 / 40 OCO-2: linearized surrogate model (Hobbs et al., 2017) State vector x:

CO2 profile (layer 1 to layer 20) [20 elements] surface pressure [1 elements] surface albedo [6 elements] aerosols [12 elements] ∂F (x) Forward model F : linearized using the Jacobian K(x) = ∂x Noise ε: normal approximation of Poisson noise Observations y: radiances in 3 near-infrared bands [1024 in each band]

O2 A-band (around 0.76 microns) weak CO2 band (around 1.61 microns) strong CO2 band (around 2.06 microns)

Mikael Kuusela (CMU) April 1, 2020 13 / 40 Synthesis

Both HEP unfolding and CO2 remote sensing can be approximated reasonably well using the Gaussian linear model:

y = Kx + ε, ε ∼ N(0, Σ),

n p where y ∈ R are the observations, x ∈ R is the unknown quantity and n×p K ∈ R is an ill-conditioned matrix

A few caveats: May have n ≥ p or n < p Even when n ≥ p, may have rank(K) < p Usually have physical constraints for x, for example x ≥ 0

The fundamental problem, though, is that K has a huge

Mikael Kuusela (CMU) April 1, 2020 14 / 40 Regularization In both fields, the standard approaches1 for handling the ill-posedness are very similar Find a regularized solution by solving T −1 2 xˆ = min (y − Kx) Σ (y − Kx) + δkL(x − xa)k (∗) x∈Rp Here, the first term is a data-fit term and the second term penalizes physically implausible solutions → Balances the competing requirements of fitting the data and finding a well-behaved solution (bias-variance trade-off) Two viewpoints — the same answer: 1 Frequentist: Penalized maximum likelihood xˆ = max log L(x) − δP(x) ⇒ (∗) x∈Rp

2 1 T −1 Bayesian: Using prior x ∼ N(xa, δ (L L) ), estimate x using the mean/maximum of the posterior p(x|y) ∝ p(y|x)p(x) ⇒ (∗) 1There are many variants of these ideas, but the fundamental issues remain the same. Mikael Kuusela (CMU) April 1, 2020 15 / 40 Regularized uncertainty quantification

For the rest of this talk, let’s assume that we are interested in some functional (or potentially some collection of functionals) of x

Let θ = hTx be a quantity of interest; for example

OCO-2: θ = XCO2 = weighted average of the CO2 profile T HEP: θ = ei x = ith unfolded bin or θ = aggregate of several unfolded bins In both frequentist and Bayesian approaches, a natural estimator of θ is θˆ = hTxˆ

While the frequentist and Bayesian perspectives lead to the same point estimators θˆ, the associated uncertainties are different: 1 Frequentist:1 − α variability intervals q q ˆ ˆ ˆ ˆ [θ, θ] = [θ − z1−α/2 var(θ), θ + z1−α/2 var(θ)]

2 Bayesian:1 − α posterior credible intervals ˆ p ˆ p [θ, θ] = [θ − z1−α/2 var(θ|y), θ + z1−α/2 var(θ|y)]

HEP unfolding typically takes approach 1 , while remote sensing uses 2

Mikael Kuusela (CMU) April 1, 2020 16 / 40 Regularization and frequentist coverage

Let’s assume that ultimately we would like to quantify the uncertainty of θ using a confidence interval with well-calibrated frequentist coverage That is, for a given confidence level 1 − α, we are looking for a random interval [θ, θ] such that   P (θ ∈ θ, θ ) ≈ 1 − α, for any x When constructed based on a regularized point estimator xˆ, both the frequentist variability interval and the Bayesian credible interval have trouble satisfying this criterion There are always x’s such that the regularized estimator xˆ is biased (i.e., E(xˆ) 6= x) and the estimator θˆ inherits this bias Variability intervals ignore the bias and only account for the variance ⇒ Undercoverage when large bias, correct coverage when small bias Credible intervals are always wider than the variability intervals ⇒ Undercoverage when large bias, overcoverage when small bias

Mikael Kuusela (CMU) April 1, 2020 17 / 40 Regularization and frequentist coverage

When the noise is Gaussian with known covariance, it is possible to write down the coverage of these intervals in closed form: 1 Variability intervals     bias(θˆ) bias(θˆ) (θ ∈ [θ, θ]) = Φ  + z  − Φ  − z  P  q 1−α/2  q 1−α/2 var(θˆ) var(θˆ)

2 Credible intervals     s s bias(θˆ) var(θ|y) bias(θˆ) var(θ|y) (θ ∈ [θ, θ]) = Φ  + z  − Φ  − z  P  q 1−α/2 ˆ   q 1−α/2 ˆ  var(θˆ) var(θ) var(θˆ) var(θ)

Both of these are even functions of bias(θˆ) and maximized when bias(θˆ) = 0 The variability intervals have coverage 1 − α if and only if bias(θˆ) = 0; otherwise coverage < 1 − α The credible intervals have coverage > 1 − α for bias(θˆ) = 0; coverage 1 − α for a certain value of |bias(θˆ)|; and coverage < 1 − α for large |bias(θˆ)| Mikael Kuusela (CMU) April 1, 2020 18 / 40 Unfolding: Simulation setup

1200   1000 True 1 f (t) = λtot π1N (t|−2, 1) + π2N (t|2, 1) + π3 Smeared |T | 800 Z 600 g(s) = N (s − t|0, 1)f (t) dt T Intensity 400   MC 2 2 1 f (t) = λtot π1N (t|−2, 1.1 ) + π2N (t|2, 0.9 ) + π3 200 |T | 0 -5 0 5

Mikael Kuusela (CMU) April 1, 2020 19 / 40 Undercoverage in unfolding

(a) SVD variant of Tikhonov regularization (a) SVD, weighted CV 1 1

0.8 0.8 0.6 0.6 0.4

0.2 0.4

0 Binwise coverage 0.2

Coverage at right peak 10 -1 10 0 10 1 10 2 10 3 10 4 10 5 0 −5 0 5

Coverage in SVD unfolding: as a function of the regularization strength (left) and for cross-validated regularization strength (right)

Mikael Kuusela (CMU) April 1, 2020 20 / 40 Bias and coverage for operational OCO-2 retrievals

(a) Bias distribution (b) Coverage distribution

Bias (left) and coverage (right) for operational OCO-2 retrievals for several realizations of the state vector x

Mikael Kuusela (CMU) April 1, 2020 21 / 40 Coverage for various unfolded confidence intervals

1

0.9

0.8

0.7

0.6

0.5

0.4 EB HB, Pareto(1,10−10) Empirical coverage 0.3 HB, Pareto(1/2,10−10) 0.2 HB, Gamma(0.001,0.001) HB, Gamma(1,0.001) 0.1 Bootstrap basic Bootstrap percentile 0 −6 −4 −2 0 2 4 6

Mikael Kuusela (CMU) April 1, 2020 22 / 40 Unregularized inversion?

At the end of the day, any regularization technique makes unverifiable assumptions about the true solution If these assumptions are not satisfied, the uncertainties will be wrong In the absence of oracle information about the true x, there does not seem to be any obvious way around this So maybe we should reconsider whether explicit regularization is such a good idea to start with? Instead of finding a regularized estimator of x, what if we simply used the unregularized maximum likelihood / estimator xˆ = min (y − Kx)TΣ−1(y − Kx) p x∈R When K has full column , the solution is xˆ = (K TΣ−1K)−1K TΣ−1y This is unbiased (E(xˆ) = x) and hence also the corresponding estimator θˆ = hTxˆ of the functional θ = hTx is unbiased Therefore, by the previous discussion, the variability interval q q ˆ ˆ ˆ ˆ [θ, θ] = [θ − z1−α/2 var(θ), θ + z1−α/2 var(θ)] has correct coverage 1 − α Mikael Kuusela (CMU) April 1, 2020 23 / 40 Implicit regularization

Of course, when K is ill-conditioned, the unregularized estimator xˆ will have a huge variance But this does not mean that θˆ = hTxˆ needs to have a huge variance! The mapping xˆ 7→ θˆ = hTxˆ can act as an implicit regularizer resulting in a well-constrained interval [θ, θ] for the functional θ = hTx This is especially the case when the functional is a smoothing or averaging operation, for example: Inference for wide unfolded bins (demo to follow) XCO2 in OCO-2 (average CO2 over the atmospheric column) Of course, there are also functionals that are more difficult to constrain T (e.g, point evaluators θ = ei x, derivatives,...) In those cases, the intervals [θ, θ] are wide—as they should be, since there is simply not enough information in the data y to constrain these functionals

Mikael Kuusela (CMU) April 1, 2020 24 / 40 Wide bin unfolding

In the case of unfolding, one functional we should be able to recover without explicit regularization is the integral of f over a wide unfolded bin: Z Hj [f ] = f (t) dt, width of Tj large Tj But one cannot simply arbitrarily increase the particle-level bin size in the conventional approaches, since this increases the MC dependence of K To circumvent this, it is possible to first unfold with fine bins (and no regularization) and then aggregate into wide bins Let’s see how this works using a similar deconvolution setup as before

Mikael Kuusela (CMU) April 1, 2020 25 / 40 Wide bins, standard approach, perturbed MC

2500 0.8 Unfolded True 0.7 2000 0.6

0.5 1500

0.4

Intensity 1000 0.3 Binwise coverage 0.2 500 0.1

0 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

R R k(s,t)f MC(t) dt ds Si Tj MC The response matrix Ki,j = R f MC(t) dt depends on f Tj ⇒ Undercoverage if f MC 6= f

Mikael Kuusela (CMU) April 1, 2020 26 / 40 Wide bins, standard approach, correct MC

2500 0.8 Unfolded True 0.7 2000 0.6

0.5 1500

0.4

Intensity 1000 0.3 Binwise coverage 0.2 500 0.1

0 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

If f MC = f , coverage is correct ⇒ But this situation is unrealistic because f of course is unknown

Mikael Kuusela (CMU) April 1, 2020 27 / 40 Fine bins, standard approach, perturbed MC

0.8 8000 Unfolded True 0.7

6000 0.6

4000 0.5

2000 0.4 Intensity 0 0.3 Binwise coverage 0.2 -2000

0.1 -4000 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

With narrow bins, less dependence on f MC so coverage is correct, but the intervals are very wide2 ⇒ Let’s aggregate these into wide bins, keeping track of the bin-to-bin correlations in the error propagation

2More unfolded realizations given in the backup . Mikael Kuusela (CMU) April 1, 2020 28 / 40 Wide bins via fine bins, perturbed MC

2500 0.8 Unfolded True 0.7 2000 0.6

0.5 1500

0.4

Intensity 1000 0.3 Binwise coverage 0.2 500 0.1

0 0 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6

Wide bins via fine bins gives both correct coverage and intervals with reasonable length3

3More unfolded realizations given in the backup . Mikael Kuusela (CMU) April 1, 2020 29 / 40 Relaxing the full rank assumption

This simple approach works as long as the forward model K has full column rank and there are no constraints that x needs to satisfy

The full rank requirement can be quite restrictive in practice, for example: Unfolding with more true bins p than smeared bins n ⇒ K column-rank deficient The linearized OCO-2 forward model has n  p, but is nevertheless column-rank deficient

When K is column-rank deficient, it has a non-trivial null space ker(K)

Therefore, confidence intervals for θ = hTx would need to be infinitely long if there are no constraints on x (assuming h not orthogonal to ker(K))

However, simple constraints such as x ≥ 0 or Ax ≤ b can be enough to make the intervals finite And we would in any case like to make use of constraints, if available

Mikael Kuusela (CMU) April 1, 2020 30 / 40 Strict bounds: Motivation

So the question becomes: Assuming model y = Kx + ε, where K need not have full column rank, how does one obtain a finite-sample 1 − α confidence interval for the linear functional θ = hTx subject to the constraint Ax ≤ b?

In the following: We assume that we have transformed the problem so that ε ∼ N(0, I ) We denote the noise-free data by µ = Kx

Mikael Kuusela (CMU) April 1, 2020 31 / 40 Strict bounds (e.g., Stark (1992)) y space x space K n p R R C D = K −1(Ξ) x µ Ξ y

hT

θ θ θ R θ = hTx, θ = min hTx, θ = max hTx x∈C∩D x∈C∩D P(µ ∈ Ξ) ≥ 1 − α ⇒ P(x ∈ D) ≥ 1 − α ⇒ P(x ∈ C ∩ D) ≥ 1 − α ⇒ P(θ ∈ [θ, θ]) ≥ 1 − α

Mikael Kuusela (CMU) April 1, 2020 32 / 40 Strict bounds

If we construct the confidence set Ξ as

n 2 2 Ξ = {µ ∈ R : ky − µk ≤ χn,1−α},

then the endpoints of the confidence interval for θ are given by the solutions of the following two quadratic programs:

minimize hTx 2 2 subject to ky − Kxk ≤ χn,1−α Ax ≤ b

and maximize hTx 2 2 subject to ky − Kxk ≤ χn,1−α Ax ≤ b The resulting interval [θ, θ] has by construction coverage at least 1 − α

Mikael Kuusela (CMU) April 1, 2020 33 / 40 Strict bounds

The main limitation of the previous construction is that there is slack in the last step:

P(x ∈ C ∩ D) ≥ 1 − α ⇒ P(θ ∈ [θ, θ]) ≥ 1 − α

Because C ∩ D is a simultaneous confidence set for x, this construction yields simultaneous confidence intervals for any arbitrarily large collection of functionals of x

This means that, if evaluated as a one-at-a-time interval, [θ, θ] from this construction is necessarily conservative (i.e., it has overcoverage)

Mikael Kuusela (CMU) April 1, 2020 34 / 40 One-at-a-time strict bounds

2 2 Denote s = minAx≤b ky − Kxk . It has been conjectured (Rust and Burrus, 1972) that the following modification gives a better one-at-a-time interval: minimize hTx 2 2 2 subject to ky − Kxk ≤ z1−α/2 + s Ax ≤ b and maximize hTx 2 2 2 subject to ky − Kxk ≤ z1−α/2 + s Ax ≤ b The coverage of these intervals has been “proven” (Rust and O’Leary, 1994) and then “disproven” (Tenorio et al., 2007) so, as of now, there is limited theoretical understanding of their properties We hope to ultimately prove the coverage, but for now we’ll simply go ahead and empirically see how these perform in the CO2 retrieval problem with the constraint x1,..., x21 ≥ 0 and column-rank deficient K Mikael Kuusela (CMU) April 1, 2020 35 / 40 Coverage for OCO-2 retrievals

Mikael Kuusela (CMU) April 1, 2020 36 / 40 Coverage for OCO-2 retrievals

Table: Comparison of 95% Bayesian credible intervals (operational) and the one-at-a-time strict bounds intervals (proposed) for inference of XCO2 in OCO-2.

x operational operational operational proposed proposed proposed realization bias coverage length coverage avg. length length s.d. 1 1.4173 0.7899 3.94 0.9515 11.20 0.29 2 1.3707 0.8090 3.94 0.9511 11.20 0.28 3 1.2986 0.8363 3.94 0.9510 11.20 0.29 4 1.2357 0.8579 3.94 0.9515 11.20 0.28 5 1.1590 0.8816 3.94 0.9513 11.20 0.28 6 1.0747 0.9042 3.94 0.9512 11.21 0.27 7 0.9721 0.9272 3.94 0.9515 11.20 0.29 8 0.8420 0.9500 3.94 0.9513 11.19 0.31 9 0.6477 0.9730 3.94 0.9508 11.19 0.32 10 0.0001 0.9959 3.94 0.9502 11.18 0.35

Mikael Kuusela (CMU) April 1, 2020 37 / 40 One-at-a-time strict bounds: Discussion

We have tried more complex constraints of the form Ax ≤ b and so far the one-at-a-time strict bounds have always had coverage at least 1 − α The intervals have excellent empirical coverage but, as of now, we are missing a rigorous proof of coverage and/or optimality The price to pay for good calibration is an increased interval length in comparison to regularization-based intervals The OCO-2 problem has n  p and rank(K) = p − 1 Do these intervals still work when p > n or rank(K)  p? There are a few things we can say about these intervals though: They are variable length (as opposed to fixed length in traditional constructions); important since there are known optimality results for fixed length intervals (Donoho, 1994) In the case of full column rank K and no constraints on x, they reduce to the classical fixed-length variability interval

Mikael Kuusela (CMU) April 1, 2020 38 / 40 Conclusions Regularization works well for point estimation, but uncertainty quantification based on regularized estimators is very difficult Intervals from both frequentist and Bayesian constructions tend to have poor frequentist calibration It seems that there is a need for a major rethinking of the role of regularization Any explicit regularization really means some amount of “cheating” in the uncertainties We should probably avoid explicit regularization and instead provide uncertainties for functionals that implicitly regularize the problem A simple example is to first unfold with narrow bins and then aggregate into wide bins One-at-a-time strict bounds provide good calibration even in situations where classical intervals do not apply Acknowledgments: HEP unfolding: Joint work with Victor Panaretos, Philip Stark and Lyle Kim, with valuable input from members of the CMS Statistics Committee CO2 retrieval: Joint work with Pratik Patil and Jonathan Hobbs, with valuable input from scientists at NASA Jet Propulsion Laboratory Mikael Kuusela (CMU) April 1, 2020 39 / 40 STAMPS @ CMU The Statistical Methods for the Physical Sciences (STAMPS) Focus Group at CMU develops statistical methods for analyzing large and complex datasets arising across the physical sciences. Coordinated by: Ann Lee and Mikael Kuusela Application areas include:

Common statistical themes: spatio-temporal data, ill-posed inverse problems, uncertainty quantification, high-dimensional data, non-Gaussian data, large-scale simulations, massive datasets,... See: http://stat.cmu.edu/stamps/

Mikael Kuusela (CMU) April 1, 2020 40 / 40 ReferencesI

T. Adye. Unfolding algorithms and tests using RooUnfold. In H. B. Prosper and L. Lyons, editors, Proceedings of the PHYSTAT 2011 Workshop on Statistical Issues Related to Discovery Claims in Search Experiments and Unfolding, CERN-2011-006, pages 313–318, CERN, Geneva, Switzerland, 17–20 January 2011. V. Blobel. Unfolding methods in high-energy physics experiments. In CERN Yellow Report 85-09, pages 88–127, 1985. V. Blobel, The run manual: Regularized unfolding for high-energy physics experiments, OPAL Technical Note TN361, 1996. J. Bourbeau and Z. Hampel-Arias, PyUnfold: A Python package for iterative unfolding, The Journal of Open Source Software, 3(26):741, 2018. A. Bozson, G. Cowan, and F. Span`o,Unfolding with Gaussian processes, 2018. G. Choudalakis. Fully Bayesian unfolding. arXiv:1201.4612v4 [physics.data-an], 2012. G. D’Agostini, A multidimensional unfolding method based on Bayes’ theorem, Nuclear Instruments and Methods A, 362:487–498, 1995. A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.

Mikael Kuusela (CMU) April 1, 2020 41 / 40 ReferencesII

D. L. Donoho, Statistical estimation and optimal recovery, The Annals of Statistics, 22 (1):238–270, 1994. P. J. Green and B. W. Silverman. and Generalized Linear Models: A Roughness Penalty Approach. Chapman & Hall, 1994. P. C. Hansen, Analysis of discrete ill-posed problems by means of the L-curve, SIAM Review, 34(4):561–580, 1992. J. Hobbs, A. Braverman, N. Cressie, R. Granat, and M. Gunson, Simulation-based uncertainty quantification for estimating atmospheric CO2 from satellite data, SIAM/ASA Journal on Uncertainty Quantification, 5(1):956–985, 2017. A. H¨ocker and V. Kartvelishvili, SVD approach to data unfolding, Nuclear Instruments and Methods in Physics Research A, 372:469–481, 1996. T. Isaac, N. Petra, G. Stadler, and O. Ghattas, Scalable and efficient algorithms for the propagation of uncertainty from data through inference to prediction for large-scale problems, with application to flow of the Antarctic ice sheet, Journal of Computational Physics, 296:348–368, 2015. M. Kuusela. Uncertainty quantification in unfolding elementary particle spectra at the Large Hadron Collider. PhD thesis, EPFL, 2016. Available online at: https://infoscience.epfl.ch/record/220015.

Mikael Kuusela (CMU) April 1, 2020 42 / 40 ReferencesIII

M. Kuusela and V. M. Panaretos, Statistical unfolding of elementary particle spectra: Empirical Bayes estimation and bias-corrected uncertainty quantification, The Annals of Applied Statistics, 9(3):1671–1705, 2015. M. Kuusela and P. B. Stark, Shape-constrained uncertainty quantification in unfolding steeply falling elementary particle spectra, The Annals of Applied Statistics, 11(3): 1671–1710, 2017. K. Lange and R. Carson, EM reconstruction algorithms for emission and transmission tomography, Journal of Computer Assisted Tomography, 8(2):306–316, 1984. L. B. Lucy, An iterative technique for the rectification of observed distributions, Astronomical Journal, 79(6):745–754, 1974. B. Malaescu. An iterative, dynamically stabilized (IDS) method of data unfolding. In H. B. Prosper and L. Lyons, editors, Proceedings of the PHYSTAT 2011 Workshop on Statistical Issues Related to Discovery Claims in Search Experiments and Unfolding, CERN-2011-006, pages 271–275, CERN, Geneva, Switzerland, 17–20 January 2011. N. Milke, M. Doert, S. Klepser, D. Mazin, V. Blobel, and W. Rhode, Solving inverse problems with the unfolding program TRUEE: Examples in astroparticle physics, Nuclear Instruments and Methods in Physics Research A, 697:133–147, 2013.

Mikael Kuusela (CMU) April 1, 2020 43 / 40 ReferencesIV

W. H. Richardson, Bayesian-based iterative method of image restoration, Journal of the Optical Society of America, 62(1):55–59, 1972. B. W. Rust and W. R. Burrus. Mathematical Programming and the Numerical Solution of Linear Equations. American Elsevier, 1972. B. W. Rust and D. P. O’Leary, Confidence intervals for discrete approximations to ill-posed problems, Journal of Computational and Graphical Statistics, 3(1):67–96, 1994. S. Schmitt, TUnfold, an algorithm for correcting migration effects in high energy physics, Journal of Instrumentation, 7:T10003, 2012. L. A. Shepp and Y. Vardi, Maximum likelihood reconstruction for emission tomography, IEEE Transactions on Medical Imaging, 1(2):113–122, 1982. P. B. Stark, Inference in infinite-dimensional inverse problems: Discretization and duality, Journal of Geophysical Research, 97(B10):14055–14082, 1992. A. M. Stuart, Inverse problems: A Bayesian perspective, Acta Numerica, 19:451–559, 2010. L. Tenorio, A. Fleck, and K. Moses, Confidence intervals for linear discrete inverse problems with a non-negativity constraint, Inverse Problems, 23:669–681, 2007.

Mikael Kuusela (CMU) April 1, 2020 44 / 40 ReferencesV

Y. Vardi, L. A. Shepp, and L. Kaufman, A statistical model for positron emission tomography, Journal of the American Statistical Association, 80(389):8–20, 1985. E. Veklerov and J. Llacer, Stopping rule for the MLE algorithm based on statistical hypothesis testing, IEEE Transactions on Medical Imaging, 6(4):313–319, 1987. I. Volobouev. On the expectation-maximization unfolding with smoothing. arXiv:1408.6500v2 [physics.data-an], 2015.

Mikael Kuusela (CMU) April 1, 2020 45 / 40 Backup

Mikael Kuusela (CMU) April 1, 2020 46 / 40 Current unfolding methods Two main approaches: 1 Tikhonov regularization (i.e., SVD by H¨ocker and Kartvelishvili (1996) and TUnfold by Schmitt (2012)): min (y − Kλ)TCˆ−1(y − Kλ) + δP(λ) p λ∈R with  MC 2 λ1/λ1 MC λ2/λ2  P (λ) = L   or P (λ) = kL(λ − λMC)k2, SVD  .  TUnfold  .  MC λp/λp where L is usually the discretized second derivative (also other choices possible) 2 Expectation-maximization iteration with (D’Agostini, 1995): (t) n (t+1) λj X Ki,j yi (0) MC λ = , with λ = λ j Pn p (t) Ki,j P i=1 i=1 k=1 Ki,k λk All these methods typically regularize by biasing towards a MC ansatz λMC Regularization strength controlled by the choice of δ in Tikhonov or by the number of iterations in D’Agostini q q   hˆ ˆ  ˆ ˆ  i Uncertainty quantification: λi , λi = λi − z1−α/2 varc λi , λi + z1−α/2 varc λi , ˆ  with varc λi estimated using error propagation or resampling Mikael Kuusela (CMU) April 1, 2020 47 / 40 Regularized unfolding

Two popular approaches to regularized unfolding:

1 Tikhonov regularization (H¨ocker and Kartvelishvili, 1996; Schmitt, 2012) 2 Expectation-maximization iteration with early stopping (D’Agostini, 1995; Richardson, 1972; Lucy, 1974; Shepp and Vardi, 1982; Lange and Carson, 1984; Vardi et al., 1985)

Mikael Kuusela (CMU) April 1, 2020 48 / 40 Tikhonov regularization

Tikhonov regularization estimates λ by solving: min (y − Kλ)TCˆ−1(y − Kλ) + δP(λ) λ∈Rp The first term as a Gaussian approximation to the Poisson log-likelihood The second term penalizes physically implausible solutions Common penalty terms: : P(λ) = kλk2 Curvature: P(λ) = kLλk2, where L is a discretized 2nd derivative operator SVD unfolding (H¨ocker and Kartvelishvili, 1996):  MC 2 λ1/λ1 MC λ2/λ2  P( ) = L   , λ  .   .  MC λp/λp where λMC is a MC prediction for λ TUnfold4 (Schmitt, 2012): P(λ) = kL(λ − λMC)k2

4TUnfold implements also more general penalty terms Mikael Kuusela (CMU) April 1, 2020 49 / 40 D’Agostini iteration

Starting from some initial guess λ(0) > 0, iterate

(k) n (k+1) λj X Ki,j yi λ = j Pn (k) Ki,j Pp i=1 i=1 l=1 Ki,l λl Regularization by stopping the iteration before convergence: λˆ = λ(K) for some small number of iterations K Will bias the solution towards λ(0) Regularization strength controlled by the choice of K In RooUnfold (Adye, 2011), λ(0) = λMC PyUnfold (Bourbeau and Hampel-Arias, 2018) implements free choice of λ(0)

Mikael Kuusela (CMU) April 1, 2020 50 / 40 D’Agostini iteration

(k) n (k+1) λj X Ki,j yi λ = j Pn (k) Ki,j Pp i=1 i=1 l=1 Ki,l λl This iteration has been discovered in various fields, including optics (Richardson, 1972), astronomy (Lucy, 1974) and tomography (Shepp and Vardi, 1982; Lange and Carson, 1984; Vardi et al., 1985) In particle physics, it was popularized by D’Agostini (1995) who called it “Bayesian” unfolding But: This is in fact an expectation-maximization (EM) iteration (Dempster et al., 1977) for finding the maximum likelihood estimator of λ in the problem y ∼ Poisson(Kλ) (k) As k → ∞, λ → λˆMLE (Vardi et al., 1985) This is a fully frequentist technique for finding the (regularized) MLE The name “Bayesian” is an unfortunate misnomer

Mikael Kuusela (CMU) April 1, 2020 51 / 40 D’Agostini demo, k = 0

500 500 µ λ y λ(k) 400 Kλ(k) 400

300 300

200 200

100 100

0 0 −10 −5 0 5 10 −10 −5 0 5 10

Figure: Smeared histogram Figure: True histogram

Mikael Kuusela (CMU) April 1, 2020 52 / 40 D’Agostini demo, k = 100

500 500 µ λ y λ(k) 400 Kλ(k) 400

300 300

200 200

100 100

0 0 −10 −5 0 5 10 −10 −5 0 5 10

Figure: Smeared histogram Figure: True histogram

Mikael Kuusela (CMU) April 1, 2020 53 / 40 D’Agostini demo, k = 10000

500 800 µ λ y λ(k) 400 (k) Kλ 600

300 400 200

200 100

0 0 −10 −5 0 5 10 −10 −5 0 5 10

Figure: Smeared histogram Figure: True histogram

Mikael Kuusela (CMU) April 1, 2020 54 / 40 D’Agostini demo, k = 100000

500 1500 µ λ y λ(k) 400 Kλ(k) 1000 300

200 500 100

0 0 −10 −5 0 5 10 −10 −5 0 5 10

Figure: Smeared histogram Figure: True histogram

Mikael Kuusela (CMU) April 1, 2020 55 / 40 Other methods

Bin-by-bin correction factors Attempts to unfold resolution effects by performing multiplicative efficiency corrections This method is simply wrong and must not be used Fully Bayesian unfolding (FBU) (Choudalakis, 2012) Unfolding using Bayesian statistics where the prior regularizes the ill-posed problem Certain priors lead to solutions similar to Tikhonov, but with Bayesian credible intervals as the uncertainties Note: D’Agostini has nothing to do with proper Bayesian inference Gaussian processes (Bozson et al., 2018; Stuart, 2010) Very closely related to Tikhonov regularization / penalized maximum likelihood / FBU Inherits many of the same limitations RUN/TRUEE (Blobel, 1985, 1996; Milke et al., 2013) Penalized maximum likelihood with B-spline discretization Shape-constrained unfolding (Kuusela and Stark, 2017) Correct-coverage simultaneous confidence intervals by imposing constraints on positivity, monotonicity and/or convexity Expectation-maximization with smoothing (Volobouev, 2015) Adds a smoothing step to each iteration of D’Agostini and iterates until convergence Iterative dynamically stabilized unfolding (Malaescu, 2011) Seems ad-hoc, with many free tuning parameters and unknown (at least to me) statistical properties I have not seen this used in CMS, but it seems to be quite common in ATLAS ...

Mikael Kuusela (CMU) April 1, 2020 56 / 40 Choice of the regularization strength

A key issue in unfolding is the choice of the regularization strength (δ in Tikhonov, # of iterations in D’Agostini) The solution and especially the uncertainties depend heavily on this choice This choice should be done using an objective data-driven criterion In particular, one must not rely on the software defaults for the regularization strength (such as 4 iterations of D’Agostini in RooUnfold) Many data-driven methods have been proposed: 1 (Weighted/generalized) cross-validation (e.g., Green and Silverman, 1994) 2 L-curve (Hansen, 1992) 3 Marginal maximum likelihood (MMLE; Kuusela and Panaretos (2015)) 4 Goodness-of-fit test in the smeared space (Veklerov and Llacer, 1987) 5 Akaike information criterion (Volobouev, 2015) 6 Minimization of a global correlation coefficient (Schmitt, 2012) 7 Stein’s unbiased risk estimate (SURE; new in TUnfold V17.9) 8 ... Limited experience about the relative merits of these in typical unfolding problems Note: All of these are designed for point estimation! Not necessarily optimal for uncertainty quantification

Mikael Kuusela (CMU) April 1, 2020 57 / 40 Tikhonov regularization, P(λ) = kλk2, varying δ

δ = 1e−05 δ = 0.001 2000 600 λ λ c c λˆ ± SE[λˆ] 500 λˆ ± SE[λˆ] 1000 400 300 0 200

−1000 100 0 −2000 −100 −5 0 5 −5 0 5

δ = 0.01 δ = 1 600 600 λ λ c c 500 λˆ ± SE[λˆ] 500 λˆ ± SE[λˆ] 400 400 300 300 200 200 100 100 0 0 −100 −100 −5 0 5 −5 0 5

Mikael Kuusela (CMU) April 1, 2020 58 / 40 Uncertainty quantification in unfolding

  Aim: Find random intervals λi (y), λi (y) , i = 1,..., p, with coverage probability 1 − α:   P λi ∈ λi (y), λi (y) ≈ 1 − α

Most implementations quantify the uncertainty using the binwise variance (estimated using either error propagation or resampling):  q q    ˆ ˆ  ˆ ˆ  λi , λi = λi − z1−α/2 varc λi , λi + z1−α/2 varc λi

But: These intervals may suffer from significant undercoverage since they ignore the regularization bias

Mikael Kuusela (CMU) April 1, 2020 59 / 40 Undersmoothed unfolding

Standard methods for picking the regularization strength optimize the point estimation performance These estimators have too much bias from the perspective of the variance-based uncertainties One possible solution is to debias the estimator, i.e., to adjust the bias-variance trade-off to the direction of less bias and more variance The simplest form of debiasing is to reduce δ from the cross-validation / L-curve / MMLE value until the intervals have close-to-nominal coverage The challenge is to come up with a data-driven rule for deciding how much to undersmooth I have been working with Lyle Kim to implement the data-driven methods from Kuusela (2016) as an extension of TUnfold The code is available at:

https://github.com/lylejkim/UndersmoothedUnfolding

If you’re already working with TUnfold, then trying this approach requires adding only one extra line of code to your analysis

Mikael Kuusela (CMU) April 1, 2020 60 / 40 Unfolded histograms, λMC = 0

1400 1400

1200 1200

1000 1000

800 800

600 600

400 400

200 200

−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

√ √ Figure: L-curve, τ = δ=0.01186 Figure: Undersmoothing, τ = δ=0.00177

Mikael Kuusela (CMU) April 1, 2020 61 / 40 Binwise coverage, λMC = 0

Binwise coverage, ScanLcurve Binwise coverage, Undersmoothing 0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7

0.6 0.6 0.6 0.6

0.5 0.5 0.5 0.5

Coverage (1000 repetitions) 0.4 0.4 Coverage (1000 repetitions) 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 0 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Bin Bin

Figure: L-curve Figure: Undersmoothing

Mikael Kuusela (CMU) April 1, 2020 62 / 40 √ Coverage as a function of τ = δ

TUnfold, coverage at peak bin 0.8 0.8

Undersmoothing 0.7 0.7 Lcurve

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3 Peak Coverage (10000 repetitions)

0.2 0.2

0.1 0.1

0 0 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 Tau (regularization strength) Figure: Coverage at the right peak of a bimodal density

Mikael Kuusela (CMU) April 1, 2020 63 / 40 Interval lengths, λMC = 0

log(interval length) comparison 6 7 log(interval length) 4 5

LcurveScan Undersmoothing Undersmoothing(oracle) mean = 66.017 mean = 238.09 mean = 197.102 median = 66.231 median = 207.391 median = 197.36

Mikael Kuusela (CMU) April 1, 2020 64 / 40 Histograms, coverage and interval lengths when λMC 6= 0

Binwise coverage, ScanLcurve Binwise coverage, Undersmoothing Average interval length: 29.3723 Average interval length: 308.004 0.8 0.8 0.8 0.8

0.7 0.7 0.7 0.7

0.6 0.6 0.6 0.6

0.5 0.5 0.5 0.5 Coverage (1000 repetitions) Coverage (1000 repetitions) 0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3

0.2 0.2 0.2 0.2

0.1 0.1 0.1 0.1

0 0 0 0 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 Bin Bin

Unfolded, ScanLcurve Unfolded, Undersmoothing

1600 1600 1600 1600

1400 Unfolded 1400 1400 1400 True 1200 1200 1200 1200

1000 1000 1000 1000

800 800 800 800

600 600 600 600

400 400 400 400

200 200 200 200

0 0 0 0 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6

Mikael Kuusela (CMU) April 1, 2020 65 / 40 Coverage study from Kuusela (2016)

Method Coverage at t = 0 Mean length BC (data) 0.932 (0.915, 0.947) 0.079 (0.077, 0.081) BC (oracle) 0.937 (0.920, 0.951) 0.064 (0.064, 0.064) US (data) 0.933 (0.916, 0.948) 0.091 (0.087, 0.095) US (oracle) 0.949 (0.933, 0.962) 0.070 (0.070, 0.070) MMLE 0.478 (0.447, 0.509) 0.030 (0.030, 0.030) MISE 0.359 (0.329, 0.390) 0.028 Unregularized 0.952 (0.937, 0.964) 40316

BC = iterative bias-correction US = undersmoothing MMLE = choose δ to maximize the marginal likelihood MISE = choose δ to minimize the mean integrated squared error

Mikael Kuusela (CMU) April 1, 2020 66 / 40 Fine bins, standard approach, perturbed MC, 4 realizations

0.8 0.8 8000 Unfolded 8000 Unfolded True 0.7 True 0.7

6000 6000 0.6 0.6

4000 4000 0.5 0.5

2000 2000 0.4 0.4 Intensity Intensity 0 0 0.3 0.3 Binwise coverage Binwise coverage 0.2 0.2 -2000 -2000

0.1 0.1 -4000 -4000 0 0 -6 -4 -2 0 2 4 6 -6-6 -4 -4 -2 -2 0 0 2 2 4 4 6 6 -6 -4 -2 0 2 4 6

0.8 0.8 8000 Unfolded 8000 Unfolded True 0.7 True 0.7

6000 6000 0.6 0.6

4000 4000 0.5 0.5

2000 2000 0.4 0.4 Intensity Intensity 0 0 0.3 0.3 Binwise coverage Binwise coverage 0.2 0.2 -2000 -2000

0.1 0.1 -4000 -4000 0 0 -6 -4 -2 0 2 4 6 -6-6 -4 -4 -2 -2 0 0 2 2 4 4 6 6 -6 -4 -2 0 2 4 6

Mikael Kuusela (CMU) April 1, 2020 67 / 40 Wide bins via fine bins, perturbed MC, 4 realizations

2500 2500 0.8 0.8 Unfolded Unfolded True 0.7 True 0.7 2000 2000 0.6 0.6

0.5 0.5 1500 1500

0.4 0.4

Intensity 1000 Intensity 1000 0.3 0.3 Binwise coverage Binwise coverage 0.2 0.2 500 500 0.1 0.1

0 0 0 0 -6 -4 -2 0 2 4 6 -6-6 -4 -4 -2 -2 0 0 2 2 4 4 6 6 -6 -4 -2 0 2 4 6

2500 2500 0.8 0.8 Unfolded Unfolded True 0.7 True 0.7 2000 2000 0.6 0.6

0.5 0.5 1500 1500

0.4 0.4

Intensity 1000 Intensity 1000 0.3 0.3 Binwise coverage Binwise coverage 0.2 0.2 500 500 0.1 0.1

0 0 0 0 -6 -4 -2 0 2 4 6 -6-6 -4 -4 -2 -2 0 0 2 2 4 4 6 6 -6 -4 -2 0 2 4 6

Mikael Kuusela (CMU) April 1, 2020 68 / 40