<<

Prior models The prior density should reflect our beliefs on the unknown variable of interest before taking the measurements into account.

Often, the prior knowledge is qualitative in nature, and transferring the information into quantitative form expressed through a prior density can be challenging.

A good prior should have the following property: Denote by x a possible realization of a X ∼ ρ0(x). If E is a collection of expected (i.e., something you would expect to see) vectors x and U is a collection of unexpected ones, then it should hold that

0 0 ρ0(x)  ρ0(x ) when x ∈ E, x ∈ U, i.e., the prior assigns a clearly higher to the realization that we expect to see. Example: Impulse prior densities

Consider, e.g., an imaging problem where the unknown is the discretized distribution of a physical parameter, i.e., a pixel image.

Assume that our prior information is that the image contains small and well localized objects in almost constant background. In such a case, one may try impulse prior densities, which have low average amplitude but allow outliers. (The ‘tail’ of an impulse prior density is long, although the is small.)

n Examples of impulse prior densities: Let x ∈ R represent a pixel image, where the component xj is the intensity of the jth pixel. (In all of the following examples, Xj and Xk are assumed to be independent for j 6= k.) The `1 prior: αn ρ (x) = exp(−αkxk ), α > 0. 0 2 1 where the `1-norm is defined as

n X kxk1 = |xj |. j=1

More enhanced impulse noise effect can be obtained by taking even smaller power of the components of x:

 n  X p ρ0(x) ∝ exp −α |xj |  , 0 < p < 1, α > 0. j=1 Another choice is the Cauchy density that is defined via

n αn Y 1 ρ (x) = , α > 0. 0 π 1 + α2x2 j=1 j

The entropy of an image is defined as

n X xj E(x) = − xj log , x0 j=1 where it is assumed that xj > 0, j = 1,... n, and x0 > 0 is a given constant. The entropy density is then of the form  ρ0(x) ∝ exp αE(x) , α > 0. Log-normal density: The logarithm of a single pixel x ∈ R is normally distributed, i.e.,

2 w = log x, w ∼ N (w0, σ ).

The explicit density of x is then   1 1 2 π(x) = √ exp − (log x − w0) , x > 0. x 2πσ2 2σ2

Do these priors represent our beliefs? How do these priors looks like? To underline the interpretation as a pixel image, we add a positivity constraint to the above introduced priors, that is, we make the replacement

ρ0(x) → Cπ+(x)ρ0(x), where π+(x) is one if all components of x are positive, and zero otherwise. Here, C is a normalizing constant: If ρ0(x) is a probability density, the same does not typically apply to π+(x)ρ0(x) without appropriate scaling.

For visual inspection we make random draws of pixel images from the constrained densities. As all components are independent, drawing can be done componentwise.

To make the draws from one-dimensional densities, we calculate the cumulative distribution of the prior density and employ the Golden Rule, as presented earlier. Example: Drawing from `1 prior

The one-dimensional cumulative distribution of the positively constrained `1 prior is Z t Φ(t) = α e−αs ds = 1 − e−αt . 0 The inverse cumulative distribution is thus 1 Φ−1(t) = − log(1 − t). α

For each pixel xj , we draw tj from the uniform distribution Uniform([0, 1]) and calculate xj = −1/α log(1 − tj ). Example: Drawing from Cauchy prior

The one-dimensional cumulative distribution of the positively constrained Cauchy prior is

2α Z t 1 2 Φ(t) = 2 2 ds = arctan(αt), π 0 1 + α s π meaning that the inverse cumulative distribution is 1 πt Φ−1(t) = tan . α 2

As in the case of the `1-prior, we draw tj from the uniform distribution and then calculate xj = 1/α tan(πt/2). Two random draws of pixel images from a Cauchy prior. How do these priors compare to white noise? Let us consider a Gaussian prior with a positivity constraint, i.e.,

 1  ρ (x) ∝ π (x) exp − kxk2 , α > 0. 0 + 2α2

Recall that at the previous lecture we implemented drawing from a standard Gaussian distribution with a bound c. In particular, we were able to calculate the one-dimensional cumulative distribution function √  √ √  Φ−1(t) = 2 erf−1 t1 − erf(c/ 2) + erf(c/ 2) .

A similar derivation for c = 0 and the α2 instead of 1 yields in the current case that √ Φ−1(t) = 2α erf−1(t). L prior 1 Cauchy prior White noise prior Discontinuities

Prior information: The unknown is a function of, say, time. It is known to be relatively stable for long periods of time, but contains now and then discontinuities. We may also have information on the size of the jumps or the rate of occurrence of the discontinuities.

A more concrete example: Unknown is a function f : [0, 1] → R. We know that f (0) = 0 and that the function may have large jumps at a few locations.

After discretizing f , impulse priors can be used to construct a prior on the finite difference approximation of the derivative of f . Discretization of the interval [0, 1]: Choose grid points tj = j/N, j = 0,..., N, and set xj = f (tj ).

We write a Cauchy-type prior density

N αN Y 1 ρ0(x) = 2 2 π 1 + α (xj − xj−1) j=1 that controls the jumps between the adjacent components of N+1 x ∈ R . In particular, the components of X are not independent. (In addition to this prior, we know that X0 = x0 = 0.)

To make draws from the above density, we define new variables

ξj = xj − xj−1, 1 ≤ j ≤ N, which are the changes in the function of interest between adjacent grid points. T N Notice thatx ˜ = [x1,..., xN ] ∈ R satisfies

x˜ = Aξ,

N×N where A ∈ R is a lower triangular matrix such that Ajk = 1 for j ≥ k. Hence, it follows, e.g., from the change of variables rule for probability densities that

N αN Y 1 ρ (ξ) = . 0 π 1 + α2ξ2 j=1 j

In particular, due to the product form of ρ0(ξ), the components of Ξ are mutually independent, and can thus be drawn from a one-dimensional Cauchy density. Subsequently, a random draw from the distribution of X can be constructed by recalling that x0 = 0 and using the relationx ˜ = Aξ. 6000 100

5000 0

4000 −100

3000 −200

2000 −300

1000 −400

0 −500 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Sample-based densities

Assume that we have a large sample of realizations of a random n variable X ∈ R : S = {x1, x2,..., xN }.

One way to construct a prior density for X is to approximate ρ0(x) based on S.

Estimates of the and the :

N 1 X E{X } ≈ xj =:x ¯, N j=1

N 1 X cov(X ) = E{XX T} − E{X }E{X }T ≈ xj (xj )T − x¯x¯T =: C. N j=1 (Notice that Γ is not the unbiased sample covariance estimator, but let us anyway follow the notation of the text book.) The eigenvalue decomposition of C is

C = UDUT,

n×n where U ∈ R is orthogonal and has the eigenvectors of C as its n×n columns, and D ∈ R is diagonal with the eigenvalues d1 ≥ ... ≥ dn ≥ 0 as its diagonal entries. (Note that C is clearly symmetric and positive semi-definite, and thus it has a full set of eigenvectors with non-negative eigenvalues.)

The vectors xj , j = 1,..., N, are typically ‘somewhat similar’ and the matrix C can consequently be singular or almost singular: The eigenvalues often satisfy dj ≈ 0 for j > r, where 1 < r < n is some cut-off index. In other words, the difference X − E{X } does not seem to vary much in the direction of the eigenvectors ur+1,..., un. Assume this is the case. Then, one can postulate that the values of the random variable X − E(X ) lie ‘with a high probability’ in the subspace spanned by the first r eigenvectors of C. One way of trying to state this information quantitatively, is to introduce a subspace prior

π(x) ∝ exp −αk(I − P)(x − x¯)k2 ,

n where P is the orthogonal projector R → span{u1,..., ur }. The parameter α > 0 controls how much X − x¯ is allowed to vary from the subspace span{u1,..., ur }. (Take note that such a subspace prior is not a probability density in the traditional sense.) If Γ is not almost singular, the inverse C −1 can be computed stably. In this case, the most straightforward way of approximating the (prior) probability density of X is to introduce the Gaussian approximation:

 1  ρ (x) ∝ exp − (x − x¯)TC −1(x − x¯) . 0 2

Depending on the higher order of X , this may or may not provide a good approximation for the distribution of X . Hypermodels In the statistical framework, the prior densities usually depend on some parameters such as variance or mean. Typically — or at least thus far —, these parameters are assumed to be known.

Some classical regularization methods can be viewed as construction of estimators based on the posterior density (e.g., ). The regularization parameter, which corresponds to the parameter that defines the prior distribution, is not assumed to be know, but selected using, e.g., the Morozov discrepancy principle.

What happens if it is not clear how to choose these ‘prior parameters’ in the statistical framework?

If a parameter is not know, it can be estimated as a part of the problem based on the . This leads to hierarchical models that include hypermodels for the parameters defining the prior density. Assume that the prior distribution depends on a parameter α which is not assumed to be known. Then we write the prior as a conditional density, that is,

ρ0(x | α).

Assuming we have a for α, i.e.,

πhyper(α), we can write the joint distribution of x and α as

π(x, α) = ρ0(x | α)πhyper(α).

Assuming a likelihood model π(y | x) for the measurement data y, we get the posterior density for x and α, given y, from the Bayes formula:

π(x, α | y) ∝ π(y | x)π(x, α) = π(y | x)π(x | α)πhyper(α). In general, the hyperprior density πhyper may depend on some α0. In such a case, the main reason for the use of a hyperprior model is that the construction of the posterior is assumed to be more robust with respect to fixing a value for the hyperparameter α0 than fixing a value for α.

Sometimes α0 can also be treated as a random variable with a respective probability density. Then, we would write

πhyper(α | α0), giving rise to nested hypermodels. Example: Hypermodel for a deconvolution problem

(Adapted from the textbook by Calvetti and Somersalo, Chapter 10) Consider a one-dimensional deconvolution problem, the goal of which is to estimate a signal f : [0, 1] → R from noisy, blurred observations modelled as Z 1 yi = g(si ) = A(si , t)f (t)dt + e(si ), 1 ≤ i ≤ m, 0

m where {si }i=1 ⊂ [0, 1] are the uniformly distributed measurement points, the blurring kernel is defined to be

 1  A(s, t) = exp − (t − s)2 , 2ω2

and the noise is Gaussian, or more precisely e ∼ N (0, σ2I ). To begin with, we discretize the model as

y = Ax + e,

m×n where A ∈ R is obtained by approximating the integral with a suitable quadrature rule, and the vector x contains the values of n the unknown signal at the discretization points {tj }j=0 that we have chosen to be distributed uniformly over the interval [0, 1]. To be more precise, j x = f (t ), t = , 0 ≤ j ≤ n. j j j n

For simplicity we assume it is known that f (0) = x0 = 0, and define the actual unknown x to be   x1  .  n x =  .  ∈ R . xn Assume that as prior information we know that the signal is continuous except for a possible jump discontinuity at a known location. Let us start with a Gaussian first order smoothness prior,

 1  ρ (x) ∝ exp − kLxk2 , 0 2γ2 where L is a first order finite difference matrix (recall that x0 = 0),

 1  −1 1  L =   ∈ n×n.  .. ..  R  . .  −1 1 It is easy to see that L is invertible and

1  1 1  L−1 =   . .. ..  . . .  1 ... 1 1

1 is a lower triangular matrix. Since γ L is the whitening matrix of n X ∈ R distributed according to ρ0(x) it follows that

X = L−1W , W ∼ N (0, γ2I ).

Due to the particular shape of L−1, this relation can alternatively be given as a Markov process:

2 Xj = Xj−1 + Wj , Wj ∼ N (0, γ ), j = 1,..., n, X0 = 0. Next, we aim at fine-tuning the the above smoothness prior so that it allows a jump discontinuity over the interval [tk−1, tk ]. To this end, we modify the above Markov model (only) at j = k by setting  γ2  X = X + W , W ∼ N 0, , k k−1 k k δ2 where δ < 1 is a parameter controlling the variance of Wk , i.e., the expected size of the jump.

Let us walk the the above steps backwards: It is easy to see that this new Markov process can alternatively be given as

X = L−1(D1/2)−1W , W ∼ N (0, γ2I ), where 1/2 n×n D = diag(1, 1,...,δ,..., 1, 1) ∈ R is defined so that (D1/2)−1 scales the kth component of W by 1/δ. In consequence, after the above modification in the kth step of the Markov process defining X , the random variable D1/2LX is distributed according to N (0, γ2I ), and thus we have introduced the fine-tuned ‘jump prior’

 1  ρ (x) ∝ exp − kD1/2Lxk2 . 0 2γ2

Let us draw samples from this kind of a prior density. We set n = 150 and γ = 0.1, meaning that we expect increments of the order 0.1 at most of the subintervals. As an exception, at two known locations t ≈ 0.4 and t ≈ 0.8 we use δ < 1 at the corresponding diagonal element of D1/2, in anticipation of a jump of the order γ/δ = 0.1/δ. Random draws from the jump discontinuity prior with two different values of δ.

δ=0.1 δ=0.02 4 10

5 2 0 0 −5 −2 −10

−4 −15 0 0.5 1 0 50 100 150 As the additive noise was assumed to be Gaussian, the likelihood density corresponding to the considered measurement is

 1  π(y | x) ∝ exp − ky − Axk2 , 2σ2 and due to the Bayes formula, the posterior density can thus be written as  1 1  π(x | y) ∝ exp − ky − Axk2 − kD1/2Lxk2 . 2σ2 2γ2

Using the results for Gaussian densities from previous lectures, the mean of the posterior, which is also the MAP and the CM estimate, can be written explicitly as

σ2 −1 x = x = LT(D1/2)TD1/2L + ATA ATy. CM MAP γ2 The original signal f (t) and the measurement data (ω ≈ 0.05):

signal f(t) measurement data 0.14 1 0.12

0.8 0.1

0.08 0.6 0.06 0.4 0.04

0.2 0.02

0 0 −0.02 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Posterior estimates for f without the discontinuity model (i.e., with the mere first order smoothness prior) and with the discontinuity model with known locations and jump sizes (γ = 0.1):

MAP estimate without jump model MAP estimate with jump model, known location and size 1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Next we choose γ = 0.01 that corresponds to increments of the order of 0.01 at each subinterval, and scale δ accordingly so that it is in accordance with jump sizes of the order 1.

MAP estimate without jump model MAP estimate with jump model, known location and size 1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

−0.2 −0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Assume next that the locations and expected sizes of the jumps are not known, but we expect a slowly varying signal that could have a few jumps at unknown locations. We modify the Markov model to allow different increments at different positions:

 1  Xj = Xj−1 + Wj , Wj ∼ N 0, , θj > 0, j = 1,..., n. θj The corresponding prior model can be obtained in the same way as above:  1  ρ (x) ∝ exp − kD1/2Lxk2 , 0 2 where this time around

1/2 1/2 1/2 1/2 D = diag(θ1 , θ2 , . . . , θn ).

T If we knew the vector θ = [θ1, . . . , θn] , we could proceed as previously. n If θ ∈ R is not know, it can be considered as a random variable and its estimation can be included as a part of the inference problem. To this end, we need to write the conditional density

ρ0(x | θ).

In this case, the normalizing constant of the density ρ0(x | θ) is no longer a constant, but depends on the random variable θ and thus cannot be ignored.

Recall the probability density of a n-variate Gaussian distribution:

 1 1/2  1  π(z) = exp − zTΓ−1z , (2π)n det(Γ) 2 where the mean is assumed to be zero. T −1 n×n In our case, Γ = (L DL) , where D = diag(θ) ∈ R . Recall that the determinant of a triangular matrix is the product of its diagonal elements, meaning that det(L) = det(LT) = 1. Moreover, the determinant of an inverse matrix is the inverse of the determinant of the original matrix. Hence, it holds that

n −1 T T Y det(Γ) = det(L DL) = det(L ) det(D) det(L) = θj , j=1 and the properly normalized density becomes

Qn !1/2 θj  1  ρ (x | θ) = j=1 exp − kD1/2Lxk2 0 (2π)n 2

 n  1 1 1/2 2 1 X  = exp − kD Lxk + log θj  . (2π)n/2 2 2 j=1 To make the expressions for the prior and posterior slightly simpler, n instead of the discretized signal x ∈ R we consider the n corresponding increments w ∈ R .

Recalling from the preceding analysis that x = L−1w and using the change of variables formula for probability densities, with the knowledge that det(L−1) = 1, it follows that

n  Y 1/2  1  π(w | θ) ∝ θ exp − kD1/2L(L−1w)k2 j 2 j=1  n  1 1 X = exp − kD1/2wk2 + log θ .  2 2 j  j=1 Next we need to choose a hyperprior density for θ. Qualitatively, we should allow some components of θ to deviate strongly from the ‘average’.

We decide to use an `1-type impulse prior with a positivity constraint:

 n  γ X π (θ) ∝ π (θ) exp − θ hyper +  2 j  j=1 where π+(θ) is one if all components of θ are positive, and zero otherwise, and γ > 0 is a hyperparameter. The posterior distribution can then be written as

π(x, θ | y) ∝ π(y | x)π(x, θ) = π(y | x)π(x | θ)πhyper(θ)  n n  1 1 γ X 1 X ∝ exp − ky − Axk2 − kD1/2Lxk2 − θ + log θ  2σ2 2 2 j 2 j  j=1 j=1 if all components of θ are positive, and π(x, θ | y) = 0 otherwise. It is straightforward to see that the corresponding MAP estimate is the minimizer of the functional

 1   1  2 n n σ A σ y X X F (x, θ) = x − + γ θj − log θj . D1/2L 0 j=1 j=1

n n over (x, θ) ∈ R × R+. We apply a two stage minimization algorithm: Choose some initial guesses for x and θ. Then, repeat the following two steps until convergence is achieved: 1. Keep θ fixed and update x to be the solution of

 1 A   1 y σ x = σ , D1/2L 0

where D = diag(θ). 2. Fix x and update θ by minimizing F (x, ·) with respect to the second variable. An easy calculation shows that this minimizer can be given componentwise as 1 θj = 2 , j = 1,..., n, wj + γ

n where w = Lx ∈ R is the vector of increments corresponding to x. MAP estimates for x and θ provided by the above alternating −5 algorithm with γ = 10 and the initial guesses x0 = 0 and θ0,j = 1/γ, j = 1,..., n. The data is the same as depicted on page 448.

4 θ MAP estimate with hypermodel x 10 1.2 10 1 θ 0.8 8 θ 0

0.6 6

0.4 4 0.2

2 0 f(t) −0.2 MAP estimate 0 0 0.2 0.4 0.6 first0.8 iterate 1 0 0.2 0.4 0.6 0.8 1 Another example: The original signal f (t) and the measurement data.

signal f(t) measurement data 0.15 1 0.1

0.5 0.05

0 0 −0.05

−0.5 −0.1

−0.15 −1 −0.2 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 MAP estimates for x and θ provided by the above alternating −5 algorithm with γ = 10 and the initial guesses x0 = 0 and θ0,j = 1/γ, j = 1,..., n.

4 θ MAP estimate with hypermodel x 10

1 10

0.5 8

6 0 θ θ 0 4 −0.5 f(t) MAP estimate 2 first iterate

−1 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1