Prior Models the Prior Density Should Reﬂect Our Beliefs on the Unknown Variable of Interest Before Taking the Measurements Into Account

Prior models The prior density should reflect our beliefs on the unknown variable of interest before taking the measurements into account. Often, the prior knowledge is qualitative in nature, and transferring the information into quantitative form expressed through a prior density can be challenging. A good prior should have the following property: Denote by x a possible realization of a random variable X ∼ ρ0(x). If E is a collection of expected (i.e., something you would expect to see) vectors x and U is a collection of unexpected ones, then it should hold that 0 0 ρ0(x) ρ0(x ) when x 2 E; x 2 U; i.e., the prior assigns a clearly higher probability to the realization that we expect to see. Example: Impulse prior densities Consider, e.g., an imaging problem where the unknown is the discretized distribution of a physical parameter, i.e., a pixel image. Assume that our prior information is that the image contains small and well localized objects in almost constant background. In such a case, one may try impulse prior densities, which have low average amplitude but allow outliers. (The `tail' of an impulse prior density is long, although the expected value is small.) n Examples of impulse prior densities: Let x 2 R represent a pixel image, where the component xj is the intensity of the jth pixel. (In all of the following examples, Xj and Xk are assumed to be independent for j 6= k.) The `1 prior: αn ρ (x) = exp(−αkxk ); α > 0: 0 2 1 where the `1-norm is defined as n X kxk1 = jxj j: j=1 More enhanced impulse noise effect can be obtained by taking even smaller power of the components of x: 0 n 1 X p ρ0(x) / exp @−α jxj j A ; 0 < p < 1; α > 0: j=1 Another choice is the Cauchy density that is defined via n αn Y 1 ρ (x) = ; α > 0: 0 π 1 + α2x2 j=1 j The entropy of an image is defined as n X xj E(x) = − xj log ; x0 j=1 where it is assumed that xj > 0, j = 1;::: n, and x0 > 0 is a given constant. The entropy density is then of the form ρ0(x) / exp αE(x) ; α > 0: Log-normal density: The logarithm of a single pixel x 2 R is normally distributed, i.e., 2 w = log x; w ∼ N (w0; σ ): The explicit density of x is then 1 1 2 π(x) = p exp − (log x − w0) ; x > 0: x 2πσ2 2σ2 Do these priors represent our beliefs? How do these priors looks like? To underline the interpretation as a pixel image, we add a positivity constraint to the above introduced priors, that is, we make the replacement ρ0(x) ! Cπ+(x)ρ0(x); where π+(x) is one if all components of x are positive, and zero otherwise. Here, C is a normalizing constant: If ρ0(x) is a probability density, the same does not typically apply to π+(x)ρ0(x) without appropriate scaling. For visual inspection we make random draws of pixel images from the constrained densities. As all components are independent, drawing can be done componentwise. To make the draws from one-dimensional densities, we calculate the cumulative distribution of the prior density and employ the Golden Rule, as presented earlier. Example: Drawing from `1 prior The one-dimensional cumulative distribution of the positively constrained `1 prior is Z t Φ(t) = α e−αs ds = 1 − e−αt : 0 The inverse cumulative distribution is thus 1 Φ−1(t) = − log(1 − t): α For each pixel xj , we draw tj from the uniform distribution Uniform([0; 1]) and calculate xj = −1/α log(1 − tj ). Example: Drawing from Cauchy prior The one-dimensional cumulative distribution of the positively constrained Cauchy prior is 2α Z t 1 2 Φ(t) = 2 2 ds = arctan(αt); π 0 1 + α s π meaning that the inverse cumulative distribution is 1 πt Φ−1(t) = tan : α 2 As in the case of the `1-prior, we draw tj from the uniform distribution and then calculate xj = 1/α tan(πt=2). Two random draws of pixel images from a Cauchy prior. How do these priors compare to white noise? Let us consider a Gaussian prior with a positivity constraint, i.e., 1 ρ (x) / π (x) exp − kxk2 ; α > 0: 0 + 2α2 Recall that at the previous lecture we implemented drawing from a standard Gaussian distribution with a bound c. In particular, we were able to calculate the one-dimensional cumulative distribution function p p p Φ−1(t) = 2 erf−1 t1 − erf(c= 2) + erf(c= 2) : A similar derivation for c = 0 and the variance α2 instead of 1 yields in the current case that p Φ−1(t) = 2α erf−1(t): L prior 1 Cauchy prior White noise prior Discontinuities Prior information: The unknown is a function of, say, time. It is known to be relatively stable for long periods of time, but contains now and then discontinuities. We may also have information on the size of the jumps or the rate of occurrence of the discontinuities. A more concrete example: Unknown is a function f : [0; 1] ! R. We know that f (0) = 0 and that the function may have large jumps at a few locations. After discretizing f , impulse priors can be used to construct a prior on the finite difference approximation of the derivative of f . Discretization of the interval [0; 1]: Choose grid points tj = j=N, j = 0;:::; N, and set xj = f (tj ). We write a Cauchy-type prior density N αN Y 1 ρ0(x) = 2 2 π 1 + α (xj − xj−1) j=1 that controls the jumps between the adjacent components of N+1 x 2 R . In particular, the components of X are not independent. (In addition to this prior, we know that X0 = x0 = 0.) To make draws from the above density, we define new variables ξj = xj − xj−1; 1 ≤ j ≤ N; which are the changes in the function of interest between adjacent grid points. T N Notice thatx ~ = [x1;:::; xN ] 2 R satisfies x~ = Aξ; N×N where A 2 R is a lower triangular matrix such that Ajk = 1 for j ≥ k. Hence, it follows, e.g., from the change of variables rule for probability densities that N αN Y 1 ρ (ξ) = : 0 π 1 + α2ξ2 j=1 j In particular, due to the product form of ρ0(ξ), the components of Ξ are mutually independent, and can thus be drawn from a one-dimensional Cauchy density. Subsequently, a random draw from the distribution of X can be constructed by recalling that x0 = 0 and using the relationx ~ = Aξ. 6000 100 5000 0 4000 −100 3000 −200 2000 −300 1000 −400 0 −500 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Sample-based densities Assume that we have a large sample of realizations of a random n variable X 2 R : S = fx1; x2;:::; xN g: One way to construct a prior density for X is to approximate ρ0(x) based on S. Estimates of the mean and the covariance: N 1 X EfX g ≈ xj =:x ¯; N j=1 N 1 X cov(X ) = EfXX Tg − EfX gEfX gT ≈ xj (xj )T − x¯x¯T =: C: N j=1 (Notice that Γ is not the unbiased sample covariance estimator, but let us anyway follow the notation of the text book.) The eigenvalue decomposition of C is C = UDUT; n×n where U 2 R is orthogonal and has the eigenvectors of C as its n×n columns, and D 2 R is diagonal with the eigenvalues d1 ≥ ::: ≥ dn ≥ 0 as its diagonal entries. (Note that C is clearly symmetric and positive semi-definite, and thus it has a full set of eigenvectors with non-negative eigenvalues.) The vectors xj , j = 1;:::; N, are typically `somewhat similar' and the matrix C can consequently be singular or almost singular: The eigenvalues often satisfy dj ≈ 0 for j > r, where 1 < r < n is some cut-off index. In other words, the difference X − EfX g does not seem to vary much in the direction of the eigenvectors ur+1;:::; un. Assume this is the case. Then, one can postulate that the values of the random variable X − E(X ) lie `with a high probability' in the subspace spanned by the first r eigenvectors of C. One way of trying to state this information quantitatively, is to introduce a subspace prior π(x) / exp −αk(I − P)(x − x¯)k2 ; n where P is the orthogonal projector R ! spanfu1;:::; ur g. The parameter α > 0 controls how much X − x¯ is allowed to vary from the subspace spanfu1;:::; ur g. (Take note that such a subspace prior is not a probability density in the traditional sense.) If Γ is not almost singular, the inverse C −1 can be computed stably. In this case, the most straightforward way of approximating the (prior) probability density of X is to introduce the Gaussian approximation: 1 ρ (x) / exp − (x − x¯)TC −1(x − x¯) : 0 2 Depending on the higher order statistics of X , this may or may not provide a good approximation for the distribution of X . Hypermodels In the statistical framework, the prior densities usually depend on some parameters such as variance or mean. Typically | or at least thus far |, these parameters are assumed to be known.

Load more