Superresolution via Student-t Mixture Models
Master Thesis –improved version–
TU Kaiserslautern Department of Mathematics
Johannes Hertrich
supervised by Prof. Dr. Gabriele Steidl
Kaiserslautern, submitted at February 4, 2020
Contents
1. Introduction5
2. Preliminaries6 2.1. Definitions and Notations...... 6 2.1.1. Random Variables and Estimators...... 6 2.1.2. Conditional Probabilities, Distribution and Expectation...... 8 2.1.3. Maximum a Posteriori Estimator...... 12 2.1.4. The Kullback-Leibler Divergence...... 13 2.2. The EM Algorithm...... 13 2.3. The PALM and iPALM Algorithms...... 19 2.3.1. Proximal Alternating Linearized Minimization (PALM)...... 20 2.3.2. Inertial Proximal Alternating Linearized Minimization (iPALM). 21
3. Alternatives of the EM Algorithm for Estimating the Parameters of the Student-t Distribution 22 3.1. Likelihood of the Multivariate Student-t Distribution...... 24 3.2. Existence of Critical Points...... 27 3.3. Zeros of F ...... 31 3.4. Algorithms...... 36 3.5. Numerical Results...... 45 3.5.1. Comparison of Algorithms...... 45 3.5.2. Unsupervised Estimation of Noise Parameters...... 46
4. Superresolution via Student-t Mixture Models 52 4.1. Estimating the Parameters...... 55 4.1.1. Initialization...... 61 4.1.2. Simulation Study...... 63 4.2. Superresolution...... 66 4.2.1. Expected Patch Log-Likelihood for Student-t Mixture Models... 66 4.2.2. Joint Student-t Mixture Models...... 73 4.3. Numerical Results...... 75 4.3.1. Comparison to Gaussian Mixture models...... 75 4.3.2. FIB-SEM images...... 77
5. Conclusion and Future Work 79
3 A. Examples for the EM Algorithm 79 A.1. EM Algorithm for Student-t distributions...... 79 A.2. EM Algorithm for Mixture Models...... 87 A.3. EM Algorithm for Student-t Mixture Models...... 89
B. Auxiliary Lemmas 92
C. Derivatives of the Negative Log-Likelihood Function for Student-t Mixture Models 97 1. Introduction
Superresolution is a process to reconstruct a high resolution image from a low resolution image. There exist several approaches to use Gaussian mixture models for superreso- lution (see e.g. [35, 45]). In this thesis, we extend this to Student-t mixture models and focus on the estimation of the parameters of Student-t distributions and Student-t mixture models. For this purpose, we first consider numerical algorithms to compute the maximum likelihood estimator of the parameters for a multivariate Student-t distribution and propose three alternatives to the classical Expectation Maximization (EM) algorithm. Then, we extend out considerations to Student-t mixture models and finally, we apply our algorithms to some numerical examples.
The thesis is organized as follows: in Section2 we review preliminary results. Further, we introduce the EM algorithm, the Proximal Alternating Linearized Minimization (PALM) as well as the inertial Proximal Alternating Linearized Minimization (iPALM) in their general forms and cite the corresponding convergence results.
Then, in Section3, we consider maximum likelihood estimation of the parameters for a multivariate Student-t distribution. This section (including AppendixB) is already contained in the arXiv preprint [16] and submitted for a journal publication. In Section 3.1, we introduce the Student-t distribution, the negative log-likelihood function L and their derivatives. In Section 3.2, we provide some results concerning the existence of minimizers of L. Section 3.3 deals with the solution of the equation arising when setting the gradient of L with respect to ν to zero. The results of this section will be important for convergence considerations of our algorithms in the Section 3.4. We propose three alternatives to the classical EM algorithm. For fixed degree of freedom ν the first alterna- tive is known as accelerated EM algorithm from the literature. It was considered e.g. in [19, 28, 40]. In our case, since we do not fix ν, it cannot be interpreted as EM algorithm. The other two alternatives differ from this in the ν step of the iteration. We show that the objective function L decreases in each iteration step and provide a simulation study to compare the performance of these algorithms. Finally, we provide two kinds of numerical results in Section 3.4. First, we compare the different algorithms by numerical examples which indicate that the new ν iterations are very efficient for estimating ν of different magnitudes. Second, we come back to the original motivation of this part and estimate the degree of freedom parameter ν from images corrupted by one-dimensional Student-t noise.
5 In Section4, we consider superresolution via Student- t mixture models. Section 4.1 deals with the parameter estimation of Student-t mixture models. We propose three alternatives to the EM algorithm. The first alternative differs from the EM algorithm in the update of the Σ and the ν step. The second and third algorithm are the PALM and iPALM algorithm as proposed in [6] and [34] for the negative log likelihood function L of the Student-t mixture model, which were so far not used in connection with mixture models. We describe some heuristics to initialize the algorithms and to set the parameters in the PALM and iPALM algorithm. Further, we compare the algorithms by a simulation study. In Section 4.2, we adapt two methods for superresolution with Student-t mixture models, which were originally proposed in [35] and [45] for Gaussian mixture models. Finally, in Section 4.3, we compare our methods with Gaussian mixture models and apply them to images generated by Focused Ion Beam and Scanning Electron Microscopy (FIB-SEM).
Acknowledgement
We would like to thank Professor Thomas Pock from the TU Graz for the fruitful discussions on the usage of PALM and iPALM in Section 4.1. Further, we thank the group of Dominique Bernard from the ICMCB material science lab at the University of Bordeaux for generating the FIB-SEM images within the ANR-DFG project ”SUPREMATIM”, which we used in Section 4.3.
2. Preliminaries
2.1. Definitions and Notations
2.1.1. Random Variables and Estimators
Let (Ω, A,P ) be a probability space and let (Ω0, A0) be a measurable space. We call a measurable mapping X : Ω Ω0 a random element. If Ω0 = Rd and A = , where → B B denotes the Borel σ-algebra, we call X a random vector. We say X is a random variable, if 0 0 d = 1. For a random element X : Ω Ω we call the probability measure PX : A [0, 1] → → defined by −1 0 PX (A) = P (X (A))A A ∈ the image measure or distribution of X.
Definition 2.1 (Mean, Variance, Standard deviation). Let X : Ω R be a random →
6 variable. We define the mean of X by Z Z E(X) = EP (X) = X(ω) dP (ω) = x dPX (x). Ω Ω0
For 1 p we denote the Banach space of (equivalence classes of) random variables ≤ ≤ ∞ with E( X p) < by Lp(Ω, A,P ). Note that Lp Lq for 1 p < q . Further we | | ∞ ⊂ ≤ ≤ ∞ denote for X L2(Ω, A,P ) the variance of X by ∈
2 Var(X) = EP ((X EP (X)) ) − p and call Var(X) the standard deviation of X. For X,Y L2(Ω, A,P ) we call ∈
Cov(X,Y ) = E((X E(X))(Y E(Y ))) − −
T d the covariance of X and Y . For a random vector X = (X1, ..., Xd) : Ω R with 1 → Xi L (Ω, A,P ) for all i = 1, ..., d we use the notation ∈
T E(X) = (E(X1), ..., E(Xd))
2 for the mean. Further, if Xi L (Ω, A,P ) for all i = 1, ..., d we call ∈
d Cov(X) = (Cov(Xi,Xj))i,j=1 the covariance matrix of X.
Definition 2.2 (Probability densities). Let X : Ω Rd be a random vector. If there d → exists some function fX : R R≥0 with → Z P ( ω Ω: X(ω) A ) = PX (A) = fX (x) dx, A A, { ∈ ∈ } A ∈ then we call fX the probability density function of X.
Now, let (Ω, A) be a measurable space and let Θ Rd. We call a family of probability ⊆ measures (Pϑ)ϑ∈Θ a parametric distribution family. Given some independent identically d distributed samples x1, ..., xn of a random vector X : Ω R 1 defined on the probability → space (Ω, A,Pϑ) we want to recover the parameter ϑ of the underlying measure.
Definition 2.3 (Estimators). A measurable mapping T : Rd1×n Θ is called an estima- → tor of ϑ.
7 A common choice for an estimator is the maximum likelihood (ML) estimator. Assume that X is a random vector with a probability density function or that X is a discrete random vector. Then we define the likelihood function :Θ R by L → ∪ {∞} n Y (ϑ x1, ..., xn) = p(xi), L | i=1 where fX (xi), if X has a density, p(x) = PX (x), if X is a discrete random vector.
Now we define the maximum likelihood estimator by
ϑˆ argmax (ϑ x1, ..., xn). ∈ ϑ∈Θ L |
2.1.2. Conditional Probabilities, Distribution and Expectation
We give a short introduction to conditional expectations, probabilities and distributions based on [20, Chapter 8] and [5, Chapter IV]. Let (Ω, A,P ) be a probability space.
Definition 2.4 (Conditional expectation). Let X L1(Ω, A,P ) and let G A be a ∈ ⊆ σ-algebra. We call a G-measurable random variable Z :Ω R with the property that → Z Z X dP = Z dP for all A G A A ∈
(a version of) the conditional expectation of X given G and we denote Z = E(X G). If | X = 1A for some A A, then we denote E(1A G) = P (A G) and call P (A G) (a version ∈ | | | of) the conditional probability of A given G.
Theorem 2.5 (Existence and uniqueness of conditional expectation). Let X L1(Ω, A,P ) ∈ and let G A be a σ-algebra. Then the conditional expectation E(X G) exists and is ⊆ | unique P G almost surely. | Proof. See [5, Theorem 15.1].
Theorem 2.6 (Poperties of the conditional expectation). Let X,Y L1(Ω, A,P ) and ∈ let F G A be a σ-algebras. Then the following holds true: ⊆ ⊆ (i) (Linearity) E(λX + Y G) = λE(X G) + E(Y G). | | |
8 (ii) (Monotonicity) If it holds X Y almost surely, then we have E(X G) E(Y G) ≥ | ≥ | almost surely.
(iii) If E( XY ) < and Y is G-measurable, then it holds | | ∞
E(XY G) = Y E(X G) and E(Y G) = Y. | | |
(iv) (Tower property) E(E(X G) F) = E(E(X F) G) = E(X F). | | | | | (v) (Independence) If σ(X) and G are independent, then E(X G) = E(X). | (vi) (Dominated convergence) Assume that Y L1(Ω, A,P ), Y 0 almost surely and ∈ ≥ that (Xn)n is a sequence of random variables with Xn Y for n N such that | | ≤ ∈ Xn X as n almost surely. Then it holds → → ∞
1 lim E(Xn G) = E(X G) almost surely and in L (Ω, G,P ). n→∞ | |
Proof. See [20, Theorem 8.14].
Now let (Ω0, A0) be a measurable space and let Y : Ω Ω0 be a random element. Then → we denote by E(X Y ) = E(X σ(Y )) | | the conditional distribution of X given Y .
Theorem 2.7 (Factorization Lemma). Let Ω1 be a set and (Ω2, A2) a measurable space.
Further, let X : Ω1 Ω2 be a mapping. Then for every Y : Ω1 R the following are → → equivalent:
(i) Y is σ(X)- measurable, where is the Borel σ-algebra B B
(ii) There exists a A2- measurable mapping g :Ω2 R such that Y = g X B → ◦ Proof. See [20, Corollary 1.93].
Thus there exists a A0-measurable mapping g :Ω0 R such that →
E(X Y ) = g Y. | ◦
This mapping is unique PY - almost surely. We define the conditional expectation of X given Y = y by E(X Y = y) = g(y) |
9 with g from above. For A A and X = 1A we define the conditional probability of A ∈ given Y = y by
P (A Y = y) = E(1A Y = y). | | Remark 2.8. Note that we now obtain the following well-known formulas: (i) Let A, B A with P (B) > 0. Then it holds ∈ Z Z E(1A 1B) dP = 1A dP = P (A B). B | B ∩
Since E(1A 1B) is constant on B, we have that for ω B the conditional probability | ∈ of A given B reads as
P (A B) P (A B) := P (A 1B = 1) = E(1A 1B)(ω) = ∩ . | | | P (B)
(ii) If X : Ω Rd1 and Y : Ω Rd2 are discrete random vectors and y Rdt with → → ∈ P (Y = y) > 0 we get directly from (i) that for all x Rd1 the conditional ∈ distribution of X given Y = y reads as
P (X = x, Y = y) P (x) := P (1 Y = y) = . (X|Y =y) {x}| P (Y = y)
(iii) Let X : Ω Rd1 and Y : Ω Rd2 are random vectors sucht that the density → → functions fX , fY and fX,Y exist. Then it holds for all Borel measurable sets A Rd1 and B Rd2 that ⊆ ⊆ Z Z fX,Y (x, y) dx dy = P ( X A Y B ) B A { ∈ } ∩ { ∈ } Z = P (X A Y = y) dPY (y) B ∈ | Z = P (X A Y = y)fY (y) dy. B ∈ |
Thus it holds Z Z P (X A Y = y)fY (y) fX,Y (x, y) dx dy = 0. B ∈ | − A
Since this holds for all Borel measurable B we get almost surely Z P (X A Y = y)fY (y) = fX,Y (x, y) dx ∈ | A
10 d Thus we get for y R 2 with fY (y) > 0 that for all Borel measurable sets A it ∈ holds that Z fX,Y (x, y) P (X A Y = y) = dx. ∈ | A fY (y) Therefore the conditional distribution P (X Y = y) of X given Y = y is a ∈ ·| probability measure on Rd1 with density
fX,Y (x, y) f(X|Y =y)(x) = . fY (y)
(iv) Let X : Ω Rd1 and Y : Ω Rd2 be either be random vectors with densities or → → discrete random vectors. Then we get from (ii) and (iii) directly Bayes formula for d y R 2 with pY (y) > 0 , i.e. ∈
p(Y |X=x)(y)pX (x) p(X|Y =y)(x) = , pY (y)
where pX (and pY analogously) is defined as fX (x), if X and Y have a density, pX (x) = P (X = x), if X and Y are discrete,
and where p(X|Y =y) (and p(Y |X=x) analogously) is defined as f(X|Y =y)(x), if X and Y have a density, p(X|Y =y)(x) = P(X|Y =y)(x), if X and Y are discrete.
d d (v) Let X : Ω R 1 and Y : Ω R 2 be random vectors with densities fX and fY and → → let h: Rd1 R be measurable such that h X L1(Ω, A,P ). Then it holds for → ◦ ∈ every Borel measurable set A Rd2 that ⊆ Z Z Z E(h(X) Y = y)fY (y) dy = E(h(X) Y = y) dPY (y) = h(X) dP A | A | {X∈A} Z Z = h(x)fX,Y (x, y) dx dy A Rd1
Now it follows that Z Z E(h(X) Y = y)fY (y) h(x)fX,Y (x, y) dx dy = 0. A | − Rd1
11 d Therefore it holds for PY -almost every y R 2 with fY (y) > 0 that ∈ Z Z fX,Y (x, y) E(h(X) Y = y) = h(x) dx = h(x)f(X|Y =y)(x) dx. | Rd1 fY (y) Rd1
Theorem 2.9 (Conditional expectation as projection). Let X L2(Ω, A,P ) and let ∈ G A be a σ-algebra. Then E(X G) is the orthogonal projection of X on L2(Ω, G,P ). ⊆ | That is, for any G-measurable random variable Y L2(Ω, G,P ) it holds ∈ Z Z (X Y )2 dP (X E(X G))2 dP Ω − ≥ Ω − |
Proof. See [20, Corollary 8.16]
Theorem 2.10 (Optimal Prediction). Let X L2(Ω, A,P ) and let Y : Ω Ω0 be ∈ → random element. Then it holds for every A0-measurable mapping ϕ:Ω0 R that → Z Z (X E(X Y ))2 dP (X ϕ Y )2 dP Ω − | ≤ Ω − ◦ with equality if and only if ϕ = E(X Y = ) PY -almost surely. | · Proof. Combine Theorem 2.7 and Theorem 2.9.
2.1.3. Maximum a Posteriori Estimator
An alternative to the maximum likelihood estimator is the Maximum a Posteriori estimator (MAP). Let (Pϑ)ϑ∈Θ be a parametric distribution family, where Pϑ is given by the density function pϑ. For the MAP we assume that we have given a prior distribution P with density function p on Θ. Instead of maximizing the likelihood of the observations x = x1, ..., xn we maximize the posterior distribution on Θ. Using Bayes formula this reads as
p(ϑ x) pϑ(x)p(ϑ). | ∝ Now the MAP is defined by
ϑMAP argmax p(ϑ x). ∈ ϑ∈Θ |
Using the above considerations we get, that this is equivalent to
ϑMAP argmin log(pϑ(x)) log(p(ϑ)) . ∈ ϑ∈Θ {− − }
12 2.1.4. The Kullback-Leibler Divergence
Let f : Rd R and g : Rd R be two probability density functions where for all x Rd → → ∈ with g(x) = 0 it holds that f(x) = 0. Then we define the Kullback-Leibler divergence by Z g(x) KL(f g) = f(x) log f(x) dx. | Rd
Lemma 2.11. The Kullback-Leibler divergence fulfills that
KL(f g) 0 | ≥ with equality if and only if f = g almost everywhere.
Proof. Since it holds for x > 0 that log(x) x 1 it holds ≤ − Z Z g(x) g(x) f(x) log f(x) dx f(x) f(x) 1 dx Rd ≤ Rd − Z Z = g(x) dx f(x) dx = 0. Rd − Rd