Superresolution via Student-t Mixture Models

Master Thesis –improved version–

TU Kaiserslautern Department of Mathematics

Johannes Hertrich

supervised by Prof. Dr. Gabriele Steidl

Kaiserslautern, submitted at February 4, 2020

Contents

1. Introduction5

2. Preliminaries6 2.1. Definitions and Notations...... 6 2.1.1. Random Variables and Estimators...... 6 2.1.2. Conditional Probabilities, Distribution and Expectation...... 8 2.1.3. Maximum a Posteriori Estimator...... 12 2.1.4. The Kullback-Leibler Divergence...... 13 2.2. The EM Algorithm...... 13 2.3. The PALM and iPALM Algorithms...... 19 2.3.1. Proximal Alternating Linearized Minimization (PALM)...... 20 2.3.2. Inertial Proximal Alternating Linearized Minimization (iPALM). 21

3. Alternatives of the EM Algorithm for Estimating the Parameters of the Student-t Distribution 22 3.1. Likelihood of the Multivariate Student-t Distribution...... 24 3.2. Existence of Critical Points...... 27 3.3. Zeros of F ...... 31 3.4. Algorithms...... 36 3.5. Numerical Results...... 45 3.5.1. Comparison of Algorithms...... 45 3.5.2. Unsupervised Estimation of Noise Parameters...... 46

4. Superresolution via Student-t Mixture Models 52 4.1. Estimating the Parameters...... 55 4.1.1. Initialization...... 61 4.1.2. Simulation Study...... 63 4.2. Superresolution...... 66 4.2.1. Expected Patch Log-Likelihood for Student-t Mixture Models... 66 4.2.2. Joint Student-t Mixture Models...... 73 4.3. Numerical Results...... 75 4.3.1. Comparison to Gaussian Mixture models...... 75 4.3.2. FIB-SEM images...... 77

5. Conclusion and Future Work 79

3 A. Examples for the EM Algorithm 79 A.1. EM Algorithm for Student-t distributions...... 79 A.2. EM Algorithm for Mixture Models...... 87 A.3. EM Algorithm for Student-t Mixture Models...... 89

B. Auxiliary Lemmas 92

C. Derivatives of the Negative Log-Likelihood Function for Student-t Mixture Models 97 1. Introduction

Superresolution is a process to reconstruct a high resolution image from a low resolution image. There exist several approaches to use Gaussian mixture models for superreso- lution (see e.g. [35, 45]). In this thesis, we extend this to Student-t mixture models and focus on the estimation of the parameters of Student-t distributions and Student-t mixture models. For this purpose, we first consider numerical algorithms to compute the maximum likelihood estimator of the parameters for a multivariate Student-t distribution and propose three alternatives to the classical Expectation Maximization (EM) algorithm. Then, we extend out considerations to Student-t mixture models and finally, we apply our algorithms to some numerical examples.

The thesis is organized as follows: in Section2 we review preliminary results. Further, we introduce the EM algorithm, the Proximal Alternating Linearized Minimization (PALM) as well as the inertial Proximal Alternating Linearized Minimization (iPALM) in their general forms and cite the corresponding convergence results.

Then, in Section3, we consider maximum likelihood estimation of the parameters for a multivariate Student-t distribution. This section (including AppendixB) is already contained in the arXiv preprint [16] and submitted for a journal publication. In Section 3.1, we introduce the Student-t distribution, the negative log-likelihood function L and their derivatives. In Section 3.2, we provide some results concerning the existence of minimizers of L. Section 3.3 deals with the solution of the equation arising when setting the gradient of L with respect to ν to zero. The results of this section will be important for convergence considerations of our algorithms in the Section 3.4. We propose three alternatives to the classical EM algorithm. For fixed degree of freedom ν the first alterna- tive is known as accelerated EM algorithm from the literature. It was considered e.g. in [19, 28, 40]. In our case, since we do not fix ν, it cannot be interpreted as EM algorithm. The other two alternatives differ from this in the ν step of the iteration. We show that the objective function L decreases in each iteration step and provide a simulation study to compare the performance of these algorithms. Finally, we provide two kinds of numerical results in Section 3.4. First, we compare the different algorithms by numerical examples which indicate that the new ν iterations are very efficient for estimating ν of different magnitudes. Second, we come back to the original motivation of this part and estimate the degree of freedom parameter ν from images corrupted by one-dimensional Student-t noise.

5 In Section4, we consider superresolution via Student- t mixture models. Section 4.1 deals with the parameter estimation of Student-t mixture models. We propose three alternatives to the EM algorithm. The first alternative differs from the EM algorithm in the update of the Σ and the ν step. The second and third algorithm are the PALM and iPALM algorithm as proposed in [6] and [34] for the negative log likelihood function L of the Student-t , which were so far not used in connection with mixture models. We describe some heuristics to initialize the algorithms and to set the parameters in the PALM and iPALM algorithm. Further, we compare the algorithms by a simulation study. In Section 4.2, we adapt two methods for superresolution with Student-t mixture models, which were originally proposed in [35] and [45] for Gaussian mixture models. Finally, in Section 4.3, we compare our methods with Gaussian mixture models and apply them to images generated by Focused Ion Beam and Scanning Electron Microscopy (FIB-SEM).

Acknowledgement

We would like to thank Professor Thomas Pock from the TU Graz for the fruitful discussions on the usage of PALM and iPALM in Section 4.1. Further, we thank the group of Dominique Bernard from the ICMCB material science lab at the University of Bordeaux for generating the FIB-SEM images within the ANR-DFG project ”SUPREMATIM”, which we used in Section 4.3.

2. Preliminaries

2.1. Definitions and Notations

2.1.1. Random Variables and Estimators

Let (Ω, A,P ) be a probability space and let (Ω0, A0) be a measurable space. We call a measurable mapping X : Ω Ω0 a random element. If Ω0 = Rd and A = , where → B B denotes the Borel σ-algebra, we call X a random vector. We say X is a random variable, if 0 0 d = 1. For a random element X : Ω Ω we call the probability measure PX : A [0, 1] → → defined by −1 0 PX (A) = P (X (A))A A ∈ the image measure or distribution of X.

Definition 2.1 (, , Standard deviation). Let X : Ω R be a random →

6 variable. We define the mean of X by Z Z E(X) = EP (X) = X(ω) dP (ω) = x dPX (x). Ω Ω0

For 1 p we denote the Banach space of (equivalence classes of) random variables ≤ ≤ ∞ with E( X p) < by Lp(Ω, A,P ). Note that Lp Lq for 1 p < q . Further we | | ∞ ⊂ ≤ ≤ ∞ denote for X L2(Ω, A,P ) the variance of X by ∈

2 Var(X) = EP ((X EP (X)) ) − p and call Var(X) the standard deviation of X. For X,Y L2(Ω, A,P ) we call ∈

Cov(X,Y ) = E((X E(X))(Y E(Y ))) − −

T d the covariance of X and Y . For a random vector X = (X1, ..., Xd) : Ω R with 1 → Xi L (Ω, A,P ) for all i = 1, ..., d we use the notation ∈

T E(X) = (E(X1), ..., E(Xd))

2 for the mean. Further, if Xi L (Ω, A,P ) for all i = 1, ..., d we call ∈

d Cov(X) = (Cov(Xi,Xj))i,j=1 the covariance matrix of X.

Definition 2.2 (Probability densities). Let X : Ω Rd be a random vector. If there d → exists some function fX : R R≥0 with → Z P ( ω Ω: X(ω) A ) = PX (A) = fX (x) dx, A A, { ∈ ∈ } A ∈ then we call fX the probability density function of X.

Now, let (Ω, A) be a measurable space and let Θ Rd. We call a family of probability ⊆ measures (Pϑ)ϑ∈Θ a parametric distribution family. Given some independent identically d distributed samples x1, ..., xn of a random vector X : Ω R 1 defined on the probability → space (Ω, A,Pϑ) we want to recover the parameter ϑ of the underlying measure.

Definition 2.3 (Estimators). A measurable mapping T : Rd1×n Θ is called an estima- → tor of ϑ.

7 A common choice for an estimator is the maximum likelihood (ML) estimator. Assume that X is a random vector with a probability density function or that X is a discrete random vector. Then we define the likelihood function :Θ R by L → ∪ {∞} n Y (ϑ x1, ..., xn) = p(xi), L | i=1 where  fX (xi), if X has a density, p(x) = PX (x), if X is a discrete random vector.

Now we define the maximum likelihood estimator by

ϑˆ argmax (ϑ x1, ..., xn). ∈ ϑ∈Θ L |

2.1.2. Conditional Probabilities, Distribution and Expectation

We give a short introduction to conditional expectations, probabilities and distributions based on [20, Chapter 8] and [5, Chapter IV]. Let (Ω, A,P ) be a probability space.

Definition 2.4 (Conditional expectation). Let X L1(Ω, A,P ) and let G A be a ∈ ⊆ σ-algebra. We call a G-measurable random variable Z :Ω R with the property that → Z Z X dP = Z dP for all A G A A ∈

(a version of) the conditional expectation of X given G and we denote Z = E(X G). If | X = 1A for some A A, then we denote E(1A G) = P (A G) and call P (A G) (a version ∈ | | | of) the of A given G.

Theorem 2.5 (Existence and uniqueness of conditional expectation). Let X L1(Ω, A,P ) ∈ and let G A be a σ-algebra. Then the conditional expectation E(X G) exists and is ⊆ | unique P G almost surely. | Proof. See [5, Theorem 15.1].

Theorem 2.6 (Poperties of the conditional expectation). Let X,Y L1(Ω, A,P ) and ∈ let F G A be a σ-algebras. Then the following holds true: ⊆ ⊆ (i) (Linearity) E(λX + Y G) = λE(X G) + E(Y G). | | |

8 (ii) (Monotonicity) If it holds X Y almost surely, then we have E(X G) E(Y G) ≥ | ≥ | almost surely.

(iii) If E( XY ) < and Y is G-measurable, then it holds | | ∞

E(XY G) = Y E(X G) and E(Y G) = Y. | | |

(iv) (Tower property) E(E(X G) F) = E(E(X F) G) = E(X F). | | | | | (v) (Independence) If σ(X) and G are independent, then E(X G) = E(X). | (vi) (Dominated convergence) Assume that Y L1(Ω, A,P ), Y 0 almost surely and ∈ ≥ that (Xn)n is a sequence of random variables with Xn Y for n N such that | | ≤ ∈ Xn X as n almost surely. Then it holds → → ∞

1 lim E(Xn G) = E(X G) almost surely and in L (Ω, G,P ). n→∞ | |

Proof. See [20, Theorem 8.14].

Now let (Ω0, A0) be a measurable space and let Y : Ω Ω0 be a random element. Then → we denote by E(X Y ) = E(X σ(Y )) | | the conditional distribution of X given Y .

Theorem 2.7 (Factorization Lemma). Let Ω1 be a set and (Ω2, A2) a measurable space.

Further, let X : Ω1 Ω2 be a mapping. Then for every Y : Ω1 R the following are → → equivalent:

(i) Y is σ(X)- measurable, where is the Borel σ-algebra B B

(ii) There exists a A2- measurable mapping g :Ω2 R such that Y = g X B → ◦ Proof. See [20, Corollary 1.93].

Thus there exists a A0-measurable mapping g :Ω0 R such that →

E(X Y ) = g Y. | ◦

This mapping is unique PY - almost surely. We define the conditional expectation of X given Y = y by E(X Y = y) = g(y) |

9 with g from above. For A A and X = 1A we define the conditional probability of A ∈ given Y = y by

P (A Y = y) = E(1A Y = y). | | Remark 2.8. Note that we now obtain the following well-known formulas: (i) Let A, B A with P (B) > 0. Then it holds ∈ Z Z E(1A 1B) dP = 1A dP = P (A B). B | B ∩

Since E(1A 1B) is constant on B, we have that for ω B the conditional probability | ∈ of A given B reads as

P (A B) P (A B) := P (A 1B = 1) = E(1A 1B)(ω) = ∩ . | | | P (B)

(ii) If X : Ω Rd1 and Y : Ω Rd2 are discrete random vectors and y Rdt with → → ∈ P (Y = y) > 0 we get directly from (i) that for all x Rd1 the conditional ∈ distribution of X given Y = y reads as

P (X = x, Y = y) P (x) := P (1 Y = y) = . (X|Y =y) {x}| P (Y = y)

(iii) Let X : Ω Rd1 and Y : Ω Rd2 are random vectors sucht that the density → → functions fX , fY and fX,Y exist. Then it holds for all Borel measurable sets A Rd1 and B Rd2 that ⊆ ⊆ Z Z fX,Y (x, y) dx dy = P ( X A Y B ) B A { ∈ } ∩ { ∈ } Z = P (X A Y = y) dPY (y) B ∈ | Z = P (X A Y = y)fY (y) dy. B ∈ |

Thus it holds Z Z P (X A Y = y)fY (y) fX,Y (x, y) dx dy = 0. B ∈ | − A

Since this holds for all Borel measurable B we get almost surely Z P (X A Y = y)fY (y) = fX,Y (x, y) dx ∈ | A

10 d Thus we get for y R 2 with fY (y) > 0 that for all Borel measurable sets A it ∈ holds that Z fX,Y (x, y) P (X A Y = y) = dx. ∈ | A fY (y) Therefore the conditional distribution P (X Y = y) of X given Y = y is a ∈ ·| probability measure on Rd1 with density

fX,Y (x, y) f(X|Y =y)(x) = . fY (y)

(iv) Let X : Ω Rd1 and Y : Ω Rd2 be either be random vectors with densities or → → discrete random vectors. Then we get from (ii) and (iii) directly Bayes formula for d y R 2 with pY (y) > 0 , i.e. ∈

p(Y |X=x)(y)pX (x) p(X|Y =y)(x) = , pY (y)

where pX (and pY analogously) is defined as  fX (x), if X and Y have a density, pX (x) = P (X = x), if X and Y are discrete,

and where p(X|Y =y) (and p(Y |X=x) analogously) is defined as  f(X|Y =y)(x), if X and Y have a density, p(X|Y =y)(x) = P(X|Y =y)(x), if X and Y are discrete.

d d (v) Let X : Ω R 1 and Y : Ω R 2 be random vectors with densities fX and fY and → → let h: Rd1 R be measurable such that h X L1(Ω, A,P ). Then it holds for → ◦ ∈ every Borel measurable set A Rd2 that ⊆ Z Z Z E(h(X) Y = y)fY (y) dy = E(h(X) Y = y) dPY (y) = h(X) dP A | A | {X∈A} Z Z = h(x)fX,Y (x, y) dx dy A Rd1

Now it follows that Z Z E(h(X) Y = y)fY (y) h(x)fX,Y (x, y) dx dy = 0. A | − Rd1

11 d Therefore it holds for PY -almost every y R 2 with fY (y) > 0 that ∈ Z Z fX,Y (x, y) E(h(X) Y = y) = h(x) dx = h(x)f(X|Y =y)(x) dx. | Rd1 fY (y) Rd1

Theorem 2.9 (Conditional expectation as projection). Let X L2(Ω, A,P ) and let ∈ G A be a σ-algebra. Then E(X G) is the orthogonal projection of X on L2(Ω, G,P ). ⊆ | That is, for any G-measurable random variable Y L2(Ω, G,P ) it holds ∈ Z Z (X Y )2 dP (X E(X G))2 dP Ω − ≥ Ω − |

Proof. See [20, Corollary 8.16]

Theorem 2.10 (Optimal Prediction). Let X L2(Ω, A,P ) and let Y : Ω Ω0 be ∈ → random element. Then it holds for every A0-measurable mapping ϕ:Ω0 R that → Z Z (X E(X Y ))2 dP (X ϕ Y )2 dP Ω − | ≤ Ω − ◦ with equality if and only if ϕ = E(X Y = ) PY -almost surely. | · Proof. Combine Theorem 2.7 and Theorem 2.9.

2.1.3. Maximum a Posteriori Estimator

An alternative to the maximum likelihood estimator is the Maximum a Posteriori estimator (MAP). Let (Pϑ)ϑ∈Θ be a parametric distribution family, where Pϑ is given by the density function pϑ. For the MAP we assume that we have given a prior distribution P with density function p on Θ. Instead of maximizing the likelihood of the observations x = x1, ..., xn we maximize the posterior distribution on Θ. Using Bayes formula this reads as

p(ϑ x) pϑ(x)p(ϑ). | ∝ Now the MAP is defined by

ϑMAP argmax p(ϑ x). ∈ ϑ∈Θ |

Using the above considerations we get, that this is equivalent to

ϑMAP argmin log(pϑ(x)) log(p(ϑ)) . ∈ ϑ∈Θ {− − }

12 2.1.4. The Kullback-Leibler Divergence

Let f : Rd R and g : Rd R be two probability density functions where for all x Rd → → ∈ with g(x) = 0 it holds that f(x) = 0. Then we define the Kullback-Leibler divergence by Z g(x)  KL(f g) = f(x) log f(x) dx. | Rd

Lemma 2.11. The Kullback-Leibler divergence fulfills that

KL(f g) 0 | ≥ with equality if and only if f = g almost everywhere.

Proof. Since it holds for x > 0 that log(x) x 1 it holds ≤ − Z Z g(x)  g(x)  f(x) log f(x) dx f(x) f(x) 1 dx Rd ≤ Rd − Z Z = g(x) dx f(x) dx = 0. Rd − Rd

g(x)  Further we have equality if and only if it holds almost everywhere that log f(x) = g(x) 1. This is equivalent to g(x) = 1. f(x) − f(x) Remark 2.12. We can formulate the Kullback-Leibler divergence also for discrete probability measures. Let P and Q be two probability measures on a countable set Ω such that Q( ω ) = 0 implies P ( ω ) = 0 for ω Ω. Then we define the Kullback-Leibler { } { } ∈ divergence by X Q({ω}) KL(P Q) = P ( ω ) log . | { } P ({ω}) ω∈Ω We can show analogously to Lemma 2.11 that KL(P Q) 0 with equality if and only if | ≥ P ( ω ) = Q( ω ) for all ω Ω. { } { } ∈

2.2. The EM Algorithm

In many cases the maximum likelihood estimator cannot be computed analytically. In those cases the Expectation Maximization (EM) algorithm is often used. In the follow- ing we give a short introduction to the EM algorithm and summarize its convergence properties. We follow the lines of [22].

13 d ×n The EM algorithm was first introduced by [11]. Let x = (x1, ..., xn) = X(ω) R 1 d ×n ∈ be a given realization of random vector X : Ω R 1 with density function fX (x ϑ). → | The rough idea is to introduce an artificial random vector Z : Ω Rd2 and a measurable → mapping h: Rd2 Rd1×n such that X = h(Z). Further, we assume that Z has the → density function fZ (z ϑ) and that we can compute the maximum likelihood estimator | with respect to Z for z Rd2 . This yields the relation ∈ Z −1 d1 fX (x ϑ) = fZ (z ϑ) dz, where h ( x ) = z R : h(z) = x . | h−1({x}) | { } { ∈ }

Now we want to find that ϑ, which maximizes the likelihood function of z = Z(ω) for the density function fZ (z ϑ). Since z is unknown the EM algorithm iterates the following | two steps:

E-Step: For a fixed estimate ϑ(r) of ϑ, we estimate the log-likelihood function log(f(Z ϑ)) | by the minimum mean square error (MMSE) estimator which is given by

(r) Q(ϑ, ϑ ) = EP (log(fZ (Z ϑ)) X = x). ϑ(r) | |

This function is usually called the Q-function. It is the MMSE, since Theorem 2.10 ensures that for any other measurable function ϕ: Rd1×n R it holds → Z Z 2 2 (log(fZ (Z ϑ)) ϕ X) dP > (log(fZ (Z ϑ)) E(log(fZ (Z ϑ)) X = ) X) dP. Ω | − ◦ Ω | − | | · ◦

M-Step: In this step we update the estimate for ϑ by maximizing the Q-function which is an estimate of the log-likelihood function:

ϑ(r+1) argmax Q(ϑ, ϑ(r)) . ∈ ϑ∈Θ { }

We summarize the EM algorithm in Algorithm 2.1. Note that in many applications of the EM algorithm the artificial random vector Z is of the form Z = (X,Y ) for some random vector Y . In this case the function h is given by h(x, y) = x and the Q-function reads as

(r) Q(ϑ, ϑ ) = EP (log(fX,Y (X,Y ϑ) X = x). ϑ(r) | |

14 Algorithm 2.1 EM Algorithm d ×n (0) Input: x = (x1, ..., xn) R 1 , initial estimate ϑ Θ ∈ ∈ for r = 0, 1, ... do E-Step: Compute the Q-function

(r) Q(ϑ, ϑ ) = EP (log(fZ (Z ϑ)) X = x). ϑ(r) | | M-Step: Update the estimate for ϑ by

ϑ(r+1) argmax Q(ϑ, ϑ(r)) . ∈ ϑ∈Θ { }

Convergence Analysis

Now we analyze the convergence properties of the EM algorithm. The convergence proof in the original paper [11] was incorrect. The convergence properties were first proven in [41] based on results of [43]. We present a convergence analysis based on the Kullback-Leibler proximal point algorithm proposed in [8, 9]. We again follow the lines of [22] and stick to the case that Z = (X,Y ) and h(x, y) = x. Now, we rewrite the EM algorithm as

(r+1) n o ϑ argmax EP (r) log(fX,Y (x, Y ϑ)) X = x) ∈ ϑ∈Θ ϑ | |      fX,Y (x,Y |ϑ) = argmax log(fX (x ϑ)) + EP (r) log f (x|ϑ) X = x . ϑ∈Θ | ϑ X

By definition this becomes

(r+1) n  o ϑ argmax log(fX (x ϑ)) + EP (r) log f(Y |X=x)(Y ϑ) X = x . ∈ ϑ∈Θ | ϑ |

By adding some constant in the objective function we get

(r+1) n   ϑ argmax log(fX (x ϑ)) + EP (r) log f(Y |X=x)(Y ϑ) X = x ∈ ϑ∈Θ | ϑ | (r)  o EP log f(Y |X=x)(Y ϑ ) X = x − ϑ(r) |    (r)   f(Y |X=x)(Y |ϑ ) = argmax log(fX (x ϑ)) EP (r) log f (Y |ϑ) X = x ϑ∈Θ | − ϑ (Y |X=x) n (r) o = argmax log(fX (x ϑ)) KL f(Y |X=x)( ϑ ) f(Y |X=x)( ϑ) , ϑ∈Θ | − ·| ·|

15 with

Z (r) (r) (r)  f(Y |X=x)(Y |ϑ )  KL(f(·|X=x)(Y ϑ ) f(·|X=x)(Y ϑ)) = f(Y |X=x)(y ϑ ) log dy | | | | f(Y |X=x)(Y |ϑ) (r)   f(Y |X=x)(Y |ϑ )   = EP log X = x . ϑ(r) f(Y |X=x)(Y |ϑ)

This is the definition of the Kullback-Leibler divergence between f ( ϑ(r)) and (Y |X=x) ·| f ( ϑ). By Lemma 2.11 it holds for two functions f and g that KL(f g) 0, (Y |X=x) ·| | ≥ where KL(f g) = 0 if and only if f = g almost everywhere. For simplicity we use the | notation (r) (r) DKL(ϑ , ϑ) = KL(f (Y ϑ ) f (Y ϑ)) (·|X=x) | | (·|X=x) | and denote by

L(ϑ x1, ..., xn) = `(ϑ x1, ..., xn) = log(fX (x ϑ)) | − | − | the negative log-likelihood function. Now we can write the EM algorithm as the Kullback- Leibler proximal point algorithm

(r+1)  (r) ϑ argmin DKL(ϑ , ϑ) + L(ϑ x1, ..., xn) . ∈ ϑ∈Θ |

We summarize this algorithm in Algorithm 2.2.

Algorithm 2.2 EM Algorithm as Kullback-Leibler Proximal Point Algorithm d ×n (0) Input: x = (x1, ..., xn) R 1 , initial estimate ϑ Θ ∈ ∈ for r = 0, 1, ... do

(r+1) (r) ϑ argmin DKL(ϑ , ϑ) + L(ϑ x1, ..., xn) ∈ ϑ∈Θ { | }    (r)    f(Y |X=x)(Y |ϑ ) = argmin EP (r) log f (Y |ϑ) X = x log(fX (x ϑ)) . ϑ∈Θ ϑ (Y |X=x) − |

(r) Now assume that the minimum in Algorithm 2.2 and let (ϑ )r∈N be generated by Algo- (r) rithm 2.2. Then we can prove that the negative log likelihood function L(ϑ x1, ..., xn) | is monotone decreasing.

Proposition 2.13. Let (Pϑ)ϑ∈Θ be a parametric distribution. If the minimum in Al- (r) gorithm 2.2 is attained in each step, then the sequence (L(ϑ x))r∈N is monotonously |

16 decreasing and satisfies

(r) (r+1) (r) (r+1) L(ϑ x) L(ϑ x) DKL(ϑ , ϑ ). | − | ≥

Proof. By the definition of ϑ(r+1) we get

(r) (r+1) (r+1) (r) (r) (r) DKL(ϑ , ϑ ) + L(ϑ x) DKL(ϑ , ϑ ) + L(ϑ x). | ≤ |

(r) (r) (r) (r+1) Now the claim follows by the fact that DKL(ϑ , ϑ ) = 0 and DKL(ϑ , ϑ ) 0. ≥ (r) To show the convergence of some subsequence of (ϑ )r we need some stronger assump- tions.

Assumption 2.14. Let (Pϑ)ϑ∈Θ be a parametric distribution family. Further, let L(ϑ x) | be the negative log-likelihood function and let DKL(ϑ, ϑ˜) be the Kullback-Leibler distance. Define the following assumptions:

2 d 2 d d (A1) L(ϑ x) C (R ) and DKL(ϑ, ϑ˜) C (R R ). | ∈ ∈ × (A2) L(ϑ x) is coercive, i.e. lim L(ϑ x) = . | kϑk→∞ | ∞ 2 (A3) DKL(ϑ,˜ ϑ) < and DKL(ϑ,˜ ϑ) is positive definite on every bounded ϑ-set for ∞ ∇ every ϑ˜ Rd. ∈ (A4) L(ϑ x) < and 2L(ϑ x) is positive definite on every bounded ϑ-set. | ∞ ∇ |

Theorem 2.15 (Convergence of the EM Algorithm). Let x1, ..., xn be independent identically distributed samples of a parametric distribution (Pϑ)ϑ∈Θ. Further, let L(ϑ x) | be the negative log-likelihood function and DKL(ϑ, ϑ˜) the Kullback-Leibler distance. Then it holds

(i) If (A1) is fulfilled, the fixed points of Algorithm 2.2 are critical points of the log-likelihood function L. Further, they are minimizers, if additionally (A4) is fulfilled.

(r) (ii) If (A2) is fulfilled, the sequence (ϑ )r is bounded.

(r+1) (r) (iii) If (A1), (A2) and (A3) are fulfilled, then limr→∞ ϑ ϑ = 0 − (r ) (iv) If (A1), (A2) and (A3) are fulfilled, then there exists a subsequence (ϑ k )k that converges to a critical point of L. If additionally (A4) is fulfilled, the whole sequence (r) (ϑ )r converges to a minimizer of L.

17 Proof. (i) For any fixed point ϑ∗ of Algorithm 2.2 holds

∗ ∗ ϑ argmin DKL(ϑ , ϑ) + L(ϑ x) . ∈ ϑ∈Θ { | }

Since L and DKL are smooth, (A1) yields that

∗ ∗ ∗ DKL(ϑ , ϑ ) + L(ϑ x) = 0. ∇ ∇ |

∗ ∗ Since (ϑ , ϑ ) is a global minimum of DKL this implies that

L(ϑ∗ x) = 0. ∇ |

(r) (ii) Since (L(ϑ x))r is monotone decreasing by Proposition 2.13 the coercivity of L | (r) implies that (ϑ )r is bounded.

(r) (iii) Since (L(ϑ x))r is monotone decreasing and bounded from below we have that it | converges and it holds

lim L(ϑ(r) x) L(ϑ(r+1) x) = 0. r→∞ | − |

Now Proposition 2.13 yields that

(r) (r+1) lim DKL(ϑ , ϑ ) = 0. r→∞

2 (r) Choose λ > 0 such that every eigenvalue of DKL(ϑ , ) on the convex hull of ∇ · (r) 2 ϑ : r N is greater of equal λ, what is possible since DKL is positive definite { ∈ } ∇ by (A3) and continuous by (A1). Further Taylors theorem yields that there exists (r+1) (r) some ηr tϑ + (1 t)ϑ : t [0, 1] such that ∈ { − ∈ }

(r) (r+1) (r) (r) (r) (r) T (r+1) (r) DKL(ϑ , ϑ ) =DKL(ϑ , ϑ ) + DKL(ϑ , ϑ ) (ϑ ϑ ) ∇ − 1 (r+1) (r) T 2 (r) (r+1) (r) + (ϑ ϑ ) DKL(ϑ , ηr)(ϑ ϑ ) 2 − ∇ −

λ 2 ϑ(r+1) ϑ(r) . ≥ 2 −

The claim follows for r . → ∞ (r) (iv) By (ii) we know that (ϑ )r is bounded. Thus there exists a convergent subsequence (r ) ∗ (r ) (r +1) (ϑ k )k. Define ϑ = limk→∞ ϑ k and note that the definition of ϑ k yields

18 that (rk) (rk+1) (rk+1) DKL(ϑ , ϑ ) + L(ϑ x) = 0. ∇ ∇ | Due to part (iii) of the proof it holds

lim ϑ(rk+1) = ϑ∗. k→∞

Since DKL and L are continuous, this yields that ∇ ∇

∗ ∗ ∗ ∗ L(ϑ x) = DKL(ϑ , ϑ ) + L(ϑ x) = 0. ∇ | ∇ ∇ |

Thus ϑ∗ is a critical point of L( x). If (A4) is fulfilled, then L is strictly convex. In ·| (r) this case the unique critical point is the global minimizer and the sequence (ϑ )r has only one accumulation point. Since it is bounded this yields that the whole sequence converges to the global minimizer.

In the AppendixA we derive the EM algorithm for some examples.

2.3. The PALM and iPALM Algorithms

An alternative to the EM algorithm is to minimize the negative log-likelihood function using some other numerical algorithms. Later, in Section4 we will use the Proximal Alternating Linearized Minimization method (PALM) and its inertial version (iPALM). The PALM and iPALM algorithm were first introduced by [6] and [34] respectively. In this section we introduce the general form of PALM and iPALM and cite the corresponding convergence results.

Problem setting

Let f, g and H be functions fulfilling the following assumption:

Assumption 2.16. (i) The functions f : Rd1 ( , ] and g : Rd2 ( , ] are → −∞ ∞ −∞ ∞ proper and lower semicontinuous.

(ii) The function H : Rd1 Rd2 R is continuously differentiable. × → d (iii) For any x R 1 the function xH(x, ) is globally Lipschitz continuous with ∈ ∇ · d Lipschitz constant L1(x). Similarly, for any y R 2 the function yH( , y) is ∈ ∇ · globally Lipschitz continuous with Lipschitz constant L2(y).

19 We consider a function Ψ: Rd1 Rd2 ( , ] defined by × → −∞ ∞

Ψ(x, y) = f(x) + g(y) + H(x, y). (1)

2.3.1. Proximal Alternating Linearized Minimization (PALM)

Now, the authors of [6] proposed Algorithm 2.3 for minimizing (1).

Algorithm 2.3 Proximal Alternating Linearized Minimization (PALM) (0) (0) d d Input: Starting point (x , y ) R 1 R 2 , parameters γ1, γ2 > 1. ∈ × for r = 0, 1, ... do (r) Set ck = γ1L1(y ) and

(r+1) (r) 1 (r) (r)  x prox x xH(x , y ) . ∈ crf − cr ∇ (r+1) Set dr = γ2L2(x ) and

(r+1) (r) 1 (r+1) (r)  y prox y yH(x , y ) . ∈ drg − dr ∇

For the convergence result of the PALM algorithm we need the following additional assumptions:

Assumption 2.17. (i) inf d d Ψ > , inf d f > and inf d g > . R 1 ×R 2 −∞ R 1 −∞ R 2 −∞ − − + + (ii) There exist λ1 , λ2 , λ1 , λ2 > 0 such that

(k) − (k) − inf L1(y ): k N λ and inf L2(x ): k N λ { ∈ } ≥ 1 { ∈ } ≥ 2 (k) + (k) + sup L1(y ): k N λ and sup L2(x ): k N λ . { ∈ } ≥ 1 { ∈ } ≥ 2

(iii) H is Lipschitz continuous on bounded subsets of Rd1 Rd2 . ∇ ×

For η (0, ] we denote by Φη the set of all concave continuous functions φ: [0, η) R≥0 ∈ ∞ → which fulfill the following properties:

(i) φ(0) = 0.

(ii) φ is continuously differentiable on (0, η).

(iii) It holds for all s (0, η) that φ0(s) > 0. ∈

20 Definition 2.18 (Kurdyka-L ojasiewicz property). Let σ : Rd ( , + ] be proper → −∞ ∞ and lower semicontinuous.

(i) We say that σ has the Kurdyka-L ojasieweicz (KL) property at u¯ dom ∂σ = u d ∈ { ∈ R : ∂σ = if there exist η (0, ], a neighborhood U of u¯ and a function φ Φη, 6 ∅} ∈ ∞ ∈ such that for all

u U v Rd : σ(¯u) < σ(u) < σ(¯u) + η , ∈ ∩ { ∈ }

it holds φ0(σ(u) σ(¯u)) dist(0, ∂σ(u)) 1. − ≥ (ii) We say that σ is a KL function, if it satisfies the KL property in each point u dom ∂σ. ∈ Now, the following theorem was proven in [6, Lemma 3, Theorem 1].

Theorem 2.19. Suppose that the Assumptions 2.16 and 2.17 hold and denote by (r) (r) (x , y )r the sequence generated by PALM. Then the following holds true:

(r) (r) (i) The sequence (Ψ(x , y ))r is nonincreasing and it holds that

ρ (r+1) (r+1) (r) (r) 2 (r) (r) (r+1) (r+1) (x , y ) (x , y ) Ψ(x , y ) Ψ(x , y ), 2 − 2 ≤ − where − − ρ = min (γ1 1)λ , (γ2 1)λ . { − 1 − 2 } (r) (r) (ii) If Ψ is additionally a KL function, then the sequence (x , y )r converges to a critical point (x∗, y∗) of Ψ.

2.3.2. Inertial Proximal Alternating Linearized Minimization (iPALM)

A generalization of the PALM algorithm is given by the inertial Proximal Alternating Linearized Minimization algorithm (iPALM). This algorithm was proposed in [34] and reads as Algorithm 2.4. For the convergence result of iPALM we need some further assumptions.

Assumption 2.20. There exists some  > 0 such that the following holds true:

1− (k) (i) For all k N and i = 1, 2 there exist 0 < α¯i < 2 such that 0 αi α¯i. ∈ (k) ≤ ≤ Further, it holds that 0 β β¯i for some β¯i > 0. ≤ i ≤

21 Algorithm 2.4 Inertial Proximal Alternating Linearized Minimization (iPALM)

(0) (0) d1 d2 (r) (r) (r) (r) Input: Starting point (x1 , x2 ) R R , parameters α1 , α2 , β1 , β2 [0, 1], (r) (r) ∈ × ∈ τ1 , τ2 > 0 for r = 0, 1, .... for r = 0, 1, ... do Set

(r) (r) (r) (r) (r−1) y = x + α (x x ) 1 1 1 1 − 1 (r) (r) (r) (r) (r−1) z = x + β (x x ) 1 1 1 1 − 1 (r+1) (r) 1 (r) (r)  x1 proxτ (r)f y1 (r) x1 H(z1 , x2 ) . ∈ 1 − τ1 ∇ Set

(r) (r) (r) (r) (r−1) y = x + α (x x ) 2 2 2 2 − 2 (r) (r) (r) (r) (r−1) z = x + β (x x ) 2 2 2 2 − 2 (r+1) (r) 1 (r+1) (r)  x2 proxτ (r)g y2 (r) x2 H(x1 , z2 ) . ∈ 2 − τ2 ∇

(k) (k) (ii) The parameters τ1 and τ2 are given by

(k) (k) (1 + )δ + (1 + β¯ )L(x ) (1 + )δ + (1 + β¯ )L(x ) (k) 1 1 2 and (k) 2 2 1 τ1 = (k) τ2 = (k) 1 α 1 α − 1 − 2 + + where λ1 and λ2 are defined in Assumption 2.17 and ¯ ¯ α¯1 + β1 + α¯2 + β2 + δ1 = λ1 and δ2 = λ2 1  2¯α1 1  2¯α2 − − − − Now, the following theorem was proven in [34, Theorem 4.1].

Theorem 2.21. Suppose that the Assumptions 2.16, 2.17 and 2.20 hold and denote by (r) (r) (x1 , x2 )r the sequence which is generated by the iPALM algorithm. Further, assume (r) (r) that Ψ is a KL function. Then the sequence (x1 , x2 ) converges to a critical point of Ψ.

3. Alternatives of the EM Algorithm for Estimating the Parameters of the Student-t Distribution

In this section we consider maximum likelihood estimation of multivariate Student-t distributions. This section (including AppendixB) is already contained in the arXiv

22 preprint [16] and submitted for a journal publication. It is a joint work with Marzieh Hasannasab, Friederike Laus and Gabriele Steidl.

The motivation for this section arises from certain tasks in image processing, where the robustness of methods plays an important role. In this context, the Student-t distribution and the closely related Student-t mixture models became popular in various image processing tasks. In [39] it has been shown that Student-t mixture models are superior to Gaussian mixture models for modeling image patches and the authors proposed an application in image compression. Image denoising based on Student-t models was addressed in [24] and image deblurring in [12, 42]. Further applications include robust [4, 29, 37] as well as robust registration [15, 44]. In one dimension d = 1 and for ν = 1 the Student-t distribution coincides with the one- dimensional Cauchy distribution. One of the first papers which suggested a variational approach for denoising of images corrupted by Cauchy noise was [2]. A variational method consisting of a data term that resembles the noise and a total variation regularization term was proposed in [27, 36]. Based on an ML approach the authors of [23] introduced a so-called generalized myriad filter which estimates both the location and the scale parameter of the Cauchy distribution. They used the filter in a nonlocal denoising approach, where for each pixel of the image they chose as samples of the distribution those pixels having a similar neighborhood and replaced the initial pixel by its filtered version. We also want to mention that a unified framework for images corrupted by white noise that can handle (range constrained) Cauchy noise as well was suggested in [21]. In contrast to the above pixelwise replacement, the state-of-the-art algorithm of Lebrun et al. [25] for denoising images corrupted by white Gaussian noise restores the image patchwise based on a maximum a posteriori approach. In the Gaussian setting, their approach is equivalent to minimum mean square error estimation, and more general, the resulting estimator can be seen as a particular instance of a best linear unbiased estimator (BLUE). For denoising images corrupted by additive Cauchy noise, a similar approach was addressed in [24] based on ML estimation for the family of Student-t distributions, of which the Cauchy distribution forms a special case. The authors call this approach generalized multivariate myriad filter. However, all these approaches assume that the degree of freedom parameter ν of the Student-t distribution is known, which might not be the case in practice. In this section we consider the estimation of the degree of freedom parameter based on an ML approach. In contrast to maximum likelihood estimators of the location and/or scatter parameter(s)

23 µ and Σ, to the best of our knowledge the question of existence of a joint maximum likelihood estimator has not been analyzed before. In this section we provide first results in this direction. Usually the likelihood function of the Student-t distributions and mixture models are minimized using the EM algorithm derived e.g. in [26, 32]. For fixed ν there exists an accelerated EM algorithm [19, 28, 40] which appears to be more efficient than the classical one for smaller parameters ν. We examine the convergence of the accelerated version if also the degree of freedom parameter ν has to be estimated. Further, we propose two modifications of the ν iteration step which lead to efficient algorithms for a wide range of parameters ν.

3.1. Likelihood of the Multivariate Student-t Distribution

The density function of the d-dimensional Student-t distribution Tν(µ, Σ) with ν > 0 degrees of freedom, location parameter µ Rd and symmetric, positive definite scatter ∈ matrix Σ SPD(d) is given by ∈ d+ν  Γ 1 p(x ν, µ, Σ) = 2 , ν  d d 1 d+ν | Γ ν 2 π 2 Σ 2 1 + 1 (x µ)TΣ−1(x µ) 2 2 | | ν − − R ∞ s−1 −t with the Gamma function Γ(s) := 0 t e dt. The expectation of the Student-t ν distribution is E(X) = µ for ν > 1 and the covariance matrix is given by Cov(X) = ν−2 Σ for ν > 2, otherwise the quantities are undefined. The smaller the value of ν, the heavier are the tails of the Tν(µ, Σ) distribution. For ν , the Student-t distribution Tν(µ, Σ) → ∞ converges to the (µ, Σ) and for ν = 0 it is related to the projected N normal distribution on the sphere Sd−1 Rd. Figure1 illustrates this behavior for the ⊂ one-dimensional standard Student-t distribution. As the normal distribution, the d-dimensional Student-t distribution belongs to the class of elliptically symmetric distributions. These distributions are stable under linear d×d transforms in the following sense: Let X Tν(µ, Σ) and A R be an invertible ∼ ∈ d T matrix and let b R . Then AX + b Tν(Aµ + b, AΣA ). Furthermore, the Student-t ∈ ∼ distribution Tν(µ, Σ) admits the following stochastic representation, which can be used to generate samples from Tν(µ, Σ) based on samples from the multivariate standard normal distribution (0,I) and the Gamma distribution Γ ν , ν : Let Z (0,I) and N 2 2 ∼ N Y Γ ν , ν  be independent, then ∼ 2 2

1 Σ 2 Z X = µ + Tν(µ, Σ). (2) √Y ∼

24 1 ν = 10 0.4 1 ν = 2 ν = 1 ν = 2 0.3 ν = 10 ν = 100 (0,1) N 0.2

0.1

0 10 5 0 5 10 − −

Figure 1: Standard Student-t distribution Tν(0, 1) for different values of ν in comparison with the standard normal distribution (0, 1). N

d For i.i.d. samples xi R , i = 1, . . . , n, the likelihood function of the Student-t distribution ∈ Tν(µ, Σ) is given by

d+ν n n Γ 2 Y 1 (ν, µ, Σ x1, . . . , xn) = n nd n d+ν , L | ν  2 2 1 T −1  2 Γ 2 (πν) Σ i=1 1 + (xi µ) Σ (xi µ) | | ν − − and the log-likelihood function by

 d+ν   ν  nd `(ν, µ, Σ x1, . . . , xn) = n log Γ n log Γ log(πν) | 2 − 2 − 2 n   n d+ν X 1 T −1 log Σ log 1 + (xi µ) Σ (xi µ) . − 2 | | − 2 ν − − i=1

In the following, we are interested in the negative log-likelihood function, which up to 2 1 the factor n and weights wi = n reads as     L(ν, µ, Σ) = 2 log Γ d+ν  + 2 log Γ ν  ν log(ν) − 2 2 − n X T −1  + (d + ν) wi log ν + (xi µ) Σ (xi µ) + log Σ . − − | | i=1

In this section, we allow for arbitrary weights from the open probability simplex ∆˚n :=  n Pn w = (w1, . . . , wn) R : wi = 1 . In this way, we might express different ∈ >0 i=1 levels of confidence in single samples or handle the occurrence of multiple samples.Using T −1 ∂ log(|X|) = X−1 and ∂a X b = (X−T)abT(X−T), see [33], the derivatives of L with ∂X ∂X −

25 respect to µ, Σ and ν are given by

n −1 ∂L X Σ (xi µ) (ν, µ, Σ) = 2(d + ν) wi T −−1 , ∂µ − ν + (xi µ) Σ (xi µ) i=1 − − n −1 T −1 ∂L X Σ (xi µ)(xi µ) Σ −1 (ν, µ, Σ) = (d + ν) wi − T −−1 + Σ , ∂Σ − ν + (xi µ) Σ (xi µ) i=1 − − n ∂L ν  ν + d X  ν + d (ν, µ, Σ) = φ φ + wi T −1 ∂ν 2 − 2 ν + (xi µ) Σ (xi µ) i=1 − −  ν + d   log T −1 1 , − ν + (xi µ) Σ (xi µ) − − − with φ(x) := ψ(x) log(x), x > 0 − and the digamma function

d Γ0(x) ψ(x) = log (Γ(x)) = . dx Γ(x)

Setting the derivatives to zero results in the equations

n X xi µ 0 = wi T− −1 , (3) ν + (xi µ) Σ (xi µ) i=1 − − 1 1 n − T − X Σ 2 (xi µ)(xi µ) Σ 2 I = (d + ν) wi − T −−1 , (4) ν + (xi µ) Σ (xi µ) i=1 − − ν  ν  ν + d 0 = F := φ φ 2 2 − 2 n X  ν + d  ν + d   + wi T −1 log T −1 1 .(5) ν + (xi µ) Σ (xi µ) − ν + (xi µ) Σ (xi µ) − i=1 − − − − Computing the trace of both sides of (4) and using the linearity and permutation invariance of the trace operator we obtain

1 1 n − T −  X tr Σ 2 (xi µ)(xi µ) Σ 2 − − d = tr(I) = (d + ν) wi T −1 ν + (xi µ) Σ (xi µ) i=1 − − n T −1 X (xi µ) Σ (xi µ) = (d + ν) wi − T −1 − , ν + (xi µ) Σ (xi µ) i=1 − −

26 which yields n X 1 1 = (d + ν) wi T −1 . ν + (xi µ) Σ (xi µ) i=1 − − We are interested in critical points of the negative log-likelihood function L, i.e. in solutions (µ, Σ, ν) of (3)-(5), and in particular in minimizers of L.

3.2. Existence of Critical Points

In this section, we examine if the negative log-likelihood function L has a minimizer. We restrict our attention to the case µ = 0. For an approach how to extend the results to arbitrary µ for fixed ν we refer to [24]. For fixed ν > 0, it is known that there exists a unique solution of (4) and for ν = 0 that there exists solutions of (4) which differ only by a multiplicative positive constant, see, e.g. [24]. In contrast, if we do not fix ν, we have roughly to distinguish between the two cases that the samples tend to come from a Gaussian distribution, i.e. ν or not. The results are presented in Theorem 3.2. → ∞ We make the following general assumption:

Assumption 3.1. Any subset of d samples xi, i 1, . . . , n is linearly independent 1 ≤ ∈ { } and max wi : i = 1, . . . , n < . { } d For µ = 0, the negative log-likelihood function becomes

 d + ν   ν  L(ν, Σ) := 2 log Γ + 2 log Γ ν log(ν) − 2 2 − n X T −1  + (d + ν) wi log ν + x Σ xi + log( Σ ) i | | i=1  d + ν   ν  = 2 log Γ + 2 log Γ ν log(ν) − 2 2 − n   X 1 T −1 + (d + ν) log(ν) + (d + ν) wi log 1 + x Σ xi + log( Σ ). ν i | | i=1

Further, for a fixed ν > 0, set

n X T −1  Lν(Σ) := (d + ν) wi log ν + x Σ xi + log( Σ ). i | | i=1

To prove the next existence theorem we will need two lemmas, whose proofs are given in AppendixB.

27 d Theorem 3.2. Let xi R , i = 1, . . . , n and w ∆˚n fulfill Assumption 3.1. Then ∈ ∈ exactly one of the following statements holds:

(i) There exists a minimizing sequence (νr, Σr)r of L, such that νr : r N has a { ∈ } finite cluster point. Then we have argmin L(ν, Σ) = and every (ν,Σ)∈R>0×SPD(d) 6 ∅ (ˆν, Σ)ˆ argmin L(ν, Σ) is a critical point of L. ∈ (ν,Σ)∈R>0×SPD(d)

(ii) For every minimizing sequence (νr, Σr)r of L(ν, Σ) we have lim νr = . Then r→∞ ∞ ˆ Pn T (Σr)r converges to the maximum likelihood estimator Σ = i=1 wixixi of the normal distribution (0, Σ). N

Proof. Case 1: Assume that there exists a minimizing sequence (νr, Σr)r of L, such that (νr)r has a bounded subsequence. In particular, using Lemma B.1, we have that ∗ ∗ (νr)r has a cluster point ν > 0 and a subsequence (νrk )k converging to ν . Clearly, the sequence (νrk , Σrk )k is again a minimizing sequence so that we skip the second index in the following. By Lemma B.2, the set Σr : r N is a compact subset of SPD(d). { ∈ } ∗ Therefore there exists a subsequence (Σr )k which converges to some Σ SPD(d). Now k ∈ we have by continuity of L(ν, Σ) that

∗ ∗ L(ν , Σ ) = lim L(νrk , Σrk ) = min L(ν, Σ). k→∞ (ν,Σ)∈R>0×SPD(d)

Case 2: Assume that for every minimizing sequence (νr, Σr)r it holds that νr as → ∞ r . We rewrite the likelihood function as → ∞ d ! n ν  ν 2   Γ 2 2 X 1 T −1 L(ν, Σ) = 2 log + d log(2) + (d + ν) wi log 1 + x Σ xi + log( Σ ). d+ν  ν i | | Γ 2 i=1

Since ν  ν d Γ 2 lim 2 2 = 1, ν→∞ d+ν  Γ 2 we obtain n   X 1 T −1 lim L(νr, Σr) = d log(2) + lim (d + νr) wi log 1 + xi Σr xi + log( Σr ). (6) r→∞ νr→∞ ν | | i=1 r

Next we show by contradiction that Σr : r N is in SPD(d) and bounded: Denote the { ∈ } eigenvalues of Σr by λr1 λrd. Assume that either λr1 : r N is unbounded or ≥ · · · ≥ { ∈ } that λrd : r N has zero as a cluster point. Then, we know by [24, Theorem 4.3] that { ∈ }

28 there exists a subsequence of (Σr)r, which we again denote by (Σr)r, such that for any fixed ν > 0 it holds

lim Lν(Σr) = . r→∞ ∞

k k Since k 1 + is monotone increasing, for νr d + 1 we have 7→ x ≥

n   n  νr+d! X 1 T −1 X 1 T −1 (d + νr) wi log 1 + x Σ xi = wi log 1 + x Σ xi ν i r ν i r i=1 r i=1 r n  νr  X 1 T −1 wi log 1 + x Σ xi ≥ ν i r i=1 r n  d+1! X 1 T −1 wi log 1 + x Σ xi ≥ d + 1 i r i=1 n   X 1 T −1 = (d + 1) wi log 1 + x Σ xi d + 1 i r i=1 n X T −1  d+1 (d + 1) wi log 1 + x Σ xi log(d + 1) . ≥ i r − i=1

By (6) this yields

n d+1 X T −1  lim L(νr, Σr) d log(2) log(d + 1) + lim (d + 1) wi log 1 + x Σ xi + log( Σr ) r→∞ ≥ − r→∞ i r | | i=1

d+1 = d log(2) log(d + 1) + lim L1(Σr) = . − r→∞ ∞

This contradicts the assumption that (νr, Σr)r is a minimizing sequence of L. Hence

Σr : r N is a bounded subset of SPD(d). { ∈ } Finally, we show that any subsequence of (Σr)r has a subsequence which converges to ˆ Pn T ˆ Σ = i=1 wixixi . Then the whole sequence (Σr)r converges to Σ.

Let (Σrk )k be a subsequence of (Σr)r. Since it is bounded, it has a convergent subsequence (Σr )l which converges to some Σ˜ Σr : r N SPD(d). For simplicity, we denote kl ∈ { ∈ } ⊂ T −1 (Σrk )l again by (Σr)r. Since (Σr)r is converges, we know that also (xi Σr xi)r converges l ν  x  r and is bounded. By lim νr = we know that the functions x 1 + converge r→∞ ∞ 7→ νr

29 locally uniformly to x exp(x) as r . Thus we obtain 7→ → ∞ n   X 1 T −1 lim (d + νr) wi log 1 + x Σ xi r→∞ ν i r i=1 r n  d+νr ! X 1 T −1 = lim wi log 1 + x Σ xi r→∞ ν i r i=1 r n  νr  d! X 1 T −1 1 T −1 = lim wi log lim 1 + x Σ xi 1 + x Σ xi r→∞ r→∞ ν i r ν i r i=1 r r n   νr  X 1 T −1 = lim wi log lim 1 + x Σ xi r→∞ r→∞ ν i r i=1 r n   n X T ˜ −1 X T ˜ −1 = wi log exp(xi Σ xi) = wixi Σ xi. i=1 i=1

Hence we have

n X T ˜ −1 ˜ inf L(ν, Σ) = lim L(νr, Σr) = d log(2) + wixi Σ xi + log( Σ ). (ν,Σ)∈R>0×SPD(d) r→∞ | | i=1

By taking the derivative with respect to Σ we see that the right-hand side is minimal if ˆ Pn T and only if Σ = Σ = i=1 wixixi . On the other hand, by similar computations as above we get

inf L(ν, Σ) lim L(νr, Σ)ˆ (ν,Σ)∈R>0×SPD(d) ≤ r→∞ n ˆ X 1 T ˆ −1  = d log(2) + log( Σ ) + lim (d + νr) wi log 1 + xi Σ xi | | vr→∞ ν i=1 r

n X T −1 = d log(2) + log( Σˆ ) + wix Σˆ xi + log( Σˆ ), | | i | | i=1 so that Σ˜ = Σ.ˆ This finishes the proof.

30 3.3. Zeros of F

In this section, we are interested in the existence of solutions of (5), i.e., in zeros of F for ν d arbitrary fixed µ and Σ. Setting x := 2 > 0, t := 2 and

1 T −1 si := (xi µ) Σ (xi µ), i = 1, . . . , n. 2 − − we rewrite the function F in (5) as

n X  x + t  x + t   F (x) = φ(x) φ(x + t) + wi log 1 (7) − x + s − x + s − i=1 i i n n X X  = wiFsi (x) = wi A(x) + Bsi (x) , i=1 i=1 where

Fs(x) := A(x) + Bs(x) (8) and

x + t  x + t  A(x) := φ(x) φ(x + t),Bs(x) := log 1. − x + s − x + s −

The digamma function ψ and φ = ψ log( ) are well examined in the literature, see − · [1]. The function φ(x) is the expectation value of a random variable which is Γ(x, x) distributed. It holds 1 < φ(x) < 1 and it is well-known that φ is completely − x − 2x − monotone. This implies that the negative of A is also completely monotone, i.e. for all x > 0 and m N0 we have ∈

( 1)m+1φ(m)(x) > 0, ( 1)m+1A(m)(x) > 0, − − in particular A < 0, A0 > 0 and A00 < 0. Further, it is easy to check that

lim φ(x) = , lim φ(x) = 0−, (9) x→0 −∞ x→∞ lim A(x) = , lim A(x) = 0−. (10) x→0 −∞ x→∞

31 On the other hand, we have that B(x) 0 if s = t in which case Fs = A < 0 and has ≡ therefore no zero. If s = t, then Bs is completely monotone, i.e., for all x > 0 and m N0, 6 ∈

( 1)mB(m)(x) > 0, − s

0 00 in particular Bs > 0, Bs < 0 and Bs > 0, and   t t + Bs(0) = log 1 > 0, lim Bs(x) = 0 . s − s − x→∞

Hence we have

lim Fs(x) = , lim Fs(x) = 0. (11) x→0 −∞ x→∞ If X (µ, Σ) is a d-dimensional random vector, then Y := (X µ)TΣ−1(X µ) χ2 ∼ N − − ∼ d with E(Y ) = d and Var(Y ) = 2d. Thus we would expect that for samples xi from such a T −1 random variable X the corresponding values (xi µ) Σ (xi µ) lie with high probability − − in the interval [d √2d, d + √2d], respective si [t √t, t + √t]. These considerations − ∈ − are reflected in the following theorem and corollary.

Theorem 3.3. For Fs : R>0 R given by (8) the following relations hold true: →

i) If s [t √t, t + √t] R>0, then Fs(x) < 0 for all x > 0 so that Fs has no zero. ∈ − ∩

ii) If s > 0 and s [t √t, t + √t], then there exists x+ such that Fs(x) > 0 for all 6∈ − x x+. In particular, Fs has a zero. ≥ Proof. We have

(s t)2 F 0(x) = φ0 (x) φ0(x + t) − s − − (x + s)2(x + t) t (s t)2 = ψ0(x) ψ0(x + t) − . − − x(x + t) − (x + s)2(x + t)

0 We want to sandwich Fs between two rational functions Ps and Ps + Q which zeros can easily be described. Since the trigamma function ψ0 has the series representation

∞ X 1 ψ0(x) = , (x + k)2 k=0

32 see [1], we obtain

∞ X 1 1 t (s t)2 F 0(x) = − . (12) s (x + k)2 − (x + k + t)2 − x(x + t) − (x + s)2(x + t) k=0

For x > 0, we have

Z ∞ 1 1 1 1 t I(x) = 2 2 du = = . 0 (x + u) − (x + u + t) x − x + t (x + t)x | {z } g(u)

Let R(x) and T (x) denote the rectangular and trapezoidal rule, respectively, for computing the integral with step size 1. Then we verify

∞ ∞ X X 1 1 R(x) = g(k) = (x + k)2 − (x + k + t)2 k=0 k=0 so that

(s t)2 F 0(x) = (R(x) T (x)) + (T (x) I(x)) − s − − − (x + s)2(x + t) 1  1 1  (s t)2 = + (T (x) I(x)) − . 2 x2 − (x + t)2 − − (x + s)2(x + t)

By considering the first and second derivative of g we see the integrand in I(x) is strictly 0 decreasing and strictly convex. Thus, Ps(x) < Fs(x), where

1  1 1  (s t)2 (2tx + t2)(x + s)2 (s t)2x2(x + t) Ps(x) := − = − − 2 x2 − (x + t)2 − (x + s)2(x + t) 2x2(x + s)2(x + t)2

ps(x) = . 2x2(x + s)2(x + t)2

3 2 with ps(x) := a3x + a2x + a1x + a0 and

2 2 2 2 a0 = t s > 0, a1 = 2st(s + t) > 0, a2 = t(4s + t (s t) ), a3 = 2 t (s t) . − − − −

We have

a3 0 s [t √t, t + √t] (13) ≥ ⇐⇒ ∈ −

33 and

a2 0 s [t + 2 √4 + 5t, t + 2 + √4 + 5t] [t √t, t + √t] ≥ ⇐⇒ ∈ − ⊃ − for t 1. For t = 1 , it holds [t + 2 √4 + 5t, t + 2 + √4 + 5t] [0, t + √t]. ≥ 2 − ⊃ Thus, for s [t √t, t + √t], by the sign rule of Descartes, ps(x) has no positive zero ∈ − which implies

0 0 Ps(x) < F (x) for s [t √t, t + √t] R>0. ≤ s ∈ − ∩

Hence, the continuous function Fs is monotone increasing and by (11) we obtain Fs(x) < 0 for all x > 0 if s [t √t, t + √t] R>0. ∈ − ∩ Let s > 0 and s [t √t, t + √t]. By 6∈ − ∞ X 1 Z 1  T (x) I(x) = (g(k + 1) + g(k)) g(k + u) du − 2 − k=0 0 and Euler’s summation formula, we obtain

∞ X 1 0 0  1 (4) T (x) I(x) = g (k + 1) g (k) g (ξk), ξk (k, k + 1) − 12 − − 720 ∈ k=0 with g0(u) = 2 + 2 and g(4)(u) = 5! 5! , so that − (x+u)3 (x+u+t)3 (x+u)6 − (x+u+t)6 ∞ 1 0 X 1 1 1 1 T (x) I(x) = g (0) + 6 6 (14) − − 12 6 (x + ξk + t) − 6 (x + ξk) k=0 1 1 3tx2 + 3t2x + t3 < g0(0) = . − 12 6 x3(x + t)3

Therefore, we conclude

2 2 3 2 2 1 3 2 0 1 3tx + 3t x + t ps(x)x(x + t) + (tx + t x + 3 t )(x + s) F (x) < Ps(x) + = s 6 x3(x + t)3 2x3(x + s)2(x + t)3 | {z } Q(x)

The main coefficient of x5 of the polynomial in the numerator is 2(t (s t)2) which − − fulfills (13). Therefore, if s [t √t, t + √t], then there exists x+ large enough 6∈ − such that the numerator becomes smaller than zero for all x x+. Consequently, 0 ≥ F (x) Ps(x) + Q(x) < 0 for all x x+. Thus, Fs is decreasing on [x+, ). By (11), s ≤ ≥ ∞

34 we conclude that Fs has a zero.

The following corollary states that Fs has exactly one zero if s > t + √t. Unfortunately we do not have such a results for s < t √t. −

Corollary 3.4. Let Fs : R>0 R be given by (8). If s > t + √t, t 1, then Fs has → ≥ exactly one zero.

+ Proof. By Theorem 3.3ii) and since limx→0 Fs(x) = and limx→∞ = 0 , it remains 0 −∞ to prove that Fs has at most one zero. Let x0 > 0 be the smallest number such that 0 0 Fs(x0) = 0. We prove that Fs(x) < 0 for all x > x0. To this end, we show that 0 2 hs(x) := Fs(x)(x + s) (x + t) is strictly decreasing. By (12) we have

∞ ! 2 X 1 1 t 2 hs(x) = (x + s) (x + t) (s t) , (x + k)2 − (x + k + t)2 − x(x + t) − − k=0 and for s > t further

∞ ! X 1 1 t h0 (x) = 2(x + s)(x + t) + (x + s)2 s (x + k)2 − (x + k + t)2 − x(x + t) k=0 ∞ ! X 2 2 t(2x + t) + (x + s)2(x + t) − + + (x + k)3 (x + k + t)3 x2(x + t)2 k=0 ∞ ! X 1 1 t 3(x + s)2 ≤ (x + k)2 − (x + k + t)2 − x(x + t) k=0 ∞ ! X 2 2 t(2x + t) + (x + s)2(x + t) − + + . (x + k)3 (x + k + t)3 x2(x + t)2 k=0 = (x + s)2(R(x) I(x)), − where I(x) is the integral and R(x) the corresponding rectangular rule with step size 1 of the function g := g1 + g2 defined as

 1 1   2 2  g1(u) := 3 , g2(u) := (x + t) − + . (x + u)2 − (x + t + u)2 (x + u)3 (x + t + u)3

We show that R(x) I(x) < 0 for all x > 0. Let T (x), Ti(x) be the trapezoidal rules − R ∞ with step size 1 corresponding to I(x) and Ii(x) = 0 gi(u)du, i = 1, 2. Then it follows

R(x) I(x) = R(x) T (x) + T (x) I(x) = R(x) T (x) + T1(x) I1(x) + T2(x) I2(x). − − − − − −

35 Since g2 is a decreasing, concave function, we conclude T2(x) I2(x) < 0. Using Euler’s − summation formula in (14) for g1, we get

∞ 1 0 1 X (4) T1(x) I1(x) = g (0) g (ξk), ξk (k, k + 1). − −12 1 − 720 1 ∈ k=0

(4) Since g1 is a positive function, we can write

1 1 0 R(x) I(x) < R(x) T (x) + T1(x) I1(x) g(0) g (0) − − − ≤ 2 − 12 1 3  1 1  1  2 2  1  1 1  = + (x + t) − + − + 2 x2 − (x + t)2 2 x3 (x + t)3 − 2 x3 (x + t)3

t ( 3t + 3)x2 + ( 5t2 + 3t)x 2t3 + t2 = − − − . 2 x3(x + t)3

All coefficients of x are smaller or equal than zero for t 1 which implies that hs is ≥ strictly decreasing.

Theorem 3.3 implies the following corollary.

T −1 Corollary 3.5. For F : R>0 R given by (7) and δi := (xi µ) Σ (xi µ), i = 1, . . . , n, → − − the following relations hold true:

i) If δi [d √2d, d + √2d] R>0 for all i 1, . . . , n , then F (x) < 0 for all x > 0 ∈ − ∩ ∈ { } so that F has no zero.

ii) If δi > 0 and δi [d √2d, d + √2d] for all i 1, . . . , n , there exists x+ such 6∈ − ∈ { } that F (x) > 0 for all x x+. In particular, F has a zero. ≥ Pn Proof. Consider F = Fs . If δi [d √2d, d + √2d] R>0 for all i 1, . . . , n , i=1 i ∈ − ∩ ∈ { } then we have by Theorem 3.3 that Fsi (x) < 0 for all x > 0. Clearly, the same holds true for the whole function F such that it cannot have a zero.

If δi [d √2d, d + √2d] for all i 1, . . . , n , then we know by Theorem 3.3 that there 6∈ − ∈ { } exist xi+ > 0 such that Fs (x) > 0 for x xi+. Thus, F (x) > 0 for x x+ := maxi(xi+). i ≥ ≥ Since limx→0 F (x) = this implies that F has a zero. −∞

3.4. Algorithms

In this section, we propose an alternative of the classical EM algorithm for computing the parameters of the Student-t distribution along with convergence results. In particular,

36 we are interested in estimating the degree of freedom parameter ν, where the function F is of particular interest.

1 Algorithm 3.1 with weights wi = n , i = 1, . . . , n, is the classical EM algorithm. Note that the function in the third M-Step

  n ν  ν  νr + d X Φr := φ φ + wi (γi,r log(γi,r) 1) 2 2 − 2 − − i=1 | {z } cr has a unique zero since by (9) the function φ < 0 is monotone increasing with limx→∞ φ(x) = − 0 and cr > 0. Concerning the convergence of the EM algorithm it is known that the values of the objective function L(νr, µr, Σr) are monotone decreasing in r and that a subsequence of the iterates converges to a critical point of L(ν, µ, Σ) if such a point exists, see [7].

Algorithm 3.1 EM Algorithm (EM) d Input: x1, . . . , xn R , n d + 1, w ∆˚n ∈ ≥ n ∈ n 1 P 1 P T Initialization: ν0 = ε > 0, µ0 = n xi,Σ0 = n (xi µ0)(xi µ0) i=1 i=1 − − for r = 0,... do E-Step: Compute the weights

T −1 δi,r = (xi µr) Σ (xi µr) − r − νr + d γi,r = νr + δi,r

M-Step: Update the parameters

n P wiγi,rxi i=1 µr+1 = n P wiγi,r i=1 n X T Σr+1 = wiγi,r(xi µr+1)(xi µr+1) − − i=1   n ν  νr + d X νr+1 = zero of φ φ + wi (γi,r log(γi,r) 1) 2 − 2 − − i=1

37 Algorithm 3.2 distinguishes from the EM algorithm in the iteration of Σ, where the 1 factor n is incorporated now. The computation of this factor requires no additional P wiγi,r i=1 computational effort, but speeds up the performance in particular for smaller ν. Such kind of acceleration was suggested in [19, 28]. For fixed ν 1, it was shown in [40] ≥ that this algorithm is indeed an EM algorithm arising from another choice of the hidden variable than used in the standard approach, see also [22]. Thus, it follows for fixed ν 1 ≥ that the sequence L(ν, µr, Σr) is monotone decreasing. However, we also iterate over

ν. In contrast to the EM Algorithm 3.1 our ν iteration step depends on µr+1 and Σr+1 instead of µr and Σr. This is important for our convergence results. Note that for both cases, the accelerated algorithm can no longer be interpreted as an EM algorithm, so that the convergence results of the classical EM approach are no longer available. Let us mention that a Jacobi variant of Algorithm 3.2 for fixed ν i.e.

n T X wiγi,r(xi µr)(xi µr) Σr+1 = − − , Pn w γ i=1 i=1 i i,r with µr instead of µr+1 including a convergence proof was suggested in [24]. The main reason for this index choice was that we were able to prove monotone convergence of a simplified version of the algorithm for estimating the location and scale of Cauchy noise

(d = 1, ν = 1) which could be not achieved with the variant incorporating µr+1, see [23]. This simplified version is known as myriad filter in image processing. In this thesis, we keep the original variant from the EM algorithm (15) since we are mainly interested in the computation of ν. Instead of the above algorithms we suggest to take the critical point equation (5) more directly into account in the next two algorithms.

Algorithm 3.2 Accelerated EM-like Algorithm (aEM) Same as Algorithm 3.1 except for

n T X wiγi,r(xi µr+1)(xi µr+1) Σr+1 = − − (15) Pn w γ i=1 i=1 i i,r   n     ν  νr + d X νr + d νr + 1 νr+1 = zero of φ φ + wi log 1 2 − 2 ν + δ − ν + δ − i=1 r i,r+1 r i,r+1

38 Algorithm 3.3 computes a zero of

  n     ν  ν  ν + d X νr + d νr + 1 Ψr := φ φ + wi log 1 2 2 − 2 ν + δ − ν + δ − i=1 r i,r+1 r i,r+1 | {z } br This function has a unique zero since by (10) the function A(x) = φ(x) φ(x + t) < 0 is − monotone increasing with limx→∞ A(x) = 0− and br > 0.

Algorithm 3.3 Multivariate Myriad Filter (MMF) Same as Algorithm 3.2 except for   n     ν  ν + d X νr + d νr + 1 νr+1 = zero of φ φ + wi log 1 2 − 2 ν + δ − ν + δ − i=1 r i,r+1 r i,r+1

Finally, Algorithm 3.4 computes the update of ν by directly finding a zero of the whole function F in (5) given µr and Σr. The existence of such a zero was discussed in the previous section. The zero computation is done by an inner loop which iterates the update step of ν from Algorithm 3.3. We will see that the iteration converge indeed to a zero of F .

Algorithm 3.4 General Multivariate Myriad Filter (GMMF) Same as Algorithm 3.2 except for

n ν  ν + d X  ν + d  ν + d   νr+1 = zero of φ φ + wi log 1 2 − 2 ν + δ − ν + δ − i=1 i,r+1 i,r+1 for l = 0,... do

νr,0 = νr   n     ν  ν + d X νr,l + d νr,l + d νr,l+1 zero of φ φ + wi log 1 2 − 2 ν + δ − ν + δ − i=1 r,l i,r+1 r,l i,r+1

In the rest of this section, we prove that the sequence (L(νr, µ, r, Σr))r generated by Algorithm 3.2 and 3.3 decreases in each iteration step and that there exists a subsequence of the iterates which converges to a critical point. We will need the following auxiliary lemma.

39 Lemma 3.6. Let Fa,Fb : R>0 R be continuous functions, where Fa is strictly increasing → and Fb is strictly decreasing. Define F := Fa + Fb. For any initial value x0 > 0 assume that the sequence generated by

xl+1 = zero of Fa(x) + Fb(xl) is uniquely determined, i.e., the function on the right-hand side has a unique zero. Then it holds

i) If F (x0) < 0, then (xl)l is strictly increasing and F (x) < 0 for all x [xl, xl+1], ∈ l N0. ∈

ii) If F (x0) > 0, then (xl)l is strictly decreasing and F (x) > 0 for all x [xl+1, xl], ∈ l N0. ∈

Furthermore, assume that there exists x− > 0 with F (x) < 0 for all x < x− and x+ > 0 ∗ with F (x) > 0 for all x > x+. Then, the sequence (xl)l converges to a zero x of F .

Proof. We consider the case i) that F (x0) < 0. Case ii) follows in a similar way.

We show by induction that F (xl) < 0 and that xl+1 > xl for all l N. Then it holds for ∈ all l N and x (xl, xl+1) that Fa(x) + Fb(x) < Fa(x) + Fb(xl) < Fa(xl+1) + Fb(xl) = 0. ∈ ∈ Thus F (x) < 0 for all x [xl, xl+1], l N0. ∈ ∈ Induction step. Let Fa(xl) + Fb(xl) < 0. Since Fa(xl+1) + Fb(xl) = 0 > Fa(xl) + Fb(xl) and Fa is strictly increasing, we have xl+1 > xl. Using that Fb is strictly decreasing, we get Fb(xl+1) < Fb(xl) and consequently

F (xl+1) = Fa(xl+1) + Fb(xl+1) < Fa(xl+1) + Fb(xl) = 0.

Assume now that F (x) > 0 for all x > x+. Since the sequence (xl)l is strictly increasing and F (xl) < 0 it must be bounded from above by x+. Therefore it converges to some ∗ x R>0. Now, it holds by the continuity of Fa and Fb that ∈

∗ ∗ ∗ 0 = lim Fa(xl+1) + Fb(xl) = Fa(x ) + Fb(x ) = F (x ). l→∞

Hence x∗ is a zero of F .

For the setting in Algorithm 3.4, Lemma 3.6 implies the following corollary.

ν  ν+d  Corollary 3.7. Let Fa(ν) := φ φ and 2 − 2

40 Pn  ν+d  ν+d   Fb(ν) := wi log 1 , r N0 Assume that there exists ν+ > 0 i=1 ν+δi,r+1 − ν+δi,r+1 − ∈ such that F := Fa + Fb > 0 for all ν ν+. Then, the sequence (νr,l)l generated by the ≥ r-th inner loop of Algorithm 3.4 converges to a zero of F .

Note that by Corollary 3.5 the above condition on F is fulfilled in each iteration step, e.g. if δi,r [d √2d, d + √2d] for i = 1, . . . , n and r N0. 6∈ − ∈

Proof. From the previous section we know that Fa is strictly increasing and Fb is strictly decreasing. Both functions are continuous. If F (νr) < 0, then we know from Lemma 3.6 ∗ that (νr,l)l is increasing and converges to a zero νr of F . If F (νr) > 0, then we know from Lemma 3.6 that (νr,l)l is decreasing. The condition that there exists x− R>0 with F (x) < 0 for all x < x− is fulfilled since limx→0 F (x) = . ∈ ∗ −∞ Hence, by Lemma 3.6, the sequence converges to a zero νr of F .

To prove that the objective function decreases in each step of the Algorithms 3.2- 3.4 we need the following lemma.

Lemma 3.8. Let Fa,Fb : R>0 R be continuous functions, where Fa is strictly increasing → and Fb is strictly decreasing. Define F := Fa +Fb and let G: R>0 R be an antiderivative d → of F , i.e. F = dx G. For an arbitrary x0 > 0, let (xl)l be the sequence generated by

xl+1 = zero of Fa(x) + Fb(xl).

Then the following holds true:

i) The sequence (G(xl))l is monotone decreasing with G(xl) = G(xl+1) if and only if ∗ x0 is a critical point of G. If (xl)l converges, then the limit x fulfills

∗ G(x0) G(x1) G(x ), ≥ ≥

with equality if and only if x0 is a critical point of G.

ii) Let F = F˜a + F˜b be another splitting of F with continuous functions F˜a, F˜b, where the first one is strictly increasing and the second one strictly decreasing. Assume ˜0 0 ˜ ˜ that Fa(x) > Fa(x) for all x > 0. Then holds for y1 := zero of Fa(x) + Fb(x0) that G(x0) G(y1) G(x1) with equality if and only if x0 is a critical point of G. ≥ ≥

Proof. i) If F (x0) = 0, then x0 is a critical point of G.

41 Let F (x0) < 0. By Lemma 3.6 we know that (xl)l is strictly increasing and that F (x) < 0 for x [xr, xr+1], r N0. By the Fundamental Theorem of calculus it holds ∈ ∈ Z xl+1 G(xl+1) = G(xl) + F (ν)dν. xl

Thus, G(xl+1) < G(xl).

Let F (x0) > 0. By Lemma 3.6 we know that (xl)l is strictly decreasing and that F (x) > 0 for x [xr+1, xr], r N0. Then ∈ ∈ Z xl G(xl) = G(xl+1) + F (ν)dν. xl+1 implies G(xl+1) < G(xl). Now, the rest of assertion i) follows immediately. ii) It remains to show that G(x1) G(y1). Let F (x0) < 0. Then we have y1 x0 and ≤ ≥ x1 x0. By the Fundamental Theorem of calculus we obtain ≥ Z x1 Z x1 0 0 F (x0) + Fa(x)dx = Fa(x0) + Fa(x)dx + Fb(x0) = Fa(x1) + Fb(x0) = 0, x0 x0 Z y1 Z y1 ˜0 ˜ ˜0 ˜ ˜ ˜ F (x0) + Fa(x)dx = Fa(x0) + Fa(x)dx + Fb(x0) = Fa(y1) + Fb(x0) = 0. x0 x0

This yields Z x1 Z y1 0 ˜0 Fa(x)dx = Fa(x)dx, x0 x0

0 0 and since F˜ (x) > F (x) further y1 x1 with equality if and only if x0 = x1, i.e., if x0 is a a ≤ a critical point of G. Since F (x) < 0 on (x0, x1) it holds

Z x1 G(x1) = G(y1) + F (x)dx G(y1), y1 ≤ with equality if and only if x0 = x1. The case F (x0) > 0 can be handled similarly.

Lemma 3.8 implies the following relation between the values of the objective function L for Algorithms 3.2- 3.4.

d Corollary 3.9. For the same fixed νr > 0, µr R , Σr SPD(d) define µr+1, Σr+1, aEM MMF GMMF ∈ ∈ νr+1 , νr+1 and νr+1 by Algorithm 3.2, 3.3 and 3.4, respectively. For the GMMF algorithm assume that the inner loop converges. Then it holds

aEM MMF GMMF L(νr, µr+1, Σr+1) L(ν , µr+1, Σr+1) L(ν , µr+1, Σr+1) L(ν , µr+1, Σr+1). ≥ r+1 ≥ r+1 ≥ r+1

42 d aEM Equality holds true if and only if dν L(νr, µr+1, Σr+1) = 0 and in this case νr = νr+1 = MMF GMMF νr+1 = νr+1 . d Proof. For G(ν) := L(ν, µr+1, Σr+1), we have dν L(ν, µr+1, Σr+1) = F (ν), where

n ν  ν + d X  ν + d  ν + d   F (ν) := φ φ + wi log 1 . 2 − 2 ν + δ − ν + δ − i=1 i,r+1 i,r+1

We use the splitting

F = Fa + Fb = F˜a + F˜b with ν  ν + d ν  Fa(ν) := φ φ , F˜a := φ 2 − 2 2 and

n X  ν + d  ν + d   ν + d Fb(ν) := wi log 1 , F˜b(ν) := φ + Fb(ν). ν + δ − ν + δ − − 2 i=1 i,r+1 i,r+1

By the considerations in the previous section we know that Fa, F˜a are strictly in- ˜ 0 ˜0 0 creasing and Fb, Fb are strictly decreasing. Moreover, since φ > 0 we have Fa > Fa. aEM Hence it follows from Lemma 3.8(ii) that L(νr, µr+1, Σr+1) L(νr , µr+1, Σr+1) MMF ≥ MMF ≥ L(νr , µr+1, Σr+1). Finally, we conclude by Lemma 3.8(i) that L(νr , µr+1, Σr+1) GMMF ≥ L(νr , µr+1, Σr+1).

Concerning the convergence of the three algorithms we have the following result.

Theorem 3.10. Let (νr, µr, Σr)r be sequence generated by Algorithm 3.2, 3.3 or 3.4, d respectively starting with arbitrary initial values ν0 > 0, µ0 R , Σ0 SPD(d). For the ∈ ∈ GMMF algorithm we assume that in each step the inner loop converges. Then it holds for all r N0 that ∈ L(νr, µr, Σr) L(νr+1, µr+1, Σr+1), ≥ with equality if and only if (νr, µr, Σr) = (νr+1, µr+1, Σr+1).

Proof. By the general convergence results of the accelerated EM algorithm for fixed ν, see also [24], it holds

L(νr, µr+1, Σr+1) L(νr, µr, Σr), ≤ with equality if and only if (µr, Σr) = (µr+1, Σr+1). By Corollary 3.9 it holds

L(νr+1, µr+1, Σr+1) L(νr, µr+1, Σr+1), ≤

43 with equality if and only if νr = νr+1. The combination of both results proves the claim.

d d Lemma 3.11. Let T = (T1,T2,T3): R>0 R SPD(d) R>0 R SPD(d) be the × × → × × operator of one iteration step of Algorithm 3.2 (or 3.3). Then T is continuous.

Proof. We show the statement for Algorithm 3.3. For Algorithm 3.2 it can be shown analogously. Clearly the mapping (T2,T3)(ν, µ, Σ) is continuous. Since

T1(ν, µ, Σ) = zero of Ψ(x, ν, T2(ν, µ, Σ),T3(ν, µ, Σ)), where

x x + d Ψ(x, ν, µ, Σ) = φ φ 2 − 2 n X  ν + d  ν + d   + wi T −1 log T −1 1 . ν + (xi µ) Σ (xi µ) − ν + (xi µ) Σ (xi µ) − i=1 − − − −

It is sufficient to show that the zero of Ψ depends continuously on ν, T2 and T3. Now the ∂ continuously differentiable function Ψ is strictly increasing in x, so that ∂x Ψ(x, ν, T2,T3) > 0. By Ψ(T1, ν, T2,T3) = 0, the Implicit Function Theorem yields the following statement:

There exists an open neighborhood U V of (T1, ν, T2,T3) with U R>0 and V d × ⊂ ⊂ R>0 R SPD(d) and a continuously differentiable function G: V U such that for × × → all (x, ν, µ, Σ) U V it holds ∈ ×

Ψ(x, ν, µ, Σ) = 0 if and only if G(ν, µ, Σ) = x.

Thus the zero of Ψ depends continuously on ν, T2 and T3.

This implies the following theorem.

Theorem 3.12. Let (νr, µr, Σr)r be the sequence generated by Algorithm 3.2 or 3.3 d with arbitrary initial values ν0 > 0, µ0 R , Σ0 SPD(d). Then every cluster point of ∈ ∈ (νr, µr, Σr)r is a critical point of L.

Proof. The mapping T defined in Lemma 3.11 is continuous. Further we know from its definition that (ν, µ, Σ) is a critical point of L if and only if it is a fixed point of T . Let ˆ (ν,ˆ µ,ˆ Σ) be a cluster point of (νr, µr, Σr)r. Then there exists a subsequence (νrs , µrs , Σrs )s

44 which converges to (ν,ˆ µ,ˆ Σˆ). Further we know by Theorem 3.10 that Lr = L(νr, µr, Σr) is decreasing. Since (Lr)r is bounded from below, it converges. Now it holds

L(ˆν, µ,ˆ Σ)ˆ = lim L(νr , µr , Σr ) s→∞ s s s

= lim Lr = lim Lr +1 s→∞ s s→∞ s

= lim L(νr +1, µr +1, Σr +1) s→∞ s s s

= lim L(T (νr , µr , Σr )) = L(T (ˆν, µ,ˆ Σ))ˆ . s→∞ s s s

By Theorem 3.10 and the definition of T we have that L(ν, µ, Σ) = L(T (ν, µ, Σ)) if and only if (ν, µ, Σ) = T (ν, µ, Σ). By the definition of the algorithm this is the case if and only if (ν, µ, Σ) is a critical point of L. Thus (ˆν, µ,ˆ Σ)ˆ is a critical point of L.

3.5. Numerical Results

In this section we give two numerical examples of the developed theory. First, we compare the four different algorithms in Subsection 3.5.1. Then, in Subsection 3.5.2, we provide an application in image analysis by determining the degree of freedom parameter in images corrupted by Student-t noise.

3.5.1. Comparison of Algorithms

In this section, we compare the numerical performance of the classical EM algorithm 3.1 and the proposed Algorithms 3.2, 3.3 and 3.4. To this aim, we did the following Monte Carlo simulation: Based on the stochastic representation of the Student-t distribution, see equation (2), we draw n = 1000 i.i.d. realizations of the Tν(µ, Σ) distribution with location parameter µ = 0 and different scatter matrices Σ and degrees of freedom parameters ν. Then, we used Algorithms 3.2, 3.3 and 3.4 to compute the ML-estimator (ˆν, µ,ˆ Σ).ˆ We initialize all algorithms with the sample mean for µ and the sample covariance matrix for Σ. Furthermore, we set ν = 3 and in all algorithms the zero of the respective function is computed by Newtons Method. As a stopping criterion we use the following relative distance: q 2 2 µr+1 µr + Σr+1 Σr p 2 k − k k − kF (log(νr+1) log(νr)) −5 q + − < 10 . 2 2 log(νr) µr + Σr k k k kF | |

We take the logarithm of ν in the stopping criterion, because Tν(µ, Σ) converges to

45 the normal distribution as ν and therefore the difference between Tν(µ, Σ) and → ∞ Tν+1(µ, Σ) becomes small for large ν. To quantify the performance of the algorithms, we count the number of iterations until the stopping criterion is reached. Since the inner loop of the GMMF is potentially time consuming we additionally measure the execution time until the stopping criterion is reached. This experiment is repeated N = 10.000 times for different values of ν 1, 2, 5, 10 . Afterward we calculate the average number of iterations and the average ∈ { } execution times. The results are given in Table1. We observe that the performance of the algorithms depends on Σ. Further we see, that the performance of the aEM algorithm is always better than those of the classical EM algorithm. Further all algorithms need a longer time to estimate large ν. This seems to be natural since the likelihood function becomes very flat for large ν. Further, the GMMF needs the lowest number of iterations. But for small ν the execution time of the GMMF is larger than those of the aEM algorithm. This can be explained by the fact, that the ν step has a smaller relevance for small ν but is still time consuming in the GMMF. The GMM needs slightly more iterations than the GMMF but if ν is not extremely large the execution time is smaller than for the GMMF and for the aEM algorithm. In summary, the MMF algorithm is proposed as algorithm of choice.

In Figure2 we exemplarily show the functional values L(νr, µr, Σr) of the four algorithms and samples generated for different values of ν and Σ = I. Note that the x-axis of the plots is in log-scale. We see that the convergence speed (in terms of number of iterations) of the EM algorithm is much slower than those of the MMF/GMMF. For small ν the convergence speed of the aEM algorithm is close to the GMMF/MMF, but for large ν it is close to the EM algorithm. In Figure3 we show the histograms of the ν-output of 1000 runs for different values of ν and Σ = I. Since the ν-outputs of all algorithms are very close together we only plot the output of the MMF. Only for ν = 100 the ν-outputs of the GMMF and MMF differ from the outputs of the aEM algorithm. Here, we give the histograms for both cases. We see that the νr of the GMMF and MMF are greater in the case that a minimum of L does not exist.

3.5.2. Unsupervised Estimation of Noise Parameters

Next, we provide an application in image analysis. To this aim, we consider images corrupted by one-dimensional Student-t noise with µ = 0 and unknown Σ σ2 and ν. ≡ We provide a method that allows to estimate ν and σ in an unsupervised way. The basic

46 Σ ν EM aEM MMF GMMF 1 77.90 6.73 27.82 1.96 26.38 1.86 25.23 1.85 ± ± ± ±   2 49.72 1.87 28.48 1.25 23.39 0.92 21.01 0.82 0.1 0 ± ± ± ± 5 60.03 12.79 58.86 9.48 31.36 3.62 21.16 2.75 0 0.1 ± ± ± ± 10 161.43 56.89 155.82 56.81 55.37 10.61 34.68 5.49 ± ± ± ± 100 5528.99 4613.79 5525.43 4614.94 580.76 1115.79 261.95 933.98 ± ± ± ± 1 77.79 6.74 27.79 1.97 26.34 1.87 25.20 1.85 ± ± ± ±   2 49.70 1.86 28.48 1.24 23.38 0.91 21.00 0.81 1 0 ± ± ± ± 5 59.98 13.13 58.90 9.66 31.37 3.68 21.18 2.77 0 1 ± ± ± ± 10 161.98 54.63 156.37 54.53 55.51 10.55 34.76 5.47 ± ± ± ± 100 5447.00 4571.26 5443.43 4572.41 582.47 1111.33 259.67 920.55 ± ± ± ± 1 77.80 6.83 27.79 1.99 26.35 1.89 25.21 1.88 ± ± ± ±   2 49.69 1.90 28.45 1.26 23.37 0.92 20.99 0.82 10 0 ± ± ± ± 5 59.93 13.01 58.80 9.68 31.33 3.69 21.15 2.78 0 10 ± ± ± ± 10 159.92 50.61 154.30 50.51 55.14 10.03 34.58 5.21 ± ± ± ± 100 5456.18 4605.11 5452.62 4606.26 562.69 1082.63 257.36 902.01 ± ± ± ± 1 77.83 1.97 27.81 6.75 26.36 1.87 25.23 1.86 ± ± ± ±   2 49.66 1.91 28.46 1.26 23.37 0.93 20.98 0.83 2 1 ± ± ± ± − 5 60.08 12.74 58.90 9.37 31.39 3.59 21.18 2.71 1 2 ± ± ± ± − 10 161.10 54.71 155.49 54.60 55.34 10.49 34.68 5.43 ± ± ± ± 100 5584.13 4597.97 5580.59 4599.13 589.88 1107.12 267.17 908.90 ± ± ± ±

Σ ν EM aEM MMF GMMF 1 0.010173 0.00181 0.004083 0.00067 0.004074 0.00069 0.005703 0.00108 ± ± ± ±   2 0.006713 0.00143 0.004223 0.00090 0.003654 0.00078 0.004396 0.00109 0.1 0 ± ± ± ± 5 0.008844 0.00342 0.009402 0.00325 0.005314 0.00177 0.004464 0.00163 0 0.1 ± ± ± ± 10 0.020038 0.00712 0.020871 0.00761 0.007940 0.00162 0.007161 0.00143 ± ± ± ± 100 0.661000 0.55210 0.702780 0.58644 0.076444 0.14557 0.050393 0.12556 ± ± ± ± 1 0.010025 0.00141 0.004043 0.00049 0.004044 0.00051 0.005653 0.00086 ± ± ± ±   2 0.006870 0.00151 0.004301 0.00094 0.003718 0.00086 0.004467 0.00117 1 0 ± ± ± ± 5 0.008299 0.00265 0.008881 0.00250 0.004979 0.00131 0.004195 0.00131 0 1 ± ± ± ± 10 0.023450 0.00987 0.024536 0.01049 0.009249 0.00301 0.008309 0.00272 ± ± ± ± 100 0.810580 0.71158 0.876640 0.77246 0.096937 0.18602 0.063198 0.15795 ± ± ± ± 1 0.011491 0.00326 0.004588 0.00133 0.004573 0.00132 0.006445 0.00199 ± ± ± ±   2 0.007862 0.00285 0.004945 0.00185 0.004279 0.00164 0.005164 0.00206 10 0 ± ± ± ± 5 0.008674 0.00329 0.009244 0.00315 0.005216 0.00176 0.004384 0.00161 0 10 ± ± ± ± 10 0.021255 0.00824 0.022352 0.00876 0.008494 0.00252 0.007586 0.00230 ± ± ± ± 100 0.700530 0.60473 0.750020 0.64754 0.080188 0.15852 0.053959 0.13666 ± ± ± ± 1 0.011177 0.00320 0.004496 0.00135 0.004472 0.00137 0.006245 0.00198 ± ± ± ±   2 0.007527 0.00251 0.004726 0.00159 0.004100 0.00143 0.004935 0.00180 2 1 ± ± ± ± − 5 0.008771 0.00330 0.009387 0.00322 0.005291 0.00181 0.004426 0.00167 1 2 ± ± ± ± − 10 0.021643 0.00908 0.022709 0.00961 0.008608 0.00265 0.007719 0.00242 ± ± ± ± 100 0.869910 0.77962 0.926510 0.82809 0.102000 0.20195 0.068415 0.17330 ± ± ± ±

Table 1: Average number of iterations (top) and execution times (bottom) and the corresponding standard deviations of the different algorithms.

47 (a) ν = 1. (b) ν = 2.

(c) ν = 5. (d) ν = 10.

(e) ν = 100. (f) ν = 200.

Figure 2: Plots of L(νr, µr, Σr) on the y-axis and r on the x-axis for all algorithms.

48 (a) ν = 1. (b) ν = 2.

(c) ν = 5. (d) ν = 10.

(e) ν = 100 using the GMMF. (f) ν = 100 using the EM algorithm.

Figure 3: Histograms of the output ν from all algorithms.

49 idea is to consider constant areas of an image, where the signal to noise ratio is weak and differences between pixel values are solely caused by the noise.

Constant area detection: In order to detect constant regions in an image, we adopt an idea presented in [38]. It is based on Kendall’s τ-coefficient, which is a measure of rank correlation, and the associated z-score, see [17, 18]. In the following, we briefly summarize the main ideas behind this approach. For finding constant regions we proceed as follows: First, the image grid is partitioned into K small, non-overlapping regions SK G = Rk, and for each region we consider the hypothesis testing problem G k=1

0 : Rk is constant vs. 1 : Rk is not constant. H H

To decide whether to reject 0 or not, we observe the following: Consider a fixed region H Rk and let I,J Rk be two disjoint subsets of Rk with the same cardinality. Denote ⊆ with uI and uJ the vectors containing the values of u at the positions indexed by I and

J. Then, under 0, the vectors uI and uJ are uncorrelated (in fact even independent) H for all choices of I,J Rk with I J = and I = J . As a consequence, the rejection ⊆ ∩ ∅ | | | | of 0 can be reformulated as the question whether we can find I,J such that uI and H uJ are significantly correlated, since in this case there has to be some structure in the image region Rk and it cannot be constant. Now, in order to quantify the correlation, we adopt an idea presented in [38] and make use of Kendall’s τ-coefficient, which is a measure of rank correlation, and the associated z-score, see [17, 18]. The key idea is to focus on the rank (i.e., on the relative order) of the values rather than on the values themselves. In this vein, a block is considered homogeneous if the ranking of the pixel values is uniformly distributed, regardless of the spatial arrangement of the pixels. In the following, we assume that we have extracted two disjoint subsequences x = uI and y = uJ from a region Rk with I and J as above. Let (xi, yi) and (xj, yj) be two pairs of observations. Then, the pairs are said to be  concordant if xi < xj and yi < yj    or xi > xj and yi > yj,  discordant if xi < xj and yi > yj    or xi > xj and yi < yj,   tied if xi = xj or yi = yj.

50 n Next, let x, y R be two sequences without tied pairs and let nc and nd be the number ∈ of concordant and discordant pairs, respectively. Then, Kendall’s τ coefficient [17] is defined as τ : Rn Rn [ 1, 1], × → −

nc nd τ(x, y) = n(n−−1) . 2

From this definition we see that if the agreement between the two rankings is perfect, i.e. the two rankings are the same, then the coefficient attains its maximal value 1. On the other extreme, if the disagreement between the two rankings is perfect, that is, one ranking is the reverse of the other, then the coefficient has value -1. If the sequences x and y are uncorrelated, we expect the coefficient to be approximately zero. Denoting with X and Y the underlying random variables that generated the sequences x and y, we have the following result, whose proof can be found in [17].

Theorem 3.13. Let X and Y be two arbitrary sequences under 0 without tied pairs. H 2(2n+5) Then, the random variable τ(X,Y ) has an expected value of 0 and a variance of 9n(n−1) . Moreover, for n , the associated z-score z : Rn Rn R, → ∞ × → p 3 n(n 1) 3√2(nc nd) z(x, y) = p − τ(x, y) = p − 2(2n + 5) n(n 1)(2n + 5) − is asymptotically standard normal distributed,

n→∞ z(X,Y ) (0, 1). ∼ N

With slight adaption, Kendall’s τ coefficient can be generalized to sequences with tied pairs, see [18]. As a consequence of Theorem 3.13, for a given significance level α (0, 1), ∈ we can use the quantiles of the standard normal distribution to decide whether to reject 0 H or not. In practice, we cannot test any kind of region and any kind of disjoint sequences. As in [38], we restrict our attention to quadratic regions and pairwise comparisons of neighboring pixels. We use four kinds of neighboring relations (horizontal, vertical and two diagonal neighbors) thus perform in total four tests. We reject the hypothesis 0 that H the region is constant as soon as one of the four tests rejects it. Note that by doing so, the final significance level is smaller than the initially chosen one. We start with blocks of size 64 64 whose side-length is incrementally decreased until enough constant areas are found. ×

51 Parameter estimation. In each constant region we consider the pixel values in the 2 region as i.i.d. samples of a univariate Student-t distribution Tν(µ, σ ), where we estimate the parameters using Algorithm 3.3. After estimating the parameters in each found constant region, the estimated location parameters µ are discarded, while the estimated scale and degrees of freedom parameters σ respective ν are averaged to obtain the final estimate of the global noise parameters. At this point, as both ν and σ influence the resulting distribution in a multiplicative way, instead of an , one might use a geometric which is slightly less affected by . In Figure4 we illustrate this procedure for two different noise scenarios. The left column in each figure depicts the detected constant areas. The middle and right column show histograms of the estimated values for ν respective σ. For the constant area detection we use the code of [38]1. The true parameters used to generate the noisy images where ν = 1 and σ = 10 for the top row and ν = 5 and σ = 10 for the bottom row, while the obtained estimates are (geometric mean in brackets) νˆ = 1.0437 (1.0291) and σˆ = 10.3845 (10.3111) for the top row and νˆ = 5.4140 (5.0423) and σˆ = 10.5500 (10.1897) for the bottom row. A further example is given in Figure5. Here, the obtained estimates are (geometric mean in brackets) νˆ = 1.0075 (0.99799) and σˆ = 10.2969 (10.1508) for the top row and νˆ = 5.4184 (5.1255) andσ ˆ = 10.2295 (10.1669) for the bottom row.

4. Superresolution via Student-t Mixture Models

In this section, we consider Student-t mixture models. We start with the definition. Then, we present some algorithms to compute the maximum likelihood estimator for the parameters of a Student-t mixture model. At the end of this section we apply the Student-t mixture models for patch based superresolution methods.

Mixture models are motivated by the following scenario: we have K random number generators sampling from different distributions. Now we first choose one of the random T number generators randomly using the probability weights α = (α1, ..., αK ) ∆ and ∈ sample from the corresponding distribution. If all random number generators sample from Student-t distributions with different parameters we derive the following formal definition of Student-t mixture models.

1https://github.com/csutour/RNLF

52 Homogeneous areas. 6 2

1.8 5 1.6

4 1.4

1.2

3 1

0.8 2 0.6

0.4 1 0.2

0 0 0.6 0.8 1 1.2 1.4 1.6 1.8 80 85 90 95 100 105 110 (a) Noisy image with detected (b) Histogram of estimates for (c) Histogram of estimates for homogeneous areas. ν. σ2. 3 5

4.5 2.5 4

3.5 2

3

2.5 1.5

2 1 1.5

1 0.5 0.5

0 0 2 4 6 8 10 12 14 16 70 80 90 100 110 120 (d) Noisy image with detected (e) Histogram of estimates for (f) Histogram of estimates for homogeneous areas. ν. σ2.

Figure 4: Unsupervised estimation of the noise parameters ν and σ2.

53 Noisy image with detected ho- Histogram of estimates for ν. Histogram of estimates for σ2. mogeneous areas.

Noisy image with detected ho- Histogram of estimates for ν. Histogram of estimates for σ2. mogeneous areas.

Figure 5: Unsupervised estimation of the noise parameters ν and σ2.

54 A Student-t mixture model is a random variable given by the probability density function

K d+νk  X Γ 2 1 p(x) = αkf(x νk, µk, Σk), f(x νk, µk, Σk) = d d+ν , νk  k | | 2 1  2 k=1 Γ 2 (πνk) Σk 1 + δk | | νk (16) where −1 δk = (x µk)Σ (x µk) − k − T d and α = (α1, ..., αK ) ∆, νk R>0, µk R and Σk SPD(d). ∈ ∈ ∈ ∈

We can sample from this distribution as described in the motivation above. Let Y be a random variable mapping into 1, ..., K and let X1, ..., Xk be random variables with { } Xk Tν (µk, Σk). Then the random variable XY is a Student-t mixture model with ∼ k probability density function (16).

For (ν, µ, Σ) RK (Rd)K (SPD(d))K the likelihood function is given by ∈ >0 × × n K Y X (ν, µ, Σ x1, ..., xn) = αkf(xi νk, µk, Σk). L | | i=1 k=1

Thus the negative log likelihood function is given by

n K X  X  L(ν, µ, Σ x1, ..., xn) = log αkf(xi νk, µk, Σk) . (17) | − | i=1 k=1

4.1. Estimating the Parameters

We present four algorithms to find critical points of (17). Algorithm 4.1 is the classical EM algorithm. The derivation of the EM algorithm for this problem can be found in the Appendix A.3. Note that analogously to Algorithm 3.1 the function in the fourth M-Step

(r) (r) (r) (r) (r) Pn  (r) d+ν d+ν i=1 βij log(γij ) γij Φ (x) = ψ( x ) log( x ) ψ j  + log j  + 1 + − j 2 − 2 − 2 2 Pn (r) i=1 βij | {z } (r) cj

(r) has a unique zero since cj > 0. By [7] we know that the values of the objective function (r) (r) (r) (r) L(α , ν , µ , Σ x1, ..., xn) are monotone decreasing in r and that a subsequence |

55 of the iterates converges to a critical point of L(α, ν, µ, Σ x1, ..., xn), if the iterates are | bounded.

Algorithm 4.1 EM Algorithm for Student-t mixture models d (0) (0) (0) (0) K d K K Input: x1, ..., xn R ,(α , ν , µ , Σ ) ∆ (R>0) (R ) (SPD(d)) ∈ ∈ × × × for r = 0, 1, ... do E-Step: For i = 1, ..., n and j = 1, ..., K compute:

(r) (r) (r) (r) (r) αj f(xi νj , µj , Σj ) β = | ij PK (r) (r) (r) (r) α f(xi ν , µ , Σ ) l=1 l | l l l (r) (r) νj + d γij = (r) (r) T (r) −1 (r) ν + (xi µ ) (Σ ) (xi µ ) j − j j − j M-Step: For j = 1, ..., K compute

n (r+1) 1 X (r) α = β j n ij i=1 Pn (r) (r) (r+1) i=1 βij γij xi µ = j Pn (r) (r) i=1 βij γij Pn (r) (r) (r+1) (r+1) T (r+1) i=1 βij γij (xi µj )(xi µj ) Σ = − − j Pn (r) i=1 βij (r) (r) (r) (r) (r) Pn (r+1) d+ν d+ν i=1 βij (log(γij ) γij ) ν = zero of ψ( x ) log( x ) ψ( j ) + log( j ) + 1 + − j 2 − 2 − 2 2 Pn (r) i=1 βij

Algorithm 4.2 is inspired by Algorithm 3.3. It differs from the EM algorithm in the Σ- and ν-step. Analogously to Algorithm 3.3 the function

Pn (r) (r+1) (r+1) (r) i=1 βij (log(γij ) γij ) Ψ (x) = ψ( x ) log( x ) ψ( d+x ) + log( d+x ) + 1 + − j 2 − 2 − 2 2 Pn (r) i=1 βij | {z } (r) bj

(r) has a unique zero since bj > 0. For Algorithm 4.3 we want to use the Proximal Alternating Linearized Minimization

56 Algorithm 4.2 Variant of the EM Algorithm for Student-t mixture models d (0) (0) (0) (0) K d K K Input: x1, ..., xn R ,(α , ν , µ , Σ ) ∆ (R>0) (R ) (SPD(d)) ∈ ∈ × × × for r = 0, 1, ... do E-Step: For i = 1, ..., n and j = 1, ..., K compute:

(r) (r) (r) (r) (r) αj f(xi νr , µj , Σj ) β = | ij PK (r) (r) (r) (r) α f(xi ν , µ , Σ ) l=1 l | l l l (r) (r) νj + d (r) (r) T (r) −1 (r) γ = , where δ = (xi µ ) (Σ ) (xi µ ) ij (r) (r) ij − j j − j νj + δij

M-Step: For j = 1, ..., K compute

n (r+1) 1 X (r) α = β j n ij i=1 Pn (r) (r) (r+1) i=1 βij γij xi µ = j Pn (r) (r) i=1 βij γij Pn (r) (r) (r+1) (r+1) T (r+1) i=1 βij γij (xi µj )(xi µj ) Σ = − − j Pn (r) (r) i=1 βij γij (r+1) ν = zero of ψ( x ) log( x ) ψ( d+x ) + log( d+x ) + 1 j 2 − 2 − 2 2 Pn (r) (r) (r) i=1 βij (log(γij ) γij ) + − Pn (r) i=1 βij

(PALM) algorithm as proposed in [6]. This algorithm can be formulated for minimizing functions of the form Ψ: Rd1 Rd2 R × →

Ψ(x, y) = f(x) + g(y) + H(x, y), (18) where f and g are proper and lower semi-continuous and H is continuous differentiable. For computing the minimizer of (17) we consider the optimization problem

argmin L(α, ν, µ, Σ) + ι∆(α), K d K K K α∈R ,µ∈(R ) ,ν∈R>0,Σ∈(SPD(d))

= argmin L(α, ν, µ, Σ) + ι∆(α) + ι K (ν) + ι K (Σ). (19) R>0 (SPD(d)) α∈RK ,µ∈(Rd)K ,ν∈RK ,Σ∈(Rd×d)K

We choose f(α) = ι∆(α), g(ν, µ, Σ) = ι K (ν) + ι K (Σ) and H(α, ν, µ, Σ) = R>0 (SPD(d))

57 L(α, ν, µ, Σ). Unfortunately, the function g is not lower semi-continuous such that the proximal mapping of g does not exist. To overcome this problem we choose

K K g˜(ν, µ, Σ) = ιR (ν) + ι{A∈Sym(d):A2Id} (Σ) for small 1, 2 > 0 instead. Now PALM ≥1 reads as Algorithm 4.3. Unfortunately, the function L is not defined on the whole RK (Rd)K RK (Rd×d)K . Further we cannot extend it continuously differentiable to × × × the whole domain. Thus the assumptions of the convergence proof for PALM in [6] are not fulfilled and we cannot prove convergence of Algorithm 4.3. For the implementation of Algorithm 4.3 we need the derivative of L with respect to α, ν, µ and Σ. These derivatives are computed in AppendixC.

Algorithm 4.3 Proximal Alternating Linearized Minimization (PALM) for Student-t Mixture Models d (0) (0) K (0) d K (0) K Input: x1, . . . , xn R , α ∆˚K , ν R , µ (R ) ,Σ (SPD(d)) , ∈ ∈ ∈ >0 ∈ ∈ τ r, τ r for r N 1 2 ∈ for r = 1, ... do α-Update: (r+1)  (r) 1 (r) (r) (r) (r)  α = Π∆K α τ r αL(α , ν , µ , Σ ) − 1 ∇ µ, ν, Σ-Update:

(r+1)  (r) 1 (r+1) (r) (r) (r)  ν = ΠRK ν τ r νL(α , ν , µ , Σ ) ≥1 − 2 ∇ (r+1) (r) 1 (r+1) (r) (r) (r) µ = µ τ r µL(α , ν , µ , Σ ) − 2 ∇ (r+1)  (r) 1 (r+1) (r) (r) (r)  K Σ = Π{A∈Sym(d):A2Id} Σ τ r ΣL(α , ν , µ , Σ ) − 2 ∇

For Algorithm 4.4 we want to use the inertial Proximal Alternating Linearized Mini- mization (iPALM) algorithm as proposed in [34]. It is a generalization of PALM. Similar to PALM it can be formulated for minimizing functions of the form (18). In the special case that ar = ar, br = br = 0 and τ r = τ r for all r N iPALM coincides with iPiano 1 2 1 2 1 2 ∈ (see [30]). To minimize (19) we choose f and g as above in PALM. Since g is not lower semi-continuous we again choose g˜ instead. Now iPALM reads as Algorithm 4.4. Note that L is still not defined on the whole RK (Rd)K RK (Rd×d)K such that the × × × assumptions of the convergence proof of iPALM in [34] are not fulfilled. Note that the (r) (r) K (r) K algorithm does not ensure that αz ∆K , that νz R or that Σz (SPD(d)) . ∈ ∈ >0 ∈ This that the algorithm might be not well defined in some cases. Thus, we cannot show convergence of Algorithm 4.4. For the implementation we again need the derivatives computed in AppendixC.

58 Algorithm 4.4 Inertial Proximal Alternating Linearized Minimization (iPALM) for Student-t Mixture Models d (0) (0) K (0) d K (0) K Input: x1, . . . , xn R , α ∆˚K , ν R , µ (R ) ,Σ (SPD(d)) , ∈ ∈ ∈ >0 ∈ ∈ ar, ar, br, br [0, 1], τ r, τ r for r N 1 2 1 2 ∈ 1 2 ∈ for r = 1, ... do α-Update: (r) (r) r (r) (r−1) αy = α + a1(α α ) (r) (r) r (r) − (r−1) αz = α + b1(α α ) (r+1)  (r) − 1 (r) (r) (r) (r)  α = Π∆K αy τ r αL(αz , ν , µ , Σ ) − 1 ∇ ν, µ, Σ-Update:

(r) (r) r (r) (r−1) νy = ν + a2(ν ν ) (r) (r) r (r) − (r−1) µy = µ + a2(µ µ ) (r) (r) r (r)− (r−1) Σy = Σ + a (Σ Σ ) 2 − (r) (r) r (r) (r−1) νz = ν + b2(ν ν ) (r) (r) r (r) − (r−1) µz = µ + b2(µ µ ) (r) (r) r (r)− (r−1) Σz = Σ + b (Σ Σ ) 2 − (r+1)  (r) 1 (r+1) (r) (r) (r)  ν = ΠRK νy τ r νL(α , νz , µz , Σz ) ≥1 − 2 ∇ (r+1) (r) 1 (r+1) (r) (r) (r) µ = µy τ r µL(α , νz , µz , Σz ) − 2 ∇ (r+1)  (r) 1 (r+1) (r) (r) (r)  K Σ = Π{A∈Sym(d):A2Id} Σy τ r ΣL(α , νz , µz , Σz ) − 2 ∇

Choice of the parameters: PALM and iPALM have many parameters and the per- formance of the algorithms is very sensitive to these parameters. We found numerically that the following heuristics are a good choice for the parameters. For iPALM we use in our numerical experiments the extrapolation parameters

r 1 αr = αr = βr = βr = − , r 1, 2, ... . 1 2 1 2 r + 2 ∈ { }

r r Now we consider the parameters τ1 and τ2 in PALM and iPALM. Even though our setting does not fulfill the assumptions of the convergence results of PALM and iPALM 1 (Theorem 2.19 and 2.21), these results indicate that the step size r should be cho- τ1 1 (r) (r) (r) sen proportional to (r) (r) (r) , where L1(ν , µ , Σ ) is the Lipschitz constant L1(ν ,µ ,Σ ) (r) (r) (r) 1 1 of αL( , ν , µ , Σ ). Similar, τ r should be chosen proportional to (r) , where ∇ · 2 L2(α )

59 (r) (r) L2(α ) is the Lipschitz constant of (ν,µ,Σ)L(α , , , ). Since L1(ν, µ, Σ) and L2(α) are ∇ · · · (r) (r) (r) not necessarily finite, we approximate the Lipschitz constants of αL( , ν , µ , Σ ) ∇ · and L(α(r), , , ) locally by the following heuristics: ∇(ν,µ,Σ) · · ·

We consider the following well-known lemma.

Lemma 4.1 (Descent Lemma). Let f : Rd R be a differentiable function with L- → Lipschitz continuous gradient. Then it holds for any x, y Rd that ∈

f(y) f(x) + f(x), y x + L y x 2. (20) ≤ h∇ − i 2 k − k2

In particular, (20) is a necessary condition for L to be a Lipschitz constant of f. ∇ r Now we use the following iterative heuristic to compute τ1 : r−1 r τ1 1. First we set τ1 to 2 . 2. Now we perform the α-Update from Algorithm 4.3 or 4.4 respectively.

3. Check if

L(α(r+1), ν(r), µ(r), Σ(r)) L(α(r), ν(r), µ(r), Σ(r)) ≤ z (r) (r) (r) (r) (r+1) (r) + αL(α , ν , µ , Σ ), α α h∇ z − z i 1 (r+1) (r) 2 + (r) α αz 2. (21) 2τ1 k − k

r r 4. If (21) does not hold true, set τ1 to 2τ1 and repeat the steps 2, 3 and 4. r We use a similar heuristic for τ2 : r−1 r τ2 1. First we set τ2 to 2 . 2. Now we perform the ν, µ, Σ-Update from Algorithm 4.3 or 4.4 respectively.

3. Check if

L(α(r+1), ν(r+1), µ(r+1), Σ(r+1)) L(α(r+1), ν(r), µ(r), Σ(r)) ≤ z z z + L(α(r+1), ν(r), µ(r), Σ(r)), (ν(r+1), µ(r+1), Σ(r+1)) (ν(r), µ(r), Σ(r)) h∇(ν,µ,Σ) z z z − z z z i 1 (r+1) (r+1) (r+1) (r) (r) (r) 2 + (r) (ν , µ , Σ ) (νz , µz , Σz ) 2. (22) 2τ1 k − k

r r 4. If (22) does not hold true, set τ2 to 2τ2 and repeat the steps 2, 3 and 4.

60 4.1.1. Initialization

Since the likelihood function of a Student-t mixture model is high dimensional and non convex the performance of the algorithms above depends crucially on the initialization. In the following, we present three strategies to initialize the parameters of a Student-t mixture model with K components.

1. Random initialization: The easiest initialization strategy is to assign each data point xi randomly a class ki 1, ..., K , i = 1, ..., n and initialize the parameters of the ∈ { } k-th component as follows:

Denote by xk1, ..., xknk the data points with ki = k. Then initialize (νk, µk, Σk) by

˜ (νk, µk, Σk) = argmin L(ν, µ, Σ xk1, ..., xknk ), d (ν,µ,Σ)∈R>0×R ×SPD(d) | where L˜ denotes the negative log-likelihood function of the multivariate Student-t distri- bution. The solution of this minimization problem can be computed e.g. with Algorithm nk 3.3. Further we initialize the parameter αk by αk = n .

In practice, the initialization generated with the initialization using this approach can be far away from a minimizer of the negative log-likelihood function.

2. K-Means clustering: A very intuitive idea is to cluster the data points x1, ..., xn into K classes. In our implementation we use the K-means algorithm to perform the clustering. Then we initialize the parameters of the k-th component of the mixture model as in the first strategy:

Denote by xk1, ..., xknk the data points which belong to class k. Now we initialize the parameters (νk, µk, Σk) by

˜ (νk, µk, Σk) = argmin L(ν, µ, Σ xk1, ..., xknk ), (23) d (ν,µ,Σ)∈R>0×R ×SPD(d) | where L˜ denotes the negative log-likelihood function of the multivariate Student-t distri- bution. The solution of this minimization problem can be computed e.g. with Algorithm nk 3.3. Again we initialize the parameter αk by αk = n .

61 In our numerical experiments, we observed that if we apply the K-means algorithm for Student-t distributed data some of the classes are very small. This can be explained by the fact, that the Student-t distribution has heavier tails than the normal distribution. However, small classes lead into numerical problems in (23). Further, this initialization strategy only makes sense, if we assume that the location parameters of the components of the mixture model are pairwise distinct.

3. K-Nearest-Neighbors: The next approach is to choose randomly K data points

y1, ..., yK x1, ..., xn . Then we initialize the parameters of the k-th component of { } ⊂ { } the mixture model as follows:

Denote by xk1, ..., xkm x1, ..., xn the m nearest neighbors of yk. Then we initialize { } ⊂ { } the parameters (νk, µk, Σk) by

(νk, µk, Σk) = argmin L˜(ν, µ, Σ xk1, ..., xkm), d (ν,µ,Σ)∈R>0×R ×SPD(d) | where L˜ denotes the negative log-likelihood function of the multivariate Student-t distri- bution. Again we compute the solution of the minimization problem using Algorithm 3.3. 1 Here we initialize αk by αk = K .

This approach is used in [3] for Gaussian mixture models. Similar to the second approach, it only makes sense if we assume that the location parameters of the components of the mixture model are different.

4. Nearest Center: This approach is a slight modification of the third approach. We again randomly choose K data points y1, ..., yK x1, ..., xn . Then we assign to each { } ⊂ { } data point x1, ..., xn a class k1, ..., kn 1, ..., K by ∈ { }

ki argmin xi yk 2 , i = 1, ..., n. ∈ k=1,...,K k − k

Now we initialize the parameters as in the first approach: Denote by xk1, ..., xknk the data points which belong to class k. Now we initialize the parameters (νk, µk, Σk) by

˜ (νk, µk, Σk) = argmin L(ν, µ, Σ xk1, ..., xknk ), d (ν,µ,Σ)∈R>0×R ×SPD(d) | where L˜ denotes the negative log-likelihood function of the multivariate Student-t distri-

62 bution. The solution of this minimization problem can be computed e.g. with Algorithm nk 3.3. Again we initialize the parameter αk by αk = n .

This approach is used in [35] for Gaussian mixture models. Again it only makes sense, if the location parameters of the components of the mixture model are not equal.

In our simulation study in Section 4.1.2 we use the first approach. For the numerical examples in Section 4.3 we use the third approach with m n . ≈ K

4.1.2. Simulation Study

In this section, we compare the numerical performance of the classical EM algorithm 4.1, the proposed algorithm 4.2, the PALM algorithm 4.3 and the iPALM algorithm 4.4. To generate n i.i.d. samples x1, ..., xn of a given Student-t mixture model with parameters d×K K α ∆K , ν = (ν1, ..., νK ), µ = (µ1, ..., µK ) R and Σ = (Σ1, ..., ΣK ) (SPD(d)) ∈ ∈ ∈ we use the following procedure:

1. Generate i.i.d. samples k1, ..., kn from the on 1, ..., K { } which is defined by P ( k ) = αk, k = 1, ..., K. { } 2. Based on the stochastic representation of the Student-t distribution, see equation

(2), draw independent realizations x1, ..., xn with xi Tν (µk , Σk ). ∼ ki i i To compare the algorithm we do the following Monte Carlo simulation: We draw n = 10000 samples from a Student-t mixture model as described above with parameters K d×K K (α, ν, µ, Σ) ∆K R R SPD(d) . Then we use Algorithms 4.1, 4.2, 4.3 and ∈ × >0 × × 4.4 to compute the ML-estimator (α,ˆ ν,ˆ µ,ˆ Σˆ). We use random initialization (see Section 4.1.1) for all algorithms. To compute the zeros in the Algorithms 4.1 and 4.2 we use Newtons method. As a stopping criteria we take the relative distance

r 2 2 2 2 α(r+1)−α(r) + log(ν(r+1))−log(ν(r))2 + µ(r+1)−µ(r) +PK Σ(r+1)−Σ(r) F k=1 k k k k k k k k F < 10−5. r 2 (r) 2 (r) 2 2 (r) 2 PK (r) α + log(ν ) + µ + k=1 Σ k k k k k kF k F We iterate this procedure N = 100 times. We generate the samples with the parameters K = 3, d = 2 and

1 1 1 α = ( 2 , 3 , 6 ),  1   1   1  µ = , , , 0 0 0

63  1 0   3 0   4 1  Σ = , , . 0 2 0 2 1 2

For ν we use different values. The resulting average number of steps and execution times of the algorithms are given in Table2. To compare the quality of the estimates we also compute the average value of the negative log-likelihood function of the outcomes of the algorithms. In Figure6 we plot the likelihood values of the estimates after r steps against the execution time of the first r steps of the algorithms.

ν EM Variant of EM PALM iPALM (1, 2, 5) 6246 4398 5332 3310 4279 2983 3071 4110 ± ± ± ± (2, 4, 10) 7369 4190 5525 3574 4503 3741 4270 4800 ± ± ± ± (5, 10, 25) 7940 4235 5777 3447 7823 4171 7278 6028 ± ± ± ± (10, 20, 50) 6553 3401 4615 3959 4259 3800 8794 5896 ± ± ± ± (20, 40, 100) 7134 3769 4706 3480 3412 3054 8759 5069 ± ± ± ± ν EM Variant of EM PALM iPALM (1, 2, 5) 57.35 40.12 46.97 29.57 69.59 48.58 53.28 70.56 ± ± ± ± (2, 4, 10) 67.67 38.71 49.94 32.37 73.69 61.25 74.83 83.91 ± ± ± ± (5, 10, 25) 67.15 35.67 47.76 29.30 117.28 63.99 117.11 99.30 ± ± ± ± (10, 20, 50) 43.05 22.24 29.43 25.21 51.46 45.94 110.54 74.23 ± ± ± ± (20, 40, 100) 46.98 24.78 29.87 22.06 41.10 36.80 109.88 63.52 ± ± ± ± ν EM Variant of EM PALM iPALM (1, 2, 5) 46152.48 202.03 46152.51 201.91 46153.72 201.69 46152.37 201.82 ± ± ± ± (2, 4, 10) 41635.15 130.18 41635.12 130.05 41639.15 130.29 41636.01 130.94 ± ± ± ± (5, 10, 25) 38991.62 127.23 38991.56 129.10 38996.20 126.73 38992.62 126.76 ± ± ± ± (10, 20, 50) 38163.39 110.48 38163.54 110.62 38169.00 110.67 38165.02 111.12 ± ± ± ± (20, 40, 100) 37736.13 104.48 37736.36 104.44 37741.18 104.39 37737.64 103.92 ± ± ± ±

Table 2: Average number of iterations (top), execution times (middle) and negative log-likelihood values (bottom) and the corresponding standard deviations of the different algorithms.

We observe that for every choice of the ν Algorithm 4.2 has the lowest execution time. Compared with the EM algorithm 4.1, it executes also a lower number of steps and reaches a similar value of L(α(r), ν(r), µ(r), Σ(r)) until the stopping criteria is reached. Note that the execution times of a single step in Algorithm 4.1 and 4.2 are similar. In Figure7 we plot the negative log-likelihood values of the estimates against the number of steps. If we compare PALM and iPALM, we observe that the value of the negative log-likelihood function of the results of PALM are larger, in particular for large ν. That means, that PALM stops earlier. Further, iPALM has for small ν a lower execution time and for large ν a higher execution time until the stopping criteria is reached. In

64 (a) ν = (1, 2, 5). (b) ν = (2, 4, 10).

(c) ν = (5, 10, 25). (d) ν = (10, 20, 50).

(e) ν = (20, 40, 100). (f) ν = (50, 100, 250).

Figure 6: Plots of L(α(r), ν(r), µ(r), Σ(r)) on the y-axis and the execution time of the first r step on the x-axis for all algorithms for r = 1, ..., 1000.

65 particular, the plots from Figure2 indicate, that the lower execution time of PALM is a consequence of the fact, that PALM reaches the stopping criteria at a higher value of L(α(r), ν(r), µ(r), Σ(r)) than iPALM. We conclude that iPALM leads to better results than PALM. Overall, the variant of the EM algorithm 4.2 is the fastest algorithm. Therefore we use Algorithm 4.2 for our numerical experiments in Section 4.3.

4.2. Superresolution

In this section we consider the problem of superresolution. That means, that we create a high resolution image from a low resolution image. Based on the assumption that the patches of natural images follow approximately a Student-t mixture model we introduce two algorithms to do this. The algorithms were originally proposed for Gaussian mixture models in [45] and [35].

4.2.1. Expected Patch Log-Likelihood for Student-t Mixture Models

For a given Student-t mixture model defined by the density function p defined by (16), we formulate the Expected Patch Log-Likelihood (EPLL) algorithm proposed in [45]. In practice we estimate the parameters of the mixture model from a given high resolution image using Algorithm 4.2. We use the notations from [31]. For an image x RN let ∈ y RM be an observation generated by y = Ax + w, where A RM×N is a matrix and ∈ ∈ w is Gaussian noise with variance σ. Here, the matrix A is a superresolution operator A = SH, which consists of a blur operator H RN×N and a downsampling operator ∈ S RM×N . Using another choice for A, the method can also be applied for other ∈ problems in imaging (see [31, 45]). Now, for a given observation y RM we want to ∈ reconstruct x RM . The rough idea of this algorithm is to approximate the solutions of ∈ the optimization problem

d 2 X argmin 2 Ax y 2 log p(Pi(x)), (24) N 2σ k − k − x∈R i∈I where Pi extracts the i-th patch of x. The solution of this minimization problem can be interpreted as the MAP estimator (see Section 2.1.3) of x using the prior that the patches of x follow the Student-t mixture model p. For simplicity we assume that each pixel is covered by at least one patch. Since this is a large non-convex problem, we use

66 (a) ν = (1, 2, 5). (b) ν = (2, 4, 10).

(c) ν = (5, 10, 25). (d) ν = (10, 20, 50).

(e) ν = (20, 40, 100). (f) ν = (50, 100, 250).

Figure 7: Plots of L(α(r), ν(r), µ(r), Σ(r)) on the y-axis and the number of steps r on the x-axis for the Algorithms 4.1 and 4.2.

67 the half-quadratic splitting method [14] and consider the problem

d 2 β X 2 X argmin 2 Ax y 2 + Pix zi 2 log p(zi). (25) N 2σ k − k 2 k − k − x∈R i∈I i∈I

For β this problem is equivalent to (24). To approximate a solution of (25) we use → ∞ the alternating optimization scheme

β 2 zˆi := argmin Pixˆ zi 2 log p(zi), (26) d zi∈R 2 k − k − d 2 β X 2 xˆ := argmin 2 Ax y 2 + Pix zˆi 2 . (27) N 2σ k − k 2 k − k x∈R i∈I

d T P T T The Hessian of (27) is given by 2 A A + β P Pi. Note that P Pi is a N N σ i∈I i i × diagonal matrix with diagonal entries 1 for all pixels which are covered by the i-th patch and 0 for all pixels which are not covered by the i-th patch. Since we assume that each pixel is covered by at least one patch, we get that the Hessian of (27) is positive definite. Thus (27) is strictly convex and we derive its solution by setting its derivative to zero. This can be rewritten as

2 −1 2  T βσ X T   T βσ X T  xˆ = A A + P Pi A y + P zˆi . d i d i i∈I i∈I

Since we still cannot solve (26) directly we use an approximation proposed in [45] for ∗ Gaussian mixture models: We select ki 1, ..., K which maximizes the likelihood that ∗ ∈ { } the patch Pixˆ belongs to the ki -th component. Then we compute

β 2 : ∗ zˆi = argmin Pixˆ zi 2 log fki (zi). zi 2 k − k −

Now we get Algorithm 4.5. We iterate it for an increasing sequence of β.

Note that if we choose the patches such that each pixel is covered by exactly d patches, then we can rewrite step 4 in Algorithm 4.5 by

2 T 2 −1 T βσ X T  xˆ = A A + βσ Id A y + P zˆi . (28) d i i∈I

Now we show that for sufficient large β the optimization problem in step 3 has a unique

68 Algorithm 4.5 EPLL for Student-t mixture models Input: Initialization xˆ RN , low resolution image y RM , superresolution operator M×N ∈ ∈ K d K K A R , Student-t mixture model (α, ν, µ, Σ) ∆K R (R ) (SPD(d)) ∈ ∈ × >0 × × and parameter β R>0 and σ R>0. ∈ ∈ Output: High resolution imagex ˆ RN . ∈ for i I do ∈ 1.z ˜i = Pixˆ. ∗ 2. k = argmin log(αk) log(fk(˜zi)). i 1≤k≤K − − d+νk∗   β 2 i 1 T −1 3.z ˆi = argmin z˜i zi + log 1 + (zi µk∗ ) Σ ∗ (zi µk∗ ) . zi 2 2 2 νk∗ i k i k − k i − i − −1  T βσ2 P T   T βσ2 P T  4.x ˆ = A A + d i∈I Pi Pi A y + d i∈I Pi zˆi . solution, i.e. that the function

d+νk∗   β 2 i 1 T −1 φ(z) = z˜i z + log 1 + (z µk∗ ) Σ ∗ (z µk∗ ) 2 2 νk∗ i k i 2 k − k i − i − has a unique minimizer. Further, we provide an efficient way to compute this minimizer.

∗ T Using the notation Σki = PDP for an orthogonal matrix P and a diagonal matrix D = diag(λ1, ..., λd) we rewrite φ as

d+νk∗   β 2 i 1 T −1 φ(z) = P z˜i P z + log 1 + (P z P µk∗ ) D (P z P µk∗ ) . 2 2 νk∗ i i 2 k − k i − −

∗ ∗ Now use the notationsz ˜ = P z˜i, µ = P µki and ν = νki . Define

β   ψ(z) = φ(P z) = z˜ z 2 + d+ν log 1 + 1 (z µ)T D−1(z µ) . 2 k − k2 2 ν − −

Thus it holds that argmin φ(z) = P T argmin ψ(z). z∈Rd z∈Rd

Lemma 4.2. There exists a global minimum of ψ. Further, if β 1 d+ν for all ≥ 2λj ν j = 1, ..., n, then ψ has a unique global minimizer which is the unique critical point of ψ. Moreover, it can be computed by

d  β 2 X 1 β+x ν 2 C = zero of g(x) = λ β 1 d  (˜zj µj) x j β+x + +1 − − j=1 ν λj ν −1  β  d  −1  β  d  −1  zˆ = β + C ν Id + ν + 1 D β + C ν z˜ + ν + 1 D µ .

69 In particular the lemma yields that for sufficient large β there exists a unique critical point of φ, which is the global minimizer.

Proof. 1. We observe that ψ is coercive and continuous. This yields the existence of a global minimizer.

2. We show the uniqueness of critical points. The gradient of ψ is given by

1 ψ(z) = β(z z˜) + d + 1 D−1(z µ). ∇ − ν 1 + 1 (z µ)T D−1(z µ) − ν − − Thus z is a critical point of ψ if and only if

β 0 = β(z z˜) + (z µ)T D−1(z µ)(z z˜) + d + 1D−1(z µ). − ν − − − ν −

Setting C = C(z) := (z µ)T D−1(z µ) this becomes − −

0 = β(z z˜) + β C(z z˜) + d + 1D−1(z µ) (29) − ν − ν −

We rewrite (29) as

β  d  −1  β  d  −1 β + C ν z˜ + ν + 1 D µ = β + C ν Id + ν + 1 D z.

This becomes

−1  β  d  −1  β  d  −1  z = β + C ν Id + ν + 1 D β + C ν z˜ + ν + 1 D µ . (30)

Thus we get for the components zj of z, j = 1, ..., d that

β + C β z˜ + 1 d + 1µ ν j λj ν j zj = . β + C β  + 1 d + 1 ν λj ν

Inserting this in the definition of C we get

d X 1 2 C = (zj µj) λj − j=1 d β  1 d  β  1 d  !2 β + C z˜j + + 1 µj β + C µj + + 1 µj X ν λj ν ν λj ν = 1 λj β + C β + 1 d + 1 − β + C β + 1 d + 1 j=1 ν λj ν ν λj ν

70 d  β 2 X 1 β + C ν 2 = (˜zj µj) . λj β + C β + 1 d + 1 − j=1 ν λj ν

This is equivalent to

d  β 2 X 1 β + C ν 2 0 = (˜zj µj) C. (31) λj β + C β + 1 d + 1 − − j=1 ν λj ν

Hence it is sufficient to solve (31) and insert the solution in (30) to solve the original problem. To see that this solution is unique, we show that (31) has a unique solution. We see that for C < 0 the right hand side of (31) is greater than zero. Thus (31) is equivalent to

d  β 2 X 1 1 β + C ν 2 0 = (˜zj µj) 1 λj C β + C β + 1 d + 1 − − j=1 ν λj ν d  2 X 1 1 (d + ν) 2 = 1 (˜zj µj) 1 (32) λj C − βλ (ν + C) + (d + ν) − − j=1 j d  2 X 1 1 1 2 = 1 (˜zj µj) 1, C > 0. λj C − ν+C  − − j=1 βλj ν+d + 1

The derivative with respect to C of the right hand side is given by

d  2 X 1 2 1 1 (˜zj µj) 2 1 − λj − C − ν+C  j=1 βλj ν+d + 1   2 ! 2 1 1 λj β C 1 ν+C  ν+C  ν+d − − βλj ν+d + 1 βλj ν+d + 1 d   ν+C   X 1 2 1 1 (ν + d) βλj ν+d + 1 + 2Cβλj = (˜zj µj) 2 1 1 − λj − C − βλ ν+C  + 1 − ν+C 2 j=1 j ν+d (ν + d) βλj ν+d + 1 d   ν+3C  X 1 2 1 1 βλj ν+d + 1 = (˜zj µj) 2 1 1 . − λj − C − βλ ν+C  + 1 − ν+C 2 j=1 j ν+d βλj ν+d + 1

This is smaller than zero if the last factor is greater than zero for all j = 1, ..., n. This is the case if for all j = 1, ..., n it holds

ν+3C ν+C 2 βλj + 1 (βλj + 1) . ν+d ≤ ν+d

71 This is equivalent to

βλj 2 ν 2 0 C + (2βλj 1)C + ν + ν (33) ≤ ν+d ν+d −

If β 1 d+ν , then all coefficients of (33) are nonnegative and (33) is fulfilled. In this ≥ 2λj ν case, the right hand side of (32) is strictly decreasing. Further, it tends to as C 0 ∞ → and to 1 as C . − → ∞ So (31) has exactly one solution on R≥0. Now (30) has exactly one solution for z.

Remark 4.3. In [31] there were suggested three modifications of the EPLL for Gaussian mixture model resulting in a dramatic acceleration of the algorithm. The first modifica- tion is called flat tail spectrum approximation and reduces the dimension of the vectors in step 1 to step 3. The second approximation speeds up the selection of the k∗ using a balanced search tree. However, if we consider Student-t mixture models, the bottleneck of Algorithm 4.5 is the computation of the C of step 3 using Lemma 4.2. This step cannot be accelerated by these two modifications. The third modification proposed in [31] is the restriction to a random subset of patches. This can also be easily implemented for Algorithm 4.5 and works out as follows:

Instead of computing the loop in Algorithm 4.5 for each patch of xˆ we just compute it for a random subset of the patches. To ensure that each pixel is covered by at least one patch we consider a regular grid with spacing s [1, √d]. Now for each point (i0, j0) on ∈ the grid we choose one patch with location (i, j) uniformly and randomly chosen such that

√ √ d−s d−s i0 2 i i0 + 2 − b √ c ≤ ≤ b √ c d−s d−s and j0 j j0 + . − b 2 c ≤ ≤ b 2 c

This method of patch sampling is also called ”jittering”, see [10]. Note that the simplifi- cation (28) of step 4 in Algorithm 4.5 is not longer applicable, since each pixel is covered by less than d patches. Nevertheless the authors of [31] suggest to use the simplification

−1 T 2 −1 T 2 X T  X T  xˆ = A A + βσ Id A y + βσ Pi Pi Pi zˆi , i∈I i∈I since −1  X T  X T Pi Pi Pi zˆi i∈I i∈I

72 corresponds just to an averaging of the patches and in the case that each pixel is covered by exactly d patches it holds

−1  X T  1 P Pi = Id. i d i∈I

4.2.2. Joint Student-t Mixture Models

We adapt the method proposed in [35] for Student-t mixture models. Given some (qτ)2 τ 2 pairs (hi, li), i = 1, .., N with hi R , li R of high resolution and low resolution ∈ ∈ patches we estimate the parameters for a Student-t mixture model describing the vectors ! hi vi := , i = 1, ..., N using Algorithm 4.2. We get a Student-t mixture model li

  K d+νk X Γ 2 1 p(x) = αifk(x), fk(x) = . d 1 d+νk νk    2 i=1 Γ (πν ) 2 Σ 2 1 T −1 2 k k 1 + (x µk) Σ (x µk) | | νk − k − ! ! µHk ΣHk ΣHLk with µk = and Σk = . µ ΣT Σ Lk HLk Lk Using this mixture model we estimate for each low resolution patch the corresponding high resolution patch separately: Given a low resolution patch y we first compute the likelihood γk that y belongs to the k-th component of our mixture model, i.e.

γk = αkf(y νk, µL , ΣL ), | k k where f is the density function of the Student-t distribution. Then we select the ∗ component ki of the mixture model, such that the likelihood that y belongs to it is maximal, i.e. ∗ k = argmax γk. 1≤k≤K Now we estimate the high resolution patch xˆ using the minimal mean square error (MMSE) ! ! x X assuming that is a realization of a random vector Tν ∗ (µk∗ , Σk∗ ). The y Y ∼ k MMSE given y is defined by

xˆ = argmin MSE(Z Y = y), where MSE(Z Y = y) = E((Z X)2 Y = y). Z | | − |

73 One can rewrite the MMSE by xˆ = E(X Y = y), see e.g. [3]. The following lemma was | proven in [13].

Lemma 4.4. Let x Rd1 ,y Rd2 be realizations of random vectors X and Y with ∈ ∈ ! ! !! X µH ΣH ΣHL Tν , T . Y ∼ µL ΣHL ΣL

Then the conditional distribution of X given Y = y reads as

T −1 ! −1 ν + (y µL) ΣL (y µL) −1 T P(X|Y =y) = Tν+d2 µ1 + ΣHLΣL (y µL), − − ΣH ΣHLΣL ΣHL . − ν + d2 −

Since for ν > 1 the expectation of a Student-t distributed random variable is given by its location, Lemma 4.4 yields that

−1 xˆ = µH ∗ + ΣHL ∗ Σ (y µL ∗ ). k k Lk∗ − k

To combine the estimates for the patches we use Gaussian windows. For this, we define the weights !! γ  qτ + 12  qτ + 12 wij = exp i + j . 2 − 2 − 2

Now, for a patch covering the pixels k + 1, ..., k + qτ l + 1, ..., l + qτ we assign { } × { } the weight wij to the pixel (k + i, l + j). Then we compute for each pixel the weighted mean of the values from the patches overlapping the pixel. We summarize the method in Algorithm 4.6.

Algorithm 4.6 Joint Student-t mixture models (Student-t MMSE) M Input: Low resolution image y R , Student-t mixture model (α, ν, µ, Σ) ∆K K d K K ∈ ∈ × R (R ) (SPD(d)) and the parameter γ R>0. >0 × × ∈ Output: High resolution imagex ˆ RN . ∈ for i I do ∈ 1. yi = Piy. ∗ 2. ki = argmin1≤k≤K log(αk) log(f(yi νk, µLk , ΣLk )). −−1 − | 3.z ˆi = µHk∗ + ΣHLk∗ ΣL ∗ (yi µLk∗ ). k − P i∈I wiklzˆikl 4. Average the patches zˆi i∈I by xˆkl = P w , where the weights wikl are { } i∈I ikl determined as described above.

74 4.3. Numerical Results

In this section we apply our methods to some example images. First in Section 4.3.1 we compare our results with the results from [3]. Then, in Section 4.3.2 we apply our method to material images generated with FIB-REM microscopy.

As error measures we choose the peak to noise ratio (PSNR) and the structural similarity index (SSIM). For two images f, g [0, 1]m×n the PSNR is defined by ∈  1  PSNR(f, g) := 10 log10 Pm Pn 2 . (fij gij) i=1 j=1 − This means that a higher PSNR indicates a higher similarity of f and g. Further, the SSIM of f and g is defined by

2 2 (2µf µg + 0.01 )(2σfg + 0.03 ) SSIM(f, g) := 2 2 2 2 2 2 , (µf + µg + 0.01 )(σf + σg + 0.03 ) where µf , µg, σf , σg and σfg are defined as follows:

1 Pm Pn 1 Pm Pn µf = fij and µg = gij are the means of f and g. • mn i=1 j=1 mn i=1 j=1 2 1 Pm Pn 2 2 1 Pm Pn 2 σ = (fij µf ) and σ = (gij µg) are the • f mn i=1 j=1 − g mn i=1 j=1 − of f and g.

1 Pm Pn σfg = (fij µf )(gij µg) is the covariance of f and g. • mn i=1 j=1 − − By some simple calculations one can see that SSIM(f, g) (0, 1] for all f, g [0, 1]m×n ∈ ∈ and that SSIM(f, g) = 1 if and only if f = g. Thus, a SSIM close to one indicates a high similarity between f and g.

4.3.1. Comparison to Gaussian Mixture models

The methods described in the previous sections were implemented for Gaussian mixture models several times (see e.g. [3, 31, 35, 45]). To compare our implementation for Student-t mixture models with Gaussian mixture models we use the results from [3]2. The implementation used in [3] is based on the implementation of [31]3.

2https://github.com/BenoitAune/Stage-Super-Resolution 3https://github.com/pshibby/fepll_public

75 For the comparison we use the superresolution operator A from the implementation of [31]. This operator consists out of a blur operator H and a downsampling operator S. The blur operator is given by a convolution with a Gauss kernel with width 0.5. The downsampling operator S : Rm,n Rm2,n2 is given by →

m2n2 −1 S = D m,n, mn Fm2,n2 F

m,n where m,n is the discrete Fourier transform and where for x R the (i, j)-th entry F ∈ of D(x) is given by  m2 n2 xi,j, if i 2 and j 2 ,  ≤ ≤  m2 n2 xi+m−m ,j, if i > and j , 2 2 ≤ 2 m2 n2 xi,j+n−n2 , if i 2 and j > 2 ,  ≤  m2 n2 xi+m−m2,j+n−n2 , if i > 2 and j > 2 .

We create the low resolution image y by applying the operator A on some example images of size 512 512 and adding Gaussian noise with standard deviation σ = 2. We estimate × a mixture model based on the upper left quarter of the high resolution image. Then we apply the two methods described in the previous sections. Figure8 shows the results for K = 200 components. The resulting PSNRs and SSIMs of the results to the original image are given in Table3. We refer to the methods described in Section 4.2.1 and 4.2.2 by Student-t EPLL and Student-t MMSE respectively. With GMM-EPLL and GMM-MMSE we refer to the methods proposed in [45] and [35] respectively. The results for the GMM-EPLL and GMM-MMSE are taken from [3]. On the low resolution image, we use the patch size τ = 4 and compute our results for the magnification factor q = 2. For the Student-t MMSE we used the parameter γ = 0.12. Further, we compare our results with the L2-TV minimizer, i.e. the solution of the minimization problem

2 xˆ = argmin Ax y 2 + λ x TV . x∈RN k − k k k

We observe that the result of the Student-t EPLL is in the most cases slightly better than the result of the GMM-EPLL. The differences between the results of the Student-t MMSE and GMM-MMSE are very small. Note that the estimation of the mixture model for the MMSE might be imprecise: We have given approximately 15000 samples to estimate a mixture model of dimension 80 with K = 100, 200, 300 components. The parameters of

76 K = 100 K = 200 K = 300 Image Algorithm PSNR SSIM PSNR SSIM PSNR SSIM GMM-EPLL 32.46 0.774 32.44 0.770 32.43 0.769 Student-t EPLL 32.20 0.770 32.21 0.771 32.17 0.769 Cameraman GMM-MMSE 32.00 0.771 32.08 0.782 32.16 0.787 Student-t MMSE 32.10 0.775 32.09 0.779 31.84 0.770 L2-TV, λ = 0.25 31.25 0.767 –– –– GMM-EPLL 25.15 0.777 25.15 0.778 25.14 0.777 Student-t EPLL 25.21 0.773 25.32 0.793 25.30 0.791 Barbara GMM-MMSE 25.01 0.776 25.01 0.778 25.01 0.778 Student-t MMSE 24.92 0.766 24.94 0.770 24.90 0.769 L2-TV, λ = 0.07 25.17 0.754 –– –– GMM-EPLL 31.15 0.827 31.19 0.828 31.21 0.828 Student-t EPLL 31.48 0.844 31.39 0.843 31.52 0.845 Hill GMM-MMSE 30.90 0.835 30.99 0.838 31.14 0.844 Student-t MMSE 30.95 0.834 31.08 0.838 31.15 0.841 L2-TV, λ = 0.35 31.02 0.834 –– ––

Table 3: Results of the superresolution methods. this mixture model live on a manifold of dimension approximately 325000 for K = 100 and 975000 for K = 300. Thus the number of samples might be not sufficient for an accurate estimation of the mixture model.

Note that the assumptions of the EPLL and the MMSE differ: The EPLL assumes that the superresolution operator A is known. For the MMSE we consider A as unknown. Further, the MMSE assumes for the estimation of the mixture model that we have given pairs of high resolution and low resolution patches. These pairs have to belong to the same part of the image. For the EPLL we just need the high resolution patches to estimate the mixture model.

4.3.2. FIB-SEM images

We apply the Student-t EPLL (see Section 4.2.1) to material images produced by Focused Ion Beam and Scanning Electron Microscopy (FIB-SEM). Here we have given some images (see Figure9) with resolution 2560 2560. To test our method we estimate a × Student-t mixture model based on the patches of one of the images. We reduce artificially the resolution of the other image using the operator A from the previous section and reconstruct it using the Student-t EPLL. We use a magnification factors of q 4, 8 . ∈ { } The results are given in the Figures 10, 11, 12 and 13. We compare it with L2-TV with

77 (a) High resolution image (b) Low resolution image (c) Reconstructed image

(d) High resolution image (e) Low resolution image (f) Reconstructed image

(g) High resolution image (h) Low resolution image (i) Reconstructed image

(j) High resolution image (k) Low resolution image (l) Reconstructed image

Figure 8: Reconstruction of the high resolution image using the Student-t EPLL with K = 200 classes. 78 λ = 0.05. We observe that the error measures of the results are not very good. One reason for this can be that the original images are very noisy.

5. Conclusion and Future Work

In Section3 we proposed Algorithm 3.2, 3.3 and 3.4 as alternatives to the classical EM algorithm 3.1, to estimate the parameters of a multivariate Student-t distribution, in particular, the degree of freedom parameter ν. We showed that for all of these algorithms the negative log-likelihood function decreases in each iteration step and that for Algorithm 3.2 and 3.3 cluster points are critical points. To prove that the whole sequence of iterates converges is still open. Further, we examined the convergence behavior of all algorithms in some numerical experiments and observed that Algorithm 3.3 has the best performance.

In Section4 we considered Student- t mixture models. To estimate the parameters of the mixture models, we proposed Algorithm 4.2, 4.3 and 4.4 as alternatives to the classical EM algorithm 4.1. We compared these algorithm in a simulation study and observed that Algorithm 4.2 has the best performance. However, a theoretical result concerning the convergence of Algorithm 4.2, 4.3 and 4.4 is still open. We used Student-t mixture models for superresolution by adapting two superresolution methods, which were originally proposed for Gaussian mixture models. Finally, we provided some proof-of-concept examples using this methods. To refine these numerical examples and to compare or combine them with other methods is still left for further research. In particular, a comparison to methods would be interesting. Furthermore, we would like to adapt the numerical examples to 3D images.

A. Examples for the EM Algorithm

A.1. EM Algorithm for Student-t distributions

In this section we derive the classical EM algorithm for the multivariate Student-t distribution Tν(µ, Σ) where the degree of freedom ν > 0 is unknown. We follow the lines of [22]. We can represent a Student-t distributed random variable X as

RZ X = µ + , √Y

79 (a) Material I, image I (b) Material I, image II

(c) Material II, image I (d) Material II, image II

Figure 9: Material images

80 High resolution image Low resolution image EPLL L2-TV

Figure 10: Reconstruction of the high resolution image using the Student-t EPLL with K = 100 classes with magnification factor q = 4 (PSNR = 27.59, SSIM = 0.499). In the second, third and fourth row we zoom in to different regions of the image. We compare our results with L2-TV (λ = 0.05, PSNR = 27.70, SSIM = 0.528).

81 High resolution image Low resolution image EPLL L2-TV

Figure 11: Reconstruction of the high resolution image using the Student-t EPLL with K = 100 classes with magnification factor q = 8 (PSNR = 24.32, SSIM = 0.414). In the second, third and fourth row we zoom in to different regions of the image. We compare our results with L2-TV (λ = 0.05, PSNR = 24.37, SSIM = 0.411).

82 High resolution image Low resolution image EPLL L2-TV

Figure 12: Reconstruction of the high resolution image using the Student-t EPLL with K = 100 classes with magnification factor q = 4 (PSNR = 29.80, SSIM = 0.734). In the second, third and fourth row we zoom in to different regions of the image. We compare our results with L2-TV (λ = 0.05, PSNR = 30.11, SSIM = 0.7480).

83 High resolution image Low resolution image EPLL L2-TV

Figure 13: Reconstruction of the high resolution image using the Student-t EPLL with K = 100 classes with magnification factor q = 8 (PSNR = 23.34, SSIM = 0.5853). In the second, third and fourth row we zoom in to different regions of the image. We compare our results with L2-TV (λ = 0.05, PSNR = 23.75, SSIM = 0.5872).

84 with Z N(0, Id), RRT = Σ, Y Γ( ν , ν ), where Y and Z are independent. Now ∼ ∼ 2 2 RZi let x1, ..., xn be i.i.d. samples of Xi = µ + √ Tν(µ, Σ) and let y1, ..., yn be the Yi ν ν ∼ corresponding i.i.d. realizations of Yi Γ( , ). Then the distribution of Xi given Yi = yi ∼ 2 2 is given by  Σ  (Xi Yi = yi) N µ, . | ∼ yi Thus we can estimate for given y the parameters µ and Σ using sample mean and i yi covariance. Now we derive the two steps of the EM algorithm as follows:

E-Step: It holds by Bayes formula that

fX,Y (x, y ϑ) = f (x ϑ)fY (y ϑ). | (X|Y =y) | |

This yields that

log(fX,Y (x, y ϑ)) = log(f (x ϑ)) + log(fY (y ϑ)). | (X|Y =y) | |

Thus the complete log likelihood function log(f(x, y ϑ)) is up to some constants given by | n n n d X 1 X T −1 nν ν log(fX,Y (x, y ϑ)) = log( Σ ) + log(yi) yi(xi µ) Σ (xi µ) + log( ) | − 2 | | 2 − 2 − − 2 2 i=1 i=1 n n ν  ν X ν X n log Γ( ) + ( 1) log(yi) yi + const. − 2 2 − − 2 i=1 i=1

Now we compute the Q-function. Note that Bayes formula yields that

f (y ϑ)fX (x ϑ) = f (x ϑ)fY (y ϑ). (Y |X=x) | | (X|Y =y) | |

Thus it holds

ν 1 1 T Σ −1 −1 ν f(Y |X=x)(y) = exp( (x µ) ( ) (x µ))y 2 exp( y)1R (y) q Σ − 2 − y − − 2 ≥0 | y | ν+d −1 1 T −1 =y 2 exp( (ν + (x µ) Σ (x µ))y)1R (y). − 2 − − ≥0

This is the density function of the gamma distribution. Thus it holds for the conditional distribution of Y given X = x with respect to Pϑ that

ν+d ν+(x−µ)T Σ−1(x−µ) P(Y |X=x) = Γ( 2 , 2 ).

85 This yields ν+d EP (Y X = x) = T −1 . ϑ | ν+(x−µ) Σ (x−µ) Now the following lemma allows us to compute the conditional expectation of log(Y ) given X = s. For the proof we refer to [22].

Lemma A.1. Let Y Γ(α, β). Then it holds E(log(Y )) = ψ(α) log(β). ∼ − Thus we have that

ν+d ν+(x−µ)T Σ−1(x−µ) EP (log(Y ) X = x) = ψ( ) log( ). ϑ | 2 − 2

Putting everything together we get (up to some constants) that

(r) Q(ϑ, ϑ ) =EP (log(fX,Y (X,Y ) ϑ) X = x) ϑ | | n X (r) ν(r)+(x −µ(r))T (Σ(r))−1(x −µ(r)) = n log( Σ ) + d ψ( ν +d ) log( i i ) − 2 | | 2 2 − 2 i=1 n 1 X ν(r)+d T −1 (r) (r) T (r) −1 (r) (xi µ) Σ (xi µ) − 2 ν +(xi−µ ) (Σ ) (xi−µ ) − − i=1 + nν log( ν ) n log(Γ( ν )) 2 2 − 2 n X (r) ν(r)+(x −µ(r))T (Σ(r))−1(x −µ(r)) + ( ν 1) ψ( ν +d ) log( i i ) 2 − 2 − 2 i=1 n ν X ν(r)+d (r) (r) T (r) −1 (r) . − 2 ν +(xi−µ ) (Σ ) (xi−µ ) i=1

Using the notation (r) ν(r)+d γ = (r) (r) T (r) −1 (r) i ν +(xi−µ ) (Σ ) (xi−µ ) the Q-function reads as follows

(r) (r) (r) Q(ϑ, ϑ ) = Q1(ν, ϑ ) + Q2((µ, Σ), ϑ ) + C, (34) where C does not depend on ϑ and Q1 and Q2 are given by

(r) nν ν ν nν ν(r)+d Q1(ν, ϑ ) = log( ) n log(Γ( )) + ψ( ) 2 2 − 2 2 2 n (r) X (r) (r) nν log( ν +d ) + ν log(γ ) γ  − 2 2 2 i − i i=1

86 and n (r) n 1 X (r) T −1 Q2((µ, Σ), ϑ ) = log( Σ ) γ (xi µ) Σ (xi µ). − 2 | | − 2 i − − i=1

M-Step: We maximize the Q function (34). For this we have to maximize Q1 and Q2.

Note that Q2 is up to some constants the log-likelihood function of a normal distributed 1 random variables with mean µ and covariances (r) Σ. Thus the maximizers of Q2 are γi given by

Pn (r) γ xi µ(r+1) = i=1 i , Pn (r) i=1 γi n (r+1) 1 X (r) (r+1) T (r+1) T Σ = γ (xi µ ) (xi µ ) . n i − − i=1

We derive the maximizer of Q2 by setting the derivative to zero. Thus the maximizer is given by the solution of

n (r) (r) X (r) (r) 0 = ψ( ν ) + log( ν ) + ψ( ν +d ) + log( ν +d ) + 1 + 1 (log(γ ) γ ). − 2 2 2 2 n i − i i=1

The solution exists and is unique because the right hand side is strictly decreasing and tends to as ν 0. ∞ →

Now the whole algorithm is given by Algorithm 3.1.

A.2. EM Algorithm for Mixture Models

In the following we derive the EM algorithm for mixture models. We again follow the lines of [22]. Let f(x ϑ) be the density function of a parametric distribution. We consider | a mixture model with K components and density function

K X f(x α, ϑ) = αkf(x ϑk), ϑ = (ϑ1, ..., ϑK ), α ∆K . | | ∈ k=1

Given some i.i.d. samples x1, ..., xn we want to estimate (α, ϑ). We introduce for i = 1, ..., n the labels zi = (zi1, ..., ziK ), where zij = 1 if the sample xi was generated by the j-th component and zij = 0 otherwise. Now the log-likelihood function with respect to x and

87 z is given by

n n K  Y  X X `(α, ϑ x, z) = log f(xi, zi α, ϑ) = zik log(αkf(xi ϑk)). | | | i=1 i=1 k=1

Now we derive the two steps of the EM algorithm.

E-Step: Since ` is linear in z we get that

(r) (r) Q((α, ϑ), (α , ϑ )) = EP (`(α, ϑ X,Z) X = x) (α(r),ϑ(r)) | | n K X X = EP (Zik log(αkf(Xi ϑk)) X = x) (α(r),ϑ(r)) | | i=1 k=1 n K X X = EP (Zik X = x) log(αkf(xi ϑk)) (α(r),ϑ(r)) | | i=1 k=1

= `(α, ϑ x, EP (Zik X = x))) | (α(r),ϑ(r)) | = `(α, ϑ x, β(r)), |

(r) (r) where β = (βik )i∈{1,...,n},k∈{1,...,K} is given by

(r) (r) (r) αk f(xi ϑk ) β = EP (r) (r) (Zik X = x) = P(α(r),ϑ(r))(Zik = 1 X = x) = | . ik (α ,ϑ ) PK (r) (r) | | α f(xi ϑ ) j=1 j | j M-Step: For fixed ϑ Θ we maximize Q((α, ϑ), (α(r), ϑ(r))) under the constraint that ∈ PK g(α) = 0, where g(α) = αk 1 using Lagrange multipliers. It holds k=1 − ∂ g(α) = 1 ∂αk and n ∂ 1 X (r) Q((α, ϑ), (α(r), ϑ(r))) = β . ∂α α ik k k i=1 Thus for any maximizer αˆ of Q((α, ϑ), (α(r), ϑ(r))) there exists λ R such that for all ∈ k = 1, ..., K it holds n X (r) αˆk = λ βik . i=1

88 Using the constraint g(ˆα) = 0 we get

K n X X (r) 1 = λ βik = λn. k=1 i=1

1 Pn (r) Therefore it holdsα ˆ = n i=1 βik . Further, it holds

K n n ˆ X X (r)  X (r)  ϑ = argmax βik log(αkf(xi ϑk)) = argmax βik log(f(xi ϑk)) . K | | k=1,...,K ϑ∈Θ k=1 i=1 ϑk∈Θ i=1

Since αˆ and ϑˆ do not depend on each other the maximizer of Q is given by (α,ˆ ϑˆ). We summarize the algorithm in Algorithm A.1.

Algorithm A.1 EM Algorithm for Mixture Models d ×n (0) Input: x = (x1, ..., xn) R 1 , initial estimate ϑ Θ ∈ ∈ for r = 0, 1, ... do E-Step: For k = 1, ..., K compute

(r) (r) (r) α f(xi ϑ ) β = k | k ik PK (r) (r) α f(xi ϑ ) j=1 j | j M-Step: For k = 1, ..., K compute

n (r+1) 1 X (r) α = β , k n ik i=1 n (r+1) n X (r) o ϑ = argmax β log(f(xi ϑk)) . k ik | ϑk i=1

A.3. EM Algorithm for Student-t Mixture Models

Now we consider the EM algorithm for Student-t mixture models. Here the density function is given by

K X d f(x α, ϑ) = αkf(x ϑk), with ϑk = (ν, µ, Σ) R>0 R SPD(d) | | ∈ × × k=1

89 where  d+ν  Γ 2 1 f(x ν, µ, Σ) = d 1 d+ν | ν  2 2   2 Γ 2 (πν) Σ 1 + 1 (x µ)TΣ−1(x µ) | | ν − −

We assign to each random variable Xi the random variables zik, k = 1, ..., K with zik = 1 if and only if Xi belongs to the k-th component of the mixture model. Further we assign to Xi the random variable Yi such that the conditional distribution (Yi zik = 1) is given | νk νk Σk  by Γ( , ) and the distribution (Xi Yi = yi, zik = 1) is given by µk, . 2 2 | N yi

E-Step: Using Bayes formula, the Q-function is given by

(r) (r) Q((α, ϑ), (α , ϑ )) =EP (log(fY,Z (Y,Z α, ϑ)) X = x) (α(r),ϑ(r)) | | =EP (log(f(Y |Z=Z)(Y α, ϑ)) X = x) (35) (α(r),ϑ(r)) | | + EP (log(fZ (Z α, ϑ)) X = x). (α(r),ϑ(r)) | |

Now the first summand is given by

n K X X P(α(r),ϑ(r))(Zik = 1 X = x)EP (r) (r) (f(Yi|Z =1)(yi α, ϑ)) X = x). | (α ,ϑ ) ik | | i=1 k=1

(r) By Section A.2 we know that P (r) (Zik = 1 X = x) = β and by Section A.1 we know ϑ | ik that

(r) (r) EP (f(Y |Z =1)(yi α, ϑ)) X = x) = Qi,k,1(νk, ϑ ) + Qi,k,2((µk, Σk), ϑ ) + const, ϑ(r) i ik | | where

(r) (r) νk νk νk νk νk +d Qi,k,1(νk, ϑ ) = log( ) log(Γ( )) + ψ( ) 2 2 − 2 2 2 (r) ν +d (r) (r) νk log( k ) + νk log(γ ) γ  − 2 2 2 ik − ik and

(r) 1 1 (r) T −1 Qi,k,2((µk, Σk), ϑ ) = log( Σk ) γ (xi µk) Σ (xi µk) − 2 | | − 2 ik − k − with (r) (r) νk +d γik = (r) (r) T (r) −1 (r) . νk +(xi−µk ) (Σk ) (xi−µk )

90 Now the second summand in (35) is given by

n K n K X X X X (r) P (r) (r) (Zik = 1 X = x) log(fZ (1 α, ϑ)) = β log(αk). (α ,ϑ ) | ik | ik i=1 k=1 i=1 k=1

Thus the Q-function is given by

(r) (r) (r) (r) (r) (r) (r) (r) Q((α, ϑ), (α , ϑ )) = Q1(ν, (α , ϑ )) + Q2((µ, Σ), (α , ϑ )) + Q3(α, (α , ϑ )), where

K n (r) (r) X X (r) (r) Q1(ν, (α , ϑ )) = Qi,k,1(νk, (α , ϑ )), k=1 i=1 K n (r) (r) X X (r) (r) Q2((µ, Σ), (α , ϑ )) = Qi,k,2((µk, Σk), (α , ϑ )), k=1 i=1 K n (r) (r) X X (r) Q3(α, (α , ϑ )) = βik log(αk). k=1 i=1

M-Step: We maximize Q1, Q2 and Q3 separately. With the same calculation as in the

M-step of Section A.2 the maximizer of Q3 is given by

n 1 X (r) αˆ = β . k n ik i=1

Pn (r) (r) (r) Using the same arguments as in Section A.1 the maximizer of i=1 βik Qi,k,2((µk, Σk), (α , ϑ )) is given by

Pn (r) (r) i=1 β γij xi µˆ = ik , k Pn (r) (r) i=1 βik γik

Pn (r) (r) T β γ (xi µˆk)(xi µˆk) Σˆ = i=1 ik ik − − , k Pn (r) i=1 βik for k = 1, ..., K Again by the same arguments as in Section A.1 the maximizer νˆk of

91 Pn (r) (r) (r) i=1 βik Qi,k,2(νk, (α , ϑ )) is given by the solution of

Pn (r) (r) (r) d+ν(r) d+ν(r) β ( log(γ ) γ 0 = ψ( x ) log( x ψ k  + log k  + 1 + i=1 ik ik − ik . 2 − 2 − 2 2 Pn (r) i=1 βik

The resulting algorithm is given in Algorithm 4.1.

B. Auxiliary Lemmas

d Lemma B.1. Let xi R , i = 1, . . . , n and w ∆˚n fulfill Assumption 3.1. Let (νr, Σr)r ∈ ∈ be a sequence in R>0 SPD(d) with νr 0 as r (or if νr r has a subsequence × → → ∞ { } which converges to zero). Then (νr, Σr)r cannot be a minimizing sequence of L(ν, Σ).

Proof. We write

L(ν, Σ) = g(ν) + Lν(Σ), where  ν   d + ν  g(ν) = 2 log Γ 2 log Γ ν log(ν). 2 − 2 −

Then it holds limν→0 g(ν) = . Hence it is sufficient to show that (νr, Σr)r has a ∞   subsequence (νr , Σr ) such that Lνr (Σr ) is bounded from below. Denote by k k k k r λr1 ... λrd the eigenvalues of Σr. ≥ ≥ Case 1: Let λr,i : r N, i = 1, . . . , d [a, b] for some 0 < a b < . Then it holds { ∈ d } ⊆ ≤ ∞ lim infr→∞ log Σr log(a ) = d log(a) and | | ≥ n n n X T −1 X 1 T  X 1 T  lim inf(d+νr) wi log(νr+x Σ xi) lim (d+νr) wi log x xi = d wi log x xi . r→∞ i r ≥ r→∞ b i b i i=1 i=1 i=1

T Note that Assumption 3.1 ensures xi = 0 and x xi > 0 for i = 1, . . . , n. Then we get 6 i n X T −1 lim inf Lν (Σr) = lim inf(d + νr) wi log(νr + x Σ xi) + log Σr r→∞ r r→∞ i r | | i=1 n X 1 T  d wi log x xi + d log(a). ≥ b i i=1

Hence (Lνr (Σr))r is bounded from below and (νr, Σr) cannot be a minimizing sequence.

Case 2: Let λr,i : r N, i = 1, . . . , d [a, b] for all 0 < a b < . Define ρr = Σr F { ∈ } 6⊆ ≤ ∞ k k

92 and P = Σr . Then, by concavity of the log function, it holds r ρr

n X T −1 Lν (Σr) = (d + νr) wi log(νr + x Σ xi) + log( Σr ) r i r | | i=1 n X T −1 d wi log(x Σ xi) + νr log(νr) + log( Σr ) ≥ i r | | i=1 n X 1 T −1 d d wi log( x P xi) + log(ρ Pr ) + const ≥ ρ i r r | | i=1 r n X T −1 = d wi log(x P xi) + log( Pr ) +const. (36) i r | | i=1 | {z } =:L0(Pr)

Denote by pr,1 ... pr,d > 0 the eigenvalues of Pr. Since Pr : r N is bounded ≥ ≥ { ∈ } there exists some C > 0 with C pr,1 for all r N. Thus one of the following cases is ≥ ∈ fulfilled:

i) There exists a constant c > 0 such that pr,d > c for all r N. ∈

ii) There exists a subsequence (Pr )k of (Pr)r which converges to some P ∂ SPD(d). k ∈ d Case 2i) Let c > 0 with pr,d c for all r N. Then lim inf log( Pr ) log(c ) = d log(c) ≥ ∈ r→∞ | | ≥ and n n X T −1 X  1 T  lim inf d wi log(x P xi) d wi log x xi . r→∞ i r ≥ C i i=1 i=1 By (36) this yields

n X T −1 lim inf Lν (Σr) lim inf d wi log(x P xi) + log( Pr ) + const r→∞ r ≥ r→∞ i r | | i=1 n X  1 T  d wi log x xi + d log(c) + const. ≥ C i i=1

Hence (Lνr (Σr))r is bounded from below and (νr, Σr) cannot be a minimizing sequence.

Case 2ii) We use similar arguments as in the proof of [24, Theorem 4.3]. Let (Prk )k be a subsequence of (Pr)r which converges to some P ∂ SPD(d). For simplicity ∈ we denote (Pr )k again by (Pr)r. Let p1 ... pd 0 be the eigenvalues of P . k ≥ ≥ ≥ Since P F = limr→∞ Pr F = 1 it holds p1 > 0. Let q 1, . . . , d 1 such that k k k k ∈ − p1 ... pq > pq+1 = ... = pd = 0. By er,1, . . . , e,rd we denote the orthonormal ≥ ≥

93 d d eigenvectors corresponding to pr,1, . . . , pr,d. Since (S ) is compact we can assume (by going over to a subsequence) that (er,1, . . . , er,d)r converges to orthonormal vectors

(e1, . . . , ed). Define S0 := 0 and for k = 1, . . . , d set Sk := span e1, . . . , ek . Now, for { } { } k = 1, . . . , d define

d Wk := Sk Sk−1 = y R : y, ek = 0, y, el = 0 for l = k + 1, . . . , d . \ { ∈ h i 6 h i }

Further, let

I˜k := i 1, . . . , n : xi Sk and Ik := i 1, . . . , n : xi Wk . { ∈ { } ∈ } { ∈ { } ∈ }

Because of Sk = Wk ˙ Sk−1 we have I˜k = Ik ˙ I˜k−1 for k = 1, . . . , d. Due to Assumption ∪ ∪ 3.1 we have Ik I˜k dim(Sk) = k for k = 1, . . . , d 1. Defining for j = 1, . . . , d, | | ≤ ≤ −

X T −1 Lj(Pr) := d wi log(xi Pr xi) + log(prj), i∈Ij

Pd it holds L0(Pr) = Lj. For j q we get j=1 ≤

X  1 T  X  1 T  lim inf Lj(Pr) lim inf d wi log x xi +log(pr,j) = d wi log x xi +log(pj). r→∞ ≥ r→∞ C i C i i∈Ij i∈Ij

Since for k 1, . . . , d and i Ik, ∈ { } ∈ d T −1 X 1 2 1 2 x P xi = xi, er,j xi, erk , i r p h i ≥ p h i j=1 r,j r,k and limr→∞ xi, erk = xi, ek = 0, we obtain h i h i 6

T 2 lim inf pr,kx Prxi lim inf y, er,k y, ek > 0. r→∞ i ≥ r→∞ h i ≥ h i

Hence it holds for j q + 1 that ≥

X T −1  X  Lj(Pr) = d wi(log(x P xi) + log(pr,j)) + 1 d wi log(pr,j) i r − i∈Ij i∈Ij

X T −1  X  = d wi log(pr,jx P xi) + 1 d wi log(pr,j). i r − i∈Ij i∈Ij

94 Thus we conclude

d q d X X X lim inf L0(Pr) = lim inf Lj(Pr) lim inf Lj(Pr) + lim inf Lj(Pr) r→∞ r→∞ ≥ r→∞ r→∞ j=1 j=1 j=q+1 q d X X 1 T X X T −1 d wi log( x xi) + log(pj) + lim inf d wi log(prjx P xi) ≥ C i r→∞ i r j=1 i∈Ij j=q+1 i∈Ij

d X  X  + lim inf 1 d wi log(prj) r→∞ − j=q+1 i∈Ij q d X X 1 T X X d wi log( x xi) + log(pj) + d wi log( xi, ej ) ≥ C i h i j=1 i∈Ij j=q+1 i∈Ij d X  X  + lim inf 1 d wi log(pr,j) r→∞ − j=q+1 i∈Ij d X  X  = const + lim inf 1 d wi log(pr,j). r→∞ − j=q+1 i∈Ij

It remains to show that there existc ˜ > 0 such that

d X  X  lim inf 1 d wi log(pr,j) c.˜ (37) r→∞ − ≥ j=q+1 i∈Ij

We prove for k q + 1 by induction that for sufficiently large r N it holds ≥ ∈ d X  X   X  1 d wi log(prj) d wi (k 1) log(pr,k). (38) − ≥ − − ˜ j=k i∈Ij i∈Ik−1

Induction basis k = d: Since I˜k = Ik I˜k−1 we have ∪ X X X wi wi = wi, − ˜ ˜ i∈Ik i∈Ik−1 i∈Ik and further

X  X X   X  X 1 d wi = 1 d wi wi = 1 d 1 wi = d wi (d 1). − − − − − − − ˜ ˜ ˜ ˜ i∈Id i∈Id i∈Id−1 i∈Id−1 i∈Id−1

95 If we multiply both sides with log(prd) this yields (38) for k = d. Induction step: Assume that (38) holds for some k + 1 with d k + 1 > q + 1 i.e. ≥ d X  X   X k  1 d wi log(pr,j) d wi log(pr,k+1). − ≥ − d ˜ j=k+1 i∈Ij i∈Ik

Then we obtain

d X  X  1 d wi log(pr,j) − j=k i∈Ij d X  X   X  = 1 d wi log(pr,j) + 1 d wi log(pr,k) − − j=k+1 i∈Ij i∈Ik  X k   X  d wi log(pr,k+1) + 1 d wi log(pr,k). ≥ − d − ˜ i∈Ik i∈Ik

P 1 k and since ˜ wi < I˜k by Assumption 3.1 and pr,k+1 pr,k < 1 finally i∈Ik d ≤ d ≤

 X k   X  d wi log(pr,k) + 1 d wi log(pr,k) ≥ − d − ˜ i∈Ik i∈Ik  X  = d wi (k 1) log(pr,k). − − ˜ i∈Ik−1

This shows (38) for k q + 1. Using k = q + 1 in (37) we get ≥ d X  X   X  lim inf 1 d wi log(prj) lim inf d wi q log(pr,q+1) > . r→∞ − ≥ r→∞ − −∞ j=q+1 i∈I ˜ | {z } j i∈Iq bounded from above | {z } <0

This finishes the proof.

Lemma B.2. Let (νr, Σr)r be a sequence in R>0 SPD(d) such that there exists ν− R>0 × ∈ with ν− νr for all r N. Denote by λr,1 λr,d the eigenvalues of Σr. If ≤ ∈ ≥ · · · ≥ λ1,r : r N is unbounded or λd,r : r N has zero as a cluster point, then there exists { ∈ } { ∈ } a subsequence (νr , Σr )k of (νr, Σr)r, such that lim L(νr , Σr ) = . k k k→∞ k k ∞ Proof. Without loss of generality we assume (by considering a subsequence) that either

λr1 as r and λrd c > 0 for all r N or that λrd 0 as r . By [24, → ∞ → ∞ ≥ ∈ → → ∞

96 Theorem 4.3] for fixed ν = ν−, we have Lν (Σr) as r . − → ∞ → ∞ The function h: R>0 R defined by ν (d + ν) log(ν + k) is monotone increasing for → 7→ all k R≥0. This can be seen as follows: The derivative of h fulfills ∈ d + ν 1 + ν h0(ν) = + log(ν + k) + log(ν + k) k + ν ≥ k + ν and since ∂  1 + ν  k 1 + log(ν + k) = − ∂k k + ν (k + ν)2 the later function is minimal for k = 1, so that

1 + ν 1 + ν h0(ν) + log(ν + k) + log(ν + 1) = 1 + log(1 + ν) > 0. ≥ k + ν ≥ 1 + ν

Using this relation, we obtain

n n X T −1  X T −1  (d + νr) wi log νr + x Σ xi (d + ν−) wi log ν− + x Σ xi i r ≥ i r i=1 i=1 and further

n X T −1  L(νr, Σr) = (d + νr) wi log νr + x Σ xi + log( Σr ) i r | | i=1 n X T −1  (d + ν−) wi log ν− + x Σ xi + log( Σr ) ≥ i r | | i=1

= Lν (Σr) as r . − → ∞ → ∞

C. Derivatives of the Negative Log-Likelihood Function for Student-t Mixture Models

We compute the derivatives of

n K X  X  L(ν, µ, Σ x1, ..., xn) = log αkf(xi νk, µk, Σk) , | − | i=1 k=1

97 where

d+νk  Γ 2 1 T −1 f(xi νk, µk, Σk) = d d+ν , δi,k = (xi µk) Σk (xi µk). νk  k | 2 1  2 − − Γ 2 (πνk) Σk 1 + δi,k | | νk

1 We use the short notation βi = PK . The derivative with respect to α is k=1 αkf(xi|νk,µk,Σk) given by: n ∂L(α, ν, µ, Σ) X = βif(xi νl, µl, Σl) ∂α − | l i=1 −1 We compute the derivative with respect to µ. We use that µ δi,l = 2Σ (µl xi): ∇ l l − n X µ L(α, ν, µ, Σ) = βiαl µ f(xi νl, µl, Σl) ∇ l − ∇ l | i=1 n d+νl  d+νl  X Γ 2 2 1  = βiαl d 1 − d+ν +2 µl 1 + ν δi,l − ν l ∇ l i=1 l  2 2 1  2 Γ 2 (πνl) Σl 1 + δi,l | | νl n 1 X d + νl 1 −1 = βiαlf(xi νl, µl, Σl) 1 (2Σl (µl xi)) 2 | 1 + δi,l νl − i=1 νl n X d + νl −1 = βiαlf(xi νl, µl, Σl) (Σ (µl xi)). | ν + δ l − i=1 l i,l

Now we compute the derivative with respect to Σ:

n X Σ L(α, ν, µ, Σ) = βiαl Σ f(xi νl, µl, Σl) ∇ l − ∇ l | i=1

n ! d+νl  X 1 Γ 2 1 = βiαl Σl d d+ν 1 νl  1 l − ∇ Γ (πν ) 2 (1+ δi,l) 2 i=1 Σl 2 2 l νl | | d+νl  !! Γ 2 1 + 1 Σl d+ν ν d ∇ l l  2 2 1  2 Γ 2 (πνl) Σl 1 + δi,l | | νl n d+νl  X 1 1  Γ 2 1 = βiαl 3 Σ Σl 2 l d d+νl − − Σ 2 ∇ | | νl  1 i=1 l Γ (πνl) 2 (1 + δi,l) 2 | | 2 νl d+νl  d+νl  ! Γ 2 2 1  + 1 − d+ν +2 Σl 1 + ν δi,l . ν d l ∇ l l  2 2 1  2 Γ 2 (πνl) Σl 1 + δi,l | | νl

T −T −1 ∂aT A−1b We use Jacobi’s formula, i.e. A A = adj (A) = A A = A A and = ∇ | | | | | | ∂A

98 A−T abT A−T = A−1abT A−1 (see [33]). Then we get: − −

n d+νl  X 1 1 −1 Γ 2 1 = βiαl 1 Σ l d d+νl − − 2 Σ 2 νl  1 i=1 l Γ (πνl) 2 (1 + δi,l) 2 | | 2 νl d+νl  d+νl  ! Γ 2 2 1 + 1 − d+ν +2 Σl δi,l ν d l ν ∇ l  2 2 1  2 l Γ 2 (πνl) Σl 1 + δi,l | | νl n X 1 −1 = βiαl Σ f(xi νl, µl, Σl) − − 2 l | i=1 ! 1 d + νl 1 −1 T −1 2 1 f(xi µl, νl, Σl)( Σl )(xi µl)(xi µl) Σl − 1 + δi,l νl | − − − νl n   1 X −1 d + νl −1 T −1 = βiαlf(xi νl, µl, Σl) Σ Σ (xi µl)(xi µl) Σ . 2 | l − ν + δ − − i=1 l i,l

We compute the derivative with respect to ν:

n ∂L(α, ν, µ, Σ) X ∂ = βiαl f(xi νl, µl, Σl) ∂ν − ∂ν | l i=1 l

A(νl) z }| { n d+νl  X 1 ∂ Γ 2 = βiαl 1 d+νl − 2 ∂ν d Σl l νl  1  2 i=1 Γ (πνl) 2 1 + δi,l | | 2 νl | {z } | {z } | {z } B(νl) C(νl) Di(νl)

n  0 0 X 1 A (νl)B(νl)C(νl)Di(νl) A(νl)B (νl)C(νl)Di(νl) = βiαl 1 − 2 − Σl 2 (B(νl)C(νl)Di(νl)) i=1 | | 0  A(νl)B(νl)C (νl)Di(νl) A(νl)B(νl)C(νl)Di(νl) + − − 2 , (B(νl)C(νl)Di(νl)) where   0 1 0 d + x 0 1 0 x 0 d d d−2 A (x) = Γ ,B (x) = Γ ,C (x) = π 2 x 2 and 2 2 2 2 2 ∂   1  d + x D0(x) = exp log 1 + δ i ∂x x i,l 2   1  d + x ∂   1  d + x = exp log 1 + δ log 1 + δ x i,l 2 ∂x x i,l 2

99 d+x !  1  2 1 1 d + x 1  1  = 1 + δi,l 1 δi,l −2 + log 1 + δi,l x 1 + x δi,l x 2 2 x   d+x     1 1 2 1 (d + x)δi,l = 1 + δi,l log 1 + δi,l 2 . 2 x x − x + xδi,l

100 References

[1] M. Abramowitz and I. A. Stegun. Handbook of mathematical functions: with formulas, graphs, and mathematical tables, volume 55. Courier Corporation, 1965.

[2] A. Antoniadis, D. Leporini, and J.-C. Pesquet. Wavelet thresholding for some classes of non-Gaussian noise. Statistica Neerlandica, 56(4):434–453, 2002.

[3] B. Aune. La super r´esolutiond’images ´abase de patchs. Master Thesis, University of Bordeaux, 2019.

[4] A. Banerjee and P. Maji. Spatially constrained Student’s t-distribution based mixture model for robust image segmentation. Journal of Mathematical Imaging Vision, 60(3):355–381, 2018.

[5] H. Bauer. Probability theory, volume 23 of De Gruyter Studies in Mathematics. Walter de Gruyter & Co., Berlin, 1996. Translated from the fourth (1991) German edition by Robert B. Burckel and revised by the author.

[6] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2, Ser. A):459–494, 2014.

[7] C. L. Byrne. The EM Algorithm: Theory, Applications and Related Methods. Lecture Notes, University of Massachusetts, 2017.

[8] S. Chr´etienand A. O. Hero. Kullback proximal algorithms for maximum-likelihood estimation. IEEE Transactions on Information Theory, 46(5):1800–1810, 2000.

[9] S. Chr´etienand A. O. Hero. On EM algorithms and their proximal generalizations. ESAIM: Probability and Statistics, 12:308–326, 2008.

[10] R. L. Cook. Stochastic sampling in computer graphics. ACM Transactions on Graphics, 5(1):51–72, 1986.

[11] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.

[12] M. Ding, T. Huang, S. Wang, J. Mei, and X. Zhao. Total variation with overlapping group sparsity for deblurring images under Cauchy noise. Applied Mathematics and Computation, 341:128–147, 2019.

101 [13] P. Ding. On the conditional distribution of the multivariate t distribution. The American Statistician, 70(3):293–295, 2016.

[14] D. Geman and C. Yang. Nonlinear image recovery with half-quadratic regularization. IEEE Transactions on Image Processing, 4(7):932–946, 1995.

[15] D. Gerogiannis, C. Nikou, and A. Likas. The mixtures of Student’s t-distributions as a robust framework for rigid registration. Image and Vision Computing, 27(9):1285– 1294, 2009.

[16] M. Hasannasab, J. Hertrich, F. Laus, and G. Steidl. Alternatives of the em algo- rithm for estimating the parameters of the student-t distribution. ArXiv preprint arXiv:1910.06623, 2019.

[17] M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.

[18] M. G. Kendall. The treatment of ties in ranking problems. Biometrika, pages 239–251, 1945.

[19] J. T. Kent, D. E. Tyler, and Y. Vard. A curious likelihood identity for the multivariate t-distribution. Communications in Statistics-Simulation and Computation, 23(2):441– 453, 1994.

[20] A. Klenke. Probability theory. Universitext. Springer-Verlag London, Ltd., London, 2008. A comprehensive course, Translated from the 2006 German original.

[21] A. Lanza, S. Morigi, F. Sciacchitano, and F. Sgallari. Whiteness constraints in a unified variational framework for image restoration. Journal of Mathematical Imaging and Vision, 60(9):1503–1526, 2018.

[22] F. Laus. Statistical Analysis and Optimal Transport for Euclidean and Manifold- Valued Data. PhD Thesis, TU Kaiserslautern, 2019.

[23] F. Laus, F. Pierre, and G. Steidl. Nonlocal myriad filters for Cauchy noise removal. Journal of Mathematical Imaging and Vision, 60(8):1324–1354, 2018.

[24] F. Laus and G. Steidl. Multivariate myriad filters based on parameter estimation of student-t distributions. SIAM Journal on Imaging Sciences, 12(4):1864–1904, 2019.

[25] M. Lebrun, A. Buades, and J.-M. Morel. A nonlocal Bayesian image denoising algorithm. SIAM Journal on Imaging Sciences, 6(3):1665–1688, 2013.

102 [26] G. McLachlan and T. Krishnan. The EM Algorithm and Extensions. John Wiley and Sons, Inc., 1997.

[27] J.-J. Mei, Y. Dong, T.-Z. Huang, and W. Yin. Cauchy noise removal by nonconvex ADMM with convergence guarantees. Journal of Scientific Computing, 74(2):743–766, 2018.

[28] X.-L. Meng and D. Van Dyk. The EM algorithm - an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3):511–567, 1997.

[29] T. M. Nguyen and Q. J. Wu. Robust Student’s-t mixture model with spatial constraints and its application in medical image segmentation. IEEE Transactions on Medical Imaging, 31(1):103–116, 2012.

[30] P. Ochs, Y. Chen, T. Brox, and T. Pock. ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences, 7(2):1388–1419, 2014.

[31] S. Parameswaran, C. Deledalle, L. Denis, and T. Q. Nguyen. Accelerating gmm- based patch priors for image restoration: Three ingredients for a 100x speed-up. IEEE Transactions on Image Processing, 28(2):687–698, 2019.

[32] D. Peel and G. J. McLachlan. Robust mixture modelling using the t distribution. Statistics and Computing, 10(4):339–348, 2000.

[33] K. B. Petersen and M. S. Pedersen. The Matrix Cookbook. Lecture Notes, Technical University of Denmark, 2008.

[34] T. Pock and S. Sabach. Inertial proximal alternating linearized minimization (iPALM) for nonconvex and nonsmooth problems. SIAM Journal on Imaging Sciences, 9(4):1756–1787, 2016.

[35] P. Sandeep and T. Jacob. Single image super-resolution using a joint GMM method. IEEE Transactions on Image Processing, 25(9):4233–4244, 2016.

[36] F. Sciacchitano, Y. Dong, and T. Zeng. Variational approach for restoring blurred images with Cauchy noise. SIAM Journal on Imaging Sciences, 8(3):1894–1922, 2015.

[37] G. Sfikas, C. Nikou, and N. Galatsanos. Robust image segmentation with mixtures of Student’s t-distributions. In 2007 IEEE International Conference on Image Processing, volume 1, pages I – 273–I – 276, 2007.

103 [38] C. Sutour, C.-A. Deledalle, and J.-F. Aujol. Estimation of the noise level function based on a nonparametric detection of homogeneous image regions. SIAM Journal on Imaging Sciences, 8(4):2622–2661, 2015.

[39] A. Van Den Oord and B. Schrauwen. The Student-t mixture as a natural image patch prior with application to image compression. Journal of Machine Learning Research, 15(1):2061–2086, 2014.

[40] D. A. van Dyk. Construction, implementation, and theory of algorithms based on data augmentation and model reduction. PhD Thesis, The University of Chicago, 1995.

[41] C. J. Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1):95–103, 1983.

[42] Z. Yang, Z. Yang, and G. Gui. A convex constraint variational method for restoring blurred images in the presence of alpha-stable noises. Sensors, 18(4):1175, 2018.

[43] W. I. Zangwill. Nonlinear programming: a unified approach, volume 196. Prentice- Hall Englewood Cliffs, NJ, 1969.

[44] Z. Zhou, J. Zheng, Y. Dai, Z. Zhou, and S. Chen. Robust non-rigid using Student’s-t mixture model. PloS one, 9(3):e91381, 2014.

[45] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 479–486. IEEE, 2011.

104