Robust Model-Based Learning Via Spatial-EM Algorithm

1 Robust Model-based Learning via Spatial-EM Algorithm Kai Yu, Xin Dang, Henry Bart, Jr. and Yixin Chen, Member, IEEE Abstract—This paper presents a new robust EM algorithm for the finite mixture learning procedures. The proposed Spatial-EM algorithm utilizes median-based location and rank-based scatter estimators to replace sample mean and sample covariance matrix in each M step, hence enhancing stability and robustness of the algorithm. It is robust to outliers and initial values. Compared with many robust mixture learning methods, the Spatial-EM has the advantages of simplicity in implementation and statistical efficiency. We apply Spatial-EM to supervised and unsupervised learning scenarios. More specifically, robust clustering and outlier detection methods based on Spatial-EM have been proposed. We apply the outlier detection to taxonomic research on fish species novelty discovery. Two real datasets are used for clustering analysis. Compared with the regular EM and many other existing methods such as K-median, X-EM and SVM, our method demonstrates superior performance and high robustness. Index Terms—Clustering, EM algorithm, finite mixture, spatial rank, outlier detection, robustness F 1 INTRODUCTION et al. [26] proposed a weighted trimmed likelihood framework to accommodate many interesting cases Finite mixture models are powerful and flexible to including the weighted likelihood method. Fujisawa represent arbitrarily complex probabilistic distribu- and Eguchi [16] utilized a so-called β-likelihood to tion of data. Mixture model-based approaches have overcome the unboundedness of the likelihood func- been increasingly popular. Applications in a wide tion and sensitivity of the maximum likelihood esti- range of fields have emerged in the past decades. mator to outliers. Qin and Priebe [35] introduced a They are used for density estimation in unsupervised maximum Lq-likelihood estimation of mixture mod- clustering [28], [15], [13], [48], [9], [25], for estimat- els and studied its robustness property. Peel and ing class-conditional densities in supervised learn- McLachlan [33], [28] considered modelling a mixture ing settings [1], [15], [36], and for outlier detection of t distributions to reduce the effects of outliers. purposes [38], [30], [52]. Comprehensive surveys on Shoham [41] also used t mixtures to handle outliers mixture models and their applications can be found and agglomerated an annealing approach to deal with in the monographs by Titterington et al. [44] and the sensitivity with respect to initial values. McLachlan and Peel [29]. Another common technique for robust fitting of Usually parameters of a mixture model are esti- mixtures is to update the component estimates on the mated by the maximum likelihood estimate (MLE) M-step of the EM algorithm by some robust location via the expectation maximization (EM) algorithm [11], and scatter estimates. M-estimator has been consid- [27]. It is well known that the MLE can be very ered by Campbell [5], Tadjudin and Landgrebe [43]. sensitive to outliers. To overcome this limitation, Hardin and Rocke [17] used Minimum Covariance various robust alternatives have been developed. Determinant (MCD) estimator for cluster analysis. Rather than maximizing the likelihood function of Bashir and Carter [1] recommended the use of S esti- Gaussian mixtures, Markatou [23] used a weighted mator. In this paper, we propose to apply spatial rank likelihood with down-weights on outliers. Neykov based location and scatter estimators. They are highly robust and are computationally and statistically more • K. Yu is with Amazon Web Service,1918 8th Ave, Seattle, WA 98101 efficient than the above robust estimators [54]. We E-mail: [email protected]. • X. Dang is with Department of Mathematics, University of Mis- develop a Spatial-EM algorithm for robust finite mix- sissippi, 315 Hume Hall, University, MS 38677, USA. Telephone: ture learning. Based on the Spatial-EM, supervised (662)915-7409. Fax: (662)915-2361. E-mail: [email protected]. Web- outlier detection and unsupervised clustering meth- page: http://olemiss.edu/∼xdang • H. Bart, Jr. is with Department of Ecology and Evolutionary Bi- ods are illustrated and compared with other existing ology, Tulane University, New Orleans, LA 70118, USA. E-mail: techniques. [email protected]. The remainder of the paper is organized as fol- • Y. Chen is with Department of Computer and Information Sci- ence, University of Mississippi, 201 Weir Hall, University, MS lows. Section 2 reviews mixture elliptical models and 38677, USA. Telephone: (662)915-7438. Fax: (662)915-5623. E-mail: the EM algorithm. Section 3 introduces spatial rank [email protected]. Webpage: http://cs.olemiss.edu/∼ychen related statistics. Section 4 presents the Spatial-EM algorithm for mixture elliptical models. Section 5 2 T formulates mixture model based novelty detection. where yi = (y1i; :::; yKi) is an “unobserved ” indica- We apply the Spatial-EM based outlier detection to tor vector with yji = 1 if xi is from component j, zero new species discovery in taxonomy research. In Sec- otherwise. The log-likelihood of Z is then defined by tion 6, the clustering method based on robust mixture n K learning is illustrated using two data sets from UCI X X Lc(θjZ) = yji log[τjfj(xijθj)]: (2.4) machine learning repository. We end the paper in Sec- i=1 j=1 tion 7 with some concluding remarks and a discussion of possible future work. The EM algorithm obtains a sequence of estimates fθ(t); t = 0; 1; :::g by alternating two steps until some convergence criterion is met. 2 REVIEW OF EM ALGORITHM E-Step: Calculate Q function, the conditional expectation of the complete log-likelihood, given X and 2.1 Finite Mixture Models (t) the current estimate θ . Since Yji is either 1 or 0, (t) (t) A d-variate random vector X is said to follow a K- E(Yjijθ ; xi) = Pr(Yji = 1jθ ; xi), which is denoted (t) component mixture distribution if its density function as Tji . By the Bayes rule, we have has the form of (t) (t) (t) τj fj(xijθi ) K T = : (2.5) X ji PK (t) (t) f(xjθ) = τjfj(xjθj); j=1 τj fj(xijθj ) j=1 (t) th Tji ’s can be interpreted as soft labels at the t f (xjθ ) where j j denotes the conditional probability iteration. Replacing yji with Tji in (2.4), we have th density function of x belonging to the j compo- Q(θjθ(t)). nent parametrized by θj, τ1; :::; τK are the mixing M-step: Update the estimate of the parameters by PK proportions with all τj > 0 and j=1 τj = 1, and maximizing the Q function θ = fθ1; :::; θK ; τ1; :::; τK g is the set of parameters. θ(t+1) = Q(θjθ(t)): For the mixture elliptical distributions, fj(xjθj) can argmaxθ (2.6) be written as For convenience, we define −1=2 T −1 (t) fj(xjµj; Σj) = jΣjj hjf(x − µj ) Σj (x − µj )g; (t) Tji w = : (2.7) (2.1) ji Pn (t) d i=1 Tji for some µj 2 R , a positive definite symmetric d × d matrix Σj. The family of mixture elliptical distribu- (t) wji can be viewed as the current weight of xi con- tions contains a quite rich collection of models. The tributing to component j. In the case of Gaussian mix- most widely used one is the mixture of Gaussian K ture, maximizing Q with respective to fµj; Σj; τjgj=1 distributions, in which provides an explicit close-form solution : −d=2 −t=2 n (t) n hj(t) = (2π) e : (2.2) P (t+1) i=1 Tji 1 X (t) τ = = T ; (2.8) j PK Pn (t) n ji The mixture of t distributions and Laplace distribu- j=1 i=1 Tji i=1 tions are commonly used in modeling data with heavy n (t+1) X (t) tails. For the mixture t distributions, µj = wji xi; (2.9) i=1 −(d+νj )=2 n hj(t) = c(νj; d)(1 + t/νj) ; (t+1) X (t) (t+1) (t+1) T Σj = wji (xi − µj )(xi − µj ) : where νj is the degree freedom and c(νj; d) is the i=1 normalization constant. As a generalization of mul- (2.10) tivariate mixture Laplace distribution, the mixture of EM estimation has been proved to converge to Kotz type distribution [34] has the density maximum likelihood estimation (MLE) of the mixture Γ(d=2) p parameters under mild conditions [11], [51], [27]. The h (t) = e− t: (2.3) j (2π)d=2Γ(d) above simple implementation makes Gaussian mixture models popular. However, a major limitation of For detailed and comprehensive accounts on mixture Gaussian mixture models is their lack of robustness models, see McLachlan and Peel [29]. to outliers. This is easily understood because maximization of likelihood function under an assumed Gaussian distribution is equivalent to finding the 2.2 EM algorithm least-squares solution, whose lack of robustness is In the EM framework for finite mixture models, the well known. Moreover, from the perspective of robust observed sample X = fx1; :::; xng are viewed as in- statistics, using sample mean (2.9) and sample covari- n complete. The complete data shall be Z = fxi; yigi=1, ance (2.10) of each component in the M-step causes the 3 sensitivity problem because they have the lowest pos- centered rank of 0. For an order statistics without a sible breakdown point. Here the breakdown point is a tie x(1) < x(2) < ··· < x(n), their centered ranks are prevailing quantitative robustness measure proposed −1 + 1=n; −1 + 3=n; ··· ; 1 − 3=n; 1 − 1=n, which are by Donoho and Huber [12]. Roughly speaking, the linear transformations from their naturally-ordered breakdown point is the minimum fraction of “bad” ranks 1; :::; n.

Robust Model-Based Learning Via Spatial-EM Algorithm

LECTURE 13 Mixture Models and Latent Space Models

START HERE: Instructions Mixtures of Exponential Family

Robust Mixture Modelling Using the T Distribution

Lecture 16: Mixture Models

Linear Mixture Model Applied to Amazonian Vegetation Classification

Graph Laplacian Mixture Model Hermina Petric Maretic and Pascal Frossard

Comparing Unsupervised Clustering Algorithms to Locate Uncommon User Behavior in Public Travel Data

Semiparametric Estimation in the Normal Variance-Mean Mixture Model∗

Structure of a Mixture Model

Student's T Distribution Based Estimation of Distribution

The Infinite Gaussian Mixture Model

Outlier Detection and Robust Mixture Modeling Using Nonconvex Penalized Likelihood