<<

Statistics and Computing (2000) 10, 339–348

Robust mixture modelling using the t distribution

D. PEEL and G. J. MCLACHLAN Department of Mathematics, University of Queensland, St. Lucia, Queensland 4072, Australia [email protected]

Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the use of normal components may unduly affect the fit of the . In this paper, we consider a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise.

Keywords: finite mixture models, normal components, multivariate t components, maximum like- lihood, EM algorithm, cluster analysis

1. Introduction tailed alternative to the . Hence it provides a more robust approach to the fitting of normal mixture mod- Finite mixtures of distributions have provided a mathematical- els, as observations that are atypical of a component are given based approach to the statistical modelling of a wide variety of reduced weight in the calculation of its parameters. Also, the random phenomena; see, for example, Everitt and Hand (1981), use of t components gives less extreme estimates of the pos- Titterington, Smith and Makov (1985), McLachlan and Basford terior probabilities of component membership of the mixture (1988), Lindsay (1995), and B¨ohning (1999). Because of their model, as demonstrated in McLachlan and Peel (1998). In their usefulness as an extremely flexible method of modelling, finite conference paper, they reported briefly on robust clustering via mixture models have continued to receive increasing attention mixtures of t components, but did not include details of the im- over the years, both from a practical and theoretical point of plementation of the EM algorithm nor the examples to be given view. For multivariate data of a continuous nature, attention has here. focussed on the use of multivariate normal components because With this t mixture model-based approach, the normal of their computational convenience. They can be easily fitted distribution for each component in the mixture is embedded iteratively by maximum likelihood (ML) via the expectation- in a wider class of elliptically symmetric distributions with an maximization (EM) algorithm of Dempster, Laird and Rubin additional parameter called the degrees of freedom .Astends (1977); see also McLachlan and Krishnan (1997). to infinity, the t distribution approaches the normal distribution. However, for many applied problems, the tails of the normal Hence this parameter may be viewed as a robustness tuning pa- distribution are often shorter than required. Also, the estimates rameter. It can be fixed in advance or it can be inferred from the of the component and covariance matrices can be affected data for each component thereby providing an adaptive robust by observations that are atypical of the components in the normal procedure, as explained in Lange, Little and Taylor (1989), who mixture model being fitted. The problem of providing protection considered the use of a single component t distribution in linear against in multivariate data is a very difficult problem and nonlinear regression problems; see also Rubin (1983) and and increases with the difficulty of the dimension of the data Sutradhar and Ali (1986). (Rocke and Woodruff (1997) and Kosinski (1999)). In the past, there have been many attempts at modifying In this paper, we consider the fitting of mixtures of (mul- existing methods of cluster analysis to provide robust cluster- tivariate) t distributions. The t distribution provides a longer ing procedures. Some of these have been of a rather ad hoc

0960-3174 C 2000 Kluwer Academic Publishers 340 Peel and McLachlan nature. The use of a mixture model of t distributions provides Kharin (1996), Rousseeuw, Kaufman and Trauwaert (1996) and a sound mathematical basis for a robust method of mixture es- Zhuang et al. (1996). timation and hence clustering. We shall illustrate its usefulness in the latter context by a cluster analysis of a simulated data set 3. Multivariate t distribution with background noise added and of an actual data set. ,..., We let y1 yn denote an observed p-dimensional random 2. Previous work sample of size n. With a normal mixture model-based approach to drawing inferences from these data, each data point is assumed Robust estimation in the context of mixture models has been con- to be a realization of the random p-dimensional vector Y with sidered in the past by Campbell (1984), McLachlan and Basford the g-component normal mixture probability density function (1988, Chapter 3), and De Veaux and Kreiger (1990), among (p.d.f.), others, using M-estimates of the means and covariance matrices Xg = , , of the normal components of the mixture model. This line of f (y; ) i (y; i i ) approach is to be discussed in Section 9. i=1 Recently, Markatou (1998) has provided a formal approach where the mixing proportions i are nonnegative and sum to one to robust mixture estimation by applying weighted likelihood and where methodology in the context of mixture models. With this p 1 , = 2 | | 2 methodology, an estimate of the vector of unknown parameters (y; i i ) (2 ) i is obtained as a solution of the equation 1 exp (y )T 1(y ) (2) Xn 2 i i i w ∂ /∂ = , (y j ) log f (y j ; ) 0 (1) j=1 denotes the p-variate multivariate normal p.d.f. with = ,..., = i and covariance matrix i (i 1 g). Here where f (y; ) denotes the specified parametric form for the T T (1,...,g1, ) , where consists of the elements of the probability density function (p.d.f.) of the random vector Y on and the distinct elements of the (i = 1,...,g). In the ,..., i i which y1 yn have been observed independently. The weight above and sequel, we are using f as a generic symbol for a p.d.f. w function (y) is defined in terms of the Pearson residuals; see One way to broaden this for potential out- Markatou, Basu and Lindsay (1998) and the previous work of liers or data with longer than normal tails is to adopt the two- Green (1984). The weighted likelihood methodology provides component normal mixture p.d.f. robust and first-order efficient estimators, and Markatou (1998) has established these results in the context of univariate mixture (1 )(y; , ) + (y; , c), (3) models. where c is large and is small, representing the small propor- One useful application of normal mixture models has been in tion of observations that have a relatively large . Huber the important field of cluster analysis. Besides having a sound (1964) subsequently considered more general forms of contami- mathematical basis, this approach is not confined to the produc- nation of the normal distribution in the development of his robust tion of spherical clusters, such as with k-means-type algorithms M-estimators of a location parameter, as discussed further in that use Euclidean distance rather than the Mahalanobis distance Section 9. metric which allows for within-cluster correlations between the The normal scale mixture model (3) can be written as variables in the feature vector Y. Moreover, unlike clustering Z methods defined solely in terms of the Mahalanobis distance, (y; ; /u) dH(u), (4) the normal mixture-based clustering takes into account the nor- | |1/2 malizing term i in the estimate of the multivariate nor- where H is the that places mass (1 ) mal density adopted for the component distribution of Y cor- at the point u = 1 and mass at the point u = 1/c. Suppose we responding to the ith cluster. This term can make an important now replace H by the p.d.f. of a chi-squared random variable contribution in the case of disparate group-covariance matrices on its degrees of freedom ; that is, by the random variable U (McLachlan 1992, Chapter 2). distributed as Although even a crude estimate of the within-cluster co- 1 1 variance matrix i often suffices for clustering purposes U gamma , , (Gnanadesikan, Harvey and Kettenring 1993), it can be severely 2 2 affected by outliers. Hence it is highly desirable for methods where the gamma (, ) density function f (u; , )isgiven of cluster analysis to provide robust clustering procedures. The by problem of making clustering algorithms more robust has re- 1 f (u; , ) ={ u /0()} exp(u)I ,∞ (u); (, > 0), ceived much attention recently as, for example, in Smith, Bailey (0 ) and Munford (1995), Dav´e and Krishnapuram (1996), Frigui and the indicator function I(0, ∞)(u) = 1 for u > 0 and is and Krishnapuram (1996), Jolion, Meer and Bataouche (1996), zero elsewhere. We then obtain the t distribution with location Robust mixture modelling 341 parameter , positive definite inner product matrix , and 1,...,g). The application of the EM algorithm for ML estima- degrees of freedom, tion in the case of a single component t distribution has been de- + scribed in McLachlan and Krishnan (1997, Sections 2.6 and 5.8). 0 p ||1/2 , , = 2 , The results there can be extended to cover the present case of a f (y; ) 1 1 + 2 p0 { + , /} 2 ( p) ( ) 2 1 (y ; ) g-component mixture of multivariate t distributions. (5) In the EM-framework, the complete-data vector is given by = T , T ,..., T, ,..., T where xc xo z1 zn u1 un (8)

T 1 = T ,..., T T (y, ; ) = (y ) (y ) (6) where xo (y1 yn ) denotes the observed-data vector, z1,...,zn are the component-label vectors defining the compo- denotes the Mahalanobis squared distance between y and (with ,..., = nent of origin of y1 yn, respectively, and zij (z j )i is 1 as the covariance matrix). If >1, is the mean of Y, and or zero, according as to whether y belongs or does not belong 1 j if >2, ( 2) is its covariance matrix. As tends to the ith component. In the light of the above characterization to infinity, U converges to one with probability one, and so of the t distribution, it is convenient to view the observed data Y becomes marginally multivariate normal with mean and augmented by the z j as still being incomplete and introduce into covariance matrix . The family of t distributions thus provides the complete-data vector the additional missing data, u1,...,un, a heavy-tailed alternative to the normal family with mean which are defined so that given zij = 1, and covariance matrix that is equal to a scalar multiple of | , = , / , (if >2). Y j u j zij 1 N(i i u j ) (9) independently for j = 1,...,n, and 4. ML estimation of t distribution 1 1 U | z = 1 gamma , . (10) j ij 2 i 2 i A brief history of the development of ML estimation of a single component t distribution is given in Liu and Rubin (1995). Given z1,...,zn, the U1,...,Un are independently distributed An account of more recent work is given in Liu (1997). Liu according to (10). and Rubin (1994, 1995) have shown that the ML estimates can The complete-data likelihood Lc() can be factored into the be found much more efficiently by using an extension of the product of the marginal densities of the Z j , the conditional den- EM algorithm called the expectation-conditional maximization sities of the U j given the z j , and the conditional densities of the either (ECME) algorithm. Meng and van Dyk (1997) demon- Y j given the u j and the z j . Accordingly, the complete-data log strated that the more promising versions of the ECME algo- likelihood can be written as rithm for the t distribution can be obtained using alternative data = + + , augmentation schemes. They called this algorithm the alternat- log Lc() log L1c() log L2c() log L3c() (11) ing expectation-conditional maximization (AECM) algorithm. where Following Meng and van Dyk (1997), Liu (1997) considered a Xg Xn class of data augmentation schemes even more general than the log L1c() = zij log i , (12) class of Meng and van Dyk (1997). This led to new versions i=1 j=1 of the ECME algorithm for ML estimation of the t distribution with possible missing values, corresponding to applications of Xg Xn 1 1 1 the parameter-expanded EM (PX-EM) algorithm (Liu, Wu and log L2c() = zij log 0 i + i log i Rubin 1998). i=1 j=1 2 2 2 1 + (log u u ) log u , 5. ML estimation of mixture of t distributions 2 i j j j (13) We consider now ML estimation for a g-component mixture of t distributions, given by and Xg Xg Xn 1 1 f (y; ) = f (y; , ,), (7) log L () = z p log(2) log | | i i i i 3c ij 2 2 i i=1 i=1 j=1 where 1 T 1 . u j (y j i ) i (y j i ) T T T 2 = (1,...,g1, , ) , (14) = ,..., T = T ,..., T T ( 1 g) , and (1 g) , and where i con- = = ,..., T tains the elements of i and the distinct elements of i (i In (11), ( 1 g) . 342 Peel and McLachlan 6. E-step distribution, then = , Now the E-step on the (k + 1)th iteration of the EM algo- E(log R) ( ) log (21) rithm requires the calculation of Q(; (k)), the current condi- where tional expectation of the complete-data log likelihood function log Lc(). This E-step can be effected by first taking the expec- (s) ={∂0(s)/∂s}/0(s) tation of log Lc() conditional also on z1,...,zn, as well as xo, and then finally over the z j given xo. It can be seen from (12) is the Digamma function. Applying the result (21) to the condi- = to (14) that in order to do this, we need to calculate tional density of U j given y j and zij 1, as specified by (10), it follows that | , E(k) (Zij y j ) | , = | , , E(k) (log U j y j zij 1) E(k) (U j y j z j ) ! n o (k) + p 1 and = i log (k) + y , (k); (k) 2 2 i j i i | , ( ! !) E(k) (log U j y j z j ) (15) (k) + p (k) + p = log u(k) + i log i (22) for i = 1,...,g; j =1,...,n). ij 2 2 It follows that | = (k) for j = 1,...,n. The last term on the right-hand side of (22), E(k) (Zij y j ) ij ! ! where (k) + p (k) + p i i , + + + log (k) f y ; (k 1), (k 1),(k 1) 2 2 (k) = i j i i i , (16) ij f y ; (k+1) j can be interpreted as the correction for just imputing the condi- (k) is the that y j belongs to the ith com- tional mean value uij for u j in log u j . ponent of the mixture, using the current fit (k) for (i = On using the results (16), (19) and (22) to calculate the condi- 1,...,g; j =1,...,n). tional expectation of the complete-data log likelihood from (11), Since the gamma distribution is the distri- we have that Q(; (k))isgivenby bution for U j , it is not difficult to show that the conditional = = Q ;(k) =Q ;(k) +Q ;(k) ++Q ;(k) , distribution of U j given Y j y j and Zij 1is 1 2 3 (23) | , = , , U yj zij 1 gamma (m1i m2i ) (17) where where Xg Xn Q ; (k) = ˆ (k) log , (24) 1 1 ij i m = ( + p) i=1 j=1 1i 2 i Xg Xn and (k) = (k) (k) , Q2 ; ˆ ij Q2 j i ; (25) i=1 j=1 = 1{ + , }. m2i i (y j i ; i ) (18) 2 and From (17), we have that Xg Xn (k) (k) (k) + p Q3 ; = ˆ Q3 j i ; , (26) E(U | y , = 1) = i . (19) ij j j zij + , i=1 j=1 i (y j i ; i ) Thus from (19), and where, on ignoring terms not involving the i , (k) | , = = , (k) 1 1 1 E(k) (U j y j zij 1) uij Q ; =log 0 + log 2 j i 2 i 2 i 2 i where ( Xn k 1 (k) (k) (k) + p + log u u u = i . (20) 2 i ij ij ij k + , (k) (k) j=1 i y j i ; i ! !) (k) + p (k) + p To calculate the conditional expectation (15), we need the + i log i , (27) result that if a random variable R has a gamma (, ) 2 2 Robust mixture modelling 343 and Following the proposal of Kent, Tyler and Vardi (1994) in the case ofP a single component t distribution, we can replace the (k) 1 1 1 (k) n (k) Q3 j i ; = p log(2) log |i |+ p log u divisor in (31) by 2 2 2 ij j=1 ij 1 Xn T 1 . (k) (k). uij(y j i ) i (y j i ) (28) ij uij 2 j=1 This modified algorithm, however, converges faster than the con- 7. M-step ventional EM algorithm, as reported by Kent, Tyler and Vardi (1994) and Meng and van Dyk (1997) in the case of a single On the M-step at the (k + 1)th iteration of the EM algorithm, it + component t distribution (g = 1). In the latter situation, Meng follows from (23) that (k+1), (k 1), and (k+1) can be computed and van Dyk (1997) showed that this modified EM algorithm is independently of each other, by separate consideration of (24), (k+1) (k+1) optimal among EM algorithms generated from a class of data (25), and (26), respectively. The solutions for i and augmentation schemes. More recently, in the case g = 1, Liu (k+1) exist in closed form. Only the updates i for the degrees of (1997) and Liu, Rubin and Wu (1998) have derived this modified freedom i need to be computed iteratively. EM algorithm using the PX-EM algorithm. The mixing proportions are updated by consideration of the (k) It can be seen that if the degrees of freedom i is fixed in first term Q1(; ) on the right-hand side of (23). This leads + advance for each component, then the M-step exists in closed to (k 1) being given by the average of the posterior probabilities i form. In this case where i is fixed beforehand, the estimation of component membership of the mixture. That is, of the component parameters is a form of M-estimation; see Xn Lange, Little and Taylor (1989, Page 882). However, an attractive (k+1) = (k) = ,..., . feature of the use of the t distribution to model the component i ij n (i 1 g) (29) j=1 distributions is that the degrees of robustness as controlled by i can be inferred from the data by computing its ML estimate. = ,..., To update the estimates of i and i (i 1 g), we need to In this case, we have to compute also on the M-step the updated consider (k+1) estimate i of i . On calculating the left-hand side of the equation (k) Q3 i ; . Xn ∂ (k) ∂ = , This is easily undertaken on noting that it corresponds to the Q2 j i ; i 0 j=1 log likelihood function formed from n independent observa- ,..., (k+1) tions y1 yn with common mean i and covariance matrices it follows that is a solution of the equation / k ,..., / k i i u i un, respectively. It is thus equivalent to comput- ( 1 ing the weighted sample mean and sample covariance matrix of 1 1 1 Xn (k) (k) (k) (k) (k) ,..., ,..., i + log i + 1 + log u u y1 yn with weights u1 un . Hence 2 2 (k) ij ij j , ni j=1 Xn Xn ! !) + (k) (k) (k 1) = (k)u(k) y (k)u(k) (30) + p + p i ij ij j ij ij + i log i = 0, (32) j=1 j=1 2 2 and P (k) = n (k) P where ni j=1 ij . n (k) (k) (k+1) (k+1) T + = ij uij y j y j (k 1) = j 1 P i i . i n (k) j=1 ij 8. Application of ECM algorithm (31) For ML estimation of a single t component, Liu and Rubin (k+1) (k+1) It can be seen that it effectively chooses i and i (1995) noted that the convergence of the EM algorithm is slow by weighted least-squares estimation. The E-step updates the for unknown and the one-dimensional search for the compu- (k) (k+1) (k+1) weights uij , while the M-step effectively chooses i and tation of is time consuming. Consequently, they consid- (k+1) i by weighted least-squares estimation. It can be seen from ered extensions of the EM algorithm in the form of the ECM the form of the equation (30) derived for the MLE of i that, and ECME algorithms; see McLachlan and Krishnan (1997, (k) as i decreases, the degree of downweighting of an in- Section 5.8) and Liu (1997). (k) k k→∞ creases. For finite i as y j , the effect on the ith com- We consider the ECM algorithm for this problem, where T , T = ,..., , T T ponent location parameter estimate goes to zero, whereas the is partitioned as (1 2) , with 1 ( 1 g1 ) effect on the ith component scale estimate remains bounded but and with 2 equal to . On the (k + 1)th iteration of the ECM does not vanish. algorithm, the E-step is the same as given above for the EM 344 Peel and McLachlan algorithm, but the M-step of the latter is replaced by two CM- and (s) =(s) is Huber’s (1964) -function defined as steps, as follows. = , | | , (k+1) (k) (s) s s a CM-Step 1. Calculate 1 by maximizing Q(; ) with (k) (k) =sign(s)a, |s| > a, (34) 2 fixed at 2 ; that is, fixed at . (k+1) (k) CM-Step 2. Calculate 2 by maximizing Q(; ) with (k+1) 1 fixed at . for an appropriate choice of the tuning constant a. The ith 1 (k+1) component-covariance matrix i can be updated as (31), (k+1) (k+1) (k) { (k) / (k)}2 But as seen above, 1 and 2 are calculated indepen- where uij is replaced by (dij ) dij . An alternative to dently of each other on the M-step, and so these two CM-steps Huber’s -function is a redescending -function, for exam- of the ECM algorithm are equivalent to the M-step of the EM ple, Hampel’s (1973) piecewise linear function. However, there algorithm. Hence there is no difference between this ECM and can be problems in forming the posterior probabilities of com- the EM algorithms here. But Liu and Rubin (1995) used the ECM ponent membership, as there is the question as to which para- algorithm to give two modifications that are different from the metric family to use for the component p.d.f.’s (McLachlan and EM algorithm. These two modifications are a multicycle version Basford 1988, Section 2.8). One possibility is to use the form of the ECM algorithm and an ECME extension. The multicycle of the p.d.f. corresponding to the -function adopted. However, version of the ECM algorithm has an additional E-step between in the case of any redescending -function with finite rejection the two CM-steps. That is, after the first CM-step, the E-step is points, there is no corresponding p.d.f. In Campbell (1984), the taken with normal p.d.f. was used, while in the related univariate work in T De Veaux and Kreiger (1990), the t density with three degrees = (k+1)T , (k)T , 1 2 of freedom was used, with the location and scale component pa- rameters estimated by the (weighted) median and mean absolute (k)T (k)T T deviation, respectively. instead of with = ( , ) as on the commence- 1 2 It can be therefore seen that the use of mixtures of t distri- ment of the (k + 1)th iteration of the ECM algorithm. The butions provides a sound statistical basis for formalizing and EMMIX algorithm of McLachlan et al. (1999) has an option implementing the somewhat ad hoc approaches that have been for the fitting of mixtures of multivariate t components ei- proposed in the past. It also provides a framework for assessing ther with or without the specification of the component de- the degree of robustness to be incorporated into the fitting of grees of freedom. It is available from the software archive the mixture model through the specification or estimation of the StatLib or from the first author’s homepage with website ad- degrees of freedom in the t component p.d.f.’s. dress http://www.maths.uq.edu.au/~gjm/. i As noted in the introduction, the use of t components in place For a single component t distribution, Liu and Rubin (1994, of the normal components will generally give less extreme esti- 1995), Kowalski et al. (1997), Liu (1997), Meng and van Dyk mates of the posterior probabilities of component membership (1997), and Liu, Rubin and Wu (1998) have considered further of the mixture model. The use of the t distribution in place of the extensions of the ECM algorithm corresponding to various ver- normal distribution leading to less extreme posterior probabili- sions of the ECME algorithm. However, the implementation of ties of group membership was noted in a discriminant analysis the ECME algorithm for mixtures of t distributions is not as context, where the group-conditional densities correspond to straightforward, and so it is not applied here. the component densities of the mixture model (Aitchison and Dunsmore 1975, Chapter 2). If a Bayesian approach is adopted 9. Previous work on M-estimation of and the conventional improper or vague prior specified for the mixture components mean and the inverse of the covariance matrix in the normal distribution for each group-conditional density, it leads to the A common way in which robust fitting of normal mixture mod- so-called predictive density estimate, which has the form of the els has been undertaken, is by using M-estimates to update the t distribution; see McLachlan (1992, Section 3.5). component estimates on the M-step of the EM algorithm, as in Campbell (1984) and McLachlan and Basford (1988). In this (k+1) 10. Example 1: Simulated noisy data set case, the updated component means i are given by (30), (k) but where now the weights uij are defined as . One way in which the presence of atypical observations or back- (k) (k) (k) ground noise in the data has been handled when fitting mixtures u = d d , (33) ij ij ij of normal components has been to include an additional com- ponent having a uniform distribution. The support of the latter where n o component is generally specified by the upper and lower ex- T 1/2 (k) (k) (k) 1 (k) tremities of each dimension defining the rectangular region that d = y y ij j i i j i contains all the data points. Typically, the mixing proportion for Robust mixture modelling 345 this uniform component is left unspecified to be estimated from the data. For example, Schroeter et al. (1998) fitted a mixture of three normal components and a uniform distribution to segment magnetic resonance images of the human brain into three regions (gray matter, white matter and cerbro-spinal fluid) in the pres- ence of background noise arising from instrument irregularities and tissue abnormalities. Here we consider a sample consisting initially of 100 sim- ulated points from a two-component bivariate normal mixture model, to which 50 noise points were added from a uniform distribution over the range 10 to 10 on each variate. The parameters of the mixture model were,

= T = T = T 1 ( 03) 2 (30) 3 (30) 20.5 10 2 0.5 1= 2= 3= 0.5.5 0.1 0.5 .5 Fig. 2. Plot of the result of fitting a mixture of t distributions with a classification of noise at a significance level of 5% to the simulated = = = 1 with mixing proportions 1 2 3 3 . The true grouping noisy data is shown in Fig. 1. We now consider the clustering obtained by fitting a mixture of three t components with unequal scale matrices but equal degrees of freedom (1 = 2 = 3 = ). The (k) values of the weights uij at convergence, uˆ ij, were examined. The noise points (points 101–150) generally produced much lower uˆ ij values. In this application, anP observation y j is treated g as an outlier (background noise) if i=1 zˆijuˆ ij is sufficiently small, or equivalently,

Xg , ˆ zˆij (y j ˆ i ; i ) (35) i=1 is sufficiently large, where

zˆij = arg max ˆ hj (i = 1,...,g; j =1,...,n), h and ˆ ij denotes the estimated posterior probability that y j belongs to the ith component of the mixture. Fig. 3. Plot of the result of fitting a three component normal mixture plus a uniform component model to the simulated noisy data

To decide on how large the statistic (35) must be in order for y j to be classified as noise, we compared it to the 95th percentile of the chi-squared distribution with p degrees of free- dom, where the latter is used to approximate the true distribution , ˆ of (Y j ˆ i ; i ). The clustering so obtained is displayed in Fig. 2. It compares well with the true grouping in Fig. 1 and the clustering in Fig. 3 obtained by fitting a mixture of three normal components and an additional uniform component. In this particular example the model of three normal components with an additional uniform component to model the noise works well since it is the same model used to generate the data in the first instance. However, this model, unlike the t mixture model, cannot be expected to work as well in situations when the noise is not uniform or is Fig. 1. Plot the true grouping of the simulated noisy data set unable to be modelled adequately by the uniform distribution. 346 Peel and McLachlan

Fig. 4. Plot of the result of fitting a four component normal mixture Fig. 5. Scatter plot of the second and third variates of the Blue Crab model to the simulated noisy data data set with their true group of origin noted

It is of interest to note that if the number of groups is treated as improved clustering is obtained without restrictions on the scale unknown and a normal mixture fitted, then the number of groups matrices, with the t and normal mixture model-based clusterings selected via AIC, BIC, and bootstrapping of 2log is four for both resulting in 17 fewer misallocations. each of these criteria. The result of fitting a mixture of four nor- In this example, where the normal model for the components mal components is displayed in Fig. 4. Obviously, the additional appears to be a reasonable assumption, the estimated degrees fourth component is attempting to model the background noise. of freedom for the t components should be large, which they However, it can be seen from Fig. 4 that this normal mixture- are. The estimate of their common value in the case of equal based clustering is still affected by the noise. scale matrices and equal degrees of freedom (1 = 2 = ), is ˆ = 22.5; the estimates of 1 and 2 in the case of unequal = . 11. Example 2: Blue crab data set scale matrices and unequal degrees of freedom are ˆ 1 23 0 and ˆ 2 = 120.3. To further illustrate the t mixture model-based approach to clus- The likelihood function can be fairly flat near the maximum tering, we consider now the crab data set of Campbell and Mahon likelihood estimates of the degrees of freedom of the t com- (1974) on the genus Leptograpsus. Attention is focussed on the ponents. To illustrate this, we have plotted in Fig. 6 the profile likelihood function in the case of equal scale matrices and equal sample of n = 100 blue crabs, there being n1 = 50 males and n2 = 50 females, which we shall refer to as groups G1 and G2 respectively. Each specimen has measurements on the width of the frontal lip FL, the rear width RW, and length along the midline CL and the maximum width CW of the carapace, and the body depth BD in mm. In Fig. 5, we give the scatter plot of the second and third variates with their group of origin noted. Hawkins’ (1981) simultaneous test for multivariate normality and equal covariance matrices (homoscedasticity) suggests it is reasonable to assume that the group-conditional distributions are normal with a common covariance matrix. Consistent with this, fitting a mixture of two t components (with equal scale matrices and equal degrees of freedoms) gives only a slightly improved outright clustering over that obtained using a mix- ture of two normal homoscedastic components. The t mixture model-based clustering results in one cluster containing 32 ob- servations from G1 and another containing all 50 observations from G2, along with the remaining 18 observations from G1; the normal mixture model leads to one additional member of Fig. 6. Plot of the profile log likelihood for various values of for G1 being assigned to the cluster corresponding to G2. We note the Blue Crab data set with equal scale matrices and equal degrees of in passing that, although the groups are homoscedastic, a much freedom (1 = 2 = ) Robust mixture modelling 347

Table 1. Summary of comparison of error rates when fitting normal and t-distributions with the modification of a single point

Constant Normal error rate t component error rate ˆ uˆ 1,25 uˆ 2,25

15 49 19 5.76 .0154 .0118 10 49 19 6.65 .0395 .0265 5 21 20 13.11 .1721 .3640 0 19 18 23.05 .8298 1.1394 5 21 20 13.11 .1721 .3640 10 50 20 7.04 .0734 .0512 15 47 20 5.95 .0138 .0183 20 49 20 5.45 .0092 .0074

degrees of freedom (1 = 2 = ), while in Fig. 7, we have plot- However, the assumption of normality is slightly more tenable ted the profile likelihood function in the case of unequal scale for the logged data as assessed by Hawkins’ (1981) test, and the matrices and unequal degrees of freedom. clustering of the logged data via either the normal or t mixture Finally, we now compare the t and normal mixture based- models does result in fewer misallocations. clusterings after some outliers are introduced into the original data set. This was done by adding various values to the second variate of the 25th point. In Table 1, we report the overall misal- References location rate of the normal and t mixture-based clusterings for each perturbed version of the original data set. It can be seen Aitchison J. and Dunsmore I.R. 1975. Statistical Predication Analysis. that the t mixture-based clustering is robust to those perturba- Cambridge University Press, Cambridge. tions, unlike the normal mixture-based clustering. It should be B¨ohning D. 1999. Computer-Assisted Analysis of Mixtures and Appli- noted in the cases where the constant equals 15, 10, and 20, cations: Meta-Analysis, Disease Mapping and Others. Chapman that fitting a normal mixture model results in an outright clas- & Hall/CRC, New York. sification of the outlier into one cluster. The remaining points Campbell N.A. 1984. Mixture models and atypical values. Mathemat- ical Geology 16: 465–477. are allocated to the second cluster giving an error rate of 49. In Campbell N.A. and Mahon R.J. 1974. A multivariate study of variation practice the user should identify this situation when interpreting in two species of rock crab of genus Leptograpsus. Australian the results and hence remove the outlier giving an error rate of Journal of 22: 417–425. 20%. However when the fitting is part of an automatic procedure Dav´e R.N. and Krishnapuram R. 1995. Robust clustering methods: A this would not be the case. unified view. IEEE Transactions on Fuzzy Systems 5: 270–293. Concerning the effect of outliers by working with the logs Dempster A.P.,Laird N.M., and Rubin D.B. 1977. Maximum likelihood of these data under the normal mixture model (as raised by a from incomplete data via the EM algorithm (with discussion). referee), it was found that the consequent clustering is still very Journal of the Royal Statistical Society B 39: 1–38. sensitive to atypical observations when introduced as above. De Veaux R.D. and Kreiger A.M. 1990. Robust estimation of a normal mixture. & Probability Letters 10: 1–7. Everitt B.S. and Hand D.J. 1981. Finite Mixture Distributions. Chapman & Hall, London. Frigui H. and Krishnapuram R. 1996. A robust algorithm for automatic extraction of an unknown number of clusters from noisy data. Letters 17: 1223–1232. Gnanadesikan R., Harvey J.W., and Kettenring J.R. 1993. Sankhy¯aA 55: 494–505. Green P.J. 1984. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society B 46: 149–192. Hampel F.R. 1973. Robust estimation: A condensed partial survey. Z. Wahrscheinlickeitstheorie verw. Gebiete 27: 87–104. Hawkins D.M. and McLachlan G.J. 1997. High-breakdown linear discriminant analysis. Journal of the American Statistical Associ- ation 92: 136–143. Huber P.J. 1964. Robust estimation of a location parameter. Annals of Mmathematical Statistics 35: 73–101. Jolion J.-M., Meer P., and Bataouche S. 1995. Robust clustering with

Fig. 7. Plot of the profile likelihood for 1 and 2 for the Blue Crab applications in computer vision. IEEE Transactions on Pattern data set with unequal scale matrices and unequal degrees of freedom Analysis and Machine Intelligence 13: 791–802. 1 and 2 Kent J.T., Tyler D.E., and Vardi Y. 1994. A curious likelihood identity 348 Peel and McLachlan

for the multivariate t-distribution. Communications in Statistics – McLachlan G.J. and Peel D. 1998. Robust cluster analysis via mixtures Simulation and Computation 23: 441–453. of multivariate t-distributions. In: Amin A., Dori D., Pudil P., and Kharin Y. 1996. Robustness in Statistical Pattern Recognition. Kluwer, Freeman H. (Eds.), Lecture Notes in Computer Science Vol.1451. Dordrecht. Springer-Verlag, Berlin, pp. 658–666. Kosinski A. 1999. A procedure for the detection of multivariate outliers. McLachlan G.J., Peel D., Basford K.E., and Adams P. 1999. Fitting Computational Statistics and Data Analysis 29: 145–161. of mixtures of normal and t-components. Journal of Statistical Kowalski J., Tu X.M., Day R.S., and Mendoza-Blanco J.R. 1997. On the Software 4(2). (http://www.stat.ucla.edu/journals/jss/). rate of convergence of the ECME algorithm for multiple regression Meng X.L. and van Dyk D. 1995. The EM algorithm – an old folk song models with t-distributed errors. Biometrika 84: 269–281. sung to a fast new tune (with discussion). Lange K., Little R.J.A., and Taylor J.M.G. 1989. Robust statistical mod- Rocke D.M. and Woodruff D.L. 1997. Robust estimation of multivariate eling using the t distribution. Journal of the American Statistical location and shape. Journal of Statistical Planning and Inference Association 84: 881–896. 57: 245–255. Lindsay B.G. 1995. Mixture Models: Theory, Geometry and Appli- Rousseeuw P.J., Kaufman L., and Trauwaert E. 1996. Fuzzy clustering cations, NSF-CBMS Regional Conference Series in Probability using scatter matrices. Computational Statistics and Data Analysis and Statistics, Vol. 5. Institute of Mathematical Statistics and the 23: 135–151. American Statistical Association, Alexandria, VA. Rubin D.B. 1983. Iteratively reweighted least squares. In: Kotz S., Liu C. 1997. ML estimation of the multivariate t distribution and the Johnson N.L., and Read C.B. (Eds.), Encyclopedia of Statistical EM algorithm. Journal of Multivariate Analysis 63: 296–312. Sciences Vol. 4. Wiley, New York, pp. 272–275. Liu C. and Rubin D.B. 1994. The ECME algorithm: A simple extension Schroeter P., Vesin J.-M., Langenberger T., and Meuli R. 1998. Robust of EM and ECM with faster monotone convergence. Biometrika parameter estimation of intensity distributions for brain magnetic 81: 633–648. resonance images. IEEE Transactions on Medical Imaging 17: Liu C. and Rubin D.B. 1995. ML estimation of the t distribution us- 172–186. ing EM and its extensions, ECM and ECME. Statistica Sinica 5: Smith D.J., Bailey T.C., and Munford G. 1993. Robust classification of 19–39. high-dimensional data using artificial neural networks. Statistics Liu C., Rubin D.B., and Wu Y.N. 1998. Parameter expansion to accel- and Computing 3: 71–81. erate EM: The PX-EM Algorithm. Biometrika 85: 755–770. Sutradhar B. and Ali M.M. 1986. Estimation of parameters Markatou M. 1998. Mixture models, robustness and the weighted like- of a regression model with a multivariate t error vari- lihood methodology. Technical Report No. 1998-9. Department of able. Communications in Statistic – Theory and Methods 15: Statistics, Stanford University, Stanford. 429–450. Markatou M., Basu A., and Lindsay B.G. 1998. Weighted likelihood Titterington D.M., Smith A.F.M., and Makov U.E. 1985. Statistical equations with bootstrap root search. Journal of the American Analysis of Finite Mixture Distributions. Wiley, New York. Statistical Association 93: 740–750. Zhuang X., Huang Y., Palaniappan K., and Zhao Y. 1996. Gaussian McLachlan G.J. and Basford K.E. 1988. Mixture Models: Inference density mixture modeling, decomposition and applications. IEEE and Applications to Clustering. Marcel Dekker, New York. Transactions on Image Processing 5: 1293–1302.