6 Finite Mixture Models with Normal Components
6.1 Finite Mixtures of Normal Distributions
Data for which two or more normal distributions are mixed occur frequently in many areas of applied statistics such as biology, economics, marketing, medicine, or physics.
6.1.1 Model Formulation
A frequently used finite mixture model for univariate continuous data y1,..., yN is to assume that the observations are i.i.d. realizations from a random variable Y , following a mixture of K univariate normal distributions. The density of this distribution is given by |ϑ 2 ··· 2 p(y )=η1fN (y; μ1,σ1)+ + ηK fN (y; μK ,σK ), (6.1)
2 with fN (y; μk,σk) being the density of a univariate normal distribution. Fi- nite mixtures of univariate normal distributions are generically identifiable (Teicher, 1963). In its most general form, the model is parameterized in terms of 3K − 1 distinct model parameters ϑ =(θ1,...,θK ,η1,...,ηK ), where θ 2 k =(μk,σk). More specific finite mixture models are obtained by putting 2 2 constraints either on μ1,...,μK or σ1,...,σK ; see also Subsection 6.4.1. Mixtures of normals are easily extended to deal with multivariate contin- uous observations y1,...,yN , where yi is an r-dimensional vector. Typically the various elements yi1,...,yir of yi measure r features for a unit i drawn from a population. A frequently used finite mixture model for multivariate data y =(y1,...,yN ) is to assume that the observations are i.i.d. realiza- tions from a multivariate random variable Y of dimension r, arising from a mixture of K multivariate normal distributions. The density of the distribu- tion of Y is given by |ϑ μ ··· μ p(y )=η1fN (y; 1, Σ1)+ + ηK fN (y; K , ΣK ), (6.2) 170 6 Finite Mixture Models with Normal Components
μ with fN (y; k, Σk) being the density of a multivariate normal distribution μ with mean k and variance–covariance matrix Σk. Finite mixtures of multi- variate normal distributions are generically identifiable (Yakowitz and Spra- gins, 1968). A multivariate mixture of normal distributions with general variance– covariance matrices Σ1,...,ΣK is highly parameterized in terms of K(r + r(r +1)/2+1)− 1 distinct model parameters. When fitting a mixture of 3 multivariate normal distributions to a data set containing 10-dimensional ob- servations yi, one has to estimate as many as 198 distinct parameters! Thus an unconstrained multivariate mixture may turn out to be too general in various situations. Other interesting multivariate finite mixture models are obtained by putting certain constraints on the variance–covariance matrices Σ1,...,ΣK . Such finite mixture models are discussed in Subsection 6.4.1.
Capturing Between- and Within-Group Heterogeneity |ϑ K μ For a multivariate mixture, the expected value E(Y )= k=1 ηk k defines the overall center of the distribution of Y, whereas the spread of the distri- bution is measured by the variance–covariance matrix of Y, which may be written as
Var(Y|ϑ) = E(Var(Y|S, ϑ)) + Var(E(Y|S, ϑ)) (6.3) K K μ − |ϑ μ − |ϑ = ηkΣk + ηk( k E(Y ))( k E(Y )) . k=1 k=1
Thus the total variance Var(Y|ϑ) arises from two sources of variability, namely within-group heterogeneity and between-group heterogeneity. The measure of within-group heterogeneity is a weighted average of within-group variability Σk. The measure of between-group heterogeneity is based on a weighted dis- μ |ϑ tance of the group mean k from the overall mean E(Y ). In both cases the weights are equal to group sizes. K |ϑ For a mixture, where k=1 ηkΣk is much smaller than Var(Y )ina suitable matrix norm, most of the variability of Y results from unobserved between-group heterogeneity. Common measures of how much variability may be addressed to unobserved between-group heterogeneity, are the following two coefficients of determination, derived from (6.3), 2 ϑ − K |ϑ Rt ( )=1 tr =1 ηkΣk /tr (Var(Y )) , (6.4) ! k ! ! ! 2 ϑ − ! K ! | |ϑ | Rd( )=1 k=1 ηkΣk / Var(Y ) . (6.5)
μ ··· μ For the limiting case of equal means, 1 = = K , these coefficients are equal to 0, whereas these coefficients are close to one for well-separated groups; see Figure 6.1 for an illustration. 6.1 Finite Mixtures of Normal Distributions 171
4 4
3 3
2 2
1 1
0 0
−1 −1
−2 −2 R2=0.618 t R2=0.832 R2=0.832 −3 t −3 d R2=0.969 d −4 −4 −4 −2 0 2 4 −4 −2 0 2 4
4 4
3 3
2 2
1 1
0 0
−1 −1
−2 −2 R2=0 2t 2 R =0 R =0.106 d −3 2t −3 R =0.194 d −4 −4 −4 −2 0 2 4 −4 −2 0 2 4
Fig. 6.1. 1000 observations simulated from a spherical mixture of three bivariate 2 2 2 normal distributions with η1 = η2 = η3 =1/3, (σ1 ,σ2 ,σ3 )=(1, 0.4, 0.04); with 2 2 different locations for the means; Rt and Rd are coefficients of unobserved hetero- geneity explained by the mixture, defined in (6.4) and (6.5)
6.1.2 Parameter Estimation for Mixtures of Normals
Suppose that a data set y is available, which consists of N i.i.d. observations of a random variable distributed according to a mixture of normal distributions, thus for univariate data y = {y1,...,yN }, whereas for multivariate data y = {y1,...,yN }. This subsection is concerned with the estimation of the component pa- rameters and the weight distribution η =(η1,...,ηK ) of the underlying mixture distribution for fixed K, based on the data y. For univariate mix- tures the component means μ =(μ1,...,μK ) and the component vari- 172 6 Finite Mixture Models with Normal Components
σ2 2 2 ances =(σ1,...,σK ); for multivariate mixtures the component mean μ μ μ vectors =( 1,..., K ) and the component variance–covariance matrices Σ =(Σ1,...,ΣK ) have to be estimated. Pioneering work on the estimation of mixtures of normals is based on the method of moments (Pearson, 1894; Charlier and Wicksell, 1924). The method of moments estimator suggested by Day (1969) turned out to be inefficient compared to maximum likelihood estimation, both for univariate as well as multivariate mixtures of normals. A more attractive method of moments estimator was suggested by Lindsay (1989) for univariate and by Lindsay and Basak (1993) for multivariate mixtures of normals. Maximum likelihood (ML) estimation was used for a mixture of two uni- 2 2 variate normal distributions with σ1 = σ2 as early as Rao (1948); further pioneering work was done by Hasselblad (1966). Wolfe (1970) suggested an iterative scheme for practical ML estimation of multivariate mixtures of nor- mals that is essentially an early variant of the EM algorithm.
EM Algorithm
Nowadays, the EM algorithm, which was introduced in Subsection 2.4.4, is the preferred method for practical ML estimation of univariate and multivariate mixtures. For univariate mixtures of normals the M-step reads: