6 Finite Mixture Models with Normal Components

6 Finite Mixture Models with Normal Components 6.1 Finite Mixtures of Normal Distributions Data for which two or more normal distributions are mixed occur frequently in many areas of applied statistics such as biology, economics, marketing, medicine, or physics. 6.1.1 Model Formulation A frequently used finite mixture model for univariate continuous data y1,..., yN is to assume that the observations are i.i.d. realizations from a random variable Y , following a mixture of K univariate normal distributions. The density of this distribution is given by |ϑ 2 ··· 2 p(y )=η1fN (y; μ1,σ1)+ + ηK fN (y; μK ,σK ), (6.1) 2 with fN (y; μk,σk) being the density of a univariate normal distribution. Fi- nite mixtures of univariate normal distributions are generically identifiable (Teicher, 1963). In its most general form, the model is parameterized in terms of 3K − 1 distinct model parameters ϑ =(θ1,...,θK ,η1,...,ηK ), where θ 2 k =(μk,σk). More specific finite mixture models are obtained by putting 2 2 constraints either on μ1,...,μK or σ1,...,σK ; see also Subsection 6.4.1. Mixtures of normals are easily extended to deal with multivariate continuous observations y1,...,yN , where yi is an r-dimensional vector. Typically the various elements yi1,...,yir of yi measure r features for a unit i drawn from a population. A frequently used finite mixture model for multivariate data y =(y1,...,yN ) is to assume that the observations are i.i.d. realizations from a multivariate random variable Y of dimension r, arising from a mixture of K multivariate normal distributions. The density of the distribution of Y is given by |ϑ μ ··· μ p(y )=η1fN (y; 1, Σ1)+ + ηK fN (y; K , ΣK ), (6.2) 170 6 Finite Mixture Models with Normal Components μ with fN (y; k, Σk) being the density of a multivariate normal distribution μ with mean k and variance–covariance matrix Σk. Finite mixtures of multivariate normal distributions are generically identifiable (Yakowitz and Spra- gins, 1968). A multivariate mixture of normal distributions with general variance– covariance matrices Σ1,...,ΣK is highly parameterized in terms of K(r + r(r +1)/2+1)− 1 distinct model parameters. When fitting a mixture of 3 multivariate normal distributions to a data set containing 10-dimensional observations yi, one has to estimate as many as 198 distinct parameters! Thus an unconstrained multivariate mixture may turn out to be too general in various situations. Other interesting multivariate finite mixture models are obtained by putting certain constraints on the variance–covariance matrices Σ1,...,ΣK . Such finite mixture models are discussed in Subsection 6.4.1. Capturing Between- and Within-Group Heterogeneity |ϑ K μ For a multivariate mixture, the expected value E(Y )= k=1 ηk k defines the overall center of the distribution of Y, whereas the spread of the distribution is measured by the variance–covariance matrix of Y, which may be written as Var(Y|ϑ) = E(Var(Y|S, ϑ)) + Var(E(Y|S, ϑ)) (6.3) K K μ − |ϑ μ − |ϑ = ηkΣk + ηk( k E(Y ))( k E(Y )) . k=1 k=1 Thus the total variance Var(Y|ϑ) arises from two sources of variability, namely within-group heterogeneity and between-group heterogeneity. The measure of within-group heterogeneity is a weighted average of within-group variability Σk. The measure of between-group heterogeneity is based on a weighted dis- μ |ϑ tance of the group mean k from the overall mean E(Y ). In both cases the weights are equal to group sizes. K |ϑ For a mixture, where k=1 ηkΣk is much smaller than Var(Y )ina suitable matrix norm, most of the variability of Y results from unobserved between-group heterogeneity. Common measures of how much variability may be addressed to unobserved between-group heterogeneity, are the following two coefficients of determination, derived from (6.3), 2 ϑ − K |ϑ Rt ( )=1 tr =1 ηkΣk /tr (Var(Y )) , (6.4) ! k ! ! ! 2 ϑ − ! K ! | |ϑ | Rd( )=1 k=1 ηkΣk / Var(Y ) . (6.5) μ ··· μ For the limiting case of equal means, 1 = = K , these coefficients are equal to 0, whereas these coefficients are close to one for well-separated groups; see Figure 6.1 for an illustration. 6.1 Finite Mixtures of Normal Distributions 171 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 R2=0.618 t R2=0.832 R2=0.832 −3 t −3 d R2=0.969 d −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 R2=0 2t 2 R =0 R =0.106 d −3 2t −3 R =0.194 d −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 Fig. 6.1. 1000 observations simulated from a spherical mixture of three bivariate 2 2 2 normal distributions with η1 = η2 = η3 =1/3, (σ1 ,σ2 ,σ3 )=(1, 0.4, 0.04); with 2 2 different locations for the means; Rt and Rd are coefficients of unobserved heterogeneity explained by the mixture, defined in (6.4) and (6.5) 6.1.2 Parameter Estimation for Mixtures of Normals Suppose that a data set y is available, which consists of N i.i.d. observations of a random variable distributed according to a mixture of normal distributions, thus for univariate data y = {y1,...,yN }, whereas for multivariate data y = {y1,...,yN }. This subsection is concerned with the estimation of the component parameters and the weight distribution η =(η1,...,ηK ) of the underlying mixture distribution for fixed K, based on the data y. For univariate mixtures the component means μ =(μ1,...,μK ) and the component vari- 172 6 Finite Mixture Models with Normal Components σ2 2 2 ances =(σ1,...,σK ); for multivariate mixtures the component mean μ μ μ vectors =( 1,..., K ) and the component variance–covariance matrices Σ =(Σ1,...,ΣK ) have to be estimated. Pioneering work on the estimation of mixtures of normals is based on the method of moments (Pearson, 1894; Charlier and Wicksell, 1924). The method of moments estimator suggested by Day (1969) turned out to be inefficient compared to maximum likelihood estimation, both for univariate as well as multivariate mixtures of normals. A more attractive method of moments estimator was suggested by Lindsay (1989) for univariate and by Lindsay and Basak (1993) for multivariate mixtures of normals. Maximum likelihood (ML) estimation was used for a mixture of two uni- 2 2 variate normal distributions with σ1 = σ2 as early as Rao (1948); further pioneering work was done by Hasselblad (1966). Wolfe (1970) suggested an iterative scheme for practical ML estimation of multivariate mixtures of normals that is essentially an early variant of the EM algorithm. EM Algorithm Nowadays, the EM algorithm, which was introduced in Subsection 2.4.4, is the preferred method for practical ML estimation of univariate and multivariate mixtures. For univariate mixtures of normals the M-step reads: N ( ) 1 ( ) μ m = Dˆ m y , k n ik i k i=1 N 2 1 ( ) ( ) (σ2)(m) = Dˆ m y − μ m , k n ik i k k i=1 (m) where Dˆ ik and nk have been defined in (2.32) and (2.33), respectively. For multivariate mixtures the M-step reads: N ( ) 1 ( ) μ m = Dˆ m y , k n ik i k i=1 N ( ) 1 ( ) ( ) ( ) Σ m = Dˆ m y − μ m y − μ m . k n ik i k i k k i=1 A certain difficulty with ML estimation is that the EM algorithm breaks down, 2 (m) (m) whenever (σk) is (numerically) zero or Σk is singular or nearly singular, (m) which happens when Dˆ ik is close to zero for too many observations. Then (m+1) at the next iteration the computation of Dˆ ik through (2.32) is no longer possible. Such difficulties arise in particular if the EM algorithm is applied to a finite mixture of normals overfitting the number of components. 6.1 Finite Mixtures of Normal Distributions 173 Unboundedness of the Mixture Likelihood Function A further difficulty with ML estimation, first noted by Kiefer and Wolfowitz (1956) for univariate mixtures of normals, is that the mixture likelihood function N K |μ σ2 η 2 p(y , , )= ηkfN (yi; μk,σk) (6.6) i=1 k=1 is unbounded and has many local spurious modes; see Subsection 6.1.3 for an illustration. As first noted by Day (1969), the unboundedness of the mixture likelihood function is also relevant for the multivariate mixtures of normals, as each observation yi gives rise to a singularity on the boundary of the parameter space. Thus the ML estimator as global maximizer of the mixture likelihood function does not exist. Nevertheless, statistical theory outlined in Kiefer (1978) guarantees that a particular local maximizer of the mixture likelihood function is consistent, efficient, and asymptotically normal if the mixture is not overfitting. Several local maximizers may exist for a given sample, and a ma- jor difficulty with the ML approach is to identify if the correct one has been found. Titterington et al. (1985, pp.97), for instance, refer to various empiri- cal studies that report convergence to singularities and spurious local modes using the EM algorithm, even if the true parameter was used as the starting value. To avoid these problems, Hathaway (1985) considers constrained ML estimation of univariate mixtures of normals based on the inequality constraint σ min k ≥ c>0, (6.7) k,j σj and proves strong consistency of the resulting estimator.

Load more