<<

Mixture Densities, Maximum Likelihood and the Em Algorithm Author(s): Richard A. Redner and Homer F. Walker Source: SIAM Review, Vol. 26, No. 2 (Apr., 1984), pp. 195-239 Published by: Society for Industrial and Applied Mathematics Stable URL: http://www.jstor.org/stable/2030064 . Accessed: 29/01/2015 12:50

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp

. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

.

Society for Industrial and Applied Mathematics is collaborating with JSTOR to digitize, preserve and extend access to SIAM Review.

http://www.jstor.org

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions SIAM REVIEW ? 1984 Societyfor Industrial and AppliedMathematics Vol. 26, No. 2, April1984 002

MIXTURE DENSITIES, MAXIMUM LIKELIHOOD AND THE EM ALGORITHM*

RICHARD A. REDNERt AND HOMER F. WALKERt

Abstract.The problemof estimatingthe parameterswhich determine a mixturedensity has been the subjectof a large,diverse body of literaturespanning nearly ninety years. Duringthe last two decades, the methodof maximum likelihood has becomethe most widely followed approach to thisproblem, thanks primarily to the adventof highspeed electroniccomputers. Here, we firstoffer a briefsurvey of the literaturedirected towardthis problem and reviewmaximum-likelihood estimation for it. We thenturn to thesubject of ultimate interest,which is a particulariterative procedure for numerically approximating maximum-likelihood estimates formixture density problems. This procedure,known as the EM algorithm,is a specializationto the mixture densitycontext of a generalalgorithm of the same nameused to approximatemaximum-likelihood estimates for incompletedata problems.We discuss the formulationand theoreticaland practicalproperties of the EM algorithmfor mixture densities, focussing in particularon mixturesof densities from exponential families.

Keywords. mixturedensities, maximum likelihood, EM algorithm,exponential families, incomplete 1. Introduction.Of interesthere is a parametricfamily of finite mixture densities, i.e.,a familyof probability density functions of the form m (1 *1) ~~~ p(X a,)= pi (x | ti) X = (Xl,*I , Xn)T E_ Rn, whereeach a,lis nonnegativeand 2T a, = 1, and whereeach pi is itselfa densityfunction parametrizedby 4Xi E Q c Rn,.We denote4b = (a1, , am, ,4k1) * * m)and set

(a,,s* Iam4* , . ,* m): ai = I and a.>O, E Qifori= 1, * m

The moregeneral case ofa possiblyinfinite mixture density, expressible as

(1.2) fp(x I4(X))da(X), is not consideredhere, even though much of the followingis applicable with few modificationsto such a density.For general referencesdealing with infinitemixture densitiesand relateddensities not considered here, see thesurvey of Blischke [13]. Also,it is understoodthat in determiningprobabilities, probability density functionsare integratedwith respect to a measureon Rn whichis eitherLebesgue measure,counting measureon somefinite or countablyinfinite subset of R , or a combinationof the two. In the following,it is usuallyobvious from the context which measure on Rn is appropriate fora particularprobability density function, and so measureson Rn are not specified unlessthere is a possibilityof confusion. It is furtherunderstood that the topology on Q is thenatural product topology induced by the topology on thereal numbers.At timeswhen it is convenientto determinethis topology by a norm,we will regardelements of Q as (m + m,Ini)-vectors and considernorms defined on suchvectors. Finite mixturedensities arise naturally-and can naturallybe interpreted-as densitiesassociated witha statisticalpopulation which is a mixtureof m component populationswith associated componentdensities {p}==1 ... and mixing proportions

*Receivedby the editors April 6, 1982,and in revisedform August 5, 1983. tDivisionof MathematicalSciences, University of Tulsa, Tulsa, Oklahoma74104. tDepartmentof Mathematics, University of Houston,Houston, Texas 77004. The workof this author was supportedby the U.S. Departmentof Energyunder grant DE-AS05-76ER05046.

195

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 196 RICHARD A. REDNER AND HOMER F. WALKER a i}f=*. Such densitiesappear as fundamentalmodels in areas ofapplied such as statisticalpattern recognition, classification and clustering.(As examplesof general referencesin the broad literatureon these subjects,we mentionDuda and Hart [46], Fukunaga [51], Hartigan [68], Van Ryzin [148], and Young and Calvert [159]. For some specificapplications, see, forexample, the Special Issue on RemoteSensing of the Communicationsin Statistics [33]). In addition,finite mixture densities often are of interestin lifetesting and acceptancetesting (cf. Cox [35], Hald [65], Mendenhalland Hader [101], and otherauthors referred to by Blischke[13]). Finally,many scientific investigationsinvolving statistical modeling require by their very nature the consideration of mixturepopulations and theirassociated mixture densities. The exampleof Hosmer [75] below is simple but typical.For referencesto otherexamples in fisherystudies, genetics,medicine, chemistry, psychology and otherfields, see Blischke[13], Everittand Hand [47], Hosmer[74] and Titterington[145]. Example 1.1. Accordingto the InternationalHalibut Commissionof Seattle, Washington,the length distribution of halibut of a givenage is closelyapproximated by a mixtureof two normal distributions corresponding to thelength distributions of the male and femalesubpopulations. Thus thelength distribution is modeledby a mixturedensity ofthe form

(1.3) p(xjI ) =alp1(x14)1) + a2p2(X|I2), X R&E 9 where,for i = 1, 2,

(1.4) p (x I ') = e (x)2/2, i = (A,g )TeR2 and 4b = (a1, a2, 01, 02). Suppose thatone wouldlike to estimate4D on the basis of some of lengthmeasurements of halibutof a givenage. If one had a large sample of measurementswhich were labeled accordingto sex, then it would be an easy and straightforwardmatter to obtaina satisfactoryestimate of 1. Unfortunately,it is reported in [75] that the sex of halibutcannot be easily (i.e., cheaply) determinedby humans; therefore,as a practicalmatter, it is likelyto be necessaryto estimate1 froma samplein whichthe majority of members are notlabeled according to sex. Regardingp in (1.1) as modelinga mixturepopulation, we say that a sample observationon themixture is labeled ifits componentpopulation of originis knownwith certainty;otherwise, we say thatit is unlabeled.The exampleabove illustratesthe central problemwith which we are concernedhere, namely that of estimating4 in (1.1) usinga samplein whichsome or all ofthe observations are unlabeled.This problemis referredto in the followingas the mixturedensity estimation problem. (For simplicity,we do not considerhere the problemof estimatingnot only 4D but also the numberm of component populationsin themixture.) A varietyof cases of thisproblem and severalapproaches to itssolution have beenthe subject of or at leasttouched on bya large,diverse set of papers spanningnearly ninety years. We beginby offering in thenext section a cohesivebut very sketchyreview of thosepapers of whichwe are aware whichhave as theirmain thrust someaspect of this problem and itssolution. It is hopedthat this survey will provide both some perspectivein whichto view the remainderof thispaper and a startingpoint for thosewho wish to explorethe literature associated with this problem in greaterdepth. Followingthe reviewin the nextsection, we discussat some lengththe methodof maximum-likelihoodfor the mixture problem. In roughgeneral terms, a maximum-likelihoodestimate of a parameterwhich determines a densityfunction is a choice of the parameterwhich maximizes the induceddensity function (called in this

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 197 contextthe likelihoodfunction)of a givensample of observations.Maximum-likelihood estimationhas been theapproach to themixture density estimation problem most widely consideredin the literaturesince the use of high speed electroniccomputers became widespreadin the 1960's. In ?3, the maximum-likelihoodestimates of interesthere are definedprecisely, and both theirimportant theoretical properties and aspects of their practicalbehavior are summarized. The remainderof thepaper is devotedto thesubject of ultimate interest here, which is a particulariterative procedure for numerically approximating maximum-likelihood estimatesof the in mixturedensities. This procedureis a specializationto the mixturedensity estimation problem of a generalmethod for approximating maximum- likelihoodestimates in an incompletedata contextwhich was formalizedby Dempster, Laird and Rubin [39] and termedby them the EM algorithm(E for"expectation" and M for"maximization"). The EM algorithmfor the mixture density estimation problem has beenstudied by many authors over the last two decades. In fact,there have been a number ofindependent derivations of the algorithm from at leasttwo quite distinct points of view. It has beenfound in mostinstances to havethe advantages of reliable global convergence,1 low cost per iteration,economy of storageand ease of programming,as well as a certain heuristicappeal. On the otherhand, it can also exhibithopelessly slow convergencein someseemingly innocuous applications. All in all, it is undeniablyof considerable current interest,and it seems likelyto play an importantrole in the mixturedensity estimation problemfor some time to come. We feel that the point of view towardthe EM algorithmfor mixturedensities advanced in [39] greatlyfacilitates both the formulationof a general procedurefor prescribingthe algorithm and theunderstanding of the important theoretical properties of thealgorithm. Our objectivesin thefollowing are to presentthis point of view in detailin the mixturedensity context, to unifyand extendthe diverseresults in the literature concerningthe derivation and theoreticalproperties of the EM algorithm,and to review and add to whatis knownabout its practical behavior. In ?4, we interpretthe mixturedensity estimation problem as an incompletedata problem,formulate the generalEM algorithmfor mixture densities from this point of view,and discussthe general properties of thealgorithm. In ?5, the focusis narrowedto mixturesof densitiesfrom the exponentialfamily, and we summarizeand augmentthe resultsof investigationsof the EM algorithmfor such mixtureswhich have appearedin the literature.Finally, in ?6, we discuss the performanceof the algorithmin practice throughqualitative comparisons with other algorithms and numericalstudies in simple butimportant cases.

2. A reviewof the literature. The followingis a verybrief survey of papers which are primarilydirected toward some part of the mixturedensity estimation problem. No attempthas beenmade to includepapers which are strictlyconcerned with applications of estimationprocedures and resultsdeveloped elsewhere. Little has beensaid about papers whichare ofmainly historical interest or peripheralto thesubjects of major interest in the sequel. For additionalreferences relating to mixturedensities as well as moredetailed summariesof the contents of many of the papers touched on below,we referthe reader to the recentlypublished monograph by Everittand Hand [47] and the recentsurvey by Titterington[145]. As a convenience,this survey has been dividedsomewhat arbitrarily

'Throughoutthis paper, we use "global convergence"in thesense of theoptimization community, i.e., to meanconvergence to a local maximizerfrom almost any starting point (cf. Dennisand Schnabel [42, p. 5]).

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 198 RICHARD A. REDNER AND HOMER F. WALKER by topicsinto four subsections. Not surprisingly,many papers are citedin morethan one subsection. 2.1. The methodof moments.The firstpublished investigation relating to the mixturedensity estimation problem appears to be thatof Pearson [109]. In thatpaper, as in Example 1.1,the problem considered is theestimation of the parameters in a mixtureof two univariatenormal densities. The sample fromwhich the estimatesare obtainedis assumed to be independentand to consistentirely of unlabeled observationson the mixture.(Since thisis the sortof sampledealt within the vast majorityof workon the problemat hand,it is understoodin thisreview that all samplesare of thistype unless otherwiseindicated.) The approach suggestedby Pearson for solvingthe problemis knownas themethod of moments. The methodof moments consists generally of equating some set of samplemoments to theirexpected values and therebyobtaining a systemof (generallynonlinear) equations for the parameters in themixture density. To estimatethe fiveindependent parameters in a mixtureof two univariate normal densities according to the procedureof [109], one beginswith equations determined by the firstfive moments and, afterconsiderable algebraic manipulation,ultimately arrives at expressionsfor estimateswhich depend on a suitablychosen root of a singleninth-degree polynomial. From the time of the appearance of Pearson's paper until the use of computers becamewidespread in the 1960's,only fairly simple mixture density estimation problems were studied,and the methodof momentswas usuallythe methodof choice fortheir solution.During this period,most energydevoted to mixtureproblems was directed towardmixtures of normaldensities, especially toward Pearson's case of two univariate normaldensities. Indeed, most work on normalmixtures during this period was intended eitherto simplifythe job of obtainingPearson's estimatesor to offermore accessible estimatesin restrictedcases. In thisconnection, we cite thepapers of Charlier[25] (who referredto theimplementation of Pearson's method as "an heroictask"), Pearsonand Lee [111], Charlierand Wicksell[26], Burrau[19], Stromgren[132], Gottschalk[55], Sittig [129], Weichselberger[152], Preston[116], Cohen [32], Dick and Bowden [43] and Gridgeman[57]. More recently,several investigators have studiedthe statistical proper- ties of momentestimates for mixturesof two univariatenormal densities and have comparedtheir performance with that of otherestimation methods. See Robertsonand Fryer[125], Fryerand Robertson[50], Tan and Chang [138] and Quandt and Ramsey [118]. Some workhas been done extendingPearson's method of momentsto moregeneral mixturesof normaldensities and to mixturesof othercontinuous densities. Pollard [115] obtained momentestimates for a mixtureof three univariatenormal densities, and momentestimates for mixtures of multivariatenormal densities were studied by Cooper [34], Day [37] and John[81] . These authorsall made simplifyingassumptions about the mixturesunder consideration in orderto reducethe complexity of the estimation problem.Gumbel [58] and Rider [123] derivedmoment estimates for the in a mixtureof two exponential densities under the assumption that the mixing proportions are known.Later, Rider offeredmoment estimates for mixtures of Weibulldistributions in [124]. Momentestimates for mixtures of two gamma densities were treated by John [82]. Tallis and Light[136] exploredthe relative advantages of fractionalmoments in thecase of mixturesof two exponentialdensities. Also, Kabir [83] introduceda generalized methodof moments and appliedit to thiscase. Momentestimates for a varietyof simple mixtures of discrete densities were derived more or less in parallel with momentestimates for mixturesof normal and other continuousdensities. In particular,moment estimates for mixturesof binomialand

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 199

Poissondensities were investigatedby Pearson [110], Muench [102], [103], Schilling [127], Gumbel [58], Arleyand Buch [4], Rider [124], Blischke[12], [14] and Cohen [30]. Kabir [83] also applied his generalizedmethod of momentsto a mixtureof two binomialdensities. For additionalinformation on momentestimation and manyother topicsof interestfor mixtures of discretedensities, see the extensivesurvey of Blischke [13]. Before leaving the methodof moments,we mentionthe importantproblem of estimatingthe proportionsalone in a mixturedensity under the assumptionthat the componentdensities, or at least some usefulstatistics associated with them, are known. Most general mixturedensity estimation procedures can be broughtto bear on this problem,and themanner of applyingthese general procedures to thisproblem is usually independentof the particularforms of the densitiesin the mixture.In additionto the generalestimation procedures, a numberof special procedureshave been developedfor thisproblem; these are discussedin ?2.3. of thisreview. The methodof momentshas the attractiveproperty for this problem that the momentequations are linearin themixture proportions.Moment estimates of proportionswere discussed by Odell and Basu [104] and Tubbs and Coberly[147]. 2.2. The methodof maximumlikelihood. With the arrival of increasinglypowerful computersand increasinglysophisticated numerical methods during the 1960's, investi- gatorsbegan to turnfrom the method of momentsto themethod of maximumlikelihood as the most widelypreferred approach to mixturedensity estimation problems. To reiteratethe workingdefinition given in the introduction,we say that a maximum- likelihoodestimate associated with a sample of observationsis a choice of parameters whichmaximizes the probability density function of the sample, called in thiscontext the likelihoodfunction. In the next section,we defineprecisely the maximum-likelihood estimatesof interesthere and commenton theirproperties. In thissubsection, we offera verybrief tour of the literatureaddressing maximum-likelihood estimation for mixture densities.Of course,more is said in thesequel about most of the work mentioned below. Actually,maximum-likelihood estimates and theirassociated were often thesubject of wishful thinking prior to theadvent of computers, and somework was done thentoward obtaining maximum-likelihood estimates for simple mixtures. Specifically, we cite a curiouspaper of Baker [5] and workby Rao [ 19] and Mendenhalland Hader [101]. In both [ 119] and [101], iterativeprocedures were successfullyused to obtain approximatesolutions of nonlinearequations satisfied by maximum-likelihoodestimates. In [119], a systemof fourequations in fourunknown parameters was solved by the methodof scoring;in [101], a singleequation in one unknownwas solvedby Newton's method.Both of thesemethods are describedin ?6. Despitethese successes, the problem of obtainingmaximum-likelihood estimates was generallyconsidered during this period to be completelyintractable for computational reasons. As computersbecame available to ease the burdenof computation,maximum- likelihoodestimation was proposedand studied in turn for a varietyof increasingly complexmixture densities. As before,mixtures of normaldensities were the subjectof considerableattention. Hasselblad [70] treated maximum-likelihoodestimation for mixturesof any numberof univariatenormal densities; his major resultswere later obtainedindependently by Behboodian [9]. Mixturesof two multivariate normal densities witha commonunknown matrix were addressed by Day [37] and John[81]. The general case of a mixtureof any numberof multivariatenormal densities was consideredby Wolfe[153], and additionalwork on thiscase was done by Duda and Hart [46] and Peters and Walker [113]. Redner [121] and Hathaway [72] proposed,

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 200 RICHARD A. REDNER AND HOMER F. WALKER respectively,penalty terms on thelikelihood function and constraintson thevariables in order to deal with singularities.Tan and Chang [138] compared the momentand maximum-likelihoodestimates for a mixtureof two univariatenormal densities with commonvariance by computingthe asymptoticvariances of the estimates.Hosmer [74] reportedon a Monte Carlo studyof maximum-likelihoodestimates for a mixtureof two univariatenormal densities when the component densities are notwell separated and the samplesize is small. Severalinteresting variations on theusual estimationproblem for mixtures of normal densitieshave been addressedin the literature.Hosmer [75] comparedthe maximum- likelihoodestimates for a mixtureof twounivariate normal densities obtained from three differenttypes of samples, the first of whichis theusual typeconsisting of only unlabeled observationsand the second two of which consist of both labeled and unlabeled observationsand are distinguishedby whetheror not the labeled observationscontain informationabout the mixingproportions. Such partiallylabeled samples were also consideredby Tan and Chang [137] and Dick and Bowden [43]. (We elaborateon the natureof thesesamples and how theymight arise in ?3.) The treatmentof John[81] is unusual in that the numberof observationsfrom each componentin the mixtureis estimatedalong with the usual componentparameters. Finally, a numberof authors have investigatedmaximum-likelihood estimates for a "switchingregression" model which is a certaintype of estimationproblem for mixtures of normaldensities; see the papers of Quandt [117], Hosmer [76], Kiefer [88] and the commentsby Hartley [69], Hosmer [77], Kiefer [89] and Kumar, Nicklin and Paulson [91] on the paper of Quandt and Ramsey[118]. A generalizationof the model considered by these authors was touchedon byDennis [40]. Maximum-likelihoodestimation has also been studiedfor a varietyof unusualand generalmixture density problems, some of whichinclude but are not restrictedto the usual normalmixture problem. Cohen [31] consideredan unusualbut simplemixture of twodiscrete densities, one of whichhas supportat a singlepoint; he focusedin particular on the case in whichthe otherdensity is a negativebinomial density. Hasselblad [71] generalizedhis earlierresults in [70] to includemixtures of any numberof univariate densitiesfrom exponential families. He includeda shortstudy comparing maximum- likelihoodestimates with the momentestimates of Blischke[14] fora mixtureof two binomialdistributions. Baum, Petrie,Soules and Weiss [8] addresseda mixtureestima- tionproblem which is bothunusual and in one respectmore general than the problems consideredin thesequel. In theirproblem, the a prioriprobabilities of sample observations comingfrom the various component populations in themixture are notindependent from one observationto thenext (that is, theyare notsimply the proportions of thecomponent populationsin the mixture),but ratherare specifiedto followa Markov chain. Their resultsare specificallyapplied to mixturesof univariatenormal, gamma, binomial and Poissondensities, and to mixturesof generalstrictly log concavedensity functions which are identicalexcept for unknownlocation and scale parameters.John [82] treated maximum-likelihoodestimation for a mixtureof two gamma distributionsin a manner similarto thatof [81]. Gupta and Miyawaki [59] considereduniform mixtures. Peters and Coberly[112] and Petersand Walker [114] treatedmaximum-likelihood estimates of proportionsand subsets of proportionsfor essentiallyarbitrary mixture densities. Maximum-likelihoodestimates were included by Tubbs and Coberly[147] in theirstudy ofthe sensitivity of variousproportion . Other maximum-likelihood estimation problemswhich are closely related to those consideredhere are the latent structure problemstouched on by Wolfe [153] (see also Lazarsfeld and Henry [93]) and the problemsconcerning tables derived by indirectobservation addressed by

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 201

Haberman [62], [63], [64]. Finally,although infinite mixture densities of the general form (1.2) are specificallyexcluded from considerationhere, we mentiona very interestingresult of Laird [92] to the effectthat, under various assumptions,the maximum-likelihoodestimate of a possiblyinfinite mixture density is actuallya finite mixturedensity. A specialcase ofthis result was shownearlier by Simar [128] in a broad studyof maximum-likelihoodestimation for possibly infinite mixtures of Poissondensi- ties.

2.3. Other methods.In. additionto the methodof momentsand the methodof maximum-likelihood,a varietyof other methodshave been proposedfor estimating parametersin mixturedensities. Some of these methodsare generalpurpose methods. Othersare (or were at the timeof theirderivation) intended for mixture problems the formsof whichmake (or made) themeither ill suitedfor the applicationof morewidely used methodsor particularlywell suited for the application of special purpose methods. For mixturesof any number of univariate normal densities, Harding [67] and Cassie [20] suggestedgraphical procedures employing probability paper as an alternativeto momentestimates, which were at that time practicallyunobtainable in all but the simplestcases. Later, Bhattacharya[ 11] prescribedother graphical methodsas a particularlysimple way of resolvinga mixturedensity into normal components. These graphicalprocedures work best on mixturepopulations which are well separatedin the sense thateach componenthas an associatedregion in whichthe presenceof the other componentscan be ignored.More recently,Fowlkes [49] introducedadditional proce- dures whichare intendedto be more sensitiveto the presenceof mixturesthan other methods.In thebivariate case, Tarterand Silvers[ 140] introduceda graphicalprocedure fordecomposing mixtures of any number of normal components. Also forgeneral mixtures of univariatenormal densities, Doetsch [45] exhibiteda linearoperator which reduces the of thecomponent densities without changing theirproportions or means and used thisoperator in a procedurewhich determines the componentdensities one at a time.For applicationsand extensionsof thistechnique to otherdensities, see Medgyessy[100] (and thereview by Mallows [98]), Gregor[56] and Stanat [130]. In [126], Sammonconsidered a mixturedensity consisting of an unknown numberof componentdensities which are identicalexcept for translation by unknown locationparameters; he derivedtechniques based on convolutionfor estimating both the numberof components in themixture and thelocation parameters. A numberof specialized procedureshave been developedfor applicationto the problemof estimatingthe proportions in a mixtureunder the assumption that something about the componentdensities is known.Choi and Bulgren[29] proposedan estimate determinedby a least-squarescriterion in the spiritof the minimum-distancemethod of Wolfowitz[ 154]. A variantof the method of [29] forwhich smaller and -square errorwere reportedwas offeredby Macdonald [96]. A methodtermed the "confusion matrixmethod" was givenby Odell and Chhikara[105] (see also thereview of Odell and Basu [104]). In this method,an estimateis obtainedby subdividingRn into disjoint regionsR,, * * *, Rmand thensolving the equationPo = e, in whicha^ is the estimated vectorof proportions,e is a vectorwhose ith componentis the fractionof observations fallingin Ri, and the"" P has ijth entry

Jpj(xhI1)dx. The confusionmatrix method is a special case of a methodof Macdonald [97], whose formulationof the problem as a least-squaresproblem allows for a singularor rectangular

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 202 RICHARD A. REDNER AND HOMER F. WALKER confusionmatrix. Related methodshave been consideredrecently by Hall [66] and Titterington[ 146]. Earlier,special cases ofestimates of this type were considered by Boes [15], [16]. Gusemanand Walton[60], [61] employedcertain pattern recognition notions and techniquesto obtainnumerically tractable confusion matrix proportion estimates for mixturesof multivariatenormal densities. James [80] studiedseveral simple confusion matrixproportion estimates for a mixtureof twounivariate normal densities. Ganesalin- gam and McLachlan [52] comparedthe performanceof confusionmatrix proportion estimateswith maximum-likelihoodproportion estimates for a mixtureof two multi- variate normal densities.Another approach to proportionestimation was taken by Anderson[3], whoproposed a fairlygeneral method based on directparametrization and estimationof likelihoodratios. Finally, Walker [151] considereda mixtureof two essentiallyarbitrary multivariate densities and, assumingonly that the means of the componentdensities are known,suggested a simpleprocedure using linear maps which yieldsunbiased proportion estimates. A stochasticapproximation algorithm for estimating the parameters in a mixtureof anynumber of univariate normal densities was offeredby Young and Coraluppi[ 160]. In such an algorithm,one determinesa sequence of recursivelyupdated estimates from a sequence of observationsof indeterminatelength considered on a one-at-a-timeor few-at-a-timebasis. Such an algorithmis likelyto be appealingwhen a sampleof desired size is eitherunavailable in totoat any one pointin timeor unwieldybecause of its size. Stochasticapproximation of mixture proportions alone was consideredby Kazakos [86]. Quandt and Ramsey [118] derived a procedurecalled the momentgenerating functionmethod and appliedit to theproblem of estimatingthe parameters in a mixture oftwo univariate normal densities and in a switchingregression model. In brief,a moment generatingfunction estimate is a choiceof parameterswhich minimizes a certainsum of squaresof differencesbetween the theoretical and samplemoment generating functions. In a commentby Kiefer [89], it is pointedout that the momentgenerating function methodcan be regardedas a naturalgeneralization of the methodof moments.Kiefer [89] furtheroffers an appealing heuristicexplanation of the apparentsuperiority of momentgenerating function estimates over moment estimates reported by Quandt and Ramsey [118]. In a commentby Hosmer [77], evidenceis presentedthat moment generatingfunction estimates may in fact performbetter than maximum-likelihood estimatesin the small-samplecase. However,Kumar et al. [91] attributedifficulties to themoment generating function method, and theysuggest another method based on the characteristicfunction. Minimumchi-square estimation is a generalmethod of estimationwhich has been touchedon by a numberof authorsin connectionwith the mixturedensity estimation problem,but whichhas not becomethe subjectof muchconsideration in depthin this context.In minimumchi-square estimation, one subdividesRn into cells RI, , Rkand seeksa choiceof parameters which minimizes {Nj- x2() = E E(b)}2 or some similarcriterion function. In thisexpression, Nj and Ej(4i) are, respectively,the observedand expectednumbers of observationsin Rj forj = 1, * * *, k. For mixturesof normaldensities, minimum chi-square estimates were mentionedby Hasselblad [70], Cohen [32], Day [37] and Fryerand Robertson[50]. Minimumchi-square estimates of proportionswere reviewed by Odell and Basu [104] and includedin thesensitivity study of Tubbs and Coberly[147]. Macdonald [97] remarkedthat his weightedleast-squares

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 203 approachto proportionestimation suggested a convenientiterative method for computing minimumchi-square estimates. As a finalnote, we mentionthree methods which have been proposedfor general mixturedensity estimation problems. Choi [28] discussed the extensionto general mixturedensity estimation problems of the least-squaresmethod of Choi and Bulgren [29] forestimating proportions. Deely and Kruse [38] suggestedan estimationprocedure whichis in spiritlike that of Choi and Bulgren[29] and Choi [28], exceptthat a sup-norm distanceis used in place of the square integralnorm. Deely and Kruseargued that their procedureis computationallyfeasible, but no concreteexamples or computationresults are given in [38]. Yakowitz [156], [157] outlineda very general "algorithm"for constructingconsistent estimates of the parametersin mixturedensities which are identifiablein the sense described in ?2.5. of this review.The sense in which his "algorithm"is reallyan algorithmin the usually understoodmeaning of the word is discussedin [ 157]. 2.4. The EM algorithm.At severalpoints in the review,above, we have alluded to computationaldifficulties associated with obtaining maximum-likelihood estimates. For mixturedensity problems, these difficulties arise because of the complexdependence of thelikelihood function on theparameters to be estimated.The customaryway of finding a maximum-likelihoodestimate is firstto determinea systemof equations called the likelihoodequations which are satisfiedby the maximum-likelihood estimate, and thento attemptto findthe maximum-likelihoodestimate by solvingthese likelihood equations. The likelihoodequations are usually found by differentiatingthe logarithmof the likelihoodfunction, setting the derivativesequal to zero, and perhapsperforming some additionalalgebraic manipulations. For mixturedensity problems, the likelihoodequa- tionsare almostcertain to be nonlinearand beyondhope of solutionby analyticmeans. Consequently,one must resortto seekingan approximatesolution via some iterative procedure. Thereare, of course, many general iterative procedures which are suitablefor finding an approximatesolution of the likelihoodequations and whichhave been honedto a high degree of sophisticationwithin the optimizationcommunity. We have in mind here principallyNewton's method, various quasi-Newton methods which are variantsof it, and conjugategradient methods. In fact,the method of scoring, which was mentionedabove in connectionwith the work of Rao [ 119] and whichwe describein detailin thesequel, falls intothe categoryof Newton-likemethods and is one such methodwhich is specifically formulatedfor solving likelihood equations. Our maininterest here, however, is in a specialiterative method which is unrelatedto theabove methodsand whichhas beenapplied to a widevariety of mixture problems over thelast fifteen or so years.Following the terminology of Dempster, Laird and Rubin [39], we call thismethod the EM algorithm(E for"expectation" and M for"maximization"). As we mentionedin the introduction,it has been foundin mostinstances to have the advantageof reliableglobal convergence,low cost per iteration,economy of storageand ease ofprogramming, as wellas a certainheuristic appeal; unfortunately,its convergence can be maddeninglyslow in simpleproblems which are oftenencountered in practice. In themixture density context, the EM algorithmhas beenderived and studiedfrom at least two distinctviewpoints by a numberof authors,many of them working independently.Hasselblad [70] obtained the EM algorithmfor an arbitraryfinite mixtureof univariatenormal densitiesand made empiricalobservations about its behavior.In an extensionof [70], he furtherprescribed the algorithmfor essentially arbitraryfinite mixtures of univariatedensities from exponential families in [71]. The

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 204 RICHARD A. REDNER AND HOMER F. WALKER

EM algorithmof [70] forunivariate normal mixtures was givenagain byBehboodian [9], while Day [37] and Wolfe [153] formulatedit for, respectively,mixtures of two multivariatenormal densitieswith common and arbitraryfinite mixturesof multivariatenormal densities. All of theseauthors apparently obtained the EM algorithmindependently, although Wolfe [153] referredto Hasselblad [70]. Theyall derivedthe algorithm by setting the partial derivatives of the log- equal to zero and, aftersome algebraic manipulation,obtained equations which suggest the algorithm. Followingthese early derivations, the EM algorithmwas appliedby Tan and Chang [137] to a mixtureproblem in genetics and used byHosmer [74] in theMonte Carlo study of maximumlikelihood estimates referred to earlier.Duda and Hart [46] citedthe EM algorithmfor mixtures of multivariate normal densities and commentedon itsbehavior in practice.Hosmer [75] extendedthe EM algorithmfor mixtures of two univariate normal densitiesto includethe partiallylabeled samplesdescribed briefly above. Hartley [69] prescribedthe EM algorithmfor a "switchingregression" model. Peters and Walker [113] offereda local convergenceanalysis of the EM algorithmfor mixturesof multivariatenormal densities and suggestedmodifications of thealgorithm to accelerate convergence.Peters and Coberly [112] studiedthe EM algorithmfor approximating maximum-likelihoodestimates of the proportionsin an essentiallyarbitrary mixture densityand gave a local convergenceanalysis of thealgorithm. Peters and Walker[114] extendedthe resultsof [112] to include subsets of mixtureproportions and a local convergenceanalysis along the lines of [113]. All of theabove investigatorsregarded the EM algorithmas arisingnaturally from theparticular forms taken by the partial derivatives of the log-likelihood function. A quite differentpoint of view toward the algorithm was putforth by Dempster,Laird and Rubin [39]. They interpretedthe mixture density estimation problem as an estimationproblem involvingincomplete data by regardingan unlabeled observationon the mixtureas "missing"a label indicatingits component population of origin. In doingso, theynot only relatedthe mixturedensity problem to a broaderclass of statisticalproblems but also showedthat the EM algorithmfor mixture density problems is reallya specializationof a more general algorithm(also called the EM algorithmin [39]) for approximating maximum-likelihoodestimates from incomplete data. As one sees in thesequel, this more generalEM algorithmis definedin such a way that it has certaindesirable theoretical propertiesby itsvery definition. Earlier, the EM algorithmwas definedindependently in a verysimilar manner by Baum et al. [8] forvery general mixture density estimation problems,by Sundberg [135] forincomplete data problemsinvolving exponential families (and specificallyincluding mixture problems), and by Haberman [62], [63], [64] for mixture-relatedproblems involving frequency tables derivedby indirectobservation. Haberman also refersin [64] to versionsof his algorithmdeveloped by Ceppellini, Siniscalcoand Smith[21], Chen [27] and Goodman [54]. In addition,an interpretation of mixtureproblems as incompletedata problemswas givenin the briefdiscussion of mixturesby Orchard and Woodbury[106]. The desirabletheoretical properties automat- icallyenjoyed by the EM algorithmsuggest in turnthe good global convergence behavior of the algorithmwhich has been observedin practiceby manyinvestigators. Theorems whichessentially confirm this suggested behavior have been recentlyobtained by Redner [121], Vardi [149], Boyles[18] and Wu [155] and are outlinedin thesequel. 2.5. Identifiabilityand information.To completethis review, we touchon twotopics whichhave to do withthe general well-posedness of estimation problems rather than with any particularmethod of estimation.The firsttopic, identifiability,addresses the

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 205 theoreticalquestion of whetherit is possibleto uniquelyestimate a parameterfrom a sample,however large. The secondtopic, information, relates to the practicalmatter of howgood one can reasonablyhope for an estimateto be. A thoroughsurvey of these topics is farbeyond the scope ofthis review; we tryto cover,below, those aspects of themwhich havea specificbearing on thesequel. In general, a parametricfamily of probabilitydensity functions is said to be identifiableif distinctparameter values determinedistinct members of the family.For familiesof mixturedensities, this general definition requires a specialinterpretation. For thepurposes of this paper, let us firstsay thata mixturedensity p (x I 1) ofthe form (1.1) is economicallyrepresented if, for each pair of integersi andj between1 and m,one has thatpi(x I Xi) = pj(x IX>) foralmost all x E Rn (relativeto theunderlying measure on Rn appropriatefor p (x 11)) onlyif eitheri = j or one of ai and aj is zero. Then it sufficesto say thata familyof mixture densities of the form (1.1) is identifiablefor C1 E Q if foreach pair V = (a', a * ,4 ) and CV" = (ast', . . aff'k7f, ,/4?>) in Q determiningeconomically represented densities p(x I V) and p(x I V'), one has that p(x I V) = p(x I V") foralmost all x E Rn onlyif there is a permutationir of (1, * * *, m) such that a' = a"(i) and, if a'e O/? = (f) fori = 1, *, m. For a moregeneral definitionsuitable for possiblyinfinite mixture densities of the form(1.2), see, for example,Yakowitz and Spragins[158]. It is tacitlyassumed here that all familiesof mixturedensities under consideration are identifiable.One can easilydetermine the identifiabilityof specificmixture densities using,for example, the identifiability characterization theorem of Yakowitz and Spragins [158] . For moreon identifiabilityofmixture densities, the reader is referredto thepapers of Teicher [141], [142], [143], [144], Barndorff-Nielsen[6], Yakowitz and Spragins [158], and Yakowitz[156], [157], and to thebook by Maritz [99]. The Fisherinformation matrix is givenby

(2.5.1) I(b?) = f [V. logp(x | 1)] [V. logp(x | 1)] Tp(x I)d,| providedthat p(x 11) is such that this expressionis well defined.(In writingV,b, we suppose that one can convenientlyredefine C1 as a vector 1 = (Q1, *, ,) T of unconstrainedscalar parameters,and we take V, = (a/a3j, * , 8/3t)" Also, in (2.5.1), u denotesthe underlyingmeasure on Rn appropriatefor p(x | 1).) The Fisher informationmatrix has generalsignificance concerning the distributionof unbiasedand asymptoticallyunbiased estimates. For thepresent purposes, the importance of the Fisher informationmatrix lies in its role in determiningthe asymptoticdistribution of maximum-likelihoodestimates (see Theorem3.1 below). A numberof authorshave consideredthe Fisher informationmatrix for finite mixturedensities in a varietyof contexts. We mentionin particularseveral investigations in whichthe Fisherinformation matrix is of centralinterest. (There have been othersin whichthe Fisherinformation matrix or someapproximation of it has playeda significant butless prominentrole; see thoseof Mendenhalland Hader [101], Hasselblad [70], [71], Day [37], Wolfe[153], Dick and Bowden[43], Hosmer[74], James[80] and Ganesalin- gam and McLachlan [52].) Hill [73] exploitedsimple approximationsobtained in limitingcases froma generalpower series expansion to investigatethe Fisher information for estimatingthe proportionin a mixtureof two normal or exponentialdensities. Behboodian[10] offeredmethods for computing the Fisherinformation matrix for the proportion,means and variancesin a mixtureof twounivariate normal densities; he also providedfour-place tables from which approximate information matrices for a varietyof parametervalues can be easilyobtained. In theircomparison of momentand maximum-

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 206 RICHARD A. REDNER AND HOMER F. WALKER likelihoodestimates, Tan and Chang [138] numericallyevaluated the diagonal elements of the inverseof the Fisherinformation matrix at a varietyof parametervalues fora mixtureof two univariatenormal densities with a commonvariance. Using the Fisher informationmatrix, Chang [23] investigatedthe effectsof addinga secondvariable on the asymptoticdistribution of the maximum-likelihoodestimates of the proportionand parametersassociated with the first variable in a mixtureof twonormal densities. Later, Chang [24] extendedthe methodsof [23] to includemixtures of twonormal densities on variables of arbitrarydimension. For a mixtureof two univariatenormal densities, Hosmerand Dick [78] consideredFisher information matrices determined by a number of sample types.They comparedthe asymptoticrelative efficiencies of estimatesfrom totallyunlabeled samples, estimates from two typesof partiallylabeled samples,and estimatesfrom two types of completely labeled samples.

3. Maximumlikelihood. In thissection, maximum-likelihood estimates for mixture densitiesare definedprecisely, and theirimportant properties are discussed.It is assumed that a parametricfamily of mixturedensities of the form(1.1) is specifiedand that a particularCP = (ar*,. . , a * . 0*) E Q is the "true" parametervalue to be estimated.As before,it is both naturaland convenientto regardp(x 11) in (1.1) as modelinga statisticalpopulation which is a mixtureof m componentpopulations with associatedcomponent densities {Pi}i=. mand mixingproportions {ai}i=l...,m. In orderto suggestto thereader the variety of sampleswhich might arise in mixture problems,as wellas to providea frameworkwithin which to discusssamples of interest in the sequel, we introducesamples of observationsin Rn of fourdistinct types. All of the mixturedensity estimation problems which we have encounteredin the literatureinvolve sampleswhich are expressibleas one or a stochasticallyindependent union of samplesof thesetypes, although the imaginativereader can probablythink of samplesfor mixture problemswhich cannot be so represented.The fourtypes of samplesand the notation whichwe associatewith them are givenas follows: Type 1. Suppose that {Xk}kL,. . .,N is an independentsample of N unlabeledobserva- tionson the mixture,i.e., a set of N observationson independent,identically distributed randomvariables with density p(x | 1*). ThenS, = {Xk}kL,. ..,N iS a sampleof Type 1. Type 2. Suppose thatJ1, * * * Jmare arbitrarynonnegative integers and that,for i = 1, * * * , m,{ Yik }k 1,. . .,,, is an independentsample of observations on theith component population,i.e., a set of Ji observationson independent,identically distributed random variableswith densitypi(x I Xi*). ThenS2 = U1m, {Yik}k=l...J, is a sampleof Type 2. Type3. Supposethat an independentsample of K unlabeledobservations is drawnon the mixture,that these observationsare subsequentlylabeled, and that, for i = 1, * * * , m,a set {Zik}k=I,. .. K, ofthem is associatedwith the ith component population with m K = Zil Ki. ThenS3 = U I {Zik}k= 1, .,K, is a sampleof Type 3. Type4. Suppose thatan independentsample of M unlabeledobservations is drawn on the mixture,that the observationsin the sample whichfall in some set E c R' are subsequentlylabeled, and that,for i = 1, * * *, m, a set {Wik}k=1,...,M, of themis thereby associatedwith the ith component population while a set {WOk}k=1.. ,MOremains unlabeled. ThenS4 = U%O {Wik}k=l,...,A, is a sampleof Type 4. A totallyunlabeled sample S, ofType 1 is thesort of sample considered in almostall of the literatureon mixturedensities. Throughout most of the sequel, it is assumedas a conveniencethat samples underconsideration are of this type.The major qualitative differencebetween completely labeled samples S2 and S3 ofTypes 2 and 3, respectively,is thatthe numbers Ki containinformation about the mixing proportions while the numbers J1do not.Thus ifestimation of proportions is ofinterest, then a sampleS2 is usefulonly as

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 207 a subsetof a largersample whichincludes samples of othertypes. For mixturesof two univariatedensities, Hosmer [75] consideredsamples of theforms SI, SI U S2, and S, u S3. Previously,Tan and Chang [137] considereda probleminvolving an applicationof mixturesin explaininggenetic variation which is almostidentical to thatof [75] in which thesample is of theform S, U S2. Also Dick and Bowden[43] used a sampleof the form S u S2 in which m = 2 and J2 = 0. Hosmer and Dick [78] evaluated the Fisher informationmatrix for a varietyof samples of Types 1, 2, and 3, and theirunions. A sampleS4 of Type 4 is likelyto be associatedwith a mixtureproblem involving censored . While the numbersMi contain informationabout the mixing proportions,as do thenumbers Ki ofa sampleS3 ofType 3, theyalso containinformation about the parametersof the componentdensities while the numbersKi do not. An interestingand informativeexample of how a sample of Type 4 mightarise is the following,which is in the area of lifetesting and is outlinedby Mendenhalland Hader [101]. Example 3.1. In lifetesting, one is interestedin testing"products" (systems, devices, etc.), recordingfailure times or causes, and hopefullythereby being betterable to understandand improvethe performance of the product. It oftenhappens that products of a particulartype fail as a resultof twoor moredistinct causes. (An exampleof Acheson and McElwee [1] is quoted in [101] in whichthe causes of electronictube failureare dividedinto gaseous defects, mechanical defects and normaldeterioration of the cathode.) It is thereforenatural to regardcollections of such productsas mixturepopulations, the componentpopulations of whichcorrespond to the distinctcauses of failure.The first objectiveof life testing in suchcases is likelyto be estimationof the proportions and other statisticalparameters associated with the failure component populations. Because of restrictionson timeavailable fortesting, life testing must oftenbe concludedafter a predeterminedlength of timehas elapsed or aftera predeter- minednumber of productunits have failed,resulting in censoredsampling. If thecauses of failureof the failed products are determinedin thecourse of suchan ,then the (labeled) failedproducts together with those (unlabeled) productswhich did notfail constitutea sampleof Type 4. The likelihoodfunction of a sample of observationsis the probabilitydensity functionof the randomsample evaluated at the observationsat hand. When maximum- likelihoodestimates are of interest,it is usuallyconvenient to deal withthe logarithmof thelikelihood function, called thelog-likelihoodfunction, rather than with the likelihood functionitself. The followingare thelog-likelihood functions LI, L2, L3 and L4 of samples Sl, S2, S3 and S4 ofTypes 1, 2, 3, and 4 respectively: N (3.1) LIQ(I) = >logp(xkP l ), k=i m J, (3.2) L2() = E E 109Pi (Yik I Xi), i=l k=I m K, K! (3.3) L3QJ?)= E7 7log [axipi(Zik IXi)] + log ik=l Kk MO ~~~~mAlM, (3.4) L4Q1) = 109logp(WOk I ) + E log [apj(Wik I Xi)I + log

Note that if a sample of observationsis a union of independentsamples of the types consideredhere, then the log-likelihoodfunction of the sample is just the corresponding sumof log-likelihood functions defined above forthe samples in theunion.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 208 RICHARD A. REDNER AND HOMER F. WALKER

If S is a sample of observationsof the sort under consideration,then by a maximum-likelihoodestimate of V*, we mean any choice of 4) in Q at which the log-likelihoodfunction of S, denotedby L(b), attainsits largestlocal maximumin Q. In defininga maximum-likelihoodestimate in this way, we have taken into account two practicaldifficulties associated with maximum-likelihood estimation for mixture densi- ties. The firstdifficulty is that one cannotalways in good consciencetake Q to be a set in whichthe log-likelihood function is boundedabove, and so thereare notalways points in Q at whichL attainsa globalmaximum over Q. Perhapsthe most notorious mixture problem forwhich L is notbounded above in Q is thatin whichp is a mixtureof normaldensities and S = S1, a sampleof Type 1. It is easilyseen in thiscase thatif one of the mixture meanscoincides with a sampleobservation and ifthe corresponding tends to zero (or if the correspondingcovariance matrix tends in certainways to a singularmatrix in themultivariate case), thenthe log-likelihood function increases without bound. This was perhapsfirst observed by Kieferand Wolfowitz[87], whooffered an exampleinvolving a mixtureof two univariatenormal densities to show that classicallydefined maximum- likelihoodestimates, i.e., global maximizersof the likelihoodfunction, need not exist. They also consideredcertain "modified"and "neighborhood"maximum-likelihood estimatesin [87] and pointedout thatin theirexample the formercannot exist and the latterare not consistent.For the normalmixture problem, an advantageof including labeledobservations in a sampleis that,with probability one, this difficulty does notoccur ifthe sample includes more than n labeledobservations from each componentpopulation. This was observedin theunivariate case by Hosmer[75]. Othermethods of circumvent- ingthis difficulty in the normalmixture case includethe imposition of constraintson the variables (Hathaway [72]) or penaltyterms on the log-likelihoodfunction (Redner [121]). The second difficultyis that mixtureproblems are very often such that the log-likelihoodfunction attains its largest local maximumat severaldifferent choices of 4i. Indeed,if pi and pj are of the same parametricfamily for some i and j and if S = SI, a sampleof Type 1, thenthe value of L(4b) willnot change if thecomponent pairs (ai, Oi) and (a1j,oj) are interchangedin 1, i.e., ifin effectthere is "label switching"of the ith and jth componentpopulations. The resultsreviewed below show that whetheror not such "label switching"is a cause forconcern depends on whetherestimates of the particular componentdensity parameters are of interest,or whetheronly an approximationof the mixturedensity is desired.We remarkthat this "label switching"difficulty can certainly occurin mixtureswhich are identifiable(see ?2.5). It shouldbe mentionedthat there is a problemrelating to thissecond difficulty which can be quite difficultto overcomein practice.The problemis that the log-likelihood functioncan (and oftendoes) have local maxima whichare not largestlocal maxima. Thus one mustin practicedecide whether to accepta givenlocal maximumas thelargest or to go on to searchfor others. At thepresent time, there is littlethat can be done about thisproblem, other than to findin one wayor anotheras manylocal maximaas is feasible and to comparethem to findthe largest.There is currentlyno adequatelyefficient and reliableway of systematically determining all local optimaof a generalfunction, although workhas beendone toward this end and somealgorithms have beendeveloped which have proveduseful for some problems. For moreinformation, see Dixon and Szego [44]. In the remainderof thissection, our interestis in the importantgeneral qualitative propertiesof maximum-likelihoodestimates of mixturedensity parameters. For conve- nience,we restrictthe discussionto the case which is most oftenaddressed in the

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 209 literature,namely, that in whichthe sample S at hand is a sampleS, of Type 1. We also assumethat each componentdensity pi is differentiablewith respect to Xiand make the nonessentialassumption that the parametersXi are unconstrainedin Qi and mutually independentvariables. It is notdifficult to modifythe discussionbelow to obtainsimilar statementswhich are appropriatefor other mixturedensity estimation problems of interest.For a discussion of the propertiesof maximum-likelihoodestimates of constrainedvariables, see thepapers of Aitchison and Silvey[2] and Hathaway[72]. The traditionalgeneral approach to determininga maximum-likelihoodestimate is firstto arriveat a systemof likelihoodequations satisfied by the maximum-likelihood estimateand then to try to obtain a maximum-likelihoodestimate by solvingthe likelihoodequations. Basically,the likelihoodequations are foundby consideringthe partialderivatives of the log-likelihoodfunction with respect to the componentsof (D. If a,= (&i, * * * 4m , . . *. OM) is a maximum-likelihoodestimate, then one has the likelihoodequations

(3.5) V,L(I) = 0, i = 1, *,m determinedby theunconstrained parameters 4i, i = 1, - - *, m. (Our conventionis that "V" with a variable appearingas a subscriptindicates the gradientof firstpartial derivativeswith respect to thecomponents of the variable.) To obtainlikelihood equations determined by the proportions, which are constrained to be nonegativeand to sum to one, we followPeters and Walker [114]. Settinga =

a*m) T one sees that

(3.6) 0 _ L((j) (a - a-) forall a= (a1,. ,m) Tsuch that Z Il ai = 1 and ai _ 0, i= 1, ,m. Now (3.6) holdsfor all a satisfyingthe given constraints if and onlyif

O_VC(L(4)(ei - cx, i=1, m, with equality for those values of i for whicha&i > 0. (Here, ei is the vectorthe ith componentof whichis one and theother components of whichare zero.) It followsthat (3.6) is equivalentto

(3.7) 1 T, - 1, .. * *,m, withequality for those values of i forwhich ai > 0. Finally,multiplying each side of (3.7) by ai fori = 1, . . . , m yields likelihood equations in the convenientform

(3.8) ai= fE p(E ) , =1,.* **,m. Nk= P(XkJ~ We remarkthat it is easily seen by consideringthe matrixof second partial derivativesof L with respect to a1, * * , am that L is a concave functionof a = (a1, * , am)T forany fixedset ofvalues E Qi, i = 1, * , m. Thus, forany fixedX i = 1, *. * , m, (3.6) and, hence, (3.7) are sufficientas well as necessary for a' to maximizeL overthe set of all a satisfyingthe given constraints. On theother hand, the likelihoodequations (3.8) are necessarybut not sufficient conditions for a' to maximizeL for fixed Xi, i = 1, . . . , m. Indeed, a' = ei satisfies (3.8) for i = 1, . . . , m. In fact, it followsfrom the concavity of L thatthere is a solutionof (3.8) in each (closed) faceof the simplexof pointsa satisfyingthe givenconstraints. In spiteof perhapssuffering from a

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 210 RICHARD A. REDNER AND HOMER F. WALKER surplusof solutions,the likelihood equations (3.8) neverthelesshave a usefulform which takeson additionalsignificance later in thecontext of the EM algorithm. The equations(3.5) and (3.8) togetherconstitute a fullset of likelihoodequations whichare necessarybut notsufficient conditions for a maximumlikelihood estimate. Of course,some irrelevant solutions of thelikelihood equations can be avoidedin practiceby usingone of a numberof proceduresfor obtaining a numericalsolution of them(among whichis theEM algorithm)which in all butthe most unfortunate circumstances will yield a local maximizerof the log-likelihoodfunction (or a singularitynear whichit grows withoutbound) ratherthan some stationarypoint which is a local minimizeror a saddle point.Still, it is naturalto ask at this pointthe extentto whichsolving the likelihood equationscan be expectedto producea maximum-likelihoodestimate and the extentto whicha maximum-likelihoodestimate can be expectedto be a goodapproximation of IV*. Two generaltheorems are offeredbelow which give a fairsummary of theresults in theliterature most pertinent to thequestion put forth above. As a convenience,we assume that a* > 0 fori = 1, , m. For the purposesof the theoremsand the discussion followingthem, this justifieswriting, say, am = 1 - aifll a, and consideringthe redefined,locally unconstrained variable 1 = (a, * , am-1 * Om)P in the modifiedset

Q ={(?ls* ** Ol_l Xl ** *s m) Ea?i < I andali > 0, XiE Qifor i =1, * * ,m.

The likelihoodequations (3.5) and (3.8) can nowbe writtenin thegeneral unconstrained form

(3.9) V,L(4F) = 0, whichfacilitates our presenting the theorems as generalresults which are notrestricted to the mixtureproblem at hand or, forthat matter,to mixtureproblems at all. In our discussionof the theorems,all statementsregarding measure and integrationare made withrespect to the underlyingmeasure on Rn appropriatefor p(x I4b), which we denote by ,. The firsttheorem states roughlythat, under reasonable assumptions,there is a uniquestrongly consistent solution of the likelihoodequations (3.9), and thissolution at least locally maximizes the log-likelihoodfunction and is asymptoticallynormally distributed.(Consistent in the usual sense meansconverging with probability approach- ing 1 to the trueparameters as the sample size approachesinfinity; strongly consistent meanshaving the same limitwith probability 1.) This theoremis a compendiumof results generalizingthe initial work of Cramer [36] concerningexistence, consistency and asymptoticnormality of the maximum-likelihoodestimate of a singlescalar . The conditionsbelow, on whichthe theorem rests, were essentially given by Chanda [22] as multidimensionalgeneralizations of those of Cramer. With them, Chanda claimedthat thereexists a uniquesolution of the likelihoodequations which is consistentin the usual sense(this fact was correctlyproved by Tarone and Gruenhage[ 139]) and establishedits asymptoticnormal behavior. (See also the summaryin Kiefer [88], the discussionin Zacks [161], Sundberg's [134] treatmentin the case of incompletedata froman exponentialfamily, and the relatedmaterial for constrained maximum-likelihood esti- mates in Aitchisonand Silvey [2] and Hathaway [72].) Using these same conditions, Petersand Walker [1 13, AppendixA] showedthat there is a unique stronglyconsistent

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 211 solutionof the likelihoodequations, and that it at least locally maximizesthe log- likelihoodfunction. In statingthe followingconditions and in the discussionafter the theorem,it is convenientto adopt temporarilythe notation4) = (Q, . . ., (v), wherev = (m - 1 + LZi l n,) and ,jE R' fori = 1, , v. Also,we remarkthat because the resultsof the theorembelow implied by theseconditions are strictlylocal in nature,there is no loss of generalityin restrictingQ to be any neighborhoodof V* ifsuch a restrictionis necessary forthe first condition to be met. Condition1. For all 4 E Q, foralmost all x E R' and fori, j, k = 1, , v, the partialderivatives ap/al,, a2p/aS,i1J and a3p/3Plk jI98k existand satisfy

(p(x ) |32p(X_ | 4)) a3,log p(xj+4))

wherefandf, are integrableandfJ,, satisfies

ffJk(X)P((XI4)d

Condition2. The Fisherinformation matrix I(b) givenby (2.5.1) is welldefined and positivedefinite at 4)* THEOREM 3.1. If Conditions 1 and 2 are satisfiedand any sufficientlysmall neighborhoodof 4b* in Q is given,then with probability 1, thereis for sufficientlylarge N a unique solution4bN of the likelihoodequations (3.9) in that neighborhood,and this solutionlocally maximizesthe log-likelihoodfunction. Furthermore, \/ (4bN - 4*) is asymptoticallynormally distributed with mean zero and covariancematrix I(4)* ) The secondtheorem is directedtoward two questions left unresolved by thetheorem above regarding4)N the unique stronglyconsistent solution of the likelihoodequations. The firstquestion is whether4)N is reallya maximum-likelihoodestimate, i.e., a pointat which the log-likelihoodfunction attains its largest local maximum.The second is whether,even if the answerto the firstquestion is "yes," thereare maximum-likelihood estimatesother than 4DN whichlead to limitingdensities other than p(x 1 V*). Givenour assumptionof identifiabilityof thefamily of mixturedensities p(x l 4), 4) E Q, one easily sees thatthe theorem below implies that if Q' is anycompact subset of Q whichcontains V* in its interior,then with probability 1, 4)N is a maximum-likelihoodestimate in Q' for sufficientlylarge N. Furthermore,every other maximum-likelihood estimate in Q' is obtainedfrom 4)N bythe "label switching"described earlier and, hence, leads to thesame limitingdensity p(x 14)*). Accordingly,we usuallyassume in thesequel thatConditions 1 through4 are satisfiedand referto 4)N as the unique stronglyconsistent maximum- likelihoodestimate. The theoremis a slightlyrestricted version of a generalresult of Redner [ 122] which extends earlier work by Wald [ 150] on the consistencyof maximum-likelihoodestimates. It shouldbe remarkedthat the resultof [122] restson somewhatweaker assumptions than those made hereand is specificallyaimed at families ofdistributions which are notidentifiable. For 4) E Q and sufficientlysmall r > 0, let Nr(4)) denotethe closed ball of radiusr about4) in i and define p(x4), r)= sup p(xI4') VcNr(D)

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 212 RICHARD A. REDNER AND HOMER F. WALKER and p*(xI f,r) = max{1, p(x 14b,r)}. Condition3. For each 41)E Q and sufficientlysmall r > 0,

f logp*(x l, r)p(x I*) dM,< o. Condition4. f logp(x | )p(x I V) d, < oo. THEOREM 3.2. Let Q' be any compactsubset of Q whichcontains V" in its interior, and set C = {c1 c Q': p(x I41) = p(x I4*) almosteverywhere}. If Conditions3 and 4 are satisfiedand D is any closed subseto Q' not intersectingC, thenwith probability 1, N

II P(Xk | k=l lim sup N =0. N-X ED PD

k--i

From a theoreticalpoint of vikw,Theorems 3.1 and 3.2 are adequate formixture densityestimation problems in providingassurance of the existence of strongly consistent maximum-likelihoodestimates, characterizing them as solutionsof the likelihoodequa- tions and prescribingtheir asymptotic behavior. In practice,however, one must still contendwith certain potential mathematical, statistical and even numericaldifficulties associated withmaximum-likelihood estimates. Some possiblemathematical problems have been suggestedabove: The log-likelihoodfunction may have manylocal and global maximaand perhapseven singularities; furthermore, the likelihood equations are likelyto have solutionswhich are not local maximaof the log-likelihoodfunction. According to Theorem3. 1, thestatistical soundness (as measuredby bias and variance)of the strongly consistentmaximum-likelihood estimate is determined,at least forlarge samples, by the Fisherinformation matrix I(4*). As it happens,I(f)*) also plays a role in determining the numericalwell-posedness of the problemof approximatingthe stronglyconsistent maximum-likelihoodestimate for large samples. To show how I(Q*) entersinto the problemof numericallyapproximating dJ N for large samples,we recall thatthe conditionof a problemrefers to the nonuniformityof responseof its solution to perturbationsin thedata associatedwith the problem. This is to say thata problemis ill-conditionedif its solution is verysensitive to someperturbations in the data whilebeing relatively insensitive to others.Another way of lookingat it is to say that a problemis ill-conditionedif some relativelywidely differing approximate solutionsgive about the same fitto the data, whichmay be verygood. For maximum- likelihoodestimation, then, one can associate ill-conditioningwith the log-likelihood function'shaving a "narrowridge" below which the maximum-likelihood estimate lies, a phenomenonoften observed in the mixturedensity case. If a problemis ill-conditioned, thenthe accuracy to whichits solution can be computedis likelyto be limited,sometimes sharplyso. Indeed, in some contextsone can demonstratethis rigorouslythrough backwarderror analysis in whichit is shownthat the computedsolution of a problemis theexact solution of a "nearby"problem with perturbed data.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 213

For an optimizationproblem, the condition is customarilymeasured by the condition numberof the Hessian matrixof thefunction to be optimized,evaluated at the solution. Denotingthe Hessian by H(Q) and assumingthat it is invertible,we recall that its conditionnumber is definedto be ||H(c1) . 1H(-) 1 1, where|| || is a matrixnorm of interest.(For moreon thecondition number of a matrix,see, for example, Stewart [ 131].) For thelog-likelihood function at hand,the Hessian is givenby N (3.10) H(cJ) = E VVlogp(xk)P I k=l whereV,V' = (a2/a3i)j). If Conditions1 and 2 above are satisfied,then it followsfrom thestrong law oflarge numbers (see Loeve [94]) that,with probability 1,

(3.11) lim-H(cN) N-o N

Since I/NH(4N) has the same conditionnumber as H(IN), (3.1 1) demonstratesthat the conditionnumber of I(Q*) reflectsthe limitsof accuracy which one can expect in computedapproximations of cJN forlarge N. To illustratethe potential severityof the statisticaland numericalproblems associatedwith maximum-likelihood estimates, we augmentthe materialon the Fisher informationmatrix in the literaturecited in ?2.5 with Table 3.3 below, which lists approximatevalues of the conditionnumber and the diagonalelements of the inverseof I(V*) fora mixtureof two univariate normal densities (see (1.3) and (1.4)) at a varietyof choices of VP. To preparethis table, we took (I = (a,, Aul,A2, o,, 2) and numerically evaluatedI(c*), its conditionnumber, and its inversefor selected values of c* using IMSL Libraryroutines DCADRE, EIGRS, and LINV2P on a CDC7600.2 The choicesof f* wereobtained by takinga* = .3 and U2* = 2* = l and varyingthe mean separation l- 42 . In thetable, the condition number of I(1*) is denotedby K, and thefirst through fifthdiagonal elements of I(4D*)-l are denotedby I-l'(a), I-'(Al), I-'(A2), -l() and (u),2 respectively. Table 3.3 reinforcesone's intuitiveunderstanding that, for mixture density estima- tionproblems, maximum-likelihood estimates are moreappealing, from both statistical and numericalstandpoints, if the componentdensities in the mixtureare well separated thanif they are poorlyseparated. Perhaps the most troublesome implication of Table 3.3 is that,if the componentdensities are poorlyseparated, then impractically large sample sizes mightbe requiredin orderto expecteven moderately precise maximum-likelihood estimates.For example,Table 3.3 indicatesthat if one considersdata froma mixtureof twounivariate normal densities with a* = .3, a = a =- 1 and A* - A* = 1, thena samplesize on theorder of 106 is necessaryto insurethat the of each componentof the maximum-likelihoodestimate is about 0.1 or less. Even if a sampleof such horrendoussize wereavailable, the factthat evaluating the log-likelihoodfunction and associatedfunctions such as its derivativesinvolves summation over observations in the sample, consideredtogether with the conditionnumber of 5.18 x 104 for the informationmatrix, suggests that the computingundertaken in seekinga maximum- likelihoodestimate should be carriedout with great care. Similar observationsregarding the asymptoticdependence of the accuracy of maximum-likelihoodestimates on samplesizes and separationof thecomponent popula- tionshave been made by a numberof authors(Mendenhall and Hader [101], Hill [73],

2We are gratefulto the Mathematicsand StatisticsDivision of the Lawrence LivermoreNational Laboratoryfor allowing us to use theircomputing facility in generatingTable 3.3.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 214 RICHARD A. REDNER AND HOMER F. WALKER

TABLE 3.3 Conditionnumber and diagonal elementsof theinverse of I(Q*) for a mixtureof twounivariate normal densitieswith a* = .3,u = = 1.

jL1 - ~K F.)f1(2) I (0r2) Al - 2 I-'(at g Z) I-1(2 (af2 -1 f2)

0.2 3.06 X 1010 4.39 x 10I" 4.86 x 109 8.98 x 108 2.15 x I07 4.02 x 106

0.5 8.05 x 106 5.54 x 106 3.81 X 106 7.17 x I05 1.04 x I05 2.07 x 104

1.0 5.18 x 104 8.59 x 103 2.32 x 104 4.55 x 103 2.58 x 103 578.

1.5 4.80 X 103 237. 1.43 x 103 290. 383. 95.0

2.0 1.10X 103 20.4 216. 45.8 115. 31.3

3.0 187. .874 18.9 4.81 28.2 8.83

4.0 71.7 .267 5.72 1.95 13.4 4.71

6.0 35.7 .211 3.44 1.45 7.47 3.06

Hasselblad [70], [71], Day [37], Tan and Chang [138], Dick and Bowden[43], Hosmer [74], [75], Hosmerand Dick [78]). Severalof them (Mendenhall and Hader [101], Day [37], Hasselblad [71], Dick and Bowden [43], Hosmer [74]) also suggestedthat things are worsefor small samples(less thana fewhundred observations) than the asymptotic theoryindicates. Hosmer [74] specificallyaddressed the small-sample,poor-separation case fora mixtureof two univariatenormals and concludedthat in thiscase maximum- likelihoodestimates "should be used with extremecaution or not at all." Dick and Bowden[43], Hosmer [75] and Hosmerand Dick [78] offeredevidence which suggests thatconsiderable improvement in theperformance of maximum-likelihood estimates can resultfrom including labeled observationsin the samples by whichthe estimatesare determined,particularly when the component densities are poorlyseparated. In fact,it is pointedout in [78] thatmost of the improvement occurs for small to moderateproportions oflabeled observations in thesample. In spiteof the ratherpessimistic comments above, maximum-likelihoodestimates have faredwell in comparisonswith most other estimates for mixture density estimation problems.Day [37], Hasselblad [71], Tan and Chang [138] and Dick and Bowden[43] foundmaximum-likelihood estimates to be markedlysuperior to momentestimates in theirinvestigations, especially in cases involvingpoorly separated component populations. (See also thecomment by Hosmer[77] on thepaper of Quandtand Ramsey [118].) Day [37] also remarkedthat minimum chi-square and Bayes estimateshave less appeal than maximum-likelihoodestimates, primarily because of the difficultyof obtainingthem in most cases. James [80] and Ganesalingamand McLachlan [52] observedthat their proportionestimates are less efficientthan maximum-likelihood estimates; however, they also outlinedcircumstances in whichtheir estimates might be preferred.On the other hand,as we remarkedin ?2.3, Hosmer[77] has commentedthat the moment generating functionmethod of Quandt and Ramsey[118] providesestimates which may outperform maximum-likelihoodestimates in the small-samplecase, althoughKumar et al. [91] attributecertain difficulties to thismethod. 4. The EM algorithm.We nowderive the EM algorithmfor general mixture density estimationproblems and discuss its importantgeneral properties.As stated in the introduction,we feelthat the EM algorithmfor mixture density estimation problems is

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 215 best regardedas a specializationof the generalEM algorithmformalized by Dempster, Laird and Rubin [39] forobtaining maximum-likelihood estimates from incomplete data. Accordingly,we beginby reviewingthe formulation of the general EM algorithmgiven in [39]. Supposethat one has a measurespace Y of "completedata" and a measurablemap y x(y) ofY to a measurespace X of "incompletedata." Letf(y I4) be a memberof a parametricfamily of probabilitydensity functions defined on Y for4b E Q, and suppose thatg(x I4b) is a probabilitydensity function on X inducedbyf(y I4b). For a givenx E X, the purposeof the EM algorithmis to maximizethe incompletedata log-likelihood L(bI) = log g(x Ib) over(I E Q byexploiting the relationship betweenf(y I 4b) and g(x Idb). It is intendedespecially for applications in whichthe maximization of thecomplete data log-likelihoodlogf(y I4b) over 4b E Q is particularlyeasy. For x E X, set Y(x) = {y E Y: x(y) = x}. The conditionaldensity k(y Ix, 4b) on Y(x) is givenbyf(y Idb) = k(y I x, 4)g(x I4b). For 4b and 4Vin Q, one thenhas L(b) = Q(I IV) - H(QII V), where Q(+ I 4V) = E(logf(y Idb) I x, 4V) and H(dI IV) = E(log k(y Ix, 4b)I x, '). The general EM algorithmof Dempster,Laird and Rubin [39] is the following:Given a currentapproximation V of a maximizerof L(b~), obtaina nextapproximation d+ as follows: 1. E-step.Determine Q(Q I V). 2. M-step.Choose V E arg maxt,uQ(4 . Here,arg max4EU Q(QI Vb)denotes the set of values 4 E Q whichmaximize Q(Q I V) over Q. (Of course,this set must be nonemptyfor the M-step of the algorithmto be well defined.)If thisset is a singleton,then we denoteits sole memberin the same way and writeV+ = arg maxp,uQ(I 4C). Similarnotation is used withoutfurther explanation in thesequel. Fromthis general description, it is notclear thatthe EM algorithmeven deserves to be called an algorithm.However, as we indicatedabove, the EM algorithmis used most oftenin applicationswhich permit the easy maximizationof logf(y I4b) over 4b E Q. In such applications,the M-stepmaximization of Q(I I 4V)over 4 E Q is usuallycarried out withcorresponding ease. In fact,as one sees in thesequel, the E-step and theM-step are usuallycombined into one veryeasily implementedstep in mostapplications involving mixturedensity estimation problems. At any rate,the senseof the EM algorithmlies in the factthat L(dI) _ L(4V). Indeed,the mannerin which V+ is determinedguarantees that Q(c+ IV) _ Q( CI CV); and it follows from Jensen's inequality that H(I+ IV ) _ H(QJCI 4). (See Dempster,Laird and Rubin [39, Thm. 1].) This factimplies thatL is monotoneincreasing on anyiteration sequence generated by the EM algorithm, whichis thefundamental property of the algorithm underlying the convergence theorems givenbelow. To discussthe EM algorithmfor mixture density estimation problems, we assumeas in thepreceding section that a parametricfamily of mixturedensities of theform (1. 1) is specifiedand that a particular 4* = (a", . . . ,a,t, /P, . . .*,*) is the"true" parame- tervalue to be estimated.In the usual way,we regardthis family of densitiesas being associatedwith a statisticalpopulation which is a mixtureof m componentpopulations. The EM algorithmfor a mixturedensity estimation problem associated with this family is derivedby firstinterpreting the problemas one involvingincomplete data and then obtainingthe algorithmfrom its general formulationgiven above. The problemis interpretedas one involvingincomplete data by regardingeach unlabeledobservation in thesample at handas "missing"a label indicatingits component population of origin.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 216 RICHARD A. REDNER AND HOMER F. WALKER

It is instructiveto considerthe forms which the EM algorithmmight take for mixture densityestimation problems involving samples of the typesintroduced in the preceding section.We firstillustrate in some detail the derivationof the functionQ() IV') of the E-stepof the algorithm,assuming for convenience that the sample at hand is a sample SI= {X}kk=l...N of Type 1 describedin the precedingsection. One can regardSI as a sample of incompletedata by consideringeach Xk to be the "known" part of an observationYk = (Xk, ik), whereik is an integerbetween 1 and m indicatingthe component populationof origin. For 4) = (a,, - ,a,a 1, * * ' kOm) E Q, the sample variables X = (X1, * * , XN) and y = (YI, , YN) have associatedprobability density functions g(X 14))= uk=I P(Xk 14))and f(y 1 )) = lk=l aik Pik(Xk I (Pik), respectively.Then for4 = (as,. . *am,. ,'** of,,) E Q, theconditional density k(y Ix, 4)') is givenby

11a1kPik(Xk I (?, k(y Ix, 4)) = 1 px ik4' k=l P (Xk I (D ) and thefunction Q(4 I 4'), whichwe denoteby Qi(4) I 4'), is determinedto be

m m N log=aEk* *(*kEIE log aikPik(Xk I (k?, Q1()((I )'(), iki,PikN=XkkI ik -i Px I4' I'? m N at~x I (4.1) = E logaipi (Xk I i) p(xk ') ~~~~~m N (kI4N ap i E[k=1p [Eaa~p~XkIq)IJ (kI (Xk)] (Xk)1log ai i +I EEE 0Llog p(Xk I Oi)j) a p (Xk | X ) i= Ii Lk= P(Xk14)' iJ = q P, p(XkI4

For samples S2= Uim= {Yik}k=1,--.J,S3 = UI= {Zik}kl,---,K, and S4= Ui=o {Wik}k=1,-,M, ofTypes 2, 3 and 4, one determinesin a similarmanner the respective functions Q2() 14)'), Q3(P 1 ') and Q4(P 14') forthe E-step of the EM algorithmto be m J, (4.2) Q2(4) 1 ') = L L logPi(Yik I?i), i=l k=l m m K, (4.3) Q3(4)1 ') = E K, logai + EL logPi(Zik I')i), i=l ~~~i=lk=I

Q4(4)I4') = E + E ( )] logai P k (4.4) r k=, 0 I + L E 0g Pi(WikI1i) + E logp (wOkI 4i) p(w l)] for4)= (a,, * * * ,am,,,,*1, *I'm)O and ' = (at, . . . ,am,,4Vj . . . ?m) inQ.Wenote thatQ2(4) 1 ') and Q3(4 1 ') are just L2()) and (exceptfor an additiveconstant) L3(4) givenby (3.2) and (3.3), respectively;and one mightwell wonder why they are ofinterest in this context.By way of explanation,we observethat if a sample of interestis a stochasticallyindependent union of smallersamples, then the functionfor the E-stepof the EM algorithmwhich is appropriatefor this sample is just the sum of the functions whichare appropriatefor the smaller samples. Thus, for example, if S = SI U S2 U S3 is a union of independentsamples of Types 1, 2, and 3, then the functionfor the E-step appropriatefor S is Q(4) 1 ') = Q&(D1 ') + Q2(4) 1b') + Q3(D 1 '), whereQ&(D 1 '), Q2(4)1 ') andQ3(4) 1 4') aregiven by (4.1), (4.2) and(4.3), respectively. Having determinedan appropriatefunction Q(4) 1)') for the E-step of the EM algorithmas one or a sum of the functionsQi(4) 14') definedabove, one is likelyto find

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 217 thatthe maximizationproblem of the M-stephas a numberof attractivefeatures. It is clear from(4.1), (4.2), (4.3) and (4.4) thatthis maximization problem separates into two maximizationproblems, the firstinvolving the proportionsa,, * * *, am alone and the secondinvolving only the remaining parameters (PI, * - , m Since log a,, * * *, log am appear linearlyin each functionQi (L I 4V) fori %-2, thefirst maximization problem has a unique solutionif the sample is not strictlyof Type 2, and this solutionis easily and explicitlydetermined regardless of the functionalforms of the componentdensities pi(x l k1).If Xl, - * * , Omare mutuallyindependent variables, then the second maximiza- tionproblem separates further into m componentproblems, each of whichinvolves only one ofthe parameters Xi. Boththese component problems and themaximization problem for the proportionsalone have the appealing propertythat theycan be regardedas "weighted" maximum-likelihoodestimation problems involving sums of logarithms weightedby posteriorprobabilities that sample observationsbelong to appropriate componentpopulations, given the currentapproximate maximum-likelihood estimate of

To illustratethese remarks,we considera sample SI = {Xk}k=l,... ,N of Type 1 and assume that 1, *, m are mutuallyindependent variables. If 4V = (a, , aO , C * * , (p4) is a currentapproximate maximizer of the log-likelihoodfunction LIQ() givenby (3.1), thenone easilyverifies that the nextapproximate maximizer V = (a1, * * *, a 1, . . . ,+) prescribedby the M-step of the EM algorithmsatisfies 1 N ac4pi(Xk I i) (4.5) ai = - p( |z,c N1 ap(xkIk4c)ic

(4.6) &, Earg maxjE logpi(XkI4i) p(xXkl) for i = 1, , m. Note that each weight aicpi(Xk I Iic)IP(Xk I 4C) is the posterior probabilitythat Xk originatedin the ith componentpopulation, given the current approximatemaximum-likelihood estimate Vc. In additionto prescribingeach a+ and 4j+ as thesolution of a heuristicallyappealing weightedmaximum-likelihood estimation problem, there are otherattractions to (4.5) and (4.6). For example,(4.5) insuresthat the nextapproximate proportions a+ inherit fromthe currentapproximate proportions ac the propertyof being nonnegativeand summingto 1. Furthermore,although there is no guaranteethat the maximization problems(4.6) will have nice propertiesin general,it happensthat each 4+ is usually easily(even uniquely and explicitly)determined by (4.6) in mostapplications of interest, especiallyin thoseapplications in whicheach componentdensity pi(x IOi) is one of the commonparametric densities for which ordinary (labeled-sample) maximum-likelihood estimatesof qi are uniquelyand explicitlydetermined. As an illustration,consider the case in whichsome pi(x I i) is a multivariatenormal density, i.e., pi(x I4i) and qi are givenby

(4.7) P(XI = (2r)2(det )1/21/2(x-kiu)T2.I|(X_A,) e- ,= (Ai, i) where,u E R' and Xi is a positive-definitesymmetric n x n matrix.For a given /c = (jA,1c), theunique solution 4o = (gA, ZI) of (4.6) is givenby

(4.8) Al NIXk (XkI4f7)l/ cpi(XkII?)

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 218 RICHARD A. REDNER AND HOMER F. WALKER

(4.9) = - - +)T IaPi(Xk ik)l lJ cpi(Xk ic)II tk=1 /~kP(Xk I c) J/1 P(XkI4c)J (The factorsac havebeen left in thenumerators and denominatorsof these expressions for aestheticreasons only.) Note thatXi is positive-definitesymmetric with probability 1 if N> n. What convergenceproperties hold fora sequenceof iteratesgenerated by applying the EM algorithmto a mixturedensity estimation problem? If nothingin particularis said about the parametricfamily of interest,then the propertieswhich can be specified are essentiallythose obtained by specializingthe convergence results associated with the EM algorithmfor general incomplete data problems.The convergenceresults below are formulatedso thatthey are validfor the EM algorithmin general.Points relating to these resultswhich are ofparticular interest in themixture density context are made in remarks followingthe theorems. The firsttheorem is a global convergenceresult for sequences generated by the EM algorithm.It essentiallysummarizes the resultsof Wu [155] for the general EM algorithmand of Redner[ 121] forthe more specialized case ofthe EM algorithmapplied to a mixtureof densitiesfrom exponential families. Similar but morerestrictive results have been formulatedfor the generalEM algorithmby Boyles [18] and fora special applicationof the EM algorithmby Vardi [149]. Statements(i), (ii) and (iii) of the theoremare valid forany sequenceand are statedhere as a conveniencebecause of their usefulnessin applications.Statements (iv), (v) and (vi) are based on the fact reviewed earlierand reiteratedin the statementof the theoremthat the log-likelihoodfunction increasesmonotonically on a sequencegenerated by the EM algorithm.Through the use of thisfact, the theoremcan be relatedto generalresults in optimizationtheory such as the convergencetheorems of Zangwill [162, pp. 91, 128, 232] concerningpoint-to-set maps which increase an objectivefunction. In fact, such general resultswere used explicitlyby Wu [155] and Vardi [149]. THEOREM 4.1. Suppose thatfor some &?) E Q, 1(D"1j=0,1,2,... is a sequence in Q generatedby the EM algorithm,i.e., a sequencein Q satisfying

Vj(i?) E arg max Q(4 | 4(])), j = 0, 1, 2, 4)Q whereQ (4 I4') is thefunction determined in theE-step of theEM algorithm.Then the log-likelihoodfunction L(b) increasesmonotonically on 0" lj=0,1,2,- to a (possibly infinite)limit L*. Furthermore,denoting the set of limitpoints of {4!'" }=_ 1,2,... in Q byL, one has thefollowing: (i) ? is a closedset inQ. (ii) If{4(i)}11j=12,. is containedin a compactsubset of Q, then? is compact. (iii) If{4(j)}=0,1,2,. is containedin a compactsubset of Q and limj_1"1 i+1) - (DO) = Ofora norm|| * || on Q, then? is connectedas wellas compact. (iv) If L(b) is continuousin Q and ? # 0, thenL* is finiteand L(bi) = L* for

(v) If Q(QLI 4) and H(I I 4') = Q(L I4V) - L(Q) are continuousin 4 and 4V in Q, theneachb E -C satisfies4 E argmax?,E QQ( I V'). (vi) If Q(QLI 4) and H(QI I 4) are continuousin 4 and V in Q and differentiablein 4 at 4 = V = E& ?, thenL(b) is differentiableat 4 = 4 and thelikelihood equationsV4L(4) = 0 are satisfiedby 4 = i. Proof.The monotonicityof L(b) on 0(1Ij=, l 2,... has alreadybeen established;the

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 219 existenceof a (possiblyinfinite) limit L* follows.Statement (i) holdssince closedness is a generalproperty of sets of limit points. To obtain(ii), notethat if {J(J)}_0 1. 2 is contained in a compactsubset of Q, thenL is a closed subsetof thiscompact subset and, hence, is compact.To prove (iii), suppose that {4(i)}.j012... is containedin a compact sub- set of Q, thatlimj-_ ki''+1)- -'J)II = 0, and that? is notconnected. Since ? is com- pact, thereis a minimaldistance between distinct components of L, and the factthat limJ1_IVjF+?) - -AII= 0 impliesthat there is an infinitesubsequence of 1('j} j=0,12, ... whosemembers are boundedaway fromL. This subsequencelies in a compactset, and so it has limitpoints. Since theselimit points cannot be in L, one has a contradiction. Statement(iv) followsimmediately from the monotonicity of L(b) on JVJ)}j=01 2, To prove(v), supposethat Q(4? 1 ') and H(4 1V') are continuousin 4b and 4' in Q and that one can findsome 4i E L and 4bE Q forwhich Q(4P I 4i)> Q(4i I4i). Thenfor every j, + ( L((P(J 1)) = QQJ?(i+I)Hjb'(+?I) IVj)) c(?))) _- Ici?li)) QQ I JJH(4(- IVi) by the M-step determinationof Vj+') and Jensen'sinequality. Since Q(P IV') and H(P I V') are continuous,it followsby takinglimits along a subsequenceconverging to 4 that L* _ Q(P I )- H(i 1i) > Q(i I )- H(i I i) = L(b) = L* which is a contradiction.To establish(vi), suppose that Q(P I4') and H(P IV') are continuousin 4Dand 4' in Q and differentiablein 4Dat 4D= V = Ei C. Then L(b) = Q(P I4i) - H(4PI 4i) is differentiableat 4D= i, and, since4i E arg maxIE Q(4bI 4i) by (v) and 4i C arg max4,EQH(P I4) byJensen's inequality, one has VL4L(I) = 0. This completes theproof. Statement(iii) ofTheorem 4.1 has precedentin suchresults as Ostrowski[ 108,Thm. 28.1]. It is usuallysatisifed in practice,especially in themixture density context. Indeed, itoften happens that each V is uniquelydetermined by the EM algorithmas a functionof 4C whichis continuousin Q. For example,one sees thata+, * * *, a1eare determinedin thisway by (4.5) whenevereach pi (x l Xi) dependscontinuously on 4i. In addition,each Xi is likelyto be determinedin this way by (4.6) whenevereach pi(x k,) is one of the commonparametric densities for which ordinarymaximum-likelihood estimates are determinedas a continuousfunction of X,; see, forexample, (4.8) and (4.9). If V is determinedin thisway from4C and if the conditionsof (v) are also satisfied,then each

C?Lz is a fixedpoint of a continuousfunction. It followsthat if in addition{(i)}_j=01 2,... iS containedin a compactsubset of Q, thenthe elementsof a "tail" sequence1b(j)1J=J,J+I... can all be made to lie arbitrarilyclose to thecompact set ? bytaking J sufficientlylarge and, hence,limj II||(J+j ) - -'0J)= by theuniform continuity near ? of the function determiningV(J+ ) fromVA. It is usefulto expanda littleon the interpretationof statement(vi) in the mixture densitycontext. Assuming that each pi (x IX, ) is differentiablewith respect to hi,that the parametersX, are unconstrainedin Qi and mutuallyindependent, and, forconvenience, that the sample of interestis of Type 1, one can reasonablyinterpret the likelihood equationsV4L(Q) = 0 in thesense of (3.9) at a point1 = (&?', * .,* &,e , . . .,* q,m) e ? whichis suchthat each &,is positive.Now it is certainlypossible for some &ai to be zero for1 E L, in whichcase (3.9) mightnot be valid. Fortunately,(3.5) and (3.8) providea betterinterpretation than (3.9) ofthe likelihood equations in themixture density context, whichis validwhether each &j is positiveor not.Indeed, if the conditions of (v) hold,then it followsfrom (4.5) that the equations(3.8) are satisfiedon ?. Thus in the mixture densitycontext under the present assumptions, (vi) shouldbe replacedwith the following: (vi)' If Q(4PI V) and H(4PI V) are continuousin 4Dand V' in Q and differentiablein

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 220 RICHARD A. REDNER AND HOMER F. WALKER

01, * * * , Omat + = V = ( E LI, thenL(b) is differentiablein 01, * ,'m at = ?P and thelikelihood equations (3.5) and (3.8) are satisfiedby 4?. To illustratethe applicationof Theorem4. 1, we considerthe problemof estimating theproportions in a mixtureunder the assumption that each componentdensity pi (x I0j) is known(and denotedfor the presentpurposes by pi (x) forsimplicity). The theorem below is a global convergenceresult for the EM algorithmapplied to thisproblem. For conveniencein presentingthe theorem, it is assumedthat the sample at hand is a sample SI = {Xk}k=,. . .,N ofType 1. Similarresults hold for other cases in whichthe sample at hand is oneor a unionof the types considered in thepreceding section. For thisproblem, one has simplyb = (a1, * * , am); and it is, of course,always understoodthat Im I ai = 1 and ai 0, i = 1, * * *, m,for all such4?. We remarkthat the condition of the theorem on the matrixof secondderivatives of L1(4P) is quite reasonable.This matrixis alwaysdefined and negative semidefinitewhenever p(Xk I4) t 0 for k = 1, * * * , N; and if p,(x), * * *, pm(x)are linearlyindependent nonvanishing functions on thesupport of the underlyingmeasure on Rnappropriate for p, thenwith probability I it is definedand negativedefinite for all 4bwhenever N is sufficientlylarge. THEOREM 4.2. Suppose thatthe matrix of second derivatives of L1 (4p) is definedand negativedefinite for all 4?.Then there is a uniquemaximum-likelihood estimate, andfor any ?(O) = (aoe), * *, a()) withaoe) > Ofor i = 1, * * *, m, the sequence -(j) = (a(i) a*(j))1=.012,... generatedby the EM algorithm,i.e., determined inductively by

ae(j+ I -) 1aE ' Pi(Xk) N kIP(Xk I 4?(j) =1 .,m convergesto themaximum-likelihood estimate. Proof.It followsfrom Theorem 4.1 and thesubsequent remarks that the set of limit pointsof {4(J)l0l1,2,... is a compact,connected subset of the simplex of proportionvectors 4b on which the likelihoodequations (3.8) are satisfied.Since the matrixof second derivativesof L1 (4?) is negativedefinite, L1 (4?) is strictlyconcave. It followsthat there is a unique maximum-likelihoodestimate and, furthermore,that the likelihoodequations (3.8) have at mostone solutionon the interiorof each face of the proportionsimplex. Consequently,each componentof theset of solutionsof the likelihoodequations consists of a singlepoint, and {14(j) }j=01,2,... mustconverge to one such point.But if {4 j) }_=0,1,2... is convergent,then its limit must be the maximum-likelihoodestimate by Peters and Coberly[ 112, Thm. 2] or Petersand Walker[ 114, Thm. 1]. Despitethe usefulness of Theorem4.1 in characterizingthe set of limitpoints of an iterationsequence generated by theEM algorithm,it leavesunanswered the questions of whethersuch a sequence convergesat all and, if it does, whetherit convergesto a maximum-likelihoodestimate. In an attemptto providereasonable sufficient conditions under which the answer to these questionsis "yes", we offerthe local convergence theorembelow. THEOREM 4.3. Suppose thatConditions 1 through4 of ?3 are satisfiedin Q, and let Q' be a compactsubset of Q whichcontains V* in its interiorand whichis such that p(x Il) - p(x I 4*) almost everywherein x for 4? E Q' only if 4?= V. Suppose further that withprobability 1, thefunction Q(4? 1 ') of the E-step of the EM algorithmis continuousin 4?and 4' in Q' and bothQ(4? 1 4') and thelog-likelihood function L(b) are differentiablein 4?for 4?and 4' in Q' wheneverN is sufficientlylarge. Finally,for ?(0) in Q' denote by {1(j)}1j0,1,2,--- a sequence generatedby the EM algorithmin Q', i.e., a sequencein Q' satisfying 4(j+1) E arg max Q(4?I4V'j'), j = 0, 1, 2, .. 4cEQ'

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 221

Thenwith probability 1, wheneverN is sufficientlylarge, the unique strongly consistent maximum-likelihoodestimate 1N is welldefined in Q' and 1N = lim1_c(J) whenever j(O) is sufficientlynear tN. Proof.It followsfrom Theorems 3.1 and 3.2 thatwith probability 1, N can be taken sufficientlylarge that the uniquestrongly consistent maximum-likelihood estimate 1N iS welldefined, lies in theinterior of Q' and is theunique maximizer of L(b) in Q'. Also with probability1, we can assumethat N is sufficientlylarge that Q (1 IV) is continuousin 1 and V in Q', and Q(4t I V) and L(b) are differentiablein4t for 4t and t' in Q'. Since L(b) is continuous,one can finda neighborhoodQ" of tN ofthe form

Q = 1? C Q': L(b) _ L(1N) - ?} forsome ? > 0 whichlies in theinterior of Q' and whichis suchthat 4pN iS theonly solution ofthe likelihood equations contained in it. If j(0) lies in Q",then 1Pj)} j0,1,2 . .must also lie in Q" sinceL (1) is monotoneincreasing on t' j) }j=0 ln2,....It followsthat each limitpoint of PV('}J=O l,I,2,... lies in Q" and, by statement(vi) of Theorem4. 1, also satisfiesthe likelihood equations.Since 4pN iS the onlysolution of the likelihoodequations in Q", one concludes that N = lim (j). As in the case of Theorem4.1, Theorem4.3 is statedso thatit is valid forthe EM algorithmin general.It shouldbe noted,however, that Theorem 4.3 makesheavy use of Theorems3.1 and 3.2 as well as Theorem4.1, and so formixture density estimation problems,it pertainsas it stands,strictly speaking, to thecase to whichTheorems 3.1 and 3.2 apply,namely that in whichthe sample at handis ofType 1 and L(b) = LI (4) is given by (3.1) and Q(4bIV) = Q (4 IV), by (4.1). Of course,Theorems 3.1 and 3.2 and, therefore,Theorem 4.3 can be modifiedto treatmixture density estimation problems involvingsamples of other types.

5. The EM algorithmfor mixtures of densities from exponential families. Almost all mixturedensity estimation problems which have been studiedin the literatureinvolve mixturedensities whose component densities are membersof exponentialfamilies. As it happens,the EM algorithmis especiallyeasy to implementon problemsinvolving densitiesof thistype. Indeed, in an applicationof the EM algorithmto such a problem, each successiveapproximate maximum-liklihood estimate 'V is uniquelyand explicitly determinedfrom its predecessor V, almostalways in a continuousmanner. Furthermore, a sequenceof iteratesproduced by theEM algorithmon sucha problemis likelyto have relativelynice convergence properties. In thissection, we firstdetermine the special form which the EM algorithmtakes for mixturesof densities from exponential families. We thenlook into the desirable properties of the algorithmand sequences generatedby it which are apparentfrom this form. Finally,we discussseveral specific examples of the EM algorithmfor component densities fromexponential families which are commonlyof interest. A very briefdiscussion of exponentialfamilies of densitiesis in order. For an elaborationon the topics touched on here, the reader is referredto the book of Barndorff-Nielsen[7]. A parametricfamily of densities q(x I0)9 0 c Q C Rk, on Rnis said to be an exponentialfamilyif its members have the form

(5.1) q(x10) = a(0)-b(x)eOTt(x) x ER n whereb: RN R1,t: R Rk and a(0) is given by

a(0) = f b(x) eoTt(x)d, foran appropriateunderlying measure ,u onRn Rn. It is,of course, assumed that b(x) ~_O for

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 222 RICHARD A. REDNER AND HOMER F. WALKER all x E R' and thata(0) < oofor E Q.Q Note thatevery member of an exponentialfamily has thesame supportin R', namelythat of the function b(x). The representation(5.1) of the membersof an exponentialfamily, in whichthe parameter0 appears linearlyin the argumentof the exponentialfunction, is called the "natural"parametrization; and 0 is called the "natural" parameter.If the set Q is open and convexand if the componentfunctions of t(x) togetherwith the functionwhich is identically1 on Rnare linearlyindependent functions on the intersectionof thesupports of b(x) and ,u,then there is anotherparametrization of themembers of the family, called the "expectation"or "mean value" parametrization,in termsof the "expectation" parameter

4)=E(t(X) I0) = t(x)q(x16)dw

Indeed,under these conditions on Q and t(x), one can showthat

[E(t(X) 16') - E(t(X) IO)]T(' - 0) > 0 whenever6' 7 0, and it followsthat the assignment 0 4 = E(t(X) 10) is one-to-oneand ontofrom Q to an open set Q c Rk. In fact,the correspondence0 4 = l E(t(X) 6) is a both-wayscontinuously differentiable mapping between Q and Q. (See Barndorff-Nielsen [7, p. 121].) So underthese conditions on Q and t(x), one can representthe membersof thefamily as

(5.2) p(x l4) = q(x I0(4)) = a(4))-b(x)eO(0)Tt(x) x E R, for 4 E Q, where 6(4)) satisfies4 = E(t(X) I6(4))) and a(6(,O)) is writtenas a(0) for convenience.Note that p(x Iq5) is continuouslydifferentiable in 4, since q(x I0) is continuouslydifferentiable in 0 and 6(4)) is continuouslydifferentiable in 4. Now supposethat a parametricfamily of mixture densities of the form ( 1.1) is given, with4* = (a*, . . . a, 4)*, .*. . 0*, ) the "true" parametervalue to be estimated;and supposethat each componentdensity pi(x 1,i) is a memberof an exponentialfamily. Specifically,we assume that each pi(x I 4i) has the "expectation"parametrization for &icEi C Rn givenby

pi(x I 4i) = ai(04)-)1bi(x)e',(k)TtI(x) x E Rn, forappropriate ai, bi,ti and 6i.We furtherassume that it is validto reparametrizepi(x I4i) in termsof the "natural" parameter 6i = 6i(,Oi)in themanner of the above discussion. To investigatethe special formand propertiesof the EM algorithmfor the given familyof mixturedensities, we assume that 4), - * *, Om are mutuallyindependent variablesand consider for convenience a sample S1 = {Xk}k=l .. .,N ofType 1. (A discussion similarto the followingis valid mutatismutandis for samples of othertypes.) If V = c (a4, * * * a, (P4, * * ),) is a currentapproximate maximizer of the log-likelihood functionL, (f) givenby (3.1), thenthe next approximate maximizer b+ = (a' , a(4, 4+,. . . *, +), prescribedby the M-step of the EM algorithmsatisfies (4.5) and (4.6). For i = 1, * * *,m, what4)i satisfy(4.6)? If one replaceseachpi(xk I i) in thesum in (4.6) by its expressionin the "natural"parameter 6i, differentiates with respect to 6i, equates the sum of derivativesto zero, and finallyrestores the "expectation"parametrization, then one sees thatthe unique o+ whichsatisfies (4.6) is givenexplicitly by

{cow Jr , X~~~~aic pi(Xk I ic) |J aipi(Xk I Oic)

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 223

(As in thecase of (4.8) and (4.9), thefactors a' are leftin thenumerator and denominator foraesthetic reasons only.) Not onlyare (4.5) and (5.3) easilyevaluated and heuristicallyappealing formulas fordetermining V fromVC, they also providethe key to a globalconvergence analysis of iterationsequences generated by the EM algorithmin thecase at handwhich goes beyond Theorem4.1. Theorem5.1 below summarizessuch an analysis.In orderto make the theoremcomplete and self-contained,some of the general conclusions of Theorem 4.1 are repeatedin itsstatement. THEOREM 5.1. Supposethat 1(i) = (a() . . . , a)s i, . . .*, /)}02 is a sequence in Q generatedby the EM iteration(4.5) and (5.3). Then LI(4)) increases monotonicallyon {J(j)}._0,1,2,- .. to a (possiblyinfinite) limit L *. Furthermore,for each i, 101(j)1j=l2,... is containedin theconvex hull of {ti(xk)}k=1,... N. Consequently,the set L of all limitpoints of {(V) }j=0,1,2,... is compact,and thelikelihood equations (3.5) and (3.8) are satisfiedon L = L n Q. If L #0, then L* is finiteand each_4 E L satisfies L(4b) = L* and is a fixed point of the EM iteration.Finally, if L = L c Q, thenL is connectedas wellas compact. Remark. If the convex hull of {ti(Xk)}k=1,. .,N is containedin Qi for each i, then 0 #L = L c Q and all of theconditional conclusions of Theorem5.1 hold.The convex hullof {ti(Xk)}k=1,.. .,N is indeedcontained in Qi for each i in many(but not all) applications. (See theexamples at theend of this section.) Proof.One sees from(5.3) thatfor each i, (+ is alwaysa convexcombination of the values {ti(xk)}k=1,...,N, and it followsthat f0i(y%.0= 12 ... is containedin the convexhull of {ti(Xk)kL 1,.. .,N foreach i. Since theseconvex hulls are compactsets, one concludesthat L is compact. Now each densitypi(x IXi) is continuouslydifferentiable in X,on Qi,and so it is clear from (3.1) and (4.1) that L1(4i) and QQI(I4)1')are continuousin 4i and 4Y and differentablein (k, * * *, Om in Q. Furthermore,it is apparentfrom (4.5) and (5.3) that V dependscontinuously on 4V,and one sees fromthe discussionfollowing Theorem 4.1 that limj 11|0+1) - +)|| = 0 if L = L c Q. In lightof these points,one verifiesthe remainingconclusions of Theorem5.1 via a straightforwardapplication of Theorem4.1 (includingstatement (vi)'), and theproof is complete. One can also exploit(4.5) and (5.3) to obtaina local convergenceresult which goes beyondTheorem 4.3 for mixturedensity estimation problems of the typenow under consideration.Theorem 5.2 and its proof below providenot only a strongerlocal convergencestatement than Theorem 4.3 fora sequenceof iteratesproduced by the EM algorithm,but also a meansof both quantifying the speed of convergence of the sequence and gaininginsight into propertiesof the mixturedensity which affectthe speed of convergence.This theoremis essentiallythe generalizationof Redner [120] of the local convergenceresults of Peters and Walker [113] for mixturesof multivariatenormal densities,and its proofclosely parallels the proofsof thoseresults. In thegeneral case of incompletedata fromexponential families, Sundberg [134], [135] has obtainedconsis- tencyresults for maximum-likelihood estimates and local convergenceresults for the EM algorithm.His resultsare similarto thosebelow in themixture density case, althoughthe readerof [135] is referredto [133] forproofs of the local convergenceresults. A proofof Theorem5.2 is includedhere since, to the best of our knowledge,no proofof a local convergenceresult of this type has previouslyappeared in the generallyavailable literature. THEOREM 5.2. Suppose thatthe Fisher information matrix I(4b) givenby (2.5.1) is positivedefinite at J* and that4* = (a* , a*, 0**,* , (k ) is such thata0" > 0 for i = 1, , m. For f(o) in Q, denoteby {1(i)} j=0 12,... thesequence in Q generatedby

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 224 RICHARD A. REDNER AND HOMER F. WALKER the EM iteration(4.5) and (5.3). Then withprobability 1, wheneverN is sufficiently . large,the uniquestrongly consistent solution 4)N = (a N . . . a N, oN , 4$N)of the likelihood equations is well definedand thereis a certainnorm 1 11on Q in which

1(j) -L0,1,2,--..converges linearlyto I)N whenevercf(O) is sufficientlynear bN, i.e., thereis a constantX, 0 _ X < 1,forwhich

(5.4) ffc(I+') -NJJ < A | - BNJ| j = 0, 1,2, wheneverb(O) is sufficientlynear bN. Proof.One sees notonly that Condition 2 of ?3 is satisfied,but also, byrestricting Q to be a smallneighborhood of VI ifnecessary, that Condition 1 holdsas wellfor the family of mixturedensities under consideration.It followsfrom Theorem 3.1 that, with probability1, 1ANiS welldefined whenever N is sufficientlylarge and convergesto CP as N approachesinfinity. It mustbe shownthat, with probability 1, wheneverN is sufficiently large,there is a norm|| * || on Q and a constantX, 0 < X < 1,such that (5.4) holdswhenever b(O)is sufficientlynear cIN. Towardthis end, we observethat the EM iterationof interest is actuallya functionaliteration V = G(QV), whereG(Q) is the functiondefined in the obviousway by (4.5) and (5.3). Note that G(4i) is continuouslydifferentiable in Q and that any 4i which satisfiedthe likelihoodequations (3.5) and (3.8) (and 4i = 4 N in particular)is afixedpoint of G(d1), i.e.,c1 = G(d1). Consequentlyone can write + (5.5) _ 4)N = G(4fC) - G(QN) = G'(QIN)(4)c _ sN) + O(14)c _ sNI12) forany Vcin Q near 4)N and any norm 1111 on Q, whereG' (QN) denotesthe Frechet derivativeof G(d1) evaluatedat 1N. (For questionsconcerning Frechet derivatives, see, for example,Luenberger [95].) We willcomplete the proof by showing that with probability 1, G'(QN) convergesas N approachesinfinity to an operatorwhich has operatornorm less than 1 withrespect to a certainnorm on Q. For convenience,we introducethe following notation for i = 1, , m:

Oi (X) p(XPi I O>N) /p(X I ,bN),

ON _ 07i =J ti (X) ti]t(X) ON ] lpi (X I XiN) d,u,

_yi(X) = oI [ti(X) - ON]. Regardingan element4i C: Q as an (m + EnI ni)-vectorin thenatural way, one can show via a verytedious calculation that G'(IN ) has the (m + ENitni) x (m + ,n I ni) matrix representation

N ~~N G (dI) = diag 1, * ,I - Z (xk)o1 'Y(Xk)l'Y(Xk)T,(X * Nk~~~

fElm(Xk)Um-Ym(Xk) Ym(Xk)T)

- diag(a I .I a., avl 0l, a*I m lm) {+Z V(xk) V(xk)T} where

V(x) = (3I(x), I3m(x),, aIf1 (x)y I(x) a,mf3m(X)I Ym(x) )

(Since cIN convergesto f* withprobability 1, we can assumethat, with probability 1, each aiN is nonzerowhenever N is sufficientlylarge.) It followsfrom the stronglaw of large

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 225 numbers (see Loeve [94]) that, with probability 1, G'(V)) converges to E[G'(Q*)] = I -QR, where

Q=diag(a *, . . . , a., a,1 1l * f?{a a m and

R = f V(x) V(x)Tp(xI V) dg.

It is understoodthat in theseexpressions defining Q and R, 4)N and its componentshave beenreplaced by 4* and itscomponents. It remainsto be shownthat there is a norm|| * on Q withrespect to which E[G'(Q*)] has operatornorm less than 1. Now Q and R are positive-definitesymmetric operators withrespect to the Euclidean innerproduct, and so QR is a positive-definitesymmetric operatorwith respect to theinner product ( *, *) definedby ( U, W) = UTQ-1 W for(m + Nit,ni)-vectors U and W. Consequently,to provethe theorem, it sufficesto showthat the operatornorm of QR withrespect to thenorm defined by (*, *) is less thanor equal to 1. Since QR is positive-definitesymmetric with respect to (, *), we need onlyshow that (U, QRU) - (U, U) foran arbitrary(m + 2nI n,)-vectorU = (61, * m 19 * **, ,mT)T.One has (U, QRU) = UTRU

in ~~~~~~~~~~~in =Rf L b1f3(x)+ Ej[jj()(] p|* dg

Rf {Z [b~61 + {ibyi(x)],aNi(x)} p(xI4|*) dg

m -'< fZ[ba0` 6i?i ++p ?'iT_Y, (X) ] 2 0i i( xIO'>*)d dg. The inequalityis a consequenceof the followingcorollary of Schwarz's inequality:If < ii >-0 fori = 1, * * *, m and if m i = 1, then{Ni 1m}2 Ni fI2qi forall { i = l ...m. Since

f zy(x)pi(xVk")y d, = 0, one continuesto obtain m (U, QRU) [62a* + pi(x ) d = (U, U).

This completesthe proof. It is instructiveto explorethe consequences of Theorem 5.2 and thedevelopments in itsproof. One sees fromthe proof of Theorem 5.2 that,with probability 1, forsufficiently largeN and cf(O)sufficiently near bN, an inequality(5.4) holdsin which11 11is the norm determinedby the inner product ( *, *) definedin theproof and Xis arbitrarilyclose to the operatornorm of E [G'Q(*)] - I - QR determinedby || * 11Since QR is positive-definite symmetricwith respectto (, * ), this operatornorm is just p(I - QR), the spectral radiusor largestabsolute value ofan eigenvalueof I - QR. Thus,with probability 1, one can obtaina quantitativeestimate of the speed of convergenceto 1N forlarge N of a sequence generatedby the EM iteration(4.5) and (5.3) by takingX , p(I - QR) in (5.4).

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 226 RICHARD A. REDNER AND HOMER F. WALKER

What propertiesof the mixturedensity influence the speed of convergenceto 4)N of an EM iterationsequence for large N? Carefulinspection shows that if the component populationsin themixture are "wellseparated" in thesense that

p1(x108 ) PitxI<) t O forx E R', wheneveri j, p(x1d1) p(xIcI) then QR I. It followsthat p(I - QR) 0, and an EM iterationsequence which convergesto 4)N exhibitsrapid linearconvergence. On the otherhand, if the component populationsin the mixtureare "poorlyseparated" in the sense that,say, the ith and jth componentpopulations are suchthat

pi(x | S8 P,( < forx E R', p(xI) p(xI) thenR is nearlysingular. One concludesthat p(I - QR) 1 in thiscase and thatslow linearconvergence of an EM iterationsequence to cIN can be expected. In the interestof obtainingiteration sequences which converge more rapidly than EM iterationsequences, Peters and Walker [113], [114] and Redner [120] considered iterativemethods which proceed at each iterationin the EM directionwith a stepwhose lengthis controlledby a parameterc. In thepresent context, these methods take the form

(5.6) )= F,(QIC) = ( 1 - cJ$ + cGG(cc), where GQ() is the EM iterationfunction defined by (4.5) and (5.3). The idea is to optimizethe speed of convergenceto tN of an iterationsequence generatedby such a methodfor large N by choosingE to minimizethe spectralradius of E[F'(4I*)] = I - cQR. As in [ 113], [114] and [ 120], one can easilyshow that the optimal choice of E is alwaysgreater than one, lies near one if the componentpopulations in the mixtureare "well-separated"in the above sense, and cannot be much smaller than two if the componentpopulations are "poorlyseparated" in theabove sense.The extentto whichthe speed of convergenceof an iterationsequence can be enhancedby makingthe optimal choiceof e in (5.6) is determinedby the lengthof the subintervalof (0, 1] in whichthe spectrumof QR lies. (Greaterimprovements in convergencespeed are realizedfrom the optimal choice of c when this subintervalis relativelynarrow.) The applicationsof iterativeprocedures of the form (5.6) are at presentincompletely explored and mightwell bear furtherinvestigation. We concludethis section by brieflyreviewing some of the special formswhich the EM iterationtakes when a particularcomponent density pi(x IOi) is a memberof one of the commonexponential families. We also commenton some convergenceproperties of sequences {1(j)}j=0,,2,... generatedby the EM algorithmin the examples considered. Hopefully,our comments will prove helpful in determiningconvergence properties of EM iterationsequences b(j)1j=0l 2,... throughthe use of Theorem5.1 or othermeans when all componentdensities are fromone or moreof these example families. Example 5.1. Poisson density.In thisexample, n = 1 and a naturalchoice of Qi is Qi = {i E R': 0 < /i < 4O}.Fori E Qi,one has

I o x = 0, 1, 2, pi (x oi) = X!!e-??, ... and theEM iteration(5.3) fora sampleof Type 1 becomes

+ |kN aocp Pi (XkI/4) N aocp (XkI XC) (Xkp(xkl pi(XkIc)

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 227

Note that d< is always containedin the convexhull of {Xk}k=I,. N,N,which is a compact subsetof Qi. Therefore,the set oflimit points of an EM iterationsequence {10j) }j=0,I,2,... is a nonemptycompact subset of Qi Example 5.2. Binomialdensity. Here n = 1, and one naturallychooses Qi to be the openset {li E R': O < Xi< 1}. For$i e-Qi,pi(x IXi) is givenby

pi(XI Xi) = ( x) b (1 - Xi)vx, x = 0, 1, **, fora prescribedinteger vi. In thiscase, the EM iteration(5.3) fora sample of Type 1 becomes

-+ i Pi (Xk Xic) J ai Pi (Xk I X,il (Vik= I P(Xk V)J!k tI P(XkI )J Since pi(x I /i) is nonzeroonly if x = 0, 1, , vi,one sees fromthis expression that the setof limit points of an EM iterationsequence 1, iS a nonemptycompact subset ofQi = {i E&R': 0 - i-1}. Example 5.3. Exponential density.Again, n = 1 and one takes Qi = i ER': o < Xi < co}. For Xi E Qi, one has

pi (x I i)= e-x /, O

N x a~p1(XkI df) acp1(XkI = ( xII I i |N/ a Pi(xkO4)1 i and one sees that the set of limitpoints of an EM iterationsequence 012 is a nonemptycompact subset of Qi. Example 5.4. Multivariatenormal density.In this example, n is an arbitrary positiveinteger, and 4i is mostconveniently represented as 4i = (j.i, ,), where,i E R' and 1i is a positive-definitesymmetric n x n matrix.(Of course,this representation of 4i is not theusual representationof the "expectation" parameter.) Then ?i is theset ofall such ki, and pi(x I /i) is givenby (4.7). For a sampleof Type 1, the EM iteration(5.3) becomes thatof (4.8) and (4.9). One can see from (4.9) that each 1+ is in the convex hull of {(xk - ui+)

(xk - g,+)T}. N, a set of rank-onematrices which, of course,are notpositive definite. Thus thereis no guaranteethat a sequenceof matrices{N j) ,=0,1,2,... producedby the EM iterationwill remainbounded from below. Indeed, it has been observedin practicethat sequencesof iterates produced by the EM algorithmfor a mixtureof multivariate normal densitiesdo occasionallyconverge to "singularsolutions" (cf. Duda and Hart [46]), i.e., pointson theboundary of ?i withassociated singular matrices. It was observedby Hosmer[75] thatif enough labeled observations are includedin a sample on a mixtureof normaldensities, then with probability1, the log-likelihood functionattains its maximumvalue at a point at which the covariancematrices are positivedefinite. Similarly, consideration of sampleswith a sufficientlylarge numberof labeled observationsalleviates with probability1 the problemof an EM iteration sequence having"singular solutions" as limitpoints. For example,if one considersa sample S = SI u S3 which is a stochastically independentunion of a sample SI = {Xkk=I,. ..,N of Type 1 and a sampleS3 = Uit=I {Zik}k=I,---,K, of Type 3, thenthe EM

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 228 RICHARD A. REDNER AND HOMER F. WALKER iterationbecomes

1 aPi(Xk IXc) N= kI,I(PX+ K a, N p(T./ + lI[N~

_ k=i=X1(c k+ T~P(kI1) P k,I )c Z (Xk - Pi)(Xk O ic- ) p(x ) + Z (Ziki - i)(Z1k - i1

N at p (Xki I ) K|

whereK = 2it- Ki. One sees fromthe expression for Ni thatNi is boundedbelow by

- {Z (Zak Hi c4i)9|p1(Xki 1 ) + Zi} i whichis in turnbounded below by

_ N?+ K,{ -T (ik Z -(Zik Z) - where

_ 1 K, -= Zik. i k=lZ Now thislast matrix is positivedefinite with probability 1 wheneverKi > n. Consequently, ifKw > n, thenwith probability 1, theelements of a sequenceI s l producedby the EM algorithmare boundedbelow by a positivedefinite matrix; hence, such a sequence cannothave singular matrices as limitpoints.

6. Performanceof the EM algorithm.In this concludingsection, we reviewand summarizefeatures of the EM algorithmhaving to do withits effectiveness in practiceon mixturedensity estimation problems. As always,it is understoodthat a parametricfamily of mixturedensities of the form (1.1) is of interest,and that a particularb* = a,t1,./. . , ?d4*>m) is1 the "true" parameter value to be estimated. In orderto providesome perspective,we beginby offeringa briefdescription of the most basic formsof severalalternative methods for numerically approximating maxi- mum-likelihoodestimates. In describingthese methods, it is assumedfor convenience that thesample at handis a sampleS1 = {Xk}k1l...Nf ofType 1 describedin ?3 and thatone can writec1 as a vectorc1 = (il, * , KY"of unconstrainedscalar parametersat pointsof interestin Q. Each of themethods to be describedseeks a maximum-likelihoodestimate byattempting to determinea point1 suchthat

(6.1) V1>L1cI) = 0, whereL1(1) is the log-likelihoodfunction given by (3.1). The featuresof the methods which concernus here are theirspeed of convergence,the computationand storage requiredfor their implementation, and the extentto whichtheir basic formsneed to be modifiedin orderto makethem effective and trustworthyin practice.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 229

The firstof the alternativemethods to be describedis Newton'smethod. It is the methodon whichall butthe last of the other methods reviewed here are modeled,and it is as follows:Given a currentapproximation ff of a solutionof (6.1), determinea next approximationV by

(6.2) V = bc -H(4)c)- V,,LI(+c). The functionH(Q) in (6.2) is theHessian matrix of LI (Q) givenby (3.10). Under reasonableassumptions on LI(b), one can showthat a sequenceof iterates {1(i)}j=0,,2,.. producedby Newton's methodenjoys quadratic local convergenceto a solutionc1 of (6.1) (see, for example,Ortega and Rheinboldt[107], or Dennis and Schnabel [42]). This is to say that,given a norm| on Q, thereis a constantf suchthat if b(O)is sufficientlynear c, thenan inequality

(6.3) cj)(J+- ) >11-f3(J)< - 4)112 holds forj = 0, 1, 2, . Quadratic convergenceis ultimatelyvery fast, and it is regardedas themajor strength of Newton'smethod. Unfortunately, there are aspectsof Newton'smethod which are associatedwith potentially severe problems in someapplica- tions.For one thing,Newton's method requires at each iterationthe computationof the v x v Hessian matrixand thesolution of a systemof v linearequations (at a costof O(V3) arithmeticoperations in general) withthis Hessian as the coefficientmatrix; thus the computationrequired for an iterationof Newton'smethod is likelyto becomeexpensive veryrapidly as m, n and N growlarge. (It shouldalso be mentionedthat one mustallow forthe storageof the Hessian or some set of factorsof it.) For anotherthing, Newton's methodin itsbasic form(6.2) requiresfor some problems an impracticallyaccurate initial approximatesolution c1(O) in orderfor a sequenceof iterates1'J'4}1=0I12,... to converge to a solutionof (6.1). Consequently,in orderto be regardedas an algorithmwhich is safeand effectiveon applicationsof interest, the basic form(6.2) is likelyto requireaugmentation with some procedurefor enhancingthe global convergencebehavior of sequences of iteratesproduced by it. Such a procedureshould be designedto insurethat a sequenceof iteratesnot only converges, but also does notconverge to a solutionof (6.1) whichis nota local maximizerof L, (1). A broad class of methodswhich are based on Newton'smethod are quasi-Newton methodsof the general form

(6.4) )+ = Jc - B-IVDL I(), in whichB is regardedas an approximationof H(QI). Methodsof the form(6.4) which are particularlysuccessful are thosein whichthe approximation B - H(QV) is maintained bydoing a secantupdate of B at each iteration(see Dennisand More [41] or Dennisand Schnabel [42]). In theapplications of interest here, such updates are typicallyrealized as rank-oneor (morelikely) rank-two changes in B. Methodsemploying such updateshave the advantagesover Newton's methodof not requiringthe evaluationof the Hessian matrixat each iterationand of beingimplementable in ways whichrequire only O(v2) arithmeticoperations to solvethe system of v linearequations at each iteration.The price paid forthese advantages is thatthe fullquadratic convergence of Newton'smethod is lost;rather, under reasonable assumptions on L1(Q), a sequenceof iterates{1()}J=0,1,2,-,* produced by one of these methodscan only be shown to exhibit local superlinear convergenceto a solutionc1 of (6.1), i.e.,one can onlyshow that if a norm|| * || on Q is given and if c1(O) is sufficientlynear c1 (and an initialapproximate Hessian B(?) is sufficiently near H(Q)), thenthere exists a sequence{$j 1=012 .. whichconverges to zero and is such

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 230 RICHARD A. REDNER AND HOMER F. WALKER that

J1(j+(1) - J J 11((j) - 11 forj = 0, 1, 2, * Like Newton'smethod, methods of the general form (6.4), including thoseemploying secant updates,are likelyto requireaugmentation with safeguards to enhanceglobal convergenceproperties and to insurethat iteratesdo not convergeto solutionsof (6.1) whichare notlocal maximizersof L, (d1). There is a particularmethod of the form(6.4) whichis specificallyformulated for solvinglikelihood equations. This is the methodof scoring,mentioned earlier in connec- tionwith the workof Rao [119] and reviewedin a generalsetting by Kale [84], [85]. (Kale [84], [85] also discussesmodifications of Newton's methodand the methodof scoringin whichthe Hessian matrixor an approximationof it is held fixedfor some numberof iterationsin thehope of reducing overall computational effort.) In themethod ofscoring, one ideallychooses B in (6.4) to be (6.5) B =-NI(+c), whereI(b) is the Fisherinformation matrix given by (2.5.1). Since the computationof I(JC) is likelyto be prohibitivelyexpensive for most mixture density problems, a more appealingchoice of B than(6.5) mightbe thesample approximation

N (6.6) B = -L [V7 logp(Xk I f )][VI logP(Xk | c)]T k=i The choice(6.6) can be justifiedin thefollowing manner: The HessianH(,D) is givenby

N T H(4) = -T [V log p(Xk l 4)] [V7 log p(Xk I] ki (6.7) N

+ Z ( 1 V'Vp(Xk7 I d). k==I P(XkI1 )) Now thesecond sum in (6.7) has zeroexpectation at 1 = f*; furthermore,since the terms V, logp(xk 14)) mustbe computedin orderto obtain V7Ll(4)), the firstsum in (6.7) is availableat thecost of only O(Nv2) arithmeticoperations, while determining the second sum is likelyto involvea greatdeal moreexpense. Thus (6.6) is a choice of B whichis readilyavailable at relativelylow cost and whichis likelyto constitutea major partof H(V) whenN is largeand bc is near f*. It is clearfrom this discussion that the method of scoringwith B givenby (6.6) is an analoguefor general maximum-likelihood estimation of the Gauss-Newton methodfor nonlinearleast-squares problems (see Ortega and Rheinboldt[107]). If thecomputation of I(c) is nottoo expensive,then the choice of B givenby (6.5) can be justifiedin muchthe same way. The methodof scoringin its basic formrequires O(Nv2) arithmeticoperations to evaluateB givenby (6.6) and 0(v3) arithmeticoperations to solvethe system of v linear equationsimplicit in (6.4). Since these O(Nv2) arithmeticoperations are likelyto be considerablyless expensivethan the evaluation of the full Hessian givenby (6.7), thecost ofcomputation per iteration of the method of scoring lies betweenthat of a quasi-Newton methodemploying a low-ranksecant update and that of Newton's method.Under reasonableassumptions on L, (d1), one can showthat, with probability 1, ifa solutionc1 of (6.1) is sufficientlynear cP and if N is sufficientlylarge, then a sequence of iterates 1'[1j=0,1,2,... generatedby the methodof scoringwith B givenby either(6.5) or (6.6) exhibitslocal linearconvergence to d1; i.e., thereis a norm11 . 11on Q and a constantX,

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 231

0 X < 1,for which

(6.8) I(+1) - - - cf1, j = 0, 1, 2, whenevercf(O) is sufficientlynear c1. If c1is verynear VI and ifN is verylarge, then this convergenceshould be fast,i.e., (6.8) shouldhold fora smallconstant X. Like Newton's methodand all methodsof the generalform (6.4), the methodof scoringis likelyto require augmentationwith global convergencesafeguards in order to be considered trustworthyand effective. The last methodsto be touchedon in thisreview are conjugategradient methods. These methodsare conceptuallyquite differentfrom Newton's method and have the particularvirtue of notrequiring the storage or maintenanceof an approximateHessian. Thus theymay be themethods of choicewhen the number of independentvariables is so largethat storage is at a premium.The originalconjugate gradient method was derivedto solvelinear equations with positive-definite symmetric coefficient matrices, and a number of extensionsand variationsof thisbasic methodnow exist for both linear and nonlinear problems.Precise general theoreticalconvergence properties of conjugate gradient methodsare not easily stated,but in practicethey can be regardedas nearlyalways linearlyconvergent for nonlinear problems (see Gill, Murrayand Wright[53]). In the nonlinearcase, theamount of work required for an iterationis difficultto predictbecause an approximatemaximizer or minimizermust be foundalong a directionof search. Overall,they appear to be less efficienton small problemsthan methods modeled after Newton'smethod, but potentiallyof greatvalue (and perhapsthe onlyuseful methods) forvery large problems. The readeris referredto Ortegaand Rheinboldt[107] and Gill, Murrayand Wright[53] for an expanded discussionof the formsand propertiesof conjugategradient methods. Additional references are offeredby Dennis and Schnabel [42], as wellas by [107] and [53]. Havingreviewed the above alternativemethods, we returnnow to theEM algorithm and summarizeits attractivefeatures. Its most appealing generalproperty is that it producessequences of iterateson whichthe log-likelihoodfunction increases monotoni- cally.This monotonicityis thebasis ofthe general convergence theorems of ?4, and these theoremsreinforce a largebody of empirical evidence to theeffect that the EM algorithm does not requireaugmentation with elaborate safeguards, such as those necessaryfor Newton'smethod and quasi-Newtonmethods, in orderto produceiteration sequences withgood global convergence characteristics. To be clearon themeaning of "good global convergencecharacteristics," let us specificallysay that barringvery bad luck or some local pathologyin thelog-likelihood function, one can expectan EM iterationsequence to convergeto a local maximizerof thelog-likelihood function. We do notintend to suggest thatthe EM algorithmis especiallyadept at findingglobal maximizers,but at thisstage of the art, thereare no well tested,general purpose algorithms which can reliablyand efficientlyfind global maximizers.There has been a good bitof workdirected toward the developmentof such algorithms, however; see Dixon and Szego [44]. More can be said aboutthe EM algorithmfor mixtures of densities from exponential familiesunder the assumption that 01, * * *, m are mutuallyindependent variables. One sees from(4.5) and (5.3) and similarexpressions for samples of typesother than Type 1 that it is unlikelythat any otheralgorithm would be nearlyas easy to encode on a computeror wouldrequire as littlestorage. In viewof (4.5) and (5.3), it also seemsthat anyconstraints on -1are likelyto be satisfied,or at leastnearly satisfied for large samples. For example,it is clear from(4.9) thateach E+ generatedby the EM algorithmfor a mixtureof multivariatenormal densities is symmetricand, withprobability 1, positive-

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 232 RICHARD A. REDNER AND HOMER F. WALKER definitewhenever N > n. Certainlythe mixing proportions generated by (4.5) are always nonnegativeand sumto 1. It is also apparentfrom (4.5) and (5.3) thatthe computational cost of each iterationof the EM algorithmis low comparedto that of the alternative methodsreviewed above. In the case of a mixtureof multivariatenormal densities, for example, the EM algorithmrequires O(mn2N) arithmeticoperations per iteration, comparedto at least [O (m2n4N)+ 02(m3n6)] forNewton's method and the methodof scoringand [O (mn2N) + 02(m2n4)] fora quasi-Newtonmethod employing a low-rank secant update. (All of thesemethods require the same numberof exponentialfunction evaluationsper iteration.)Arithmetic per iterationfor the threelatter methods can, of course, be reduced by retaininga fixed approximateHessian for some numberof iterationsat therisk of increasing the total number of iterations. In spiteof theseattractive features, the EM algorithmcan encounterproblems in practice.The sourceof themost serious practical problems associated with the algorithm is the speed of convergenceof sequencesof iteratesgenerated by it, whichcan oftenbe annoyinglyor evenhopelessly slow. In thecase of mixturesof densitiesfrom exponential families,Theorem 5.2 suggeststhat one can expect the convergenceof EM iteration sequencesto be linear,as opposedto the(very fast) quadratic convergence associated with Newton's method,the (fast) superlinearconvergence associated witha quasi-Newton methodemploying a low-ranksecant update, the (perhapsfast) linear convergence of the methodof scoring,and the linear convergenceof conjugate gradientmethods. The discussionfollowing Theorem 5.2 suggestsfurther that the speed of thislinear conver- gence dependsin a certainsense on the separationof the componentpopulations in the mixture.To demonstratethe speed of thislinear convergence and its dependenceon the separationof thecomponent populations, we again considerthe exampleof a mixtureof twounivariate normal densities (see (1.3) and (1.4)). Table 6.1 summarizesthe resultsof a numericalexperiment involving a mixtureof two univariatenormal densities for the choices of f* appearingin Table 3.3. (These choiceswere obtained as beforeby taking a* = .3, o = = 1, andvarying the mean separationg1 - g2. Forconvenience, we took42 = -/.) In thisexperiment, a Type 1 sampleof 1000 observationson the mixturewas generatedfor each choiceof 4*; and a sequenceof iterateswas producedby the EM algorithm(see (4.5), (4.8) and (4.9)) from startingvalues a(O) = a(o) = .59g(O) = 1.5g*u9g(O) = 1.5g*2and 2(o = 2 = .5. An accurate determinationof thelimit of thesequence was made in each case, and observationswere made of the iterationnumbers at whichvarious degrees of accuracywere first obtained. These iterationnumbers are recordedin Table 6.1 beneaththe corresponding degrees of accuracy;in the table,"E" denotesthe largestabsolute value of the componentsof the differencebetween the indicatediterate and the limit.In addition,the spectralradius of the derivativeof the EM iterationfunction at the limitwas calculatedin each case (cf. Theorem5.2 and thefollowing discussion). These spectralradii, appearing in thecolumn headed by "p" in Table 6.1, providequantitative estimates of thefactors by whicherrors are reducedfrom one iterationto thenext in each case. Finally,to givean idea ofthe point in an iterationsequence at whichnumerical error first begins to affectthe theoretical performanceof the algorithm,we observedin each case the iterationnumbers at which loss of monotonicityof thelog-likelihood function first occurred; these iteration numbers appearin Table 6.1 in thecolumn headed by "LM." In preparingTable 6.1, all computingwas donein doubleprecision on an IBM 3032. Eigenvalues were calculated with EISPACK subroutinesTREDI and TQLI, and normallydistributed data was obtained by transforminguniformly distributed data generatedby the subroutineURAND of Forsythe,Malcolm and Moler [48] based on suggestionsof Knuth [90].3

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 233

TABLE 6.1 Results of applyingthe EM algorithmto a probleminvolving a Type 1 sample on a mixtureof two univariatenormal densities with ca* .3, a = I= 1.

A*-g1E <10-' E< 10-2 E< 10-3 E< 10-4 E< 10 5E< 10-6|E< 107 E< 1081 LM p

0.2 2078 2334 2528 2717 2906 3095 3283 3472 3056 .9879

0.5 710 852 985 1117 1249 1381 1513 1643 1361 .9827

1.0 349 442 526 610 693 777 861 949 779 .9728

1.5 280 414 537 660 783 906 1028 1151 887 .9814

2.0 126 281 432 582 732 883 1033 1183 846 .9849

3.0 2 31 62 93 124 155 185 216 173 .9280

4.0 1 6 16 25 35 44 54 63 55 .7864

6.0 1 1 2 3 4 5 7 8 8 .2143

A numberof comments about the contents of Table 6.1 are in order.First, it is clear fromthe table that an exorbitantlylarge numberof EM iterationsmay be requiredto obtaina veryaccurate numerical approximation of the maximum-likelihoodestimate if the sample is froma mixtureof poorlyseparated component populations. However, in such a case, one sees fromTable 3.3 thatthe variance of theestimate is likelyto be such that it may be pointlessto seek verymuch accuracy in a numericalapproximation. Second, we remarkon the pleasingconsistency between the computedvalues of the spectralradius of thederivative of the EM iterationfunction and thedifferences between the iterationnumbers needed to obtainvarying degrees of accuracy.What we have in mindis thefollowing: If theerrors among the members of a linearlyconvergent sequence are reducedmore or less by a factorof p, 0 < p < 1, fromone iterationto thenext, then thenumber of iterations A\k necessary to obtainan additionaldecimal digit of accuracy is givenapproximately by Ak , -log 10/logp. This relationshipbetween Ak and p is borne out verywell in Table 6.1. This fact stronglysuggests that aftera numberof EM iterationshave been made,the errors in the iterateslie almostentirely in the eigenspace correspondingto thedominant eigenvalue of thederivative of theEM iterationfunction. We take this as evidencethat one mightvery profitably apply simplerelaxation-type accelerationprocedures such as thoseof Petersand Walker [113], [114] and Redner [120] to sequencesof iterates generated by the EM algorithm. Third,in all of the cases listedin Table 6.1 exceptone, we observedthat over 95 percentof the change in the log-likelihoodfunction between the startingpoint and the limitof the EM iterationsequence was realizedafter only five iterations, regardless of the numberof iterationsultimately required to approximatethe limitvery closely. (The exceptionalcase is that in whichA4 - A*l= 1.0; in that case, about 83 percentof the changein the log-likelihoodfunction was observedafter five iterations.) This suggeststo us thateven when the component populations in a mixtureare poorlyseparated, the EM algorithmcan be expectedto producein a verysmall numberof iterationsparameter values such thatthe mixturedensity determined by themreflects the sample data very

3We are gratefulto the Mathematicsand StatisticsDepartment of the Universityof New Mexico for providingthe computing support for the generation of Table 6.1.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 234 RICHARD A. REDNER AND HOMER F. WALKER well. Fourth,it is evidentfrom Table 6.1 that elementsof an EM iterationsequence continueto makesteady progress toward the limit even after numerical error has begunto interferewith the theoretical properties of the algorithm. Fifth,the apparentlyanomalous decrease in p occurringwhen A4- ,2 decreases from2.0 to 1.0 happenedconcurrently with the iteration sequence limit of the proportion of the firstpopulation in the mixturebecoming very small. (Such very small limit proportionscontinued to be observedin the cases Af- 14= 0.5, 0.2.) We do not know whetherthis decrease in the limitproportion of the firstpopulation indicates a sudden movementof the maximum-likelihood estimate as A,4- ,Udrops below 2.0, orwhether the iterationsequence limit is somethingother than the maximum-likelihoodestimate in the cases in which14 - ,U2is less than 2.0. Finally,we also conductedmore than 60 trials similarto those reportedin Table 6.1, except with samples of 200 ratherthan 1000 generatedobservations on the mixture.The resultswere comparableto thosegiven in Table 6.1. It shouldbe mentioned,however, that the EM iterationsequences obtained usingsamples of 200 observationsdid occasionallyconverge to "singularsolutions," i.e., limits associated with zero componentvariances. Convergenceto such "singular solutions"did notoccur among the relatively small numberof trialsinvolving samples of 1000 observations. At present,the EM algorithmis beingwidely applied, not onlyto mixturedensity estimationproblems, but also to a widevariety of other problems as well.We wouldlike to concludethis survey with a littlespeculation about the futureof the algorithm.It seems likelythat the EM algorithmin its basic formwill finda secureniche as an algorithm usefulin situationsin whichsome resourcesare limited.For example,the limitedtime whichan experimentercan affordto spend writingprograms coupled witha lack of available librarysoftware for safelyand efficientlyimplementing competing methods could make the simplicityand reliabilityof the EM algorithmvery appealing. Also, the EM algorithmmight be verywell suited for use on smallcomputers for which limitations on programand data storageare morestringent than limitations on computingtime. Althoughmeaningful comparison tests have notyet been made,it seemsdoubtful to us that the unadornedEM algorithmcan be competitiveas a general tool with well-designedgeneral optimizationalgorithms such as those implementedin good currentlyavailable softwarelibrary routines. Our doubtis based on the intolerablyslow convergenceof sequences of iterates generated by the EM algorithmin someapplications. On the otherhand, it is entirelypossible that the EM algorithmcould be modifiedto incorporateprocedures for accelerating convergence, and thatsuch modificationwould enhanceits competitiveness. It is also possiblethat an effectivehybrid algorithm might be constructedwhich first takes advantage of thegood global convergenceproperties of the EM algorithmby using it initiallyand then exploitsthe rapid local convergenceof Newton'smethod or one ofits variants by switching to sucha methodlater. Our feelingis thattime might well be spenton researchaddressing these possibilities.

REFERENCES

[1] M. A. ACHESON AND E. M. MCELWEE, Concerningthe reliabilityof electrontubes, The Sylvania Technologist,4 (1951), pp. 105-116. [2] J. AITCHISON AND S. D. SILVEY, Maximum-likelihoodestimation of parameterssubject to restraints, Ann.Math. Statist.,3 (1958), pp. 813-828. [3] J.A. ANDERSON, Multivariatelogistic compounds, Biometrika, 66 (1979), pp. 17-26. [4] N. ARLEY AND K. R. BUCH, Introductionto the Theoryof Probabilityand Statistics,John Wiley, New York,1950.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 235

[5] G. A. BAKER, Maximum-likelihoodestimation of theratio of thetwo components of non-homogeneous populations,Tohoku Math. J.,47 (1940), 304-308. [6] 0. BARNDORFF-NIELSEN,Identifiability of mixturesof exponentialfamilies,J. Math. Anal. Appl., 12 (1965), pp. 115-121. [7] , Informationand ExponentialFamilies inStatistical Theory, John Wiley, New York,1978. [8] L. E. BAUM, T. PETRIE, G. SOULES AND N. WEISS, A maximizationtechnique occuring in thestatistical analysisof probabilistic functions of Markovchains, Ann. Math.Statist., 41 (1970), pp. 164-171. [9] J.BEHBOODIAN, On a mixtureof normaldistributions, Biometrika, 57 Part 1 (1970), 215-217. [10] , Informationmatrix for a mixtureof two normaldistributions, J. Statist.Comput. Simul., 1 (1972), pp. 295-314. [11] C. T. BHATTACHARYA,A simple methodof resolutionof a distributioninto Gaussian components, Biometrics,23 (1967), pp. 115-137. [12] W. R. BLISCHKE, Momentestimators for theparameters of a mixtureof twobinomial distributions, Ann.Math. Statist.,33 (1962), pp. 444-454. [13] , Mixturesof discretedistributions, in Proc. of the InternationalSymposium on Classical and ContagiousDiscrete Distributions, Pergamon Press, New York,1963, pp. 351-372. [14] , Estimatingthe parameters of mixturesof binomialdistributions, J. Amer.Statist. Assoc., 59 (1964), pp. 510-528. [15] D. C. BOES, On theestimation of mixingdistributions, Ann. Math.Statist., 37 (1966), pp. 177-188. [16] , Minimaxunbiased of mixingdistribution forfinite mixtures, Sankhya A, 29 (1967), pp.417-420. [17] K. 0. BOWMANAND L. R. SHENTON, Space of solutionsfora normalmixture, Biometrika, 60 (1973), pp. 629-636. [18] R. A. BOYLES, On the convergenceof the EM algorithm,J. Royal Statist.Soc. Ser. B, 45 (1983), pp. 47-50. [19] C. BURRAU, The half-invariantsof the sum of two typicallaws of errors,with an applicationto the problemof dissectingafrequency curve into components, Skand. Aktuarietidskrift,17 (1934), pp. 1-6. [20] R. M. CASSIE, Some uses ofprobability paper in theanalysis of sizefrequency distributions, Austral. J. Marineand FreshwaterRes., 5 (1954), pp. 513-523. [21] R. CEPPELINI, S. SINISCALCO AND C. A. B. SMITH, The estimationof genefrequencies in a random- matingpopulation, Ann. HumanGenetics, 20 (1955), pp. 97-115. [22] K. C. CHANDA, A note on the consistencyand maxima of the roots of the likelihood equations, Biometrika,41 (1954), pp. 56-61. [23] W. C. CHANG, The effectsof addinga variablein dissectinga mixtureof twonormal populations with a commoncovariance matrix, Biometrika 63, (1976), pp. 676-678. [24] , Confidenceinterval estimation and transformationof data in a mixtureof two multivariate normaldistributions with any given large dimension, Technometrics, 21 (1979), pp. 351-355. [25] C. V. L. CHARLIER, Researchesinto the theory of probability, Acta Univ.Lund. (Neue Folge.Abt. 2), 1 (1906), pp. 33-38. [26] C. V. L. CHARLIER AND S. D. WICKSELL, On thedissection offrequencyfunctions, Arkiv. for Matematik AstronomiOch Fysik,18 (6) (1924), Stockholm. [27] T. CHEN, Mixed-upfrequencies in contingencytables, Ph.D. dissertation,Univ. of Chicago, Chicago, 1972. [28] K. CHOI, Estimatorsfor theparameters of afinite mixture of distributions, Ann. Inst. Statist. Math., 21 (1969), pp. 107-116. [29] K. CHOI AND W. B. BULGREN,An estimationprocedure for mixturesof distributions,J. Royal Statist. Soc. Ser. B, 30 (1968), pp. 444-460. [30] A. C. COHEN, Jr., Estimationin mixturesof discretedistributions, in Proc. of the International Symposiumon Classical and ContagiousDiscrete Distributions, Pergamon Press, New York, 1963, pp.351-372. [31] , A noteon certaindiscrete mixed distributions, Biometrics, 22 (1966), pp. 566--571. [32] , Estimationin mixturesof twonormal distributions, Technometrics, 9 (1967), pp. 15-28. [33] Communicationsin Statistics,Special Issue on Remote Sensing,Comm. Statist. Theor. Meth., A5 (1976). [34] P. W. COOPER, Some topicson nonsupervisedadaptive detection for multivariatenormal distributions, in Computerand InformationSciences, II, J. T. Tou, ed., Academic Press,New York, 1967, pp. 143-146. [35] D. R. Cox, The analysisof exponentially distributed lifetimes with two types offailure, J. Royal Statist. Soc., 21 B (1959), pp. 411-421.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 236 RICHARD A. REDNER AND HOMER F. WALKER

[36] H. CRAMER, MathematicalMethods of Statistics, Princeton Univ. Press, Princeton, NJ, 1946. [37] N. E. DAY, Estimatingthe components of a mixtureof normaldistributions, Biometrika, 56 (1969), pp. 463-474. [38] J.J. DEELY AND R. L. KRUSE, Constructionof sequences estimating the mixing distribution, Ann. Math. Statist.,39 (1968), pp. 286-288. [39] A. P. DEMPSTER, N. M. LAIRD AND D. B. RUBIN, Maximum-likelihoodfrom incomplete data via the EM algorithm,J. RoyalStatist. Soc. Ser. B (methodological),39 (1977), pp. 1-38. [40] J. E. DENNIS, JR., Algorithmsfor nonlinearfitting, in Proc. of the NATO Advanced Research Symposium,Cambridge Univ., Cambridge, England, July, 1981. [41] J. E. DENNIS, JR., AND J. J. MORE, Quasi-Newtonmethods, motivation and theory,this Review, 19 (1977), pp. 46-89. [42] J. E. DENNIS, JR., AND R. B. SCHNABEL, Numerical Methodsfor UnconstrainedOptimization and NonlinearEquations, Prentice-Hall, Englewood Cliffs, NJ, 1983. [43] N. P. DICK AND D. C. BOWDEN, Maximum-likelihoodestimation for mixturesof two normal distributions,Biometrics, 29 (1973), pp. 781-791. [44] L. C. W. DIXON AND G. P. SZEGO, Towards Global Optimization,Vols. 1, 2, North-Holland, Amsterdam,1975, 1978. [45] G. DOETSCH, Zerlegungeiner Funktion in GauscheFehlerkurven und zeitliche Zuruckverfolgung eines Temperaturzustandes,Math. Z., 41 (1936), pp. 283-318. [46] R. 0. DUDA AND P. E. HART, PatternClassification and Scene Analysis,John Wiley, New York,1973. [47] B. S. EVERITTAND D. J.HAND, FiniteMixtureDistributions,Chapmanand Hall, London,1981. [48] G. E. FORSYTHE, M. A. MALCOLM AND C. B. MOLER, ComputerMethodsfor Mathematical Computa- tions,Prentice-Hall, Englewood Cliffs, NJ 1977. [49] E. B. FOWLKES, Some methodsfor studyingthe mixtureof twonormal (log normal)distributions, J. Amer.Statist. Assoc., 74 (1979), pp. 561-575. [50] J. G. FRYER AND C. A. ROBERTSON, A comparisonof some methodsfor estimatingmixed normal distributions,Biometrika, 59 (1972), pp. 639-648. [51] K. FUKUNAGA, Introductionto StatisticalPattern Recognition, Academic Press, New York,1972. [52] S. GANESALINGAM AND G. J. MCLACHLAN, Some efficiencyresults for the estimationof the mixing proportionin a mixtureof twonormal distributions, Biometrics, 37 (1981), pp. 23-33. [53] P. E. GILL, W. MURRAY AND M. H. WRIGHT, PracticalOptimization, Academic Press, London, 1981. [54] L. A. GOODMAN, The analysis of systemsof qualitativevariables whensome of the variables are unobservable:Part I-A modifiedlatent structureapproach, Amer. J. Sociol., 79 (1974), pp. 1179-1259. [55] V. H. GOTTSCHALK, Symmetricbimodalfrequency curves, J. FranklinInst., 245 (1948), pp. 245-252. [56] J.GREGOR, An algorithmfor thedecomposition of a distributioninto Gaussian components,Biometrics, 25 (1969), pp. 79-93. [57] N. T. GRIDGEMAN, A comparisonof two methodsof analysis of mixturesof normaldistributions, Technometrics,12 (1970), pp. 823-833. [58] E. J. GUMBEL, La dissectiond'une repartition,Annales de l'Universitede Lyon, (3), A (1939), pp. 39-51. [59] A. K. GUPTA AND T. MIYAWAKI, On a uniformmixture model, Biometrical J., 20(1978), pp. 631-638. [60] L. F. GUSEMAN, JR., AND J. R. WALTON, An applicationof linearfeature selection to estimationof proportions,Comm. Statist. Theor. Meth., A6 (1977), pp. 611-617. [61] , Methodsfor estimatingproportions of convexcombinations of normalsusing linear feature selection,Comm. Statist. Theor. Meth., A7 (1978), pp. 1439-1450. [62] S. J. HABERMAN, Log-linearmodels for frequency tables derivedby indirectobservations: Maximum- likelihoodequations, Ann. Statist., 2 (1974), pp. 911-924. [63] , Iterativescaling proceduresfor log-linearmodels for frequencytables derivedby indirect observation,Proc. Amer. Statist. Assoc. (Statist.Comp. Sect. 1975), (1976), pp. 45-50. [64] , Productmodelsforfrequency tables involving indirect observation, Ann. Statist., 5 (1977), pp. 1124-1147. [65] A. HALD, The compoundhypergeometric distribution and a systemof singlesampling inspection plans based onprior distributions and costs,Technometrics, 2 (1960), pp. 275-340. [66] P. HALL, On the non-parametricestimation of mixtureproportions, J. Royal Statist.Soc. Ser. B, 43 (1981), pp. 147-156. [67] J. P. HARDING, The use of probabilitypaper for the graphical analysis of polynomialfrequency distributions,J. MarineBiological Assoc., 28 (1949), pp. 141-153. [68] J.A. HARTIGAN, ClusteringAlgorithms, John Wiley, New York, 1975. [69] M. J.HARTLEY, Commenton [118], J.Amer. Statist. Assoc., 73 (1978), pp. 738-741.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 237

[70] V. HASSELBLAD, Estimationof parametersfor a mixtureof normaldistributions, Technometrics, 8 (1966), pp. 431-444. [71] , Estimationof finite mixtures of distributionsfrom the exponentialfamily, J. Amer.Statist. Assoc.,64 (1969), pp. 1459-1471. [72] R. J. HATHAWAY,Constrained maximum-likelihood estimation for a mixtureof m univariatenormal distributions.Statistics Tech. Rep. 92, 62F10-2,Univ. of South Carolina, Columbia, SC, 1983. [73] B. M. HILL, Informationfor estimatingthe proportionsin mixturesof exponentialand normal distributions,J. Amer. Statist. Assoc., 58 (1963), pp. 918-932. [74] D. W. HOSMER, JR., On MLE of theparameters of a mixtureof two normaldistributions when the samplesize is small,Comm. Statist., 1 (1973), 217-227. [75] , A comparisionof iterativemaximum-likelihood estimates of theparameters of a mixtureof two normal distributionsunder three differenttypes of sample, Biometrics,29 (1973), pp. 761-770. [76] , Maximum-likelihoodestimates of theparameters of a mixtureof tworegression lines, Comm. Statist.,3 (1974), pp. 995-1006. [77] , Commenton [118], J.Amer. Statist. Assoc., 73 (1978), pp. 741-744. [78] D. W. HOSMER, JR., AND N. P. DICK, Informationand mixturesof twonormal distributions, J. Statist. Comput.Simul., 6 (1977), pp. 137-148. [79] V. S. HUZURBAZAR, The likelihoodequation, consistency and the maxima of the likelihoodfunction, Annalsof Eugenics London, 14 (1948), pp. 185-200. [80] I. R. JAMES, Estimationof themixing proportion in a mixtureof twonormal distributions from simple, rapidmeasurements, Biometrics, 34 (1978), pp. 265-275. [81] S. JOHN, On identifyingthe population of originof each observationin a mixtureof observationsfrom twonormal populations, Technometrics, 12 (1970), pp. 553-563. [82] , On identifyingthe population of originof each observationin a mixtureof observationsfrom twogamma populations, Technometrics, 12 (1970), pp. 565-568. [83] A. B. M. L. KABIR, Estimationof parameters of afinitemixture of distributions,J. Royal Statist.Soc. Ser. B (methodological),30 (1968), pp. 472-482. [84] B. K. KALE, On thesolution of thelikelihood equation by iteration processes, Biometrika, 48 (1961), pp. 452-456. [85] -, On the solution of likelihoodequations by iterationprocesses. The multiparametriccase, Biometrika,49 (1962), pp. 479-486. [86] D. KAZAKOS, Recursiveestimation of prior probabilities using a mixture,IEEE Trans. Inform.Theory, IT-23 (1977), pp. 203-211. [87] J. KIEFER AND J. WOLFOWITZ, Consistencyof the maximum-likelihoodestimation in thepresence of infinitelymany incidental parameters, Ann. Math. Statist.,27 (1956), pp. 887-906. [88] N. M. KIEFER, Discreteparameter variation: Efficient estimation of a switchingregression model, Econometrica,46 (1978), pp. 427-434. [89] , Commenton [118], J.Amer. Statist. Assoc., 73 (1978), pp. 744-745. [90] D. E. KNUTH, Seminumericalalgorithms, in The Art of ComputerProgramming, Vol. 2, Addison- Wesley,Reading, MA, (1969). [91] K. D. KUMAR, E. H. NICKLIN AND A. S. PAULSON, Commenton [118], J. Amer.Statist. Assoc., 74 (1979), pp. 52-55. [92] N. LAIRD, Nonparametricmaximum-likelihood estimation of a mixingdistribution, J. Amer.Statist. Assoc.,73 (1978), pp. 805-811. [93] P. F. LAZARSFELD AND N. W. HENRY, LatentStructure Analysis, Houghton Mifflin Company, Boston, 1968. [94] M. LOEVE, ProbabilityTheory, Van Nostrand,New York,1963. [95] D. G. LUENBERGER, Optimizationby VectorSpace Methods,John Wiley, New York,1969. [96] P. D. M. MACDONALD, Commenton "An estimationprocedure for mixturesof distribution"by Choi and Bulgren,J. RoyalStatist. Soc. Ser. B, 33 (1971), pp. 326-329. [97] , Estimationof finitedistribution mixtures, in Applied Statistics,R. P. Gupta, ed., North- HollandPublishing Co., Amsterdam,1975. [98] C. L. MALLOWS, reviewof [100], Biometrics,18 (1962), p. 617. [99] J.S. MARITZ, EmpiricalBayes Methods,Methuen and Co., London,1970. [100] P. MEDGYESSY, Decompositionof Superpositionsof DistributionFunctions, Publishing House of the HungarianAcademy of Sciences, Budapest, 1961. [101] W. MENDENHALL AND R. J. HADER, Estimationof parametersof mixed exponentiallydistributed failure timedistributions from censored life test data, Biometrika, 45 (1958), pp. 504-520.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions 238 RICHARD A. REDNER AND HOMER F. WALKER

[102] H. MUENCH, Probabilitydistribution of protection test results, J. Amer.Statist. Assoc., 31 (1936), pp. 677-689. [103] , Discretefrequency distributions arising from mixtures of severalsingle probability values, J. Amer.Statist. Assoc., 33 (1938), pp. 390-398. [104] P. L. ODELL AND J. P. BASU, Concerningseveral methodsfor estimatingcrop acreages usingremote sensingdata, Comm.Statist. Theor. Meth., A5 (1976), pp. 1091-1114. [105] P. L. ODELL AND R. CHHIKARA, Estimationof a largearea cropacreage inventory using remote sensing technologyin Annual Report: StatisticalTheory and Methodologyfor Remote Sensing Data Analysis,Rep. NASA/JSC-09703,Univ. of Texas at Dallas, Dallas, TX, 1975. [106] T. ORCHARD AND M. A. WOODBURY, A missinginformation principle: theory and applications,in Proc. of the 6th BerkeleySymposium on MathematicalStatistics and Probability,1972, vol. 1, pp. 697-715. [107] J. M. ORTEGA AND W. C. RHEINBOLDT, IterativeSolution of NonlinearEquations inSeveral Variables, AcademicPress, New York,1970. [108] A. M. OSTROWSKI, Solutions of Equations and Systemsof Equations, Academic Press,New York, 1966. [109] K. PEARSON, Contributionsto the mathematicaltheory of evolution,Phil. Trans. Royal Soc., 185A (1894), pp. 71-110. [110] , On certain types of compoundfrequency distributions in which the componentscan be individuallydescribed by binomial series, Biometrika, 11 (1915-17), pp. 139-144. [111] K. PEARSON AND A. LEE, On thegeneralized probable error in multiplenormal correlation, Biometrika, 6 (1908-09), pp. 59-68. [112] B. C. PETERS, JR., AND W. A. COBERLY, The numericalevaluation of themaximum-likelihood estimate of mixtureproportions, Comm. Statist. Theor. Meth., A5 (1976), pp. 1127-1135. [113] B. C. PETERS, JR., AND H. F. WALKER, An iterativeprocedure for obtainingmaximum-likelihood estimatesof the parametersfor a mixtureof normal distributions,SIAM J. Appl. Math., 35 (1978), pp. 362-378. [114] , The Numerical evaluation of the maximum-likelihoodestimate of a subset of mixture proportions,SIAM J.Appl. Math.,35 (1978), pp. 447-452 [115] H. S. POLLARD, On the relativestability of the and the arithmeticmean, withparticular referenceto certainfrequency distributions which can be dissectedinto normal distributions, Ann. Math. Statist.,5 (1934), pp. 227-262. [116] E. J. PRESTON, A graphical methodfor the analysis of statistical distributionsinto two normal components,Biometrika, 40 (1953), pp.460-464. [117] R. E. QUANDT, A newapproach to estimatingswitching regressions, J. Amer. Statist. Assoc., 67 (1972), pp.306-310. [118] R. E. QUANDT AND J. B. RAMSEY, Estimatingmixtures of normal distributionsand switching regressions,J. Amer. Statist. Assoc., 73 (1978), pp. 730-738. [119] C. R. RAO, The utilizationof multiplemeasurements in problems of biologicalclassification. J. Royal Statist.Soc. Ser. B, 10 (1948), pp. 159-193. [120] R. A. REDNER, An iterativeprocedure for obtainingmaximum liklihood estimates in a mixturemodel, Rep. SR-T1-04081,NASA ContractNAS9-14689, Texas A&M Univ.,College Station, TX, Sept., 1980. [121] , Maximum-likelihoodestimation for mixturemodels, NASA Tech. Memorandum,to appear. [122] , Note on theconsistency of themaximum-likelihood estimate for nonidentifiabledistributions, Ann.Statist., 9 (1981), pp. 225-228. [123] P. R. RIDER, The methodof momentsapplied to a mixtureof twoexponential distributions, Ann. Math. Statist.,32 (1961), pp. 143-147. [124] , Estimatingthe parametersof mixed Poisson, binomial and Weibull distributionsby the methodof moments,Bull. Internat.Statist. Inst., 39 Part2 (1962), pp. 225-232. [125] C. A. ROBERTSON AND J.G. FRYER, The bias and accuracyof momentestimators, Biometrika, 57 Part 1 (1970), pp. 57-65. [126] J. W. SAMMON, JR., An adaptivetechnique for multiplesignal detectionand identification,in Pattern Recognition,L. N. Kanal, ed.,Thompson Book Co., London,1968, pp. 409-439. [127] W. SCHILLING, A frequencydistribution represented as thesum of twoPoisson distributions,J. Amer. Statist.Assoc., 42 (1947), pp. 407-424. [128] L. SIMAR, Maximum-liklihoodestimation of a compoundPoisson process, Ann. Statist.,4 (1976), pp. 1200-1209. [129] J. SITTIG, Superpositievan tweefrequentieverdelingen,Statistica, 2 (1948), pp. 206-227. [130] D. F. STANAT, Unsupervisedlearning of mixturesof probability functions, in PatternRecognition, L. N. Kanal, ed.,Thompson Book Co., London,1968, pp. 357-389.

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions MIXTURE DENSITIES, MAXIMUM LIKELIHOOD, EM ALGORITHM 239

[131] G. W. STEWART, Introductionto Matrix Computations,Academic Press, New York,1973. [132] B. STROMGREN, Tables and diagrams for dissectinga frequencycurve into componentsby the half-invariantmethod, Skand. Aktuarietidskrift,17 (1934), pp. 7-54. [133] R. SUNDBERG,Aaximum-likelihood theory and applicationsfor distributionsgenerated when observ- ing a functionof an exponentialfamily variable,Doctoral thesis,Inst. Math. Stat., Stockholm Univ.,Stockholm, Sweden, 1972. [134] , Maximumliklihood theory for incompletedata from an exponentialfamily,Scand. J.Statist., 1 (1974), pp. 49-58. [135] , An iterativemethod for solution of the likelihood equationsfor incompletedata from exponentialfamilies,Comm. Statist. Simulation Comput., B5, (1976), pp. 55-64. [136] G. M. TALLIS AND R. LIGHT, The use offractionalmoments for estimatingthe parameters of a mixed exponentialdistribution, Technometrics, 10 (1968), pp. 161-175. [137] W. Y. TAN AND W. C. CHANG, Convolutionapproach to geneticanalysis of quantitativecharacters of self-fertilizedpopulation, Biometrics, 28 (1972), pp. 1073-1090. [138] , Some comparisonsof the methodof momentsand the methodof maximum-likelihoodin estimatingparameters of a mixtureof twonormal densities, J. Amer. Statist. Assoc., 67 (1972), pp. 702-708. [139] R. D. TARONE AND G. GRUENHAGE, A noteon the uniquenessof rootsof the likelihoodequations for vector-valuedparameters, J. Amer. Statist. Assoc., 70 (1975), pp. 903-904. [140] M. TARTER AND A. SILVERS, Implementationand applicationsof bivariate Gaussian mixture decompo- sition,J. Amer. Statist. Assoc., 70 (1975), pp. 47-55. [141] H. TEICHER, On themixture of distributions,Ann. Math. Statist.,31 (1960), pp. 55-73. [142] ' Identifiabilityof mixtures,Ann. Math. Statist.,32 (1961), pp. 244-248. [143] , Identifiabilityoffinite mixtures, Ann. Math. Statist.,34 (1963), pp. 1265-1269. [144] , Identifiabilityof mixturesof productmeasures, Ann. Math. Statist.,38 (1967), pp. 1300- 1302. [145] D. M. TITTERINGTON, Some problems with data fromfinite mixturedistributions, Mathematics ResearchCenter, University of Wisconsin-Madison,Technical Summary Report 2369, Madison, WI, 1982. [146] , Minimumdistance non-parametric estimation of mixtureproportions, J. Royal Statist.Soc. Ser. B, 45 (1983), pp. 37-46. [147] J. D. TUBBS AND W. A. COBERLY, An empiricalsensitivity study of mixtureproportion estimators, Comm.Statist. Theor. Meth., A5 (1976), pp. 1 115-1125. [148] J. VAN RYZIN, ed., Classificationand Clustering:Proc. of an AdvancedSeminar Conducted by the MathematicsResearch Center,The Universityof Wisconsin,Madison (May, 1976), Academic Press,New York,1977. [149] Y. VARDi, Nonparametricestimation in renewalprocesses, Ann. Statist., 10 (1982), pp. 772-785. [150] A. WALD, Note on theconsistency of themaximum-likelihood estimate, Ann. Math. Statist., 20 (1949), pp.595-600. [151] H. F. WALKER, Estimatingthe proportions of twopopulations in a mixtureusing linear maps, Comm. Statist.Theor. Meth. A9 (1980), pp. 837-849. [152] K. WEICHSELBERGER, Uber ein graphischesVerfahren zur Trennungvon Mischverteilungenund zur Identifikationkupierter Normalverteilungen bie grossemStichprobenumfang, Metrika, 4 (1951), pp.178-229. [153] J. H. WOLFE, Patternclustering by multivariatemixture analysis, MultivariateBehavioral Res., 5 (1970), pp. 329-350. [154] J. WOLFOWITZ, The minimumdistance method, Ann Math. Statist.,28 (1957), pp. 75-88 [155] C.-F. Wu, On theconvergence of theEM algorithm,Ann. Statist., (1983), to appear. [156] S. J. YAKOWITZ, A consistentestimatorfor the identificationof finite mixtures, Ann. Math. Stat., 40 (1969), pp. 1728-1735. [157] , Unsupervisedlearning and theidentification of finite mixtures, IEEE Trans. Inform.Theory, IT-16 (1970), pp. 330-338. [158] S. J. YAKOWITZ AND J. D. SPRAGINS, On the identifiabilityof finite mixtures, Ann. Math. Statist.,39 (1968), pp. 209-214. [159] T. Y. YOUNG AND T. W. CALVERT, Classification,Estimation, and PatternRecognition, American Elsevier,New York, 1973. [160] T. Y. YOUNG AND G. CORALUPPI, Stochasticestimation of a mixtureof normaldensity functions using an informationcriterion, IEEE Trans. Inform.Theory, IT- 16 (1970), pp. 258-263. [161] S. ZACKS, The Theoryof , John Wiley, New York,1971. [162] W. ZANGWILL, NonlinearProgramming. A UnifiedApproach, Prentice-Hall, Englewood Cliffs, NJ (1969).

This content downloaded from 128.62.52.251 on Thu, 29 Jan 2015 12:50:31 PM All use subject to JSTOR Terms and Conditions