<<

Fitting A to Data: Tutorial

Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada

Aydin Ghojogh [email protected]

Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada

Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada

Abstract of K distributions {g1(x;Θ1), . . . , gK (x;ΘK )} where the This paper is a step-by-step tutorial for fitting a weights {w1, . . . , wK } sum to one. As is obvious, every mixture distribution to data. It merely assumes distribution in the mixture has its own parameter Θk. The the reader has the background of calculus and lin- mixture distribution is formulated as: ear algebra. Other required background is briefly K X reviewed before explaining the main algorithm. f(x;Θ1,..., ΘK ) = wk gk(x;Θk), In explaining the main algorithm, first, fitting a k=1 (1) mixture of two distributions is detailed and ex- K X amples of fitting two Gaussians and Poissons, re- subject to wk = 1. spectively for continuous and discrete cases, are k=1 introduced. Thereafter, fitting several distribu- The distributions can be from different families, for ex- tions in general case is explained and examples ample from beta and normal distributions. However, this of several Gaussians (Gaussian ) makes the problem very complex and sometimes useless; and Poissons are again provided. Model-based therefore, mostly the distributions in a mixture are from clustering, as one of the applications of mixture one family (e.g., all normal distributions) but with different distributions, is also introduced. Numerical sim- parameters. This paper aims to find the parameters of the ulations are also provided for both Gaussian and distributions in the mixture distribution f(x; Θ) as well as Poisson examples for the sake of better clarifica- the weights (also called “mixing ”) wk. tion. The remainder of paper is organized as follows. Section 2 reviews some technical background required for explain- arXiv:1901.06708v2 [stat.OT] 11 Oct 2020 1. Introduction ing the main algorithm. Afterwards, the methodology of Every can be considered as a sample from fitting a mixture distribution to data is explained in Section a distribution, whether a well-known distribution or a not 3. In that section, first the mixture of two distributions, as very well-known (or “ugly”) distribution. Some random a special case of mixture distributions, is introduced and variables are drawn from one single distribution, such as a analyzed. Then, the general mixture distribution is dis- . But life is not always so easy! Most of cussed. Meanwhile, examples of mixtures of Gaussians real-life random variables might have been generated from (example for continuous cases) and Poissons (example for a mixture of several distributions and not a single distri- discrete cases) are mentioned for better clarification. Sec- bution. The mixture distribution is a weighted summation tion4 briefly introduces clustering as one of the applica- tions of mixture distributions. In Section5, the discussed methods are then implemented through some simulations in order to have better sense of how these algorithms work. Finally, Section6 concludes the paper. Fitting A Mixture Distribution to Data: Tutorial 2

2. Background expectation is: This section reviews some technical background required X for explaining the main algorithm. This review includes E(X) = xf(x), (8) and Bayes rule, probability mass/density func- dom x Z tion, expectation, maximum likelihood estimation, expec- E(X) = xf(x) dx, (9) tation maximization, and Lagrange multiplier. dom x 2.1. Probability and Bayes Rule respectively, where dom x is the domain of X. The condi- If S denotes the total sample space and A denotes an event tional expectation is defined as: in this sample space, the probability of event A is: X (X|Y ) = xf(x|y), (10) |A| EX|Y P(A) = . (2) dom x |S| Z EX|Y (X|Y ) = xf(x|y) dx, (11) The , i.e., probability of occurance dom x of event A given that event B happens, is: for discrete and continuous cases, respectively. P(A, B) P(A|B) = (3) 2.4. Maximum Likelihood Estimation P(B) Assume we have a sample with size n, i.e., {x , . . . , x }. P(B|A) P(A) 1 n = , (4) Also assume that we know the distribution from which this P(B) sample has been randomly drawn but we do not know the parameters of that distribution. For example, we know it where P(A|B), P(B|A), P(A), and P(B) are called like- lihood, posterior, prior, and marginal probabilities, respec- is drawn from a normal distribution but the and vari- tively. If we assume that the event A consists of some cases ance of this distribution are unknown. The goal is to es- timate the parameters of the distribution using the sample A = {A1,...,An}, we can write: {x1, . . . , xn} available from it. This estimation of parame- ters from the available sample is called “point estimation”. P(B|Ai) P(Ai) P(Ai|B) = Pn . (5) One of the approaches for point estimation is Maximum P(B|Aj) P(Aj) j=1 Likelihood Estimation (MLE). As it is obvious from its the equations (4) and (5) are two versions of Bayes rule. name, MLE deals with the likelihood of data. We postulate that the values of sample, i.e., x1, . . . , xn, are 2.2. Probability Mass/Density Function independent random variates of data having the sample dis- In discrete cases, the probability mass function is defined tribution. In other words, the data has a joint distribution as: fX (x1, . . . , xn|Θ) with parameter Θ and we assume the variates are independent and identically distributed (iid) iid f(x) = P(X = x), (6) variates, i.e., xi ∼ fX (xi; Θ) with the same parameter Θ. Considering the Bayes rule, equation (4), we have: where X and x are a random variable and a number, re- spectively. fX (x1, . . . , xn|Θ)π(Θ) L(Θ|x1, . . . , xn) = . (12) In continuous cases, the probability density function is: fX (x1, . . . , xn) Θ P(x ≤ X ≤ x + ∆x) ∂P(X ≤ x) The MLE aims to find parameter which maximizes the f(x) = lim = . likelihood: ∆x→0 ∆x ∂x (7) Θb = arg max L(Θ). (13) Θ In this work, by mixture of distributions, we imply mixture of mass/density functions. According to the definition, the likelihood can be written as: 2.3. Expectation

Expectation the value of a random variable X on av- L(Θ|x1, . . . , xn) := f(x1, . . . , xn; Θ) erage. Therefore, expectation is a weighted average where n (a) Y the weights are probabilities of the random variable X to = f(xi, Θ), (14) get different values. In discrete and continuous cases, the i=1 Fitting A Mixture Distribution to Data: Tutorial 3 where (a) is because the x1, . . . , xn are iid. Note that in In the M-step, the MLE approach is used where the log- literature, the L(Θ|x1, . . . , xn) is also denoted by L(Θ) for likelihood is replaced with its expectation, i.e., Q(Θ); simplicity. therefore: Usually, for more convenience, we use log-likelihood Θb = arg max Q(Θ). (18) rather than likelihood: Θ

`(Θ) := log L(Θ) (15) These two steps are iteratively repeated until convergence n n of the estimated parameters Θb. Y X = log f(x , Θ) = log f(x , Θ). (16) i i 2.6. Lagrange Multiplier i=1 i=1 Suppose we have a multi-variate function Q(Θ1,..., ΘK ) Often, the logarithm is a natural logarithm for the sake of (called “objective function”) and we want to maximize (or compatibility with the exponential in the well-known nor- minimize) it. However, this optimization is constrained and mal density function. Notice that as logarithm function is its constraint is equality P (Θ1,..., ΘK ) = c where c is a monotonic, it does not change the location of maximization constant. So, the constrained optimization problem is: of the likelihood. maximize Q(Θ1,..., ΘK ), 2.5. Expectation Maximization Θ1,...,ΘK (19) Sometimes, the data are not fully observable. For example, subject to P (Θ1,..., ΘK ) = c. the data are known to be whether zero or greater than zero. For solving this problem, we can introduce a new variable As an illustration, assume the data are collected for a partic- α which is called “Lagrange multiplier”. Also, a new func- ular disease but for convenience of the patients participated tion L(Θ ,..., Θ , α), called “Lagrangian” is introduced: in the survey, the severity of the disease is not recorded 1 K but only the existence or non-existence of the disease is re- ported. So, the data are not giving us complete information L(Θ1,..., ΘK , α) = Q(Θ1,..., ΘK ) (20) as Xi > 0 is not obvious whether is Xi = 2 or Xi = 1000.  − α P (Θ1,..., ΘK ) − c . In this case, MLE cannot be directly applied as we do not have access to complete information and some data are Maximizing (or minimizing) this Lagrangian function missing. In this case, Expectation Maximization (EM) is gives us the solution to the optimization problem (Boyd & useful. The main idea of EM can be summarized in this Vandenberghe, 2004): short friendly conversation: set – What shall we do? The data is missing! The log- ∇Θ1,...,ΘK ,αL = 0, (21) likelihood is not known completely so MLE cannot be used. – Mmm, probably we can replace the missing data with which gives us: something... ∇ L =set 0 =⇒ ∇ Q = α∇ P, – Aha! Let us replace it with its mean. Θ1,...,ΘK Θ1,...,ΘK Θ1,...,ΘK set – You are right! We can take the mean of log-likelihood ∇αL = 0 =⇒ P (Θ1,..., ΘK ) = c. over the possible values of the missing data. Then every- thing in the log-likelihood will be known, and then... 3. Fitting A Mixture Distribution – And then we can do MLE! (obs) (miss) As was mentioned in the introduction, the goal of fitting a Assume D and D denote the observed data (Xi’s mixture distribution is to find the parameters and weights of = 0 in the above example) and the missing data (Xi’s > 0 a weighted summation of distributions (see equation (1)). in the above example). The EM algorithm includes two First, as a spacial case of mixture distributions, we work on main steps, i.e., E-step and M-step. mixture of two distributions and then we discuss the gen- In the E-step, the log-likelihood (equation (15)), is taken eral mixture of distributions. expectation with respect to the missing data D(miss) in or- der to have a mean estimation of it. Let Q(Θ) denote the 3.1. Mixture of Two Distributions (miss) expectation of the likelihood with respect to D : Assume that we want to fit a mixture of two distributions g1(x;Θ1) and g2(x;Θ2) to the data. Note that, in theory, Q(Θ) := ED(miss)|D(obs),Θ[`(Θ)]. (17) these two distributions are not necessarily from the same distribution family. As we have only two distributions in (obs) Note that in the above expectation, the D and Θ are the mixture, equation (1) is simplified to: conditioned on, so they are treated as constants and not ran- dom variables. f(x;Θ1, Θ2) = w g1(x;Θ1) + (1 − w) g2(x;Θ2). (22) Fitting A Mixture Distribution to Data: Tutorial 4

Note that the parameter w (or wk in general) is called “mix- Notice that the above expressions are linear with respect to ing probability” (Friedman et al., 2009) and is sometimes ∆i and that is why the two logarithms were factored out. denoted by π (or πk in general) in literature. Assume γbi := E[∆i|X, Θ1, Θ2] which is called “responsi- The likelihood and log-likelihood for this mixture is: bility” of xi (Friedman et al., 2009). The ∆i is either 0 or 1; therefore: n (a) Y L(Θ1, Θ2) = f(x1, . . . , xn;Θ1, Θ2) = f(xi;Θ1, Θ2) E[∆i|X, Θ1, Θ2] = 0 × P(∆i = 0|X, Θ1, Θ2)+ i=1 1 × (∆ = 1|X, Θ , Θ ) n P i 1 2 Y h i = w g1(xi;Θ1) + (1 − w) g2(xi;Θ2) , = P(∆i = 1|X, Θ1, Θ2). i=1 According to Bayes rule (equation (5)), we have:

n P(∆i = 1|X, Θ1, Θ2) X h `(Θ , Θ ) = log w g (x ;Θ ) (X, Θ , Θ , ∆ = 1) 1 2 1 i 1 = P 1 2 i i=1 P(X;Θ1, Θ2) i + (1 − w) g (x ;Θ ) , (X, Θ1, Θ2|∆i = 1) (∆i = 1) 2 i 2 = P P . P1 j=0 P(X, Θ1, Θ2|∆i = j) P(∆i = j) where (a) is because of the assumption that x1, . . . , xn are iid. Optimizing this log-likelihood is difficult because of The marginal probability in the denominator is: the summation within the logarithm. However, we can use P(X;Θ1, Θ2) = (1 − w) g2(xi;Θ2) + w g1(xi;Θ1). a nice trick here (Friedman et al., 2009): Let ∆i be defined as: Thus:  wb g1(xi;Θ1) 1 if xi belongs to g1(x;Θ1), γbi = , (23) ∆i := w g1(xi;Θ1) + (1 − w) g2(xi;Θ2) 0 if xi belongs to g2(x;Θ2), b b and and its probability be: n X h   Q(Θ1, Θ2) = γi log w g1(xi;Θ1) +  (∆ = 1) = w, b P i i=1 (24) P(∆i = 0) = 1 − w.  i (1 − γbi) log (1 − w) g2(xi;Θ2) . Therefore, the log-likelihood can be written as: Some simplification of Q(Θ1, Θ2) will help in next step: `(Θ , Θ ) = 1 2 n h n X  P log w g (x ;Θ ) ∆ = 1 Q(Θ1,Θ2) = γi log w + γi log g1(xi;Θ1)+  i=1 1 i 1 if i b b i=1 n i  P log (1 − w) g (x ;Θ ) if ∆ = 0 i=1 2 i 2 i (1 − γbi) log(1 − w) + (1 − γbi) log g2(xi;Θ2) . The above expression can be restated as: The M-step in EM: n X h   Θb 1, Θb 2, wb = arg max Q(Θ1, Θ2, w). `(Θ1, Θ2) = ∆i log w g1(xi;Θ1) + Θ1,Θ2,w i=1 Note that the function Q(Θ1, Θ2) is also a function of w  i (1 − ∆i) log (1 − w) g2(xi;Θ2) . and that is why we wrote it as Q(Θ1, Θ2, w). n ∂Q X h γi ∂g1(xi;Θ1)i set The ∆i here is the incomplete (missing) datum because we = b = 0, (25) ∂Θ g (x ;Θ ) ∂Θ do not know whether it is ∆i = 0 or ∆i = 1 for xi. Hence, 1 i=1 1 i 1 1 using the EM algorithm, we try to estimate it by its expec- n ∂Q X h 1 − γi ∂g2(xi;Θ2)i set tation. = b = 0, (26) ∂Θ2 g2(xi;Θ1) ∂Θ2 The E-step in EM: i=1 n ∂Q X h 1 −1 i set n = γ ( ) + (1 − γ )( ) = 0, X h   ∂w bi w bi 1 − w Q(Θ1,Θ2) = E[∆i|X, Θ1, Θ2] log w g1(xi;Θ1) + i=1 n i=1 1 X  i =⇒ w = γi (27) [(1 − ∆i)|X, Θ1, Θ2] log (1 − w) g2(xi;Θ2) . b n b E i=1 Fitting A Mixture Distribution to Data: Tutorial 5

2 2 The Q(µ1, µ2, σ1, σ2) is: 1 START: Initialize Θb 1, Θb 2, wb 2 while not converged do n 3 // E-step in EM: 2 2 X h Q(µ1, µ2, σ1, σ2) = γbi log w 4 for i from 1 to n do i=1 5 γbi ← equation (23) 1 (x − µ )2 + γ (− log(2π) − log σ − i 1 ) 6 // M-step in EM: bi 1 2 2 2σ1 7 Θb 1 ← equation (25) + (1 − γbi) log(1 − w) 8 Θb 2 ← equation (26) 2 1 (xi − µ2) i 9 w ← equation (27) + (1 − γ )(− log(2π) − log σ − ) . b bi 2 2 2σ2 10 // Check convergence: 2 11 Compare Θb 1, Θb 2, and wb with their values in Therefore: previous iteration n ∂Q X h xi − µ1 i set Algorithm 1: Fitting A Mixture of Two Distribu- = γ ( ) = 0, ∂µ bi σ2 tions 1 i=1 1 Pn i=1 γbi xi =⇒ µb1 = Pn , (30) i=1 γbi So, the mixing probability is the average of the responsi- n ∂Q X h xi − µ2 i set bilities which makes sense. Solving equations (25), (26), = (1 − γbi)( 2 ) = 0, ∂µ2 σ2 and (27) gives us the estimations Θb 1, Θb 2, and wb in every i=1 Pn iteration. i=1(1 − γbi) xi =⇒ µb2 = Pn , (31) The iterative algorithm for finding the parameters of the i=1(1 − γbi) mixture of two distributions is shown in Algorithm1. n 2 ∂Q X h −1 (xi − µ1) i set = γ ( + ) = 0, ∂σ bi σ σ3 3.1.1. MIXTUREOF TWO GAUSSIANS 1 i=1 1 1 Here, we consider a mixture of two one-dimensional Gaus- Pn 2 2 i=1 γbi (xi − µb1) sian distributions as an example for mixture of two contin- =⇒ σb1 = Pn , (32) γi uous distributions. In this case, we have: i=1 b n 2 ∂Q X h −1 (xi − µ2) i set = (1 − γbi)( + 3 ) = 0, 2 ∂σ2 σ2 σ2 2 1 (x − µ1) i=1 g1(x; µ1, σ1) = exp(− 2 ) Pn 2 p 2 2σ (1 − γi)(xi − µ2) 2πσ1 1 2 i=1 b b =⇒ σb2 = Pn , (33) x − µ (1 − γi) = φ( 1 ), i=1 b σ1 1 (x − µ )2 and wb is the same as equation (27). g (x; µ , σ2) = exp(− 2 ) 2 2 2 p 2 2σ2 Iteratively solving equations (29), (30), (31), (32), (33), and 2πσ2 2 (27) using Algorithm (1) gives us the estimations for µ1, x − µ2 b = φ( ), µb2, σb1, σb2, and wb in equation (28). σ2 3.1.2. MIXTUREOF TWO POISSONS where φ(x) is the probability density function of normal Here, we consider a mixture of two Poisson distributions distribution. Therefore, equation (22) becomes: as an example for mixture of two discrete distributions. In this case, we have: 2 2 f(x; µ1, µ2, σ , σ ) = 1 2 −λ1 x e λ1 x − µ1 x − µ2 (28) g (x; λ ) = , w φ( ) + (1 − w) φ( ). 1 1 x! σ1 σ2 e−λ2 λx g (x; λ ) = 2 , 2 2 x! The equation (23) becomes: therefore, equation (22) becomes: w φ( xi−µ1 ) b σ1 γ = . (29) −λ1 x −λ2 x i xi−µ1 xi−µ2 e λ e λ b w φ( ) + (1 − w) φ( ) 1 2 (34) b σ1 b σ2 f(x; λ1, λ2) = w + (1 − w) . x! x! Fitting A Mixture Distribution to Data: Tutorial 6

n K The equation (23) becomes: X h X i `(Θ1,..., ΘK ) = log wkgk(xi;Θk) , x e−λb1 λ i i=1 k=1 w ( b1 ) b xi! γi = x x . (35) b e−λb1 λ i e−λb2 λ i w ( b1 ) + (1 − w)( b2 ) where (a) is because of assumption that x1, . . . , xn are iid. b xi! b xi! Optimizing this log-likelihood is difficult because of the

The Q(λ1, λ2) is: summation within the logarithm. We use the same trick as the trick mentioned for mixture of two distributions: n X h Q(λ , λ ) = γ log w  1 2 bi 1 if xi belongs to gk(x;Θk), ∆ := i=1 i,k 0 otherwise, + γbi(−λ1 + xi log λ1 − log xi!) and its probability is: + (1 − γbi) log(1 − w) i + (1 − γ )(−λ + x log λ − log x !) .  bi 2 i 2 i P(∆i,k = 1) = wk, P(∆i,k = 0) = 1 − wk. Therefore: n Therefore, the log-likelihood can be written as: ∂Q X h xi i set = γbi(−1 + ) = 0, ∂λ1 λ1 i=1 `(Θ1,..., ΘK ) = n P n i=1 γbi xi  P log w g (x ;Θ ) =⇒ λb1 = n , (36) i=1 1 1 i 1 P γ  i=1 bi  if ∆i,1 = 1 and ∆i,k = 0 ∀k 6= 1  ∂Q n h x i  X i set  n = (1 − γi)(−1 + ) = 0,  P   ∂λ b λ  i=1 log w2 g2(xi;Θ2) 2 i=1 2 if ∆i,2 = 1 and ∆i,k = 0 ∀k 6= 2 Pn (1 − γ ) x  . =⇒ λ = i=1 bi i , (37)  . b2 Pn  . i=1(1 − γbi)  Pn    log wK gK (xi;ΘK )  i=1  if ∆ = 1 and ∆ = 0 ∀k 6= K and wb is the same as equation (27). i,K i,k Iteratively solving equations (35), (36), (37), and (27) using The above expression can be restated as: Algorithm (1) gives us the estimations for λb1, λb2, and wb in equation (34). n " K # X X  3.2. Mixture of Several Distributions `(Θ1,..., ΘK ) = ∆i,k log wkgk(xi;Θk) . i=1 k=1 Now, assume a more general case where we want to fit a mixture of K distributions g1(x;Θ1), . . . , gK (x;ΘK ) to The ∆i,k here is the incomplete (missing) datum because the data. Again, in theory, these K distributions are not we do not know whether it is ∆i,k = 0 or ∆i,k = 1 for xi necessarily from the same distribution family. For more and a specific k. Therefore, using the EM algorithm, we try convenience of reader, equation (1) is repeated here: to estimate it by its expectation.

K The E-step in EM: X f(x;Θ1,..., ΘK ) = wk gk(x;Θk), n " K k=1 X X K Q(Θ1,..., ΘK ) = E[∆i,k|X, Θ1,..., ΘK ] X w = 1. i=1 k=1 subject to k # k=1  × log wkgk(xi;Θk) . The likelihood and log-likelihood for this mixture is:

L(Θ1,..., ΘK ) = f(x1, . . . , xn;Θ1,..., ΘK ) The ∆i,k is either 0 or 1; therefore: n (a) Y = f(xi;Θ1,..., ΘK ) E[∆i,k|X,Θ1,..., ΘK ] i=1 = 0 × P(∆i,k = 0|X, Θ1,..., ΘK ) n K Y X + 1 × P(∆i,k = 1|X, Θ1,..., ΘK ) = wkgk(xi;Θk) i=1 k=1 = P(∆i,k = 1|X, Θ1,..., ΘK ). Fitting A Mixture Distribution to Data: Tutorial 7

According to Bayes rule (equation (5)), we have: 1 START: Initialize Θb 1,..., Θb K , wb1,..., wbK 2 while not converged do (∆ = 1|X, Θ ,..., Θ ) P i,k 1 K 3 // E-step in EM: (X, Θ1,..., ΘK , ∆i,k = 1) 4 for i from 1 to n do = P P(X;Θ1,..., ΘK ) 5 for k from 1 to K do (X, Θ ,..., Θ |∆ = 1) (∆ = 1) 6 γi,k ← equation (38) = P 1 K i,k P i,k . b PK k0=1 P(X, Θ1,..., ΘK |∆i,k0 = 1)P(∆i,k0 = 1) 7 // M-step in EM: 8 for k from 1 to K do The marginal probability in the denominator is: 9 Θb k ← equation (40) 10 wk ← equation (41) K b X 11 // Check convergence: P(X;Θ1,..., ΘK ) = wk0 gk0 (xi;Θk0 ). k0=1 12 Compare Θb 1,..., Θb K , and wb1,..., wbK with their values in previous iteration Assuming that γbi,k := E[∆i,k|X, Θ1,..., ΘK ] (called re- sponsibility of xi), we have: Algorithm 2: Fitting A Mixture of Several Dis- tributions w g (x ;Θ ) γ = bk k i k , (38) bi,k PK k0=1 wbk0 gk0 (xi;Θk0 ) 2.6): and L(Θ1,..., ΘK , w1, . . . , wK , α) K X  n K = Q(Θ1,..., ΘK , w1, . . . , wK ) − α wk − 1 X X  Q(Θ1,..., ΘK ) = γbi,k log wkgk(xi;Θk) . k=1 i=1 k=1 n K (39) X X h i = γbi,k log wk + γbi,k log gk(xi;Θk) Some simplification of Q(Θ1,..., ΘK ) will help in next i=1 k=1 step: K X  − α wk − 1 Q(Θ1,..., ΘK ) = k=1 n K X X h i γ log w + γ log g (x ;Θ ) . n bi,k k bi,k k i k ∂L X γi,k ∂gk(xi;Θk) set = b = 0 (40) i=1 k=1 ∂Θ g (x ;Θ ) ∂Θ k i=1 k i k k n n The M-step in EM: ∂L X γi,k set 1 X = b − α = 0 =⇒ w = γ ∂w w k α i,k k i=1 k i=1 Θb k, wk = arg max Q(Θ1,..., ΘK , w1, . . . , wK ), b Θ ,w K K k k ∂L X set X = w − 1 = 0 =⇒ w = 1 K ∂α k k X k=1 k=1 subject to wk = 1. K n n K k=1 X 1 X X X γ = 1 =⇒ α = γ ∴ α i,k i,k k=1 i=1 i=1 k=1 Note that the function Q(Θ1,..., ΘK ) is also a func- Pn γ tion of w1, . . . , wK and that is why we wrote it as w = i=1 i,k ∴ bk n K (41) Q(Θ ,..., Θ , w , . . . , w ). P P 0 1 K 1 K i=1 k0=1 γi,k The above problem is a constrained optimization problem: Solving equations (40) and (41) gives us the estimations Θb k and wbk (for k ∈ {1,...,K}) in every iteration. maximize Q(Θ1,..., ΘK , w1, . . . , wK ), Θk,wk The iterative algorithm for finding the parameters of the K mixture of several distributions is shown in Algorithm2. X subject to wk = 1, 3.2.1. MIXTUREOF SEVERAL GAUSSIANS k=1 Here, we consider a mixture of K one-dimensional Gaus- which can be solved using Lagrange multiplier (see Section sian distributions as an example for mixture of several con- Fitting A Mixture Distribution to Data: Tutorial 8 tinuous distributions. In this case, we have: 3.2.2. MULTIVARIATE MIXTUREOF GAUSSIANS x ∈ d 1 (x − µ )2 The data might be multivariate ( R ) and the Gaus- g (x; µ , σ2) = exp(− k ) sian distributions in the mixture model should be multi- k k k p 2 2σ2 2πσk k dimensional in this case. We consider a mixture of K mul- x − µ = φ( k ), ∀k ∈ {1,...,K} tivariate Gaussian distributions. In this case, we have: σk g (x; µ , Σ ) Therefore, equation (1) becomes: k k k > −1 1 (x − µk) Σk (x − µk) K = exp(− ) p d 2 2 X x − µk (2π) |Σk| 2 f(x; µ1, . . . , µK , σ1, . . . , σK ) = wk φ( ). σk k=1 ∀k ∈ {1,...,K}, (42) The equation (38) becomes: where |Σk| is the determinant of Σk. Therefore, equation (1) becomes: xi−µk wk φ( ) b σk γi,k = . (43) b PK xi−µk0 K 0 wk0 φ( ) X k =1 b σk0 f(x; µ1,..., µK , Σ1,..., ΣK ) = wk gk(x; µk, Σk). 2 2 k=1 The Q(µ1, . . . , µK , σ1, . . . , σK ) is: (46) 2 2 The equation (38) becomes: Q(µ1, . . . , µK , σ1, . . . , σK ) n K w g (x ; µ , Σ ) X X h 1 γ = bk k i k k , (47) = γbi,k log wk + γbi,k − log(2π) bi,k PK 2 0 wk0 gk0 (xi; µk0 , Σk0 ) i=1 k=1 k =1 b 2 (xi − µk) i where x ,..., x ∈ d and µ ,..., µ ∈ d and − log σk − . 1 n R 1 K R 2σ2 d×d k Σ1,..., ΣK ∈ R and wbk ∈ R and γbi,k ∈ R. The Lagrangian is: The Q(µ1,..., µK , Σ1,..., ΣK ) is:

2 2 Q(µ ,..., µ , Σ ,..., Σ ) L(µ1, . . . , µK , σ1, . . . , σK , w1, . . . , wK , α) 1 K 1 K n K " n K X X  d X X h 1 = γ log w + γ − log(2π) = γi,k log wk + γi,k − log(2π) bi,k k bi,k b b 2 2 i=1 k=1 i=1 k=1 2 i 1 (xi − µk)  − log |Σk| − log σk − 2 2 2σk # K 1  > −1  X  − tr (xi − µk) Σk (xi − µk) , − α wk − 1 . 2 k=1 where tr(.) denotes the trace of matrix. The trace is used Therefore: > −1 here because (xi − µk) Σk (xi − µk) is a scalar so it is n equal to its trace. ∂L X h xi − µk i set = γ ( ) = 0, ∂µ bi,k σ2 The Lagrangian is: k i=1 k Pn γ x i=1 bi,k i L(µ1,..., µK , Σ1,..., ΣK , w1, . . . , wK , α) =⇒ µbk = Pn , (44) i=1 γi,k n K " b X X  d n 2 = γ log w + γ − log(2π) ∂L X h −1 (xi − µk) i set bi,k k bi,k 2 = γbi,k ( + 3 ) = 0, i=1 k=1 ∂σk σk σ i=1 k 1 Pn 2 − log |Σk| 2 i=1 γbi,k (xi − µbk) 2 =⇒ σbk = Pn , (45) # i=1 γbi,k 1  − tr(x − µ )>Σ−1(x − µ ) 2 i k k i k and wbk is the same as equation (41). Iteratively solving equations (43), (44), (45), and (41) us- K X − α w − 1. ing Algorithm (2) gives us the estimations for µb1,..., µbK , k σb1,..., σbK , and wb1,..., wbK in equation (42). k=1 Fitting A Mixture Distribution to Data: Tutorial 9

Therefore: The Q(λ1, . . . , λK ) is:

n n K ∂L X h −1 i set d X X h = γbi,k Σk (xi − µk) = 0 ∈ R , Q(λ1, . . . , λK ) = γi,k log wk ∂µk b i=1 i=1 k=1 n i (a) X h i + γi,k(−λk + xi log λk − log xi!) . =⇒ γbi,k (xi − µk) = 0, b i=1 Pn γ x The Lagrangian is: =⇒ µ = i=1 bi,k i ∈ d, (48) bk Pn γ R i=1 bi,k L(λ1, . . . , λK , w1, . . . , wK , α) n ∂L (b) X h −1 n K = γi,k ( Σk X X h ∂Σ b 2 = γi,k log wk k i=1 b i=1 k=1 1 >i set d×d i + (xi − µk)(xi − µk) = 0 ∈ R , 2 + γbi,k(−λk + xi log λk − log xi!) n n X X > K =⇒ Σk γi,k = γi,k(xi − µk)(xi − µk) , X  b b − α wk − 1 . i=1 i=1 k=1 Pn γ (x − µ )(x − µ )> =⇒ Σ = i=1 bi,k i k i k ∈ d×d, b k Pn R Therefore: i=1 γbi,k (49) n ∂L X h xi i set = γ (−1 + ) = 0, ∂λ bi,k λ and wk ∈ R is the same as equation (41). In above expres- k i=1 k b −1 sions, (a) is because Σ 6= 0 ∈ Rd×d is not dependent on Pn k i=1 γbi,k xi =⇒ λbk = n , (52) i and can be left factored out of the summation (note that P γ ∂ i=1 bi,k γi,k is a scalar), and (b) is because log |Σk| = Σk and b ∂Σk  > −1   −1 and w is the same as equation (41). tr (xi − µk) Σk (xi − µk) = tr Σk (xi − µk)(xi − b µ )> and ∂ trΣ−1A = −A. Iteratively solving equations (51), (52), and (41) using Al- k ∂Σk k Iteratively solving equations (47), (48), (49), and (41) us- gorithm (2) gives us the estimations for λb1,..., λbK , and w ,..., w in equation (50). ing Algorithm (2) gives us the estimations for µb1,..., µbK , b1 bK Σb 1,..., Σb K , and wb1,..., wbK in equation (46). The mul- tivariate mixture of Gaussians is also mentioned in (Lee & 4. Using Mixture Distribution for Clustering Scott, 2012). Moreover, note that the mixture of Gaussians Mixture distributions have a variety of applications includ- is also referred to as Gaussian Mixture Models (GMM) in ing clustering. Assuming that the number of clusters, de- the literature. noted by K, is known, the cluster label of a point xi (i ∈ {1, . . . , n}) is determined as: 3.2.3. MIXTUREOF SEVERAL POISSONS

Here, we consider a mixture of K Poisson distributions as label of xi ← arg max gk(xi;Θk), (53) k an example for mixture of several discrete distributions. In this case, we have: where gk(xi;Θk) is the k-th distribution fitted to data x1, . . . , xn. In other words, where f(x;Θ1,..., ΘK ) = −λk x K e λk P w g (x;Θ ) is the fitted mixture distribution to gk(x; λk) = , k=1 k k k x! data. The reason of why this clustering works is that the therefore, equation (1) becomes: density/mass function which has produced that point with higher probability can be the best candidate for the cluster K of that point. This method of clustering is referred to as X e−λk λx f(x; λ , . . . , λ ) = w k . (50) “model-based clustering” in literature (Fraley & Raftery, 1 K k x! k=1 1998; 2002).

The equation (38) becomes: 5. Simulations

x −λbk i In this section, we do some simulations on fitting a mixture e λbk wk ( ) of densities in both continuous and discrete cases. For con- γ = b xi! . (51) i,k −λ x b e bk0 λ i tinuous cases, a mixture of three Gaussians and for discrete PK bk0 0 wk0 ( ) k =1 b xi! cases, a mixture of three Poissons are simulated. Fitting A Mixture Distribution to Data: Tutorial 10

Figure 1. The original probability density functions from which Figure 3. The change and convergence of σ1 (shown in blue), σ2 the sample is drawn. The mixture includes three different Gaus- (shown in red), and σ3 (shown in green) over the iterations. sians showed in blue, red, and green colors.

Figure 2. The change and convergence of µ1 (shown in blue), µ2 Figure 4. The change and convergence of w1 (shown in blue), w2 (shown in red), and µ3 (shown in green) over the iterations. (shown in red), and w3 (shown in green) over the iterations.

5.1. Mixture of Three Gaussians mated values for the parameters: A sample with size n = 2200 from three distributions is randomly generated for this experiment:

µ1 = −9.99, σ1 = 1.17, w1 = 0.317 x − µ1 x + 10 φ( ) = φ( ), µ = −0.05, σ = 1.93, w = 0.445 σ1 1.2 2 2 2 x − µ x − 0 µ = 4.64, σ = 4.86, w = 0.237 φ( 2 ) = φ( ), 3 3 3 σ2 2 x − µ x − 5 φ( 3 ) = φ( ). σ 5 Comparing the estimations for µ1, µ2, µ3 and σ1, σ2, σ3 3 with those in original densities from which data were gen- For having generality, the size of subset of sample gener- erated verifies the correctness of the estimations. ated from the three densities are different, i.e., 700, 1000, The progress of the parameters µk, σk, and wk through the and 500. The three densities are shown in Fig.1. iterations until convergence are shown in figures2,3, and Applying Algorithm2 and using equations (43), (44), (45), 4, respectively. and (41) for mixture of K = 3 Gaussians gives us the esti- Note that for setting initial values of parameters in mixture Fitting A Mixture Distribution to Data: Tutorial 11

x 0 1 2 3 4 5 6 7 8 9 10 frequency 162 267 271 185 111 61 120 210 215 136 73 x 11 12 13 14 15 16 17 18 19 20 frequency 43 14 160 230 243 104 36 15 10 0

Table 1. The discrete data for simulation of fitting mixture of Poissons.

Figure 5. The estimated probability density functions. The esti- mated mixture includes three different Gaussians showed in blue, red, and green colors. The dashed purple density is the weighted P3 x−µk summation of these three densities, i.e., wkφ( ). The k=1 σk dashed brown density is the fitted density whose parameters are estimated by MLE. of Gaussians, one reasonable option is: Figure 6. The frequency of the discrete data sample.

range ← max(xi) − min(xi), i i Poissons. (0) µ ∼ U(min(xi), max(xi)), (54) Applying Algorithm2 and using equations (51), (52), and k i i (41) for mixture of K = 3 Poissons gives us the estimated (0) σk ∼ U(0, range/6), (55) values for the parameters: w(0) ∼ U(0, 1), (56) k λ1 = 1.66, w1 = 0.328 where U(α, β) is continuous uniform distribution in range λ2 = 6.72, w2 = 0.256 (α, β). This initialization makes sense because in normal λ3 = 12.85, w3 = 0.416 distribution, the mean belongs to the range of data and 99% of data falls in range (µ−3σ, µ+3σ); therefore, the spread Comparing the estimations for λ1, λ2, λ3 with Fig.6 veri- of data is roughly 6σ. In the experiment of this section, the fies the correctness of the estimations. The progress of the mentioned initialization is utilized. parameters λk and wk through the iterations until conver- gence are shown in figures7 and8, respectively. The fitted densities and the mixture distribution are de- picted in Fig.5. Comparing this figure with Fig.1 verifies For setting initial values of parameters in mixture of Pois- the correct estimation of the three densities. Figure5 also sons, one reasonable option is: shows the mixture distribution, i.e., the weighted summa- (0) λ ∼ U(min(xi), max(xi)), (57) tion of the estimated densities. k i i (0) Moreover, for the sake of better comparison, only one dis- wk ∼ U(0, 1). (58) tribution is also fitted to data using MLE. The MLE estima- (mle) Pn The reason of this initialization is that the MLE estimation tion of parameters are µ =x ¯ = (1/n) i=1 xi and n n b (mle) P (mle) P 2 of λ is λb =x ¯ = (1/n) i=1 xi which belongs to the σb = (1/n) i=1(xi − x¯) . This fitted distribution is also depicted in Fig.5. We can see that this poor estima- range of data. This initialization is used in this experiment. tion has not captured the multi-modality of data in contrast The fitted mass functions and the mixture distribution are to the estimated mixture distribution. depicted in Fig.9. Comparing this figure with Fig.6 veri- fies the correct estimation of the three mass functions. The 5.2. Mixture of Three Poissons mixture distribution, i.e., the weighted summation of the A sample with size n = 2666 is made (see Table1) for the estimated densities, is also shown in Fig.9. experiment where the frequency of data, displayed in Fig6, For having better comparison, only one mass function is shows that data are almost sampled from a mixture of three also fitted to data using MLE. For that, the parameter λ Fitting A Mixture Distribution to Data: Tutorial 12

Figure 7. The change and convergence of λ1 (shown in blue), λ2 Figure 9. The estimated probability mass functions. The esti- (shown in red), and λ3 (shown in green) over the iterations. mated mixture includes three different Poissons showed in blue, red, and green colors. The purple density is the weighted summa- P3 e−λk λk tion of these three densities, i.e., k=1 wk x! . The brown density is the fitted density whose parameter is estimated by MLE.

Acknowledgment The authors hugely thank Prof. Mu Zhu for his great course “Statistical Concepts for Data Science”. This great course partly covered the materials mentioned in this tutorial pa- per.

References Boyd, Stephen and Vandenberghe, Lieven. Convex opti- mization. Cambridge university press, 2004. Fraley, Chris and Raftery, Adrian E. How many clus- ters? which clustering method? answers via model- based cluster analysis. The computer journal, 41(8): Figure 8. The change and convergence of w (shown in blue), w 1 2 578–588, 1998. (shown in red), and w3 (shown in green) over the iterations. Fraley, Chris and Raftery, Adrian E. Model-based cluster- ing, discriminant analysis, and density estimation. Jour- nal of the American statistical Association, 97(458): is estimated using λ(mle) =x ¯ = (1/n) Pn x . This b i=1 i 611–631, 2002. fitted distribution is also depicted in Fig.9. Again, the poor performance of this single mass function in capturing Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. the multi-modality is obvious. The elements of statistical learning, volume 2. Springer series in New York, NY, USA:, 2009. 6. Conclusion Lee, Gyemin and Scott, Clayton. Em algorithms for multi- In this paper, a simple-to-understand and step-by-step tuto- variate gaussian mixture models with truncated and cen- rial on fitting a mixture distribution to data was proposed. sored data. Computational Statistics & Data Analysis, The assumption was the prior knowledge on calculus and 56(9):2816–2829, 2012. basic linear algebra. For more clarification, fitting two dis- tributions was primarily introduced and then it was gen- eralized to K distributions. Fitting mixture of Gaussians and Poissons were also mentioned as examples for continu- ous and discrete cases, respectively. Simulations were also shown for more clarification.