Fitting a Mixture Distribution to Data: Tutorial
Total Page:16
File Type:pdf, Size:1020Kb
Fitting A Mixture Distribution to Data: Tutorial Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Aydin Ghojogh [email protected] Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada Abstract of K distributions fg1(x;Θ1); : : : ; gK (x;ΘK )g where the This paper is a step-by-step tutorial for fitting a weights fw1; : : : ; wK g sum to one. As is obvious, every mixture distribution to data. It merely assumes distribution in the mixture has its own parameter Θk. The the reader has the background of calculus and lin- mixture distribution is formulated as: ear algebra. Other required background is briefly K X reviewed before explaining the main algorithm. f(x;Θ1;:::; ΘK ) = wk gk(x;Θk); In explaining the main algorithm, first, fitting a k=1 (1) mixture of two distributions is detailed and ex- K X amples of fitting two Gaussians and Poissons, re- subject to wk = 1: spectively for continuous and discrete cases, are k=1 introduced. Thereafter, fitting several distribu- The distributions can be from different families, for ex- tions in general case is explained and examples ample from beta and normal distributions. However, this of several Gaussians (Gaussian Mixture Model) makes the problem very complex and sometimes useless; and Poissons are again provided. Model-based therefore, mostly the distributions in a mixture are from clustering, as one of the applications of mixture one family (e.g., all normal distributions) but with different distributions, is also introduced. Numerical sim- parameters. This paper aims to find the parameters of the ulations are also provided for both Gaussian and distributions in the mixture distribution f(x; Θ) as well as Poisson examples for the sake of better clarifica- the weights (also called “mixing probabilities”) wk. tion. The remainder of paper is organized as follows. Section 2 reviews some technical background required for explain- arXiv:1901.06708v2 [stat.OT] 11 Oct 2020 1. Introduction ing the main algorithm. Afterwards, the methodology of Every random variable can be considered as a sample from fitting a mixture distribution to data is explained in Section a distribution, whether a well-known distribution or a not 3. In that section, first the mixture of two distributions, as very well-known (or “ugly”) distribution. Some random a special case of mixture distributions, is introduced and variables are drawn from one single distribution, such as a analyzed. Then, the general mixture distribution is dis- normal distribution. But life is not always so easy! Most of cussed. Meanwhile, examples of mixtures of Gaussians real-life random variables might have been generated from (example for continuous cases) and Poissons (example for a mixture of several distributions and not a single distri- discrete cases) are mentioned for better clarification. Sec- bution. The mixture distribution is a weighted summation tion4 briefly introduces clustering as one of the applica- tions of mixture distributions. In Section5, the discussed methods are then implemented through some simulations in order to have better sense of how these algorithms work. Finally, Section6 concludes the paper. Fitting A Mixture Distribution to Data: Tutorial 2 2. Background expectation is: This section reviews some technical background required X for explaining the main algorithm. This review includes E(X) = xf(x); (8) probability and Bayes rule, probability mass/density func- dom x Z tion, expectation, maximum likelihood estimation, expec- E(X) = xf(x) dx; (9) tation maximization, and Lagrange multiplier. dom x 2.1. Probability and Bayes Rule respectively, where dom x is the domain of X. The condi- If S denotes the total sample space and A denotes an event tional expectation is defined as: in this sample space, the probability of event A is: X (XjY ) = xf(xjy); (10) jAj EXjY P(A) = : (2) dom x jSj Z EXjY (XjY ) = xf(xjy) dx; (11) The conditional probability, i.e., probability of occurance dom x of event A given that event B happens, is: for discrete and continuous cases, respectively. P(A; B) P(AjB) = (3) 2.4. Maximum Likelihood Estimation P(B) Assume we have a sample with size n, i.e., fx ; : : : ; x g. P(BjA) P(A) 1 n = ; (4) Also assume that we know the distribution from which this P(B) sample has been randomly drawn but we do not know the parameters of that distribution. For example, we know it where P(AjB), P(BjA), P(A), and P(B) are called like- lihood, posterior, prior, and marginal probabilities, respec- is drawn from a normal distribution but the mean and vari- tively. If we assume that the event A consists of some cases ance of this distribution are unknown. The goal is to es- timate the parameters of the distribution using the sample A = fA1;:::;Ang, we can write: fx1; : : : ; xng available from it. This estimation of parame- ters from the available sample is called “point estimation”. P(BjAi) P(Ai) P(AijB) = Pn : (5) One of the approaches for point estimation is Maximum P(BjAj) P(Aj) j=1 Likelihood Estimation (MLE). As it is obvious from its the equations (4) and (5) are two versions of Bayes rule. name, MLE deals with the likelihood of data. We postulate that the values of sample, i.e., x1; : : : ; xn, are 2.2. Probability Mass/Density Function independent random variates of data having the sample dis- In discrete cases, the probability mass function is defined tribution. In other words, the data has a joint distribution as: fX (x1; : : : ; xnjΘ) with parameter Θ and we assume the variates are independent and identically distributed (iid) iid f(x) = P(X = x); (6) variates, i.e., xi ∼ fX (xi; Θ) with the same parameter Θ. Considering the Bayes rule, equation (4), we have: where X and x are a random variable and a number, re- spectively. fX (x1; : : : ; xnjΘ)π(Θ) L(Θjx1; : : : ; xn) = : (12) In continuous cases, the probability density function is: fX (x1; : : : ; xn) Θ P(x ≤ X ≤ x + ∆x) @P(X ≤ x) The MLE aims to find parameter which maximizes the f(x) = lim = : likelihood: ∆x!0 ∆x @x (7) Θb = arg max L(Θ): (13) Θ In this work, by mixture of distributions, we imply mixture of mass/density functions. According to the definition, the likelihood can be written as: 2.3. Expectation Expectation means the value of a random variable X on av- L(Θjx1; : : : ; xn) := f(x1; : : : ; xn; Θ) erage. Therefore, expectation is a weighted average where n (a) Y the weights are probabilities of the random variable X to = f(xi; Θ); (14) get different values. In discrete and continuous cases, the i=1 Fitting A Mixture Distribution to Data: Tutorial 3 where (a) is because the x1; : : : ; xn are iid. Note that in In the M-step, the MLE approach is used where the log- literature, the L(Θjx1; : : : ; xn) is also denoted by L(Θ) for likelihood is replaced with its expectation, i.e., Q(Θ); simplicity. therefore: Usually, for more convenience, we use log-likelihood Θb = arg max Q(Θ): (18) rather than likelihood: Θ `(Θ) := log L(Θ) (15) These two steps are iteratively repeated until convergence n n of the estimated parameters Θb. Y X = log f(x ; Θ) = log f(x ; Θ): (16) i i 2.6. Lagrange Multiplier i=1 i=1 Suppose we have a multi-variate function Q(Θ1;:::; ΘK ) Often, the logarithm is a natural logarithm for the sake of (called “objective function”) and we want to maximize (or compatibility with the exponential in the well-known nor- minimize) it. However, this optimization is constrained and mal density function. Notice that as logarithm function is its constraint is equality P (Θ1;:::; ΘK ) = c where c is a monotonic, it does not change the location of maximization constant. So, the constrained optimization problem is: of the likelihood. maximize Q(Θ1;:::; ΘK ); 2.5. Expectation Maximization Θ1;:::;ΘK (19) Sometimes, the data are not fully observable. For example, subject to P (Θ1;:::; ΘK ) = c: the data are known to be whether zero or greater than zero. For solving this problem, we can introduce a new variable As an illustration, assume the data are collected for a partic- α which is called “Lagrange multiplier”. Also, a new func- ular disease but for convenience of the patients participated tion L(Θ ;:::; Θ ; α), called “Lagrangian” is introduced: in the survey, the severity of the disease is not recorded 1 K but only the existence or non-existence of the disease is re- ported. So, the data are not giving us complete information L(Θ1;:::; ΘK ; α) = Q(Θ1;:::; ΘK ) (20) as Xi > 0 is not obvious whether is Xi = 2 or Xi = 1000. − α P (Θ1;:::; ΘK ) − c : In this case, MLE cannot be directly applied as we do not have access to complete information and some data are Maximizing (or minimizing) this Lagrangian function missing. In this case, Expectation Maximization (EM) is gives us the solution to the optimization problem (Boyd & useful. The main idea of EM can be summarized in this Vandenberghe, 2004): short friendly conversation: set – What shall we do? The data is missing! The log- rΘ1;:::;ΘK ,αL = 0; (21) likelihood is not known completely so MLE cannot be used. – Mmm, probably we can replace the missing data with which gives us: something... r L =set 0 =) r Q = αr P; – Aha! Let us replace it with its mean.