MAP Clustering Under the Gaussian Mixture Model Via Mixed Integer Nonlinear Optimization

MAP Clustering under the Gaussian Mixture Model via Mixed Integer Nonlinear Optimization Patrick Flaherty 1 Pitchaya Wiratchotisatian 2 Ji Ah Lee 1 Zhou Tang 1 Andrew C. Trapp 3 Abstract hard logical constraints and may fail to provide a globally We present a global optimization approach for optimal clustering. solving the maximum a-posteriori (MAP) cluster- Locally optimal and heuristic clustering methods lack opti- ing problem under the Gaussian mixture model. mality guarantees or bounds on solution quality that in- Our approach can accommodate side constraints dicate the potential for further improvement. Even re- and it preserves the combinatorial structure of search on bounding the quality of heuristic solutions is the MAP clustering problem by formulating it as scarce (Cochran, 2018). Conversely, a rigorous optimization a mixed-integer nonlinear optimization problem framework affords bounds that establish the degree of subop- (MINLP). We approximate the MINLP through timality for any feasible solution, and a certificate of global a mixed-integer quadratic program (MIQP) trans- optimality thereby guaranteeing that no better clustering formation that improves computational aspects exists. while guaranteeing -global optimality. An impor- tant benefit of our approach is the explicit quan- Recent work on developing global optimization methods for tification of the degree of suboptimality, via the supervised learning problems has led to impressive improve- optimality gap, en route to finding the globally ments in the size of the problems that can be handled. In optimal MAP clustering. Numerical experiments linear regression, properties such as stability to outliers and comparing our method to other approaches show robustness to predictor uncertainty have been represented as that our method finds a better solution than stan- a mixed-integer quadratic program and solved for samples dard clustering methods. Finally, we cluster a real of size n ∼ 1; 000 (Bertsimas and King, 2016; Bertsimas breast cancer gene expression data set incorporat- and Li, 2020). Best subset selection in the context of regres- ing intrinsic subtype information; the induced con- sion is typically approximated as a convex problem with straints substantially improve the computational the `1 norm penalty, but can now be solved exactly using performance and produce more coherent and bio- the nonconvex `0 penalty for thousands of data points (Bert- logically meaningful clusters. simas et al., 2016). In unsupervised learning, Bandi et al. (2019) use mixed-integer optimization (MIO) to learn parameters for the Gaussian mixture model, showing that an 1 Introduction optimization-based approach can outperform the EM algorithm. Greenberg et al. (2019) use a trellis representation In the application of clustering models to real data there and dynamic programming to compute the partiton function is often rich prior information that constrains the relation- exactly and thus the probability of all possible clusterings. ships among the samples, or the relationships between the For these reasons we are motivated to develop a method for arXiv:1911.04285v2 [stat.ML] 17 Mar 2020 samples and the parameters. For example, in biological achieving a globally optimal solution for clustering under or clinical experiments, it may be known that two samples the Gaussian mixture model that allows for the incorporation are technical replicates and should be assigned to the same of rich prior constraints. cluster, or it may be known that the mean value for control Finite mixture model The probability density function of samples is in a certain range. However, standard model- K a finite mixture model is p(yjθ; π) = P π p(yjθ ) based clustering methods make it difficult to enforce such k=1 k k where the observed data is y and the parameter set is 1Department of Mathematics and Statistics, University of Mas- φ = fθ; πg. The data is an n-tuple of d-dimensional ran- 2 T T T sachusetts Amherst, USA Data Science Program, Worcester Poly- dom vectors y = (y1 ;:::; yn ) and the mixing proportion technic Institute, USA 3Data Science Program and Foisie Business T parameter π = (π1; : : : ; πK ) is constrained to the prob- School, Worcester Polytechnic Institute, USA. Correspondence to: ability simplex P = p 2 K p 0 and 1T p = 1 . Patrick Flaherty <fl[email protected]>. K R When the component density, p(yjθk), is a Gaussian density function, p(yjφ) is a Gaussian mixture model with MAP Clustering under the Gaussian Mixture Model via MINLO parameters θ = (fµ1; Σ1g;:::; fµK ; ΣK g). Assum- can be written ing independent, identically distributed (iid) samples, the n K n K X X 2 X X Gaussian mixture model probability density function is min η z (y − µ ) − z log π Qn PK z; µ; π ik i k ik k p(yjθ; π) = i=1 k=1 πk p (yijµk; Σk). i=1 k=1 i=1 k=1 K Generative model A generative model for the Gaussian X s:t: π = 1; mixture density function is k k=1 K X (4) iid zik = 1; i = 1; : : : ; n; Zi ∼ Categorical(π) for i = 1; : : : ; n; (1) k=1 Yijzi; θ ∼ Gaussian(µz ; Σz ); L U i i Mk ≤ µk ≤ Mk ; k = 1; : : : ; K; πk ≥ 0; k = 1; : : : ; K; where µ = (µ1;:::; µK ) and Σ = (Σ1;:::; ΣK ). To zik 2 f0; 1g; i = 1; : : : ; n; generate data from the Gaussian mixture model, first draw k = 1; : : : ; K; zi 2 f1;:::;Kg from a categorical distribution with parameter π. Then, given z , draw y from the associated i i where η = 1 is the precision, and M L and M U are real Gaussian component distribution p(y jθ ). 2σ2 k k i zi numbers. In a fully Bayesian setting, even if the MAP clus- Maximum a posteriori (MAP) clustering The posterior tering is of less interest than the full distribution function, distribution function for the generative Gaussian mixture the MAP clustering can still be useful as an initial value for model is a posterior sampling algorithm as suggested by Gelman and Rubin (1996). p(yjθ; z)p(zjπ)p(θ; π) p(z; θ; πjy) = : (2) Biconvex mixed-integer nonlinear programming While p(y) maximum a posterior clustering for a Gaussian mixture model is a well-known problem, it does not fall neatly into The posterior distribution requires the specification of a any optimization problem formulation except the broad- prior distribution p(θ; π), and if p(θ; π) / 1, then MAP est class: mixed-integer nonlinear programming. Here we clustering is equivalent to maximum likelihood clustering. show that, in fact, maximum a posteriori clustering for the The MAP estimate for the component membership configu- Gaussian mixture model is in a more restricted class of ration can be obtained by solving the following optimization optimization problems—biconvex mixed-integer nonlinear problem: problems. This fact can be exploited to develop improved inference algorithms. Theorem 1. Maximum a posteriori clustering under the maximize log p(z; θ; πjy) z; θ; π Gaussian mixture model (1) with known covariance is a biconvex mixed-integer nonlinear programming optimization subject to z 2 f1;:::;Kg 8i; (3) i problem. π 2 PK : Proof. We briefly sketch a proof that is provided in full Assuming iid sampling, the objective function comprises in the Supplementary Material. A biconvex mixed-integer the following conditional density functions: nonlinear program (MINLP) is an optimization problem such that if the integer variables are relaxed, the resulting K optimization problem is a biconvex nonlinear programming Y h −m=2 −1=2 p(yijµ; Σ; zi) = (2π) jΣkj problem. Maximum a posteriori clustering under Model (1) k=1 can be written as Problem (4). If zik 2 f0; 1g is relaxed z 1 ik to zik 2 [0; 1], Problem (4) is convex in fπ; µg for fixed · exp − (y − µ )T Σ−1(y − µ ) ; 2 i k k i k z and convex in z for fixed fπ; µg, and it satisfies the K criteria of a biconvex nonlinear programming optimization Y zik p(zijπ) = [πk] ; p(Σ; π; µ) / 1; problem (Floudas, 2000). Because the relaxed problem k=1 is a biconvex nonlinear program, the original maximum a posteriori clustering problem is a biconvex MINLP. where zi 2 f1;:::;Kg is recast using binary encoding. To simplify the presentation, consider the case of one- Goal of this work Our goal is to solve for the global MAP dimensional data (d = 1) and equivariant components clustering via a MINLP over the true posterior distribution 2 (Σ1 = ··· = ΣK = σ ). The MAP optimization problem domain while only imposing constraints that are informed MAP Clustering under the Gaussian Mixture Model via MINLO by our prior knowledge and making controllable approx- EM algorithm The EM algorithm relaxes the domain such imations. Importantly, there are no constraints linking π, that zik 2 [0; 1] instead of zik 2 f0; 1g. The decision vari- 1 Pn µ, and z such as πk = n i=1 zik; k = 1;:::;K which ables of the resulting biconvex optimization problem are would be a particular estimator. partitioned into two groups: fzg and fµ; πg. The search algorithm performs coordinate ascent on these two groups. Computational complexity Problems in the MINLP class There are no guarantees for the global optimality of the esti- are NP-hard in general, and the MAP problem in particu- mate produced by the EM algorithm. While Balakrishnan lar presents two primary computational challenges (Murty et al. (2017) showed that the global optima of a mixture of and Kabadi, 1987). First, as the number of data points in- well-separated Gaussians may have a relatively large region creases, the size of the configuration space of z increases of attraction, Jin et al. (2016) showed that inferior local in a combinatorial manner (Nemhauser and Wolsey, 1988).

Load more