Divide-And-Conquer Learning by Anchoring a Conical Hull
Total Page:16
File Type:pdf, Size:1020Kb
Divide-and-Conquer Learning by Anchoring a Conical Hull Tianyi Zhouy, Jeff Bilmesz, Carlos Guestriny yComputer Science & Engineering, zElectrical Engineering, University of Washington, Seattle ftianyizh, bilmes, [email protected] Abstract We reduce a broad class of fundamental machine learning problems, usually addressed by EM or sampling, to the problem of finding the k extreme rays spanning the conical hull of a1 data point set. These k “anchors” lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the k anchors, we propose a novel divide-and- conquer learning scheme “DCA” that distributes the problem to O(k log k) same- type sub-problems on different low-D random hyperplanes, each can be solved independently by any existing solver. For the 2D sub-problem, we instead present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine inside other algorithms to check whether a point is covered in a conical hull, and thus improves these algorithms by providing significant speedups. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on large datasets. 1 Introduction Expectation-maximization (EM) [10], sampling methods [13], and matrix factorization [20, 25] are three algorithms commonly used to produce maximum likelihood (or maximum a posteriori (MAP)) estimates of models with latent variables/factors, and thus are used in a wide range of applications such as clustering, topic modeling, collaborative filtering, structured prediction, feature engineering, and time series analysis. However, their learning procedures rely on alternating optimization/updates between parameters and latent variables, a process that suffers from local optima. Hence, their quality greatly depends on initialization and on using a large number of iterations for proper convergence [24]. The method of moments [22, 6, 17], by contrast, solves m equations by relating the first m moments of observation x 2 Rp to the m model parameters, and thus yields a consistent estimator with a global solution. In practice, however, sample moments usually suffer from unbearably large variance, which easily leads to the failure of final estimation, especially when m or p is large. Although recent spectral methods [8, 18, 15, 1] reduces m to 2 or 3 when estimating O(p) m parameters [2] by relating the eigenspace of lower-order moments to parameters in a matrix form up to column scale, the variance of sample moments is still sensitive to large p or data noise, which may result in poor estimation. Moreover, although spectral methods using SVDs or tensor decomposition evidently simplifies learning, the computation can still be expensive for big data. In addition, recovering a parameter matrix with uncertain column scale might not be feasible for some applications. In this paper, we reduce the learning in a rich class of models (e.g., matrix factorization and latent variable model) to finding the extreme rays of a conical hull from a finite set of real data points. This is obtained by applying a general separability assumption to either the data matrix in matrix factorization or the 2nd/3rd order moments in latent variable models. Separability posits that a ground set of n points, as rows of matrix X, can be represented by X = FXA, where the rows (bases) in XA are a subset A ⊂ V = [n] of rows in X, which are called “anchors” and are interesting to various 1 models when jAj = k n. This property was introduced in [11] to establish the uniqueness of non-negative matrix factorization (NMF) under simplex constraints, and was later [19, 14] extended to non-negative constraints. We generalize it further to the model X = FYA for two (possibly distinct) finite sets of points X and Y , and build a new theory for the identifiability of A. This generalization enables us to apply it to more general models (ref. Table1) besides NMF. More interestingly, it leads to a learning method with much higher tolerance to the variance of sample moments or data noise, a unique global solution, and a more interpretable model. Cone(Y Another primary contribution of this paper is a dis- A) tributed learning scheme “divide-and-conquer anchor- X = { } Y = { } ing” (DCA), for finding an anchor set A such that YA = { } X = FYA by solving same-type sub-problems on only H O(k log k) randomly drawn low-dimensional (low-D) XΦ = { } hyperplanes. Each sub-problem is of the form of cone YΦ = { } YÃ Φ = { } ( (XΦ) = F · (Y Φ)A with random projection matrix Y Ã Φ Φ, and can easily be handled by most solvers due to ) the low dimension. This is based on the observation O that the geometry of the original conical hull is partially preserved after a random projection. We analyze the Figure 1: Geometry of general minimum conical hull prob- probability of success for each sub-problem to recover lem and basic idea of divide-and-conquer anchoring (DCA). part of A, and then study the number of sub-problems for recovering the whole A with high probability (w.h.p.). In particular, we propose an very fast non-iterative solver for sub-problems on the 2D plane, which requires computing an array of cosines and its max/min values, and thus results in learning algorithms with speedups of tens to hundreds of times. DCA improves multiple aspects of algorithm design since: 1) its idea of divide-and-conquer randomization gives rise to distributed learning that can reduce the original problem to multiple extremely low-D sub-problems that are much easier and faster to solve, and 2) it provides a fast subroutine, checking if a point is covered by a conical hull, which can be embedded into other solvers. We apply both the conical hull anchoring model and DCA to five learning models: Gaussian mixture models (GMM) [27], hidden Markov models (HMM) [5], latent Dirichlet allocation (LDA) [7], NMF [20], and subspace clustering (SC) [12]. The resulting models and algorithms show significant improvement in efficiency. On generalization performance, they consistently outperform spectral methods and matrix factorization, and are comparable to or even better than EM and sampling. In the following, we will first generalize the separability assumption and minimum conical hull problem risen from NMF in x2, and then show how to reduce more general learning models to a (general) minimum conical hull problem in x3. x4 presents a divide-and-conquer learning scheme that can quickly locate the anchors of the conical hull by solving the same problem in multiple extremely low-D spaces. Comprehensive experiments and comparison can be found in x5. 2 General Separability Assumption and Minimum Conical Hull Problem The original separability property [11] is defined on the convex hull of a set of data points, namely that each point can be represented as a convex combination of certain subsets of vertices that define the convex hull. Later works on separable NMF [19, 14] extend it to the conical hull case, which replaced convex with conical combinations. Given the definition of (convex) cone and conical hull, the separability assumption can be defined both geometrically and algebraically. Definition 1 (Cone & conical hull). A (convex) cone is a non-empty convex set that is closed with respect to conical combinations of its elements. In particular, cone(R) can be defined by its k k generators (or rays) R = frigi=1 such that Xk cone(R) = αiri j ri 2 R; αi 2 + 8i : (1) i=1 R See [29] for the original separability assumption, the equivalence between separable NMF and the minimum conical hull problem, which is defined as a submodular set cover problem. 2.1 General Separability Assumption and General Minimum Conical Hull Problem By generalizing the separability assumption, we obtain a general minimum conical hull problem that can reduce more general learning models besides NMF, e.g., latent variable models and matrix factorization, to finding a set of “anchors” on the extreme rays of a conical hull. 2 Definition 2 (General separability assumption). All the n data points(rows) in X are covered in a finitely generated and pointed cone (i.e., if x 2 cone(YA) then −x 62 cone(YA)) whose generators form a subset A ⊆ [m] of data points in Y such that @i 6= j; YAi = a · YAj . Geometrically, it says 8i 2 [n];Xi 2 cone (YA) ;YA = fyigi2A: (2) 0 (n−k)×k An equivalent algebraic form is X = FYA, where jAj = k, F 2 S ⊆ R+ . (n−k)×k When X = Y and S = R+ , it degenerates to the original separability assumption given in [29]. We generalize the minimum conical hull problem from [29]. Under the general separability assumption, it aims to find the anchor set A from the points in Y rather than X. Definition 3 (General Minimum Conical Hull Problem). Given a finite set of points X and a set Y having an index set V = [m] of its rows, the general minimum conical hull problem finds the subset of rows in Y that define a super-cone for all the rows in X. That is, find A 2 2V that solves: min jAj; s.t., cone(YA) ⊇ cone(X): (3) A⊂V where cone(YA) is the cone induced by the rows A of Y . When X = Y , this also degenerates to the original minimum conical hull problem defined in [29]. A critical question is whether/when the solution A is unique. When X = Y and X = FXA, by following the analysis of the separability assumption in [29],we can prove that A is unique and identifiable given X.