Divide-and-Conquer Learning by Anchoring a Conical Hull

Tianyi Zhou†, Jeff Bilmes‡, Carlos Guestrin† †Computer Science & Engineering, ‡Electrical Engineering, University of Washington, Seattle {tianyizh, bilmes, guestrin}@u.washington.edu

Abstract

We reduce a broad class of fundamental machine learning problems, usually addressed by EM or sampling, to the problem of finding the k extreme rays spanning the conical hull of a1 data point set. These k “anchors” lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the k anchors, we propose a novel divide-and- conquer learning scheme “DCA” that distributes the problem to O(k log k) same- type sub-problems on different low-D random hyperplanes, each can be solved independently by any existing solver. For the 2D sub-problem, we instead present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine inside other algorithms to check whether a point is covered in a conical hull, and thus improves these algorithms by providing significant speedups. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on large datasets.

1 Introduction

Expectation-maximization (EM) [10], sampling methods [13], and matrix factorization [20, 25] are three algorithms commonly used to produce maximum likelihood (or maximum a posteriori (MAP)) estimates of models with latent variables/factors, and thus are used in a wide range of applications such as clustering, topic modeling, collaborative filtering, structured prediction, feature engineering, and time series analysis. However, their learning procedures rely on alternating optimization/updates between parameters and latent variables, a process that suffers from local optima. Hence, their quality greatly depends on initialization and on using a large number of iterations for proper convergence [24]. The method of moments [22, 6, 17], by contrast, solves m equations by relating the first m moments of observation x ∈ Rp to the m model parameters, and thus yields a consistent estimator with a global solution. In practice, however, sample moments usually suffer from unbearably large variance, which easily leads to the failure of final estimation, especially when m or p is large. Although recent spectral methods [8, 18, 15, 1] reduces m to 2 or 3 when estimating O(p)  m parameters [2] by relating the eigenspace of lower-order moments to parameters in a matrix form up to column scale, the variance of sample moments is still sensitive to large p or data noise, which may result in poor estimation. Moreover, although spectral methods using SVDs or tensor decomposition evidently simplifies learning, the computation can still be expensive for big data. In addition, recovering a parameter matrix with uncertain column scale might not be feasible for some applications. In this paper, we reduce the learning in a rich class of models (e.g., matrix factorization and latent variable model) to finding the extreme rays of a conical hull from a finite set of real data points. This is obtained by applying a general separability assumption to either the data matrix in matrix factorization or the 2nd/3rd order moments in latent variable models. Separability posits that a ground set of n points, as rows of matrix X, can be represented by X = FXA, where the rows (bases) in XA are a subset A ⊂ V = [n] of rows in X, which are called “anchors” and are interesting to various

1 models when |A| = k  n. This property was introduced in [11] to establish the uniqueness of non-negative matrix factorization (NMF) under simplex constraints, and was later [19, 14] extended to non-negative constraints. We generalize it further to the model X = FYA for two (possibly distinct) finite sets of points X and Y , and build a new theory for the identifiability of A. This generalization enables us to apply it to more general models (ref. Table1) besides NMF. More interestingly, it leads to a learning method with much higher tolerance to the variance of sample moments or data noise, a unique global solution, and a more interpretable model. Cone(Y Another primary contribution of this paper is a dis- A) tributed learning scheme “divide-and-conquer anchor- X = { } Y = { } ing” (DCA), for finding an anchor set A such that YA = { } X = FYA by solving same-type sub-problems on only H O(k log k) randomly drawn low-dimensional (low-D) XΦ = { } hyperplanes. Each sub-problem is of the form of cone YΦ = { } YÃ Φ = { } ( (XΦ) = F · (Y Φ)A with random projection matrix Y Ã Φ Φ, and can easily be handled by most solvers due to ) the low dimension. This is based on the observation O that the geometry of the original conical hull is partially preserved after a random projection. We analyze the Figure 1: Geometry of general minimum conical hull prob- probability of success for each sub-problem to recover lem and basic idea of divide-and-conquer anchoring (DCA). part of A, and then study the number of sub-problems for recovering the whole A with high probability (w.h.p.). In particular, we propose an very fast non-iterative solver for sub-problems on the 2D plane, which requires computing an array of cosines and its max/min values, and thus results in learning algorithms with speedups of tens to hundreds of times. DCA improves multiple aspects of algorithm design since: 1) its idea of divide-and-conquer randomization gives rise to distributed learning that can reduce the original problem to multiple extremely low-D sub-problems that are much easier and faster to solve, and 2) it provides a fast subroutine, checking if a point is covered by a conical hull, which can be embedded into other solvers. We apply both the conical hull anchoring model and DCA to five learning models: Gaussian mixture models (GMM) [27], hidden Markov models (HMM) [5], latent Dirichlet allocation (LDA) [7], NMF [20], and subspace clustering (SC) [12]. The resulting models and algorithms show significant improvement in efficiency. On generalization performance, they consistently outperform spectral methods and matrix factorization, and are comparable to or even better than EM and sampling. In the following, we will first generalize the separability assumption and minimum conical hull problem risen from NMF in §2, and then show how to reduce more general learning models to a (general) minimum conical hull problem in §3. §4 presents a divide-and-conquer learning scheme that can quickly locate the anchors of the conical hull by solving the same problem in multiple extremely low-D spaces. Comprehensive experiments and comparison can be found in §5. 2 General Separability Assumption and Minimum Conical Hull Problem The original separability property [11] is defined on the of a set of data points, namely that each point can be represented as a of certain subsets of vertices that define the convex hull. Later works on separable NMF [19, 14] extend it to the conical hull case, which replaced convex with conical combinations. Given the definition of (convex) cone and conical hull, the separability assumption can be defined both geometrically and algebraically. Definition 1 (Cone & conical hull). A (convex) cone is a non-empty that is closed with respect to conical combinations of its elements. In particular, cone(R) can be defined by its k k generators (or rays) R = {ri}i=1 such that Xk  cone(R) = αiri | ri ∈ R, αi ∈ + ∀i . (1) i=1 R

See [29] for the original separability assumption, the equivalence between separable NMF and the minimum conical hull problem, which is defined as a submodular set cover problem. 2.1 General Separability Assumption and General Minimum Conical Hull Problem

By generalizing the separability assumption, we obtain a general minimum conical hull problem that can reduce more general learning models besides NMF, e.g., latent variable models and matrix factorization, to finding a set of “anchors” on the extreme rays of a conical hull.

2 Definition 2 (General separability assumption). All the n data points(rows) in X are covered in a finitely generated and pointed cone (i.e., if x ∈ cone(YA) then −x 6∈ cone(YA)) whose generators form a subset A ⊆ [m] of data points in Y such that @i 6= j, YAi = a · YAj . Geometrically, it says

∀i ∈ [n],Xi ∈ cone (YA) ,YA = {yi}i∈A. (2)

0 (n−k)×k An equivalent algebraic form is X = FYA, where |A| = k, F ∈ S ⊆ R+ .

(n−k)×k When X = Y and S = R+ , it degenerates to the original separability assumption given in [29]. We generalize the minimum conical hull problem from [29]. Under the general separability assumption, it aims to find the anchor set A from the points in Y rather than X. Definition 3 (General Minimum Conical Hull Problem). Given a finite set of points X and a set Y having an index set V = [m] of its rows, the general minimum conical hull problem finds the subset of rows in Y that define a super-cone for all the rows in X. That is, find A ∈ 2V that solves:

min |A|, s.t., cone(YA) ⊇ cone(X). (3) A⊂V where cone(YA) is the cone induced by the rows A of Y .

When X = Y , this also degenerates to the original minimum conical hull problem defined in [29]. A critical question is whether/when the solution A is unique. When X = Y and X = FXA, by following the analysis of the separability assumption in [29],we can prove that A is unique and identifiable given X. However, when X 6= Y and X = FYA, it is clear that there could be multiple legal choices of A (e.g., there could be multiple layers of conical hulls containing a conical hull covering all points in X). Fortunately, when the rows of Y are rank-one matrices after vectorization (concatenating all columns to a long vector), which is the common case in most latent variable models in §3.2, A can be uniquely determined if the number of rows in X exceeds 2. s s Lemma 1 (Identifiability). If X = FYA with the additional structure Ys = vec(Oi ⊗Oj ) where Oi s th is a pi × k matrix and Oi is its s column, under the general separability assumption in Definition 2, two (non-identical) rows in X are sufficient to exactly recover the unique A, Oi and Oj.

See [29] for proof and additional uniqueness conditions when applied to latent variable models. 3 Minimum Conical Hull Problem for General Learning Models

Table 1: Summary of reducing NMF, SC, GMM, HMM and LDA to a conical hull anchoring model X = FYA in §3, and their learning algorithms achieved by A = DCA(X, Y, k, M) in Algorithm1 . Minimal conical hull A = MCH(X,Y ) is defined in Definition4. n×p vec(·) denotes the vectorization of a matrix. For GMM and HMM, Xi ∈ R i is the data matrix for view i (i.e., a subset of features) and th th the i observation of all triples of sequential observations, respectively. Xt,i is the t row of Xi and associates with point/triple t. ηt is a vector uniformly drawn from the unit sphere. More details are given in [29]. Model X in conical hull problem Y in conical hull problem k in conical hull problem n×p NMF data matrix X ∈ R+ Y := X # of factors n×p SC data matrix X ∈ R Y := X # of basis from all clusters T T GMM [vec[X1 X2]; vec[X1 Diag(X3ηt)X2]t∈[q]]/n [vec(Xt,1 ⊗ Xt,2)]t∈[n] # of components/clusters T T HMM [vec[X2 X3]; vec[X2 Diag(X1ηt)X3]t∈[q]]/n [vec(Xt,2 ⊗ Xt,3)]t∈[n] # of hidden states p×p LDA word-word co-occurrence matrix X ∈ R+ Y := X # of topics S ˜i Algo Each sub-problem in DCA Post-processing after A := i A Interpretation of anchors indexed by A ˜ NMF A = MCH(XΦ,XΦ), can be solved by (10) solving F in X = FXA basis XA are real data points SC A˜ =anchors of clusters achieved by meanshift((X\Φ)ϕ) clustering anchors X cluster i is a cone cone(X ) A Ai ˜ GMM A = MCH(XΦ,Y Φ), can be solved by (10) N/A centers [XA,i]i∈[3] from real data ˜ HMM A = MCH(XΦ,Y Φ), can be solved by (10) solving T in OT = XA,3 emission matrix O = XA,2 ˜ LDA A = MCH(XΦ,XΦ), can be solved by (10) col-normalize {F : X = FXA} anchor word for topic i (topic prob. Fi) In this section, we discuss how to reduce the learning of general models such as matrix factorization and latent variable models to the (general) minimum conical hull problem. Five examples are given in Table1 to show how this general technique can be applied to specific models. 3.1 Matrix Factorization Besides NMF, we consider more general matrix factorization (MF) models that can operate on negative features and specify a complicated structure of F . The MF X = FW is a deterministic latent variable model where F and W are deterministic latent factors. By assigning a likelihood T p(Xi,j|Fi, (W )j) and priors p(F ) and p(W ), its optimization model can be derived from maximum

3 likelihood or MAP estimate. The resulting objective is usually a loss function `(·) of X − FW plus regularization terms for F and W , i.e., min `(X,FW ) + RF (F ) + RW (W ). Similar to separable NMF, minimizing the objective of general MF can be reduced to a minimum con- Pk ical hull problem that selects the subset A with X = FXA. In this setting, RW (W ) = i=1 g(Wi) where g(w) = 0 if w = Xi for some i and g(w) = ∞ otherwise. This is equivalent to applying a prior p(Wi) with finite support set on the rows of X to each row of W . In addition, the regularization of F can be transformed to geometric constraints between points in X and in XA. Since Fi,j is the conical combination weight of XAj in recovering Xi, a large Fi,j intuitively indicates a small angle between XAj and Xi, and vice verse. For example, the sparse and graph Laplacian prior for rows of F in subspace clustering can be reduced to “cone clustering” for finding A. See [29] for an example of reducing the subspace clustering to general minimum conical hull problem.

3.2 Latent Variable Model

Different from deterministic MF, we build a system of equations from the moments of probabilistic latent variable models, and then formulate it as a general minimum conical hull problem, rather than directly solve it. Let the generalization model be h ∼ p(h; α) and x ∼ p(x|h; θ), where h is a latent variable, x stands for observation, and {α, θ} are parameters. In a variety of graphical models such as GMMs and HMMs, we need to model conditional independence between groups of features. This is also known as the multi-view assumption. W.l.o.g., we assume that x is composed of three groups(views) of features {xi}i∈[3] such that ∀i 6= j, xi ⊥⊥ xj|h. We further assume the dimension k of h is smaller than pi, the dimension of xi. Since the goal is learning {α, θ}, decomposing the moments of x rather than the data matrix X can help us get rid of the latent variable h and thus avoid T T alternating minimization between {α, θ} and h. When E(xi|h) = h Oi (linearity assumption), the second and third order moments can be written in the form of matrix operator.  T E (xi ⊗ xj) = E[E(xi|h) ⊗ E(xj|h)] = OiE(h ⊗ h)Oj , T (4) E (xi ⊗ xj · hη, xli) = Oi [E(h ⊗ h ⊗ h) ×3 (Olη)] Oj , where A ×n U denotes the n-mode product of a tensor A by a matrix U, ⊗ is the outer product, and the operator parameter η can be any vector. We will mainly focus on the models in which {α, θ} can 1 be exactly recovered from conditional mean vectors {Oi}i∈[3] and E(h ⊗ h) , because they cover most popular models such as GMMs and HMMs in real applications. The left hand sides (LHS) of both equations in (4) can be directly estimated from training data, while T pi×k their right hand sides (RHS) can be written in a unified matrix form OiDOj with Oi ∈ R and k×k D ∈ R . By using different η, we can obtain 2 ≤ q ≤ pl + 1 independent equations, which compose a system of equations for Oi and Oj. Given the LHS, we can obtain the column spaces of T Oi and Oj, which respectively equal to the column and row space of OiDOj , a low-rank matrix when pi > k. In order to further determine Oi and Oj, our discussion falls into two types of D.

When D is a diagonal matrix. This happens when ∀i 6= j, E(hihj) = 0. A common example is that h is a label/state indicator such that h = ei for class/state i, e.g., h in GMM and HMM. In this case, the two D matrices in the RHS of (4) are ( −−−→ (h ⊗ h) = Diag( (h2)), E E i −−−→ (5) 3 E(h ⊗ h ⊗ h) ×3 (Olη) = Diag(E(hi ) · Olη), −−−→ t t t where E(hi) = [E(h1),..., E(hk)]. So either matrix in the LHS of (4) can be written as a sum of k Pk (s) s s s th rank-one matrices, i.e., s=1 σ Oi ⊗ Oj , where Oi is the s column of Oi. The general separability assumption posits that the set of k rank-one basis matrices constructing the RHS of (4) is a unique subset A ⊆ [n] of the n samples of xi ⊗ xj constructing the left hand sides, s s th i.e., Oi ⊗ Oj = [xi ⊗ xj]As = XAs,i ⊗ XAs,j, the outer product of xi and xj in (As) data point.

1Note our method can also handle more complex models that violate the linearity assumption and need higher n th order moments for parameter estimation. By replacing xi in (4) with vec(xi⊗ ), the vectorization of the n th tensor power of xi, Oi can contain n order moments for p(xi|h; θ). However, since higher order moments are either not necessary or difficult to estimate due to high sample complexity, we will not study them in this paper.

4 Therefore, by applying q − 1 different η to (4), we obtain the system of q equations in the following form, where Y t is the estimate of the LHS of tth equation from training data. k (t) X (t) ∀t ∈ [q],Y = σt,s[xi ⊗ xj]As ⇔ [vec(Y )]t∈[q] = σ[vec(Xt,i ⊗ Xt,j)]t∈A. (6) s=1

The right equation in (6) is an equivalent matrix representation of the left one. Its LHS is a q × pipj matrix, and its RHS is the product of a q × k matrix σ and a k × pipj matrix. By letting X ← (t) [vec(Y )]t∈[q], F ← σ and Y ← [vec(Xt,i ⊗ Xt,j)]t∈[n], we can fit (6) to X = FYA in Definition 2. Therefore, learning {Oi}i∈[3] is reduced to selecting k rank-one matrices from {Xt,i ⊗ Xt,j}t∈[n] (t) indexed by A whose conical hull covers the q matrices {Y }t∈[q]. Given the anchor set A, we have Oˆi = XA,i and Oˆj = XA,j by assigning real data points indexed by A to the columns of Oi and Oj. Given Oi and Oj, σ can be estimated by solving (6). In many models, a few rows of σ are sufficient to recover α. See [29] for a practical acceleration trick based on matrix completion. When D is a symmetric matrix with nonzero off-diagonal entries. This happens in “admixture” models, e.g., h can be a general binary vector h ∈ {0, 1}k or a vector on the probability simplex, and the conditional mean E(xi|h) is a mixture of columns in Oi. The most well known example is LDA, in which each document is generated by multiple topics. We apply the general separability assumption by only using the first equation in (4), and treating the matrix in its LHS as X in X = FXA. When the data are extremely sparse, which is common in text data, selecting the rows of the denser second order moment as bases is a more reasonable and effective assumption compared to sparse data points. In this case, the p rows of F contain k unit vectors {ei}i∈[k]. This leads to a natural assumption of “anchor word” for LDA [3]. See [29] for the example of reducing multi-view mixture model, HMM, and LDA to general minimum conical hull problem. It is also worth noting that we can show our method, when applied to LDA, yields equal results but is faster than a Bayesian inference method [3], see Theorem 4 in [29]. 4 Algorithms for Minimum Conical Hull Problem 4.1 Divide-and-Conquer Anchoring (DCA) for General Minimum Conical Hull Problems

The key insights of DCA come from two observations on the geometry of the . First, projecting a conical hull to a lower-D hyperplane partially preserves its geometry. This enables us to distribute the original problem to a few much smaller sub-problems, each handled by a solver to the minimum conical hull problem. Secondly, there exists a very fast anchoring algorithm for a sub-problem on 2D plane, which only picks two anchor points based on their angles to an axis without iterative optimization or greedy pursuit. This results in a significantly efficient DCA algorithm that can be solely used, or embedded as a subroutine, checking if a point is covered in a conical hull. 4.2 Distributing Conical Hull Problem to Sub-problems in Low Dimensions Due to the convexity of cones, a low-D projection of a conical hull is still a conical hull that covers the projections of the same points covered in the original conical hull, and generated by the projections of a subset of anchors on the extreme rays of the original conical hull. p Lemma 2. For an arbitrary point x ∈ cone(YA) ⊂ R , where A is the index set of the k anchors (generators) selected from Y , for any Φ ∈ Rp×d with d ≤ p, we have ˜ ∃A ⊆ A : xΦ ∈ cone(YA˜Φ), (7) Since only a subset of A remains as anchors after projection, solving a minimum conical hull problem on a single low-D hyperplane rarely returns all the anchors in A. However, the whole set A can be recovered from the anchors detected on multiple low-D hyperplanes. By sampling the projection matrix Φ from a random ensemble M, it can be proved that w.h.p. solving only s = O(ck log k) sub-problems are sufficient to find all anchors in A. Note c/k is the lower bound of angle α − 2β in Theorem1, so large c indicates a less flat conical hull. See [29] for our method’s robustness to the failure in identifying “flat” anchors.

For the special case of NMF when X = FXA, the above result is proven in [28]. However, the analysis cannot be trivially extended to the general conical hull problem when X = FYA (see Figure 1). A critical reason is that the converse of Lemma2 does not hold: the uniqueness of the anchor set A˜

5 Algorithm 1 DCA(X, Y, k, M) n×p m×p Input: Two sets of points (rows) X ∈ R and Y ∈ R in matrix forms (ref. Table1 to see X and Y for different models), number of latent factors/variables k, random matrix ensemble M; Output: Anchor set A ⊆ [m] such that ∀i ∈ [n],Xi ∈ cone(YA); Divide Step (in parallel): for i = 1 → s := O(k log k) do p×d Randomly draw a matrix Φ ∈ R from M; Solve sub-problem such as A˜t = MCH(XΦ,Y Φ) by any solver, e.g., (10); end for Conquer Step: Ps 1 ∀i ∈ [m], compute gˆ(Yi) = (1/s) t=1 A˜t (Yi); Return A as index set of the k points with the largest gˆ(Yi). on low-D hyperplane could be violated, because non-anchors in Y may have non-zero probability to be projected as low-D anchors. Fortunately, we can achieve a unique A˜ by defining a “minimal conical hull” on a low-D hyperplane. Then Proposition1 reveals when w.h.p. such an A˜ is a subset of A. Definition 4 (Minimal conical hull). Given two sets of points (rows) X and Y , the conical hull spanned by anchors (generators) YA is the minimal conical hull covering all points in X iff  C ∀{i, j, s} ∈ i, j, s | i ∈ A = [m] \ A, j ∈ A, s ∈ [n],Xs ∈ cone(YA) ∩ cone(Yi∪(A\j)) (8) we have X[sYi > X[sYj, where xyc denotes the angle between two vectors x and y. The solution of minimal conical hull is denoted by A = MCH(X,Y ). H1 B’1

A’ It is easy to verify that the minimal conical hull is unique, and 1 … the general minimum conical hull problem X = FY under the … A C’ 1 general separability assumption (which leads to the identifiability of C’2 H2 A’ A) is a special case of A = MCH(X,Y ). In DCA, on each low-D A’3$ 2 A1 H3 hyperplane H , the associated sub-problem aims to find the anchor C i 2 i i i ˜ A3$ set A = MCH(XΦ ,Y Φ ). The following proposition gives the C1 i ˜ β" probability of A ⊆ A in a sub-problem solution. A2 H4

Proposition 1 (Probability of success in sub-problem). As de- α" O fined in Figure2, Ai ∈ A signifies an anchor point in YA, Ci ∈ X $ n×p C signifies a point in X ∈ R , Bi ∈ A signifies a non-anchor point in Y ∈ Rm×p, the green ellipse marks the intersection hy- p−1 perplane between cone(YA) and the unit sphere S , the super- script ·0 denotes the projection of a point on the intersection hy- perplane. Define d-dim (d ≤ p) hyperplanes {Hi}i∈[4] such that Figure 2: Proposition1. 0 0 0 0 0 0 0 0 A3A2 ⊥ H1,A1A2 ⊥ H2,B1A2 ⊥ H3,B1C1 ⊥ H4, let α = H\1H2 be the angle between hyper- planes H1 and H2, β = H\3H4 be the angle between H3 and H4. If H with associated projection matrix Φ ∈ Rp×d is a d-dim hyperplane uniformly drawn from the Grassmannian manifold Gr(d, p), and A˜ = MCH(XΦ,Y Φ) is the solution to the minimal conical hull problem, we have β α − β Pr(B ∈ A˜) = , Pr(A ∈ A˜) = . (9) 1 2π 2 2π See [29] for proof, discussion and analysis of robustness to unimportant “flat” anchors and data noise. Theorem 1 (Probability bound). Following the same notations in Proposition1, suppose p∗∗ = cs  min{A1,A2,A3,B1,C1}(α − 2β) ≥ c/k > 0. It holds with probability at least 1 − k exp − 3k that DCA successfully identifies all the k anchors in A, where s is the number of sub-problems solved.

See [29] for proof. Given Theorem1, we can immediately achieve the following corollary about the number of sub-problems that guarantee success of DCA in finding A. Corollary 1 (Number of sub-problems). With probability 1 − δ, DCA can correctly recover the 3k k anchor set A by solving Ω( c log δ ) sub-problems. See [29] for the idea of divide-and-conquer randomization in DCA, and its advantage over Johnson- Lindenstrauss (JL) Lemma based methods.

6 4.3 Anchoring on 2D Plane

Although DCA can invoke any solver for the sub-problem on any low-D hyperplane, a very fast solver for the 2D sub-problem always shows high accuracy in locating anchors when embedded into DCA. Its motivation comes from the geometry of conical hull on a 2D plane, which is a special case of a d-dim hyperplane H in the sub-problem of DCA. It leads to a non-iterative algorithm for A = MCH(X,Y ) on the 2D plane. It only requires computing n + m cosine values, finding the min/max of the n values, and comparing the remaining m ones with the min/max value.

According to Figure1, the two anchors YA˜Φ on a 2D plane have the min/max (among points in Y Φ ) angle (to either axis) that is larger/smaller than all angles of points in XΦ, respectively. This leads to the following closed form of A˜.

A˜ = {arg min ((\YiΦ)ϕ − max (X\jΦ)ϕ)+, arg min (min (X\jΦ)ϕ − (\YiΦ)ϕ)+}, (10) i∈[m] j∈[n] i∈[m] j∈[n] where (x)+ = x if x ≥ 0 and ∞ otherwise, and ϕ can be either the vertical or horizontal axis on a 2D plane. By plugging (10) in DCA as the solver for s sub-problems on random 2D planes, we can obtain an extremely fast learning algorithm. Note for the special case when X = Y ,(10) degenerates to finding the two points in XΦ with the ˜ smallest and largest angles to an axis ϕ, i.e., A = {arg mini∈[n] (X\iΦ)ϕ, arg maxi∈[n] (X\iΦ)ϕ}. This is used in matrix factorization and the latent variable model with nonzero off-diagonal D. See [29] for embedding DCA as a fast subroutine into other methods, and detailed off-the-shelf DCA algorithms of NMF, SC, GMM, HMM and LDA. A brief summary is in Table1. 5 Experiments

See [29] for a complete experimental section with results of DCA for NMF, SC, GMM, HMM, and LDA, and comparison to other methods on more synthetic and real datasets.

1 1 0 10 SPA SPA SPA XRAY XRAY XRAY 0.9 −0.1 DCA(s=50) DCA(s=50) DCA(s=50) 0 DCA(s=92) DCA(s=92) 10 DCA(s=92) 0.8 DCA(s=133) −0.2 DCA(s=133) DCA(s=133) DCA(s=175) DCA(s=175) DCA(s=175) 0.7 SFO −0.3 SFO SFO −1 LP−test LP−test 10 LP−test

0.6 −0.4

−2 0.5 −0.5 10

0.4 −0.6 CPU seconds

− anchor recovery error −3

anchor index recovery rate 10 0.3 −0.7

0.2 −0.8 −4 10

0.1 −0.9

−5 0 −1 10 −2 −1 0 1 −2 −1 0 1 −2 −1 0 1 10 10 10 10 10 10 10 10 10 10 10 10 noise level noise level noise level Figure 3: Separable NMF on a randomly generated 300 × 500 matrix, each point on each curve is the result by averaging 10 independent random trials. SFO-greedy algorithm for submodular set cover problem. LP-test is the backward removal algorithm from [4]. LEFT: Accuracy of anchor detection (higher is better). Middle: Negative relative `2 recovery error of anchors (higher is better). Right: CPU seconds.

cmu−pie 3 cmu−pie yale 2 yale 0.35 10 0.5 10 DCA GMM(s=171) DCA GMM(s=171) DCA GMM(s=171) DCA GMM(s=341) DCA GMM(s=341) DCA GMM(s=341) 0.45 0.3 DCA GMM(s=682) DCA GMM(s=682) DCA GMM(s=682) DCA GMM(s=1023) DCA GMM(s=1023) DCA GMM(s=853) 0.4 k−means 2 k−means k−means 1 10 10 0.25 Spectral GMM Spectral GMM Spectral GMM DCA GMM(s=171) 0.35 EM for GMM EM for GMM EM for GMM DCA GMM(s=341) DCA GMM(s=682) 0.3 0.2 DCA GMM(s=853)

1 0 k−means 10 0.25 10 Spectral GMM 0.15 EM for GMM CPU seconds 0.2 CPU seconds Clustering Accuracy Clustering Accuracy

0.1 0.15 0 −1 10 10 0.1 0.05 0.05

−1 −2 0 10 0 10 30 60 90 120 150 180 210 240 270 300 30 60 90 120 150 180 210 240 270 300 19 38 57 76 95 114 133 152 171 190 19 38 57 76 95 114 133 152 171 190 Number of Clusters/Mixture Components Number of Clusters/Mixture Components Number of Clusters/Mixture Components Number of Clusters/Mixture Components Figure 4: Clustering accuracy (higher is better) and CPU seconds vs. Number of clusters for Gaussian mixture model on CMU-PIE (left) and YALE (right) human face dataset. We randomly split the raw pixel features into 3 groups, each associates to a view in our multi-view model.

JP−Morgan 1 JP−Morgan Barclays 1 Barclays 8 10 33.5 10 DCA HMM(s=32) 33 DCA HMM(s=64) DCA HMM(s=96) 7 32.5 DCA HMM(s=160) 100 Baum−Welch(EM) 100 Spectral method 32 6

31.5

−1 −1 5 10 DCA HMM(s=32) 31 10 DCA HMM(s=32) DCA HMM(s=96) DCA HMM(s=64)

loglikelihood DCA HMM(s=160) loglikelihood DCA HMM(s=96) CPU seconds CPU seconds 30.5 DCA HMM(s=256) DCA HMM(s=160) 4 DCA HMM(s=32) Baum−Welch(EM) Baum−Welch(EM) 30 DCA HMM(s=96) −2 Spectral method −2 Spectral method 10 10 DCA HMM(s=160) 29.5 3 DCA HMM(s=256) Baum−Welch(EM) 29

Spectral method −3 −3 2 10 28.5 10 3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10 Number of States Number of States Number of States Number of States Figure 5: Likelihood (higher is better) and CPU seconds vs. Number of states for using an HMM to model the stock price of 2 companies from 01/01/1995-05/18/2014 collected by Yahoo Finance. Since no ground truth label is given, we measure likelihood on training data. DCA for Non-negative Matrix Factorization on Synthetic Data. The experimental comparison results are shown in Figure3. Greedy algorithms SPA [ 14], XRAY [19] and SFO achieves the best

7 accuracy and smallest recovery error when noise level is above 0.2, but XRAY and SFO are the slowest two. SPA is slightly faster but still much slower than DCA. DCA with different number of sub-problems shows slightly less accuracy than greedy algorithms, but the difference is acceptable. Considering its significant acceleration, DCA offers an advantageous trade-off. LP-test [4] has the exact solution guarantee, but it is not robust to noise, and too slow. Therefore, DCA provides a much faster and more practical NMF algorithm with comparable performance to the best ones. DCA for Gaussian Mixture Model on CMU-PIE and YALE Face Dataset. The experimental comparison results are shown in Figure4. DCA consistently outperforms other methods (k-means, EM, spectral method [1]) on accuracy, and shows speedups in the range 20-2000. By increasing the number of sub-problems, the accuracy of DCA improves. Note the pixels of face images always exceed 1000, and thus results in slow computation of pairwise distances required by other clustering methods. DCA exhibits the fastest speed because the number of sub-problems s = O(k log k) does not depend on the feature dimension, and thus merely 171 2D random projections are sufficient for obtaining a promising clustering result. The spectral method performs poorer than DCA due to the large variance of sample moments. DCA uses the separability assumption in estimating the eigenspace of the moment, and thus effectively reduces the variance. Table 2: Motion prediction accuracy (higher is better) of the test set for 6 motion capture sequences from CMU-mocap dataset. The motion for each frame is manually labeled by the authors of [16]. In the table, s13s29(39/63) means that we split sequence 29 of subject 13 into sub-sequences, each has 63 frames, in which the first 39 ones are for training and the rest are for test. Time is measured in ms. Sequence s13s29(39/63) s13s30(25/51) s13s31(25/50) s14s06(24/40) s14s14(29/43) s14s20(29/43) Measure Acc Time Acc Time Acc Time Acc Time Acc Time Acc Time Baum-Welch (EM) 0.50 383 0.50 140 0.46 148 0.34 368 0.62 529 0.77 345 Spectral Method 0.20 80 0.25 43 0.13 58 0.29 66 0.63 134 0.59 70 DCA-HMM (s=9) 0.33 3.3 0.92 1 0.19 1.5 0.29 4.8 0.79 3 0.28 3 DCA-HMM (s=26) 0.50 3.3 1.00 1 0.65 1.6 0.60 4.8 0.45 3 0.89 3 DCA-HMM (s=52) 0.50 3.4 0.50 1.1 0.43 1.6 0.48 4.9 0.80 3.2 0.78 3.1 DCA-HMM (s=78) 0.66 3.4 0.93 1.1 0.41 1.6 0.51 4.9 0.80 6.7 0.83 3.2

4 5 3800 10 0.8 10 DCA LDA(s=801) DCA LDA(s=2001) 3600 DCA LDA(s=3336) 0.7 DCA LDA(s=5070) 4 10 EM variational 3 3400 10 Gibbs sampling Spectral method 0.6 3 3200 10

2 10 DCA LDA(s=801) 0.5 3000 DCA LDA(s=2001) DCA SC(s=307) DCA SC(s=819) DCA LDA(s=3336) 2 10 DCA LDA(s=5070) DCA SC(s=1229)

Perplexity 2800 EM variational 0.4 DCA SC(s=307) DCA SC(s=1843)

CPU seconds 1 10 CPU seconds SSC

Gibbs sampling Mutual Information DCA SC(s=819) 1 SCC 2600 Spectral method DCA SC(s=1229) 10 LRR 0.3 DCA SC(s=1843) SSC RSC 2400 0 10 SCC 0 LRR 10 0.2 2200 RSC

2000 −1 −1 5 13 22 30 38 47 55 63 72 80 10 0.1 10 5 13 22 30 38 47 55 63 72 80 20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200 Number of Topics Number of Topics Number of Clusters/Mixture Components Number of Clusters/Mixture Components Figure 6: LEFT: Perplexity (smaller is better) on test set and CPU seconds vs. Number of topics for LDA on NIPS1-17 Dataset, we randomly selected 70% documents for training and the rest 30% is used for test. RIGHT: Mutual Information (higher is better) and CPU seconds vs. Number of clusters for subspace clustering on COIL-100 Dataset. DCA for Hidden Markov Model on Stock Price and Motion Capture Data. The experimental comparison results for stock price modeling and motion segmentation are shown in Figure5 and Table 2, respectively. In the former one, DCA always achieves slightly lower but comparable likelihood compared to Baum-Welch (EM) method [5], while the spectral method [2] performs worse and unstably. DCA shows a significant speed advantage compared to others, and thus is more preferable in practice. In the latter one, we evaluate the prediction accuracy on the test set, so the regularization caused by separability assumption leads to the highest accuracy and fastest speed of DCA. DCA for Latent Dirichlet Allocation on NIPS1-17 Dataset. The experimental comparison results for topic modeling are shown in Figure6. Compared to both traditional EM and the Gibbs sam- pling [23], DCA not only achieves both the smallest perplexity (highest likelihood) on the test set and the highest speed, but also the most stable performance when increasing the number of topics. In addition, the “anchor word” achieved by DCA provides more interpretable topics than other methods. DCA for Subspace Clustering on COIL-100 Dataset. The experimental comparison results for subspace clustering are shown in Figure6. DCA provides a much more practical algorithm that can achieve comparable mutual information but at a more than 1000 times speedup over the state-of-the-art SC algorithms such as SCC [9], SSC [12], LRR [21], and RSC [26]. Acknowledgments: We would like to thank MELODI lab members for proof-reading and the anonymous reviewers for their helpful comments. This work is supported by TerraSwarm research center administered by the STARnet phase of the Focus Center Research Program (FCRP) sponsored by MARCO and DARPA, by the National Science Foundation under Grant No. (IIS-1162606), and by Google, Microsoft, and Intel research awards, and by the Intel Science and Technology Center for Pervasive Computing.

8 References [1] A. Anandkumar, D. P. Foster, D. Hsu, S. Kakade, and Y. Liu. A spectral algorithm for latent dirichlet allocation. In NIPS, 2012. [2] A. Anandkumar, D. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden markov models. In COLT, 2012. [3] S. Arora, R. Ge, Y. Halpern, D. M. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm for topic modeling with provable guarantees. In ICML, 2013. [4] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization - provably. In STOC, 2012. [5] L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 37:1554–1563, 1966. [6] M. Belkin and K. Sinha. Polynomial learning of distribution families. In FOCS, 2010. [7] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Maching Learning Research (JMLR), 3:993–1022, 2003. [8] J. T. Chang. Full reconstruction of markov models on evolutionary trees: Identifiability and consistency. Mathematical Biosciences, 137(1):51–73, 1996. [9] G. Chen and G. Lerman. Spectral curvature clustering (scc). International Journal of Computer Vision (IJCV), 81(3):317–330, 2009. [10] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977. [11] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In NIPS, 2003. [12] E. Elhamifar and R. Vidal. Sparse subspace clustering. In CVPR, 2009. [13] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 6(6):721–741, 1984. [14] N. Gillis and S. A. Vavasis. Fast and robust recursive algorithmsfor separable nonnegative matrix fac- torization. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(4):698–714, 2014. [15] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models. In COLT, 2009. [16] M. C. Hughes, E. B. Fox, and E. B. Sudderth. Effective split-merge monte carlo methods for nonparametric models of sequential data. In NIPS, 2012. [17] A. T. Kalai, A. Moitra, and G. Valiant. Efficiently learning mixtures of two gaussians. In STOC, 2010. [18] R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In COLT, 2005. [19] A. Kumar, V. Sindhwani, and P. Kambadur. Fast conical hull algorithms for near-separable nonnegative matrix factorization. In ICML, 2013. [20] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999. [21] G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML, 2010. [22] K. Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71–110, 1894. [23] I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. In SIGKDD, pages 569–577, 2008. [24] R. A. Redner and H. F. Walker. Mixture Densities, Maximum Likelihood and the Em Algorithm. SIAM Review, 26(2):195–239, 1984. [25] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS, 2008. [26] M. Soltanolkotabi, E. Elhamifar, and E. J. Candes.` Robust subspace clustering. arXiv:1301.2603, 2013. [27] D. Titterington, A. Smith, and U. Makov. Statistical Analysis of Finite Mixture Distributions. Wiley, New York, 1985. [28] T. Zhou, W. Bian, and D. Tao. Divide-and-conquer anchoring for near-separable nonnegative matrix factorization and completion in high dimensions. In ICDM, 2013. [29] T. Zhou, J. Bilmes, and C. Guestrin. Extended version of “divide-and-conquer learning by anchoring a conical hull”. In Extended version of a accepted NIPS-2014 paper, 2014.

9