Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

A Quantum-inspired Classical Algorithm for Separable Non-negative Matrix Factorization Zhihuai Chen1,2 , Yinan Li3 , Xiaoming Sun1,2 , Pei Yuan1,2 and Jialin Zhang1,2 1CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, 100190, Beijing, China 2University of Chinese Academy of Sciences, 100049, Beijing, China 3Centrum Wiskunde & Informatica and QuSoft, Science Park 123, 1098XG Amsterdam, Netherlands {chenzhihuai, sunxiaoming, yuanpei, zhangjialin}@ict.ac.cn, [email protected]

Abstract of A. To solve the Separable Non-Negative Matrix Factoriza- tions (SNMF), it is sufficient to identify the anchors in the Non-negative Matrix Factorization (NMF) asks to input matrices, which can be solved in polynomial time [Arora decompose a (entry-wise) non-negative matrix into et al., 2012a; Arora et al., 2012b; Gillis and Vavasis, 2014; the product of two smaller-sized nonnegative matri- Esser et al., 2012; Elhamifar et al., 2012; Zhou et al., 2013; ces, which has been shown intractable in general. In Zhou et al., 2014]. Separability assumption is favored by vari- order to overcome this issue, separability assump- ous practical applications. For example, in the unmixing task tion is introduced which assumes all data points in hyperspectral imaging, separability implies the existence are in a conical hull. This assumption makes NMF of ‘pure’ pixel [Gillis and Vavasis, 2014]. And in the topic tractable and is widely used in text analysis and im- detection task, it also means some words are associated with age processing, but still impractical for huge-scale unique topic [Hofmann, 2017]. In huge datasets, it is useful datasets. In this paper, inspired by recent develop- to pick up some representative data points to stand for other ment on dequantizing techniques, we propose a new points. Such ‘self-expression’ assumption helps to improve classical algorithm for separable NMF problem. Our the data analysis procedure [Mahoney and Drineas, 2009; new algorithm runs in polynomial time in the rank Elhamifar and Vidal, 2009]. and logarithmic in the size of input matrices, which achieves an exponential speedup in the low-rank 1.1 Related Work setting. It is natural to assume all the rows of the input A has unit `1-norm, since `1-normalization translates the conical hull to 1 Introduction convex hull while keeping the anchors unchanged. From this Non-negative Matrix Factorization (NMF) aims to approxi- perspective, most algorithms essentially identify the extreme Rm×n mate a non-negative data matrix A ∈ ≥0 by the product points in the convex hull of the (`1-normalized) data vectors. of two non-negative low rank factors, i.e., A ≈ WHT , where In [Arora et al., 2012a], the authors use m linear programs in Rm×k Rn×k O m m W ∈ ≥0 is called basis matrix, H ∈ ≥0 is called encoding ( ) variables to identify the anchors out of data points, matrix and k  min{m, n}. In many applications, an NMF of- and it is therefore not suitable for dealing with large-scale real- ten results in more natural and interpretable part-based decom- world problems. Furthermore, [Recht et al., 2012] presents a position of data [Lee and Seung, 1999]. Therefore, NMF has single LP in n2 variables for SNMF to deal with large-scale been widely used in a number of practical applications, such problems (but is still impractical for huge-scale problems). as topic modeling in text, signal separation, social network, There is another class of algorithms based on greedy algo- collaborative filtering, dimension reduction, sparse coding, rithms. The main idea is to opt a data point on the direction feature selection and hyperspectral image analysis. Since com- where the current residual decreases fast. The algorithms ter- puting an NMF is NP-hard [Vavasis, 2009], a series of heuris- minate with a sufficiently small error or a large iteration times. tic algorithms have been proposed [Lee and Seung, 2001; For example, Successful Projection Algorithm (SPA) [Gillis Lin, 2007; Hsieh and Dhillon, 2011; Kim and Park, 2008; and Vavasis, 2014] derives from Gram-Schmidt orthogonal- Ding et al., 2010; Guan et al., 2012]. All of the heuristic algo- ization with row or column pivoting. XRAY [Kumar et al., rithms aim to minimize the reconstruction error, the formula 2013] detects a new anchor referring to the residual of exterior which is a non-convex program and lack optimality guarantee: data points and updates the residual matrix by solving a non- T negative least square regression. Both of these two algorithms min A − WH . Rm×k Rn×k F W∈ ≥0 H∈ ≥0 based on greedy pursuit have smaller com- A natural assumption on the data called separability assump- pared with LP-based methods. However, the time complexity tion, was observed in [Donoho and Stodden, 2004] . From a is still too large for large-scaled data. geometry perspective, the separable assumption means that all [Zhou et al., 2013; Zhou et al., 2014] utilize a Divide-and- rows of A reside in a cone generated by a rather smaller num- Conquer Anchoring (DCA) framework to tackle the SNMF. ber of rows. In particular, these generators are called anchors Namely, by projecting the data set into several low-dimension

4511 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

subspaces, and each projection can determines a small set of kvk2 anchors. Moreover, it can be proven that all the k anchors can be identified by O(k log k) projections. 2 + 2 2 + 2 v1 v2 v3 v4 Recently, a for SNMF called Quantum Divide-and-Conquer Anchoring algorithm (QDCA), has been v2 v2 v2 v2 presented [Du et al., 2018], which uses the quantum tech- 1 2 3 4 nology to speed up the random projection step in [Zhou et sgn(v1) sgn(v2) sgn(v3) sgn(v4) al., 2013]. QDCA implements matrix-vector product (i.e., random projection) via quantum principal component anal- T 4 Figure 1: Binary search tree for v = (v1, v2, v3, v4) ∈ R . The leaf ysis and then a quantum state encoding the projected data node stores v2 and interior node stores the sum of the children. In ffi i points could be prepared e ciently. Moreover, there are al- order to restore the original vector, we also store the sign of vi in leaf so several papers utilizing dequantizing techniques to solve node. To sampling from Dv, we can start from top and randomly some low-rank matrix operations, such as recommendation recurring on a child, with probability proportional to its weight. systems [Tang, 2018] and matrix inversion [Gilyen´ et al., 2018; Chia et al., 2018]. Dequantizing techniques in those algorithm- s involve two technologies, the Monte-Carlo singular value Definition 2.1 (`2-norm Sampling). Let Dv denote the dis- D = 2 k k2 decomposition and rejection sampling, which could efficiently tribution over [n] with density function v(i) vi / v for Rn simulate some special operations on low-rank matrices. v ∈ . A sample from a distribution Dv is called a sample Inspired by QDCA and the dequantizing techniques , we from v. propose a classical randomized algorithm which speeds up Lemma 2.2 (Vector Sample Model). There is a data structure the random projection step in [Zhou et al., 2013] and thereby storing vector v ∈ Rn in O(n log n) space, and supporting identifies all anchors efficiently. Our algorithm takes time following operations: polynomial in rank k, condition number κ and of • Querying and updating a entry in O(log n) time; the size of matrix. When rank k = O(log(mn)), our algorith- m achieves exponentially speedup than any other classical • Sampling from Dv in O(log n) time; algorithms for SNMF. • Findingkvk in O(1) time. Such a data structure can be easily implemented via Binary 2 Preliminaries Search Tree (BST) (see Figure 1). 2.1 Notations Proposition 2.3 (Matrix Sample Model). Considering matrix Let [n] := {1, 2,..., n}. Let span{x ∈ Rn|i ∈ [k]} := m×n 0 i A ∈ R , let A˜ and A˜ be the vector whose entry is A(i) {Pk | ∈ R ∈ } i=1 αi xi αi , i [k] denote the space spanned by xi and A( j) , respectively. There is a data structure storing Rm×n ( j) for i ∈ [k]. For a matrix A ∈ , A(i) and A denote matrix A ∈ Rm×n in O(mn) space and supporting following the ith row and the jth column of A for i ∈ [m], j ∈ [n], re- operations: T T T T m×n spectively. Let AR = [A , A ,..., A ] where A ∈ R (i1) (i2) (ir) • Querying and updating an entry in O(log m + log n) time; and R = {i1, i2,..., ir} ⊆ [m] (without loss of generality, as- sume i1 ≤ i2 ≤ · · · ≤ ir). kAkF and kAk2 refer to Frobe- • Sampling from A(i) for any i ∈ [m] in time O(log n); nius norm and spectral norm, respectively. For a vector ( j) n • Sampling from A for any j ∈ [n] in time O(log m); v ∈ R , kvk denotes its `2-norm. For two probability distri- ( j) butions p, q (as density functions) over a discrete universe • FindingkAkF , A(i) and A in time O(1); D, the total variation distance between them is defined as • Sampling A˜ and A˜0 in time O(log m) and O(log n), respec- p, q := 1 P |p(i) − q(i)|. κ(A) := σ /σ denotes the TV 2 i∈D max min tively. condition number of A, where σmax and σmin are the maximal and minimal non-zero singular values of A. This data structure can be easily implemented via Lem- ma 2.2, we can just use two arrays of BST to store all rows 2.2 Sample Model and columns of A and use two extra BSTs store A˜ and A˜0. In query model, algorithms for SNMF problem require time which is at least linear in the number of nonzero elements 2.3 Low-rank Approximations in Sample Model of the matrix, since in the worst case, they have to read out FKV algorithm is a Monte-Carlo algorithm [Frieze et al., all entries. However, we expect our algorithm to be efficient 2004] that returns approximate singular vectors of given ma- even if the datasets are extremely large. Considering the QD- trix A in matrix sample model. The low-rank approximation of CA in [Du et al., 2018], one of its advantage is that data is A can be reconstructed by approximate singular vectors. The prepared in quantum state and can be access via ‘quantum’ query and sample complexity of FKV algorithm are indepen- way (like sampling). Thus, in quantum algorithm, quantum dent of size of A. FKV algorithm outputs a short ‘description’ state is served to represent data implicitly which can be read of Vˆ , which is approximate to a right singular vectors V of out by measurement only. In order to avoiding reading the matrix A. Similarly, FKV algorithm can output a description whole matrix, we introduce a new sample model other than the of approximate left singular vectors Uˆ of A by inputting AT . query model based on the idea of quantum state preparation Let FKV (A, k, , δ) denote the FKV algorithm, where A is a assumption. matrix given by sample model, k is the rank of approximate

4512 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) matrix of A,  is error parameter, and δ is the failure probability. row of matrix A by `1-norm is valid, since the anchors remain The FKV algorithm is described in Theorem 2.4. unchanged. Moreover, Instead of storing `1-normalized matrix Theorem 2.4 (Low-rank Approximations, [Frieze et al., A, we can just maintain the `1-norms for all rows and columns. 2004]). Given matrix A ∈ Rm×n in matrix sample model, The Quantum Divide-and-Conquer Anchoring (QDCA) is k ∈ N and , δ ∈ (0, 1), FKV algorithm outputs the descrip- a quantum algorithm for SNMF which achieves exponential tion of the approximate right singular vectors Vˆ ∈ Rn×k in speedup than any classical algorithms [Du et al., 2018]. After 1 projecting any convex hull into an 1-dimensional space, the O(poly(k, 1/, log δ )) samples and queries of A with probabil- ity 1 − δ, which satisfies geometric information is partially preserved. Especially, the anchors in 1-dimensional projected subspace are still anchors

ˆ ˆ T − 2 ≤ k − k2 + k k2 in the original space. The main idea of QDCA is quantizing AVV A F min A D F  A F . D:rank(D)≤k random projection step in DCA. It decomposes SNMF into Especially, if A is a matrix with rank k exactly,√ Theorem 2.4 several subproblems: projecting A onto a set of random u- T also implies an inequality: AVˆ Vˆ − A ≤ kAkF . Rn s = F nit vectors {βi ∈ }i=1 with s O(k log k), i.e., computing ˆ m Description of V. Note that FKV algorithm does not output Aβi ∈ R . Such a matrix-vector product can be efficiently the approximate right singular vectors Vˆ directly since their implemented by Quantum Principle Component Analysis (QP- lengths are linear of n. It returns a description of Vˆ , which CA). And then it returns a log m-qubits quantum state whose = consists of three components: the row index sets T : {it ∈ amplitudes are proportional to entries of Aβi. Measurement [m]|t ∈ [p]}, a vector set U := {u( j) ∈ Rp| j ∈ [k]} which of quantum state outcomes an index j ∈ [m] which obeys are singular vectors of a submatrix sampled from A , and distribution DAβi . Thus, we can prepare O(poly log m) copies its corresponding singular values Σ := {σ( j)| j ∈ [k]}, where of quantum states, measure each of them in computational = 1 ˆ (i) = (i) p O(poly(k,  )). In fact, V : AT u /σi for i ∈ [k]. basis and record the most frequent index. By repeating proce- Given a description of Vˆ , we can sample from Vˆ (i) in time dure above with s = O(k log k) times, we could successively 1 [ ] O(poly(k,  )) for i ∈ [k] Tang, 2018 and query its entry in identify all anchors with high probability. 1 As discussed above, the core and most costly procedure time O(poly(k,  )). is to simulate DAβ . At the first sight, traditional algorithms Definition 2.5 (α-orthonormal). Given α > 0, Vˆ ∈ Rn×k is i 2 can not achieve exponential speedup on account of limits of called α-approximately orthonormal if 1 − α/k ≤ Vˆ (i) ≤ computational model. In QDCA, vectors are encoded into 1 + α/k for i ∈ [k] and |Vˆ (s)Vˆ (t)| ≤ α/k for s , t ∈ [k]. quantum states and we can sample the entries with probability proportional to their magnitudes by measurements. This quan- The next lemma presents some properties of α-approximate tum state preparation overcomes the bottleneck of traditional orthonormal vectors. computational model. Based on divided-and-conquer scheme Lemma 2.6 (Properties of α-orthonormal Vectors, [Tang, and sample model (See Section 2.2), we present Fast Anchors 2018]). Given a set of k α-approximately orthonormal vectors Seeking (FAS) Algorithm inspired by QDCA. Designing FAS × Vˆ ∈ Rn k, then there exists a set of k orthonormal vectors is quite hard and non-trivial although FAS and QDCA have V ∈ Rn×k spanning the columns of Vˆ such that the same scheme. Indeed, we can simulate DAβi directly by √ rejection sampling technology. However, the number of itera- V − Vˆ ≤ α/ 2 + c α2, (1) F 1 tions of rejection sampling is unbounded. To overcome this

Π ˆ ˆ T ffi A Uˆ Uˆ T A Vˆ − VV ≤ c2α, (2) di culty, we translate matrix into its approximation , F where the columns Uˆ ∈ Rm×k consists of k approximate left Π = T where Vˆ : VV represents the orthonormal projector to singular vectors of matrix A and k = rank(A). Next, it is = ˆ T Rk image of Vˆ and c1, c2 > 0 are constants. obvious that y U Aβi ∈ is a short vector and we can ffi Lemma 2.7 ([Frieze et al., 2004]). The output vectors Vˆ ∈ estimate its entries one by one (see Lemma 3.5) e ciently. D Rn×k of FKV (A, k, , δ) is k/16-approximate orthonormal. Now the problem becomes to simulate Uyˆ and it can be done by Lemma 3.4 . Given an error parameter /2, the method described above 3 Fast Anchors Seeking Algorithm T will result in Aβi − Uˆ Uˆ Aβi < kAkF βi /2 via Theorem

In this section, we present a randomized algorithm for SNMF 2.4, which implies DAβi − DUˆ Uˆ T Aβ ≤ kAkF βi / Aβi . which is called Fast Anchors Seeking (FAS) Algorithm. Espe- i TV Rm×n cially, the input A ∈ ≥ of FAS is given by matrix sample Namely, the method above introduces an unbounded error in 0 model which is realized via a data structure described in Sec- form kAkF βi / Aβi if βi is arbitrary vector in entire space tion 2. FAS returns the indices of anchors in time polynomial Rn. Fortunately, this issue can be solved by generating random s logarithmic to the size of matrix. vectors {βi}i=1 lying in row space of A instead of those lying in entire space Rn. To generate uniform random unit vectors 3.1 Description of Algorithm on the row space of A, we need to find a basis of row space of n×k Recall that SNMF aims to factorize A = FAR where R is the A. If V ∈ R is a set of orthonormal basis of the row space index set of anchors. In this paper, an additional constraint of A (the space spanned by the right singular vectors), and xi k−1 is added: the sum of entries in any row of F is 1. Namely, is uniform random unit vector on S , then β = Vxi is a unit any data point of A resides in convex hull which is the set random vector in row space of A. Moreover, FKV algorithm of all convex combination of AR. In fact, normalizing each will figure out approximate singular vectors Vˆ for V, that can

4513 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

(1) βi = Vxi (2) Lemma 2.6, Lemma 3.2 (3) Theorem 2.4 (4) Lemma 3.5 (5) Lemma 3.4 D ←−−−−−−−−−D D D T D O Aβi AVxi ˆ (i) = AVxˆ i Uˆ Uˆ AVxˆ i T Uˆ y˜i i span{V } span{A(i)} Uˆ Uˆ T A ≈ A M˜ ≈ Uˆ AVˆ , y˜i = Mx˜ i rejection sampling

Figure 2: An illustration for how to approximate distribution DAβi . Oi represents the final distribution which approximate DAβi . f represents ‘approximate’ and ← represents ‘equal to’ in a sense of total variation distance. To prove the upper bound for kD O k D D T D Aβi , i TV , we introduce several medium distributions AVxˆ i , Uˆ Uˆ AVxˆ i and Uyˆ . From left to right, (1) use a set of right singular vectors V ∈ Rn×k to generate the random unit vector lying in the row space of A; (2) however, since V cannot be gained efficiently, use FKV algorithm to figure out an approximation Vˆ given by a ‘short desription’; (Lemma 2.6, Lemma T ˆT 3.2) (3) translate AVxˆ i into Uˆ Uˆ AVxˆ i since A ≈ Uˆ U A (Theorem 2.4), where Uˆ is the approximate left singular vectors generated by FKV given by a ‘short desription’. (4) return an estimation M˜ of M = Uˆ T AVˆ ∈ Rk×k by estimating each entry by Lemma 3.5 and then approximate Mxi (denoted as y˜i); (5) finally, the rejection sampling works for Uˆ y˜i by Lemma 3.4 since Uˆ is approximately orthonormal.

z Algorithm 1 Fast Anchors Seeking Algorithm A = , Rm×n { } Input: Separable non-negative matrix A ∈ ≥0 in matrix AR = sample model, k = rank(A), condition number κ, a constant { } ∈ = argmaxi A(i)β = β δ (0, 1) and s O(k log k). { } { } Output: The index set R of anchors for A.

P = ∅ y 1: Initialize Rq . 2: Set  < 2 2 log(4 log2(m)/δ)/ log2(m). x √  n 2o 1 Figure 3: An illustration of finding an anchor of A with rank k = 3. 3: Set V = O min / kκ, 1/kκ , δV = 1 − (1 − δ) 4 . The 3-dimensional space represents the row space of A and A’s data 4: Run FKV (A, k,  , δ ) and output the description of ap- P V V points (blue and black points) lie on the `1-normalized plane . The proximate right singular vectors Vˆ . blue points also stand for the anchors of A. A random vector β is  n  1 o 1 5: = = − − 4 picked up from row space of A and then data points are projected on Set U O min k , kκ2 , δU 1 (1 δ) . T β. The anchors of the projected space on β are still the anchors of 6: Run FKV (A , k, U , δU ) and output the description of ap- A, such that implies that the red point with the maximum absolute proximate left singular vectors Uˆ . projection component on β is an anchor of A. 7: By Lemma 3.5, estimate M := Uˆ T AVˆ with relative error  2  1 ζ = O /k κ and failure probability η = 1 − (1 − δ) 4 , and help us make an approximate βˆi = Vxˆ i for βi. Therefore, we denote the result as M˜ . = will estimate distribution DUˆ Uˆ T AVxˆ instead of DAβi . Based on 8: for i 1 to s do i Rk Corollary 3.5, Uˆ T AVˆ can be estimated efficiently. According 9: Generate a unit random vector xi ∈ . = ˜ to Lemma 3.4, Uyˆ can be sampled efficiently, thus we can treat 10: Directly computey ˜i Mxi. y˜ as estimation of Uˆ T AVxˆ (see Figure 2). 11: By rejection sampling (Algorithm 2), simulate distribu- i 1 D = − − 4s Once we can simulate distribution DAVxi , we can figure out tion Uˆ y˜ with failure probability γ 1 (1 δ) and the index of the largest component of vector AVxi by picking pick up O(poly log m) samples. up O(poly log m) samples (Theorem 3.6). Moreover, accord- {Let Oi denotes the actual distribution which simulates [ ] D } ing to Zhou et al., 2013 , by repeating this procedure with Uˆ y˜i O(k log k) times, we can find all anchors of A with high proba- 12: R ← R ∪ {l}, where l is the most frequently index bility (For single step of random projection, see Figure 3). appearing in O(poly log m) samples. 13: end for 3.2 Analysis 14: Return R Now, we propose our main theorem and analyze the correct- ness and complexity of our algorithm FAS. necessary to generate unit vector in row space of A. The next, Theorem 3.1 (Main Result). Given separable non-negative we prove that for each i ∈ [s], distribution Oi is -close to matrix A ∈ Rm×n in matrix sample model, the rank k, condition ≥0 distribution DAVx in total variant distance. Once again, we number κ and a constant δ ∈ (0, 1), Algorithm 1 returns the i show how to gain the index of largest component of AVxi from indices of anchors with probability at least 1 − δ in time distribution Oi. Finally, by O(k log k) random projection, it is  ! enough for us to gain all anchors of matrix A.  1  O poly k, κ, log , log(mn)  . The following lemma tells us the approximate singular vec- δ tors outputted by FKV spans the row space of matrix A. And Correctness combining with Lemma 2.6, it gives us that V also spans the In this subsection, we will analyze the correctness of Algo- same space, i.e., V forms an orthonormal basis of row space rithm 1. Firstly, we show that the columns of V defined in of matrix A. Lemma 2.6 form a basis of row space of matrix A, which is Lemma 3.2. Let Vˆ be the output of algorithm FKV (A, k, , δ).

4514 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

1 − D If  < kκ2 , then with probability 1 δ, we obtain Algorithm 2 Rejection Sampling for Uyˆ n (i) o n o ˆ ( j) Rm span Vˆ i ∈ [l], l ≤ k = span A(i) i ∈ [m] . Input: A set of approximately orthonormal vectors U ∈ n o (for j = 1,..., k) in vector sample model and a vector y ∈ Rk. Proof. By contradiction, we assume that span Vˆ (i) i ∈ [k] , Output: A sample s subjecting to DUyˆ . n o span A(i) i ∈ [m] , which implies that there exists a unit vector ˆ ( j) n o n o 1: Sample j according to probability proportional to kU y jk. x ∈ span A i ∈ [m] and x ⊥ span Vˆ (i) i ∈ [k] . Then we (i) 2: Sample s from D ( j) . Uˆ T T (Uyˆ )2 can obtain Ax − AVˆ Vˆ x = kAxk ≥ σmin(A) since Vˆ x = ~0. = s 3: Compute rs P ˆ . And according to Theorem 2.4, we have k i(yiUi j) √ √ 4: Accept s with probability rs, otherwise, restart. T Ax − AVˆ Vˆ x ≤ kAkF ≤ kκσmin(A). √ Thus σmin(A) ≤ kκσmin(A), which makes a contradiction if Lemma 3.4 ([Tang, 2018]). Given a set of α-approximately 2  < 1/kκ .  orthonormal vectors Vˆ ∈ Rn×k in vector sample model, and Rk By Lemma 3.2, we can generate an approximate random an input vector w ∈ , there exists an algorithm outputting a α D vector in the row space of A with probability 1 − δ in time sample from a distribution 1−α -close to Vwˆ with probability 2 1 + O(poly(k, 1/, log 1/δ)) by FKV (A, k, , δ). Firstly, we obtain 1 − γ using O(k log γ (1 O(α))) queries and samples. the description of approximate right singular vectors by FKV Rm×n algorithm, where the error parameter  is bounded by rank k Lemma 3.5. Given A ∈ in matrix sample model and L ∈ Rk1×m and R ∈ Rn×k2 in query model, let M = LAR, then and condition number κ (see in Lemma 3.2). Secondly, we k ×k k we can output a matrix M˜ ∈ R 1 2 , with probability 1 − η, generate a random unit vector xi ∈ R as a coordinate vector referring to a set of orthonormal vectors in Lemma 2.6. Let V such that M − M˜ ≤ ζkAk kLk kRk denotes the matrix defined in Lemma 2.6, then it is obvious F F F F   that its columns form the right singular vectors for matrix A. 1 1 by O k1k2 ζ2 log η queries and samples. That is, βˆi = Vxˆ i is an approximate vector of a random vector = β Vxi. Next, we show that total variant distance between Proof. Let M = L AR( j) with i ∈ [k ] and j ∈ [k ]. In [Tang, O and D is bounded by constant . For convenience, we i j (i) 1 2 i AVxi 2018], there exists an algorithm that outputs an estimation of assume that each step in Algorithm 1 succeeds and the final ( j) 0 Mi j (M˜ i j) to precision ζkAkF L(i) R with probability 1−η success probability will be given in next subsection.   1 1 0 = 1/(k1k2) in time O 2 log 0 . Let η 1 − (1 − η) . We can output Lemma 3.3. For all i ∈ [s], Oi, DAVxi ≤  holds simulta- ζ η TV   neously with probability 1 − δ. ˜ 1 1 = M with probability 1 − η utilizing O k1k2 2 log 2 ζ − −η 1/k In the rest, without ambiguity, we use notations O, x instead   1 (1 ) of O , x . By applying triangle inequality, we divide the left 1 1 ˜ i i O k1k2 ζ2 log η queries and samples respectively where M part of inequality into four parts (the intuition idea please ref satisfies X Figure 2): 2 2 M − M˜ = |Mi j − M˜ i j| + F DAVx, O ≤ DAVx, DAVxˆ DAVxˆ , DUˆ Uˆ T AVxˆ i∈[k1], j∈[k2] TV TV TV X | {z } | {z } 2 2 2 ( j) 2 1 2 ≤ ζ kAk L(i) R F + + i∈[k ], j∈[k ] D ˆ ˆ T ˆ , D ˆ ˜ D ˆ ˜ , O . 1 2 U U A Vx U Mx TV U Mx TV X X | {z } | {z } = 2 2 2 ( j) 2 3 4 ζ kAkF L(i) R  i∈[k1] j∈[k2] Thus, we only need to prove that 1 , 2 , 3 , 4 < 4 , respectively. In addition, given u, v ∈ Rn, if ku − vk ≤  kuk, = 2 2 2 2 2 ζ kAkF kLkF kRkF . kD D k ≤ then u, v TV . For 1 , 2 and 3 , we only show their  `2-norm version, i.e., proof of Lemma 3.3. Upper bound for 1 . By Lemma 3.2, • kAVx − AVxˆ k ≤  kAVxk; 8 Vx is a unit random vector sampled from the row space of A T  1 1 • kAVxˆ − Uˆ Uˆ AVxˆ k ≤ kAVxˆ k; − = − 4 8 with probability 1 δV : (1 δ) if V < kκ2 . From Eq. (1) • kUˆ Uˆ T AVxˆ − Uˆ Mx˜ k ≤  kUˆ Uˆ T AVxˆ k. in Lemma 2.6, with probability 1 − δV 8 √ 2 For convenience, in the rest part, let αU = U k/16 and − ˆ ≤k k − ˆ k k ≤ + k k AVx AVx A F V V F x (αV / 2 c1αV ) A F . αV = V k/16 represent approximate ratio for orthonormality √ √ of Uˆ and Vˆ based on Lemma 2.7, respectively. Combing withkAkF ≤ kκσmin(A) ≤ kκkAVxk, we gain Before we start our proof, we list two tools which are used √ √ AVx − AVxˆ ≤ (α / 2 + c α2 ) kκkAVxk . (3) to prove 3 and 4 , respectively. V 1 V Based on rejection sampling, Lemma 3.4 shows that sam- ˆ  Eq. (3) satisfies AVx − AVx ≤ 8 kAVxk with pling from linear combination of α-approximately orthogonal  ! vectors can be quickly realized without knowledge of norms  = O min √ , 1 . V kκ kκ2 of these vectors (see Algorithm 2).

4515 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

Upper bound for 2 . According to Lemma 3.2, the columns As mentioned in [Du et al., 2018], the assumption about the ˆ ≤ 1 D D of U span a space equal to the column space of A if U kκ2 gap between max and secmax is easy to satisfy in practice. By 1 q = 4 Π 2 2 2 with probability 1−δU : (1−δ) . Let Uˆ denote the orthonor- choosing N = log m and  < 2 2 log(4 log m/δ)/ log m, mal projector to image of Uˆ (column space of A). Similarly, q ⊥ 2 2 Π denotes the orthonormal projector to the orthogonal space we have Dmax − Dsecmax > 4 2 log(4 log m/δ)/ log m, Uˆ of column space of Uˆ . which will converge to zero as m goes to infinity. To estimate the number of random projections we need, we AVxˆ − Uˆ Uˆ T AVxˆ ∗ denote pi the probability that after random projection β, a data

⊥ T T point A(i) is identified as an anchor in subspace, i.e., = (Π ˆ + Π )AVxˆ − Uˆ Uˆ AVxˆ = Π ˆ AVxˆ − Uˆ Uˆ AVxˆ U Uˆ U ∗ = = pi Pr(i argmaxi{(Aβ)i}). ≤ Π − ˆ ˆ T ˆ ≤ ˆ β Uˆ UU F AVx c2αU AVx , (4)  n o In [Zhou et al., 2014], if p∗ > k/α for a constant α, with =  1 i based on Eq. (2) in Lemma 2.6. If U O min , 2 , 3 k kκ s = k log k random projections, all anchors can be found ˆ ˆ ˆ T ˆ  ˆ α AVx − UU AVx ≤ 8 AVx . with probability at least 1 − k exp(−αs/3k). Complexity and Success Probability Upper bound for 3 . When U and V are discussed above, 1 Note that Algorithm 1 involves operations that query and with probability 1 − η := (1 − δ) 4 , we have sample from matrix A, Uˆ and Vˆ , but those operations can be    2 T   implemented in O(log(mn)poly(k, κ, 1/)) time. Thus, in the Uˆ Uˆ AVxˆ ≥ 1 − AVxˆ ≥ 1 − kAVxk . (5) 8 8 following analysis, we just ignore the time complexity of those According to Lemma 3.5, we obtain operations but multiple it to the final time complexity. The running time and failure probability mainly concen- Uˆ T AVˆ − M˜ ≤ ζ Uˆ Vˆ kAk F F F F trates on lines 4, 6, 7 and 11 in Algorithm 1. The run- r ning time of lines 4 and 6 are O poly(k, 1/ , log 1/δ ) and 1.5 αU αV V V ≤ ζk κ (1 + )(1 + )kAVxk . (6)  1 1  k k O poly(k, , log ) , respectively, according to Theorem 2.4. U δU  2 1 1 ˜ Combining Eq. (5) and Eq. (6), the following holds And line 7 takes O k 2 log to estimate matrix M accord- ζ η Uˆ Uˆ T AVxˆ − Uˆ Mx˜ ≤ Uˆ Uˆ T AVˆ − M˜ ing to Lemma 3.5. And line 11 with s iterations totally spends F F   r 2 1 α α O sk log poly log m . In the perspective of failure probabil- ≤ 2 + U 2 + V k k γ ζk κ (1 ) (1 ) AVx 1 k k ity, lines 4, 6 and 7 take the same failure probabilities (1 − η) 4 . r 1 α α  And line 11 takes (1 − η) 4s for each iteration. ≤ 2 + U 2 + V − 2 ˆ ˆ T ˆ ζk κ (1 ) (1 )/(1 ) UU AVx . (7) Above all, the time complexity of FAS is k k 8    1      = ˆ ˆ T ˆ − ˆ ˜ ˆ ˆ T ˆ O poly k, κ, log δ , log mn . The success probability is If ζ O k2κ , then UU AVx UMx < 8 UU AVx holds. 1 − δ.  n o =  1 4 Conclusion Upper bound for 4 . Since U O min k , kκ2 as discussed before, directly taking usage of Lemma 3.4, with probability This paper presents a classical randomized algorithm FAS 1 1 − γ := (1 − δ) 4s we have which dramatically reduces the running time to find anchors of

αU  low-rank matrix. Especially, we achieve exponential speedup DUˆ y˜, O ≤ < . (8) when the rank is logarithmic of the input scale. Although TV 1 − α 4 U our algorithm running in polynomial of logarithm of matrix Hence, Algorithm 1 generates a distribution O which satis- i dimension, it still has a bad dependence on rank k. In the D ≤ fies Oi, AVxi TV  for s random unit vectors generated future, we plan to improve its dependence on rank as well as simultaneously with probability 1 − δ.  analyze its noise tolerance. The following theorem tells us how to find the largest com- Acknowledgements ponent of AVxi from distribution Oi. Theorem 3.6 (Restatement of Theorem 1 in [Du et al., 2018]). Part of this work is done during Y. Li’s visit at the Institute Let D be a distribution over [m] and D0 is another distribution of , Baidu Inc.. This work is supported in part by the National Natural Science Foundation of China simulating D with total variant error . Let x1,..., xN be ex- 0 Grants No. 61433014, 61832003, 61761136014, 61872334, amples independently sampled from D and Ni be the number 61502449, the Strategic Priority Research Program of Chinese of examples taking value of i. Let Dmax = max{D1,..., Dm} and D = max{D ,..., D }\D . If D − D > Academy of Sciences Grant No. XDB28000000, and Anhui p secmax 1 m max max secmax 2 2 log(4N/δ)/N + , then, for any δ > 0, with a probability Initiative in Quantum Information Technologies, Grant No. at least 1 − δ, we have AHY150100. Y. Li is supported by ERC Consolidator Grant 615307-QPROGRESS. Y. Li thanks Runyao Duan for hosting arg max{Ni|1 ≤ i ≤ N} = arg max{pi|1 ≤ i ≤ m}. i i his visit at the Institute of Quantum Computing, Baidu Inc..

4516 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)

References [Guan et al., 2012] Naiyang Guan, Dacheng Tao, Zhigang Luo, and John Shawe-Taylor. Mahnmf: Manhattan non- [Arora et al., 2012a] Sanjeev Arora, Rong Ge, Ravindran negative matrix factorization. stat, 1050:14, 2012. Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization–provably. In Proceedings of the forty- [Hofmann, 2017] Thomas Hofmann. Probabilistic latent se- fourth annual ACM symposium on Theory of computing, mantic indexing. In ACM SIGIR Forum, volume 51, pages pages 145–162. ACM, 2012. 211–218. ACM, 2017. [Arora et al., 2012b] Sanjeev Arora, Rong Ge, and Ankur [Hsieh and Dhillon, 2011] Cho-Jui Hsieh and Inderjit S D- Moitra. Learning topic models–going beyond svd. In Foun- hillon. Fast coordinate descent methods with variable selec- dations of Computer Science (FOCS), 2012 IEEE 53rd tion for non-negative matrix factorization. In Proceedings Annual Symposium on, pages 1–10. IEEE, 2012. of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1064–1072. [Chia et al., 2018] Nai-Hui Chia, Han-Hsuan Lin, and Chun- ACM, 2011. hao Wang. Quantum-inspired sublinear classical algorithms [Kim and Park, 2008] Jingu Kim and Haesun Park. Toward for solving low-rank linear systems. arXiv preprint arX- faster nonnegative matrix factorization: A new algorithm iv:1811.04852, 2018. and comparisons. In Data Mining, 2008. ICDM’08. Eighth [Ding et al., 2010] Chris HQ Ding, Tao Li, and Michael I IEEE International Conference on, pages 353–362. IEEE, Jordan. Convex and semi-nonnegative matrix factoriza- 2008. tions. IEEE transactions on pattern analysis and machine [Kumar et al., 2013] Abhishek Kumar, Vikas Sindhwani, and intelligence, 32(1):45–55, 2010. Prabhanjan Kambadur. Fast conical hull algorithms for [Donoho and Stodden, 2004] David Donoho and Victoria S- near-separable non-negative matrix factorization. In Inter- todden. When does non-negative matrix factorization give national Conference on , pages 231–239, a correct decomposition into parts? In Advances in neural 2013. information processing systems, pages 1141–1148, 2004. [Lee and Seung, 1999] Daniel D Lee and H Sebastian Seung. [Du et al., 2018] Yuxuan Du, Tongliang Liu, Yinan Li, Run- Learning the parts of objects by non-negative matrix factor- yao Duan, and Dacheng Tao. Quantum divide-and-conquer ization. Nature, 401(6755):788, 1999. anchoring for separable non-negative matrix factorization. [Lee and Seung, 2001] Daniel D Lee and H Sebastian Seung. In Proceedings of the 27th International Joint Conference Algorithms for non-negative matrix factorization. In Ad- on Artificial Intelligence, pages 2093–2099. AAAI Press, vances in neural information processing systems, pages 2018. 556–562, 2001. [Elhamifar and Vidal, 2009] Ehsan Elhamifar and Rene´ Vi- [Lin, 2007] Chih-Jen Lin. Projected gradient methods for dal. Sparse subspace clustering. In 2009 IEEE Conference nonnegative matrix factorization. Neural computation, on Computer Vision and Pattern Recognition, pages 2790– 19(10):2756–2779, 2007. 2797. IEEE, 2009. [Mahoney and Drineas, 2009] Michael W Mahoney and Pet- [Elhamifar et al., 2012] Ehsan Elhamifar, Guillermo Sapiro, ros Drineas. Cur matrix decompositions for improved data and Rene Vidal. See all by looking at a few: Sparse mod- analysis. Proceedings of the National Academy of Sciences, eling for finding representative objects. In 2012 IEEE pages pnas–0803205106, 2009. Conference on Computer Vision and Pattern Recognition, [Recht et al., 2012] Ben Recht, Christopher Re, Joel Tropp, pages 1600–1607. IEEE, 2012. and Victor Bittorf. Factoring nonnegative matrices with [Esser et al., 2012] Ernie Esser, Michael Moller, Stanley Os- linear programs. In Advances in Neural Information Pro- her, Guillermo Sapiro, and Jack Xin. A convex model cessing Systems, pages 1214–1222, 2012. for nonnegative matrix factorization and dimensionality [Tang, 2018] Ewin Tang. A quantum-inspired classical algo- reduction on physical space. IEEE Transactions on Image rithm for recommendation systems. arXiv preprint arX- Processing, 21(7):3239–3252, 2012. iv:1807.04271, 2018. [Frieze et al., 2004] Alan Frieze, Ravi Kannan, and Santosh [Vavasis, 2009] Stephen A Vavasis. On the complexity of Vempala. Fast monte-carlo algorithms for finding low-rank nonnegative matrix factorization. SIAM Journal on Opti- approximations. Journal of the ACM (JACM), 51(6):1025– mization, 20(3):1364–1377, 2009. 1041, 2004. [Zhou et al., 2013] Tianyi Zhou, Wei Bian, and Dacheng Tao. [Gillis and Vavasis, 2014] Nicolas Gillis and Stephen A Vava- Divide-and-conquer anchoring for near-separable nonnega- sis. Fast and robust recursive algorithmsfor separable non- tive matrix factorization and completion in high dimensions. negative matrix factorization. IEEE transactions on pattern In 2013 IEEE 13th International Conference on Data Min- analysis and machine intelligence, 36(4):698–714, 2014. ing, pages 917–926. IEEE, 2013. [Zhou et al., 2014] Tianyi Zhou, Jeff A Bilmes, and Carlos [Gilyen´ et al., 2018] Andras´ Gilyen,´ Seth Lloyd, and Ewin Guestrin. Divide-and-conquer learning by anchoring a Tang. Quantum-inspired low-rank stochastic regression conical hull. In Advances in Neural Information Processing with logarithmic dependence on the dimension. arXiv Systems, pages 1242–1250, 2014. preprint arXiv:1811.04909, 2018.

4517