A Quantum-Inspired Classical Algorithm for Separable Non
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) A Quantum-inspired Classical Algorithm for Separable Non-negative Matrix Factorization Zhihuai Chen1;2 , Yinan Li3 , Xiaoming Sun1;2 , Pei Yuan1;2 and Jialin Zhang1;2 1CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, 100190, Beijing, China 2University of Chinese Academy of Sciences, 100049, Beijing, China 3Centrum Wiskunde & Informatica and QuSoft, Science Park 123, 1098XG Amsterdam, Netherlands fchenzhihuai, sunxiaoming, yuanpei, [email protected], [email protected] Abstract of A. To solve the Separable Non-Negative Matrix Factoriza- tions (SNMF), it is sufficient to identify the anchors in the Non-negative Matrix Factorization (NMF) asks to input matrices, which can be solved in polynomial time [Arora decompose a (entry-wise) non-negative matrix into et al., 2012a; Arora et al., 2012b; Gillis and Vavasis, 2014; the product of two smaller-sized nonnegative matri- Esser et al., 2012; Elhamifar et al., 2012; Zhou et al., 2013; ces, which has been shown intractable in general. In Zhou et al., 2014]. Separability assumption is favored by vari- order to overcome this issue, separability assump- ous practical applications. For example, in the unmixing task tion is introduced which assumes all data points in hyperspectral imaging, separability implies the existence are in a conical hull. This assumption makes NMF of ‘pure’ pixel [Gillis and Vavasis, 2014]. And in the topic tractable and is widely used in text analysis and im- detection task, it also means some words are associated with age processing, but still impractical for huge-scale unique topic [Hofmann, 2017]. In huge datasets, it is useful datasets. In this paper, inspired by recent develop- to pick up some representative data points to stand for other ment on dequantizing techniques, we propose a new points. Such ‘self-expression’ assumption helps to improve classical algorithm for separable NMF problem. Our the data analysis procedure [Mahoney and Drineas, 2009; new algorithm runs in polynomial time in the rank Elhamifar and Vidal, 2009]. and logarithmic in the size of input matrices, which achieves an exponential speedup in the low-rank 1.1 Related Work setting. It is natural to assume all the rows of the input A has unit `1-norm, since `1-normalization translates the conical hull to 1 Introduction convex hull while keeping the anchors unchanged. From this Non-negative Matrix Factorization (NMF) aims to approxi- perspective, most algorithms essentially identify the extreme Rm×n mate a non-negative data matrix A 2 ≥0 by the product points in the convex hull of the (`1-normalized) data vectors. of two non-negative low rank factors, i.e., A ≈ WHT , where In [Arora et al., 2012a], the authors use m linear programs in Rm×k Rn×k O m m W 2 ≥0 is called basis matrix, H 2 ≥0 is called encoding ( ) variables to identify the anchors out of data points, matrix and k minfm; ng. In many applications, an NMF of- and it is therefore not suitable for dealing with large-scale real- ten results in more natural and interpretable part-based decom- world problems. Furthermore, [Recht et al., 2012] presents a position of data [Lee and Seung, 1999]. Therefore, NMF has single LP in n2 variables for SNMF to deal with large-scale been widely used in a number of practical applications, such problems (but is still impractical for huge-scale problems). as topic modeling in text, signal separation, social network, There is another class of algorithms based on greedy algo- collaborative filtering, dimension reduction, sparse coding, rithms. The main idea is to opt a data point on the direction feature selection and hyperspectral image analysis. Since com- where the current residual decreases fast. The algorithms ter- puting an NMF is NP-hard [Vavasis, 2009], a series of heuris- minate with a sufficiently small error or a large iteration times. tic algorithms have been proposed [Lee and Seung, 2001; For example, Successful Projection Algorithm (SPA) [Gillis Lin, 2007; Hsieh and Dhillon, 2011; Kim and Park, 2008; and Vavasis, 2014] derives from Gram-Schmidt orthogonal- Ding et al., 2010; Guan et al., 2012]. All of the heuristic algo- ization with row or column pivoting. XRAY [Kumar et al., rithms aim to minimize the reconstruction error, the formula 2013] detects a new anchor referring to the residual of exterior which is a non-convex program and lack optimality guarantee: data points and updates the residual matrix by solving a non- T negative least square regression. Both of these two algorithms min A − WH : Rm×k Rn×k F W2 ≥0 H2 ≥0 based on greedy pursuit have smaller time complexity com- A natural assumption on the data called separability assump- pared with LP-based methods. However, the time complexity tion, was observed in [Donoho and Stodden, 2004] . From a is still too large for large-scaled data. geometry perspective, the separable assumption means that all [Zhou et al., 2013; Zhou et al., 2014] utilize a Divide-and- rows of A reside in a cone generated by a rather smaller num- Conquer Anchoring (DCA) framework to tackle the SNMF. ber of rows. In particular, these generators are called anchors Namely, by projecting the data set into several low-dimension 4511 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) subspaces, and each projection can determines a small set of kvk2 anchors. Moreover, it can be proven that all the k anchors can be identified by O(k log k) projections. 2 + 2 2 + 2 v1 v2 v3 v4 Recently, a quantum algorithm for SNMF called Quantum Divide-and-Conquer Anchoring algorithm (QDCA), has been v2 v2 v2 v2 presented [Du et al., 2018], which uses the quantum tech- 1 2 3 4 nology to speed up the random projection step in [Zhou et sgn(v1) sgn(v2) sgn(v3) sgn(v4) al., 2013]. QDCA implements matrix-vector product (i.e., random projection) via quantum principal component anal- T 4 Figure 1: Binary search tree for v = (v1; v2; v3; v4) 2 R . The leaf ysis and then a quantum state encoding the projected data node stores v2 and interior node stores the sum of the children. In ffi i points could be prepared e ciently. Moreover, there are al- order to restore the original vector, we also store the sign of vi in leaf so several papers utilizing dequantizing techniques to solve node. To sampling from Dv, we can start from top and randomly some low-rank matrix operations, such as recommendation recurring on a child, with probability proportional to its weight. systems [Tang, 2018] and matrix inversion [Gilyen´ et al., 2018; Chia et al., 2018]. Dequantizing techniques in those algorithm- s involve two technologies, the Monte-Carlo singular value Definition 2.1 (`2-norm Sampling). Let Dv denote the dis- D = 2 k k2 decomposition and rejection sampling, which could efficiently tribution over [n] with density function v(i) vi = v for Rn simulate some special operations on low-rank matrices. v 2 . A sample from a distribution Dv is called a sample Inspired by QDCA and the dequantizing techniques , we from v. propose a classical randomized algorithm which speeds up Lemma 2.2 (Vector Sample Model). There is a data structure the random projection step in [Zhou et al., 2013] and thereby storing vector v 2 Rn in O(n log n) space, and supporting identifies all anchors efficiently. Our algorithm takes time following operations: polynomial in rank k, condition number κ and logarithm of • Querying and updating a entry in O(log n) time; the size of matrix. When rank k = O(log(mn)), our algorith- m achieves exponentially speedup than any other classical • Sampling from Dv in O(log n) time; algorithms for SNMF. • Findingkvk in O(1) time. Such a data structure can be easily implemented via Binary 2 Preliminaries Search Tree (BST) (see Figure 1). 2.1 Notations Proposition 2.3 (Matrix Sample Model). Considering matrix Let [n] := f1; 2;:::; ng. Let spanfx 2 Rnji 2 [k]g := m×n 0 i A 2 R , let A˜ and A˜ be the vector whose entry is A(i) fPk j 2 R 2 g i=1 αi xi αi ; i [k] denote the space spanned by xi and A( j) , respectively. There is a data structure storing Rm×n ( j) for i 2 [k]. For a matrix A 2 , A(i) and A denote matrix A 2 Rm×n in O(mn) space and supporting following the ith row and the jth column of A for i 2 [m]; j 2 [n], re- operations: T T T T m×n spectively. Let AR = [A ; A ;:::; A ] where A 2 R (i1) (i2) (ir) • Querying and updating an entry in O(log m + log n) time; and R = fi1; i2;:::; irg ⊆ [m] (without loss of generality, as- sume i1 ≤ i2 ≤ · · · ≤ ir). kAkF and kAk2 refer to Frobe- • Sampling from A(i) for any i 2 [m] in time O(log n); nius norm and spectral norm, respectively. For a vector ( j) n • Sampling from A for any j 2 [n] in time O(log m); v 2 R , kvk denotes its `2-norm. For two probability distri- ( j) butions p; q (as density functions) over a discrete universe • FindingkAkF , A(i) and A in time O(1); D, the total variation distance between them is defined as • Sampling A˜ and A˜0 in time O(log m) and O(log n), respec- p; q := 1 P jp(i) − q(i)j. κ(A) := σ /σ denotes the TV 2 i2D max min tively.