Fast and Efficient Boolean Matrix Factorization by Geometric

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Fast and Efficient Boolean Matrix Factorization by Geometric Segmentation Changlin Wan,1,2 Wennan Chang,1,2 Tong Zhao,3 Mengya Li,2 Sha Cao,2,∗ Chi Zhang2,* 1Purdue University, 2Indiana University, 3Amazon {wan82, chang534}@purdue.edu, [email protected], [email protected], {shacao,czhang87}@iu.edu Abstract as well as adapting to the drastic increase of dimensionality is of prominent interests for nowadays data science research. Boolean matrix has been used to represent digital informa- Recent study also showed that some continuous data could tion in many fields, including bank transaction, crime records, benefit from binary pattern mining. For instance, the bina- natural language processing, protein-protein interaction, etc. Boolean matrix factorization (BMF) aims to find an approxi- rization of continuous single cell gene expression data to its mation of a binary matrix as the Boolean product of two low on and off state, can better reflect the coordination patterns rank Boolean matrices, which could generate vast amount of of genes in regulatory networks (Larsson et al. 2019). How- information for the patterns of relationships between the fea- ever, owing to its two value characteristics, the rank of a tures and samples. Inspired by binary matrix permutation the- binary matrix under normal linear algebra can be very high ories and geometric segmentation, we developed a fast and due to certain spike rows or columns. This makes it infeasi- efficient BMF approach, called MEBF (Median Expansion ble to apply established methods such as SVD and PCA for for Boolean Factorization). Overall, MEBF adopted a heuris- BMF (Wall, Rechtsteiner, and Rocha 2003). tic approach to locate binary patterns presented as submatri- Boolean matrix factorization (BMF) has been developed ces that are dense in 1’s. At each iteration, MEBF permu- particularly for binary pattern mining, and it factorizes a bi- tates the rows and columns such that the permutated matrix is approximately Upper Triangular-Like (UTL) with so- nary matrix into approximately the product of two low rank called Simultaneous Consecutive-ones Property (SC1P). The binary matrices following Boolean algebra, as shown in Fig- largest submatrix dense in 1 would lie on the upper triangu- ure 1. The decomposition of a binary matrix into low rank lar area of the permutated matrix, and its location was deter- binary patterns is equivalent to locating submatrices that mined based on a geometric segmentation of a triangular. We are dense in 1. Analyzing binary matrix with BMF shows compared MEBF with other state of the art approaches on its unique power. In the most optimal case, it significantly data scenarios with different density and noise levels. MEBF reduces the rank of the original matrix calculated in nor- demonstrated superior performances in lower reconstruction mal linear algebra to its log scale (Monson, Pullman, and error, and higher computational efficiency, as well as more ac- Rees 1995). Since the binary patterns are usually embedded curate density patterns than popular methods such as ASSO, within noisy and randomly arranged binary matrix, BMF is PANDA and Message Passing. We demonstrated the applica- tion of MEBF on both binary and non-binary data sets, and known to be an NP-hard problem (Miettinen et al. 2008). revealed its further potential in knowledge retrieving and data denoising. Background Related work Introduction BMF was first introduced as a set basis problem in 1975 (Stockmeyer 1975). This area has received wide attention Binary data gains more and more attention during the trans- after a series of work by Mittenin et al (Miettinen et al. formation of modern living (Kocayusufoglu, Hoang, and 2008; Miettinen and Vreeken 2014; Karaev, Miettinen, and Singh 2018; Balasubramaniam, Nayak, and Yuen 2018). It Vreeken 2015). Among them, the ASSO algorithm per- consists of a large domain of our everyday life, where the forms factorization by retrieving binary bases from row-wise 1s or 0s in a binary matrix can physically mean whether correlation matrix in a heuristic manner (Miettinen et al. or not an event of online shopping transaction, web brows- 2008). Despite its popularity, the high computational cost ing, medical record, journal submission, etc, has occurred of ASSO makes it impracticable when dealing with large or not. The scale of these datasets has increased exponen- scale data. Recently, an algorithm called Nassua was devel- tially over the years. Mining the patterns within binary data oped by the same group (Karaev, Miettinen, and Vreeken ∗Corresponding author 2015). Nassua optimizes the initialization of the matrix fac- Copyright c 2020, Association for the Advancement of Artificial torization by locating dense seeds hidden within the matrix, Intelligence (www.aaai.org). All rights reserved. and with improved performance comparing to ASSO. How- 6086 Figure 1: BMF, the addition of rank 1 binary matrices ever, optimal parameter selection remains a challenge for m m0 m m m0 m Nassua. A second series of work called PANDA was devel- 2 2 2 n0 oped by Claudio et al (Lucchese, Orlando, and Perego 2010; 2 n n 2 2013). PANDA aims to find the most significant patterns in n0 2 n n the current binary matrix by discovering core patterns iter- atively (Lucchese, Orlando, and Perego 2010). After each iteration, PANDA only retains a residual matrix with all the non-zero values covered by identified patterns removed. Figure 2: Three simplified scenarios for UTL matrices with Later, PANDA+ was recently developed to reduce the noise direct SC1P. level in core pattern detection and extension (Lucchese, Or- lando, and Perego 2013). These two methods also suffer from inhibitory computational cost, as they need to recalcu- two Boolean matrices as Xn×m = An×k ⊗ Bk×m, where k late a global loss function at each iteration. More algorithms Xij = ∨l=1Ail ∧ Blj. and applications of BMF have been proposed in recent years. FastStep relaxed BMF constraints to non-negativity Problem statement by integrating non-negative matrix factorization (NMF) and X ∈{0, 1}n×m Boolean thresholding (Araujo, Ribeiro, and Faloutsos 2016). Given a binary matrix and a criteria parameter τ, the BMF problem is defined as identifying two But interpreting derived non-negative bases could also be A∗ B∗ challenging. With prior network information, Kocayusu- binary matrices and , called pattern matrices, that γ(A, B; X) τ foglu et al decomposes binary matrix in a stepwise fashion minimize the cost function under criteria , i.e., (A∗,B∗) = argmin (γ(A, B; X)|τ) τ with bases that are sampled from given network space (Ko- A,B . Here the criteria could vary with different problem assumptions. The criteria cayusufoglu, Hoang, and Singh 2018). Bayesian probability A∗ B∗ mapping has also been applied in this field . Ravanbakhsh et used in the current study is to identify and with at k A ∈{0, 1}n×k B ∈{0, 1}k×m al proposed a probability graph model called “factor-graph” most patterns, i.e., , , and the cost function is γ(A, B; X)=|X (A ⊗ B)|. We call to characterize the embedded patterns, and developed a mes- l A l B sage passing approach, called MP (Ravanbakhsh, Poczos,´ the th column of matrix and th row of matrix as the l l l =1, ..., k and Greiner 2016). On the other hand, Ormachine, proposed th binary pattern, or the th basis, . by Rukat et al, provided a probabilistic generative model for BMF (Rukat et al. 2017). Similarly, these Bayesian ap- MEBF Algorithm Framework proaches suffer from low computational efficiency. In ad- Motivation of MEBF dition, Bayesian model fitting could be highly sensitive to noisy data. BMF is equivalent to decomposing the matrix into the sum of multiple rank 1 binary matrices, each of which is also re- Notations ferred as a pattern or basis in the BMF literature (Lucchese, Orlando, and Perego 2010). A matrix is denoted by a uppercase character with a super ∗ ∗ n×m A B script n × m indicating its dimension, such as X , and Lemma 1 (Submatrix detection). Let , be the solu- X X X i j tion to arg minA∈{0,1}n×k,B∈{0,1}k×m |X (A ⊗ B)|, then with subscript i,:, :,j, ij indicating th row, th column, ∗ ∗ or the (i, j)th element, respectively. A vector is denoted as the k patterns identified in A , B correspond to k subma- X a bold lowercase character, such as a, and its subscript ai trices in that are dense in 1’s. In other words, finding A∗ B∗ X ,I ⊂ indicates the ith element. A scalar value is represented by , is equivalent to identify submatrices Il,Jl l { ,...,n} J ⊂{,...,m},l , ..., k, s.t.|X |≥ a lowercase character, such as a, and [a] as its integer part. 1 ; l 1 =1 Il,Jl t |I |∗|J | |I | I |X| and |x| represents the 1 norm of a matrix and a vec- 0( l l ). Here l is the cardinality of the index set l, tor. Under the Boolean algebra, the basic operations include t0 is a positive number between 0 and 1 that controls the X ∧(AND, 1 ∧ 1=1, 1 ∧ 0=0, 0 ∧ 0=0), ∨(OR, 1 ∨ 1= noise level of Il,Jl . 1, 0 ∨ 1=1, 0 ∨ 0=0), ¬(NOT,¬1=0, ¬0=1). De- note the Boolean element-wise sum, subtraction and prod- Proof. ∀l, it suffices to let Il be the indices of the lth column ∗ ∗ uct as A ⊕ B = A ∨ B, A B =(A ∧¬B) ∨ (¬A ∧ B) of A , such that A:,l =1; and let Jl be the indices of the lth ∗ ∗ and A B = A ∧ B, and the Boolean matrix product of row of B such that Bl,: =1.

Load more