The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)

Fast and Efficient Boolean Factorization by Geometric Segmentation

Changlin Wan,1,2 Wennan Chang,1,2 Tong Zhao,3 Mengya Li,2 Sha Cao,2,∗ Chi Zhang2,* 1Purdue University, 2Indiana University, 3Amazon {wan82, chang534}@purdue.edu, [email protected], [email protected], {shacao,czhang87}@iu.edu

Abstract as well as adapting to the drastic increase of dimensionality is of prominent interests for nowadays data science research. Boolean matrix has been used to represent digital informa- Recent study also showed that some continuous data could tion in many fields, including bank transaction, crime records, benefit from binary pattern mining. For instance, the bina- natural language processing, protein-protein interaction, etc. Boolean matrix factorization (BMF) aims to find an approxi- rization of continuous single cell gene expression data to its mation of a binary matrix as the Boolean product of two low on and off state, can better reflect the coordination patterns rank Boolean matrices, which could generate vast amount of of genes in regulatory networks (Larsson et al. 2019). How- information for the patterns of relationships between the fea- ever, owing to its two value characteristics, the rank of a tures and samples. Inspired by binary matrix permutation the- binary matrix under normal linear algebra can be very high ories and geometric segmentation, we developed a fast and due to certain spike rows or columns. This makes it infeasi- efficient BMF approach, called MEBF (Median Expansion ble to apply established methods such as SVD and PCA for for Boolean Factorization). Overall, MEBF adopted a heuris- BMF (Wall, Rechtsteiner, and Rocha 2003). tic approach to locate binary patterns presented as submatri- Boolean matrix factorization (BMF) has been developed ces that are dense in 1’s. At each iteration, MEBF permu- particularly for binary pattern mining, and it factorizes a bi- tates the rows and columns such that the permutated ma- trix is approximately Upper Triangular-Like (UTL) with so- nary matrix into approximately the product of two low rank called Simultaneous Consecutive-ones Property (SC1P). The binary matrices following Boolean algebra, as shown in Fig- largest submatrix dense in 1 would lie on the upper triangu- ure 1. The decomposition of a binary matrix into low rank lar area of the permutated matrix, and its location was deter- binary patterns is equivalent to locating submatrices that mined based on a geometric segmentation of a triangular. We are dense in 1. Analyzing binary matrix with BMF shows compared MEBF with other state of the art approaches on its unique power. In the most optimal case, it significantly data scenarios with different density and noise levels. MEBF reduces the rank of the original matrix calculated in nor- demonstrated superior performances in lower reconstruction mal linear algebra to its log scale (Monson, Pullman, and error, and higher computational efficiency, as well as more ac- Rees 1995). Since the binary patterns are usually embedded curate density patterns than popular methods such as ASSO, within noisy and randomly arranged binary matrix, BMF is PANDA and Message Passing. We demonstrated the applica- tion of MEBF on both binary and non-binary data sets, and known to be an NP-hard problem (Miettinen et al. 2008). revealed its further potential in knowledge retrieving and data denoising. Background Related work Introduction BMF was first introduced as a set basis problem in 1975 (Stockmeyer 1975). This area has received wide attention Binary data gains more and more attention during the trans- after a series of work by Mittenin et al (Miettinen et al. formation of modern living (Kocayusufoglu, Hoang, and 2008; Miettinen and Vreeken 2014; Karaev, Miettinen, and Singh 2018; Balasubramaniam, Nayak, and Yuen 2018). It Vreeken 2015). Among them, the ASSO algorithm per- consists of a large domain of our everyday life, where the forms factorization by retrieving binary bases from row-wise 1s or 0s in a binary matrix can physically mean whether correlation matrix in a heuristic manner (Miettinen et al. or not an event of online shopping transaction, web brows- 2008). Despite its popularity, the high computational cost ing, medical record, journal submission, etc, has occurred of ASSO makes it impracticable when dealing with large or not. The scale of these datasets has increased exponen- scale data. Recently, an algorithm called Nassua was devel- tially over the years. Mining the patterns within binary data oped by the same group (Karaev, Miettinen, and Vreeken ∗Corresponding author 2015). Nassua optimizes the initialization of the matrix fac- Copyright c 2020, Association for the Advancement of Artificial torization by locating dense seeds hidden within the matrix, Intelligence (www.aaai.org). All rights reserved. and with improved performance comparing to ASSO. How-

6086 Figure 1: BMF, the addition of rank 1 binary matrices ever, optimal parameter selection remains a challenge for m m0 m m m0 m Nassua. A second series of work called PANDA was devel- 2 2 2 n0 oped by Claudio et al (Lucchese, Orlando, and Perego 2010; 2 n n 2 2013). PANDA aims to find the most significant patterns in n0 2 n n the current binary matrix by discovering core patterns iter- atively (Lucchese, Orlando, and Perego 2010). After each iteration, PANDA only retains a residual matrix with all the non-zero values covered by identified patterns removed. Figure 2: Three simplified scenarios for UTL matrices with Later, PANDA+ was recently developed to reduce the noise direct SC1P. level in core pattern detection and extension (Lucchese, Or- lando, and Perego 2013). These two methods also suffer from inhibitory computational cost, as they need to recalcu- two Boolean matrices as Xn×m = An×k ⊗ Bk×m, where k late a global loss function at each iteration. More algorithms Xij = ∨l=1Ail ∧ Blj. and applications of BMF have been proposed in recent years. FastStep relaxed BMF constraints to non-negativity Problem statement by integrating non-negative matrix factorization (NMF) and X ∈{0, 1}n×m Boolean thresholding (Araujo, Ribeiro, and Faloutsos 2016). Given a binary matrix and a criteria pa- rameter τ, the BMF problem is defined as identifying two But interpreting derived non-negative bases could also be A∗ B∗ challenging. With prior network information, Kocayusu- binary matrices and , called pattern matrices, that γ(A, B; X) τ foglu et al decomposes binary matrix in a stepwise fashion minimize the cost function under criteria , i.e., (A∗,B∗) = argmin (γ(A, B; X)|τ) τ with bases that are sampled from given network space (Ko- A,B . Here the criteria could vary with different problem assumptions. The criteria cayusufoglu, Hoang, and Singh 2018). Bayesian probability A∗ B∗ mapping has also been applied in this field . Ravanbakhsh et used in the current study is to identify and with at k A ∈{0, 1}n×k B ∈{0, 1}k×m al proposed a probability graph model called “factor-graph” most patterns, i.e., , , and the cost function is γ(A, B; X)=|X  (A ⊗ B)|. We call to characterize the embedded patterns, and developed a mes- l A l B sage passing approach, called MP (Ravanbakhsh, Poczos,´ the th column of matrix and th row of matrix as the l l l =1, ..., k and Greiner 2016). On the other hand, Ormachine, proposed th binary pattern, or the th basis, . by Rukat et al, provided a probabilistic generative model for BMF (Rukat et al. 2017). Similarly, these Bayesian ap- MEBF Algorithm Framework proaches suffer from low computational efficiency. In ad- Motivation of MEBF dition, Bayesian model fitting could be highly sensitive to noisy data. BMF is equivalent to decomposing the matrix into the sum of multiple rank 1 binary matrices, each of which is also re- Notations ferred as a pattern or basis in the BMF literature (Lucchese, Orlando, and Perego 2010). A matrix is denoted by a uppercase character with a super ∗ ∗ n×m A B script n × m indicating its dimension, such as X , and Lemma 1 (Submatrix detection). Let , be the solu- X X X i j tion to arg minA∈{0,1}n×k,B∈{0,1}k×m |X  (A ⊗ B)|, then with subscript i,:, :,j, ij indicating th row, th column, ∗ ∗ or the (i, j)th element, respectively. A vector is denoted as the k patterns identified in A , B correspond to k subma- X a bold lowercase character, such as a, and its subscript ai trices in that are dense in 1’s. In other words, finding A∗ B∗ X ,I ⊂ indicates the ith element. A scalar value is represented by , is equivalent to identify submatrices Il,Jl l { ,...,n} J ⊂{,...,m},l , ..., k, s.t.|X |≥ a lowercase character, such as a, and [a] as its integer part. 1 ; l 1 =1 Il,Jl t |I |∗|J | |I | I |X| and |x| represents the 1 norm of a matrix and a vec- 0( l l ). Here l is the cardinality of the index set l, tor. Under the Boolean algebra, the basic operations include t0 is a positive number between 0 and 1 that controls the X ∧(AND, 1 ∧ 1=1, 1 ∧ 0=0, 0 ∧ 0=0), ∨(OR, 1 ∨ 1= noise level of Il,Jl . 1, 0 ∨ 1=1, 0 ∨ 0=0), ¬(NOT,¬1=0, ¬0=1). De- note the Boolean element-wise sum, subtraction and prod- Proof. ∀l, it suffices to let Il be the indices of the lth column ∗ ∗ uct as A ⊕ B = A ∨ B, A  B =(A ∧¬B) ∨ (¬A ∧ B) of A , such that A:,l =1; and let Jl be the indices of the lth ∗ ∗ and A  B = A ∧ B, and the Boolean matrix product of row of B such that Bl,: =1.

6087 11 2 2 1 1 2 2 11 11 3 10 8 8 2 2 8 8 10 10 2 9 5 5 11 11 1 1 9 9 1 10 10 8 6 6 5 5 8 8 1234567891011 7 3 3 8 8 3 3 7 7 6 1 1 5 5 11 11 6 6 5 9 9 3 3 6 6 5 5 4 11 11 6 6 7 7 4 4 3 10 10 7 7 4 4 3 3 2 7 7 4 4 10 10 2 2 1 4 4 9 9 9 9 1 1 1234567891011 1011839761524 1011839761524 7961011518243 7961011518243 7519624101183 7519624101183 1234567891011 123 a1. original matrix a2. approx UTL with direct SCIP a3. bidirectional growth 1 a4. residual matrix 1 a5. bidirectional growth 2 a6. residual matrix 2a7. weak signal detection a8. MEBF decomposition a9. MEBF patterns

11 9 9 9 9 11 11 11 11 3 10 8 8 8 8 5 5 10 10 9 2 4 4 4 4 1 1 9 9 1 8 3 3 9 9 8 8 3 3 1234567891011 7 2 2 2 2 8 8 7 7 6 11 11 11 11 4 4 6 6 5 10 10 5 5 3 3 5 5 4 7 7 1 1 2 2 4 4 3 6 6 10 10 10 10 3 3 2 5 5 7 7 7 7 2 2 1 1 1 6 6 6 6 1 1 1234567891011 123678910 45 11 123678910 45 11 12345106789 11 12345106789 11 1234510789 11 6 1234510789 11 6 1234567891011 123 b1. original matrix b2. approx UTL with direct SC1P b3. bidirectional growth 1 b4. residual matrix 1 b5. bidirectional growth 2 b6. residual matrix 2b7. bidirectional growth 3 b8. MEBF decomposition b9. MEBF patterns

Figure 3: A schematic overview of the MEBF pipeline for three data scenarios where the matrix is roughly UTL with SC1P.

Motivated by Lemma 1, instead of looking for patterns they show strong similarity to the first row or column. Af- directly, we turn to identify large submatrices in X that are ter the expansion concludes, one could determine whether enriched by 1, such that each submatrix would correspond to retain the submatrix expanded row-wise or column-wise, to one binary pattern or basis. whichever reduces more of the cost function. Definition 1 (Direct consecutive-ones property, direct It is common that the underlying SC1P pattern may not C1P). A binary matrix X has direct C1P if for each of its exist for a binary matrix, and we turn to find the matrix with row vector, all 1’s occur at consecutive indices. closest SC1P. X Definition 2 (Simultaneous consecutive-ones property, Definition 4 (Closest SC1P). Given a binary matrix and ˆ SC1P). A binary matrix X has direct SC1P, if both X and a nonnegative weight matrix W , a matrix X that has SC1P T X have direct C1P; and a binary matrix X has SC1P, if and minimizes the distance dW (X, Xˆ) is the closest SC1P there exists a permutation of the rows and columns such that matrix of X. the permutated matrix has direct SC1P. Based on Lemma 2, we could find all the submatrices in Definition 3 (Upper Triangular-Like matrix, UTL). A Lemma 1 by first permutating rows and columns of matrix m×n binary matrix X is called an Upper Triangular-Like X to be an UTL matrix with closest direct SC1P, locating m X ≤ m X ≤ ··· ≤ (UTL) matrix, if 1) i=1 i1  i=1 i2 the largest submatrix of all 1’s, to be our first binary pattern. m X n X ≥ n X ≥···≥ Then we are left with a residual matrix whose entries cov- i=1 in;2) j=1 1j j=1 2j n ered by existing patterns are set to zero. Repeat the process Xmj. In other words, the matrix has non-increasing j=2 on the residual matrix until convergence. However, finding row sums from top down, and non-decreasing column sums matrix of closest SC1P of matrix X is NP-hard (Junttila and from left to right. others 2011; Oswald and Reinelt 2009). Lemma 2 (UTL matrix with direct SC1P). Assume X has X X Lemma 3 (Closest SC1P). Given a binary matrix and a no all-zero rows or columns. If is an UTL matrix and has W Xˆ direct SC1P, then an all 1 submatrix of the largest area in X nonnegative weight matrix , finding a matrix that has d X, Xˆ is seeded where one of its column lies in the medium column SC1P and minimizes the distance W ( ) is an NP-hard of the matrix, or one of its row lies in the medium row of the problem. matrix, as shown in Figure 2. The NP-hardness of the closest SC1P problem has been Figure 2 presented three simplified scenarios of UTL ma- shown in(Oswald and Reinelt 2009, Junttila 2011). Both ex- trix that has direct SC1P. In (a), (b), the 1’s are organized act and heuristic algorithms are known for the problem, and in triangular shape, where certain rows in (a) and certain it has also been shown if the number of rows or columns is columns in (b) are all zero, and in (c), the 1’s are shaped in bounded, then solving closest SC1P requires only polyno- block diagonal. After removing all-zero rows and columns, mial time (Oswald 2003). In our MEBF algorithm, we at- the upper triangular area of the shuffled matrix is dense in tempt to address it by using heuristic methods and approxi- 1. It is easy to show that a rectangular with the largest area mation algorithms. in a triangular is the one defined by the three midpoints of the three sides, together with the vertex of the right angle Overview of the triangular, as colored by red in Figure 2. The width Overall, MEBF adopted a heuristic approach to locate sub- and height of the rectangular equal to half of the two legs matrices that are dense in 1’s iteratively. Starting with the m n0 m0 n m n of the triangular, i.e. ( 2 , 2 ), ( 2 , 2 ), ( 2 , 2 ) for the three original matrix as a residual matrix, at each iteration, MEBF scenarios in Figure 2 respectively. According to Lemma 2, permutates the rows and columns of the current residual ma- this largest rectangular contains at least one row or one col- trix so that the 1’s are gathered on entries of the upper tri- umn (colored in yellow) of the largest all 1 submatrix in the angular area. This step is to approximate the . Consequently, starting with one row or column, ex- operation it takes to make a matrix UTL and direct SC1P. pansions with new rows or columns could be done easily if Then as illustrated in Figure 2 and Figure 3, the rectangular

6088 of the largest area in the upper triangular, and presumably, Algorithm 1: MEBF of the highest frequencies of 1’s, will be captured. The pat- X ∈{ , }n×m t ∈ , τ tern corresponding to this submatrix represents a good rank- Inputs: 0 1 , (0 1), Outputs: A∗ ∈{0, 1}n×k, B∗ ∈{0, 1}k×m 1 approximation of the current residual matrix. Before the MEBF X, t, τ end of each iteration, the residual matrix will be updated by ( ): Xresidual ← X, γ0 ← inf flipping all the 1’s located in the identified submatrix in this A∗ ← NULL B∗ ← NULL step to be 0. , while !τ do Shown in Figure 3a, for an input Boolean matrix (a1), (a, b) ← bidirectional growth(Xresidual,t) MEBF first rearranges the matrix to obtain an approximate ∗ Atmp ← append(A , a) UTL matrix with closest direct SC1P. This was achieved ∗ B ← append(B , b) by reordering the rows so that the row norms are non- tmp if γ(A ,B ; X) >γ0 then increasing, and the columns so that the column norms are tmp tmp (a, b) ← weak signal detection(Xresidual,t); non-decreasing (a2). Then, MEBF takes either the column or ∗ Atmp ← append(A , a) row with medium number of 1’s as one basis or pattern (a3). ∗ B ← append(B , b) As the name reveals, MEBF then adopts a median expansion tmp if γ(A ,B ; X) >γ0 then step, where the medium column or row would propogate to tmp tmp break ; other columns or rows with a bidirectional growth algorithm A∗ ← append A∗, until certain stopping criteria is met. Whether to choose ( a) B∗ ← append(B∗, b) the pattern expanded row-wise or column-wise depends on γ ← γ A∗,B∗ X which one minimizes the cost function with regards to the 0 ( ; ) X ← 0 when (a ⊗ b)ij =1 current residual matrix. Before the end of each iteration, residualij MEBF computes a residual matrix by doing a Boolean sub- end traction of the newly selected rank-1 pattern matrix from the current residual matrix (a4). This process continues until the convergence criteria was met. If the patterns identified by the column pattern if γ(X:,(med), e; X) <γ(f,X(med),:; X),or bidirectional growth step stopped deceasing the cost func- the row pattern otherwise. Here, the cost function is defined tion before the convergence criteria was met, another step as γ(a, b; X)=|X (a⊗b)|. This is equivalent to selecting called weak signal detection would be conducted (a6,a7). a pattern that achieves lower overall cost function at the cur- Figure 3b illustrated a special case, where the permutated rent step. Obviously here, a smaller t could achieve higher matrix is roughly block diagonal (b1), which corresponds to coverage with less number of patterns, while a larger t en- the third scenario in Figure 2. The same procedure as shown ables a more sparse decomposition of the input matrix with in 3a could guarantee the accurate location of all the pat- greater number of patterns. Patterns found by bidirectional terns. The computational complexity of bidirectional growth growth does not guarantee a constant decrease of the cost and weak signal detection algorithms are both O(nm) and the function. In the case the cost function increases, we adopt a complexity of each iteration of MEBF is O(nm). The main weak signal detection step before stopping the algorithm. algorithm of MEBF is illustrated below:

Bidirectional Growth Algorithm 2: Bidirectional Growth X ∈{ , }n×m t ∈ , For an input binary (residual) matrix X, we first rearrange Inputs: 0 1 , (0 1] X Outputs: (a,b) by reordering the rows and columns so that the row X, t norms are non-increasing, and the column norms are non- bidirectional growth( ): X ← UTL operation on X decreasing. The rearranged X, after removing its all-zero  ← X ←{ :,j d >t,j , ..., m} columns and rows, is denoted as X , the median column and d :,(med), e ( ) =1    X X X X:,( ) i,: f median row of as :,med and med,:. Denote med and f ← X(med),:, g ←{( < , > >t),i=1, ..., n} X X X f f (med),: as the column and row in corresponding to :,med if γ(d, e; X) >γ(g, f; X) then X X and med,:. The similarity between :,(med) and columns a ← g; b ← f; of X can be computed as a column wise correlation vec- else m :,i :,(med) ← ← tor m ∈ (0, 1) , where mi = . Similarly, a d; b e; :,(med) :,(med) end the similarity between X(med),: and rows of X can be com- n puted as a vector n ∈ (0, 1) ,nj = .A pre-specified threshold t ∈ (0, 1) was further applied, and two vectors e and f indicating the similarity strength of the Weak Signal Detection Algorithm columns and rows of X with X:,(med) and X(med),:, are ob- The bidirectional growth steps do not guarantee a constant tained, where ej =(mj >t) and fi =(ni >t). Here the decrease of the cost function, especially when after the binary vectors e and f each represent one potential BMF ”large” patterns have been identified and the ”small” pat- pattern . In each iteration, we select the row or column pat- terns are easily confused with noise. To identify weak pat- tern whichever fits the current residual matrix better, i.e. the terns from a residual matrix, we came up with a week signal

6089 detection algorithm to locate the regions that may still have each scenario. small but true patterns. Here, from the current residual ma- We evaluate the goodness of the algorithms by con- trix, we search the two columns with the most number of sidering two metrics, namely the reconstruction error 1’s and form a new column that is the intersection of the and density (Belohlavek, Outrata, and Trnecka 2018; two columns; and the two rows with the most number of 1’s Rukat et al. 2017), as defined below: and form a new row that is the intersection of the two rows. |(U ⊗ V )  (A∗ ⊗ B∗)| Starting from the new column and new row as a pattern, sim- Reconstruction error := ilar to bidirectional growth, we locate the rows or columns |U ⊗ V | in the residual matrix that have high enough similarity to the pattern, thus expanding a single row or column into a sub- |A∗n×k| + |B∗k×m| matrix. The one pattern among the two with the lowest cost Density := function with regards to the residual matrix will be selected. (n + m) × k And if addition of the pattern to existing patterns could de- ∗ ∗ Here, U, V are the ground truth patterns while A and B crease the cost function with regards to the original matrix, are the decomposed patterns by each algorithm. The density it will be retained. Otherwise, the algorithm will stop. metric is introduced to evaluate whether the decomposed patterns could reflect the sparsity/density levels of the true Algorithm 3: Weak Signal Detection patterns. It is notable that with the same reconstruction er- Inputs: X ∈{0, 1}n×m, t ∈ (0, 1] ror, patterns of lower density, i.e., higher sparsity are more Outputs: (a, b) desirable, as it leads to more parsimonious models. Weak signal detection(X, t) In Figure 4 and 5, we show that, compared with ASSO, X ← UTL operation on X PANDA and MP, MEBF is the fastest and most robust al- 1   gorithm. Here, the convergence criteria for the algorithms d ← X:,m ∧ X:,m−1 1 are set as: (1) 10 patterns were identified; (2) or for MEBF, 1 ←{ >t,j , ..., m} e ( ) =1 PANDA and ASSO, they will also stop if a newly identified 2   e ← X1,: ∧ X2,: pattern does not decrease the cost function. 2 2 As shown in Figure 4, MEBF has the best performance on d ←{( 2 2 >t),i=1, ..., n} small and big sized matrices for all the four different sce- l l l ← arg minl=1,2 γ(d , e ,X) narios, on 50 simulations each. It achieved the lowest re- a ← dl; b ← el; constructed error with the least computation time compared with all other algorithms. The convergence rate of MEBF also outperforms PANDA and MP. Though ASSO converges early with the least number of patterns, its reconstruction er- Experiment ror is considerably higher than MEBF, especially for high Simulation data density matrices. In addition, ASSO derived patterns tend to 1 be more dense than the true patterns, while those derived We first compared MEBF with three state-of-the-art ap- from the other three methods have similar density levels proaches, ASSO, PANDA and Message Passing (MP), on with the true patterns. By increasing the number of patterns, simulated datasets. Xn×m PANDA stably decreased reconstruction error, but it has a A binary matrix is simulated as considerably slow convergence rate and high computation n×m n×k k×m X = U ⊗ V +f E cost. MP suffered in fitting small size matrices, and in the case of low with noise, MP derived patterns where would not converge. The standard deviations of reconstruc- U ,V ∼ Bernoulli p E ∼ Bernoulli p tion error and density across 50 simulations is quite low, and ij ij ( 0) ij ( ) was demonstrated by the size of the shapes. The computa- ”+f ” is a flipping operation, s.t. tional cost and its standard deviation for each algorithm is  shown as bar plots in Figure 5, and clearly, MEBF has the k ∨l=1Uil ∧ Vlj,Eij =0 best computational efficiency among all. Xij = k ¬∨ Uil ∧ Vlj,Eij =1 l=1 Real world data application Here, p0 controls the density levels of the true patterns, and We next focus on the application of MEBF on two real world E is introduced as noise that could flip the binary values, datasets, and its performance comparison with MP. The and the level of noise could be regulated by the parameter p. datasets are: Chicago Crime records2 (X ∈{0, 1}6787×752) We simulated two data scales, a small one, n = m = 100, and head and neck cancer single cell RNA sequencing and a large one n = m = 1000. For each data scale, data3 (X ∈{0, 1}344×5902). The computational cost of the number of patterns k, is set to 5, and we used two 2 density levels, where p0 =0.2, 0.4, and two noise levels Chicago crime records downloaded on August 20, 2019 from p =0, 0.01. 50 simulation was done for each data scale at https://data.cityofchicago.org/Public-Safety 3This head and neck sequencing data can be accessed at 1The code is available at https://github.com/clwan/MEBF https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103322

6090 ecmae EFadM for noisy MP and be MEBF to compared likely We more are 1’s. 0’s than as observations to matrix, error original reconstruction noisy lower the than be metric higher would 1’s, reasonable a the more regard, of a this recovery higher In meaning occurrence. rate, positive coverage false is with a compared 0 being occurrence, data, negative 1 false world a real be many to in likely possible more hand, avoids other and the easier On interpretation overfitting. levels, further density makes low it meaning sparsity, as high more have are to patterns desirable binary error, reconstruction same the With n oeaelvl.Dniymti a enda nthe in as density defined as is the defined rate coverage was metrics, and metric data, two simulation Density on levels. coverage focused and to we original reasonable the Instead, be to respect not matrix. with may error it reconstruction datasets, noise the two world examine high the real possible the of the from to in ground-truth and level dropped of matrices applied were lack construction they be rank a low so to to scale, Due inhibitive large comparisons. a the too such are of datasets PANDA and ASSO ASSO, MEBF, of cost. comparison computational on MP Performance and PANDA 5: Figure vrl,a hw nTbe1 o both for 1, Table 1. in time Table shown in and presented as patterns are Overall, algorithms derived two the the of of consumption rate coverage and density EFotefrsM naltremaue:hge cover- higher measures: three all in MP outperforms MEBF Time(s) Time(s) Density Reconstruct Error 0.12 0.16 0.20 0.240.0 0.2 0.4 0.6 0.8 1.0 o est / os Low Density W Error Low Density W/O Noise

Low Density W/O Noise Low Density W Error High High Low Density W Error Low Density W/O Noise 0 25 50 75 100 0.0 0.25 0.50 0.75 1.00 Low Density W/O Noise Low Density W Error High High Low Density W Error Low Density W/O Noise 246810 246810 246810 246810 246810 Number ofPatterns iue4 efrac oprsn fMB,AS,PNAadM nteacrc fdecomposition. of accuracy the on MP and PANDA ASSO, MEBF, of comparisons Performance 4: Figure oeaerate Coverage MP PANDA ASSO MEBF

0.12 0.16 0.20 0.24 0.2 0.4 0.6 0.8 1.0 1.2 Big SizeMatrix1000x1000 Small SizeMatrix100x100 246810 Number ofPatterns Small SizeMatrix100x100 := | ( est / ro Low Density W Error Density W/O Error X est / ro Low Density W Error Density W/O Error k 0.25 0.30 0.35 0.40 0.0 0.2 0.4 0.6 0.8 Low Density W Error 1.0 Density High W/O Error · ( =5 246810 A Number ofPatterns | X ∗ k ⊗ | and =5 B k ∗ )) =20 and |

0.25 0.30 0.35 0.40 0.2 0.4 0.6 0.8 1.0 246810 k n the and , Number ofPatterns =20 MP PAND ASSO MEBF A , 6091

0.12 0.16 0.20 0.240.0 0.2 0.4 0.6 0.8 1.0 o est / os Low Density W Error Low Density W/O Noise ple nec ftecntutdbnr arcswt pa- with matrices binary constructed rameters the of each on applied h oa aswt rm omte o regions for committed crime with days total the C h eosrce iaymti sacrigycalculated accordingly is as matrix binary reconstructed The o the for and and year each in qa ie.Frec fte1 er,abnr matrix binary a years, 19 the of each into For sizes. area equal of Chicago standard divided overall of general. We the in allocation regions reflect the of also situation for living could useful and different for is force, date patterns police and crime time on The patterns regions. crime Here 2001 find from month. to data 2019 each crime to Chicago in analyzing pattern in MEBF date apply strong we credit a like has debt regular theft which been repay crimes cards, to as has need example, such the It For for patterns. crimes US. committed date were of the strong have in majority robbery rates the and crime that highest understood it and the well Midwest, of US the one in city has populous most the is Chicago vnyaswr hw u osaelmt n6,from 6A, In limit. space to due shown were years even region h nig ntetodtst sn MEBF. using present datasets and two mining, the data on findings continuous the and mining data crete MP. of robustness in low 0.807 the to suggesting 0.812 data, from cov- crime drops the the unexpectedly patterns, MP of MP. of number rate of increased erage that the to time with 1% the noted, approximately and Also MP, is to MEBF levels of density consumption the half roughly rate, age 246810 246810 246810 246810 al :Cmaio fMB n Po elwrddata world real on MP and MEBF of Comparison 1: Table 246810 igecell Single igecell Single Crime Crime Number ofPatterns d iceedt mining data Discrete iue6sosteciepten vrtm,adonly and time, over patterns crime the shows 6 Figure etw ics ndti h plcto fBFo dis- on BMF of application the detail in discuss we Next j X X = d k=20 k=5 ∗ d  j i,j d = MP PANDA ASSO MEBF hya scntutd where constructed, is year th nyear in k=5 k=20 ( i m t =1 =1 A =0 n d X 0.10 0.15 0.20 0.2 0.4 0.6 0.8 1.0 .9/.0 .3/.6 10.608/992.011 2.913/333.601 0.030/0.066 0.019/0.027 0.496/0.463 0.891/0.807 0.835/0.812 Coverage 0.626/0.580 × d en htciewsrpre tdate at reported was crime that means k 246810 . i,j Number ofPatterns 7 d m ⊗ ,k , . Big SizeMatrix1000x1000 X ersnstettlnme fregions, of number total the represents B 20) = (MEBF/MP) d d k i,j × m =0 rm ne a endas defined was index crime A . n outputs and

0.20 0.25 0.30 0.35 0.40 0.450.0 0.2 0.4 0.6 0.8 Low Density W Error 1.0 Density High W/O Error Density 2 3 tews.MB a then was MEBF otherwise. 246810 . . ∼ Number ofPatterns 06 34 e-4/ e-4/ 800 (MEBF/MP) 2 7 . . 86 22 n ein froughly of regions - 1.846/137.623 e-4 - 5.954/390.217 e-4

A 0.20 0.25 0.30 0.35 stettldates total the is d n × 0.40 0.45 0.2 0.4 0.6 0.8 1.0 k Time/s 246810 Number ofPatterns and j nyear in (MEBF/MP) B X d k d n × i × m in d m , . A 0 100 200 300

B Crime Index Index Crime 0 100 200 300 C

2002 20042006 2008 2010 2012 2014 2016 2018

Figure 6: MEBF analysis of Chicago crime data over the years.

year 2002-2018, the crime index calculated from the recon- Original MEBF m C ∗ X ∗ structed matrix, namely, dj = i=1 di,j was shown on the y-axis for all the regions on x-axis, In 6A, points col- ored in red indicate those regions with crime index equal to total dates of the year, i.e., 365 or 366, meaning these re- gions are heavily plagued with crimes, such that there is no Wí61( day without crime committed. Points colored in green shows Wí61( vice versa, indicating those regions with no crimes commit- ted over the year. Points are otherwise colored in gray. 6B í   shows the crime index on the original matrix, and clearly, í í í     the green and red regions are distinctly separated, i.e. green í   í í í     on the bottom with low crime index, and red on the top with Wí61( Wí61( high crime index. This shows the consistency of the crime patterns between the reconstructed and original crime data, Figure 7: Visualization of single cell clustering before and and thus, validate the effectiveness of MEBF pattern mining. after MEBF. Notably, the dramatic decrease in crime index starting from 2008 as shown in Figure 6A and 6B correlates with the re- ported crime decrease in Chicago area since 2008. Figure 6C uous matrix while clustering. shows the crime trend over the years on the map of Chicago. We collected a single cell RNA sequencing (scRNAseq) data Clearly, from 2008 to now, there is a gradual increase in the (Puram et al. 2017), that measured more than 300 gene ex- 5000×300 green regions, and decrease in the red regions, indicating an pression features for over 5,000 single cells, i.e., X . ∗ ∗ overall good transformation for Chicago. This result also in- We first binarize original matrix X into X , s.t. Xij =1 ∗ dicate that more police force could be allocated in between where Xij > 0,and Xij =0otherwise. Then, applying red and green regions when available. MEBF on X∗ with parameters (t =0.6,k =5)outputs An×k Bk×m X∗∗ A ⊗ B X Continuous data denoising , . Let = and use be the inner X∗ X∗∗ X X  X∗∗ X Binary matrix factorization could also be helpful in contin- product of and , namely, use = . use uous matrix factorization, as the Boolean rank of a matrix represents a denoised version of X, by retaining only the en- could be much smaller than the matrix rank under linear al- tries in X with true non-zero gene expressions. And this is gebra. Recently, clustering of single cells using single-cell reconstructed from the hidden patterns extracted by MEBF. RNA sequencing data has gained extensive utilities in many As shown in Figure 7, clustering on the denoised expres- fields, wherein the biggest challenge is that the high dimen- sion matrix, Xuse, results in much tighter and well separated sional gene features, mostly noise features, makes the dis- clusters (right) than that on the original expression matrix tance measure of single cells in clustering algorithm highly (left), as visualized by t-SNE plots shown in Figure 7. t- unreliable. Here we aim to use MEBF to denoise the contin- SNE is an non-linear dimensional reduction approach for

6092 the visualization of high dimensional data (Maaten and Hin- Kocayusufoglu, F.; Hoang, M. X.; and Singh, A. K. 2018. ton 2008). It is worth noting that, in generating Figure 7, Summarizing network processes with network-constrained Boolean rank of 5 was chosen for the factorization, indi- boolean matrix factorization. In 2018 IEEE International cating that the heterogeneity among cell types with such a Conference on Data Mining (ICDM), 237–246. IEEE. high dimensional feature space could be well captured by Larsson, A. J.; Johnsson, P.; Hagemann-Jensen, M.; Hartma- matrices of Boolean rank equal to 5. Interestingly, we could nis, L.; Faridani, O. R.; Reinius, B.; Segerstolpe, A.;˚ Rivera, see a small portion of fibroblast cell (dark blue) lies much C. M.; Ren, B.; and Sandberg, R. 2019. Genomic encoding closer to cancer cells (red) than to the majority of the fibrob- of transcriptional burst kinetics. Nature 565(7738):251. last population, which could biologically indicate a strong Lucchese, C.; Orlando, S.; and Perego, R. 2010. Mining cancer-fibroblast interaction in cancer micro-environment. top-k patterns from binary datasets in presence of noise. In Unfortunately, such interaction is not easily visible in the Proceedings of the 2010 SIAM International Conference on clustering plot using original noisy matrix. Data Mining, 165–176. SIAM. Conclusion Lucchese, C.; Orlando, S.; and Perego, R. 2013. A uni- fying framework for mining approximate top-k binary pat- We sought to develop a fast and efficient algorithm for terns. IEEE Transactions on Knowledge and Data Engineer- boolean matrix factorization, and adopted a heuristic ap- ing 26(12):2900–2913. proach to locate submatrices that are dense in 1’s iteratively, Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using where each such submatrix corresponds to one binary pat- t-sne. Journal of machine learning research 9(Nov):2579– tern in BMF. The submatrix identification was inspired by 2605. binary matrix permutation theory and geometric segmenta- tion. Approximately, we permutate rows and columns of the Miettinen, P., and Vreeken, J. 2014. Mdl4bmf: Minimum input matrix so that the 1’s could be ”driven” to the upper description length for boolean matrix factorization. ACM triangular of the matrix as much as possible, and a dense Transactions on Knowledge Discovery from Data (TKDD) submatrix could be more easily geometrically located. Com- 8(4):18. pared with three state of the art methods, ASSO, PANDA Miettinen, P.; Mielikainen,¨ T.; Gionis, A.; Das, G.; and Man- and MP, MEBF achieved lower reconstruction error, better nila, H. 2008. The discrete basis problem. IEEE trans- density and much higher computational efficiency on simu- actions on knowledge and data engineering 20(10):1348– lation data of an array of situations. Additionally, we demon- 1362. strated the use of MEBF on discrete data pattern mining and Monson, S. D.; Pullman, N. J.; and Rees, R. 1995. A survey continuous data denoising, where in both case, MEBF gen- of clique and biclique coverings and factorizations of (0, 1)- erated consistent and robust findings. matrices. Bull. Inst. Combin. Appl 14:17–86. Oswald, M., and Reinelt, G. 2009. The simultaneous Acknowledgment consecutive ones problem. Theoretical Computer Science This work was supported by R01 award #1R01GM131399- 410(21-23):1986–1992. 01, NSF IIS (N0.1850360), Showalter Young Investigator Oswald, M. 2003. Weighted consecutive ones problems. Award from Indiana CTSI and Indiana University Grand Ph.D. Dissertation. Challenge Health Initiative. Puram, S. V.; Tirosh, I.; Parikh, A. S.; Patel, A. P.; Yizhak, K.; Gillespie, S.; Rodman, C.; Luo, C. L.; Mroz, E. A.; Em- References erick, K. S.; et al. 2017. Single-cell transcriptomic analy- sis of primary and metastatic tumor ecosystems in head and Araujo, M.; Ribeiro, P.; and Faloutsos, C. 2016. Fast- neck cancer. Cell 171(7):1611–1624. step: Scalable boolean matrix decomposition. In Pacific- Asia Conference on Knowledge Discovery and Data Mining, Ravanbakhsh, S.; Poczos,´ B.; and Greiner, R. 2016. Boolean 461–473. Springer. matrix factorization and noisy completion via message pass- ing. In ICML, 945–954. Balasubramaniam, T.; Nayak, R.; and Yuen, C. 2018. Peo- ple to people recommendation using coupled nonnegative Rukat, T.; Holmes, C. C.; Titsias, M. K.; and Yau, C. 2017. boolean matrix factorization. Bayesian boolean matrix factorisation. In Proceedings of the 34th International Conference on Machine Learning- Belohlavek, R.; Outrata, J.; and Trnecka, M. 2018. Toward Volume 70, 2969–2978. JMLR. org. quality assessment of boolean matrix factorizations. Infor- mation Sciences 459:71–85. Stockmeyer, L. J. 1975. The set basis problem is NP- complete. IBM Thomas J. Watson Research Division. Junttila, E., et al. 2011. Patterns in permuted binary matri- ces. Wall, M. E.; Rechtsteiner, A.; and Rocha, L. M. 2003. Sin- gular value decomposition and principal component analy- Karaev, S.; Miettinen, P.; and Vreeken, J. 2015. Getting sis. In A practical approach to microarray data analysis. to know the unknown unknowns: Destructive-noise resistant Springer. 91–109. boolean matrix factorization. In Proceedings of the 2015 SIAM International Conference on Data Mining, 325–333. SIAM.

6093