Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

Infinite Plaid Models for Infinite Bi-Clustering

Katsuhiko Ishiguro, Issei Sato,* Masahiro Nakano, Akisato Kimura, Naonori Ueda NTT Communication Science Laboratories, Kyoto, Japan *The University of Tokyo, Tokyo, Japan

Abstract (such as k-means) for matrices: the entire matrix is parti- tioned into many rectangle blocks, and all matrix entries are We propose a probabilistic model for non-exhaustive and assigned to one (or more) of these blocks (Fig. 1). However, overlapping (NEO) bi-clustering. Our goal is to extract a few for knowledge discovery from the matrix, we are only inter- sub-matrices from the given data matrix, where entries of a ested in few sub-matrices that are essential or interpretable. sub-matrix are characterized by a specific distribution or pa- rameters. Existing NEO bi-clustering methods typically re- For that purpose exhaustive bi-clustering techniques require quire the number of sub-matrices to be extracted, which is post-processing in which all sub-matrices are examined to essentially difficult to fix a priori. In this paper, we extend identify “informative” ones, often by painful manual effort. the plaid model, known as one of the best NEO bi-clustering On the contrary, the goal of non-exhaustive bi-clustering algorithms, to allow infinite bi-clustering; NEO bi-clustering methods is to extract the most informative or significant without specifying the number of sub-matrices. Our model sub-matrices, not partitioning all matrix entries. Thus the can represent infinite sub-matrices formally. We develop a non-exhaustive bi-clustering is more similar to clique detec- MCMC inference without the finite truncation, which poten- tion problems in network analysis: not all nodes in a net- tially addresses all possible numbers of sub-matrices. Exper- iments quantitatively and qualitatively verify the usefulness work are extracted as clique members. Non-exhaustive bi- of the proposed model. The results reveal that our model can clustering would greatly reduce the man-hour costs of man- offer more precise and in-depth analysis of sub-matrices. ual inspections and knowledge discovery, because it ignores non-informative matrix entries (Fig. 1). Hereafter let us use “bi-clustering” to mean “NEO bi- Introduction clustering”. There has been a plenty of bi-clustering re- searches over years, e.g. (Cheng and Church 2000; Lazze- In this paper, we are interested in bi-clustering for matrix roni and Owen 2002; Caldas and Kaski 2008; Shabalin et data analysis. Given the data, the goal of bi-clustering is to al. 2009; Fu and Banerjee 2009). However, there is one fa- extract a few (possibly overlapping) sub-matrices1 from the tal problem that has not been solved yet: determining the matrix where we can characterize entries of a sub-matrix by number of sub-matrices to be extracted. Most existing bi- a specific distribution or parameters. Bi-clustering is poten- clustering methods assume that the number of sub-matrices tially applicable for many domains. Assume an mRNA ex- is fixed a priori. This brings two essential difficulties. First, pression data matrix, which consists of rows of experimen- finding the appropriate number of sub-matrices is very diffi- tal conditions and columns of genes. Given such data, biolo- cult. Recall that determining “k” for classical k-means clus- gists are interested in detecting pairs of specific conditions × tering is not trivial in general. And the same holds, per- gene subsets that have different expression levels compared haps more difficult (Gu and Liu 2008), for bi-clustering. to other expression entries. From a user × product purchase Second, sub-optimal choices of the sub-matrix number in- record, we can extract sub-matrices of selected users × par- evitably degrade bi-clustering performances. For example, ticular products that sell very good. Successful extraction of assume there are K = 3 sub-matrices incorporated in the such sub-matrices is the basis for efficient ad-targeting. given data matrix. Solving the bi-clustering by assuming More specifically we study the non-exhaustive and over- K = 2(< 3) never recover the true sub-matrices. If we solve lapping (NEO) bi-clustering. Here we distinguish between with K = 5(> 3), then the model seeks for two extra sub- the exhaustive and non-exhaustive bi-clustering. matrices to fulfill the assumed K; e.g. splitting a correct sub- Exhaustive bi-clustering (e.g. (Erosheva, Fienberg, and matrix into multiple smaller sub-matrices. Lafferty 2004; Kemp et al. 2006; Roy and Teh 2009; There are few works on this problem. One such Nakano et al. 2014)) is an extension of typical clustering work (Ben-Dor et al. 2003) suffers computational complex- Copyright c 2016, Association for the Advancement of Artificial ity equivalent to the cube to the number of columns; thus it Intelligence (www.aaai.org). All rights reserved. is feasible only for matrices with very few columns. Gu and 1By sub-matrix, we mean a direct product of a subset of row Liu (Gu and Liu 2008) adopted the model selection approach indices and a subset of column indices. based on BIC. However model selection inevitably requires

1701 Data matrix. Background

Exhaustive Bi-clustering Baseline: (simplified) Bayesian Plaid model Partitions the entire matrix into many blocks. Bi-clustering has been studied intensively for years. A sem- Post-processing: inal paper (Cheng and Church 2000) has applied the bi- manually inspect all blocks clustering technique for the analysis of gene-expression :-( data. After that, many works have been developed (e.g. (Sha- balin et al. 2009; Fu and Banerjee 2009)). Among them, the Plaid model (Lazzeroni and Owen 2002) is recognized as one of the best bi-clustering methods in several review stud- Non-exhaustive and Overlapping (NEO) Bi-clustering ies (Eren et al. 2013; Oghabian et al. 2014). Bayesian mod- Extracts only “possibly interesting” sub-matrices: els of the plaid model have been also proposed and reported sharing coherent characteristics. effective (Gu and Liu 2008; Caldas and Kaski 2008). Post-processing: only a limited number of sub-matrices! We adopted a simplified version of Bayesian Plaid mod- :-) els (Caldas and Kaski 2008) as the baseline. The observed data is a matrix of N1 × N2 continuous values, and K is the Reduce man-hour cost! number of sub-matrices to model interesting parts. We de- fine the simplified Bayesian Plaid model as follows: λ ∼ λ, λ λ ∼ λ, λ , Our proposal; Infinite Bi-clustering 1,k Beta a1 b1 2,k Beta a2 b2 (1) NEO Bi-clustering with any number of sub-matrices. Infer the number of hidden sub-matrices automatically. z , , ∼ Bernoulli λ , , z , , ∼ Bernoulli λ , , (2) 1 i k 1 k 2 j k 2 k θ ∼ Normal μθ, (τθ)−1 ,φ∼ Normal μφ, (τφ)−1 , (3) k ⎛ ⎞ Figure 1: Exhaustive bi-clustering VS. NEO bi-clustering. ⎜ ⎟ ⎜ −1⎟ We offer the infinite bi-clustering: NEO bi-clustering with- xi, j ∼ Normal ⎝⎜φ + z1,i,kz2, j,kθk , (τ0) ⎠⎟ . (4) out knowing the exact number of distinctive sub-matrices. k Row and column indices are not consecutive: thus we do not In the above equations, k ∈{1,...,K} denotes sub-matrices, extract consecutive rectangles, but direct products of subsets i ∈{1,...,N } and j ∈{1,...,N } denote objects in the first of rows and columns. 1 2 (row) and the second (column) domains, respectively. λ1,k and λ2,k (Eq. (1)) are the probabilities of assigning an object to the kth sub-matrix in the first and the second domain. z1,i,k and z2, j,k in Eq. (2) represent the sub-matrix (factor) mem- berships. If z1,i,k = 1(0) then the ith object of the first do- multiple inference trials for different choices of model com- main is (not) a member of the kth sub-matrix, and similar plexities (the number of sub-matrices, K). This consumes a for z2, j,k. θk and φ in Eq. (3) are the mean parameters for the lot of computation and time resources. kth sub-matrix and the “background” factor. Eq. (4) com- bines these to generate an observation. Note that this obser- The main contribution of this paper is to propose a prob- vation process is simplified from the original Plaid models. abilistic model that allows infinite bi-clustering; NEO bi- Throughout the paper, however, we technically focus on the clustering without specifying the number of sub-matrices. modeling of Z thus we employ this simplified model. The proposed model is based on the plaid models (Lazze- As stated, existing bi-clustering methods including the roni and Owen 2002), which are known to be one of the Bayesian Plaid model require us to fix the number of sub- best bi-clustering methods (Eren et al. 2013; Oghabian et matrices, K, beforehand. It is very possible to choose a sub- al. 2014). The proposed Infinite Plaid models introduce a optimal K in practice since it is innately difficult to find the simple extension of the Indian Buffet Process (Griffiths and best K by hand. Ghahramani 2011) and can formulate bi-clustering patterns with infinitely many sub-matrices. We develop a MCMC in- Indian Buffet Process (IBP) ference that allows us to infer the appropriate number of Many researchers have studied sophisticated techniques for sub-matrices for the given data automatically. The inference exhaustive bi-clustering. Especially, the Bayesian Nonpara- does not require the finite approximation of typical varia- metrics (BNP) becomes a standard tool for the problem tional methods. Thus it can potentially address all possible of an unknown number of sub-matrices in exhaustive bi- numbers of sub-matrices, unlike the existing variational in- clustering (Kemp et al. 2006; Roy and Teh 2009; Nakano ference method for similar prior model (Ross et al. 2014). et al. 2014; Zhou et al. 2012; Ishiguro, Ueda, and Sawada Experiment results show that the proposed model quantita- 2012). Thus it is reasonable to consider the BNP approach tively outperforms the baseline finite bi-clustering method for the (NEO) bi-clustering. both for synthetic data and for real-world sparse datasets. For that purpose, we would first consider an IBP (Grif- We also qualitatively examine the extracted sub-matrices, fiths and Ghahramani 2011), a BNP model for binary fac- and confirm that the Infinite Plaid models can offer in-depth tor matrices. Assume the following Beta-Bernoulli process sub-matrix analysis for several real-world datasets. for N collections of K = ∞ feature factors F = {Fi}i=1,...,N :

1702 B ∼ BP (α, B0), Fi ∼ BerP (B).BP(α, B0) is a Beta pro- B = λ1,k,λ2,k,θk , where λ1,k,λ2,k ∈ [0, 1] are the probabil- cess with a concentration parameter α and a base mea- ity of activating a sub-matrix (factor) k in the 1st domain and = ∞ λ δ θ sure B0. B k=0 k θk is a collection of infinite pairs of the 2nd domain, respectively, and k is a parameter drawn λ ,θ ,λ ∈ , ,θ ∈ Ω ( k k) k [0 1] k . Fi is sampled from the Bernoulli from the base measure B0.GivenB, we generate feature = ∞ δ = ∞ δ process (BerP) as Fi k=0 zi,k θk using a binary variable factors for 2-domain indices (i,j): F(i, j) k=1 z1,i,kz2, j,k θk zi,k ∈ {0, 1}. IBP is defined as a generative process of zi,k: where two binary variables z1,i,k and z2, j,k represent sub- matrix(factor) responses. k More precisely, TIBP is defined as a generative process of vl ∼ Beta (α, 1) ,λk = vl , zi,k ∼ Bernoulli (λk) . such z1,i,k and z2, j,k that extends IBP: l=1 Z = { } Z Z zi,k acts like an infinite-K extension of 1 or 2 in k Eq. (2). Assume that we observed N objects, and an object i v1,l ∼Beta (α1, 1) ,λ1,k = v1,l , z1,i,k ∼Bernoulli λ1,k , may involve K = ∞ clusters. Then zi,k serves as an member- l=1 ship indicator of the object i to a cluster k. k v2,l ∼Beta (α2, 1) ,λ2,k = v2,l , z2, j,k ∼Bernoulli λ2,k . Proposed model l=1

Unfortunately, a vanilla IBP cannot achieve the infinite bi- N ×∞ Repeating for all (i,j), we obtain Z1 = {z1,i,k}∈{0, 1} 1 clustering. This is because infinite bi-clustering requires to N ×∞ and Z2 = {z2, j,k}∈{0, 1} 2 . In shorthand we write: associate a sub-matrix parameter θk with two λs, which cor- Z1, Z2 | α1,α2, ∼ TIBP (α1,α2). Because of the triplets, a respond to two binary variables z1,i,k, z2, j,k (Eqs.(1,2)). How- parameter atom θk and the corresponding index k are asso- ever, the IBP ties the parameter θk with only one parameter ciated between Z1 and Z2 for k ∈{1, 2,...,∞}. Therefore as it is built on the pairs of (λk,θk)inBP. For example, (Miller, Griffiths, and Jordan 2009) pro- the binary matrices generated from TIBP can serve as valid posed to employ an IBP to factorize a square matrix. This membership variables for infinite bi-clustering. model and its followers (Palla, Knowles, and Ghahramani TIBP is a generic BNP prior related to the marked Beta 2012; Kim and Leskovec 2013), however, cannot perform process (Zhou et al. 2012). It would be applicable for many bi-clustering on non-square matrix data, such as user × item other problems, not limited to the infinite bi-clustering. For matrix. (Whang, Rai, and Dhillon 2013) employed a product example, (Ross et al. 2014) employed a similar infinite bi- of independent IBPs to over-decompose a matrix into sub- clustering prior for a mixture of Gaussian processes to learn matrices, and associated binary variables over sub-matrices clusters of disease trajectories. In this paper, we employ to choose “extract” or “not-extract”. This is not yet optimal TIBP for totally different problem; unsupervised infinite bi- because this model explicitly models the not-extract sub- clustering of plaid models. matrices, results in unnecessary model complications. Inference without truncations Infinite Plaid models for infinite bi-clustering Now we propose the Infinite Plaid models, an infinite bi- We usually use the variational inference (VI) or the MCMC clustering model for general non-square matrices with the inferences to estimate unknown variables in probabilistic plaid factor observation models: models. The aforementioned work by (Ross et al. 2014) em- ployed the VI, which utilizes a finite truncation of TIBP with the fixed K+ model in its variational approximation, where Z , Z | α ,α ∼ TIBP (α ,α ) (5) + 1 2 1 2 1 2 K is the assumed ”maximum” number of sub-matrices. To θ ∼ Normal μθ, (τθ)−1 ,φ∼ Normal μφ, (τφ)−1 , (6) do this, VI achieves an analytical approximation of the pos- k ⎛ ⎞ terior distributions. ⎜ ⎟ ⎜ −1⎟ Instead, we developed a MCMC inference combining x , ∼ Normal ⎝⎜φ + z , , z , , θ , (τ ) ⎠⎟ . (7) i j 1 i k 2 j k k 0 Gibbs samplers and Metropolis-Hastings (MH) samplers k (c.f. (Meeds et al. 2007)), which approximate the true pos- In Eq. (5), the Two-way IBP (TIBP), which will be ex- teriors by sampling. Roughly speaking, the Gibbs samplers plained later, is introduced to generate two binary matrices infer individual parameters (θ, φ) and hidden variables (z) N ×∞ for sub-matrix memberships: Z1 = {z1,i,k}∈{0, 1} 1 and while the MH samplers test drastic changes in the number N ×∞ Z2 = {z2, j,k}∈{0, 1} 2 .Different from the Bayesian Plaid of sub-matrices such as split-merge moves and a proposal models, the cardinality of sub-matrices may range to K = ∞; of new sub-matrices. It is noteworthy that our MCMC in- i.e. the Infinite Plaid models can formulate bi-cluster pat- ference does not require finite truncations of TIBP unlike terns with infinitely many sub-matrices. Remaining genera- VI. Thus it yields a posterior inference of the infinite bi- tive processes are the same as in the Bayesian Plaid models. clustering that potentially address all possible numbers of Now let us explain what is TIBP and why TIBP can rem- sub-matrices, not truncated at K+. In addition, we can in- edy the problem of IBP. TIBP assumes triplets consists of fer the hyperparameters to boost bi-clustering performances 2 weights and an atom, instead of pairs of a weight and (c.f. (Hoff 2005; Meeds et al. 2007)). Please consult the sup- an atom of IBP. Consider an infinite collection of triplets plemental material for inference details.

1703 observation observation 8 10 ff 6 8 Table 1: Average NMI values on synthetic data with di erent (A) synth1 (B) synth2 6  init   K and K . B.P. indicates the baseline Bayesian Plaid mod- (true K=3) (true K=3)  0 0 els, and Inf. P. indicates the proposed Infinite Plaid models. í í í í Bold faces indicate statistical significance. Bayesian Plaid Infinite Plaid Bayesian Plaid Infinite Plaid K, Kinit = Ktrue K, Kinit = 5 K, Kinit = 10 B. P. Inf. P. B. P. Inf. P. B. P. Inf. P. Synth 1 0.848 0.970 0.791 0.860 0.700 0.791 Synth 2 0.782 0.976 0.754 0.924 0.659 0.823 Synth 3 0.938 0.973 0.777 0.972 0.660 0.971 observation observation 10 10 (C) synth3 8 (D) synth4 8 Synth 4 0.975 0.981 0.754 0.963 0.647 0.940 6 6 WUXH.   WUXH.     0 0 í í í í Bayesian Plaid Infinite Plaid Bayesian Plaid Infinite Plaid Table 2: Two additional criteria scores on synthetic and real data. B.P. indicates the baseline Bayesian Plaid models, and Inf. P. indicates the proposed Infinite Plaid models. Bold faces indicate statistical significance. Distinctiveness F-measure B. P. Inf. P. B. P. Inf. P. Figure 2: Observations and typical bi-clustering results on Synth 1 2.8 3.5 0.94 0.94 Synth 2 2.6 3.7 0.94 0.94 synthetic datasets. A colored rectangle indicates a sub- , . . = = Synth 3 0.4 1.8 0.79 0.84 matrix k ((i j) s t z1,i,k z2, j,k 1 ). Sub-matrix over- Synth 4 0.1 1.4 0.69 0.72 lapping is allowed: for example, in Panel (B), three sub- Enron Aug. 1.1 1.2 0.32 0.42 matrices overlap each other on their corners. For better pre- Enron Oct. 0.98 0.96 0.47 0.54 sentations we sort the rows and column so as to sub-matrices Enron Nov. 1.1 1.0 0.38 0.53 are seen as rectangles. In experiments row and column in- Enron Dec. 0.93 1.1 0.35 0.48 dices are randomly permuted. Lastfm 1.3 0.89 0.11 0.18

Experiments For quantitative evaluations, we employed the Normal- ized Mutual Information (NMI) for overlapping cluster- Procedure ing (Lancichinetti, Fortunato, and Kertesz 2009). NMIs take . We prepared four synthetic small datasets (Fig. 2). All data the maximum value of 1 0 if and only if the two clustering are completely the same, including the number of clusters. sets are N1(= 100) × N2(= 200) real-valued matrices, but differ in the number of sub-matrices and the proportion of NMIs yield a precise evaluation of bi-clustering but require sub-matrix overlap. For real-world datasets, we prepared the ground truth sub-matrix assignments, which are unavailable following datasets. The Enron E-mail dataset is a collection in general. Thus we also consider two generic criteria that of E-mail transactions in the Enron Corporation (Klimt and work without ground truth information. First, we want sub- ff θ Yang 2004). We computed the number of monthly transac- matrices that have remarkably di erent k compared to the φ tions of E-mails sent/received between N = N = 151 em- background . For that purpose, we compute a distinctive- 1 2 |θ − φ| ployees in 2001. We used transactions of Aug., Oct., Nov., ness: mink k . Next, we expect the sub-matrices include and Dec. when transactions were active. We also collected many informative entries but less non-informative entries. a larger Lastfm dataset, which is a co-occurrence count Fortunately, it is relatively easy to define (non-)informative set of artist names and tag words. The matrix consists of entries without the ground truth for sparse datasets: the dom- inant “zero” entries might be non-informative while non- N1 = 10, 425 artists and N2 = 4, 212 tag words. All real- world datasets are sparse: the densities of non-zero entries zero entries are possibly informative. Let us denote E as the are, at most, 4% for Enron, and 0.3% for Lastfm. For dataset set of all entries extracted by K sub-matrices and I as the set details, please consult the supplemental materials. of all non-zero entries. Then we compute the F-measure by: 2Recall·Precision = #∩(E,I) = #∩(E,I) Our interest is that how good the Infinite Plaid models Recall+Precision where Recall #I and Precison #E . can solve bi-clustering without knowing the true number Larger values of these criteria imply distinct and clean sub- of sub-matrices, Ktrue. Thus we set the same initial hyper- matrices. parameter values for the baseline Bayesian Plaid models and the proposed Infinite Plaid models. For the choices of Synthetic Data Experiments K, the Bayesian Plaid models conduct inferences with the Table 1 presents averages of NMIs on synthetic datasets with fixed number of sub-matrices K. The Infinite Plaid models several K and Kinit. We see that the proposed Infinite Plaid are initialized with Kinit sub-matrices, then adaptively infer models achieved significantly better NMIs against the base- an appropriate number of sub-matrices through inferences. line Bayesian Plaid models. Also the proposed model ob- All other hyperparameters of two models are inferred via tained NMI  1.0 i.e. the perfect bi-clustering in many cases, MCMC inferences. regardless of Kinit values. This makes contrast with Bayesian

1704 Plaid models suffer sharp falls in NMIs as K increases. This is what we expect: we designed the Infinite Plaid model so Magnified that it is capable of inferring a number of sub-matrices, while the baseline model and many existing bi-clustering methods cannot perform such inference. We observe that the first two synthetic data (synth1, All assignments Z synth2) are more difficult than the later two data (synth3, k=2: “COO” synth4). This is reasonable since all the sub-matrices of the Exhaustive bi-clustering θ Domain 1 objects found a similar first two data were designed to have the same value k while Employee, COO sub-matrix the later two data were designed with different θ values (θ s (Two additional objects) k k Domain 2 objects are illustrated in Fig. 2 as the color depth of rectangles). We President, Enron Gas Pipeline also observe that Bayesian Plaid model does not necessar- * Founder of Enron, CEO true Vice President, Chief Financial Officer and Treasurer ily obtain the perfect recovery even if K = K . This may (Many other VIPs) be explained by the fact that the MCMC inferences on BNP exhaustive bi-clustering models are easily trapped at local k=8: “VIP w/ legal experts” optimum in practice (Albers et al. 2013). We expect the per- Domain 1 objects * Vice President & Chief of Staff formance will be improved with more MCMC iterations. In House Lawyer Vice President, Regulatory Affairs The upper half of Table 2 presents the additional two crite- ...... ria (K, Kinit =10). The Infinite Plaid models are significantly Domain 2 objects * Vice President & Chief of Staff better than the baseline for many cases in both of the dis- In House Lawyer * President, Enron online tinctiveness and the F-measure. Combining the NMI results, * Founder of Enron, CEO we can safely say that the proposed model is superior to the * Employee, Government Relation Executive Manager, Chief Risk Officer fixed-K bi-clustering. Managing Director, Legal Department init * Vice President, Government Affairs Fig. 2 shows examples of bi-clustering results (K, K = ...... Similar topic 10). The Bayesian Plaid models often extract noisy back- Different in legal issues? ground factors, or divide sub-matrices into multiple blocks k=10: “VIP w/o legal experts” Domain 1 objects unnecessarily. Those are exactly the outcomes we were Employee, Government Relation Executive afraid of, induced by the sub-optimal K. In contrast, Infi- Vice President, Government Affairs Domain 2 objects nite Plaid models achieved perfect bi-clusters, as expected * President, Enron online from the NMI scores. * Employee, Government Relation Executive Vice President, Regulatory Affairs * Vice President & Chief of Staff [ ] Overlapped objects * Vice President, Government Affairs * Real-world Datasets Experiments ...... among sub-matrices For the real-world datasets, we rely on the distinctiveness and F-measure as there are no ground truth sub-matrix as- Figure 3: Results on Enron Oct. data. signments. Evaluations are presented in the lower half of Table 2 (because we do not know Ktrue, we presented the , init best scores among several choices of K K , in terms of the ff distinctiveness). We confirmed that the Infinite Plaid models similar topic, but from di erent viewpoints concerning legal always outperform the baseline in terms of the F-measure. issues. Interestingly, members in these two sub-matrices are This effectively demonstrates the validity of our infinite bi- grouped in one sub-matrix in August data (presented in the clustering model for sparse real-world datasets. supplement). Fig. 4 presents an example of sub-matrices from Enron Enron dataset Next, we qualitatively examine the results Dec., the final days of the bankruptcy. As the Enron became of the Infinite Plaid models on Enron datasets. Please refer bankrupt, human resources mattered a lot. The k = 3rd sub- to the supplemental material for the results of Enron Aug. matrix (green colored) may capture this factor. The main and Enron Nov. datasets. sender of the sub-matrix is the Chief of Staff. The receivers Fig. 3 presents an example of sub-matrices from En- were CEOs, Presidents and directors of several Enron group ron Oct. data. The k = 2nd sub-matrix (orange colored) companies. In this month, we found another interesting pair highlights the event in which the COO sent many mails of the sub-matrices: the k = 1st and the k = 6th sub-matrices to Enron employees. This pattern is also found by exist- (red and sky-blue colored). In the k = 1st sub-matrix, the ing exhaustive bi-clustering works (Ishiguro et al. 2010; President of Enron Online sent many mails to VIPs. We Ishiguro, Ueda, and Sawada 2012) but our model reveals that don’t know the content of these e-mails, but they may be there are a few additional employees who behaved like the of significant interest as they were sent by the president of COO. We also find two VIP sub-matrices (k = 8, 10). We one of the most valuable Enron group companies. Interest- may distinguish these two by the presence of legal experts ingly, in the 6th sub-matrix the president is the sole receiver and the founder. The k = 8th sub-matrix (light-blue colored) of e-mails and the number of senders (the 1st domain mem- includes these people, but the k = 10th sub-matrix (dark- bership) is smaller than that of receivers (the 2nd domain blue colored) does not. A few people join both sub-matrices, membership) of the k = 1st sub-matrix. This may imply that thus two sub-matrices may be exchanging e-mails about a the k = 6th sub-matrix captured the responses to the e-mails

1705 Magnified k=2: “female vocals” k=3: “black metals”

Domain 1 objects Domain 1 objects Kylie Minogue Black eyed Peas Moonspell Madonna Avril Lavigne Cradle of Filth Lady Gaga Taylor Swift Satyricon Janet Jackson Bjork Drak Funeral Jennifer Lopez Michael Jackson Ulver All assignments Z P!nk Jastin Timberlake Celtic Frost ...... k=3: “Chief of Staff and VIPs” Domain 2 objects Domain 2 objects Domain 1 objects “electronic” “female vocalists” “dance” “black metal” * Vice President & Chief of Staff “pop” “rnb” (only one tag) Employee Domain 2 specific objects k=4: “Older HR/HM and rocks” Founder of Enron, CEO k=8: “alternative, punks, ” President, Enron Gas Pipeline Domain 1 objects * Manager, Chief Risk Officer Domain 1 objects The Beatles Led Zeppelin In House Lawyer Coldplay Vice President, Enron Wholesale Services The Who Jimi Hendrix The Doors Queen Radiohead Sex Pistols * Managing Director, Legal Department John Lennon Van Halen The Offspring (A few additional VIPs) The Clash Deep Purple Judas Priest Nirvana AC/DC Metallica ...... k=1: “E-mail From President Enron Online” Rainbow Guns N’ Roses Domain 2 objects ...... Domain 1 objects “rock” “grunge” President, Enron online (only one object) Domain 2 objects “” “punk” Domain 2 specific objects “rock” “classick rock” “alternative” “” Employee, COO “hard rock” “heavy metal” “pop punk” * Manager, Chief Risk Officer * Vice President, Enron Online * Managing Director, Legal Department Vice President, Government Affairs (more additional objects) Figure 5: Results on Lastfm data. Reply? k=6: “E-mail To President Enron Online” Domain 1 objects Employee, COO Vice President, Enron Online with existing exhaustive bi-clustering methods which re- * Vice President & Chief of Staff quire manual checking of all partition blocks, or with exist- Vice President, Government Affairs (few additional objects) ing NEO bi-clustering methods which require model selec- Domain 2 specific objects [*] Overlapped objects tion. The Infinite Plaid models reduced the cost of knowl- * President, Enron online (only one object) among sub-matrices edge discovery by extracting remarkably different sub- matrices automatically. Figure 4: Results on Enron Dec. data. Conclusion and Future Works We presented an infinite bi-clustering model that solves the in the k = 1st sub-matrix. NEO bi-clustering without knowing the number of sub- matrices. We proposed the Infinite Plaid models, which ex- Lastfm dataset Fig. 5 shows an example of the results tend the well-known Plaid models to represent infinite sub- of applying the Infinite Plaid models to the Lastfm dataset. matrices by a BNP prior. Our MCMC inference allows the The k = 3rd sub-matrix consists of black metal bands (1st model to infer an appropriate number of sub-matrices to domain objects). This sub-matrix selects only one tag (2nd describe the given matrix data. Our inference does not re- domain object): “black metal”, which might be appropriate quire the finite truncation as it is required in the previ- for these bands. The k = 8th sub-matrix is a sub-matrix ous method (Ross et al. 2014); thus the inference algo- of relatively pop music. This sub-matrix has tags such as rithm correctly addresses all possible patterns of the infinite “alternative rock”, “pop punk”, and “grunge”. Indeed, there bi-clusters. We experimentally confirmed that the proposed are some alternative rock bands such as “Coldplay”, popular model quantitatively outperforms the baseline method fixing punk bands such as “Green Day”, while “Nirvana” is one of the number of sub-matrices a priori. Qualitative evaluations the best grunge bands. We also found a sub-matrix of older showed that the proposed model allows us to conduct more rock artists at k = 4. Included classical artists are: “Beatles”, in-depth analysis of bi-clustering in real-world datasets. “John Lennon”, “Deep Purple” (Hard Rock), “Judas Priest” Plaid models have been intensively employed in gene ex- (Metal), and others. pression data analysis, and also we are interested in the Finally we present slight strange but somewhat reasonable purchase log bi-clustering and social data analyses as dis- assignments. The k = 2nd sub-matrix consists of various top cussed in the introduction. In addition, infinite bi-clustering female artists. Chosen tags are “electronic”, “pop”, “dance”, can be used for multimedia data such as images and au- “female vocalist”, and “rnb” (R&B). Among the chosen dios. For example, it is well known that the nonnegative artists, however, there are two male artists; “Michael Jack- matrix factorization can extract meaningful factors from im- son” and “Justin Timberlake”. These choices have somewhat age matrices (Cichocki et al. 2009), such as eyes, mouths, reasonable aspects: both are very famous pop and dance mu- and ears from human face images. The infinite bi-clustering sic grandmasters, and both are characterized by their high- may improve in finding meaningful parts by just focusing tone voices like some female singers. on distinctive sub-images. Some audio signal processing These deep analyses of sub-matrices would cost much researchers are interested in audio signals with the time-

1706 frequency domain representations that are real- or complex- space Clustering. Proceedings of the 9th IEEE International valued matrix data. We can apply the infinite bi-clustering Conference on Data Mining (ICDM) 1:776–781. to those matrices to capture time-depending acoustic har- Griffiths, T. L., and Ghahramani, Z. 2011. The Indian Buffet monic patterns of music instruments, which are of inter- Process : An Introduction and Review. Journal of Machine est of music information analysis( (Sawada et al. 2013; Learning Research 12:1185–1224. Kameoka et al. 2009)). Gu, J., and Liu, J. S. 2008. Bayesian biclustering of gene Finally we list a few open questions. First, explicitly cap- expression data. BMC genomics 9 Suppl 1:S4. turing relationships between sub-matrices would be of inter- est for further deep bi-clustering analysis. Second, it is im- Hernandez-Lobato, J. M.; Houlsby, N.; and Ghahramani, Z. portant to verify the limitation of the Infinite Plaid models 2014. Stochastic Inference for Scalable Probabilistic Mod- against the number of hidden sub-matrices, the noise toler- eling of Binary Matrices. In Proceedings of the 31st Interna- ance, and overlaps. Finally, computational scalability mat- tional Conference on Machine Learning (ICML), volume 32. ters in today’s big data environment. We are considering Hoff, P. D. 2005. Subset clustering of binary sequences, stochastic inferences (e.g (Hernandez-Lobato, Houlsby, and with an application to genomic abnormality data. Biometrics Ghahramani 2014; Wang and Blei 2012)) for larger matri- 61(4):1027–1036. ces. Ishiguro, K.; Iwata, T.; Ueda, N.; and Tenenbaum, J. 2010. A supplemental material and information for a MAT- Dynamic Infinite Relational Model for Time-varying Rela- LAB demo program package can be found at: http:// tional Data Analysis. In Lafferty, J.; Williams, C. K. I.; www.kecl.ntt.co.jp/as/members/ishiguro/index.html Shawe-Taylor, J.; Zemel, R. S.; and Culotta, A., eds., Ad- The part of research results have been achieved by “Re- vances in Neural Information Processing Systems 23 (Pro- search and Development on Fundamental and Utilization ceedings of NIPS). Technologies for Social Big Data”, the Commissioned Re- Ishiguro, K.; Ueda, N.; and Sawada, H. 2012. Subset In- search of National Institute of Information and Communica- finite Relational Models. In Proceedings of the 15th Inter- tions Technology (NICT), Japan. national Conference on Artificial Intelligence and Statistics (AISTATS), volume XX, 547–555. References Kameoka, H.; Ono, N.; Kashino, K.; and Sagayama, S. 2009. Complex NMF: A new sparse representation for Albers, K. J.; Moth, A. L. A.; Mørup, M.; and Schmidt, M. acoustic signals. In Proceedings of the IEEE International N. . 2013. Large Scale Inference in the Infinite Relational Conference on Acoustics, Speech and Signal Processing Model: Gibbs Sampling is not Enough. In Proceedings of (ICASSP), volume 1, 2353–2356. the IEEE International Workshop on Machine Learning for ffi Signal Processing (MLSP). Kemp, C.; Tenenbaum, J. B.; Gri ths, T. L.; Yamada, T.; and Ueda, N. 2006. Learning Systems of Concepts with an Ben-Dor, A.; Chor, B.; Karp, R.; and Yakhini, Z. 2003. Dis- Infinite Relational Model. In Proceedings of the 21st Na- covering local structure in gene expression data: the order- tional Conference on Artificial Intelligence (AAAI). preserving submatrix problem. Journal of computational bi- ology 10(3-4):373–384. Kim, M., and Leskovec, J. 2013. Nonparametric Multi- group Membership Model for Dynamic Networks. In Ad- Caldas, J., and Kaski, S. 2008. Bayesian Biclustering vances in Neural Information Processing Systems 26 (Pro- with the Plaid Model. In Proceedings of the IEEE Interna- ceedings of NIPS), 1–9. tional Workshop on Machine Leaning for Signal Proceesing Klimt, B., and Yang, Y. 2004. The Enron Corpus : A New (MLSP). Dataset for Email Classification Research. In Proceedings Cheng, Y., and Church, G. M. 2000. Biclustering of Ex- of the European Conference on Machine Learning (ECML). pression Data. In Proceedings of the Eighth International Lancichinetti, A.; Fortunato, S.; and Kertesz, J. 2009. De- Conference on Intelligent Systems for Molecular Biology tecting the overlapping and hierarchical community struc- (ISMB), 93–103. ture of complex networks. New Journal of Physics 11(3). Cichocki, A.; Zdunek, R.; Phan, A. H.; and Amari, S.-i. Lazzeroni, L., and Owen, A. 2002. Plaid Models for Gene 2009. Nonnegative matrix and tensor factorizations: ap- Expression Data. Statistica Sinica 12:61–86. plications to exploratory multi-way data analysis and blind Meeds, E. W.; Ghahramani, Z.; Neal, R. M.; and Roweis, S. source separation. Wiley. 2007. Modeling Dyadic Data with Binary Latent Factors. Eren, K.; Deveci, M.; Kuc¸¨ uktunc¸,¨ O.; and C¸atalyurek,¨ U.¨ V. In Advances in Neural Information Processing Systems 19 2013. A ccomparative analysis of biclustering algorithms for (NIPS), 977–984. gene expression dat. Briefings in Bioinformatics 14(3):279– Miller, K. T.; Griffiths, T. L.; and Jordan, M. I. 2009. Non- 292. parametric Latent Feature Models for Link Prediction. In Erosheva, E.; Fienberg, S.; and Lafferty, J. 2004. Mixed- Bengio, Y.; Schuurmans, D.; Lafferty, J.; Williams, C. K. I.; membership Models of Scientific Publications. Proceedings and Culotta, A., eds., Advances in Neural Information Pro- of the National Academy of Sciences of the United States of cessing Systems 22 (Proceedings of NIPS). America (PNAS) 101(Suppl 1):5220–5227. Nakano, M.; Ishiguro, K.; Kimura, A.; Yamada, T.; and Fu, Q., and Banerjee, A. 2009. Bayesian Overlapping Sub- Ueda, N. 2014. Rectangular Tiling Process. In Proceedings

1707 of the 31st International Conference on Machine Learning (ICML), volume 32. Oghabian, A.; Kilpinen, S.; Hautaniemi, S.; and Czeizler, E. 2014. Biclustering methods: biological relevance and appli- cation in gene expression analysis. PloS one 9(3):e90801. Palla, K.; Knowles, D. A.; and Ghahramani, Z. 2012. An Infinite Latent Attribute Model for Network Data. In Pro- ceedings of the 29th International Conference on Machine Learning (ICML). Ross, J. C.; Castaldi, P. J.; Cho, M. H.; and Dy, J. G. 2014. Dual Beta Process Priors for Latent Cluster Discovery in Chronic Obstructive Pulmonary Disease. In Proceedings of the 20th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining (SIGKDD), 155–162. Roy, D. M., and Teh, Y. W. 2009. The Mondrian Process. In Advances in Neural Information Processing Systems 21 (Proceedings of NIPS). Sawada, H.; Kameoka, H.; Araki, S.; and Ueda, N. 2013. Multichannel extensions of non-negative matrix factoriza- tion with complex-valued data. IEEE Transactions on Au- dio, Speech and Language Processing 21(5):971–982. Shabalin, A. A.; Weigman, V. J.; Perou, C. M.; and Nobel, A. B. 2009. Finding large average submatrices in high di- mensional data. The Annals of Applied Statistics 3(3):985– 1012. Wang, C., and Blei, D. M. 2012. Truncation-free Stochastic Variational Inference for Bayesian Nonparametric Models. In Advances in Neural Information Processing Systems 25 (Proceedings of NIPS). Whang, J. J.; Rai, P.; and Dhillon, I. S. 2013. Stochas- tic blockmodel with cluster overlap, relevance selection, and similarity-based smoothing. In Proceedings of the IEEE In- ternational Conference on Data Mining (ICDM), 817–826. Zhou, M.; Hannah, L. A.; Dunson, D. B.; and Carin, L. 2012. Beta-Negative Binomial Process and Poisson Factor Analy- sis. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS).

1708