A Convex Optimization Framework for Bi-Clustering

A Convex Optimization Framework for Bi-Clustering Shiau Hong Lim [email protected] National University of Singapore, 9 Engineering Drive 1, Singapore 117575 Yudong Chen [email protected] University of California, Berkeley, CA 94720, USA Huan Xu [email protected] National University of Singapore, 9 Engineering Drive 1, Singapore 117575 Abstract F = G. We present a framework for biclustering and While many biclustering algorithms have been proposed clustering where the observations are general la- and applied to a variety of problems, most are without bels. Our approach is based on the maximum formal guarantees in terms of the actual clustering per- likelihood estimator and its convex relaxation, formance. A typical approach begins with an objective and generalizes recent works in graph clustering function that measures the quality or cost of a candidate to the biclustering setting. In addition to stan- clustering, and then searches for one that optimizes the dard biclustering setting where one seeks to dis- objective. Many interesting objective functions have in- cover clustering structure simultaneously in two tractable computational complexity and the focus of many domain sets, we show that the same algorithm past works have been on finding efficient approximate so- can be as effective when clustering structure only lutions to these problems. occurs in one domain. This allows for an alterna- We propose a tractable biclustering algorithm based on tive approach to clustering that is more natural convex optimization. The algorithm can be viewed as a in some scenarios. We present theoretical results convex relaxation to the computationally intensive prob- that provide sufficient conditions for the recov- lem of finding a maximum likelihood solution under a ery of the true underlying clusters under a gener- generalized stochastic block model. The stochastic block alized stochastic block model. These are further model (Holland et al., 1983; Rohe et al., 2011) has been validated by our empirical results on both syn- widely used in graph clustering. In this model, it is as- thetic and real data. sumed that a true but unknown underlying clustering exists in both F and G. A probabilistic generative model is defined for the observations and the performance of the clus- 1. Introduction tering algorithm is evaluated in terms of the ability to re- In a regular clustering task, we look for clustering structure cover the true underlying clusters. within a set F through observing the pairwise interactions Our main contribution is in extending the current advances between elements in this set. In biclustering, we instead in standard graph clustering to the biclustering setting. We have two sets F and G and the observed pairwise interac- provide the conditions under which our biclustering algo- tions are between object pairs (i; j) with i 2 F and j 2 G. rithm can recover the true clusters with high probability. The aim is to discover clustering structure within F, G or Our theoretical results are consistent with existing results both through these observations. For example, in a rec- in graph clustering, which have been shown to be optimal ommender system, F consists of customers and G a set of in some cases. products. In DNA microarray analysis, F could be biolog- ical samples and G a set of genes. The standard clustering One novel aspect of our result is in providing new insight task can be viewed as a special case of biclustering with on the case where clustering structure only occurs in one domain set, say F but not necessarily in G. We show that Proceedings of the 32 nd International Conference on Machine under reasonable assumptions, the same algorithm can be Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- used to recover the clusters in F regardless of the cluster- right 2015 by the author(s). ing structure in G. This provides an alternative approach to A Convex Optimization Framework for Bi-Clustering standard graph clustering, where instead of relying on pair- considers the setting with a “block-diagonal” structure (in wise interactions within F, we cluster objects in F through the matrix B to be defined below). The recent work by their interactions with elements of G. Xu et al.(2014) studies a more general setting with “block- constant” structure. Both these settings are special cases We employ the observation model as proposed in (Lim of ours with 3 labels (1, −1 and ”unobserved”) and with et al., 2014) where each observation is a label Λ 2 L. ij clusters in both the rows and columns. The label set L can be very general. In standard graph clustering, L would consist of two labels “edge” and “no- edge”. An additional “unknown” label may be included 3. Problem Setup for partially observed graphs. We refer the reader to (Lim A bicluster is defined as a cluster-pair (C; D) with C ⊆ F et al., 2014) for examples of other, more complex label sets, and D ⊆ G. We say that (i; j) is a member-pair of (C; D) which include observations from time-varying graphs. if i 2 C and j 2 D. The key property that is shared by The paper is organized as follows. After discussing related member-pairs of the same bicluster is the label distribution. works in the next section, we present the formal setup of Ideally, if two pairs (i; j) and (i0; j0) belong to the same our approach in Section3. The main algorithm and its bicluster, then their respective labels Λij and Λi0j0 should theoretical results are presented in Section4. An imple- have similar distributions. mentation of the main algorithm is provided in Section5. Let n = jFj, n = jGj and n = maxfn ; n g. Without Empirical results on both synthetic and real-world data are 1 2 1 2 loss of generality, we use i = 1 : : : n to denote members presented in Section6. The proofs of the theoretical results 1 of F and j = 1 : : : n to denote members of G. are given in the supplementary materials. 2 We assume an underlying clustering in F such that there 2. Related Work exists a partition of F into r1 disjoint subsets fCp : p = 1 : : : r1g. Similarly, G is partitioned into fDq : q = A comparative study of many biclustering algorithms in 1 : : : r2g. These partitionings result in a total of r1 × r2 the domain of analyzing gene expression data is provided biclusters (Cp;Dq). Let Kp = jCpj and Lq = jDqj be the in (Eren et al., 2012), and a comprehensive survey can be respective cluster sizes of Cp and Dq, with K = minp Kp found in (Tanay et al., 2005). Beginning with the work of and L = minq Lq. Hartigan(1972) and Cheng & Church(2000), many ap- We group all the biclusters into two classes. This is speci- proaches to biclustering are based on optimizing certain fied using an r ×r matrix B whose entries B 2 fb ; b g combinatorial objective functions, which are typically NP- 1 2 pq 0 1 where b and b (b < b ) are two arbitrarily defined real Hard, and heuristic and approximate algorithms have been 0 1 0 1 numbers, identifying the class of each bicluster. Member- developed but with no formal performance guarantees. One pairs of each bicluster share the same class. This is speci- notable exception that is related to our settings with la- fied using an n ×n matrix Y ∗, where Y ∗ = B if i 2 C bels, is the work of Wulff et al.(2013). They proposed a 1 2 ij pq p and j 2 D . very intuitive monochromatic cost function, proved its NP- q hardness and developed a polynomial-time approximate al- Associated with each class is a set of label-generating dis- gorithm. Another related approach is correlation cluster- tributions. We assume that each observed label Λij is ing (Bansal et al., 2004), which was originally developed generated independently from some distribution µij. If ∗ ∗ for clustering but can be extended to biclustering; results Yij = b1 then µij 2 M, otherwise if Yij = b0 then on computational complexity and approximate algorithms µij 2 N . In the simplest case, there are only two label can be found in, e.g., (Demaine et al., 2005; Swamy, 2004; distributions µ and ν such that M = fµg and N = fνg. ∗ ∗ Puleo & Milenkovic, 2014). A popular approach to clus- In this case Λij ∼ µ if Y = b1 and Λij ∼ ν if Y = b0. tering and biclustering is spectral clustering and its vari- Let U be an n × r matrix denoting the membership ants (Chaudhuri et al., 2012; Rohe et al., 2011; Anandku- C 1 1 of each cluster C , where (U ) = 1=pK if i 2 mar et al., 2013; Kannan et al., 2000; Shamir & Tishby, p C ip p C , otherwise (U ) = 0. Note that each row of U 2011; Lelarge et al., 2013; McSherry, 2001). p C ip C contains only one non-zero entry. Similarly, VD is an Here we focus on average-case performance under a proba- n2 ×pr2 membership matrix for clusters Dq. Let K = bilistic generative model for generalized graphs with labels, diag( K ::: pK ) be a r × r diagonal matrix. Simi- 1 pr1 1 1 and our algorithms are inspired by recent convex optimiza- p larly let L = diag( L1 ::: Lr2 ). tion approaches to graph clustering (Mathieu & Schudy, ∗ 2010; Ames & Vavasis, 2011; Lim et al., 2014; Chen et al., Y can therefore be related to B by: 2012; 2014; Cai & Li, 2014; Vinayak et al., 2014). For biclustering, the work by Ames(2013); Kolar et al.(2011) ∗ > Y = UC KBLVD : A Convex Optimization Framework for Bi-Clustering > Let the reduced SVD of KBL be UBSBVB .

A Convex Optimization Framework for Bi-Clustering

Development of Biclustering Techniques for Gene Expression Data Modeling and Mining Juan Xie South Dakota State University

Package 'S4vd'

Biclustering in Data Mining Stanislav Busygina, Oleg Prokopyevb,∗, Panos M

Arxiv:1811.04661V5 [Cs.CV] 11 May 2021

Arxiv:2003.04726V1 [Cs.LG] 7 Mar 2020 Ces Indicating Some Regularities in a Data Matrix

High Performance Parallel/Distributed Biclustering Using Barycenter Heuristic

Statistical and Computational Tradeoffs in Biclustering

Optimal Estimation and Completion of Matrices with Biclustering Structures

Scalable Graph Algorithms with Applications in Genetics

An Extensive Survey on Biclustering Approaches and Algorithms for Gene Expression Data

Biclustering Usinig Message Passing

Convex Biclustering