A Convex Optimization Framework for Bi-Clustering

Shiau Hong Lim [email protected] National University of Singapore, 9 Engineering Drive 1, Singapore 117575

Yudong Chen [email protected] University of California, Berkeley, CA 94720, USA

Huan Xu [email protected] National University of Singapore, 9 Engineering Drive 1, Singapore 117575

Abstract F = G. We present a framework for and While many biclustering have been proposed clustering where the observations are general la- and applied to a variety of problems, most are without bels. Our approach is based on the maximum formal guarantees in terms of the actual clustering per- likelihood estimator and its convex relaxation, formance. A typical approach begins with an objective and generalizes recent works in graph clustering function that measures the quality or cost of a candidate to the biclustering setting. In addition to stan- clustering, and then searches for one that optimizes the dard biclustering setting where one seeks to dis- objective. Many interesting objective functions have in- cover clustering structure simultaneously in two tractable computational complexity and the focus of many domain sets, we show that the same past works have been on finding efficient approximate so- can be as effective when clustering structure only lutions to these problems. occurs in one domain. This allows for an alterna- We propose a tractable biclustering algorithm based on tive approach to clustering that is more natural convex optimization. The algorithm can be viewed as a in some scenarios. We present theoretical results convex relaxation to the computationally intensive prob- that provide sufficient conditions for the recov- lem of finding a maximum likelihood solution under a ery of the true underlying clusters under a gener- generalized stochastic block model. The stochastic block alized stochastic block model. These are further model (Holland et al., 1983; Rohe et al., 2011) has been validated by our empirical results on both syn- widely used in graph clustering. In this model, it is as- thetic and real data. sumed that a true but unknown underlying clustering exists in both F and G. A probabilistic generative model is de- fined for the observations and the performance of the clus- 1. Introduction tering algorithm is evaluated in terms of the ability to re- In a regular clustering task, we look for clustering structure cover the true underlying clusters. within a set F through observing the pairwise interactions Our main contribution is in extending the current advances between elements in this set. In biclustering, we instead in standard graph clustering to the biclustering setting. We have two sets F and G and the observed pairwise interac- provide the conditions under which our biclustering algo- tions are between object pairs (i, j) with i ∈ F and j ∈ G. rithm can recover the true clusters with high probability. The aim is to discover clustering structure within F, G or Our theoretical results are consistent with existing results both through these observations. For example, in a rec- in graph clustering, which have been shown to be optimal ommender system, F consists of customers and G a set of in some cases. products. In DNA microarray analysis, F could be biolog- ical samples and G a set of genes. The standard clustering One novel aspect of our result is in providing new insight task can be viewed as a special case of biclustering with on the case where clustering structure only occurs in one domain set, say F but not necessarily in G. We show that Proceedings of the 32 nd International Conference on Machine under reasonable assumptions, the same algorithm can be Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- used to recover the clusters in F regardless of the cluster- right 2015 by the author(s). ing structure in G. This provides an alternative approach to A Convex Optimization Framework for Bi-Clustering standard graph clustering, where instead of relying on pair- considers the setting with a “block-diagonal” structure (in wise interactions within F, we cluster objects in F through the matrix B to be defined below). The recent work by their interactions with elements of G. Xu et al.(2014) studies a more general setting with “block- constant” structure. Both these settings are special cases We employ the observation model as proposed in (Lim of ours with 3 labels (1, −1 and ”unobserved”) and with et al., 2014) where each observation is a label Λ ∈ L. ij clusters in both the rows and columns. The label set L can be very general. In standard graph clustering, L would consist of two labels “edge” and “no- edge”. An additional “unknown” label may be included 3. Problem Setup for partially observed graphs. We refer the reader to (Lim A bicluster is defined as a cluster-pair (C,D) with C ⊆ F et al., 2014) for examples of other, more complex label sets, and D ⊆ G. We say that (i, j) is a member-pair of (C,D) which include observations from time-varying graphs. if i ∈ C and j ∈ D. The key property that is shared by The paper is organized as follows. After discussing related member-pairs of the same bicluster is the label distribution. works in the next section, we present the formal setup of Ideally, if two pairs (i, j) and (i0, j0) belong to the same our approach in Section3. The main algorithm and its bicluster, then their respective labels Λij and Λi0j0 should theoretical results are presented in Section4. An imple- have similar distributions. mentation of the main algorithm is provided in Section5. Let n = |F|, n = |G| and n = max{n , n }. Without Empirical results on both synthetic and real-world data are 1 2 1 2 loss of generality, we use i = 1 . . . n to denote members presented in Section6. The proofs of the theoretical results 1 of F and j = 1 . . . n to denote members of G. are given in the supplementary materials. 2 We assume an underlying clustering in F such that there 2. Related Work exists a partition of F into r1 disjoint subsets {Cp : p = 1 . . . r1}. Similarly, G is partitioned into {Dq : q = A comparative study of many biclustering algorithms in 1 . . . r2}. These partitionings result in a total of r1 × r2 the domain of analyzing gene expression data is provided biclusters (Cp,Dq). Let Kp = |Cp| and Lq = |Dq| be the in (Eren et al., 2012), and a comprehensive survey can be respective cluster sizes of Cp and Dq, with K = minp Kp found in (Tanay et al., 2005). Beginning with the work of and L = minq Lq. Hartigan(1972) and Cheng & Church(2000), many ap- We group all the biclusters into two classes. This is speci- proaches to biclustering are based on optimizing certain fied using an r ×r matrix B whose entries B ∈ {b , b } combinatorial objective functions, which are typically NP- 1 2 pq 0 1 where b and b (b < b ) are two arbitrarily defined real Hard, and heuristic and approximate algorithms have been 0 1 0 1 numbers, identifying the class of each bicluster. Member- developed but with no formal performance guarantees. One pairs of each bicluster share the same class. This is speci- notable exception that is related to our settings with la- fied using an n ×n matrix Y ∗, where Y ∗ = B if i ∈ C bels, is the work of Wulff et al.(2013). They proposed a 1 2 ij pq p and j ∈ D . very intuitive monochromatic cost function, proved its NP- q hardness and developed a polynomial-time approximate al- Associated with each class is a set of label-generating dis- gorithm. Another related approach is correlation cluster- tributions. We assume that each observed label Λij is ing (Bansal et al., 2004), which was originally developed generated independently from some distribution µij. If ∗ ∗ for clustering but can be extended to biclustering; results Yij = b1 then µij ∈ M, otherwise if Yij = b0 then on computational complexity and approximate algorithms µij ∈ N . In the simplest case, there are only two label can be found in, e.g., (Demaine et al., 2005; Swamy, 2004; distributions µ and ν such that M = {µ} and N = {ν}. ∗ ∗ Puleo & Milenkovic, 2014). A popular approach to clus- In this case Λij ∼ µ if Y = b1 and Λij ∼ ν if Y = b0. tering and biclustering is spectral clustering and its vari- Let U be an n × r matrix denoting the membership ants (Chaudhuri et al., 2012; Rohe et al., 2011; Anandku- C 1 1 of each cluster C , where (U ) = 1/pK if i ∈ mar et al., 2013; Kannan et al., 2000; Shamir & Tishby, p C ip p C , otherwise (U ) = 0. Note that each row of U 2011; Lelarge et al., 2013; McSherry, 2001). p C ip C contains only one non-zero entry. Similarly, VD is an Here we focus on average-case performance under a proba- n2 ×√r2 membership matrix for clusters Dq. Let K = bilistic generative model for generalized graphs with labels, diag( K ... pK ) be a r × r diagonal matrix. Simi- 1 √r1 1 1 and our algorithms are inspired by recent convex optimiza- p larly let L = diag( L1 ... Lr2 ). tion approaches to graph clustering (Mathieu & Schudy, ∗ 2010; Ames & Vavasis, 2011; Lim et al., 2014; Chen et al., Y can therefore be related to B by: 2012; 2014; Cai & Li, 2014; Vinayak et al., 2014). For bi- clustering, the work by Ames(2013); Kolar et al.(2011) ∗ > Y = UC KBLVD . A Convex Optimization Framework for Bi-Clustering

> Let the reduced SVD of KBL be UBSBVB . It is easy Note that the probability in Theorem1 is with respect to ∗ to see that a valid SVD of Y is given by U = UC UB, the randomness of the observed labels, given a fixed B and ∗ > ∗ V = VDVB and S = SB such that Y = USV . Y . The difficulty of the biclustering task, especially when B Theorem1 is general in the sense that it applies to any 1 is non-square, depends on the r1 × r2 matrix KBL = bounded weight function w and any choice of b0 < b1 . > UBSBVB . To capture this dependence, we introduce the The clustering structure in F and G is reflected by the de- notion of coherence, defined as pendence on u1, u2, K and L. The proof for Theorem1 > 2 p 2 (given in the supplementary material) follows the idea as in u1 = max kUBUB epk = max kUBk , p∈{1...r1} p∈{1...r1} the proof of Theorem 1 in (Lim et al., 2014), but with extra consideration for the above structural properties. For the q > 2 2 ∗ u2 = max kVBVB eqk = max kVBk case of standard clustering, where F = G, both B and Y q∈{1...r2} q∈{1...r2} are symmetric and we have u1 = u2 and K = L. In this where ei is the i-th standard basis vector and we define p > q > case we recover almost all the results in (Lim et al., 2014). UB = UB ep and similarly VB = VB eq. Since both UB and VB have orthonormal columns, it is straightforward to It is important to note that any two clusters in F or G can 1 1 see that ≤ u1 ≤ 1 and ≤ u2 ≤ 1. Our definition of only be distinguished if the corresponding rows or columns r1 r2 coherence is based on similar notion in the literature of low- in B are unique. It is, however, unclear whether the full- rank matrix estimation/approximation. In our biclustering rank requirement for B is strictly necessary for the purpose setting, it characterizes how easy it is to infer the structure of recovering Y ∗. in the columns from that in the rows, and vice versa. One sided clusters: For the case of biclustering, both K 4. Algorithm and Main Results and L play a similar role in Theorem1. If one uses the naive bound u1 ≤ 1 and u2 ≤ 1 then it suggests that suffi- The biclustering task is to find Y ∗ given the observations Λ. ciently large clusters in both F and G are necessary for (1) The algorithm consists of two steps. First, a weight func- to be successful. This renders the result particularly weak tion w : L → R is chosen to construct an n1 × n2 weight when, for example, L is small relative to K. Yet, the matrix matrix W , where Wij = w(Λij). The second step involves B has the same low rank as long as K is large, regardless solving the following convex optimization problem for the of L. In this case, one should still be able to recover the n1 × n2 matrix Y : clusters in F.

max hW, Y i − λkY k∗ (1) The following result shows that this is indeed the case, as- Y suming that B has rather uniformly spread out ±1 entries: β log r s.t. b0 ≤ Yij ≤ b1, ∀i, j Theorem 2. Suppose that r ≤ r and r ≥ 1 for √ 1 2 1 c λ = c n log n some universal constant c. Let φ = K , ψ = L where for some sufficiently large constant maxp Kp maxq Lq c and kY k∗ denotes the nuclear norm of Y . and b such that |w(l)| ≤ b, ∀l ∈ L. Suppose that Bpq for each (p, q) is independent ±1 random variable with Our main theorem provides the sufficient condition for the Pr(B = +1) = Pr(B = −1) = 1 . There exists a exact recovery of the true clustering matrix Y ∗ using pro- pq pq 2 universal constant c0 such that if gram (1) with high probability. The condition depends on √  0  the following population quantities: c n1 bβ log n + βV n log n min{E1,E0} > 2 φψ n2 K E1 := min Eµw, E0 := min Eµ[−w], µ∈M µ∈N then the solution of (1) is unique and equals Y ∗ with prob- −β V := max Varµw, ability at least 1 − n . µ∈M∪N where Eµ and Varµ denote the expectation and variance Note that the probability in Theorem2 is with respect under the distribution µ. With this notation, we have the to both the randomness in B as well as the observed la- following: bels, holding the clusterings in F and G fixed. The proof Theorem 1. Suppose that B is full rank and |w(l)| ≤ (again, given in the supplementary material) is via bound- b, ∀l ∈ L. There exists a universal constant c such that ing max{u1/K, u2/L} by applying a bound on the singu- if lar values of B by Rudelson & Vershynin(2009).   n u u o min{E ,E } > c bβ log n + pβV n log n max 1 , 2 Theorem2 implies that the success of (1) is essentially in- 1 0 K L dependent of L when we are only interested in clustering

−β 1 then with probability at least 1 − n the solution of (1) is For all practical purposes, the choice of either (b0, b1) = ∗ unique and equals Y . (0, 1) or (b0, b1) = (−1, 1) should suffice. A Convex Optimization Framework for Bi-Clustering

F. In particular, we can think of each row in the observed We have the following guarantee for the performance using matrix as the feature representation (with G the feature set) this weight function. of the corresponding element in F. As long as elements Theorem 3. Suppose M = {µ} and N = {ν}. Suppose that belong to the same cluster in F have the same feature µ(l) that B is full rank and | log | ≤ b, ∀l ∈ L. Let ζ = relationships, we can perform clustering through these fea- ν(l) D(µ||ν) D(ν||µ) tures regardless of whether the features themselves show max{ D(ν||µ) , D(µ||ν) }. There exists a universal constant c a clustering structure. This is illustrated in Figure1. The such that if example on the left shows clustering structure only in the rows (each column is unique) while the example on the min{D(µ||ν),D(ν||µ)} right shows clustering structure in both the rows and the  u2 u2  > cβ(ζ + 1)(b + 2)(n log n) max 1 , 2 columns. According to Theorem2 program (1) should be K2 L2 equally effective in both cases, with the same order-wise dependency on K. or 2 It is interesting to note that in the unbalanced case where X (µ(l) − ν(l)) µ(l) + ν(l) n1 is fixed, the condition of success improves as n2 grows, l∈L even when the cluster sizes K and L remain the same.  u2 u2  This situation cannot occur in standard clustering, where > cβ(ζ + 1)(b + 2)(n log n) max 1 , 2 K2 L2 the problem always gets harder as n grows if K stays the same. This phenomenon is not discussed in the previous then with probability at least 1 − n−β the solution of (1) work (Kolar et al., 2011; Ames, 2013) on biclustering. with w = W MLE is unique and equals Y ∗ .

D(·k·) in the theorem denotes the KL-divergence. The weight function wMLE has been shown to be optimal up to a constant factor, at least for the case with two equal-size clusters (Lim et al., 2014).

4.2. Monotonicity In the general case where we allow M and N to contain multiple distributions, it is unclear what weight function Figure 1. Two clustering/biclustering structures to use. The following property suggests a “conservative” weight function: Theorem 4. Suppose the weight function w(l) = 4.1. Optimal Weights µ¯(l) log ν¯(l) , ∀l ∈ L is used in program (1) where µ¯ and ν¯ We now consider the case where there are only two label are two distributions on L such that for any distributions distributions, i.e. M = {µ} and N = {ν}. µ ∈ M and ν ∈ N , we have Following previous works such as (Lim et al., 2014) and µ(l) µ(l) µ¯(l) ν(l) ν(l) ν¯(l) (Chen et al., 2012) we can view our algorithm as a convex ≥ ≥ ≥ 1 or ≥ ≥ ≥ 1 ν(l) ν¯(l) ν¯(l) µ(l) µ¯(l) µ¯(l) relaxation of the maximum likelihood estimator, which is given by ∀l ∈ L. If µ¯ and ν¯ satisfy the conditions of Theorem3 then with probability at least 1 − n−β the solution of (1) is ∗ arg max log Pr(Λ|Y = Y ) unique and equals Y ∗. Y a cluster matrix X Yij −b0 b1−Yij = arg max log µ(Λij) b1−b0 ν(Λij) b1−b0 Theorem4 says that if the distributions in M are label- Y a cluster matrix i,j wised well separated from the distributions in N , at least as much as µ¯ from ν¯, then program (1) will perform no X µ(Λij) = arg max Yij log . worse than if M = {µ¯} and N = {ν¯} with the associated ν(Λ ) Y a cluster matrix i,j ij wMLE based on µ¯ and ν¯.

This therefore suggests the weight function 4.3. General Stochastic Block Model for Two Labels µ(l) We now discuss the results above by considering a special wMLE(l) = log . ν(l) case of the general stochastic block model where there are A Convex Optimization Framework for Bi-Clustering only two labels, L = {+, −}. Suppose that all member- Sometimes it may be preferable to merge two or more la- pairs of the same bicluster (Cp,Dq) have the same label- bels into one and use their marginal distributions instead. distribution on L. Let Ultimately, the performance of program (1) depends on E0, E1 and V in Theorem1. µpq = Pr(Λij = +|i ∈ Cp, j ∈ Dq), ∀p, q, i, j. Program (1) may be run multiple times, each with a differ- ent weight function, to discover finer clustering or biclus- Suppose it is known (or estimated) that all µpq are in a tering structure within the previous output. Full exploration set {µ1, µ2, . . . , µm}. Theorem3 and4 suggests a simple of these possibilities is left to future work. strategy to discover the clustering in F and G: 5. Implementation 1. Sort µi for i = 1 . . . m in ascending order and find the largest gap between any two consecutive µi. Program (1) can be solved efficiently using the Alternat- ing Direction Method of Multipliers (ADMM) (Boyd et al., ν = µ µ = 2. Suppose the largest gap is between i0 and 2011). We provide the pseudocode for a complete imple- µ ν < µ N = {µ : µ ≤ ν} i1 where , then set i i and mentation in Algorithm1, with explicit stopping condition. M = {µi : µi ≥ µ}. The inputs and output of Algorithm1 are the same terms 3. Solve program (1) with the weight function used√ in program (1). We find that in practice, setting λ = 2n works well. The threshold for convergence is µ 1 − µ −4 w(+) = log and w(−) = log . specified by . We find that  = 10 is a good tradeoff ν 1 − ν between the speed of convergence and the quality of the so- lution. All our experiment√ results, unless otherwise stated, To illustrate, we use the example problem posed by Cai & were based on λ = 2n and  = 10−4. Li(2014). This is a clustering problem ( F = G) with 3 An optional Step 7 for updating ρ may potentially improve clusters, where the µpq for each block is given as follows: the speed of convergence. This takes an additional param- k+1 k+1 k+1 k  0.4 0.2 0.05  eter τ. If kX − Y kF > τρkY − Y kF then k+1 k+1  0.2 0.3 0.05  set ρ := 2ρ and Q := Q /2. On the other hand, if k+1 k+1 k+1 k 0.05 0.05 0.1 τkX − Y kF < ρkY − Y kF then set ρ := ρ/2 and Qk+1 := 2Qk+1. Typically τ = 10 is a stable choice. The traditional graph clustering approach where B is a di- We refer the reader to Boyd et al.(2011) for further details. agonal matrix   b1 b0 b0 Algorithm 1 ADMM solver for Program (1) n1×n2  b0 b1 b0  Input: W ∈ R , λ, b0, b1,  b0 b0 b1 Output: Y could not solve this problem since the smallest diagonal 1. ρ := 1, k := 0 µpq (0.1) is smaller than the largest off-diagonal µpq (0.2). Using our strategy one could come up with the following 2. Y k := 0, Qk := 0, (Y k,Qk ∈ Rn1×n2 ) assignment where ν = 0.2 and µ = 0.3: k+1 λ > > 3. X := U max{Σ − ρ , 0}V where UΣV is an   k k b1 b0 b0 SVD of (Y − Q ).  b0 b1 b0  n n o o Y k+1 := min max Xk+1 + Qk + 1 W, b , b b0 b0 b0 4. ρ 0 1

Suppose µ and ν satisfy the conditions in Theorem3, then 5. Qk+1 := Qk + Xk+1 − Y k+1 by Theorem4 the 3 clusters can be identified with high k+1 k+1 k+1 k+1 6. If kX −Y kF ≤  max{kX kF , kY kF } probability since each row of B is unique. k+1 k k+1 and kY − Y kF ≤ kQ kF then stop and out- put Y := Y k+1. 4.4. General Multi-label Multi-distribution Case k+1 In the general case where there are more than two labels, 7. (Optional) Update ρ and Q the strategy is to separate all possible distributions into two 8. k := k + 1, go to step 3. sets M and N such that they are as “far apart” as possible. The weight function can then be set with respect to a pair of distributions µ ∈ M and ν ∈ N . In practice, due to finite precision and numerical errors, the A Convex Optimization Framework for Bi-Clustering output Y of Algorithm1 will not have entries that are ex- becomes “easier” as the ratio gets smaller. actly b0 or b1, even if it is a successful recovery. A simple rounding, say, around b0+b1 would give the correct solu- Size Y: 100 x 100, B: 5 x 5 Size Y: 100 x 100, B: 5 x 5 2 0.8 tion. The clusters in F (resp. G) can then be obtained by 1 sorting the rows (resp. columns) of Y . Even if the recovery 0.8 0.6 0.6 is not 100%, a simple k-means algorithm can be applied to 0.4 the rows/columns to obtain a desired number of clusters. 0.4 0.2 k−means 0.2 Full recovery rate conv

0 conv + k−means Running time (seconds) 0 0 0.1 0.2 0.3 0 0.1 0.2 0.3 0.4 6. Experiments Noise level Noise level

6.1. Synthetic data Size Y: 500 x 500, B: 10 x 10 Size Y: 500 x 500, B: 10 x 10 40 1

We first evaluate the performance of our algorithm on syn- 0.8 30 thetic data, where the complete ground truth is available. 0.6 20 We test a simple setup where there are only two labels and 0.4 k−means 0.2 10 two label distributions. Three different sizes for n1 (with Full recovery rate conv n = n 0 conv + k−means Running time (seconds) 2 1) are used: 100, 500, and 1000. The number of 0 0 0.1 0.2 0.3 0 0.1 0.2 0.3 0.4 clusters is 5 for n1 = 100 and 10 for the others, all with Noise level Noise level equal sizes. The matrix B has independently generated ±1 Size Y: 1000 x 1000, B: 10 x 10 Size Y: 1000 x 1000, B: 10 x 10 200 entries with uniform probability. A noise level σ is defined 1 ∗ such that each observed label Λij equals Yij with proba- 0.8 150 bility 1 − σ and equals the opposite value with probability 0.6 100 σ. This setup is very similar to that used in (Wulff et al., 0.4 k−means 2013), except that in our case, we report the actual “full re- 0.2 50 Full recovery rate conv covery rate”, i.e. the fraction of trials where the final output 0 conv + k−means Running time (seconds) 0 ∗ 0 0.1 0.2 0.3 0 0.1 0.2 0.3 0.4 Y exactly equals Y , out of 100 independent trials. Error Noise level Noise level bars in all figures show 95% confidence intervals. Figure 2. Full recovery rate and running time of program (1) Figure2 shows the results. The plot “conv” indicates re- sults based on the output of Program (1) with each entry The average computational time needed to finish one in- of Y rounded to the closest ±1. Since the noise is sym- stance of program (1) in Matlab on a Core i5 desktop ma- metric (for +1 and −1 observations), the weight for Pro- chine is also reported in Fig.2. It is interesting to note gram (1) can simply be set to Wij = Λij. Included in the that convergence is typically very fast when the problem figure are two additional plots “k-means” and “conv + k- instance is either very “easy” (e.g. low noise) or too “hard”, means”. The “k-means” results are based on running the while the “borderline” instances usually require a much k-means algorithm directly on the observed labels (since larger number of iterations to converge. they are numeric), once on the rows and a separate run on the columns. After running k-means, all entries in the same 6.1.1. ONE-SIDED BICLUSTERING bicluster (“block”) are set to take the majority label in the block. The “conv + k-means” results are based on applying Our second set of experiments evaluates the prediction of the same k-means process to the output of “conv”. Note Theorem2. In particular, when large clusters exist in F (i.e. that Program (1) itself does not need to know the number the rows), and B is more or less uniformly distributed, the ∗ of clusters while k-means requires the number of clusters recovery of Y depends mainly on K and not L. Figure3 as input. We choose k-means as reference since, despite its shows two sets of plots. Both experiments use n1 = 400 simplicity, it is still one of the top performing biclustering and n2 = 800. B has ±1 entries and the observations con- ∗ methods according to the report by Oghabian et al.(2014). sist of two labels, 0 or 1. Entries with Y = +1 produce the label 1 with probability 0.7 while the rest produce the We can observe, from Fig.2 that the simple k-means al- label 1 with probability 0.05. This is the case where the gorithm performs rather well but it is significantly outper- observation noise is non-symmetric. formed by “conv” especially for larger n. We recommend running k-means on top of the output of program (1) es- In the left figure, B in each trial is generated with uniformly pecially when the number of clusters is known, since this random ±1 entries. Equal-size clusters are generated in both the rows and the columns, where each row (resp. col- produces the best recovery result.√ We can also validate both n umn) cluster has size K (resp. L). We test the biclustering Theorem1 and3. The ratios K for the three sizes are 0.50, 0.45 and 0.32 respectively, and indeed the problem performance in two cases: (1) fixed K and varying L, (2) varying K and L. Case 1 is shown in the blue plot while A Convex Optimization Framework for Bi-Clustering case 2 is in red. As predicted by Theorem2, the full re- animals and the features) we run k-means on the rows and covery rate remains high in case 1 where K remains fixed columns of the program output Y . Given that this is un- and large regardless of L. In case 2, the performance drops likely to be the true model, we expect a need to adjust the when both K and L become small, as expected. regularization parameter λ as well as the number of clus- ters. We tested a range of λ and cluster sizes, and choose a In the right figure, we evaluate the biclustering perfor- combination that produces clusters with the most uniform mance in a scenario where the assumption of Theorem2 √ sizes. Figure4 shows the result with λ = n, for 9 ani- is violated. In particular, we force B to be a 10 × 10 diag- mal clusters and 20 feature clusters. There is randomness onal matrix – the number of clusters is fixed at 10 in both involved in using k-means that depends on the initializa- the rows and the columns. Note that in this case, u and u 1 2 tion but for the chosen parameters the resulting clusters are both equal 1. The first cluster in the rows (resp. columns) more-or-less stable. We find the results reasonable and not is chosen to have size equal to K (resp. L). The remaining unlike those obtained by Wulff et al.(2013) and Kemp et al. clusters all have equal, but larger sizes. Here, we observe (2006). that the performance drops significantly in both cases, even though K is held large in the first case. 6.2.2. GENE EXPRESSION DATA

1 1 For our second real-world example, we use the gene ex-

0.8 0.8 pression data for leukemia, in the form provided by Monti

0.6 0.6 et al.(2003). This is a filtered version of the original raw

0.4 0.4 data published in (Golub et al., 1999), with mostly unin- formative genes removed. These are DNA microarray data

Full recovery rate 0.2 Full recovery rate 0.2 K = 40 K = 40 0 K = L/2 0 K = L/2 collected from bone marrow samples of acute leukemia pa-

0 20 40 60 80 0 20 40 60 80 L L tients, with 11 acute myeloid leukemia (AML) samples, 8 T-lineage acute lymphoblastic leukemia (ALL) samples, Figure 3. Full recovery rate for two different setups with varying and 19 B-lineage ALL samples. The “features” are the cluster sizes. Left: B has uniform ±1 entries. Right: B diagonal. recorded expression levels of 999 genes. The expression levels are real values with widely different range for each 6.2. Real World Data gene. We scale and shift the expression levels for each gene such that the mean is 0 and the standard deviation is 1. We further evaluate our algorithm on real world data, where Since the actual model is unknown, we assume a Gaussian the true data generating model is unknown. distribution where we use N (1, 1) for µ and N (−1, 1) for µ(l) ν. The MLE weight is therefore log ν(l) , where l is the ob- 6.2.1. ANIMAL-FEATURE DATASET served expression level2. Note that the resulting weight is Our first dataset is the animal-feature data originally pub- simply a linear function of l. lished by Osherson et al.(1991) and has been used in Since the type of each sample is known, we can evaluate (Wulff et al., 2013) and (Kemp et al., 2006) for bicluster- the clustering performance for the 38 samples. Again, we ing. The dataset contains human-produced association be- ran k-means on the program output and tested a range of λ. tween a set of 50 animals and 85 features. For each animal- Figure5 shows the clustering performance with respect to feature pair, the degree of association–a real value between λ, against the ground-truth clustering, in terms of the ad- 0 and 100, is provided. The biclustering task is to simulta- justed and the percent of sample-pairs neously cluster the animals and features. Unfortunately no that are correctly clustered together. The two performance ground truth is available so evaluation is subjective. Nev- measures give qualitatively similar results and the best per- ertheless this allows a qualitative comparison with the re- √ formance is achieved around λ = 5n. In general, a small sults reported in, e.g. (Wulff et al., 2013) and (Kemp et al., λ would produce a finer clustering structure while a larger 2006). λ produce a coarser structure. Note that using λ = 0 is We use the raw data in two ways. First, we follow Kemp equivalent to simply thresholding the raw value to the near- et al.(2006) and use the global mean value t ≈ 20.8 as est ±1 (right panel of Fig.6). Running k-means directly the cutoff point between a “yes” and “no”. For the “yes” on the raw data (middle panel of Fig.6) results in the worst entries, we assume that the probability of error decreases performance, with an adjusted mutual information of 0.4. linearly with the recorded degree of association, where we 2 set 101 to be the point of 0 error probability while the prob- Although in general the likelihood ratio is unbounded for Gaussian distributions, a standard truncation trick can be used to ability of error at t is 0.5. Similarly for the “no” entries the bound the weights such that our theoretical results still hold (see point of 0-error is set at −1. We then employ the MLE Lim et al., 2014). weight and run program (1). To extract clusters (both the A Convex Optimization Framework for Bi-Clustering

F1 F2 F3 F4 F5 F6 F7 F1 F2 F3 F4 F5 F6 F7

A1 A1

A2 A2

A3 A3

A4 A4

A5 A5

A6 A6 A7 A7 A8 A8 A9 A9

A1: leopard, lion, wolf, bobcat, fox, german shepherd, tiger, grizzly bear, polar bear F1: flippers, ocean, water, swims, fish, arctic A2: blue whale, humpback whale, walrus, seal, dolphin, killer whale, otter F2: fierce, hunter, stalker, meat, meatteeth, muscle A3: rabbit, mouse, hamster, squirrel, mole, beaver, skunk F3: walks, quadrapedal, ground, furry, chewteeth, brown A4: giant panda, sheep, cow, ox, pig, buffalo F4: smart, fast, active, agility A5: zebra, horse, antelope, deer, giraffe, moose F5: claws, solitary, paws, small A6: chihuahua, persian cat, collie, siamese cat, dalmatian F6: group, big, strong A7: weasel, rat, raccoon, bat F7: newworld, tail, oldworld A8: spider monkey, chimpanzee, gorilla F8: white, domestic A9: elephant, hippopotamus, rhinoceros F9: hooves, grazer, plains, fields, vegetation

Figure 4. Left: Program (1) output Y , Right: Raw data with the same rows/columns order

√ Figure6 shows the program output Y (for λ = 5n), along with the raw data and its thresholded version. Each column corresponds to one sample and they have been arranged ac- 100 cording to the ground truth clustering where columns 1- 19 corresponds to B-ALL, 20-27 T-ALL and 28-38 AML. 200 We observe that finer clustering seems to exist especially within the B-ALL group, and according to Monti et al. 300 (2003) it is widely accepted that meaningful sub-classes of ALL do exist but the composition and nature of these 400 sub-classes is not as well accepted.

500

1 1 0.9 600 0.8 0.9 0.7 conv + k−means conv + k−means 0.8 700 0.6 k−means (raw) k−means (raw) 0.5 0.7 0.4

adjusted mutual information 800 0.3 fraction of correctly classified pairs 0.6 0 5 10 0 5 10 λ2/n λ2/n 900 Figure 5. Clustering performance. Left: Adjusted mutual infor- mation, Right: Fraction of pairs correctly clustered together 10 20 30

Acknowledgments Figure 6. Left: Output Y from program (1), Middle: Raw data, Right: Raw data (binarized) S.H. Lim and H. Xu were supported by the Ministry of Education of Singapore through AcRF Tier Two grants R265000443112 and R265000519112, and A*STAR Pub- References lic Sector Funding R265000540305. Y. Chen was sup- ported by NSF grant CIF-31712-23800 and ONR MURI Ames, B. and Vavasis, S. Nuclear norm minimization for grant N00014-11-1-0688. the planted and biclique problems. Mathematical A Convex Optimization Framework for Bi-Clustering

Programming, 129(1):69–89, 2011. Hartigan, J. A. Direct Clustering of a Data Matrix. Journal of the American Statistical Association, 67(337):123– Ames, Brendan P.W. Guaranteed clustering and bicluster- 129, 1972. ISSN 01621459. doi: 10.2307/2284710. ing via semidefinite programming. Mathematical Pro- gramming, pp. 1–37, 2013. ISSN 0025-5610. doi: Holland, Paul W., Laskey, Kathryn B., and Leinhardt, 10.1007/s10107-013-0729-x. Samuel. Stochastic blockmodels: Some first steps. So- cial Networks, 5:109–137, 1983. Anandkumar, Anima, Ge, Rong, Hsu, Daniel, and Kakade, Sham M. A tensor spectral approach to learning Kannan, R., Vempala, S., and Vetta, A. On clusterings - mixed membership community models. arXiv preprint good, bad and spectral. In IEEE Symposium on Founda- arXiv:1302.2684, 2013. tions of Science, 2000.

Bansal, N., Blum, A., and Chawla, S. Correlation cluster- Kemp, Charles, Tenenbaum, Joshua B., Griffiths, ing. Machine Learning, 56(1), 2004. Thomas L., Yamada, Takeshi, and Ueda, Naonori. Learning systems of concepts with an infinite relational Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, and model. In Proceedings of the 21st National Conference Eckstein, Jonathan. Distributed optimization and statisti- on Artificial Intelligence - Volume 1, AAAI’06, pp. 381– cal learning via the alternating direction method of mul- 388. AAAI Press, 2006. tipliers. Found. Trends Mach. Learn., 3(1):1–122, Jan- uary 2011. Kolar, Mladen, Balakrishnan, Sivaraman, Rinaldo, Cai, T. Tony and Li, Xiaodong. Robust and computation- Alessandro, and Singh, Aarti. Minimax localization of ally feasible community detection in the presence of ar- structural information in large noisy matrices. In NIPS, bitrary outlier nodes. arXiv preprint arXiv:1404.6000, pp. 909–917, 2011. 2014. Lelarge, Marc, Massoulie,´ Laurent, and Xu, Jiaming. Re- Chaudhuri, K., Chung, F., and Tsiatas, A. Spectral clus- construction in the Labeled Stochastic Block Model. tering of graphs with general degrees in the extended In IEEE Workshop, Seville, Spain, planted partition model. COLT, 2012. September 2013. URL http://hal.inria.fr/ hal-00917425. Chen, Y., Sanghavi, S., and Xu, H. Clustering sparse graphs. In NIPS 2012., 2012. Lim, S.H., Chen, Y., and Xu, H. Clustering from labels and time-varying graphs. In NIPS 2014., 2014. Chen, Y., Jalali, A., Sanghavi, S., and Xu, H. Clustering partially observed graphs via convex optimization. Jour- Mathieu, C. and Schudy, W. Correlation clustering with nal of Machine Learning Research, 15:2213–2238, June noisy input. In SODA, pp. 712, 2010. 2014. McSherry, F. Spectral partitioning of random graphs. In Cheng, Yizong and Church, George M. Biclustering of FOCS, pp. 529–537, 2001. expression data. In Proc. of the 8th ISMB, pp. 93–103. AAAI Press, 2000. Monti, Stefano, Tamayo, Pablo, Mesirov, Jill, and Golub, Todd. Consensus clustering: A resampling-based Demaine, E. D., Immorlica, N., Emmanuel, D., and Fiat, A. method for class discovery and visualization of gene ex- Correlation clustering in general weighted graphs. SIAM pression microarray data. Mach. Learn., 52(1-2):91– special issue on approximation and online algorithms, 118, July 2003. 2005. Oghabian, Ali, Kilpinen, Sami, Hautaniemi, Sampsa, and Eren, Kemal, Deveci, Mehmet, Kuc¸¨ uktunc¸,¨ Onur, and Czeizler, Elena. Biclustering methods: Biological rele- C¸atalyurek,¨ Umit¨ V. A comparative analysis of biclus- vance and application in gene expression analysis. PLoS tering algorithms for gene expression data. Briefings in ONE, 9(3), 2014. , 2012. Osherson, D.N., Stern, J., Wilkie, O., Stob, M., and Smith, Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., E.E. Default probability. Cognitive Science, 15(2):251– Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., 269, 1991. Downing, J. R., Caligiuri, M. A., and Bloomfield, C. D. Molecular classification of cancer: class discovery and Puleo, G. J. and Milenkovic, O. Correlation Clustering class prediction by gene expression monitoring. Science, with Constrained Cluster Sizes and Extended Weights 286:531–537, 1999. Bounds. ArXiv e-print 1411.0547, 2014. A Convex Optimization Framework for Bi-Clustering

Rohe, K., Chatterjee, S., and Yu, B. Spectral clustering and the high-dimensional stochastic block model. Annals of Statistics, 39:1878–1915, 2011. Rudelson, Mark and Vershynin, Roman. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics, 62(12):1707–1739, 2009. Shamir, O. and Tishby, N. Spectral Clustering on a Budget. In AISTATS, 2011. Swamy, C. Correlation clustering: maximizing agreements via semidefinite programming. In Proceedings of the 15th Annual ACM-SIAM Symposium on Discrete Algo- rithms, 2004. Tanay, Amos, Sharan, Roded, and Shamir, Ron. Bicluster- ing algorithms: A survey. In In Handbook of Computa- tional Molecular Biology, 2005.

Vinayak, Ramya Korlakai, Oymak, Samet, and Hassibi, Babak. Graph clustering with missing data: Convex al- gorithms and analysis. In Advances in Neural Informa- tion Processing Systems, pp. 2996–3004, 2014. Wulff, S., Urner, R., and Ben-David, S. Monochromatic Bi-Clustering. ICML 2013, 2013. Xu, Jiaming, Wu, Rui, Zhu, Kai, Hajek, Bruce, Srikant, R., and Ying, Lei. Jointly clustering rows and columns of bi- nary matrices: Algorithms and trade-offs. SIGMETRICS Perform. Eval. Rev., 42(1):29–41, June 2014.