High Performance Parallel/Distributed Using Barycenter Heuristic

Arifa Nisar∗ Waseem Ahmad† Wei-keng Liao‡ Alok Choudhary §

Abstract nificantly useful and considerably harder problem than Biclustering refers to simultaneous clustering of objects and traditional clustering. Whereas a cluster is a set of ob- their features. Use of biclustering is gaining momentum jects with similar values over the entire set of attributes, in areas such as , gene expression analysis and a bicluster can be composed of objects with similarity collaborative filtering. Due to requirements for high per- over only a subset of attributes. Many clustering tech- formance in large scale data processing applications such niques such as those based on Nearest Neighbor Search as Collaborative filtering in E-commerce systems and large are known to suffer from ”Curse of Dimensionality” [4]. scale genome-wide gene expression analysis in microarray ex- With large number of dimensions, similarity functions periments, a high performance prallel/distributed solution such as Euclidean distance tend to perform poorly as for biclustering problem is highly desirable. Recently, Ah- similarity diffuses over these dimensions. For example, mad et al [1] showed that Bipartite Spectral Partitioning, in text mining, the size of keywords set describing dif- which is a popular technique for biclustering, can be refor- ferent documents is very large and yields sparse data mulated as a graph drawing problem where objective is to matrix [5]. In this case, clusters based on the entire minimize Hall’s energy of the bipartite graph representation keywords set may have no meaning for the end users. of the input data. They showed that optimal solution to this Biclustering finds local patterns where a subset of problem is achieved when nodes are placed at the barycenter objects might be similar to each other based on only a of their neighbors. In this paper, we provide a parallel algo- subset of attributes. The comparison between biclus- rithm for biclustering based on this formulation. We show ters and clusters is illustrated in Figure 1. Figure 1(a) that parallel energy minimization using barycenter heuristic demonstrates the concept of clusters and 1(b) demon- is embarrassingly parallel. The challenge is to design a bi- strates the biclusters in input data matrix. Note that cluster identification which is scalable as well as biclusters can cover just part of rows or columns and accurate. We show that our parallel implementation is not may overlap with each other as shown in figure 1(b). just extremely scalable, it is comparable in accuracy as well Biclustering process can generate a wide variety of ob- with serial implementation. We have evaluated proposed ject groups that capture all the significant correlation parallel biclustering algorithm with large synthetic data sets information present in a data set. Biclustering has be- on upto 256 processors. Experimental evaluation shows large come very popular in discovering patterns from gene mi- superlinear speedups, scalability and high level of accuracy. croarray experiments data. Since a bicluster can iden- tify patterns among a subset of genes based on a subset 1 Introduction of conditions, it therefore models condition-specific pat- terns of co-expression. Moreover, biclustering can iden- Biclustering (Subspace Simultaneous Clustering) is a tify overlapping patterns thus catering to the possibility very useful technique for text mining, collaborative fil- that a gene may be a member of multiple pathways. tering and gene expression analysis and profiling [2] [3]. Unprecedented growth in web related text based Recently, it has gained increasing popularity in the anal- data repositories [6] and evolution of large biological ysis of gene expression data [3] . Biclustering is sig- data sets [7] has resulted into the need of high per- formance biclustering in order to cope with ∗ Department of Electrical Engineering and Science, the largeness of the data sets efficiently. Given the Northwestern University. †Current Affiliation:A9.COM, Research was carried out when memory size limit in single processor machines, paral- author was with ECE department at University of Illinois, lel/distributed realization of the biclustering algorithms Chicago. is desirable to enable scalable applications for arbitrarily ‡Department of Electrical Engineering and Computer Science, large data sets. Northwestern University. Distributed solutions for biclustering problem be- §Department of Electrical Engineering and Computer Science, Northwestern University. come essential also when the underlying data is inher-

1050 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. tic can provide a scalable linear time solution to bipar- Bicluster 3 tite spectral partitioning problem. In this paper, we Cluster 1 study the effectiveness of this approach for dealing with parallel/distributed biclustering problem in large scale datasets. Cluster 2 Bicluster 1 Going by the above-mentioned retail chain scenario, Bicluster 2 we assume a horizontal partitioning of the data where each location has the same feature space but objects are different. Similar assumptions were made in [8] (a) Clusters of Objects. (b) Biclusters of Objects. and [9]. The parallel implementation is divided in to three stages. First stage involves parallel implementa- Figure 1: Illustration of Clusters and Biclusters tion of barycenter heuristic for crossing minimization. Identification of local biclusters is part of second stage. The third stage involves determination of global biclus- ently distributed in nature. Consider the example of a ters at each processor. This is accomplished by virtue retail store chain such as Wal-Mart who have numerous of comparison of local biclusters with bicluster repre- stores spread across different parts of the world. The sentatives received from other processors. A bicluster consumer transaction data gets stored in these stores representative is merely a vector composed of means of such that users are different across stores but items are bicluster columns over all rows of the bicluster. Each generally the same. It can be viewed as row-wise distri- processor performs a Euclidean distance based similar- bution of the customer transaction data. Biclustering ity comparison with bicluster representatives received of this distributed data can be carried out in two ways. from other processors. The similar biclusters are sub- One is to transfer the data at a central location and sequently merged to obtain global biclusters. Clearly perform subsequent sequential mining. The other way during both stages of communication, the size of com- is to perform distributed biclustering of the data. The munication is bounded by the number of columns in the former approach has serious scalability issues as it is not data matrix. feasible to transfer data from thousands of locations to We show that parallel/distributed implementations a central location reliably and efficiently. Central ap- for most of the existing biclustering algorithms, incur proach is also harder to adopt because of security and communication overhead which grows at least linearly privacy issues. This leaves us with the task of developing with the size of rows. On the other hand, proposed par- distributed solution to the biclustering problem. There allel algorithm of crossing minimization based approach has already been some work done in parallel/distributed being independent of the number of rows, has a signif- biclustering area specially in context of document-word icant advantage for huge data sets on a large number co-clustering [8], collaborative filtering for online rec- of processors. We show that this efficient distributed ommendation systems [9] and for gene expression anal- implementation is comparable in accuracy to the se- ysis [10]. quential one [1]. We have evaluated the performance Bipartite Spectral partitioning is a powerful tech- and accuracy of parallel biclustering algorithm on large nique to achieve biclustering. Normalized cut is the data sets comprising of millions of rows. The evaluation most popular objective function for spectral partition- results are discusses in detail in Section 6. ing problem [11], [12]. In the absence of a polynomial Organization: Rest of the paper is organized as time exact solution for this problem, emphasis is on find- follows. We discuss related work in section 2, section 3 ing efficient heuristics. A popular implementation [12] talks briefly about the sequential biclustering algorithm, requires computation of the Fiedler vector for Lapla- section 4 proposes a new model for parallel biclustering cian matrix of input data. Its computation complex- using barycenter heuristic and provides an efficient ity is quadratic in data size and hence not suitable algorithm. Section 5 discusses the complexity of the for large data. Recently Ahmad et al. [1] proposed proposed scheme. Experimental framework is given in a new formulation of the spectral partitioning prob- Section 6, conclusions are drawn in Section 7. lem as Graph Drawing(GD) problem. They showed that minimization of Hall’s energy function [1] corre- 2 Related Work sponds to finding the normalized cut of the bigraph. Moreover, they also showed that an optimal solution to Recently, there have been some research efforts aimed Hall’s energy minimization problem is achieved by plac- at solving parallel biclustering problem. ParRescue [8] ing each node at the barycenter of its neighbors. Based is one such solution which is based on Minimum Sum- on these results, they proved that Barycenter heuris- Squared Residue(MSSR) clustering algorithm of Dhillon

1051 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. et al [13]. In our view, MSSR based biclustering ap- When f(x) = x log x, Hf is the Shannon entropy proach inherently suffers from following two problems. and Bf (p; q) is the I-divergence, when f(x) = − log(x) we get the Burg entropy and discrete Itakura-Saito 1. MSSR based biclustering approach is unable to de- P q(x) p(x) distortion Bf (p; q) = (log + − 1) termine overlapping biclusters. Overlapping biclus- x f(x) q(x) A parallel biclustering (co-clustering) framework ters are highly desired in emerging application ar- was also proposed in [9] which is aimed at providing eas such as gene expression analysis and collabora- scalable solution to the collaborative filtering problem tive filtering [14]. for online recommendation systems. Their contribution 2. In this approach, row and column clusters are sepa- is based on parallelization of weighted biclustering (co- rately identified. It is unclear how these separately clustering) algorithm of [16]. The paper does not identified row and column clusters can lead to an describe the parallel algorithm in great detail. In effective biclustering solution. their theoretical analysis, they suggest that the overall mn+mkl+nkl computation time of the algorithm is O( N + 3. The parallel solution incurs an O(t(pn(k+l))) com- Nkl) where N is number of processors, m and n are rows munication overhead where t is the total number of and columns respectively. Also k and l are number of iterations, p is the number of processors, n is the column clusters and number of row clusters respectively. number of columns and k and l refer to row and col- They have not fully discussed the communication cost umn clusters respectively. As we show later, this of their approach. It is interesting to note that their communication overhead is around (k + l) times graph on execution time with increasing number of more than introduced by our implementation. processors (Figure 4 of [9]) shows that the execution 4. Finally, we believe that the MSSR based approach time for BookCrossing dataset is around 30 seconds for suffers from the initialization problem. Note that a single processor case. While the execution time in this algorithm initializes cluster and column indi- case of 15 processors has decreased only by a factor cator matrices randomly. This random assignment of 3 to 10 seconds. This indicates that a 15 times determines the quality of overall clustering solution increase in number of processors has only resulted in in some cases which is a drawback. Moreover, the 3 times improvement in execution time. Moreover, algorithm also requires the user to specify number the curve essentially straightens up (saturates) right of row and column clusters. Since these numbers after 8 processor case. Clearly the algorithm does have strong impact on final output clusters, the al- not have a linear speed-up characteristics. We believe gorithm has undesirable dependence on these input this is because of high communication cost incurred parameters. Note that barycenter heuristic based during accumulation of global matrix averages. On the biclustering algorithm does not suffer from these other hand, our proposed parallel implementation of drawbacks. barycenter heuristic based biclustering is able to attain super-linear speedups for even hundreds of processors. Dhillon et al in [15] gave an information theoretic biclustering (Co-Clustering) algorithm. They treat the 3 Biclustering using Sequential Barycenter (normalized) non-negative contingency table as a joint Heuristic probability distribution between two discrete random 3.1 Preliminaries: Throughout the paper, we de- variables that take values over the rows and columns. note a matrix by capital boldface letters such as G,I Their subsequent work [16] provided a Bregman Diver- etc. Vectors are denoted by small boldface letters such gence based loss function which was applicable to all as p and matrix elements are represented by small let- density functions belonging to the exponential family. ters such as wij. Also, we denote a graph by G(V,E) Bregman Divergence is defined as follows [17]. where V is the vertex set and E is the edge set of the If f is a strictly convex real-valued function, the graph. Moreover each edge, denoted by {i, j}, has a f-entropy of a discrete measure p(x) ≥ 0 is defined by weight wij. The adjacency matrix of G, denoted as G, X is defined as Hf (p) = − (f(p(x))  w if there is an edge {i, j}  x G = ij 0 otherwise and the Bregman divergence Bf (p; q) is given as Definition 3.1. Bipartite Graph: A graph G(V,E) is (2.1) termed as Bipartite if V = V0 ∪ V1 where V0 and V1 are X the disjoint sets of vertices (i.e. V0 ∩ V1 = φ) and each Bf (p; q) = − f(p(x))−f(q(x))−∇f(q(x))(p(x)−q(x)) x edge in E has one end point in V0 and the other end

1052 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. point in V1. Here vol(S) refers to the total weight of all edges originating from group S. Normalized cut yields more We consider weighted bipartite graph balanced partitioning. G(V0,V1,E,W ) with W = (wij) where wij > 0 It is well known that graph partitioning problem is denotes the weight of the edge {i, j} between vertices i NP-complete [18]. Many heuristic methods exist that and j. Moreover, wij = 0 if there is no edge between i can find the local minimum of the problem. Spectral and j. graph partitioning heuristics based on finding the nor- For above mentioned retail chain example, users are malized cut of the graph are known to perform well [11] represented by vertex set V0 and items are represented [12]. by vertices in V1. The weight matrix W, in this case, represents the quantity of each item bought by each Definition 3.2. Bipartite Drawing: A bipartite draw- user. Generally, in microarray experiment data, genes ing of bigraph G is the embedding of its vertex sets V0 and conditions are represented by V0 and V1 vertex and V1 onto distinct points on two horizontal lines y0, y1 sets respectively. The edge weight wij represents the respectively while edges are drawn by straight line seg- response of i’th gene to j’th condition. Similar is ments. the case for document-word data where documents are represented by V0 vertex set and words are represented Definition 3.3. Hall’s Energy Function for a Bipar- by V1 vertex set. In this case, the matrix W is the term tite Drawing: Let h(vk) be the x-coordinate of vk ∈ V , frequency matrix for the whole corpus. then the energy of bipartite drawing is defined as In graph theory, a cut is a partition of the vertices of a graph into two sets. Formally, for a partition of n 1 X 2 vertex set V into two subsets S and T , a cut can be E = wij(h(vi) − h(vj)) 2 defined as follows i,j=1 X cut(S, T ) = wij Hall’s energy function essentially assigns similar co- i∈S,j∈T ordinates to vertices that are connected by edges with large weights. If the edges indicate the similarity be- This definition of cut can easily be extended to k vertex tween the vertices, minimization of Hall’s energy would subsets, amount to bring similar vertices together. Intuitively, X cut(V1,V2,...,Vk) = cut(Vi,Vj) it can also be thought of as a process where graph is i

cut(V01 ∪ V11,V02 ∪ V12,...,V0k ∪ V1k) = n T 1 X 2 minV ,V ,...,V cut(V1,V2,...,Vk) E = q Lq = w (q − q ) 1 2 k 2 ij i j i,j=1 where V1,V2,...,Vk is any partitioning of the over- all Vertex set V = V0 ∪ V1 into k vertex subsets. And Proof. V0k and V1k denote the k’th subsets of the two groups n of vertices in a bipartite graph. 1 X E = w (q − q )2 Min-cut may result in unbalanced partitions. The 2 ij i j solution is to employ normalized cut instead. Minimum i,j=1 n n n Normalized cut for two partitions S and T can be 1 X 1 X X = w q2 + w q2 − w q q defined as 2 ij i 2 ij j ij i j i,j=1 i,j=1 i,j=1 cut(S, T ) cut(S, T ) (3.2) Ncut(S, T ) = + min vol(S) vol(T )

1053 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Since W is symmetric so wij = wji It is clear from Theorem 3.5 that a bipartite drawing which assigns each node an x-coordinate that is barycen- ter of its neighbors, would provide optimal solution to n n X X E = w q2 − w q q Hall’s energy minimization problem for the given bipar- ij i ij i j tite graph. Also, from Theorem 3.4, we know that Hall’s i,j=1 i,j=1 energy function is equivalent to the objective function n  n  n X X 2 X for normalized cut of the bigraph which in turn is the =  wij qi − wijqiqj. objective function for bipartite spectral partitioning. i=1 j=1 i,j=1 Spectral clustering solutions are based on finding the Fiedler Vector (an eigenvector) of the Laplacian of the given graph. Eigenvalues and their corresponding Since wii = 0, ∀i eigenvectors are computed using techniques such as Singular Value decomposition (SVD). These techniques have typically quadratic complexity and require random n  n  n n X X 2 X X access to complete data sets. This complexity limits E =  wij qi − wijqiqj their effectiveness for processing very large matrices. i=1 j=1 i=1 j=1,j6=i n n n X 2 X X 3.2 Bigraph Crossing Minimization using = Liiqi + Lijqiqj Barycenter Heuristic As described above, optimal i=1 i=1 j=1,j6=i solution to Hall’s energy minimization problem is n X attained when each vertex is placed at the barycenter = L q q ij i j (mean of ordinal values) of its neighbors. In graph i,j=1 drawing domain, Barycenter heuristic is a well-known T = q Lq technique which attempts to draw a graph on a plane such that edge crossings are minimized. It is an iter- Above theorem shows that minimization of Hall’s en- ative process. During each iteration, vertex positions ergy corresponds to finding the normalized cut of the on one layer are fixed while the position of each vertex graph. Now we will show that assignment of x- on other layer is computed as the mean of positions coordinates to each node based on barycenter of its of its neighbors on the other layer. Each iteration of neighbors results in optimal minimum value for Hall’s barycenter requires O(|E|+|V |log|V |) time. For graphs energy function. with constant degree bound, the barycenter heuristic can also be implemented in linear time per iteration. Theorem 3.5. The minimum value of Hall’s energy Let vi represent the i’th node in the non- function is achieved when static(dynamic) layer and set Ni represent the set of neighbors of v . Also let r represent the rank of j’th Pn i j wijh(vj) j=1 member of the set Ni. Then new rank of vi, denoted as h(vi) = Pn j=1 wij ri can be calculated as follows. 1. We first calculate the Weighted Mean, denoted as i.e. each node is assigned a new co-ordinate value which µ, of the ranks of neighbor nodes using Equation is the barycenter of its neighbors’ co-ordinates. 3.3. Proof. Since Hall’s energy function is a convex function, 2. These weighted means represent the new ordering we can determine its minimum value by simply comput- of the nodes. Since these means are not necessarily ∂E unique, we adjust them so that each node is as- ing ∂h(v ) = 0. i signed a unique rank based on its weighted mean. n ∂E ∂ 1 X 2 X = wij(h(vi) − h(vj)) = 0 (3.3a) s˜ = wi,j × rj ∂h(vi) ∂h(vi) 2 i,j=1 j∈Ni n X X = wij(h(vi) − h(vj)) = 0 (3.3b) s = wi,j

i,j=1 j∈Ni Pn w h(v ) i,j=1 ij j s˜ h(vi) = n P (3.3c) µi = j=1 wij s

1054 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. 3.3 Bicluster Identification After crossings are set. If any row has more coherent columns than the ones minimized, we perform the bicluster identification pro- in reference set, we set the OverlapF lag and store the cess. This process is performed on the reordered matrix current value of row iterator i in StartRow and that of representation of the Bigraph. Bicluster identification column iterator j to StartColumn. process starts from the first element of the reordered If both of the above mentioned conditions fail, we matrix and keeps on adding new columns and rows while call the current submatrix a bicluster and continue with calculating and updating a score representing the coher- the Bicluster identification process. The next iteration ence of the values in the current block of the reordered of the identification process would start from StartRow matrix. This coherence is determined by virtue of two and Startcolumn if OverlapF lag was set otherwise it kinds of distance functions namely Bregman Divergence could continue with the next value of row iterator till and Manhattan Distance. it reaches the last row. Moreover, when we reach the Bregman Divergence is used to compare two rows end of columns, we add the current row to our current over same set of columns. As described earlier, we can Row Cluster. If we reach the end of rows, and current use different functions in Bregman Divergence Equa- column and row clusters contain more than required tion (Equation 2.1) to emulate Euclidean Distance, KL- minimum number of columns(Columnmin) and rows Divergence and Itakura-Saito distortion [17]. Manhat- (Rowmin) respectively, then we declare these row and tan Distance is used to compare the values of adjacent column clusters to be a bicluster and add it to the columns in the same row. Manhattan distance for two Bicluster set S. points P1(x1, y1) and P2(x2, y2) in XY plane is given in Equation 3.4 4 Parallel/Distributed Biclustering Complete process of biclustering comprises of two main tasks; first the matrix is reordered by performing cross- (3.4) D(P ,P ) =| x − x | + | y − y | 1 2 1 2 1 2 ing minimization, followed by the identifying the biclus- Manhattan distance is simple to compute and works ters in this reordered matrix. Parallel biclustering algo- well in our case as we are using it to compare distance rithm also consists of parallel crossing minimization al- among columns of the same row which are assumed to gorithm and parallel bicluster identification. As shown be identically distributed. in figure 2, input data matrix is horizontally partitioned The bicluster identification procedure is a local amongst all the processes. After every process acquires search procedure. We keep pointers StartRow and its share of data, normalization of the data is performed StartColumn which are initialized to first row and to construct the joint distribution of rows and columns. column of the reordered matrix respectively. Coherence In the second step, rows and columns are reordered it- score for adjacent rows is calculated using Bregman eratively until there is no further change. During this Distance. Similarly the evolution pattern over adjacent phase only columns’ information is exchanged among columns is determined using Manhattan Distance. Our the processes, rows are reordered locally. The fact that iterative procedure is row-major. We compare two each process has to share only the column rank informa- rows at one time to see if they have matching columns. tion is very useful in privacy preserving biclustering ap- Starting with the StartRow and Startrow + 1, we keep plications (for complete details see [14] and [19]). After a reference set of matching columns. Columns are reordering of local matrix, process enters in its second added to this set if either of following two conditions phase of bicluster denitrification. With the reordered is satisfied. matrix, each process performs a local bicluster search. After identifying the biclusters locally, bicluster repre- 1. Bregman Distance between two rows over same set sentatives are shared amongst all the processes. With of columns is less than the given threshold δ this new information acquired by every process, another phase of identifying local biclusters is completed. At 2. The manhattan distance between adjacent columns the end, updated information is shared with all the pro- on both rows is the same. cesses and results are returned. Following sections dis- The first condition makes sure that we always cuss above mentioned tasks in detail. identify the constant value biclusters constrained by the noise. Second condition, on the other hand, guarantees 4.1 Model for Distributed Biclustering Data is that we would be able to determine the biclusters which horizontally partitioned across a set S : {s1 . . . sN } of exhibit coherent evolution over current set of columns. N Servers such that each server si has the same set A : For each subsequent row, we find out that if it has the {a1 . . . am} of m attributes and a set Ri : {ri1 . . . rini } same coherent columns. If it has, we add it to our row of ni different objects(records) such that global object

1055 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Crossing Minimization Bicluster Identification

Horizontal Partitioning Local Bicluster Search

Initialization

Reordered Matrix Global Exchange Until no further of bicluster representative change Column Rows Reordering Reordering

(Global) (Local) Comparison with Only once Local Bilclusters

Figure 2: System Design of Parallel Biclustering Algorithm

S set R can be represented as R = ∀i∈S Ri. is the same over all the servers and thus results in assignment of a unique rank to each attribute node. 4.1.1 Parallel/Distributed Barycenter Heuris- This is an embarrassingly parallel procedure where rank tic Each server can build bigraph on its local data of each node can be effectively computed independent such that objects are represented by vertices/nodes on of the rank of any other node in the same layer. This is one layer (say Layer0) and the attributes/features are specially useful for hardware based implementations of represented by vertices on the other layer (Layer1). It biclustering algorithm. would then perform barycenter heuristic based crossing minimization on the Bigraph. As we can recall from 4.1.2 Parallel/Distributed Bicluster Identifica- previous discussion, barycenter heuristic requires itera- tion By the end of the above mentioned distributed tive computation of the global ranks of object nodes and barycenter heuristic based crossing minimization algo- attribute nodes which are represented by two layers of rithm, each server will have a reordered representation the bigraph. Rank computation for each node in one of its local data. Since ranks of object nodes are cal- layer of the bigraph requires the knowledge of ranks of culated locally, each server does not know the global its neighbors on the other layer. rank of its object nodes. Global rank of object nodes is Since we assume horizontal partitioning, each ob- useful only if each server knows the ranks of objects on ject node in the bigraph built over local data will have other servers too. This is because of the fact that bi- all necessary information (i.e. Ranks of its neighbor at- cluster identification procedure works on contiguously tribute nodes) to calculate its rank locally using (3.3). ranked object nodes. It will have to be implemented This implies that no inter-server communication is re- in a distributed setting by having a hash table so that quired for calculating the ranks of objects at each server. each server si can lookup the table to determine the set On the other hand since attribute nodes are con- of servers Slookup which have object nodes adjacent to nected to object nodes which might be separated across its own object node. The server si would then engage different servers, we will have to engage in inter-server in distributed bicluster identification procedure in col- communication for exact computation of the ranks of laboration with those servers which belong to the set these attribute nodes. Each server i shares its local Slookup. value of µij for j’th attribute to all other servers and It would result in tremendous increase in communi- then global weighted mean is calculated. This global cation overhead. Firstly, all servers will have to engage (j) in all-to-all broadcast of local ranks of each object node weighted mean µG for an attribute aj ∈ A is simply the mean of all µij values over n servers. so as to assign unique global rank to each object node at each server. This is also required for building the re- P µij quired look up table. Given the large number of object µ(j) = 1≤i≤n G n nodes, the communication cost for this process would be prohibitively high. Secondly, during the bicluster The global weighted mean for an attribute node

1056 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. identification process each server will have to engage in We have noted that k is a fairly small number for most distributed bicluster identification with other servers be- cases. This implies that in the expression for Tsequetial, longing to the set Slookup. This would require bicluster the term O(nm) tends to dominate which essentially identification procedure to be performed in a synchro- means that sequential biclustering process has complex- nized manner at each server. Given the fact that these ity which is roughly the same as the problem size. servers will be communicating on internet speeds and are distributed geographically, synchronization require- 5.2 Parallel Complexity: Lets assume that we ment will be impractical for most practical applications. have horizontally partitioned the data among p proces- The solution to above mentioned problem is to em- sors such that each processor has now n/p rows of the ploy a simple approach whereby object ranks are com- original matrix and all m columns. The computation puted locally at each server and then bicluster identifi- complexity of scheme when data is partitioned among cation process is performed locally as well. Once each p processors is given by Tp = O(k(n/p + n/plog(n/p) + server has identified its local biclusters, it broadcasts n/p)) + O(nm/p). Going by the same arguments as in representatives of these biclusters to all other servers. the sequential case, O(nm/p) is the rough bound on A bicluster representative is a vector consisting of the parallel computation cost. This essentially means that mean of each attribute in the bicluster. Upon receiving by having p processors to work on the problem, we have the bicluster representatives from other servers, each reduced the overall time complexity by a factor of p server determines through Euclidean distance if its lo- which theoretically yields a linear speedup curve. Since cal biclusters can be combined with those from other the algorithm involves All-to-All communication prim- servers. In case it finds strong similarity between one itives firstly in the crossing minimization phase (All to of its biclusters and incoming bicluster representatives. all reduce over Column positions), then during the clus- It updates the local bicluster representative for that bi- ter combining pase all-to-all gather is performed number cluster such that the attribute means now reflect the of cluster times. All of these All-to-All communication global attribute means for the bicluster. primitives incur O(p2m) in communication cost. This indicates that the communication cost is independent 5 Complexity Analysis of the number of rows, it only depends upon number of columns and number of processors. It grows linearly Lets assume that n =| V0 |= total number of rows of the with number of columns while growing quadratically input data matrix,m =| V1 |= total number of columns with increasing number of processors. of the input matrix and Rc = average number of rows per bicluster. Also Cc = average number of columns per bicluster and C = total number of biclusters and k = 5.3 Impact of Memory System on Performance average number of iterations of barycenter heuristic. As is clear from the above discussion, there are two components of proposed parallel biclustering solution. 5.1 Time complexity of Sequential Algorithm: One is parallel crossing minimization and the other is Now if O(n) = time to compute weighted means and bicluster identification. Both of these strongly depend O(nlogn) = time to perform sorting based on means and on underlying memory system performance. Barycen- O(n) = time taken in adjusting node positions. Also ter heuristic requires in-memory sorting of rows and columns after each iterations. We use Red-Black trees O(CRcCc) = time to identify biclusters. Combining all of these components would yield sequential time [20] to keep the nodes sorted during each iteration with respect to their current positions. Red Black trees complexity which is given as Tsequential = O(k(n + are very useful as they provide amortized latency of nlog(n) + n)) + O(CRcCc). The worst case complexity for identification of a bicluster is O(nm). But the O(log n) for insertion, deletion and retrieval operations. identification process is carried out such that the sub- However, we observer that the cost of keeping the tree matrix which is part of an already identified bicluster balanced increases exponentially as the data size in- won’t have to be revisited again during identification creases beyond a certain threshold depending on the of subsequent biclusters. In the case where a single underlying memory system. This increase in tree bal- bicluster covers the whole matrix, we will have total ancing cost can be mitigated by increasing the cache size number of biclusters equal to one thus keeping the so as to reduce the I/O time spent in accessing off-chip complexity bounded by the data size itself. memory and disks. This also motivates the use of par- Here, without loss of generality, we can assume that allelizing the code to make use of the caches distributed across different processors in the cluster. O(CRcCc) = O(nm). This implies that The second component of parallel implementation is the bicluster identification process. This requires Tsequential = O(k(n + nlog(n) + n)) + O(nm)

1057 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. reading in the matrix in order determined by the above mentioned barycenter heuristic. This process Accuracy Comparison is very strongly dependent on the performance of the 1.02 Sequential underlying memory system based on following two 1 Parallel reasons. 0.98 0.96 • During this procedure, matrix rows and columns 0.94 are scanned based on the order determined through 0.92 barycenter heuristic and not by the natural layout 0.9 Match score in the memory. This has adverse impact on the 0.88 cache performance specially when the data size is 0.86 very large. Noise Level • As data size increases, due to limited memory at each processor, there is high likelihood that required data might get swapped out of memory Figure 3: Accuracy Comparison between Parallel and and onto the hard disk. Since I/O latency for Sequential Algorithm hard disk access is orders of magnitude larger than memory access, this effect is very prominent specially in the case of sequential algorithm. 6.1 Accuracy Evaluation First set of experiments are aimed at judging the accuracy of the proposed al- Both of the above reasons can be mitigated intrin- gorithm. For this purpose, we evaluate proposed paral- sically by the parallel implementation. As more proces- lel/distributed implementation of barycenter heuristic sors become available, partitions of large data sets can based biclustering algorithm against results from the fit into their local memory more easily. Increased cache sequential implementation [1]. For empirical evaluation also has a strong positive impact on the performance of of the quality of clusters, we use the following function the parallel algorithm. In parallel processing domain, it which was also used in [21] and [22] . is generally believed that super-linear speedups are at- Let M1,M2 be two sets of bi-clusters. The match tained when the parallel implementation is able to make score of M1 with respect to M2 is given by effective use of caches of all machines in the cluster. If we consider the retail chain scenario mentioned in pre- 1 X | I1 ∩ I2 || J1 ∩ J2 | S(M1,M2) = maxA(I2,J2)∈M2 vious sections, this data distribution is natural. As we | M1 | | I1 ∪ I2 || J1 ∪ J2 | A(I ,J )∈M show in the experiment section, these reasons explain 1 1 1 the super-linear speedups observed while comparing the Let Mopt denote the set of implanted bi-clusters and performance of parallel algorithm against the sequential M the set of the output bi-clusters of a biclustering one. algorithm. S(Mopt,M) represents how well each of the true bi-clusters is discovered by the algorithm. 6 Experimental Results Again, we follow the approach used by Liu et al. [22] The proposed biclustering algorithm is implemented in for synthetic data generation. To cater for the missing MPI/C++. Our implementations for parallel bicluster- values in real life data, we add noise by replacing some ing algorithm is evaluated on Mercury, a machine at elements in the matrix with random values. There the National Center for Supercomputing Applications are three variables b, c and γ in the generation of the (NCSA). Mercury is an 887-node IBM Linux cluster bi-clusters. b and c are used to control the size of where each node contains two Intel 1.3/1.5 GHz Ita- the implanted bi-cluster. γ is the noise level of the nium II processors sharing 4 GB of memory. Running bi-cluster. The matrix with implanted constant bi- a SuSE Linux operating system, the compute nodes are clusters is generated with four steps: (1) generate a inter-connected by both Myrinet and Gigabit Ethernet. n × m matrix√ A √such that all elements of A are 0s, Implementation has been tested with MPICH version (2) generate n× m bi-clusters such that all elements 1.2.6 on faster cpu machines (1.5GHz) with 6MB L3 of the biclusters are 1s, (3) implant the bi-clusters into cache. In all the experiments the execution time was A, (4) replace γ(m × n) elements of the matrix with obtained through MPI W time() and is reported in sec- random noise values (0 or 1) . onds. The execution time doesn’t take into account the In the experiment, the noise level ranges from 0 to time spent in reading data from files but it contains the 0.25. The parameter settings used for the two meth- time of writing the identified bicluster to the files. ods are listed in Table 1. The resulting graph is shown

1058 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Method Parameter Settings Parallel δ = 0.5, Iterations = 5,Function=KL-Divergence,Procs=4 Sequential δ = 0.5, Iterations = 5,Function=KL-Divergence,Processors=1

Table 1: Parameter Settings for Parallel and sequential Biclustering Algorithm in Figure 3. The graph illustrates the accuracy com- disk, the end result is exponential reduction in execu- parison between sequential barycenter heuristic based tions times for multiple processor cases. We are not biclustering algorithm with its parallel/distributed im- presenting the execution time more than 5 hours in this plementation (using four processors). The graph shows paper to avoid imbalanced scaling in charts. For bigger that at zero noise, this scheme is almost as accurate as data sizes, execution time with only large number of pro- the sequential scheme [1]. With the increase in noise the cessors is reported in figure 4, because execution time is accuracy degrades a little bit which is because of non- considerably high for smaller number of processes. exact nature of bicluster comparison and merge opera- Figure 5 shows the charts for speedup, which is de- tions of the algorithm. Note however that the accuracy T ime(p=1) fined as S = time(p=P ) , for different problem sizes from of this algorithm remains considerably high even though 1 to 256 processors. First graph shows speedup curves it is a distributed implementation which uses a simple for parallel crossing minimization, all other graphs show biclustering merge framework. overall speedup achieved for different problem sizes. For 200, 000 rows speedup decreases as the number of pro- 6.2 Performance Evaluation of Parallel Biclus- cessors is increased beyond 64. This is because of the tering Algorithm Second set of experiments are fact that inter-processor communication cost which is aimed at analyzing the performance of parallel biclus- O(p2) starts undermining the performance gains made tering algorithm. For this purpose algorithm was run through data decomposition and resulting parallel exe- on synthetic data sets mentioned above. These data cution of computation tasks. Similar phenomenon could sets have sizes ranging from 200, 000 rows to 20, 000, 000 be seen for other problem sizes as well if the number of rows, while keeping 64 columns throughout. processors are increased further. 1 Million rows case Figure 4 shows the performance improvement in also seems to saturate after 256 processors. Smaller the terms of execution time for crossing minimization and problem size, sooner saturation will occur. Note that for bicluster identification with varying problem sizes. In the cases of 4 Million, 10 Million and 20 Million rows, figure 4(a) execution time for more than 64 proces- speedup has been calculated with respect to 8, 32 and 64 sors is not reported. It is evident from these charts processors respectively. It is clear from the graphs that that bicluster identification execution time dominates the proposed approach is very scalable and maintains crossing minimization execution time. So, total execu- excellent speedup characteristics throughout. As dis- tion time shows the behavior of bicluster identification cussed earlier, superlinear speedups are result of better while crossing minimization becomes insignificant. Each cache utilization with smaller data sizes and avoidance curve represents a different data size, it is evident from of high latency secondary storage I/O. the figure that execution time increases significantly with the increase in number of rows. Although for single 7 Conclusions processor, crossing minimization time increases many- Parallel/distributed implementation of biclustering al- fold by increasing the number of rows from 200, 000 to gorithm is highly desired in many application domains. 20, 000, 000, but figure 4(b) shows that bicluster iden- These domains include gene expression analysis in large tification case has shown extreme increase in execution scale micro-array experiments, text mining on large web time for greater than 4 Million rows. The reason for this based text repositories and collaborative filtering in E- behavior, as explained in Section 5.3 is the poor cache commerce systems. In order to meet this challenge, utilization and possible swapping of required data to a parallel/distributed algorithm for barycenter heuris- secondary storage when the data size becomes too large tic based biclustering algorithm is proposed. We pro- to fit in memory. The execution time improves signifi- vide thorough analysis of the advantages of this scheme cantly in case of multiple processors because data par- over existing solutions. Proposed approach is evaluated tition at each processor is more likely to fit in memory on synthetic data sets of large sizes on a cluster with and it is less likely to suffer from hard disk I/O access hundreds of machines. Both accuracy and performance latencies because of swapping. Moreover, smaller data of the algorithm are tested and verified for these data sets have better cache utilization as there is less cache sets. The experiments reveal that the algorithm not pollution. Since there is orders of magnitude difference only shows super-linear and scalable speedup charac- between I/O latencies from cache and those from hard

1059 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. Crossing Minimization Bicluster Indentification

70 200,000 20000 200,000 1 Million 1 Million 60 4 Million 4 Million 10 Million 15000 10 Million 50 20 Million 20 Million

40 10000 30

Time (sec) 20 Time (sec) 5000 10

0 0 1 8 16 32 64 1 32 64 128 256 Number of Processes Number of Processes

(a) (b)

Figure 4: Execution Time for Crossing Minimization and Bicluster Identification for Different Problem Sizes.

Crossing Minimization Overall Speedup Overall Speedup 4500 200,000 200,000 800 4 Million 400 4000 1 Million 1 Million 700 350 4 Million 3500 10 Million 600 300 20 Million 3000 500 250 2500 400 200 2000 300 Speedup 150 Speedup 1500 Speedup 100 1000 200 50 500 100 0 0 0 64 128 256 64 128 256 32 64 128 256 Number of Processes Number of Processes Number of Processes

Overall Speedup Overall Speedup 25 140 10 Million 20 Million

120 20

100 15 80

60 10 Speedup Speedup 40 5 20

0 0 32 64 128 256 64 128 256 Number of Processes Number of Processes

Figure 5: Crossing Minimization’s and Overall Speedup for Different Problem Sizes.

1060 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. teristics but also maintains good accuracy throughout. [11] M. Gu, H. Zha, C. Ding, X. He, and H. Simon, “Spectral relaxation models and 8 Acknowledgements structure analysis for k-way graph clustering and bi-clustering,” 2001. [Online]. Available: This work was supported in part by DOE SCIDAC- citeseer.ist.psu.edu/article/gu01spectral.html 2: Scientific Data Management Center for Enabling [12] I. S. Dhillon, “Co-clustering documents and Technologies (CET) grant DE-FC02-07ER25808, DOE words using bipartite spectral graph partition- SCiDAC award number DE-FC02-01ER25485, NSF ing,” in Knowledge Discovery and , HECURA CCF-0621443, NSF SDCI OCI-0724599, NSF 2001, pp. 269–274. [Online]. Available: cite- CNS-0551639 and IIS-0536994. This research was sup- seer.ist.psu.edu/dhillon01coclustering.html ported in part by the National Science Foundation [13] Y. G. H. Cho, I. S. Dhillon and S. Sra, “Minimum through TeraGrid resources provided by NCSA, grant sum-squared residue co-clustering of gene expression numbers TG-CCR060017T, TG-CCR080019T, and TG- data,” in Proceedings of the fourth SIAM International ASC080050N. We gratefully acknowledge Prof. Ash- Conference on Data Mining, 2004, pp. 114–125. faq khokar of University of Illinois, Chicago and anony- [14] W. Ahmad and A. Khokhar, “An architecture for pri- vacy preserving collaborative filtering on web portals,” mous reviewers for their insightful suggestions and help- in The Third International Symposium on Information ful comments. Assurance and Security (IAS), 2007. [15] S. M. I. S. Dhillon and D. S. Modha., “Information- References theoretic co-clustering,” in Proceedings of The Ninth ACM SIGKDD International Conference on Knowl- [1] W. Ahmad and A. Khokhar, “chawk: A highly ef- edge Discovery and Data Miing (KDD), 2003, pp. 89– ficient biclustering algorithm using bigraph crossing 98. minimization,” in Second International Workshop on [16] I. S. D. A. Banerjee, S. Merugu and J. Ghosh, “Clus- Data Mining and , VDMB 2007, Vi- tering with bregman divergences,” Journal of Machine enna, Austria (In conjunction with VLDB2007), 2007. Learning Research, vol. 6, pp. 1705–1749, 2005. [2] Y. Cheng and G. Church, “Biclustering of expression [17] J. Lafferty, S. Pietra, and V. Pietra, “Statistical learn- data,” in Proceedings of Intelligent Systems for Molec- ing algorithms based on bregman distances,” in Pro- ular Biology, 2000. ceedings of the Canadian Workshop on Information [3] S. C. Madeira and A. L. Oliveira, “Biclustering al- Theory, 1997., 1997. gorithms for biological data analysis: A survey,” [18] M. R. Garey and D. S. Johnson, and IEEE/ACM Transactions on Computational Biology Intractability; A Guide to the Theory of NP- and Bioinformatics, vol. 1, 2004. Completeness. New York, NY, USA: W. H. Freeman [4] R. Bellman, Dynamic Programming. Princeton, NJ: & Co., 1990. Princeton University Press, 1957. [19] W. Ahmad and A. A. Khokhar, “Phoenix: Privacy pre- [5] I. S. Dhillon and D. S. Modha, “Concept serving biclustering on horizontally partitioned data,” decompositions for large sparse text data using in Privacy, Security, and Trust in KDD, First ACM clustering,” Machine Learning, vol. V42, no. 1, SIGKDD International Workshop, PinKDD 2007, San pp. 143–175, January 2001. [Online]. Available: Jose, CA, USA, August 12, 2007, Revised Selected Pa- http://dx.doi.org/10.1023/A:1007612920971 pers, ser. Lecture Notes in Computer Science, vol. 4890. [6] A. Kittur, E. Chi, B. A. Pendleton, B. Suh, and Springer, 2008, pp. 14–32. T. Mytkowicz, “Power of the few vs. wisdom of the [20] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. crowd: Wikipedia and the rise of the bourgeoisie,” Leiserson, Introduction to Algorithms. McGraw-Hill World Wide Web, vol. 1, no. 2, 2006. Higher Education, 2001. [7] K. hoi Cheung, K. White, J. Hager, M. Gerstein, and [21] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, M. S. P. P. Miller, “Ymd: a microarray database for P. Buhlmann, W. Gruissem, L. Hennig, L. Thiele, and large-scale gene expression analysis,” in Proc AMIA E. Zitzler, “A systematic comparison and evaluation of Symp, 2002, pp. 140–144. biclustering methods for gene expression data,” Bioin- [8] J. Zhou and A. A. Khokhar, “Parrescue: Scalable formatics, 2006. parallel algorithm and implementation for biclustering [22] L. W. X. Liu, “Computing the maximum similarity over large distributed datasets.” in ICDCS. IEEE bi-clusters of gene expression data,” Bioinformatics, Computer Society, 2006, p. 21. vol. 23, no. 1, pp. 50–56, 2007. [9] T. George and S. Merugu, “A scalable collaborative filtering framework based on co-clustering.” in ICDM. IEEE Computer Society, 2005, pp. 625–628. [10] L. Wei, C. Ling, Q. Hongyu, and Q. Ling, “A par- allel biclustering algorithm for gene expressing data.” IEEE Computer Society Press, 2008.

1061 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.