High Performance Parallel/Distributed Biclustering Using Barycenter Heuristic
Total Page:16
File Type:pdf, Size:1020Kb
High Performance Parallel/Distributed Biclustering Using Barycenter Heuristic Arifa Nisar∗ Waseem Ahmady Wei-keng Liaoz Alok Choudhary x Abstract nificantly useful and considerably harder problem than Biclustering refers to simultaneous clustering of objects and traditional clustering. Whereas a cluster is a set of ob- their features. Use of biclustering is gaining momentum jects with similar values over the entire set of attributes, in areas such as text mining, gene expression analysis and a bicluster can be composed of objects with similarity collaborative filtering. Due to requirements for high per- over only a subset of attributes. Many clustering tech- formance in large scale data processing applications such niques such as those based on Nearest Neighbor Search as Collaborative filtering in E-commerce systems and large are known to suffer from "Curse of Dimensionality" [4]. scale genome-wide gene expression analysis in microarray ex- With large number of dimensions, similarity functions periments, a high performance prallel/distributed solution such as Euclidean distance tend to perform poorly as for biclustering problem is highly desirable. Recently, Ah- similarity diffuses over these dimensions. For example, mad et al [1] showed that Bipartite Spectral Partitioning, in text mining, the size of keywords set describing dif- which is a popular technique for biclustering, can be refor- ferent documents is very large and yields sparse data mulated as a graph drawing problem where objective is to matrix [5]. In this case, clusters based on the entire minimize Hall's energy of the bipartite graph representation keywords set may have no meaning for the end users. of the input data. They showed that optimal solution to this Biclustering finds local patterns where a subset of problem is achieved when nodes are placed at the barycenter objects might be similar to each other based on only a of their neighbors. In this paper, we provide a parallel algo- subset of attributes. The comparison between biclus- rithm for biclustering based on this formulation. We show ters and clusters is illustrated in Figure 1. Figure 1(a) that parallel energy minimization using barycenter heuristic demonstrates the concept of clusters and 1(b) demon- is embarrassingly parallel. The challenge is to design a bi- strates the biclusters in input data matrix. Note that cluster identification algorithm which is scalable as well as biclusters can cover just part of rows or columns and accurate. We show that our parallel implementation is not may overlap with each other as shown in figure 1(b). just extremely scalable, it is comparable in accuracy as well Biclustering process can generate a wide variety of ob- with serial implementation. We have evaluated proposed ject groups that capture all the significant correlation parallel biclustering algorithm with large synthetic data sets information present in a data set. Biclustering has be- on upto 256 processors. Experimental evaluation shows large come very popular in discovering patterns from gene mi- superlinear speedups, scalability and high level of accuracy. croarray experiments data. Since a bicluster can iden- tify patterns among a subset of genes based on a subset 1 Introduction of conditions, it therefore models condition-specific pat- terns of co-expression. Moreover, biclustering can iden- Biclustering (Subspace Simultaneous Clustering) is a tify overlapping patterns thus catering to the possibility very useful technique for text mining, collaborative fil- that a gene may be a member of multiple pathways. tering and gene expression analysis and profiling [2] [3]. Unprecedented growth in web related text based Recently, it has gained increasing popularity in the anal- data repositories [6] and evolution of large biological ysis of gene expression data [3] . Biclustering is sig- data sets [7] has resulted into the need of high per- formance biclustering algorithms in order to cope with ∗ Department of Electrical Engineering and Computer Science, the largeness of the data sets efficiently. Given the Northwestern University. yCurrent Affiliation:A9.COM, Research was carried out when memory size limit in single processor machines, paral- author was with ECE department at University of Illinois, lel/distributed realization of the biclustering algorithms Chicago. is desirable to enable scalable applications for arbitrarily zDepartment of Electrical Engineering and Computer Science, large data sets. Northwestern University. Distributed solutions for biclustering problem be- xDepartment of Electrical Engineering and Computer Science, Northwestern University. come essential also when the underlying data is inher- 1050 Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. tic can provide a scalable linear time solution to bipar- Bicluster 3 tite spectral partitioning problem. In this paper, we Cluster 1 study the effectiveness of this approach for dealing with parallel/distributed biclustering problem in large scale datasets. Cluster 2 Bicluster 1 Going by the above-mentioned retail chain scenario, Bicluster 2 we assume a horizontal partitioning of the data where each location has the same feature space but objects are different. Similar assumptions were made in [8] (a) Clusters of Objects. (b) Biclusters of Objects. and [9]. The parallel implementation is divided in to three stages. First stage involves parallel implementa- Figure 1: Illustration of Clusters and Biclusters tion of barycenter heuristic for crossing minimization. Identification of local biclusters is part of second stage. The third stage involves determination of global biclus- ently distributed in nature. Consider the example of a ters at each processor. This is accomplished by virtue retail store chain such as Wal-Mart who have numerous of comparison of local biclusters with bicluster repre- stores spread across different parts of the world. The sentatives received from other processors. A bicluster consumer transaction data gets stored in these stores representative is merely a vector composed of means of such that users are different across stores but items are bicluster columns over all rows of the bicluster. Each generally the same. It can be viewed as row-wise distri- processor performs a Euclidean distance based similar- bution of the customer transaction data. Biclustering ity comparison with bicluster representatives received of this distributed data can be carried out in two ways. from other processors. The similar biclusters are sub- One is to transfer the data at a central location and sequently merged to obtain global biclusters. Clearly perform subsequent sequential mining. The other way during both stages of communication, the size of com- is to perform distributed biclustering of the data. The munication is bounded by the number of columns in the former approach has serious scalability issues as it is not data matrix. feasible to transfer data from thousands of locations to We show that parallel/distributed implementations a central location reliably and efficiently. Central ap- for most of the existing biclustering algorithms, incur proach is also harder to adopt because of security and communication overhead which grows at least linearly privacy issues. This leaves us with the task of developing with the size of rows. On the other hand, proposed par- distributed solution to the biclustering problem. There allel algorithm of crossing minimization based approach has already been some work done in parallel/distributed being independent of the number of rows, has a signif- biclustering area specially in context of document-word icant advantage for huge data sets on a large number co-clustering [8], collaborative filtering for online rec- of processors. We show that this efficient distributed ommendation systems [9] and for gene expression anal- implementation is comparable in accuracy to the se- ysis [10]. quential one [1]. We have evaluated the performance Bipartite Spectral partitioning is a powerful tech- and accuracy of parallel biclustering algorithm on large nique to achieve biclustering. Normalized cut is the data sets comprising of millions of rows. The evaluation most popular objective function for spectral partition- results are discusses in detail in Section 6. ing problem [11], [12]. In the absence of a polynomial Organization: Rest of the paper is organized as time exact solution for this problem, emphasis is on find- follows. We discuss related work in section 2, section 3 ing efficient heuristics. A popular implementation [12] talks briefly about the sequential biclustering algorithm, requires computation of the Fiedler vector for Lapla- section 4 proposes a new model for parallel biclustering cian matrix of input data. Its computation complex- using barycenter heuristic and provides an efficient ity is quadratic in data size and hence not suitable algorithm. Section 5 discusses the complexity of the for large data. Recently Ahmad et al. [1] proposed proposed scheme. Experimental framework is given in a new formulation of the spectral partitioning prob- Section 6, conclusions are drawn in Section 7. lem as Graph Drawing(GD) problem. They showed that minimization of Hall's energy function [1] corre- 2 Related Work sponds to finding the normalized cut of the bigraph. Moreover, they also showed that an optimal solution to Recently, there have been some research efforts aimed Hall's energy minimization problem is achieved by plac- at solving parallel biclustering problem. ParRescue [8] ing each node at the barycenter of