BIDEAL: a Toolbox for Bicluster Analysis - Generation, Visualization and Validation
Total Page:16
File Type:pdf, Size:1020Kb
1 BIDEAL: A Toolbox for Bicluster Analysis - Generation, Visualization and Validation Nishchal K. Verma, T. Sharma, S. Dixit, P. Agrawal, S. Sengupta, and V. Singh Abstract—This paper introduces a novel toolbox named BIDEAL for the generation of biclusters, their analysis, visualization, and validation. The objective is to facilitate researchers to use forefront biclustering algorithms embedded on a single platform. A single toolbox comprising various biclustering algorithms play a vital role to extract meaningful patterns from the data for detecting diseases, biomarkers, gene-drug association, etc. BIDEAL consists of seventeen biclustering algorithms, three biclusters visualization techniques, and six validation indices. The toolbox can analyze several types of data, including biological data through a graphical user interface. It also facilitates data preprocessing techniques i.e., binarization, discretization, normalization, elimination of null and missing values. The effectiveness of the developed toolbox has been presented through testing and validations on Saccharomyces cerevisiae cell cycle, Leukemia cancer, Mammary tissue profile, and Ligand screen in B-cells datasets. The biclusters of these datasets have been generated using BIDEAL and evaluated in terms of coherency, differential co-expression ranking, and similarity measure. The visualization of generated biclusters has also been provided through a heat map and gene plot. Index Terms—Biclustering, Gene expression analysis, Data visualization, preprocessing, Validation index, MTBA, Coherency. F 1 INTRODUCTION approaches have been used in the literature to find better ICLUSTERING has become prevalent and useful data solutions [4]. B mining technique among researchers for analyzing the The traditional clustering algorithms give equal impor- data. It has been applied to a wide variety of applications tance to all the columns. These algorithms are K-means such as bioinformatics, information retrieval, text mining, clustering [5], hierarchical clustering [6], self-optimal clus- dimensionality reduction, recommender systems, electoral tering [7], improved mountain clustering [8], fuzzy C-means data analysis, disease identification, association rule dis- clustering [9], unsupervised fuzzy clustering [10], etc. Each covery in databases, and many more [1]. Among these, algorithm has its own advantage. Despite their usefulness, bioinformatics [2] [3] seems to have taken the advantage they are not very helpful in a variety of problems. For of biclustering for analysis of the gene expression data. example, every gene may not take part in every condition During any biological process under different experimental with gene expression analysis. Thus, combinatorial regula- conditions, genes are examined by their expression levels. tion and joint patterns of gene expression biclustering are The data is present in a matrix form with rows representing essential to realize the complex nature of genes. In [11], genes and columns as experimental conditions. The aim is a plethora of solutions to perform biclustering has been to group genes and conditions into a sub-matrix to obtain presented. Undoubtedly, among the pool of algorithms, all crucial biological information such as identification of co- have their own distinctive ways including heuristic and regulated patterns among genes. A bicluster B can be repre- statistical approaches with their merits and demerits. It is sented as not expected that a single approach would turn out to be 2 3 well-suited for all types of data. So, any problem must be b11 b12 b13 : : : b1jJj arXiv:2007.13737v1 [cs.OH] 26 Jul 2020 tackled with respective suitable algorithms and the best 6 b21 b22 b23 : : : b2jJj 7 6 7 result must be noted. This generates the need of a com- B = 6 . 7 (1) 4 . .. 5 prehensive biclustering toolbox where various algorithms bjIj1 bjIj2 bjIj3 : : : bjIjjJj can be tested, validated, and visualized. A toolbox can be compared in terms of the following: where bij refers to the expression level of instance i under sample j, 8 i 2 f1; 2; :::; jIjg and 8 j 2 f1; 2; :::; jJjg, (a) Number of algorithms embedded in the toolbox. jIj is the number of instances, and jJj is the number of (b) Number of validation indices present for qualitative attributes. It involves finding the maximum sub-matrices in analysis of generated biclusters. a data matrix with maximum coherency. Since biclustering is (c) Number of visualization methods available for gener- a NP-hard problem, various heuristics and meta-heuristics ated biclusters. (d) User-friendly interface of the toolbox. • Nishchal K. Verma, T. Sharma, S. Dixit, P. Agrawal, and V. Singh are Based on the above-mentioned features, it can be sum- with the Dept. of Electrical Engineering, IIT Kanpur, India. S. Sengupta is with Dept. of Electrical Engineering, Jadavpur University, India. E- marized that a toolbox must be diverse in nature. In the mail: [email protected]. past decade, the growing demand of biclustering algorithms The developed software is available for academic and research use at http: has led the intense research on developing toolboxes for //www.iitk.ac.in/idea/bideal/. biclustering. This paper proposes a user-friendly toolbox 2 TABLE 1 Summary of the biclustering toolboxes Toolboxes Algorithms Validation Indices Visualization Methods Platform BicAT [12] CC [23], ISA [27], OPSM [26], xMotif [30] None Heat Map [48] JAVA Mean Square Residue, Heat Map, Parallel BiVisu [16] Greedy version of pCluster MATLAB Average Correlation Value Coordinate Plots BicOver-lapper 2.0 [13] Visualization Toolbox None Venn like Diagrams R, JAVA Expander [14] SAMBA [46] None Heat Map JAVA BAT [17] BiHEA [47] Pairwise Gene Analysis Heat Map, Numerical Matrix JAVA CC, OPSM, xMotif, kSpectral [28], ISA, Plaid [31], BiMax [32], Heat Map, Bicluster BiBench [18] Jaccard Index [40], F-measure Python Bayesian, QUBIC [38], FABIA [34], COALESCE Projection, Parallel Coordinates Parallel Plot, Heat BiClust [19] BiMax, CC, Plaid Jaccard Index, Constant Variance R Map, Bubble Plot BicNET [15] BicNET None Biclustering Network Data Java CC, BSGP [25], ISA, OPSM, kSpectral, ITL [29], xMotif, BiMax, Jaccard Index, SB Score, MTBA [20] Heat Map, Gene Plot MATLAB Plaid, FLOC [24], BiMax, LAS [33] Constant and Sign Variance Cluster Plot, Cluster Size, CoClust [21] Modularity Based, Information-Theoretic Based None Python Heat Map, Cluster Graph Graphical Display, BicPAMS [22] BicPAM, BicNET, Bic2PAM, BiP, BiModule None Java Heat Map BIDEAL CC, BSGP, OPSM, ISA, kSpectral, ITL, xMotif, Plaid, FLOC, BiMax, Jaccard Index, SB Score, Constant and Heat Map, Gene Plot, Cluster MATLAB (Proposed Toolbox) LAS, FABIA, BitBit [35], BiSim [36], MSVD [37], QUBIC, ROBA [39] Sign Variance, Hausdorff, MSE Plot, Numerical Matrix namely “BIDEAL” which incorporates 17 biclustering algo- toolboxes have all these features incorporated on a single rithms, 6 validation indices, and 3 visualization methods. platform. Table 1 summarizes various biclustering toolboxes in terms The paper is arranged as: Section 2 presents a brief of available algorithms, validity indices, and visualization introduction about biclustering algorithms embedded in methods. Considering the visualization methods or result BIDEAL, Section 3 describes validation indices, Section 4 il- presentation for generated biclusters, BicAT [12], BicOver- lustrates GUI of BIDEAL, and Section 5 provides the results lapper 2.0 [13], Expander [14], and BicNET [15] provide only on four standard datasets using BIDEAL. Finally, Section 6 single visualization method. On the other hand, BiVisu [16], concludes the paper. BAT [17], BiBench [18], BiClust [19], MTBA [20], CoClust [21], BicPAMS [22], and BIDEAL have multiple methods 2 BIDEAL: READY FOR USE BICLUSTERING AL- of visualization. Among these, CoClust and BIDEAL offers GORITHMS the maximum number of visualization methods. By default, BIDEAL provides bicluster results in a numerical matrix. This section provides a brief overview of biclustering algo- Another important feature of a toolbox is the validation rithms embedded in BIDEAL. indices to check the quality of obtained biclusters. BiVisu, Cheng and Church (CC) [23] proposed an algorithm BAT, BiBench, and BiClust offers only one or two validation to process expression data on the basis of Mean Squared indices whereas, BIDEAL have six i.e. maximum among Residue (MSR) score as the listed toolboxes. The Graphical User Interface (GUI) of 1 X MSR = (a − a − a + a )2 (2) any application for the execution of various algorithms on jIjjJj ij iJ Ij IJ i2I;j2J a single platform alleviates the process. The user-friendly interface of BIDEAL enables the testing of new dataset MSR measures coherency of genes and conditions us- quite easy without any prior knowledge of back-end pro- ing mean values and extract δ-biclusters. Another effec- gramming. On the other hand, BiBench, BiVisu, BiClust, tive algorithm FLexible Overlapped biClustering (FLOC) CoClust, and MTBA requires a little bit familiarity with [24] was proposed. It performs probabilistic steps and find the programming knowledge. Moreover, BicAT allows the overlapped biclusters further refined using MSR score to execution of algorithms with default parameter settings, overcome the effect of missing values in biclusters. The which is a constraint whereas, BIDEAL allows to change missing values often create random disturbances which these parameters. affect the quality and slow down the operation of biclusters Contributions: This paper introduces the