Application of Biclustering Algorithms to Biological Data
Total Page:16
File Type:pdf, Size:1020Kb
APPLICATION OF BICLUSTERING ALGORITHMS TO BIOLOGICAL DATA MASTERS THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of the Ohio State University By Kemal Eren, BS Graduate Program in Computer Science The Ohio State University 2012 Thesis Committee: Umit¨ V. C¸ atalyurek,¨ Advisor Srinivasan Parthasarathy c Copyright by Kemal Eren 2012 ABSTRACT Microarrays have made it possible to cheaply collect large gene expression datasets. Biclustering has become established as a popular method for mining patterns in these datasets. Biclustering algorithms simultaneously cluster rows and columns of the data matrix; this approach is well suited to gene expression data because genes are not related across all samples, and vice versa. In the past decade many biclustering algo- rithms that specifically target gene expression data have been published. However, only a few are commonly used in bioinformatics pipelines. There are a few reasons for this omission: implementations for only a small fraction of these algorithms have been published. Those that have been published have different interfaces, and there are few comparisons of algorithms or guidelines for choosing among them in the literature. In this thesis we address three problems: the development of an efficient and effec- tive biclustering algorithm, the development of a software framework for biclustering tasks, and a comprehensive benchmark of biclustering techniques. We improved the Correlated Patterns Biclustering (CPB) algorithm’s running time and accuracy by modifying its heuristic for evaluating rows and columns for inclusion in a bicluster. This calculation was previously performed by an iterative approach, but we developed a more computationally efficient method. We further improved CPB by removing unnecessary parameters and developing a nonparametric method for filtering irrelevant biclusters. ii To provide a common interface and also enable comparison of biclustering algo- rithms, we developed a Python package for bicluster analysis, which we introduce in this thesis. This package, BiBench, provides wrappers to twelve biclustering al- gorithms, as well as functionality for generating synthetic data, downloading gene expression data, transforming datasets, and validating biclusters. Using BiBench we compared twelve algorithms, including the modified version of CPB. The algorithms were tested on synthetic datasets for their ability to recover specific bicluster models, resist noise, recover multiple biclusters, and recover over- lapping biclusters. They were also tested on gene expression data; gene ontology enrichment was used to identify biologically relevant biclusters. iii For Susan Slattery iv ACKNOWLEDGMENTS This work would have been impossible without the help of the people, both past and present, at the HPC Lab. In particular I would like to acknowledge my adviser, Umit¨ C¸ atalyurek,¨ for all of his guidance and support. v VITA 1986 .................................. Born in Cleveland, OH. 2009 .................................. B.S. in Biology, minors in Mathematics and Computer Science, The University of Michigan. 2009-Present .......................... Joined the High Performance Computing Laboratory, The Ohio State University. FIELDS OF STUDY Major Field: Computer Science and Engineering Specialization: Machine Learning, Data Mining, Bioinformatics vi TABLE OF CONTENTS Abstract ....................................... ii Dedication ...................................... iii Acknowledgments .................................. v Vita ......................................... vi List of Figures ................................... ix List of Tables .................................... xi CHAPTER PAGE 1 Introduction ................................. 1 2 Background ................................. 5 2.1 Notation ................................ 7 2.2 Algorithms ............................... 8 2.2.1 Cheng and Church ...................... 8 2.2.2 OPSM ............................. 9 2.2.3 xMOTIFs ........................... 10 2.2.4 Spectral ............................ 10 2.2.5 ISA ............................... 10 2.2.6 Plaid .............................. 11 2.2.7 BiMax ............................. 11 2.2.8 Bayesian Biclustering (BBC) ................. 11 2.2.9 QUBIC ............................. 12 2.2.10COALESCE .......................... 12 2.2.11FABIA ............................. 12 3 The Correlated Patterns Biclustering Algorithm ............. 14 3.1 Original algorithm ........................... 17 3.1.1 The tendency vector ..................... 18 3.1.2 Updating rows of a bicluster ................. 23 vii 3.1.3 Updating columns of a bicluster ............... 24 3.2 Improvements ............................. 26 3.2.1 Improving tendency vector calculation ........... 26 3.2.2 A lower bound for the minimum pairwise correlation ... 29 3.2.3 Removing the row-to-column ratio .............. 32 3.2.4 Filtering biclusters found by chance ............. 33 3.3 Implementations ............................ 34 4 BiBench: a framework for biclustering analysis .............. 37 4.1 Examples of use ............................ 37 4.2 Related work .............................. 38 4.3 Overview of functionality ....................... 39 4.3.1 Algorithms ........................... 39 4.3.2 Datasets ............................ 39 4.3.3 Validation ........................... 40 4.3.4 Visualization ......................... 40 4.4 Availability ............................... 40 5 Comparison of biclustering algorithms .................. 42 5.1 Related work .............................. 43 5.2 Algorithms ............................... 44 5.2.1 Parameter selection ...................... 44 5.3 Synthetic dataset experiments .................... 46 5.3.1 Synthetic data model..................... 46 5.3.2 Bicluster similarity measures ................. 47 5.3.3 Model experiment ....................... 49 5.3.4 Noise experiment ....................... 53 5.3.5 Number experiment ...................... 55 5.3.6 Overlap experiment ...................... 57 5.4 Microarray dataset experiments .................... 59 5.4.1 Gene Ontology enrichment analysis ............. 60 5.4.2 GDS results .......................... 61 5.5 Discussion ............................... 67 6 Conclusion and future work ........................ 69 Bibliography .................................... 70 viii LIST OF FIGURES FIGURE PAGE 1.1 Illustration of a bicluster made contiguous by row and column exchanges. 2 2.1 Three types of bicluster structures. Note that biclusters may not be continuous, as in these figures, until the appropriate row and column permutation. ............................... 6 3.1 Effect of noise on correlation of rows of ten 50x50 matrices with per- fectly correlated rows. Top: pairwise correlation density. Bottom: correlation density with tendency vector. ’Random’ is a control for comparison: randomly generated matrices with no special correlation between rows. ............................... 22 3.2 Relationship between Corr and Error on random vectors with 50 elements. ................................. 25 3.3 Row expression levels before and after making all pairwise correlations positive. .................................. 26 3.4 Running time: iterative vs single-pass tendency vector calculation .. 27 3.5 Scores: iterative vs single-pass tendency vector calculation ...... 28 3.6 The cone formed by tendency vector t and angle θ = arccos(ρmin) .. 30 3.7 ρmin as a function of ρpw......................... 31 3.8 Effect of row-to-column ratio on CPB’s scores for three different bi- cluster sizes: 25, 50, and 100 rows. (Number of bicluster columns were 100, 50, and 25, so that each bicluster always contained 2500 elements. 32 3.9 Improvement of relevance scores after filtering ............. 34 3.10 Running times of original C version and Python version of CPB as the number of rows in the dataset increases. ................ 36 ix 4.1 Four types of visualization supported by BiBench. The first three are performed by the biclust package; the last by BicOverlapper. ..... 41 5.1 The intersection area (dark red) and union area (all shaded areas) involved in the calculation of the Jaccard index. ............ 49 5.2 Bicluster model experiment. Each data point represents the mean of twenty datasets. A score of (1,1) is best. ................ 51 5.3 Results of bicluster model experiment after filtering. ......... 52 5.4 Noise experiments: additive noise. Dots shows the mean score; lines show one standard deviation. ...................... 54 5.5 Number experiment: increasing number of biclusters. ......... 56 5.6 Overlap experiment: increasing amounts of overlap between two clusters. 58 5.7 Number of rows in all GDS biclusters. ................. 62 5.8 Number of rows in enriched GDS biclusters. .............. 63 5.9 Number of rows in enriched, non-overlapping GDS biclusters. .... 64 x LIST OF TABLES TABLE PAGE 5.1 Theoretical support of each bicluster model by algorithm. ...... 48 5.2 GDS datasets ............................... 61 5.3 Aggregated results on all eight GDS datasets. Biclusters were consid- ered enriched if any GO term was enriched with p = 0.05 level after multiple test correction. The set of enriched biclusters was further processed to discard overlapping biclusters. .............. 65 5.4 Five most enriched terms for each algorithm’s best bicluster on GDS589. 66 xi CHAPTER 1 INTRODUCTION Clustering refers to a class of unsupervised