Bi-Clustering – Algorithms

Silesian University of Technology Faculty of Automatic Control, Electronics and Computer Science Institute of Informatics Doctor of Philosophy Dissertation Bi-clustering – algorithms and applications Paweł Foszner Supervisor: prof. dr hab. inż. Andrzej Polański Gliwice, 2014 1 2 To my lovely wife Aleksandra for her full support over those years. 3 4 Table of Contents Acknowledgements ................................................................................................................................... 9 1. Introduction .......................................................................................................................................11 2. Aims .......................................................................................................................................................13 3. Theses ...................................................................................................................................................15 4. Main contribution and original elements of the thesis .................................................16 5. Formulation of main problems .................................................................................................17 5.1. Definition of bi-clusters ......................................................................................................17 5.2. Index functions for evaluating quality of bi-clustering systems .....................22 5.2.1. Mean square residue (MSR) ....................................................................................22 5.2.2. Average Correlation Value (ACV) ..........................................................................22 5.2.3. Average Spearman's rho (ASR) ..............................................................................23 5.3. Stop criteria for bi-clustering algorithms...................................................................25 5.3.1. Mathematical convergence ......................................................................................25 5.3.2. Connectivity matrix .....................................................................................................26 5.3.3. Conditions defined by the user. .............................................................................28 6. An overview of bi-clustering methods ..................................................................................29 6.1. Algorithms based on matrix decomposition .............................................................29 6.1.1. Based on LSE. ..................................................................................................................29 6.1.2. Based on Kullback–Leibler divergence ..............................................................30 6.1.3. Based on non-smooth Kullback–Leibler divergence. ..................................30 6.1.5. FABIA ..................................................................................................................................32 6.2. Algorithms based on bipartite graphs .........................................................................34 6.2.1. QUBIC..................................................................................................................................34 6.3. Algorithms based on Iterative Row and Column search .....................................36 5 6.3.1. Coupled Two-Way Clustering (CTWC)............................................................... 36 6.4. Algorithms based on Divide and Conquer approach ............................................ 37 6.4.1. Block clustering ............................................................................................................. 37 6.5. Algorithms based on Greedy iterative search.......................................................... 38 6.5.1. δ-bi-clusters .................................................................................................................... 38 6.6. Algorithms based on Exhaustive bi-cluster enumeration ................................. 39 6.6.1. Statistical-Algorithmic Method for Bi-cluster Analysis (SAMBA) ......... 39 6.7. Algorithms based on Distribution parameter identification ............................ 40 6.7.1. Plaid Model ...................................................................................................................... 40 7. Comparing the results .................................................................................................................. 41 7.1. Similarity measures .............................................................................................................. 41 7.1.1. Jaccard Index .................................................................................................................. 41 7.1.2. Relevance and recovery ............................................................................................ 42 7.1.3. Consensus score ............................................................................................................ 43 7.2. Hungarian algorithm............................................................................................................ 45 7.3. Generalized Hungarian algorithm ................................................................................. 52 7.3.1. Problem formulation .................................................................................................. 52 7.3.2. Related work .................................................................................................................. 54 7.3.3. Hungarian algorithm .................................................................................................. 54 7.3.4. Two-dimensional approach .................................................................................... 56 7.3.5. Multidimensional approach .................................................................................... 61 7.4. Consensus algorithm ........................................................................................................... 64 8. Graphical presentation of results ........................................................................................... 67 8.1. Presenting bi-clusters ......................................................................................................... 67 8.1.1. BiVoC .................................................................................................................................. 67 8.1.2. BicOverlapper ................................................................................................................ 68 8.1.3. BiCluster Viewer ........................................................................................................... 68 6 8.2. Presenting the results of domain ...................................................................................70 8.2.1. Clusters containing genes .........................................................................................70 8.3. Presenting the results from different experiments. ..............................................71 9. Computational experiments ......................................................................................................72 9.1. Environment for data generation and evaluation ..................................................72 9.1.1. Data ......................................................................................................................................74 9.1.2. Distributed computing ...............................................................................................75 9.1.3. Defining own synthetic matrix ...............................................................................76 9.1.4. Browsing data and results ........................................................................................77 9.1.5. Update functionality ....................................................................................................79 9.1.6. Program availability ....................................................................................................79 9.2. Third-party software ............................................................................................................81 9.3. Data ...............................................................................................................................................82 9.3.1. Synthetic data .................................................................................................................82 9.3.2. Real data ............................................................................................................................83 9.4. Computational results .........................................................................................................85 9.4.1. Synthetic data .................................................................................................................85 9.4.2. Real data ............................................................................................................................86 10. Conclusions and summary .....................................................................................................93 Bibliography ...............................................................................................................................................95 List of Symbols and Abbreviations ............................................................................................... 101 Table

Load more