COMBINATORIAL OPTIMIZATION TECHNIQUES in DATA MINING by Stanislav Busygin August 2007 Chair: Panos M
Total Page:16
File Type:pdf, Size:1020Kb
COMBINATORIAL OPTIMIZATION TECHNIQUES IN DATA MINING By STANISLAV BUSYGIN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007 1 c 2007 Stanislav Busygin 2 To all the people of goodwill who helped me along the path 3 ACKNOWLEDGMENTS First of all, I would like to express my gratitude to Dr. Panos M. Pardalos for his support and guidance during my PhD studies at the University of Florida. I am grateful to the members of my supervisory committee Dr. Stan Uryasev, Dr. Joseph Geunes and Dr. William Hager for their time and good judgement. I am also very grateful to my collaborators and friends Dr. Sergiy Butenko, Dr. Vladimir Boginski, Dr. Artyom Nahapetyan, and Dr. Oleg Prokopyev for their valuable contributions to our joint research. 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................. 4 LIST OF TABLES ..................................... 7 LIST OF FIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1 INTRODUCTION .................................. 11 1.1 General Overview ................................ 11 1.2 Data Mining Problems and Optimization ................... 12 2 BICLUSTERING IN DATA MINING ........................ 14 2.1 The Main Concept ............................... 14 2.2 Formal Setup .................................. 15 2.3 Visualization of Biclustering .......................... 16 2.4 Relation to SVD ................................ 16 2.5 Methods ..................................... 20 2.5.1 “Direct Clustering” ........................... 20 2.5.2 Node-Deletion Algorithm ........................ 20 2.5.3 FLOC Algorithm ............................ 22 2.5.4 Biclustering via Spectral Bipartite Graph Partitioning ........ 23 2.5.5 Matrix Iteration Algorithms for Minimizing Sum-Squared Residue . 27 2.5.6 Double Conjugated Clustering ..................... 31 2.5.7 Information-Theoretic Based Co-Clustering .............. 32 2.5.8 Biclustering via Gibbs Sampling .................... 35 2.5.9 Statistical-Algorithmic Method for Bicluster Analysis (SAMBA) .. 38 2.5.10 Coupled Two-way Clustering ...................... 39 2.5.11 Plaid Models ............................... 40 2.5.12 Order-Preserving Submatrix (OPSM) Problem ............ 42 2.5.13 OP-Cluster ................................ 43 2.5.14 Supervised Classification via Maximal δ-valid Patterns ........ 44 2.5.15 CMonkey ................................. 45 2.6 Discussion and Concluding Remarks ..................... 45 3 CONSISTENT BICLUSTERING VIA FRACTIONAL 0–1 PROGRAMMING . 47 3.1 Consistent Biclustering ............................. 47 3.2 Supervised Biclustering ............................. 49 3.3 Fractional 0–1 Programming .......................... 51 3.4 Algorithm for Biclustering ........................... 53 5 3.5 Computational Results ............................. 56 3.5.1 ALL vs. AML data set ......................... 56 3.5.2 HuGE Index data set .......................... 57 3.6 Conclusions and Future Research ....................... 57 4 AN OPTIMIZATION-BASED APPROACH FOR DATA CLASSIFICATION .. 61 4.1 Basic Definitions ................................ 61 4.2 Optimization Formulation and Classification Algorithm ........... 65 4.3 Computational Experiments .......................... 67 4.3.1 ALL vs. AML Data Set ......................... 67 4.3.2 Colon Cancer Data Set ......................... 67 4.4 Conclusions ................................... 68 5 GRAPH MODELS IN DATA ANALYSIS ..................... 69 5.1 Cluster Cores Based Clustering ........................ 71 5.2 Decision-Making under Constraints of Conflicts ............... 72 5.3 Conclusions ................................... 75 6 A NEW TRUST REGION TECHNIQUE FOR THE MAXIMUM WEIGHT CLIQUE PROBLEM ................................. 76 6.1 Introduction ................................... 76 6.2 The Motzkin–Straus Theorem for Maximum Clique and Its Generalization 78 6.3 The Trust Region Problem ........................... 85 6.4 The QUALEX-MS Algorithm ......................... 89 6.5 Computational Experiment Results ...................... 94 6.6 Remarks and Conclusions ........................... 101 REFERENCES ....................................... 102 BIOGRAPHICAL SKETCH ................................ 109 6 LIST OF TABLES Table page 3-1 HuGE index biclustering ............................... 60 6-1 DIMACS maximum clique benchmark results ................... 98 6-2 Performance of QUALEX-MS vs. PBH on random weighted graphs ....... 100 7 LIST OF FIGURES Figure page 2-1 Partitioning of samples and features into 3 clusters ................ 17 2-2 Coclus H1 algorithm ................................. 29 2-3 Coclus H2 algorithm ................................. 30 2-4 Gibbs biclustering algorithm ............................. 37 3-1 Feature selection heuristic .............................. 55 3-2 ALL vs. AML heatmap ............................... 58 3-3 HuGE index heatmap ................................ 59 4-1 Data classification algorithm ............................. 66 5-1 Cluster cores based clustering algorithm ...................... 73 5-2 Example of two CaRTs for a database ....................... 73 6-1 New-best-in weighted heuristic ............................ 94 6-2 NBIW-based graph preprocess algorithm ...................... 95 6-3 Meta-NBIW algorithm ................................ 95 6-4 QUALEX-MS algorithm ............................... 96 8 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy COMBINATORIAL OPTIMIZATION TECHNIQUES IN DATA MINING By Stanislav Busygin August 2007 Chair: Panos M. Pardalos Major: Industrial and Systems Engineering My research analyzes the role of combinatorial optimization in data mining research and proposes a collection of new practically efficient data mining techniques based on combinatorial optimization algorithms. A variety of addressed data mining problems include supervised clustering (classification), biclustering, dimensionality reduction, and outlier detection. The recent advances and trends in biclustering are surveyed and the major challenges for its further development are outlined. Similarly to many other data mining methodologies, one of them is the lack of mathematical justification for the significance of purported results. I address this issue with the development of the notion of consistent biclustering. The significance of consistent biclustering is mathematically justified by the conic separation theorem establishing simultaneous delineation of both sample and attribute classes by convex cones. This required property of the obtained biclustering serves as a powerful tool for selecting those attributes of the data which are relevant to a particular studied phenomenon. As an example of such an application, several well-known DNA microarray data sets are considered with the consistent biclustering results obtained for them. To further advance the application of mathematically well-justified optimization methods to major data mining problems, I developed a new optimization based data classification framework which relies upon the same criteria of class separation that serve as the objectives in unsupervised clustering methods, but utilizing them instead 9 as the constraints on feature selection based upon the available training set of samples. The reliability and robustness of the methodology is also empirically confirmed with computational experiments on DNA microarray data. Next, I discuss the prominent role of graph models in data analysis with the emphasis on data analysis applications of the maximum clique/independent set problem. The great variety of real-world problems that can be tackled with the graph-based models is surveyed along with the employed methodologies of information retrieval. Finally, I present a practically efficient maximum clique heuristic QUALEX-MS. It utilizes a new simple generalization of the Motzkin-Straus theorem for the maximum weight clique problem. This generalization, representing quite a significant theoretical result itself, maximally preserves the form of the original Motzkin-Straus formulation and is proved directly, without the use of mathematical induction. QUALEX-MS employs a new trust region heuristic based upon this new quadratic programming formulation. In contrast to usual trust region methods, it takes into account not only the global optimum of a quadratic objective over a sphere, but also a set of other stationary points. The developed method has complexity O(n3), where n is the number of vertices of the graph. Computational experiments indicate that QUALEX-MS is exact on small graphs and very efficient on the DIMACS benchmark graphs and various random maximum weight clique problem instances. QUALEX-MS was utilized for optimization of classification and regression trees of databases. 10 CHAPTER 1 INTRODUCTION 1.1 General Overview Due to recent technological advances in such areas as IT and biomedicine, researchers face ever-increasing challenges in extracting relevant information from the enormous volumes of available data. The so-called data avalanche is created by the fact that there is no concise set of parameters that can fully describe a state of real-world complex systems studied nowdays by biologists, ecologists, sociologists, economists, etc. On the other hand,