Optimization Methods for Cluster Analysis in Network-Based Data Mining

OPTIMIZATION METHODS FOR CLUSTER ANALYSIS IN NETWORK-BASED DATA MINING A Dissertation by SEYEDMOHAMMADHOSSEIN HOSSEINIAN Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Chair of Committee, Sergiy Butenko Committee Members, Lewis Ntaimo Kiavash Kianfar Jianer Chen Head of Department, Lewis Ntaimo May 2021 Major Subject: Industrial Engineering Copyright 2021 Seyedmohammadhossein Hosseinian ABSTRACT This dissertation focuses on two optimization problems that arise in network-based data mining, concerning identification of basic community structures (clusters) in graphs: the maximum edge weight clique and maximum induced cluster subgraph problems. We propose a continuous quadratic formulation for the maximum edge weight clique problem, and establish the correspon- dence between its local optima and maximal cliques in the graph. Subsequently, we present a combinatorial branch-and-bound algorithm for this problem that takes advantage of a polynomial- time solvable nonconvex relaxation of the proposed formulation. We also introduce a linear-time- computable analytic upper bound on the clique number of a graph, as well as a new method of upper-bounding the maximum edge weight clique problem, which leads to another exact algorithm for this problem. For the maximum induced cluster subgraph problem, we present the results of a comprehensive polyhedral analysis. We derive several families of facet-defining valid inequalities for the IUC polytope associated with a graph. We also provide a complete description of this polytope for some special classes of graphs. We establish computational complexity of the separation problems for most of the considered families of valid inequalities, and explore the effectiveness of employing the corresponding cutting planes in an integer (linear) programming framework for the maximum induced cluster subgraph problem. ii DEDICATION To my parents. iii ACKNOWLEDGMENTS I am extremely grateful to my advisor, Dr. Sergiy Butenko. I learned a lot from Sergiy and enjoyed every day of working with him. I am also very thankful to my advisory committee members, Drs. Lewis Ntaimo, Kiavash Kianfar, and Jianer Chen, and the Associate Department Head for Graduate Affairs, Dr. Alfredo Garcia, for their time and support. I would also like to pay my gratitude to Drs. Kenneth Reinschmidt, recently deceased, and Warren D. Reece, for their encouragement and advice when I was changing my academic path. Finally, my highest appreciation goes to my wife, Mona, without whose understanding and love completing this work would have been impossible. iv CONTRIBUTORS AND FUNDING SOURCES Contributors This work was supported by a dissertation committee consisting of Professors Sergiy Butenko (advisor), Lewis Ntaimo, and Kiavash Kianfar of the Department of Industrial and Systems Engi- neering, and Professor Jianer Chen of the Department of Computer Science and Engineering. All work conducted for the dissertation was completed by the student independently. Funding Sources Graduate study was supported by graduate assistantship from the Department of Industrial and Systems Engineering of Texas A&M University and Dr. Sergiy Butenko, and a graduate teaching fellowship from the College of Engineering of Texas A&M University. v TABLE OF CONTENTS Page ABSTRACT ......................................................................................... ii DEDICATION....................................................................................... iii ACKNOWLEDGMENTS .......................................................................... iv CONTRIBUTORS AND FUNDING SOURCES ................................................. v TABLE OF CONTENTS ........................................................................... vi LIST OF FIGURES ................................................................................. viii LIST OF TABLES................................................................................... ix 1. INTRODUCTION............................................................................... 1 1.1 Summary of Contributions................................................................ 2 2. A NONCONVEX QUADRATIC OPTIMIZATION APPROACH TO THE MAXIMUM EDGE WEIGHT CLIQUE PROBLEM ........................................................ 4 2.1 Introduction................................................................................ 4 2.2 Quadratic Programming Formulation for the MEWC Problem......................... 6 2.2.1 First order necessary conditions (FONC) ....................................... 7 2.2.2 Second order necessary conditions (SONC) .................................... 8 2.2.3 Second order sufficient conditions (SOSC) ..................................... 8 2.3 Optimality Characterizations ............................................................. 8 2.4 Solving the MEWC Problem ............................................................. 13 2.4.1 Construction heuristic ............................................................ 13 2.4.2 Quadratic relaxation bound ...................................................... 16 2.4.3 Combinatorial branch-and-bound procedure .................................... 20 2.5 Computational Experiments .............................................................. 25 3. A LAGRANGIAN BOUND ON THE CLIQUE NUMBER AND AN EXACT ALGO- RITHM FOR THE MAXIMUM EDGE WEIGHT CLIQUE PROBLEM ................... 32 3.1 Introduction................................................................................ 32 3.2 A Lagrangian Relaxation Bound on the Clique Number ................................ 34 3.3 An Exact Solution Method for the MEWC Problem .................................... 38 3.3.1 Upper-bounding method ......................................................... 39 vi 3.3.2 Algorithm ......................................................................... 42 3.4 Computational Experiments .............................................................. 48 4. POLYHEDRAL PROPERTIES OF THE INDUCED CLUSTER SUBGRAPHS ........... 57 4.1 Introduction................................................................................ 57 4.1.1 Terminology and notation ........................................................ 60 4.2 The IUC Polytope ......................................................................... 61 4.3 Facet-producing Structures ............................................................... 64 4.3.1 Chordless cycle and its complement............................................. 65 4.3.2 Star and double-star .............................................................. 76 4.3.3 Fan and wheel .................................................................... 89 4.4 The Separation Problems ................................................................. 97 4.5 Computational Experiments .............................................................. 101 5. SUMMARY AND CONCLUSIONS........................................................... 110 REFERENCES ...................................................................................... 114 APPENDIX A. THE UNIVARIATE LAGRANGIAN RELAXATION PROBLEM............ 128 APPENDIX B. SUPPLEMENTARY ALGORITHMS ........................................... 130 vii LIST OF FIGURES FIGURE Page 3.1 Example graph for the analytic bound on the clique number............................ 38 4.1 Subgraph induced by a hole H, jHj = 3q + 1........................................... 67 4.2 Subgraph induced by an anti-hole. ....................................................... 71 4.3 Example graphs for lifting the hole inequality. .......................................... 72 viii LIST OF TABLES TABLE Page 2.1 Characteristics of the test instances (CBQ algorithm). .................................. 26 2.2 Solution time............................................................................... 27 2.3 Heuristic results. .......................................................................... 29 2.4 Quality of the quadratic relaxation bound. ............................................... 31 3.1 Characteristics of the test instances (combinatorial algorithm). ........................ 49 3.2 Analytic and coloring-based bounds on the clique number. ............................ 50 3.3 CPU times (in milliseconds) for the considered bounds on the clique number......... 51 3.4 Solution time for the MEWC problem. .................................................. 53 3.5 Size of the search tree. .................................................................... 56 4.1 First set of experiments: characteristics of the test instances............................ 102 4.2 First set of experiments: improvement (in solution time) for each family of the valid inequalities........................................................................... 104 4.3 First set of experiments: results of incorporating different combinations of valid inequalities. ................................................................................ 106 4.4 Second set of experiments: cut-generation characteristics. ............................. 107 4.5 Second set of experiments: integer (linear) programming solution results. ............ 109 ix 1. INTRODUCTION Graph theory proves to be a powerful tool in analysis of complex systems with numerous applica- tions in science and engineering. Many real-life systems can be described as a set of components that

Load more