A Multifaceted Approach to Improving the Practicality of Structural Graph Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT O’BRIEN, MICHAEL PATRICK. A Multifaceted Approach to Improving the Practicality of Structural Graph Algorithms. (Under the direction of Blair D. Sullivan.) Graph algorithms have become an integral part of modern data analytics, but existing ap- proaches have struggled to scale to increasing network sizes. The theoretical computer science community has a rich history of research that circumvents these scalability issues through algorithms that exploit the structural sparsity of graphs. Because many real-world networks from diverse domains are known to share properties like sparsity, clustering, and heavy-tailed degree distributions, structure-based algorithms appear on the surface to be an attractive alternative. However, they come with their own set of problems, such as non-constructive proofs, massive constants hidden in big-O notation, and attention paid to exploiting structures that are unlikely to occur in real data. Consequently, there is a large gap between the most theoretically efficient algorithms and the most practical ones. This dissertation focuses on alleviating practical barriers to the use of structural graph algorithms in large-scale data analytics, addressing problems on multiple different fronts. First, we show that some structural features can still be identified even in the presence of missing or unknown data. A second contribution is CONCUSS, a first-of-its-kind implementation of a subgraph isomorphism counting algorithm that exploits the bounded expansion structure of graph classes. Through a thorough experimental evaluation of CONCUSS, we establish conditions under which it is competitive with existing algorithms and highlight ways to enhance the general algorithmic framework. As extensions to this framework, we introduce p-linear colorings—an alternative graph coloring used to identify bounded expansion struc- ture—as well as practical dynamic programming algorithms to perform a variant of local search in fixed-parameter tractable time in graph classes with bounded expansion. Despite these successes, we prove that recognizing classes of bounded expansion by measuring the density of shallow topological minors is likely not possible in polynomial time. Finally, to demonstrate a concrete way in which structure-based algorithms enable better analysis in scientific domains we build a framework for extracting neighborhoods from metagenomic sequencing data in an interdisciplinary collaboration with computational biologists. Overall, this work illustrates multiple effective strategies for bridging the gap between efficient and practical algorithms. © Copyright 2018 by Michael Patrick O’Brien All Rights Reserved A Multifaceted Approach to Improving the Practicality of Structural Graph Algorithms by Michael Patrick O’Brien A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Computer Science Raleigh, North Carolina 2018 APPROVED BY: Carla D. Savage Matthias F. Stallmann Steffen Heber Blair D. Sullivan Chair of Advisory Committee DEDICATION To Dr. Mary Jane Cooper O’Brien, who received her doctoral degree at age 71. ii BIOGRAPHY Michael P. O’Brien grew up in Strongsville, OH. Before coming to North Carolina State University, he received a Bachelor’s degree in Computer Science with a minor in Chinese from the University of Notre Dame. iii ACKNOWLEDGEMENTS Many thanks to Dr. Felix Reidl for laying the groundwork for this dissertation, Dr. C. Titus Brown for time spent devoted to a student outside his lab, Dr. Blair D. Sullivan for advising and encouraging this work, and Shannon Warchol O’Brien for making the whole experience much easier and more enjoyable. iv TABLE OF CONTENTS LIST OF TABLES ......................................... viii LIST OF FIGURES ........................................ ix Chapter 1 Introduction ..................................... 1 1.1 Motivation......................................... 1 1.2 Overarching Challenges................................ 2 1.3 Research Questions and Outline ........................... 3 Chapter 2 Background ..................................... 4 2.1 Graph Theory...................................... 4 2.2 Sparse Graph Hierarchy................................ 5 2.2.1 Treedepth .................................... 6 2.2.2 Excluded Minors................................ 8 2.2.3 Bounded Expansion.............................. 9 2.2.4 Nowhere/Somewhere Density......................... 11 2.2.5 Degeneracy and Core Decompositions ................... 12 2.3 Random Graph Models ................................ 13 2.3.1 Asymptotic Properties............................. 13 2.3.2 RGMs with Bounded Expansion....................... 13 2.4 Parameterized Complexity............................... 15 2.4.1 Relationship to Graph Structure....................... 16 2.4.2 Parameterized Lower Bounds......................... 16 Chapter 3 Locally Estimating Core Numbers ....................... 18 3.1 Introduction....................................... 19 3.2 Related Work ....................................... 21 3.3 Local Estimation ..................................... 21 3.3.1 Neighborhood-based Estimation........................ 21 3.3.2 Structures Leading to Error.......................... 25 3.3.3 Expected Behavior on Random Graphs................... 26 3.4 Experimental Results.................................. 27 3.4.1 Methods..................................... 28 3.4.2 Results...................................... 28 3.5 Network Treatment................................... 33 3.5.1 Problem Statement............................... 33 3.5.2 Estimating k-core Exposure Probabilities.................. 35 3.6 Conclusion........................................ 37 Chapter 4 CONCUSS ..................................... 40 4.1 Introduction........................................ 41 4.2 Algorithmic Landscape for Classes of Bounded Expansion............. 41 4.3 Subgraph Isomorphism Counting .......................... 42 v 4.3.1 Motif Counting................................. 42 4.3.2 Graphlet Degree Distribution......................... 43 4.3.3 Existing Algorithms .............................. 43 4.4 CONCUSS........................................ 45 4.4.1 Color....................................... 47 4.4.2 Decompose ................................... 48 4.4.3 Compute..................................... 48 4.4.4 Combine..................................... 49 4.4.5 Extensions to other problems......................... 49 4.5 Experimental Design.................................. 49 4.5.1 Data ....................................... 50 4.5.2 Hardware .................................... 50 4.6 Competitive Evaluation ................................ 50 4.6.1 Configuration Testing .............................. 51 4.6.2 Comparison with NXVF2 ........................... 53 4.7 Bottleneck Identification................................ 55 4.7.1 Color Class Distribution............................ 55 4.7.2 Color Set Treedepth .............................. 58 4.8 Conclusion........................................ 59 Chapter 5 Linear Colorings .................................. 60 5.1 Introduction........................................ 61 5.2 p-Linear and Linear Colorings ............................. 61 5.3 Lower Bounds...................................... 63 5.4 Upper Bounds on Trees ................................ 66 5.5 Upper Bounds on Interval Graphs.......................... 68 5.6 Hardness of Recognizing Linear Colorings...................... 71 5.7 Conclusion........................................ 74 Chapter 6 Identifying Dense Substructures ........................ 75 6.1 Introduction....................................... 76 6.2 Background ....................................... 77 6.3 Algorithmic Considerations.............................. 78 6.4 NP-Hardness ...................................... 80 6.5 ETH Lower Bounds................................... 86 6.6 Conclusion........................................ 89 Chapter 7 Local Search ..................................... 91 7.1 Introduction........................................ 91 7.2 Background ....................................... 94 7.2.1 Notation..................................... 94 7.2.2 Problem Definitions .............................. 94 7.3 Algorithmic Strategy.................................. 94 7.4 Vertex Cover....................................... 95 7.4.1 Algorithm Description............................. 96 vi 7.4.2 Correctness ................................... 96 7.4.3 Weighted Variant................................ 98 7.5 Maximal Matching ................................... 99 7.5.1 Algorithm Description............................. 100 7.5.2 Correctness ....................................101 7.5.3 Weighted Variant................................ 104 7.6 Conclusion........................................ 105 Chapter 8 Interdisciplinary Applications ......................... 106 8.1 Introduction....................................... 107 8.2 Background ....................................... 108 8.2.1 De Bruijn Graphs................................ 108 8.2.2 Data ....................................... 108 8.2.3 Evaluation Metrics............................... 109 8.3 Neighborhood Indexing................................ 109 8.3.1 Algorithmic Description............................ 109 8.3.2 Biological Impact................................ 110 8.4 The CAtlas.......................................