Practical Approximation Algorithms for Clustering and Covering
Total Page:16
File Type:pdf, Size:1020Kb
Practical Approximation Algorithms for Clustering and Covering a dissertation presented by Jessica May McClintock to The School of Engineering in total fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Computing and Information Systems Student Number: 268135 The University of Melbourne Melbourne, Australia June 2017 Thesis advisor: Associate Professor Tony Wirth Jessica May McClintock Practical Approximation Algorithms for Clustering and Covering Abstract The aim of this thesis is to provide solutions to computationally difficult network optimisation problems on modern architectures. There are three types of problem that we will be studying – clustering, covering and ordering – all which are computa- tionally hard to solve, and in practice often involve very large data sets. These models are used in range of real-world applications, and therefore we will investigate both the practical and theoretical aspects of solving these problems in big data contexts. The main approach we consider for solving such instances is to obtain polynomial- time approximation-algorithms, which efficiently solve these problems to within some constant factor of optimality. We also introduce several heuristics for a type of schedul- ing problem with graph-based constraints and demonstrate their performance in a practical setting, before providing an approximation algorithm and hardness res- ults for a formalised version of the problem. For instances on big data, where the computational bottleneck is the available RAM, we consider models for algorithm design that would allow such instances to be solved. For this purpose, we also design clustering algorithms using the MapReduce paradigm for parallelisation, giving ex- perimental evaluations of their performance in comparison to existing approaches. iv Contents 0 Introduction 1 0.1 Clustering and Parallel Algorithms..................... 2 0.2 Graph Covering Problems.......................... 4 0.3 Test-case Prioritisation with Precedences.................. 6 0.4 Min-Sum Set Cover.............................. 8 1 Related Work 11 1.1 Optimisation and Approximation...................... 11 1.2 Graph Theory ................................. 13 1.3 Parallelism and Clustering Problems.................... 16 1.4 Graph Covering Problems.......................... 32 1.5 The Min-Sum Set Cover Problem...................... 46 1.6 Scheduling and Prioritisation with Precedences.............. 53 1.7 Summary.................................... 61 2 Efficient Clustering with MapReduce 63 2.1 Clustering with MapReduce......................... 64 2.2 Parallel k-centre................................ 65 2.3 Analysis of EIM sampling .......................... 72 2.4 Runtime .................................... 80 2.5 Experiments.................................. 82 2.6 Results ..................................... 86 2.7 Conclusions .................................. 90 3 Parallel Coverings of Graphs 93 3.1 MapReduce for Covering Problems..................... 93 3.2 Tree Covering Problems ........................... 94 v 3.3 The Minimum Path Cover Problem..................... 96 3.4 The k-star Covering Problem......................... 98 3.5 Hardness of k-star Covering......................... 99 4 Scheduling with Precedences 105 4.1 Test-case Prioritisation with Precedences..................105 4.2 Our Contributions...............................112 4.3 Algorithms and Analysis...........................114 4.4 Special Cases..................................121 4.5 Experiments..................................122 4.6 Results .....................................129 4.7 Conclusions ..................................141 5 Min-Sum Set-Cover with Precedences 147 5.1 Our Contributions...............................149 5.2 Max-Density Precedence-Closed Subgraphs . 151 5.3 An Algorithm for precMSSC.........................161 5.4 Hardness Results ...............................165 5.5 Conclusions ..................................169 6 Summary 171 References 174 vi List of Figures 1.1 Gonzalez k-centre solution for k = 3 .................... 28 1.2 An example rooted k-tree covering ..................... 35 1.3 An example rooted k-path covering..................... 36 1.4 An example clustering solution....................... 37 1.5 An example k-star covering ......................... 38 1.6 A bad assignment for k-star ......................... 40 1.7 An improved assignment for k-star..................... 40 2.1 Flowchart for MRG .............................. 67 2.2 A bad assignment and seeding for MRG . 69 2.3 A bad sample for MRG ............................ 70 2.4 An example tight solution for MRG ..................... 71 2.5 A point being satisfied by the sample.................... 76 2.6 Satisfied and unsatisfied points....................... 76 2.7 Comparison of performance over k on real datasets............ 85 2.8 Runtimes over k ................................ 86 2.9 Average solution value over k on synthetic data.............. 86 2.10 Runtime for k = 25 over a range of n .................... 89 2.11 Runtime over k on synthetic data ...................... 90 3.1 Star covering on a line metric ........................ 99 4.1 Fault detection histogram for default ordering . 107 4.2 Fault detection histogram for fault-coverage based ordering . 108 4.3 Fault detection histogram for code-coverage based ordering . 110 4.4 An example precedence graph........................112 4.5 Fault detection histogram for precedence-constrained ordering . 112 4.6 Combined APFD and AF results for real and synthetic dependencies . 128 vii 4.7 APFD scores for real data sets........................129 4.8 AF scores for real data sets..........................130 4.9 APFD and AF for data sets with synthetic dependencies . 131 4.10 APFD for Siemens with synthetic dependencies . 135 4.11 AF for Siemens with synthetic dependencies . 136 4.12 Lookahead comparison for gsm2 . 138 4.13 Lookahead comparison for tot info . 140 4.14 Lookahead comparison for replace . 144 4.15 Lookahead comparison for print tokens2 . 145 5.1 A bad precedence graph when d < 1 . 153 5.2 An example in-tree ..............................155 5.3 An example out-tree .............................157 5.4 Histograms for greedy and OPT . 163 viii List of Tables 1.1 Current best-known solutions........................ 61 1.2 Our contributions............................... 62 2.1 Comparison of algorithms for k-centre................... 83 2.2 Average solution value over k on Gau (n = 1,000,000, k0 = 25) . 87 2.3 Average solution value over k on Unif (n = 100,000)........... 87 2.4 Average solution value over k on UnB (n = 200,000, k0 = 25) . 87 2.5 Average solution over f ............................ 88 2.6 Average runtime over f ............................ 88 4.1 An example defect-coverage matrix.....................106 4.2 An example code-coverage matrix......................109 4.3 Metrics for real systems............................123 4.4 Metrics for systems with synthetic precedences . 125 ix x List of Algorithms 1 MapReduce-MST(V, E) ............................. 21 2 GON(V, k) .................................... 27 3 k-means(V, k) .................................. 29 4 Greedy(S, U, w) ................................ 47 5 MRG(V, k, m) .................................. 67 6 EIM-MapReduce-Sample(V, E, k, #) ...................... 72 7 Select(H, S) .................................. 74 8 k-Tree-Cover(G, k, B) ............................. 95 9 Min-Path-Cover(G, B) ............................ 97 10 Minimum Star Cover............................. 101 11 Topological-sort based greedy........................ 115 12 Coverage-first greedy............................. 117 13 greedy-subgraph(P = (S, E), U) ...................... 151 14 greedy-subgraph+(P = (S, E), U) ..................... 154 15 subgraph-outtree(T = (S, E), v, Rv) .................... 158 16 mssc-greedy(S, P, U) ............................. 162 xi xii Acknowledgments Many thanks to Tony Wirth for these years of advice and assistance as my supervisor – without his support and encouragement I certainly would not be where I am today. His attention to detail and discerning advice have provided many lessons that will be long remembered and appreciated. Thanks to Tim Miller for introducing me to the fascinating problem of scheduling with precedences, and providing datasets for the experiments. Thanks to Julian Mestre for providing direction and insight on the min-sum set cover problem. And thanks to Andrew Turpin for providing diversions along the way. Many thanks to my family and friends for their support and tolerance along this journey, particularly to Robert Marshall for proof reading and encouragement, and to Helen and Diana for providing reassurance and distraction. And finally a special thank you to my mother, Karen McClintock, who has been endlessly supportive even in times of uncertainty. xiii Declaration This is to certify that: 1. The thesis comprises only their original work towards the PhD except where indicated in the preface; 2. Due acknowledgement has been made in the text to all other material used; 3. The thesis is fewer than the maximum word limit in length, exclusive of tables, maps, bibliographies and appendices. Signed, Jessica McClintock Date xv xvi Preface The original work in this thesis is related to the following papers, one of which has been peer reviewed and appeared in conference proceedings. • McClintock, J., and Wirth, A., Efficient Parallel Algorithms