Algorithmic Skeletons for Exact Combinatorial Search at Scale

Archibald, Blair (2018) Algorithmic skeletons for exact combinatorial search at scale. PhD thesis. https://theses.gla.ac.uk/31000/ Copyright and moral rights for this work are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This work cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Enlighten: Theses https://theses.gla.ac.uk/ [email protected] ALGORITHMIC SKELETONS FOR EXACT COMBINATORIAL SEARCH AT SCALE BLAIR ARCHIBALD SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy SCHOOL OF COMPUTING SCIENCE COLLEGE OF SCIENCE AND ENGINEERING UNIVERSITY OF GLASGOW NOVEMBER 4, 2018 © BLAIR ARCHIBALD Abstract Exact combinatorial search is essential to a wide range of application areas including constraint optimisation, graph matching, and computer algebra. Solutions to combinatorial problems are found by systematically exploring a search space, either to enumerate solutions, determine if a specific solution exists, or to find an optimal solution. Combinatorial searches are computationally hard both in theory and practice, and efficiently exploring the huge number of combinations is a real challenge, often addressed using approximate search algorithms. Alternatively, exact search can be parallelised to reduce execution time. However, parallel search is challenging due to both highly irregular search trees and sensitivity to search order, leading to anomalies that can cause unexpected speedups and slowdowns. As core counts continue to grow, parallel search becomes increasingly useful for improving the performance of existing searches, and allowing larger instances to be solved. A high-level approach to parallel search allows non-expert users to benefit from increasing core counts. Algorithmic Skeletons provide reusable implementations of common parallelism patterns that are parameterised with user code which determines the specific computation, e.g. a particular search. We define a set of skeletons for exact search, requiring the user to provide in the minimal case a single class that specifies how the search tree is generated and a parameter that specifies the type of search required. The five are: Sequential search; three general-purpose parallel search methods: Depth-Bounded, Stack-Stealing, and Budget; and a specific parallel search method, Ordered, that guarantees replicable performance. We implement and evaluate the skeletons in a new C++ parallel search framework, YewPar. YewPar provides both high-level skeletons and low-level search specific schedulers and utilities to deal with the irregularity of search and knowledge exchange between workers. YewPar is based on the HPX library for distributed task-parallelism potentially allowing search to execute on multi-cores, clusters, cloud, and high performance computing systems. Underpinning the skeleton design is a novel formal model, MT 3, a parallel operational semantics that describes multi-threaded tree traversals, allowing reasoning about parallel search, e.g. describing common parallel search phenomena such as performance anomalies. YewPar is evaluated using seven different search applications (and over 25 specific instances): Maximum Clique, k-Clique, Subgraph Isomorphism, Travelling Salesperson, Binary Knap- sack, Enumerating Numerical Semigroups, and the Unbalanced Tree Search Benchmark. The search instances are evaluated at multiple scales from 1 to 255 workers, on a 17 host, 272 core Beowulf cluster. The overheads of the skeletons are low, with a mean 6.1% slowdown compared to hand-coded sequential implementation. Crucially, for all search applications YewPar reduces search times by an order of magnitude, i.e hours/minutes to minutes/seconds, and we commonly see greater than 60% (average) parallel efficiency speedups for up to 255 workers. Comparing skeleton performance reveals that no one skeleton is best for all searches, highlighting a benefit of a skeleton approach that allows multiple parallelisations to be explored with minimal refactoring. The Ordered skeleton avoids slowdown anomalies where, due to search knowledge being order dependent, a parallel search takes longer than a sequential search. Analysis of Ordered shows that, while being 41% slower on average (73% worse-case) than Depth-Bounded, in nearly all cases it maintains the following replicable performance properties: 1) parallel executions are no slower than one worker sequential executions 2) runtimes do not increase as workers are added, and 3) variance between repeated runs is low. In particular, where Ordered maintains a relative standard deviation (RSD) of less than 15%, Depth-Bounded suffers from an RSD greater than 50%, showing the importance of carefully controlling search orders for repeatability. Acknowledgements Utmost thanks goes to Phil Trinder, Patrick Maier, and Rob Stewart, for their unfaltering encouragement, guidance and ideas. All have been simultaneously supervisors, colleagues, mentors and friends. This thesis would not have been possible without their support. Thank you to everyone in the School of Computing science at Glasgow, my home for the last seven years. Special thanks to Jeremy Singer, for kindling my interest in research, and Ciaran McCreesh for sending me down the rabbit-hole of parallel search problems and for feedback on this thesis. Also to Stephen McQuistin for feedback on an earlier draft. To all those at EPCC, especially Michele Weiland and Nick Johnson, thank you for hosting me during my internship. I learned a great deal about high performance computing and, of course, cryptic crosswords! Thanks to the Software Sustainability Institute who taught me the wider importance of computing and ensured I didn’t miss the forest for the (search) trees. I am especially grateful to my parents, family, friends, and Jena who have formed the best support network I could ask for. Thank you. Finally, thanks to Chris Jefferson, David Manlove, and Michel Steuwer for making the viva an enjoyable experience. This work was funded from EPSRC grants: EP/K503058/1, EP/L50497X/1 and EP/M508056/1. Table of Contents 1 Introduction 1 1.1 Contributions . .2 1.2 Publications and Authorship . .5 2 Background 7 2.1 Combinatorial Search . .8 2.1.1 General Combinatorial Search . .9 2.1.2 Backtracking Search . 11 2.1.3 Types of Search . 12 2.1.4 Search Ordering and Heuristics . 12 2.2 Parallelism . 15 2.2.1 Parallel Architectures . 15 2.2.2 Parallelism Models . 16 2.2.3 Load-Balancing . 18 2.2.4 High-Level Approaches to Parallelism . 20 2.2.5 Parallel Performance Metrics . 21 2.3 Parallel Combinatorial Search . 22 2.3.1 Methods of Parallelising Combinatorial Search . 22 2.3.2 Challenges of Space-Splitting Search . 23 2.4 Existing Parallel Space-Splitting Searches . 25 2.4.1 Static Approaches . 26 2.4.2 Dynamic Approaches . 29 2.4.3 Scalability of Existing Approaches . 36 2.4.4 The Need For A New Search Skeleton Framework . 36 3 Formalising Parallel Tree Search 41 3.1 Modelling Search Trees . 42 3.1.1 Ordered Trees . 44 3.2 Example: Tree Generators for the Travelling Salesperson Problem . 45 3.3 Semantics of Parallel Tree Search . 47 3.3.1 Creating an MT 3 Model . 48 3.3.2 Rule Ordering . 49 3.3.3 Search State . 49 3.3.4 Traversal Rules . 50 3.3.5 Example Reductions for TSP . 50 3.3.6 Node Processing Rules . 51 3.3.7 Pruning Rules . 54 3.3.8 Spawn Rules . 57 3.4 Lazy Node Generators: A Functional Interface for Tree Search . 57 3.4.1 Node Generators . 58 3.4.2 Node Generators as a Programming Interface . 58 3.4.3 Lazy Node Generators . 61 3.5 Summary . 62 4 Search Skeletons 65 4.1 Skeletons as a High-Level Parallelism Approach . 66 4.1.1 Creating Search Skeletons . 67 4.2 An Abstract Parallel Framework for Tree Search . 67 4.3 Search Coordination Methods . 70 4.3.1 Design Choices . 71 4.3.2 Pseudocode . 72 4.3.3 Sequential Search Coordination . 72 4.3.4 Depth-Bounded Search Coordination . 73 4.3.5 Stack-Stealing Search Coordination . 76 4.3.6 Budget Search Coordination . 81 4.3.7 Search Coordination Summary . 84 4.4 Workpool Design Choices . 85 4.5 Search Types . 87 4.5.1 Enumeration Search Type . 87 4.5.2 Decision Search Type . 88 4.5.3 Optimisation Search Type . 89 4.6 Summary . 90 5 Skeleton Implementation and Case Studies 93 5.1 Implementation . 93 5.1.1 A Prototype Haskell Framework for Tree Search . 93 5.1.2 YewPar: A Framework for Distributed-Memory Tree Search . 95 5.1.3 YewPar Features . 96 5.2 Case Study Applications . 104 5.2.1 Enumeration Case Studies . 104 5.2.2 Decision Case Studies . 108 5.2.3 Optimisation Case Studies . 112 6 Evaluation 117 6.1 Experimental Setup . 117 6.2 Caveats when Evaluating Parallel Search . 118 6.3 The Cost of Generality . 119 6.4 Global Incumbent Management . 121 6.5 Depth-Bounded . 123 6.5.1 Workpool Choice . 129 6.5.2 Depth-Bounded Work-Stealing Performance . 130 6.5.3 Depth-Bounded Summary . 132 6.6 Stack-Stealing . 132 6.6.1 Use of Chunking . 134 6.6.2 Stack-Stealing Work-stealing Performance . 136 6.6.3 Stack-Stealing Summary . 136 6.7 Budget . 138 6.7.1 Budget Work-Stealing Performance . 144 6.7.2 Budget Summary . 145 6.8 Skeleton Comparison . 146 6.8.1 Cumulative Statistics . 148 6.9 Scalability . 149 6.9.1 Finite Geometry: H(4; 32) ...................... 153 6.10 Generality and Ease of Use . 153 6.11 Summary . 154 7 Replicable Branch and Bound Search 157 7.1 Limits of Replicable Search .

Algorithmic Skeletons for Exact Combinatorial Search at Scale

Toward Optimised Skeletons for Heterogeneous

Algorithmic Skeletons and Parallel Design Patterns in Mainstream Parallel Programming

A Divide-And-Conquer Parallel Skeleton for Unbalanced and Deep Problems

Towards an Algorithmic Skeleton Framework for Programming the Intel R Xeon Phitm Processor

UNIVERSITY of PISA and SCUOLA SUPERIORE SANTANNA Master

A Theoretical Model for Global Optimization of Parallel Algorithms

Arxiv:2005.04094V1 [Cs.DC] 5 May 2020 National University of Defense Technology E-Mail: {J.Fang, Chunhuang, Taotang84}@Nudt.Edu.Cn Z

Skelcl -- a Portable Multi-GPU Skeleton Library

Algorithmic Skeletons and Parallel Design Patterns in Mainstream Parallel Programming

Nested Parallelism with Algorithmic Skeletons A

Algorithmic Skeletons for Python Jolan Philippe, Frédéric Loulergue

Copyrighted Material