Parallel and External High Quality Graph Partitioning
Total Page:16
File Type:pdf, Size:1020Kb
Parallel and External High Quality Graph Partitioning zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation von Yaroslav Akhremtsev aus Rostow am Don Tag der mündlichen Prüfung: 29.05.2019 Erster Gutachter: Herr Prof. Dr. Peter Sanders Zweiter Gutachter: Herr Prof. Dr. Henning Meyerhenke Dedication to my mother, my grandmother, and my father. ii Abstract Partitioning graphs into k blocks of roughly equal size such that few edges run between the blocks is a key tool for processing and analyzing large complex real-world networks. The graph partitioning problem has multiple practical applications in parallel and distributed computations, data storage, image processing, VLSI physical design and many more. Furthermore, recently, size, variety, and structural complexity of real- world networks has grown dramatically. Therefore, there is a demand for efficient graph partitioning algorithms that fully utilize computational power and memory capacity of modern machines. A popular and successful heuristic to compute a high-quality partitions of large networks in reasonable time is multi-level graph partitioning approach which contracts the graph preserving its structure and then partitions it using a complex graph partitioning algorithm. Specifically, the multi-level graph partitioning approach consists of three main phases: coarsening, initial partitioning, and uncoarsening. During the coarsening phase, the graph is recursively contracted preserving its structure and properties until it is small enough to compute its initial partition during the initial partitioning phase. Afterwards, during the uncoarsening phase the partition of the contracted graph is projected onto the original graph and refined using, for example, local search. Most of the research on heuristical graph partitioning focuses on sequential algorithms or parallel algorithms in the distributed memory model. Unfortunately, previous approaches to graph partitioning are not able to process large networks and rarely take in into account several aspects of modern computational machines. Specifically, the amount of cores per chip grows each year as well as the price of RAM reduces slower than the real-world graphs grow. Since HDDs and SSDs are 50 – 400 times cheaper than RAM, external memory makes it possible to process large real-world graphs for a reasonable price. Therefore, in order to better utilize contemporary computational machines, we develop efficient multi-level graph partitioning algorithms for the shared-memory and the external memory models. First, we present an approach to shared-memory parallel multi-level graph partitioning that guarantees balanced solutions, shows high speed-ups for a variety of large graphs and yields very good quality independently of the number of cores used. Important ingredients include parallel label propagation for both coarsening and uncoarsening, parallel initial partitioning, a simple yet effective approach to parallel localized local iii search, and fast locality preserving hash tables that effectively utilizes caches. The main idea of the parallel localized local search is that each processors refines only a small area around a random vertex reducing interactions between processors. For example, on 79 cores, our algorithms partitions a graph with more than 3 billions of edges into 16 blocks cutting 4.5% less edges than the closest competitor and being more than two times faster. Furthermore, another competitors is not able to partition this graph. We then present an approach to external memory graph partitioning that is able to partition large graphs that do not fit into RAM. Specifically, we consider the semi-external and the external memory model. In both models a data structure of size proportional to the number of edges does not fit into the RAM. The difference is that the former model assumes that a data structure of size proportional to the number of vertices fits into the RAM whereas the latter assumes the opposite. We address the graph partitioning problem in both models by adapting the size-constrained label propagation technique for the semi-external model and by developing a size- constrained clustering algorithm based on graph coloring in the external memory. Our semi-external size-constrained label propagation algorithm (or external memory clustering algorithm) can be used to compute graph clusterings and is a prerequisite for the (semi-)external graph partitioning algorithm. The algorithms are then used for both the coarsening and the uncoarsening phase of a multi-level algorithm to compute graph partitions. Our (semi-)external algorithm is able to partition and cluster huge complex networks with billions of edges on cheap commodity machines. Experiments demonstrate that the semi-external graph partitioning algorithm is scalable and can compute high quality partitions in time that is comparable to the running time of an efficient internal memory implementation. A parallelization of the algorithm in the semi-external model further reduces running times. Additionally, we develop a speed-up technique for the hypergraph partitioning algo- rithms. Hypergraphs are an extension of graphs that allow a single edge to connect more than two vertices. Therefore, they describe models and processes more accurately additionally allowing more possibilities for improvement. Most multi-level hypergraph partitioning algorithms perform some computations on vertices and their set of neigh- bors. Since these computations can be super-linear, they have a significant impact on the overall running time on large hypergraphs. Therefore, to further reduce the size of hyperedges, we develop a pin-sparsifier based on the min-hash technique that clusters vertices with similar neighborhood. Further, vertices that belong to the same cluster are substituted by one vertex, which is connected to their neighbors, therefore, reducing the size of the hypergraph. Our algorithm sparsifies a hypergraph such that the resulting graph can be partitioned significantly faster without loss in quality (or with insignificant loss). On average, KaHyPar with sparsifier performs partitioning about 1.5 times faster while preserving solution quality if hyperedges are large. All aforementioned frameworks are publicly available. iv Deutsche Zusammenfassung Die Partitionierung von Graphen in k Blöcke von etwa gleicher Größe, sodass nur weni- ge Kanten zwischen den Blöcken verlaufen, ist ein wichtiges Werkzeug zur Verarbeitung und Analyse großer komplexer realer Netzwerke. Das Problem der Graphpartitionierung findet mehrere praktische Anwendungen in parallelen und verteilten Rechensystemen, bei der Datenspeicherung, Bildverarbeitung, physischen Gestaltung von VLSI sowie in vielen weiteren Bereichen. Weiterhin sind die Größe, Vielfalt und strukturelle Komplexität von realen Netzwerken in letzter Zeit enorm gestiegen. Daher besteht ein Bedarf an effizienten Algorithmen zur Graphpartitionierung, die die Rechenlei- stung und Speicherkapazität moderner Rechnern voll ausschöpfen. Eine verbreitete und erfolgreiche heuristische Methode, um qualitativ hochwertige Partitionen großer Netzwerke in angemessener Zeit zu berechnen, ist der Ansatz der mehrstufigen Graph- partitionierung, der den Graphen unter Beibehaltung seiner Struktur zusammenfasst und ihn dann unter Verwendung eines komplexen Partitionsalgorithmus für Graphen partitioniert. Konkret besteht der Ansatz der mehrstufigen Graphpartitionierung aus drei Hauptphasen: Vergröberung, anfängliche Partitionierung und Verfeinerung. Während der Vergröberungsphase wird der Graph rekursiv zusammengefasst, wobei seine Struktur und Eigenschaften erhalten bleiben, bis er klein genug ist, um seine anfängliche Partition während der anfänglichen Partitionierungsphase zu berechnen. Anschließend wird während der Vergröberungsphase die Partition des zusammen- gefassten Graphen auf den ursprünglichen Graphen abgebildet und beispielsweise durch eine lokale Suche verfeinert. Der größte Teil der Forschung zur heuristischen Graphpartitionierung legt den Schwerpunkt auf sequentielle oder parallele Algorithmen im verteilten Speichermodell. Leider sind bisherige Ansätze zur Graphpartitionierung nicht auf dem Stand, um große Netzwerke zu verarbeiten und berücksichtigen selten mehrere Aspekte moderner Rechenmaschinen. Insbesondere wächst die Anzahl der Transistoren pro Chip jedes Jahr, währenddessen der RAM-Preis langsamer sinkt als der reale Graphen-Wachstum. Da HDD-und SSD-Speicher 50 bis 400 mal billi- ger sind als RAM, ermöglicht der Einsatz des externen Speichers die Verarbeitung großer realer Grafiken zu einem vernünftigen Preis. Deshalb entwickeln wir effiziente Algorithmen der mehrstufigen Graphpartitionierung für das Shared Memory- und die externen Speichermodelle, um die Kapazitäten moderner Rechner optimaler nutzen zu können. Zuerst stellen wir einen Ansatz für die parallele mehrstufige Graphpartitio- nierung im Shared Memory vor, der ausgewogene Lösungen garantiert, signifikante Beschleunigungen für eine Vielzahl von großen Graphen bietet und unabhängig von der v Anzahl der verwendeten Transistoren eine sehr gute Qualität gewährleistet. Wichtige Bestandteile dieses Ansatzes umfassen die parallele Label Propagation sowohl für die Vergröberung als auch für die Verfeinerung sowie die parallele initiale Partitionierung. Diese ist ein einfacher, aber wirkungsvoller Ansatz zur parallelen lokalisierten lokalen Suche und zu schnellen Hash-Tabellen,