The Read-Only Semi-External Model

The Read-Only Semi-External Model Guy E. Blelloch Laxman Dhulipala Phillip B. Gibbons Carnegie Mellon University MIT CSAIL Carnegie Mellon University [email protected] [email protected] [email protected] Yan Gu Charles McGuffey Julian Shun UC Riverside Carnegie Mellon University MIT CSAIL [email protected] [email protected] [email protected] Abstract external memory, with transfers between the two done in We introduce the Read-Only Semi-External (ROSE) Model blocks of size B. The cost of an algorithm is the number of for the design and analysis of algorithms on large graphs. As such transfers, called its I/O complexity. The model captures in the well-studied semi-external model for graph algorithms, the fact that (i) real-world performance is often bottlenecked we assume that the vertices but not the edges fit in a small by the number of transfers (I/Os) to/from the last (slowest, fast (shared) random-access memory, the edges reside in an largest) level of the hierarchy used, (ii) that level is used unbounded (shared) external memory, and transfers between because the second-to-last level is of limited size, and (iii) the two memories are done in blocks of size B. A key transfers are done in large blocks (e.g., cache lines or pages). difference in ROSE, however, is that the external memory Because of its simplicity and saliency, the External Memory can be read from but not written to. This difference is model has proven to be an effective model for algorithm motivated by important practical considerations: because the design and analysis [7, 15, 51, 67, 74]. graph is not modified, a single instance of the graph can be The Semi-External model [1] is a well-studied special shared among parallel processors and even among unrelated case of the External Memory model suitable for graph concurrent graph algorithms without synchronization, that algorithms, in which the vertices of the graph, but not the instance can be stored compressed without the need for edges, fit in the internal memory. This model reflects the re-compression, the graph can be accessed without cache reality that large real-world graphs tend to have at least an coherence issues, and the wear-out problems of non-volatile order of magnitude more edges than vertices. Figure1, for memory, such as Optane NVRAM, can be avoided. example, shows that all the large graphs (at least 1 billion Using ROSE, we analyze parallel algorithms (some edges) in the SNAP [56], LAW [30] and Azad et al. [9] existing, some new) for 18 fundamental graph problems. We datasets have an average degree more than 10, and over half show that these algorithms are work-efficient, highly parallel, have average degree at least 64. The assumptions in the Semi- and read the external memory using only a block-friendly External model have proven to be effective in both theory and (and compression-friendly) primitive: fetch all the edges for practice [1, 42, 57, 60, 69, 75, 76]. a given vertex. Analyzing the maximum times this primitive However, the recent emergence of new nonvolatile is called for any vertex yields an (often tight) bound on the memory (NVRAM) technologies (e.g., Intel’s Optane DC (low) I/O cost of our algorithms. We present new, specially- Persistent Memory) has added a new twist to memory designed ROSE algorithms for triangle counting, FRT trees, hierarchies: writes to NVRAM are much more costly than and strongly connected components, devising new parallel reads in terms of energy, throughput, and wear-out [18, 33, algorithm techniques for ROSE and beyond. 42, 50, 72, 73]. Neither the External Memory model nor the Semi-External model account for this read-write asymmetry. 1 Introduction To partially rectify this, Blelloch et al. [18] introduced the Asymmetric External Memory model, a variant of the External Efficient use of the memory hierarchy is crucial to obtaining Memory model that charges ω 1 for writes to the external good performance. For the design and analysis of algorithms, memory (the NVRAM), while reads are still unit cost (see it is often useful to consider simple models of computation also [52]). To our knowledge, the Semi-External setting with that capture the most salient aspects of the memory hierarchy. asymmetric read-write costs has not been studied. Although The External Memory model (also known as the I/O or disk- one could readily define such a model, graph algorithms access model) [3], for example, models the memory hierarchy provide an opportunity to go beyond just penalizing writes, as a bounded internal memory of size M and an unbounded by eliminating writes to the external memory altogether! Copyright © 2021 by SIAM Unauthorized reproduction of this article is prohibited Table 1: Analysis of graph algorithms in ROSE, for a graph G 103 Social of n vertices and m edges, assuming m = Ω¹nº. ∗ denotes that Web z Biology a bound holds in expectation and denotes that a bound holds with high probability or whp (O¹k f ¹nºº cost with probability at least 1 − 1/nk ). y denotes the bound assumes the average degree davg = m/n = O¹lognº; for larger davg, one of the logs in the depth 102 should be replaced by davg. B, the block size, is the number of edges that fit in a block. D = diam¹Gº is the diameter of G. rsrc is the eccentricity from the source vertex.p ∆ is the maximum degree. α is the arboricity of G. L = min ¹ m; ∆º + log2 ∆ logn/log logn. ρ Num. Edges / Num. Vertices is the peeling complexity of G [41]. WSP , DSP , and QSP are the work, depth, and I/O complexity of a single-source shortest path 1 10 computation, which depends on the metric used for the FRT trees. 107 108 109 Number of vertices Problem Work Depth I/O Complexity Triangle Counting O¹αmº∗ O¹α log nºz O¹α¹n + m/Bºº FRT Trees O¹W log nº∗ O¹D log nºz O¹Q log nº∗ Figure 1: (Adapted from [42]) Number of vertices ( , in log-scale) SP SP SP n Strongly Connected Comp O¹m log nº∗ O¹D log3 nºz O¹¹n + m/Bº vs. average degree (m/n, in log-scale) on 52 real-world graphs log nº∗ with m > 109 edges from the SNAP [56], LAW [30] and Azad Breadth-First Search O¹mº O¹D log nºy O¹n + m/Bº et al. [9] datasets. All of the graphs have more than 10 times as ∗ zy Integral-weight BFS O¹rsr c + mº O¹rsr c log nº O¹n + m/Bº many edges as vertices (corresponding to the gray dashed line), and Bellman-Ford O¹Dmº O¹D log nºy O¹D¹n 52% of the graphs have at least 64 times as many edges as vertices + m/Bºº y (corresponding to the green dashed line). Single-Source Widest Path O¹Dmº O¹rsr c log nº O¹D¹n + m/Bºº y This paper presents the Read-Only Semi-External (ROSE) Single-Source Betweenness O¹mº O¹D log nº O¹n + m/Bº O¹kº-Spanner O¹mº O¹k log nºzy O¹n + m/Bº model, for the design and analysis of algorithms on large LDD O¹mº O¹log2 nºzy O¹n + m/Bº graphs. As in the Semi-External model, the ROSE model Connectivity O¹mº∗ O¹log3 nºzy O¹n + m/Bº∗ assumes that the vertices but not the edges fit in a small Spanning Forest O¹mº∗ O¹log3 nºzy O¹n + m/Bº∗ ∗ ∗ fast (shared) random-access memory, the edges reside in an Biconnectivity O¹mº O¹D log n O¹n + m/Bº + log3 nºzy unbounded (shared) external memory, and transfers between Maximal Independent Set O¹mº O¹log2 nºzy O¹n + m/Bº the two memories are done in blocks of size B (where B is Graph Coloring O¹mº O¹log n O¹n + m/Bº the number of edges that fit in a block). A key difference + L log ∆º∗† ∗ z in the ROSE model, however, is that the external memory k-core O¹mº O¹ρ log nº O¹n + m/Bº Apx. Densest Subgraph O¹mº O¹log2 nº O¹n + m/Bº can be read from but not written to. The input graph is PageRank Iteration O¹mº O¹log nº O¹m/Bº stored in the read-only external memory, but the output gets written to the read-write internal memory. Unlike general algorithms such as sorting, whose output size is A read-only input graph: Because the graph is not modified, Θ¹input sizeº, graph algorithms are amenable to a read-only a single instance of the graph can be shared among parallel external memory setting because their output sizes are often processors and even unrelated concurrent graph algorithms Θ¹nº rather than Θ¹mº, where n (m) is the number of vertices without synchronization. Because data from NVRAM, like (edges, respectively) in the graph. DRAM, is brought into CPU caches that are kept coherent The ROSE model is motivated by practical benefits by hardware, read-only access means that these cache lines arising from two main consequences of the model: will avoid the costly invalidation that arises with concurrent No external memory writes: Because of NVRAM’s order(s) readers and writers (or concurrent writers). Finally, graphs of magnitude advantage in latency/throughput/wear-out over are often stored in compressed format [41, 65], to reduce traditional (NAND Flash) SSDs and in capacity/cost-per-byte their footprint and the memory bandwidth needed to access over traditional (DRAM) main memory, the emerging setting them.

Load more