Models and Algorithms Under Asymmetric Read and Write Costs
Total Page:16
File Type:pdf, Size:1020Kb
Models and Algorithms under Asymmetric Read and Write Costs Guy E. Blelloch*, Jeremy T. Finemany, Phillip B. Gibbons*, Yan Gu* and Julian Shunz *Carnegie Mellon University yGeorgetown University zU.C. Berkeley Abstract—In several emerging non-volatile technologies for main memory Thus the Q metric may be more relevant in practice, as it focuses on (NVRAM) the cost of reading is significantly cheaper than the cost of reads and writes to the last relevant level of the memory hierarchy. writing. Such asymmetry in memory costs leads to a desire for “write- We use a parameter ! to account for the higher cost of writes so that efficient” algorithms that minimize the number of writes to the NVRAM. While several prior works have explored write-efficient algorithms for we can study the dependence of algorithms on the gap between writes databases or for the unique properties of NAND Flash, our ongoing work and reads. We view ! as being significantly larger than 1, as factors seeks to develop a broader theory of algorithm design for asymmetric up to 1–2 orders of magnitude have been reported in the literature memories. This talk will highlight our recent progress on models, (see [1] and the references therein). algorithms, and lower bounds for asymmetric memories [1], [2], [3]. We extend the classic RAM model to the asymmetric case by defining the Prior work focusing on asymmetric read and write costs for non- (M; !)-ARAM, which consists of a large asymmetric memory and a volatile memory has explored system issues (e.g., [5]), database much smaller symmetric memory of size M, both random access, such that for the asymmetric memory, writes cost ! > 1 times more than algorithms (e.g., [6]) or the unique properties of NAND Flash (e.g., reads. We present algorithms for search trees, priority queues, sorting, [7]) such as its limited endurance and large erase operation. dynamic programming, BFS- and DFS-related graph algorithms, and shortest paths. We also present novel lower bounds for several problems II. NEW ALGORITHMS AND LOWER BOUNDS that reveal fundamental limits on obtaining write-efficient algorithms for these problems. Finally, we propose and study models for asymmetric memories that consider block transfers and/or parallel processing. Search Trees and Priority Queues. We consider the efficiency of balanced binary search trees on the (M; !)-ARAM. Red-black trees I. INTRODUCTION with appropriate rebalancing rules require only O(1) amortized time per update (insertion or deletion) once the location for the key is found. Fifty years of algorithms research has focused on settings in which For a tree of size n, finding a key’s location uses O(log n) reads reads and writes (to memory) have similar cost. But what if reads and but no writes, so the total amortized cost is Q = T = O(! + log n) writes to memory have significantly different costs? How would that per update. Since priority queues can be implemented with a binary impact algorithm design? What is the correct model to measure the search tree, insertion and delete-min have the same bounds. Note that cost of an algorithm on such memories? These questions are coming these good bounds are obtained even for M = O(1). to the fore with the arrival of new non-volatile memory technologies, Sorting. Many classic sorting algorithms like quicksort and merge- such as phase-change memory (PCM), spin-torque transfer magnetic sort require Θ(n log n) reads and writes, and hence would cost RAM (STT-RAM), and memristor-based resistive RAM (ReRAM), O(!n log n) on the (M; !)-ARAM. However, it is well known which have the property that the cost of reading is significantly cheaper that sorting can be implemented by first inserting each key into than the cost of writing, in terms of lower energy, lower latency, and a balanced search tree, and then in-order traversing the tree to write higher (per-module) bandwidth. Such memories will appear in the out the final sorted order. Using the above-mentioned binary search marketplace in the first half of 2016 [4] as a block device, and after tree, the sorting algorithm based on this method requires only O(n) that as a DIMM off the processor’s memory bus. writes with no increase in reads (O(n log n)). This algorithm is This talk highlights our recent results [1], [2], [3] providing a first asymptotically optimal, since comparison sorting n element requires at step towards answering these fundamental questions about asymmetric least Ω(n log n) comparisons and n writes are necessary to produce memories. We introduce a simple model for studying such memories, the sorted output in contiguous memory. On the (M; !)-ARAM, and a number of new results. Q = T = O(n log n + !n), even for M = O(1). Definition 1: We define the (M; !)-Asymmetric RAM (ARAM), com- It is also interesting to consider sorting in parallel on a shared- prised of a symmetric small-memory of size M, an asymmetric memory multicore. In [1] we discussed a parallel version of the large-memory of unbounded size, and an integer write cost !. The sorting algorithm, which has asymptotically the same numbers of ARAM cost Q is the number of reads from large-memory plus ! reads and writes as the above sequential version. The algorithm is times the number of writes to large-memory. The time T is Q plus highly parallel, with a critical path length of only O(! log n). the number of reads and writes to small-memory. Dynamic Programming. With regards to dynamic programming, We focus on settings in which the problem input fits in the large- some problems are reasonably easy and some harder. The standard memory (and hence can be viewed as “unbounded” size) but not in Floyd-Warshal algorithm for the all-pairs shortest-path (APSP) prob- the small-memory (i.e., M n, where n is the input size and M is lem uses O(n3) writes, and hence would cost Q = T = O(!n3). measured in units of the size of an input item). The large-memory Kleene’s divide-and-conquer algorithm can be used to reduce the models the NVRAM, while the small-memory models either DRAM, ARAM cost. Each recursive call makes two calls to itself on problems if present, or cache memory. The time T extends the time complexity of half the size, and six calls to matrix multiply over the semiring on a RAM model to the asymmetric setting. The cost Q accounts (min; +). Here we analyze the algorithm in the (M; !)-ARAM. The for just the data transfers between the small-memory and the large- matrix multiplies on two matrices of sizep n × n can be done in 2 memory, charging ! for each write and 1 for each read. While T has the model in QM (n) = O(n (! + n= M)) [1]. This leads to 2 the advantage of being most similar to the RAM model, it hides the the recurrence QKleene(n) = 2QKleene(n=2)+O(QM (n))+O(!n ), performance gap between different levels of the memory hierarchy. which solves to QKleene(n) = O(QM (n)) since the cost is dominated at the root of the recurrence. This is a factor of n fewer writes than the RAM results, for which the two have the same asymptotic complexity. Floyd-Warshal algorithm. A similar approach can be used for several other problems, including sequence alignment with gaps, optimal III. OTHER MODELS AND RESULTS binary search trees, and matrix chain multiplication [2]. “External” memory sorting. We considered explicitly modeling the Longest common subsequence and Edit distance are more challenging. transfer of data between the two memories in units of size B > 1. If the answer is directly computed using the classic recursion formula- We adapt the well-studied External Memory model (a.k.a., the I/O tion, the dependency structure of the computation is a diamond DAG, model or the Disk model used in databases) to charge ! times more whose properties have been well-studied in the symmetric memory for writing a block of size B to the asymmetric memory than for setting. For our asymmetric setting, we proved a new lower bound reading such a block. We examine the three general approaches for showing that the ARAM cost Q cannot be asymptotically reduced for sorting on the External Memory model—multi-way mergesort, sample the diamond DAG, even when only selectively storing intermediate sort, and buffer-tree sort—and showed that each of these algorithms values of the diamond DAG and allowing for recomputation of partial can be adapted to use fewer writes. Interestingly, each of the original results—well-known techniques for reducing the number of writes. algorithms have the same optimal cost on the External Memory model, To overcome this fundamental limitation, we must violate the classic and we show that each can be adapted to the asymmetric setting to evaluation method for a DAG that requires a node’s execution must have the same Q = O(!n=B log!M=B (n=B)) cost. While the new wait until all its immediate predecessors in the DAG have executed. multi-way mergesort algorithm is similar to prior work [6] and the Instead, we show a new approach that performs partial accumulation new samplesort algorithm is readily achieved, the new buffer-tree sort at a node based on values computed at only some of its predecessors requires a number of new techniques [1]. at a time, combined with selective recomputations, which overcomes this lower bound. Using this new technique, called a “path sketch”, Cache models.