Write-Efficient Algorithms

Write-efficient algorithms 15-853: Algorithms in the Real World Yan Gu, May 2, 2018 Classic Algorithms Research …has focused on settings in which reads & writes to memory have equal cost But what if they have very DIFFERENT costs? How would that impact Algorithm Design? Emerging Memory Technologies Motivation: DRAM is volatile DRAM energy cost is significant (~35% energy on data centers) DRAM density (bits/area) is limited Promising candidates: Phase-Change Memory (PCM) Spin-Torque Transfer Magnetic RAM (STT-RAM) Memristor-based Resistive RAM (ReRAM) 3D XPoint Conductive-bridging RAM (CBRAM) Key properties: Persistent, significantly lower energy, can be higher density (10x+) Read latencies approaching DRAM, random-access Another Key Property: Writes More Costly than Reads In these emerging memory technologies, bits are stored as “states” of the given material No energy to retain state Small energy to read state - Low current for short duration Large energy to change state - High current for long duration Writes incur higher energy costs, higherPCM latency, and lower per-DIMM bandwidth (power envelope constraints) Why does it matter? Consider the energy issue and assume a read costs 0.1 nJ and a write costs 10 nJ Sorting algorithm 1: Sorting algorithm 2: 100푛 reads and 100푛 200푛 reads and 2푛 writes writes on 푛 elements on 푛 elements We can sort <1 million We can sort 25 million entries per joule entries per joule read cost write cost read cost write cost Why does it matter? Writes are significantly more costly than reads due to the cost to change the phases of materials higher latency, lower per-chip bandwidth, higher energy costs Higher latency → Longer time for a write → Decrease per-chip (memory) bandwidth Let the parameter 흎 > ퟏ be the cost for writes relative to reads Expected to be between 5 to 30 Evolution on the memory hierarchy CPU New non-volatileExternal Mainmemory Memory memory(Storage class(SSD, L3 (DRAM)memory)HDD) L1 L2 1.5ns 5ns 20ns Read100ns latency ~1004ms ns 64KB 256KB 32MB from64GB 128GB to TBhuge level Common latency / size for a current computer (server) Impacts on Real-World Computation Databases: the data that is kept in the external memory can now be on the main memory Graph processing: large social networks nowadays contain ~billion vertices and >100 billion edges Geometry applications: can handle more precise meshes that support better effects Summary The new non-volatile memory raises the challenge to design write-efficient algorithms What we need: Modified cost models New algorithms New techniques to support efficient computation (cache policy, scheduling, etc.) Experiment New Cost Models Random-Access Machine (RAM) Unit cost for: Any instruction on Θ(log 푛)-bit words Read/write a single memory location from an infinite memory Memory 1 CPU 1 Read/write asymmetry in RAM? A single write cost 휔 instead of 1 But every instruction writes something… Memory 1 CPU write cost 휔 1 흎 (푴, 흎)-Asymmetric RAM (ARAM) Comprise of: a symmetric small-memory (cache) of size 푀, and an asymmetric large-memory (main memory) of unbounded size, and an integer write cost 휔 I/O cost 푄: instructions on cache are free Asymmetric Large-Memory Small-Memory 1 CPU 0 푀 words write cost 휔 흎 (푴, 흎)-Asymmetric RAM (ARAM) Comprise of: a symmetric small-memory (cache) of size 푀, and an asymmetric large-memory (main memory) of unbounded size, and an integer write cost 휔 I/O cost 푄: instructions on cache are free time 푇: instructions on cache take ퟏ unit of time Asymmetric Large-Memory Small-Memory 1 CPU 0 1 푀 words write cost 휔 흎 Lower and Upper Bounds Warm up: Asymmetric sorting Comparison sort on 푛 elements Read and comparison (without writes): Ω(푛 log 푛) Write complexity: Ω(푛) Question: how to sort 푛 elements using 푂(푛 log 푛) instructions (reads) and 푂(푛) writes? Swap-based sorting (i.e. quicksort, heap sort) does not seem to work Mergesort requires strictly 푛 writes for log 푛 rounds Selection sort uses linear write, but not work (read) efficiency Warm up: Asymmetric sorting Comparison sort on 푛 elements Read complexity: Ω(푛 log 푛) Write complexity: Ω(푛) 7 The algorithm: inserting each key in random order into a binary search 3 9 tree. In-order traversing the tree gives the sorted array. (푂 log 푛 tree depth 1 5 w.h.p.) 4 Using balanced BSTs (e.g. AVL trees) gives a deterministic algorithm, but more careful analysis is required Trivial upper bounds 푀 = 푂(1) I/O cost 푄(푛) and Reduction Problem time 푇(푛) ratio Comparison sort Θ(푛 log 푛 + 휔푛) 푂(log 푛) Search tree, priority queue Θ(log 푛 + 휔) 푂(log 푛) 2D convex hull, triangulation 푂(푛 log 푛 + 휔푛) 푂(log 푛) BFS, DFS, SCC, topological sort, block, bipartiteness, Θ(푚 + 푛휔) 푂(푚/푛) floodfill, biconnected components Lower bounds I/O cost Problem Classic Lower bound Algorithm log 푛 log 푛 Sorting network Θ 휔푛 Θ 휔푛 log 푀 log 휔푀 log 푛 log 푛 Fast Fourier Transform Θ 휔푛 Θ 휔푛 log 푀 log 휔푀 Diamond DAG (LCS, 푛2휔 푛2휔 Θ Θ edit distance) 푀 푀 An example of a diamond DAG: Longest common sequence (LCS) A C G T A T A T C G A T An example of Diamond DAG: Longest common sequence (LCS) A C G T A T A 1 1 1 1 1 1 T 1 1 1 2 2 2 C 1 2 2 2 2 2 G 1 2 3 3 3 3 A 1 2 3 3 4 4 T 1 2 3 4 4 5 Computation DAG Rule DAG Rule / pebbling game: To compute the value of a node, must have the values at all incoming nodes High-level proof idea To show that for the computational DAG, there exists a partitioning of the DAG that I/O cost is lower bounded However, since read and write has different cost, previous techniques (e.g. [HK81]) cannot directly apply Standard Observations on DAG 2 A C G T A T DAG (or DP table) has size 푛 (Input size is only 2푛) A Building table explicitly 2 writes, T ⇒ 푛 C but problem only inherently requires writing last value G A T Compute some nodes in cache but don’t write them out Storage lower bound of subcomputation (diamond DAG rule, Cook and Sethi 1976): Solving an 푘 × 푘 sub-DAG requires 푘 space to store intermediate value For 푘 > 푀, some values need to be written out Storage lower bound of subcomputation (diamond DAG rule, Cook and Sethi 1976): Solving an 푘 × 푘 sub-DAG requires 푘 space to store intermediate value 푘 For 푘 > 푀, some values need to be written out Proof sketch of lower bound Computing any 2푀 × 2푀 diamond requires 푀 writes to the large-memory 2푀 storage space, 푀 from small-memory To finish computation, every 2푀 × 2푀 sub-DAG needs to 푛2 be computed, which leads to Ω writes 푀 2푀 푛2 푛2 Θ ⋅ Ω(푀) = Ω 푀2 푀 #sub-problem #write A matching algorithm for the lower bound 푛2 Lower bound: Θ writes 푀 푀 푀 This lower bound is tight when breaking down into × 2 2 sub-DAGs, and read & write-out the boundary only 푀/2 푛 푛2 2 ⋅ ⋅ 푛 = 푂 푀/2 푀 #row/column #read/write Upper bounds on graph algorithms I/O cost 푄(푛, 푚) Problem Classic algorithms New algorithms 푂(min(푛 휔 + 푚/푀 , Single-source 푂 휔 푚 + 푛 log 푛 휔 푚 + 푛 log 푛 , shortest-path 푚(휔 + log 푛))) Minimum 푂(min(푚휔, 푂 푚휔 spanning tree 푚 min(log 푛 , 푛/푀) + 휔푛)) I/O cost of Dijkstra’s algorithm Compute an SSSP requires 푂 푚 DECREASE-KEYs and 푂 푛 EXTRACT-MINs in Dijkstra’s algorithm Classic Fibonacci heap: 푂 휔 푚 + 푛 log 푛 Balanced BST: 푂 푚 휔 + log 푛 Restrict the Fibonacci heap into the small-memory with size 푀: no writes to the large-memory to maintain the heap, 푂 푛/푀 rounds to finish, 푂 푛 휔 + 푚/푀 I/O cost in total small-memory Fibonacci Heap large-memory Parallel Computational Model and Parallel Write-efficient Algorithms 6 + 15 + 15 = 36 A = 1 2 3 4 5 6 7 8 Sum(A): 36 Cut the input array into smaller segments, sum each up individually, and finally sum up the sums Picking the appropriate number of segments can be annoying Machine parameter, runtime environment, algorithmic details A = 1 2 3 4 5 6 7 8 + + + + 3 7 11 15 + + 10 26 + Sum(A): 36 Function SUM(A) If |A| = 1 then return A(1) In Parallel a = SUM(first half of A) b = SUM(second half of A) return a + b A = 1 2 3 4 5 6 7 8 + + + + 3 7 11 15 + + 10 26 + Sum(A): 36 A work-stealing scheduler can run a computation on 푝 cores using time: 푛 푂 + log 푛 The work-stealing scheduler 푝 is used in OpenMP, CilkPlus, Intel TBB, MS PPL, etc. A = 1 2 3 4 5 6 7 8 + + + + 3 7 11 15 + + 10 26 + Sum(A): 36 A work-stealing scheduler can 푊: total run a computation on 푝 cores computation using time: 푊 퐷: longest + 푂 퐷 chain of 푝 all paths A = 1 2 3 4 5 6 7 8 + + + + 3 7 11 15 + + 10 26 + Sum(A): 36 A work-stealing scheduler can 푊: 푂(푛) run a computation on 푝 cores using time: 퐷: 푂(log 푛) 푊 푂 휔 + 퐷 푃 Our new results on scheduler Assumptions: Each processor has its symmetric private cache Each processor can request to access data in other processor’s cache, but cannot write anything All communications are via asymmetric main memory Any non-leaf task uses constant stack space Our new results on scheduler Assumptions: Each processor has its symmetric private cache Each processor can request to access data in other processor’s cache, but cannot write anything All communications are via asymmetric main memory Any non-leaf task uses constant stack space With non-trivial but simple modifications to the work- stealing scheduler, a computation on 푝 cores use time: 푊 + 푂 휔퐷 푝 with memory size 푀 = 푂 퐷 + 푀퐿 (memory for base case) Results of parallel algorithms Problem Work (푊) Depth (퐷) Reduction of writes Reduce Θ 푛 + 휔 Θ log 푛 + 휔 Θ(log 푛) Ordered filter Θ 푛 + 휔푘 ↟ 푂 휔 log 푛 ↟ Θ(log 푛) Comparison sort Θ 푛 log 푛 + 푛휔 ↟ 푂 휔 log 푛 ↟ Θ(log 푛) List and tree Θ 푛 푂 휔 log 푛 ↟ Θ(휔) contraction Minimum 푂(훼 푛 푚 푚/(푛 ⋅ 푂 휔 polylog 푛 ↟ spanning tree + 휔푛 log(min(푚/푛, 휔))) log(min(푚/푛,

Load more