Small Refinements to the DAM Can Have Big Consequences for Data-Structure Design

Small Refinements to the DAM Can Have Big Consequences for Data-Structure Design Michael A. Bender Alex Conway Martín Farach-Colton Stony Brook University Rutgers University and Rutgers University [email protected] VMware Research [email protected] [email protected] William Jannen Yizheng Jiao Rob Johnson Williams College The University of North Carolina at VMware Research [email protected] Chapel Hill [email protected] [email protected] Eric Knorr Sara McAllister Nirjhar Mukherjee Rutgers University Harvey Mudd College The University of North Carolina at [email protected] [email protected] Chapel Hill [email protected] Prashant Pandey Donald E. Porter Jun Yuan Carnegie Mellon University The University of North Carolina at Pace University [email protected] Chapel Hill [email protected] [email protected] Yang Zhan The University of North Carolina at Chapel Hill [email protected] ABSTRACT In this paper, we show that the affine and PDAM models, which Storage devices have complex performance profiles, including costs are small refinements of the DAM model, yield a surprisingly large to initiate IOs (e.g., seek times in hard drives), parallelism and bank improvement in predictability without sacrificing ease of use. We conflicts (in SSDs), costs to transfer data, and firmware-internal present benchmarks on a large collection of storage devices showing operations. that the affine and PDAM models give good approximations ofthe The Disk-Access Machine (DAM) model simplifies reality by performance characteristics of hard drives and SSDs, respectively. assuming that storage devices transfer data in blocks of size B and We show that the affine model explains node-size choices inB- ε that all transfers have unit cost. Despite its simplifications, the trees and B -trees. Furthermore, the models predict that the B-tree ε DAM model is reasonably accurate. In fact, if B is set to the half- is highly sensitive to variations in the node size whereas B -trees bandwidth point, where the latency and bandwidth of the hardware are much less sensitive. These predictions are born out empirically. are equal, the DAM approximates the IO cost on any hardware to Finally, we show that in both the affine and PDAM models, it within a factor of 2. pays to organize data structures to exploit varying IO size. In the ε Furthermore, the DAM explains the popularity of B-trees in affine model, B -trees can be optimized so that all operations are the 70s and the current popularity of Bε -trees and log-structured simultaneously optimal, even up to lower order terms. In the PDAM ε merge trees. But it fails to explain why some B-trees use small model, B -trees (or B-trees) can be organized so that both sequential nodes, whereas all Bε -trees use large nodes. In a DAM, all IOs, and and concurrent workloads are handled efficiently. hence all nodes, are the same size. We conclude that the DAM model is useful as a first cut when designing or analyzing an algorithm or data structure but the affine Permission to make digital or hard copies of all or part of this work for personal or and PDAM models enable the algorithm designer to optimize pa- classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation rameter choices and fill in design details. on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission ACM Reference Format: and/or a fee. Request permissions from [email protected]. Michael A. Bender, Alex Conway, Martín Farach-Colton, William Jannen, SPAA ’19, June 22–24, 2019, Phoenix, AZ, USA Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6184-2/19/06...$15.00 Prashant Pandey, Donald E. Porter, Jun Yuan, and Yang Zhan. 2019. Small https://doi.org/10.1145/3323165.3323210 Refinements to the DAM Can Have Big Consequences for Data-Structure Design. In 31st ACM Symposium on Parallelism in Algorithms and Architec- storage devices showing that the affine and PDAM models are good tures (SPAA ’19), June 22–24, 2019, Phoenix, AZ, USA. ACM, New York, NY, approximations of the performance characteristics of hard drives USA, 10 pages. https://doi.org/10.1145/3323165.3323210 and SSDs, respectively. We find that, for example, the PDAM isable to correctly predict the run-time of a parallel random-read bench- 1 INTRODUCTION mark on SSDs to within an error of never more than 14% across a Storage devices have complex performance profiles, including costs broad range of devices and numbers of threads. The DAM, on the to initiate IO (e.g., seek times in hard drives), parallelism and bank other hand, overestimates the completion time for large numbers conflicts (in SSDs), costs to transfer data, and firmware-internal of threads by roughly P, the parallelism of the device, which ranges operations. from 2.5 to 12. On hard drives, the affine model predicts the time The Disk-Access Machine (DAM) model [2] simplifies reality for IOs of varying sizes to within a 25% error, whereas, as described by assuming that storage devices transfer data in blocks of size B above, the DAM is off by up to a factor of2. and that all transfers have unit cost. Despite its simplifications, the Researchers have long understood the underlying hardware ef- DAM has been a success [4, 64], in part because it is easy to use. fects motivating the affine and PDAM models. Nonetheless, itwas The DAM model is also reasonably accurate. If B is set to the a pleasant surprise to see how accurate these models turn out to hardware’s half-bandwidth point, i.e., the IO size where the la- be, even though they are simple tweaks of the DAM. tency and bandwidth of the hardware are equal, then the DAM The affine and PDAM models explain software design model predicts the IO cost of any algorithm to within a factor of 2 choices. In §5 and §6, we reanalyze the B-tree and the Bε -tree on that hardware. That is, if an algorithm replaces all its IOs with in the affine and PDAM models. The affine model explains why IOs of the half-bandwidth size, then the cost of the IOs increase by B-trees typically use nodes that are much smaller than the half- a factor of at most 2. bandwidth point, whereas Bε -trees have nodes that are larger than Furthermore, the DAM explains some choices that software ar- the half-bandwidth point. Furthermore, the models predict that the chitects have made. For example, the DAM model gives an analytical B-tree is highly sensitive to variations in the node size whereas explanation for why B-trees [6, 27] took over in the 70s and why Bε -trees are much less sensitive. These predictions are borne out Bε -trees [13, 21], log-structured merge trees [11, 47], and external- empirically. memory skip lists [8, 15, 55] are taking over now. The affine and PDAM models enable better data-structure But the DAM model has its limits. For example, the DAM does designs. In a Bε -tree, small nodes optimize point queries and large not explain why B-trees in many databases and file systems use nodes optimize range queries and insertions. In §6, we show that nodes of size 16KiB [1, 44, 45, 49, 50], which is well below the in the affine model, nodes can be organized with internal structure half-bandwidth point on most storage devices, whereas B-trees (such that nodes have subnodes) so that all operations are simulta- optimized for range queries use larger node sizes, typically up to neously optimal, up to lower order terms. Since the DAM model around 1MB [51, 53]. Nor does it explain why TokuDB’s [62]Bε -tree looses a factor of 2, it is blind to such fine-grained optimizations. uses 4MiB nodes and LevelDB’s [36] LSM-tree uses 2MiB SSTables The PDAM allows us to organize nodes in a search tree so that for all workloads. In a DAM, all IOs, and hence all nodes, are the the tree achieves optimal throughput both when the number of same size. concurrent read threads are large and small. A small number of How can a optimization parameter that can vary by over three read threads favors large nodes, and a large number favors small orders of magnitude have escaped algorithmic recognition? The nodes. In §8, we show how to organize nodes so that part or all of answer is that the DAM model is too blunt an instrument to capture them can be read, which allows the data structure to handle both these design issues. work loads obliviously and optimally. In this paper, we show that the affine [3, 57] and PDAM [2] Discussion. Taking a step back, we believe that the affine and models, which are small refinements of the DAM model, yield a PDAM models are important complements to the DAM model. surprisingly large improvement in predictivity without sacrificing The DAM is useful as a first cut when designing or analyzing an ease of use. algorithm or data structure but the affine and PDAM models enable The affine and PDAM models explicitly account for seeks (in the algorithm designer to optimize parameter choices and fill in spinning disks) and parallelism (in solid-state storage devices).

Small Refinements to the DAM Can Have Big Consequences for Data-Structure Design

Percona Server for Mongodb Documentation Release 3.2.22-3.13

The Algorithmics of Write Optimization

Mongodb에서 B-트리 인덱스와 Fractal 트리 인덱스를 이용한 성능 비교

Data Structures and Algorithms for Big Databases

Database Management Systems Ebooks for All Edition (

Write-Optimization in a Kernel File System

Toward a Better Understanding and Evaluation of Tree Structures on Flash Ssds

Storage Solutions for Big Data Systems: a Qualitative Study and Comparison

Betrfs: a Right-Optimized Write-Optimized File System

Fractal Tree Indexes Theory to Practice Percona Live London 2013

Toward a Better Understanding and Evaluation Of

Bivariate, Cluster and Suitability Analysis of Nosql Solutions for Different Application Areas