Flushing Without Cascades

Michael A. Bender1 Rathish Das1 Mart´ın Farach-Colton2 Rob Johnson3 William Kuszmaul4

Abstract write-optimized skip lists [10], COLAs [8], and xDicts [13]. Buffer-and-flush is a technique for transforming standard external- The primary performance advantage of WODs over memory search trees into write-optimized search trees. In exchange older data structures, such as B-trees [6,16], is their insertion for faster amortized insertions, buffer-and-flush can sometimes significantly increase the latency of operations by causing cascades throughput. Some WODs also support asymptotically opti- of flushes. In this paper, we show that flushing cascades are not a mal point queries, thus matching a B-tree’s optimal point- fundamental consequence of the buffer-flushing technique, and can query performance while far exceeding a B-tree’s insert per- be removed entirely using randomization techniques. The underlying implementation of buffer flushing relies on a formance. WODs also support deletions via delete messages, buffer-eviction strategy at each node in the tree. The ability for the which are special values indicating that the associated key user to select the buffer eviction strategy based on the workload has has been deleted. been shown to be important for performance, both in theory and in practice. Despite the profusion of WODs [8–10,13,14,23,26,27, In order to support arbitrary buffer-eviction strategies, we 38, 43], most share the same overall organization. These introduce the notion of a universal flush, which uses a universal structures are partitioned into sorted levels that grow ex- eviction policy that can simulate any other eviction policy. This abstracts away the underlying eviction strategy, even allowing for ponentially in size. New key-value pairs are inserted into workload-specific strategies that change dynamically. the top level. As new elements arrive, elements are flushed Our deamortization preserves the amortized throughput of the (moved down) from one level to the next. Upserts (inserts, underlying flushing strategy on all workloads. In particular, with deletes, and updates) are fast because each I/O simultane- our deamortization and a node cache of size poly-logarithmic in the ously flushes many elements down one level, leading to a number of insertions performed on the tree, the amortized insertion small amortized I/O cost per upsert. Indeed, in the common cost matches the lower bound of Brodal and Fagerberg. For typical case when one I/O flushes at least a superlogarithmic num- parameters, the lower bound is less than 1 I/O per insertion. For ber of elements, the amortized I/O cost per upsert is o(1).A such parameters, our worst-case insertion cost is O(1) I/Os. key search now needs to look in multiple locations but this need not affect its asymptotic I/O cost. 1 Introduction WODs vary primarily in two ways: their indexing Storage systems—including file systems [20, 29, 30, 39, 44, structures and, more significantly, their flushing policies. 47] as well as SQL and NoSQL [4, 15, 25, 28, The choice of indexing structure can have an effect on 34, 40, 45, 46]—have been transformed over the last decade the query performance, but maintaining them is typically by new high-performance external-memory data structures. a lower-order term compared with the cost of upserts; see These write-optimized dictionaries (WODs) include log- Section 6 for details. Flushing policies, on the other hand, structured merge trees (LSMs) [8, 38, 43], Bε-trees [9, 14], can have a major impact on upsert performance. Practitioners have explored a large number of flushing policies in order to optimize for different performance cri- 1 Stony Brook University, Stony Brook, NY, USA. Email: teria. For example, the LSM community has developed a {bender,radas}@cs.stonybrook.edu. 2Rutgers University, New Brunswick, NJ, USA. Email: dizzying array of so-called “compaction strategies,” which [email protected]. govern how the data structure chooses which runs of el- 3VMWare Research, Palo Alto, CA, USA. Email: ements get merged together. Example compaction strate- [email protected]. gies include leveled [21, 23, 27], leveled-N [21], tiered [21], 4MIT, Cambridge, MA, USA. Email: [email protected]. This work was supported by a Fannie & John Hertz Foundation tiered+leveled [21], size-tiered [23, 26], FIFO [21], time- Fellowship, a NSF GRFP Fellowship, and NSF grants: CNS 1408695, CNS windowed [42], eager [12], adaptive [12], lazy [17], and 1755615, CCF 1439084, CCF 1725543, CSR 1763680, CCF 1716252, CCF fluid [17]. These policies make various trade-offs between 1617618, CCF 1533644, CSR 1938180, CCF 1715777, CCF 1712716. the I/O and CPU complexity of upserts and queries, the and by Sandia National Laboratories. amount of space used on disk, the level of concurrency sup- key-value pairs will consist of n = Θ(N/B) nodes and have 1 ported by the key-value store, the amount of cache, and height O( ε logB N), which is O(logB N), since ε = Θ(1). many other concerns. Furthermore, all of these compaction Only O(Bε) space in an internal node is needed for pivots, strategies can be combined with partitioned compaction [36], so the rest of the space is used for a buffer. Buffers batch up which attempts to partially deamortize compaction (albeit insertion and deletion messages as they work their way down without guarantees) and to make the compactions adaptive to the tree. Searches are as in a B-tree, except that we need the upsert workload. Bε-trees can also use a variety of differ- to check buffers along the way for any pending insertion or ent flushing policies, including flush-all, greedy [7], round- deletion messages. Since each node, including its buffer, robin, random-ball [7], and random. is stored in a single block, searches cost O(logB N) I/Os. Most WODs have no guarantees on insertion latency. A Successor and predecessor queries must search in all buffers major drawback of WODs is that they generally lack latency along the current root-to-leaf path to find the next/previous guarantees. Considerable effort has been devoted to reducing key. Performing a sequence of k successor or predecessor the latency of WODs in practice through partitioning [36] queries costs O(k/B + logB N) I/Os. New messages are queued in the buffer at the root of the and by performing flushing on background threads [21], but ε less so in theory [8]. Indeed most of the authors of this paper tree. In a standard (amortized) B -tree, whenever a buffer have substantial engineering experience in reducing WODs overflows (contains B or more messages), we perform a latencies in the field. For example, Tokutek pushed to reduce buffer flush, i.e., we move some number of messages from the latency of TokuDB [45] and to get rid of periods of lower the node’s buffer to the buffers of the node’s children. When throughput [31]. But TokuDB does not have provable latency a delete message reaches a leaf, all older key-value pairs for guarantees. that key are deleted from the leaf, and the delete message Latency is a paramount concern but, as the examples is discarded. When an insertion message reaches a leaf, it above illustrate, latency reduction must contend with other replaces any existing key-value pair for that key. design goals.1 Consequently, it is not enough to deamortize The main design question is: how many messages a particular flushing rule that is being used in a particular should be moved in a flush, and to which children? The version of a particular system. simplest policy is to flush everything. With this policy, Ω(B) messages get moved one level down the tree in a single flush. This paper: flushing with strong latency guarantees. We At most Bε children need to be accessed, and each access present a randomized construction that enables a wide class costs O(1) I/Os. Thus, the amortized cost to move one of flushing rules to be deamortized. Our deamortization message down one level is O(1/B1−ε). Since the height of preserves the instance-specific I/O cost of the underlying the tree is O(logB N), the amortized message insertion cost buffer flushing strategy, i.e. if the original strategy incurs C logB N is O( B1−ε ) I/Os, which is the best possible on a random I/Os on some workload W at a node x, then our deamortized insertion workload [14]. version will incur O(C) I/Os on the same workload. For concreteness, we describe our results in terms of Bε-trees, Many flushing strategies. There are several flushing strate- and in Section 6 we explain how our results apply to other gies that can improve upon the performance of the simple WODs. strategy given above without sacrificing its good amortized performance guarantees. For non-random (e.g., skewed) 1.1 Flushing in Bε-trees We review flushing in Bε-trees insertion workloads, partial flushes can give substantial and why they provide almost no guarantee on insertion speedups. latency. The speedup of a Bε-tree is directly proportional to the Bε-trees. The Bε-tree [9, 14] is like a B-tree in that it has average number of messages moved per I/O in a flush. So the nodes of size B, where B is the I/O (i.e., block-transfer) size. goal is to push as many messages down per I/O as possible. Leaves all have the same depth and key-value pairs are stored For example, the greedy policy flushes messages only in the leaves. Balance is maintained via splits and merges, to the single child to which the most buffered messages are similar to a B-tree. Like a B-tree, Bε-trees support inserts, directed, and obtains the same bounds as above. Greedy is deletes, point queries, and successor/predecessor queries. never worse than the flush-everything policy and, for a purely Bε-trees have a fanout of Θ(Bε), where constant ε random workload, will improve throughput by a factor of 2 (0 < ε < 1) is a tunable parameter. Thus, a tree with N on average [7]. However, there are some workloads where greedy will do no better than the flush-everything policy, but 1For example, the deamortized COLA [8] offers good (although subop- a “random ball” [7] policy achieves an amortized insertion logB N timal) worst-case latency, but the flushing policy sacrifices throughput for cost of O( B ) I/Os, which is asymptotically faster. the sake of latency. Specifically, the performance of a sequential-insertion ε workload gets throttled back to that of a random-insertion workload. This High latency and flushing cascades. In the context of B - is inconsistent with a system that optimizes for common-case workloads. trees, we can see why WODs generally have large latencies

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited for upsert workloads—upserts can trigger flushing cascades. performance as a Bε-tree. A flushing cascade occurs when a flush at one node triggers Our results support the use of an arbitrary message- flushes in multiple children nodes; these flushes may, in turn, eviction strategy in each node. In order to establish this, we trigger flushes in the grandchildren, etc. introduce the notion of a universal flush, a buffer-eviction Structural changes to the Bε-tree are another major strategy that we prove can simulate any other eviction strat- source of flushing cascades. Regardless of the eviction egy. Universal flushes provide a layer of abstraction between strategy, when two nodes are merged into a new node, the the data structure and the underlying eviction strategy. resulting buffer can overflow, triggering new flushes, which However, with essentially any flushing strategy (greedy, again cause cascades (and more node merges). The result is flush all, universal), the constant amount of slack in each that a single insert or delete could trigger modifications to node is nowhere close to enough for deterministic bounds. Ω(n1−o(1)) nodes of the tree. For that matter, neither is randomization alone. In the rest One could attempt to mitigate cascades by always flush- of this paper, we show that by randomizing when flushes ing exactly B1−ε elements from parent to child, but this dis- happen, caching a small number of nodes, randomizing cards the potential performance gains of sophisticated flush- the rebalances slightly, and performing an asymptotically ing policies, such as greedy or random ball. Furthermore, negligible number of “cleanup” I/Os, we can obtain our as explained above, it doesn’t even work when the workload optimality bounds. contains deletions and the structure of the tree changes. Paper outline. In Section 2, we present an overview of Informally, there appears to be a tension between the main technical ideas in the paper. In Section 3.1, we throughput and latency. Whenever a flushing policy succeeds define universal flushes, and use them to simulate arbitrary in moving a large number of elements to a child, that child buffer-eviction policies. In Sections 3 and 4, we present becomes more likely to need to flush to multiple grandchil- technical preliminaries and present a simplified version of dren, causing a cascade. the data structure that supports only static trees, in which the Results. We show that there is no inherent trade-off between only write operations are updates. Finally, in Section 5, we flushing policy throughput and latency. We give a random- present the full version of the data structure, which supports ized algorithm that eliminates flushing cascades with high arbitrary insert/delete/update operations. probability, for any buffer-eviction policy. To understand why this is inherently an algorithmic 2 Technical Overview issue, note that there is a small amount of slack in when We now give an overview of some of the main technical ideas we are forced to perform a buffer flush. In particular, we in the paper. Let N denote the size of the tree. (When being are free to defer some flushes, letting some nodes grow formal in later sections, we instead use Nmax, the maximum up to a constant factor larger than B and flushing others size of the tree over all time.) Assume throughout that a little early, without asymptotically harming insertion or the block-size B is at least c log N for a constant c of our query performance. We could even allow for very small choice, that the internal memory M is of size at least logc N (only constant-sized) cascades. blocks, and that the internal memory satisfies the tall-cache We model the problem of deamortizing an arbitrary assumption, meaning that M is at least cB2. flushing policy as a game between a deamortizer and a As a convention, we say a node x is in the first level of a buffer-eviction policy. In each round, the deamortizer picks Bε-tree T if x is a leaf, and otherwise we say that x is at level which node to flush. Then the buffer-eviction policy chooses ` + 1, where ` is the level of x’s children. This convention which and how many messages to flush from that node. of indexing levels from the bottom of the tree (rather than By fully deamortizing an adversarial flushing strategy, we the top) is particularly natural for Bε-trees, since structurally deamortize them all. This is important because a flushing new levels form when the node at the top of the tree splits. strategy may be benevolent in terms of write performance but adversarial in terms of latency. Simulating eviction strategies with universal flushing. In We obtain the following performance bounds. Let Nt Section 3.1, we present the definition of a universal flush, be the size of the data structure, in terms of messages, af- and show how to use universal flushes in order to simulate ter the t-th operation, and Nmax = maxt Nt. Then the t- arbitrary buffer-eviction strategies. 1−ε th insert/delete costs O(d(logB Nt)/B e) I/Os with high A universal flush on a node x can be performed when- probability in Nmax, which is O(1) for typical values of B ever the node’s buffer contains 3B or more messages. The and N. This improvement to worst-case insertion perfor- universal flush is required to select exactly B messages to mance does not come at a cost in amortized insertion perfor- evict from that node. We show that, for any buffer eviction mance: The data structure deterministically (i.e. with prob- strategy A, one can construct an implementation B of uni- ability 1) achieves the same query and amortized insertion versal flushing that simulates A efficiently. In particular, the number of I/Os expended by B to manage a stream of mes-

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited sages S to a buffer of size 3B, is, in the worst case, at most total number of flushes performed at ` levels below the root 1 ` twice the number of I/Os that would expended by A to man- is at most (1 + log N ) k ≤ O(k) for all ` ∈ [O(log N)]. age the same stream S of messages to a buffer of size B. The aggressive randomized flushing algorithm achieves Since each universal flush at the root of the tree evicts low latency for batches of Bpolylog(N) consecutive oper- B elements, we can always spread the I/Os for each univer- ations, but fails to give a worst-case bound for any individ- sal flush across the next B insert/update/delete operations. ual operation. To achieve a worst-case bound, we take mo- The I/Os associated with a single universal flush at the root tivation from a recent algorithm for what is known as the (and the subsequent flushes it induces lower in the tree) are single-processor cup game [5, 11, 18, 19, 35]. At the begin- referred to as a flushing round. In general, our goal is to ning of the single-processor cup game, n cups are initially implement flushing rounds using at most O(logB N) uni- empty. At each step of the cup game, a filler distributes 1 versal flushes. Since each universal flush requires at most new unit of water among the N cups, and then an emptier O(Bε) I/Os, and since each universal flush is spread across selects one cup out of which to remove 1 + ε units of wa- B insert/update/delete operations, this results in a worst- ter for some resource-augmentation parameter ε. The emp- ε case-operation-running-time of O(dB · logB N/Be) = tier’s goal is to minimize the backlog of the system, which 1−ε O(dlogB N/B e), as desired. is defined to be the amount of water in the fullest cup. It Eliminating flushing cascades in a static tree. We begin has been known for decades that the greedy-emptying algo- by focusing on the simpler problem of eliminating flushing rithm (of emptying from the fullest cup) achieves backlog cascades from a tree whose node-structure is static. One O(log N). In recent work [11], we showed that if the filler can think of such a tree as supporting updates only, and no is an oblivious adversary (rather than an adaptive one), then insertions or deletions. a randomized emptying strategy can do significantly better. If ε ≥ 1 , then the randomized algorithm achieves The first algorithm we consider is called the aggressive polylog(N) randomized flushing algorithm, which works as follows. backlog O(log log N) with high probability in N. The algo- Whenever a new node x is created, a random threshold rx rithm works by first inserting a random amount of water rx is selected uniformly at random from {0, 1,...,B − 1}, and between 0 and 1 + ε into each cup at the beginning of the rx dummy messages are placed in x’s buffer. The algorithm game, and then, after each step, removing 1 + ε units of wa- then follows the simple rule of performing a randomized ter from the fullest cup; if, however, the fullest cup contains flush on each node x whenever the size of x’s buffer is 3B less than 1 + ε units of water, then no water is removed from or larger. the cup. (This is so that the amount of water modulo 1 + ε in The purpose of the rx dummy messages is to randomize each cup x remains uniformly random as a function of rx.) the number of messages modulo B in each node x’s buffer, To a first approximation, one can treat each level ` of a also known as the B-signature of node x. Because universal Bε tree as its own cup game in which each buffer represents flushes always remove exactly B messages from x’s buffer, a cup. This suggests the following algorithm, which we they do not affect the B-signature. In particular, at any given call the lazy randomized flushing algorithm: As before, point in time, the B-signature is given by rx + ux mod B, whenever a node is created, we insert a random number where ux is number of messages to have ever been flushed rx ∈ {0, 1,...,B − 1} of dummy messages into that node’s into node x. Since rx and ux are independent, and since rx buffer. Then, during each flushing round, we perform at most is uniformly random in {0, 1,...,B − 1}, it follows that, at one flush at each level `, prioritizing the fullest buffer at the any given point in time, the B-signature at each node x is level (and flushing from the buffer only if it contains 3B or uniformly random in {0, 1,...,B − 1}. more messages). The randomness of the B-signatures of the nodes x at Although the lazy randomized flushing algorithm at- each level ` ensures that, whenever j ≤ B messages are tempts to treat each level ` in the tree as its own cup game, j the cup game is missing a key ingredient: resource aug- flushed into node x, there is probability exactly B that the size of x’s buffer crosses a multiple of B. More generally, mentation. In order to create the (1 + ε) resource aug- suppose that during a sequence of k flushing rounds, a total mentation needed, we modify the Bε-tree so that each level of j` messages are flushed into level ` from level ` + 1. ` uses a slightly larger block size B` than does its par- We show that the total number of flushes performed at ent level ` + 1. In particular, we set B1 = Θ(B), and 2 level ` will then be a sum of independent indicator random B`+1 = B` − dB/` e. This simultaneously guarantees that variables with total mean j`/B. If j` is sufficiently large in B` = Θ(B) for all ` ≤ B, while also guaranteeing that 2 polylog(N) · B, then with high probability in N, the number B` ≥ B`+1 ·(1+1/ log N) for all `. The increase in block- of messages j`−1 flushed from level ` to level `−1 during the size by a factor of 1+1/polylog(N) simulates resource aug- 1 k flushing rounds will be at most (1+ )j`. By applying mentation between the corresponding cup games. logB N this fact to every level `, we see that during any sequence of Concurrent work to ours presents a randomized algo- k flushing rounds with k sufficiently large in polylog(N), the rithm for the single-processor cup game that achieves bounds

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited on backlog without the use of resource augmentation [33]. for a level ` (i.e., the flushes performed at level ` + 1) are As we shall see, however, the resource augmentation used completely independent of the random bits used to determine by the lazy flushing algorithm also grants the algorithm ad- the flushes at level `. That is, the flushes performed at level ditional behavioral properties that will be useful (and, in fact, ` + 1 must be oblivious to the flushes performed at lower necessary) in designing our final flushing algorithm. levels. Whereas the aggressive randomized flushing algorithm Node merges and splits violate this obliviousness prop- deterministically bounds the size of each buffer to O(B) (but erty by introducing cyclic dependencies between levels. The may sometimes perform many universal flushes), the lazy random bits used at level ` partially determine which mes- algorithm deterministically bounds the number of universal sages make it into lower levels by a given point in time, and flushes (but may sometimes allow buffers to grow large). thus which messages make it to the leaves by a given point in Even allowing for buffers to have size as large as B log log N time. The arrival of messages to the leaves of the Bε-tree de- can have disastrous performance consequences in the corre- termines when (and which) nodes at level 2 of the tree merge sponding Bε-tree level, however. In particular, lookups may and split. The merging and splitting of nodes at level 2 in the in the worst-case require log log N · logB N I/Os, which if tree then, in turn, influences when nodes at level 3 of the tree B = polylogn would result in worst-case read performance merge and split, etc. Thus the random bits at level ` indi- of Ω(log N). Thus even a buffer size of Θ(B log log N) is rectly influence the set of nodes at level ` + 1, which in turn unacceptable. affects the flushing decisions made in level ` + 1. To handle the large buffers caused by the lazy algorithm, Eliminating the cyclic dependencies requires a node- we exploit a second property of the lazy randomized flushing rebalancing mechanism in which the nodes at a given level algorithm. Although the algorithm allows for buffers to be- ` of the tree are a function only of randomness and flushing come overfilled (containing a maximum of O(B log log N) decisions made at levels ` or higher in the tree. Even when messages), the total number of overfilled buffers at any given splitting a node x into two new nodes x1 and x2, the mecha- moment is at most polylog(N) with high probability in N. nism must not examine node x’s children in determining the In fact, a stronger property is also true with high probabil- pivot at which the split should occur, since doing so would ity: the sum of the sizes of the overfilled buffers is at most introduce a cyclic dependency between levels ` and ` − 1. polylog(N) · B. This means that we can maintain an in- We introduce a node-rebalancing mechanism in which memory cache of size polylog(N) · B storing all the con- each node x at level ` maintains a summary-sketch S` of tents of any overfilled buffers. By referencing the in-memory the contents of the tree Tx rooted at x. The summary cache, operations that access an overfilled buffer need not sketch simultaneously maintains a size estimate for Tx, and pay more than O(1) I/Os in order to perform the access. a median estimate, taking the form of a key u ∈ Tx that Building upon the ideas outlined above, Section 4 de- is guaranteed (with high probability) to have rank between ε signs a flushing strategy for the static B -tree that completely |Tx|/4 and 3|Tx|/4 in tree Tx. Node merges and splits eliminates flushing cascades. The key challenge becomes to are performed based on the size estimate, ensuring that ε adapt this strategy to dynamic B trees. This requires us to each node x at level ` always has tree-size |Tx| within a not only handle flushing cascades due to structural changes, constant factor of B1−ε ·Bε`; when a node splits, the median but also to modify the lazy randomized flushing strategy in estimate u is used as the pivot at which the split occurs. order to handle difficulties from nodes being merged and Although node-rebalancing decisions for node x are made split. independently of x’s child set, the weight-based approach We remark that the use of cup-emptying-game tech- to node-rebalancing guarantees that every node has exactly niques in this paper differs significantly from past applica- Θ(Bε) children. tions of cup-emptying games to deamortization [2, 3, 18, 19, In addition to performing accurate size and median 22,24,32,37]. Whereas past applications have relied only on estimation, the summary sketches S`(x) at each level ` must backlog guarantees, our application instead relies primarily each fit within a single disk block, and be easily composable on what one might normally view as a secondary property (meaning, for example, that S`(x ∪ y) can be constructed of the random-thresholds-based emptying algorithm (specif- from S`(x) and S`(y)). ically, the fact that the algorithm bounds the sum of the sizes A natural approach to constructing each summary of cups that contain more than 1 unit of water). Apply- sketch S`(x) would be to randomly sample the elements of ing this property to other problems in external-memory data Tx using a hash function, and then to perform size estimation structures and write optimization is therefore an interesting based on the number of sampled elements (and median esti- possible direction for future work. mation using the median of the sampled elements). Such an approach can be easily made successful assuming access to Eliminating cyclic dependencies in dynamic trees. In a family of hash functions H with a high degree of indepen- order for each level of the Bε-tree to be treated as a separate dence (i.e., N-wise independence). The description bits for randomized “cup game”, it is necessary that the input flushes

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited such a family of hash functions would be far too large to fit in We begin by describing the approach taken for nodes at 1 memory, however. To solve this, we instead use a collection levels ` > ε + 1. of B low-independence hash functions, h1, . . . , hB, each of Whenever two nodes x1 and x2 are merged into a new 1 which performs size and median estimation based on a large node x at some level ` > ε + 1, we consider the new node constant number of sample-elements from the tree Tx. Each x to initially have an empty buffer, and for the old nodes x1 hash function hi can be shown to perform size and median and x2 to be retired (but to still have potentially non-empty estimation correctly with probability at least 2/3; by a Cher- buffers). The new node x is deemed the caretaker for the noff bound, it follows that the median of the size estimates two retired nodes. acts as a correct size estimate with high probability, and that Whenever node x or one of its siblings y are flushed out the median of the median estimates also works as a correct of, an additional trickle flush is performed in each of x1 and median estimate with high probability. x2. A trickle flush removes a single message from a node’s When eliminating flushing cascades due to node- buffer and passes it to the next level of the Bε-tree. merges, one additional property of the node-rebalancing The Breathing-Space Property ensures that trickle mechanism will be required: flushes will entirely eliminate both x1’s and x2’s buffers be- fore node x is next involved in either a split or merge. (Note • The Breathing-Space Property: Say a node x is that, in addition to trickle flushes, “regular” flushes may also touched whenever an universal flush occurs either in be performed in x ’s and x ’s buffers whenever the “cup x or in one of x’s sibling nodes in level `. Whenever 1 2 game” at level ` sees fit; this is necessary so that the buffers a new node x is created at level `, either via a node from retired nodes do not end up clogging the in-memory merge or a node split, the new node x is touched at cache.) least Ω(Bε(`−1)) times before being again involved in a The fact that trickle flushes do not cause additional merge or split. flushing cascades is a consequence of each level of the tree The Breathing-Space Property can be enforced by the being already modeled as a “cup game”. In particular, the following modification to the summary-sketch-based node- cup game ensures that even if the content that is flushed into rebalancing mechanism described above: Whenever a node a level originates from several nodes at the preceding level x is created (due to a node-split or node-merge), the node (instead of just a single node), this doesn’t increase the risk x is frozen until the node x has been touched at least of flushing cascades. Bε(`−1)/c times, for some large constant c. The node x is The strategy outlined above handles flushing cascades 1 only permitted to take part in merges and splits after the node at levels ` > ε + 1. At lower levels, a similar strategy is stops being frozen (even if the summary sketch S`(x) wishes used, except with larger trickle flushes consisting of multiple to performs splits or merges earlier). A key observation is messages (rather than just 1). The large trickle flushes are that, whenever the summary sketch S`(x) for a node x first needed to compensate for the fact that, when ` is small, fewer requires a split or merge, the time that node x must wait for trickle flushes occur between consecutive node-rebalances. x to become unfrozen (as well as for one of x’s siblings to These larger trickle flushes can cause a level ` to flush become unfrozen in the event of a merge) is small enough significantly more than B` messages to level ` − 1; we that the size of node x’s tree |Tx| will change by at most a show that this issue can be handled by slightly increasing constant fraction during the wait. Thus the Breathing-Space the resource augmentation between the lower levels of the Property can be enforced without compromising the weight- tree. balanced guarantees of the Bε-tree. 3 Preliminaries Eliminating flushing cascades from node merges. The final problem is to eliminate flushing cascades caused by The Disk Access Model. Due to the long latencies of buffer overflows due to node merges. accessing data stored on disk drives, algorithms that interact One approach would be to eliminate node merges with data stored in external memory are often evaluated in through the use of tombstone records. When a record r is the Disk-Access Model (DAM) [1]. The DAM model allows deleted, it is replaced with a tombstone indicating the dele- for an algorithm to interact with external memory by reading tion. Non-tombstone records are gradually copied to a sec- and writing in chunks of B machine words (the quantity B ond copy of the tree, which replaces the original tree every is known as the block size), and defines the running time O(n) operations. of an algorithm to be the number of such reads and writes The global-rebalancing approach is dissatisfying in that performed, also known as I/Os. Algorithms are typically also it requires the maintenance of two tree structures at a time. allowed to access a small internal memory of some size M In Section 5.2 we show that, by exploiting the techniques for free. already presented above, one can directly support deletions in the tree T without using tombstones. External-memory search trees. In order to exploit the fact

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited that the block size B is typically quite large, the data on tion/deletion/update operations are performed lazily by plac- disk is often laid out as a B-tree.A B-tree is balanced tree ing a message in the buffer of the root node indicating the (i.e., all leaves have the same depth) in which every node operation that must occur. Messages percolate down the tree (except possibly for the root) stores a list of between B and until they reach the leaf node containing the record to which 3B keys. For internal nodes of the tree, these keys represent they apply (or the leaf that would contain the record if it were the children of the node, and for leaf nodes, these keys are present in the tree), at which point the corresponding inser- simply Θ(B) consecutive-valued keys in the data structure. tion/deletion/update is performed. As described in the intro- Since the Θ(B) keys in each node fit in O(1) blocks, lookups duction, buffer flushing can be performed with an arbitrary in B-trees can be performed in time proportional to the tree’s buffer-eviction scheme; this is formalized in Section 3.1. depth, Θ(logB N). When B is large, this offers significant Note that, in order for insert/update/delete operations speedup over traditional binary trees. to be performed lazily, update/insertion/deletion operations B-trees can be made to also support insertions in time cannot be permitted to perform a lookup for the key k that O(logB N) by following the rule that whenever the number they apply to (e.g., the operation cannot return the previous of keys stored in a node x increases past 3B, the node splits status of key k). Thus update and insertion operations into two nodes, each with 1.5B keys.2 Note that this split become essentially the same operation: an upsert on key k increases the number of children for x’s parent by one, and is performed with value v operates by inserting the key k thus may also recursively trigger a split in x’s parent. The with value v if k is not currently present, or updating key k’s only way that a B-tree can increase its depth is when the value to be v if k is currently present. root node performs a split. This split creates a new root node Since messages higher in the tree may modify a record at one level higher. a leaf, lookups must examine not only a leaf node but also the Similarly, deletions can be supported in time root-to-leaf path that may contain any messages relevant to O(logB N) by allowing nodes x to merge with one of the lookup. In general, we use the term lookup to refer either their neighbors whenever the number of keys stored in x to membership queries or to predecessor/successor queries. decreases below B. The exception to this rule is the root Both can be performed in O(logB N) I/Os. node r, which never performs a merge and is instead simply Tree conventions. Throughout the paper, when we refer removed once its number of children drops to one. to the size of a tree, we actually mean the logical number Rather than storing only keys, B-trees are typically used of records in a tree. If, for example, a record in the leaf of to store key/value pairs, allowing for one to store a value the tree has a deletion message sitting in a buffer higher in associated with each key. the tree, then that record is not counted in the size. As long As a convention, we say a node x is in the first level of as the height of the tree is at least some large constant, the a tree T if x is a leaf, and otherwise we say that x is at level contents of the leaves of the tree will always dominate the ` + 1, where ` is the level of x’s children. This convention mass of the tree (i.e., there are many more records in leaves of indexing levels from the bottom of the tree (rather than than messages in buffers), and thus the tree will always have ε the top) is particularly natural for B -trees, since structurally height at most O(logB N), where N is the size of the tree. new levels form when the node at the top of the tree splits. More generally, when discussing a node x, we say the size |x| of x, or equivalently the size of the subtree rooted at Making writes faster with buffer-flushing techniques. x, is the logical number of records in the subtree rooted at x, Buffer flushing is a technique that can be used to transform a including the messages in x’s buffer. standard B-tree into a write-optimized data structure. Writes When two write messages for the same key are present are made faster by buffering together large collections of in the same buffer, they can potentially be combined into insert/update/delete operations high in the tree, and then a single message. For convenience, we assume that these passing down collections of writes together. The first step in combinations are not performed until the messages arrive at doing this is to reduce the number of children at each internal a leaf node. When we refer to the size |b| of a node x’s buffer node in the tree to O(Bε) for some constant 0 < ε < 1. b, we count the total space required by that buffer, including Note that this only affects the depth of the tree by a factor of messages with duplicate keys. 1 = Θ(1), and thus does not change the asymptotic running ε We define the siblings of a node x at level ` in a tree T time of queries. to be the (up to two) nodes at level ` that are adjacent to x Each internal node x stores not only the O(Bε) chil- in the value ranges that they store. Note that the siblings of dren for x, but also a buffer of size O(B) consisting of a node x need not have the same parent as x (and in Section messages that are meant to be sent to x’s children. Inser- 5 the notion of parent will stop being well defined, anyway, since the value range considered by a node will potentially 2Here we describe the simplest node-merging and node-splitting strate- cross between two parents). gies. Often more complex splitting strategies are actually used in order to As a convention, we use the term message to refer to a achieve better constants.

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited buffered insert/update/delete operation and record to refer to simulate the behavior of A on a buffer of size B. a key-value pair that is either stored in a leaf, or is implicitly Formally, a buffer-eviction strategy A is any algorithm in the tree due to a message that has not yet made it do the for mapping a B-element multiset M consisting of at most leaves. (That is, the set of records in a tree is the same as the O(Bε) distinct elements to a subset S ⊆ M. The set M tree’s logical size.) represents the set of records contained in a node’s buffer Finally, for convenience, we overload the logB N op- (with each record represented by the child to which it is erator to return max(logB N, 1) (so that we do not need to destined), and the set S represents a selection of which worry about logB N being sub-constant). records to evict from the node. Given a buffer-eviction strategy A, and sequence Allowing buffer-sizes to differ by constant factors. of numbers K = (k , k , . . . , k ) with each k ∈ Throughout the paper, we fill find it useful to allow for 1 2 m i {1, 2,...,Bε}, the I/O-complexity of strategy A on se- the values of B used to size buffers at each level of the quence K can be computed as follows. Fill a B-element Bε-tree to vary slightly. Given a buffer-size sequence, buffer to contain k , . . . , k , and then use A to select a sub- B ,B ,B ,... ∈ such that B ∈ Θ(B) for all i, a 1 B 1 2 3 N i set S of the elements in the buffer to evict; record the num- (B ,B ,B ,...)-sized Bε-tree in one in which the buffer- 1 1 2 3 ber d of distinct elements in S . Then fill the buffer back sizes in nodes at level ` are determined by B (rather than 1 1 ` to size B using elements k , k ,..., and again use A B). In particular, this means that an oblivious flush on nodes B+1 B+2 to select a subset S of the elements in the buffer to evict; at level ` is performed on a buffer of size at least 3B , and 2 ` record the number d of distinct elements in S . Repeat this flushes exactly B elements. 2 2 ` until every element k ∈ K has at some point been placed We will always assume that B is at least a sufficiently i into the buffer, and the buffer size is less than B. The sum large constant multiple of log N. It follows that, when B P d is the I/O complexity of A on sequence K, and rep- we define a buffers-size sequence B ,B ,..., we need only i i 1 2 resents the number of distinct times that eviction strategy A define the first B terms, B ,B ,...,B . Given a sequence 1 2 B touches a child while handling the input stream K. p , p , . . . , p ∈ [0, 1] satisfying P p ≤ O(1), we define 1 2 B ` ` The I/O-complexity of universal flushing on sequence the (p , . . . , p )-induced buffer-size sequence by B = 1 B B K is computed in the same way, except that the buffer is B, and B = B + dp Be for ` ∈ {B − 1,..., 1}. Note ` `+1 ` refilled to size 3B between successive flushes, instead of that this guarantees B ∈ Θ(B) and B = (1+Θ(p ))·B ` ` ` `+1 only to size B. for all ` ∈ [B − 1]. The next proposition shows how to use universal buffer Placing a small number of small buffers in memory. A flushing in order to simulate any buffer-eviction strategy A. key ingredient in our algorithms is the use of a small in- PROPOSITION 3.1. Given any buffer-eviction strategy A, memory cache to store over-sized buffers. The in-memory there is an implementation of universal flushing U such that cache is permitted to consist of polylogarithmically many on any sequence of numbers K = (k , k , . . . , k ) with blocks (i.e., we assume that M ≥ polylog(N) · B). We call 1 2 m each k ∈ {1, 2,...,Bε}, the I/O-complexity of U on K is the buffer of a node x a cached buffer (and x a cached node) i at most twice the I/O complexity of A on K. if the buffer is stored in our in-memory cache. Whereas buffers not stored in memory must be flushed before their Proof. Given a buffer-eviction strategy A, we can implement size exceeds O(B), cached buffers are permitted to grow universal flushing U as follows: Partition our buffer u into a past O(B) without being flushed. The combined size of all future buffer u1, a current buffer u2, and a past buffer u3. cached buffers, however, must not exceed polylog(N) · B. Whenever new elements are placed into the buffer u, they are inserted directly into the future buffer u1. In order to 3.1 Universal Flushes In this section, we present the def- perform a universal flush on the buffer, we first move records inition of a universal flush, and show how to use universal from u1 to u2 until u2 is of size B; next we use the buffer- flushes in order to simulate arbitrary buffer-eviction strate- eviction strategy A to select a subset of records in u2 to evict gies. This allows for us to use universal flushes as the prim- and we move those records to u3; then, we repeat these two itive through which our data structure interacts with nodes, steps until u3 has size at least B. Finally, we evict exactly B while abstracting away the underlying buffer-eviction strat- records from u3; any remaining records in u3 will be evicted egy. during the next universal flush.3 A universal flush on a node x can be performed when- ever the node’s buffer contains 3B or more records. The 3Several of our algorithms will allow for up to B dummy records to universal flush is then required to select exactly B records to be present in buffer u1. We may feel free to treat these dummy records as evict from that node. always being present in the future buffer u1 and never being moved to buffer In this section we prove that, given an arbitrary buffer- u2. Since universal flushes are only performed when u = u1 ∪ u2 ∪ u3 eviction strategy A, one can implement universal flushing to is size at least 3B, and since u3 never exceeds size B at the beginning of a universal flush, one can always fill u2 to size B using entries from u1

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited Now consider the I/O complexity of A on sequence K. the size of internal memory; and (2) Every uncached node Let Si denote the set of elements evicted in the i-th eviction deterministically has buffer-size at most O(B). by A on the sequence K, and let di denote the number of distinct elements in Si. In particular, the I/O complexity of The purpose of the section is to prove the following P theorem: A on K is given by i di. When the universal flushing strategy U is performed THEOREM 4.1. Let 0 < ε < 1 be a constant. Suppose that on sequence K, the i-th eviction from the current-buffer u 2 M ≥ (log N)cB for a sufficiently large constant c, and that evicts exactly the set S from the u to u . Whenever a set i 2 3 B is at least a sufficiently large constant multiple of log N. S is evicted from u it is guaranteed to be evicted from u B i 2 3 Let T be a Bε-tree on N records, and suppose that T ’s by the end of the next universal flush. It follows that each set buffers are initially empty. Consider a workload consisting S incurs at most 2d flushes for U. Thus the I/O complexity i i only of read operations and update operations (to records of U on K is at most 2 P d , as desired. i i already in the tree). The preceding proof shows how to implement univer- Then there exists a global flushing algorithm for T that sal flushes directly in terms of a buffer-eviction strategy A. achieves worst-case update-operation time More generally, any implementation of universal flushing O dlog N/B1−εe . that selects a set of exactly B records to evict will be com- B patible with our final data structure. It would be allowable, for example, for the eviction strategy to make eviction de- cisions with full knowledge of u and u . Moreover, even 1 3 Analyzing Flushing Rounds Instead of Operations. if an eviction were to try to adversarially minimize write- optimization, the fact that universal flushes send B records to DEFINITION 2. A flushing round occurs whenever a flush ε O(B ) children ensures the bare-minimum amount of write- is performed in the root node of a tree. The flushing round ε optimization required to achieve B -tree-like guarantees. consists of any further flushes that are performed as a Note that, from the perspective of flushing cascades, consequence. universal flushes essentially represent the worst possible flushing strategy. If the B elements flushed by the universal To prove Theorem 4.1, we design a global flushing flush go to Θ(Bε) different children, then each of those algorithm with two properties. First, successive flushing children’s buffers could easily overflow; those buffers could rounds are always separated by Ω(B) operations. And ε then also evict to Θ(B ) different children each, and so second, each flushing round requires at most O(logB N) on. Nonetheless, we will present a suite of randomization universal flushes with high probability in N. techniques that allow for flushing cascades to be avoided in Since each universal flush requires at most O(Bε) I/Os this setting. (in order to touch each of the node’s children), the I/O- ε complexity of a flushing round is O(B logB N). By 4 The Static Cup-Based Bε-Tree spreading the work for each flushing round across Ω(B) We begin by designing flushing strategies for static Bε-trees, operations, we can then obtain worst-case update-time 1−ε  in which the only operations to the tree are update-operations O dlogB N/B e , as desired by Theorem 4.1. (for records that are already present). In particular, we Randomizing node overflows with dummy messages. A are interested in designing what we call a global flushing key step in our algorithm is to initially place a random algorithm: number rx ∈ {0, 1, 2,...,B` − 1} of dummy messages into the buffer of each node x in each level `. The quantity rx is DEFINITION 1. A global flushing algorithm with buffer- known as the random initial offset at node x. The dummy size sequence (B1,B2,B3,...) is any algorithm for decid- messages are for bookkeeping purposes only, and are never ing which nodes to perform oblivious flushes on during each ε flushed from the buffer in which they reside. (If universal operation of a B -tree. The oblivious flushes are then per- flushes are implemented in terms of a ball-recycling strategy, formed using buffer-size sequence (B1,B2,B3,...). Addi- as described in Section 3, then one can follow the rule that tionally, a global flushing algorithm may mark certain nodes the dummy messages never leave the future buffer u1.) as cached. We say that a node x in level ` overflows whenever A global flushing algorithm must provide two guaran- the fill of the node (i.e., the number of messages, including tees: (1) If the tree size is N, then the sum of the sizes of dummy messages, that reside in node x’s buffer) increases the buffers of all cached nodes is at most polylog(N) · B, from a value less than 3B` to a value at least 3B`. More generally, we say that a node crosses a threshold whenever without using any dummy records. the fill of a node increases from a value less than kB` to a

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited value at least kB` for an integer k ≥ 3. (Note that a fill of The total number of threshold crossings in cup x during 3B` including dummy messages may correspond with a fill input rounds t0, . . . , t1 is therefore bounded above by of as little as 2B` + 1 excluding dummy messages.)   bjx/B`c The presence of rx dummy messages in each node x X randomizes the threshold crossings so that the number of  1 + Yx, crossings in nodes at a given level ` during a sequence of i=1 flushing rounds t0, . . . , t1 is a sum of well-behaved random where Yx is 0-1 random variable that takes value 1 with variables. Specifically, if j messages are flushed from level probability (jx (mod B`))/B`. Summing over nodes x at ` + 1 to level ` during a time interval, then the number of level `, the total number of threshold crossings at level ` crossings in nodes at level ` is a sum of independent indicator during input rounds t0, . . . , t1 is at most random variables with total mean j/B`.   bjx/B`c LEMMA 4.1. Suppose that a global flushing algorithm is X X Yx + 1 . used in which flushes at levels `+1, `+2,... are independent x∈ level ` i=1 of the random thresholds rx used for nodes x at level `. At each level `, define an input round to be the time interval The above sum is a sum of independent 0-1 random vari- between two consecutive universal flushes at level ` + 1 ables, since the outcome of each Yx is randomized by rx, (containing the first of the two). and since the random variable 1 is trivially independent of Let j be the number of messages flushed from level all other random variables. Moreover, the sum has mean ` + 1 to level ` during input rounds t0, . . . , t1 for some X jx/B` = j/B`, t0, t1 ∈ Z. Then the number of threshold crossings in level ` x∈ level ` during input rounds t0, . . . , t1 is bounded above by a sum of independent 0-1 random variables with mean j/B`. as desired.

Proof. For each node x at level `, let gx be the number of messages placed into node x’s buffer during input rounds Two randomized flushing algorithms. The first algorithm 1, . . . , t , including the initial r dummy messages used for we consider is the aggressive randomized flushing algo- 0 x rithm, which performs an universal flush on each node x bookkeeping. Since the random value of rx is independent of the number of messages g − r to have arrived from whenever that node overflows. One advantage of the aggres- x x sive randomized flushing algorithm is that it forgoes the use level ` − 1 during those input rounds, the quantity gx is the sum of two independent random variables, one of which is of caching; the guarantee that we give for the algorithm is uniformly random in {0,...,B − 1}. It follows that g weaker than Theorem 4.1, however, bounding the number of ` x flushes across collections of polylog(N) consecutive flush- (mod B`) is uniformly random in {0,...,B` − 1}. Whenever an universal flush is performed in cup x, ing rounds, instead of in a single flushing round. We begin by analyzing the aggressive randomized flushing algorithm, exactly B` messages are removed from the buffer. This ensures that, at the beginning of input round t , the number and then extend the analysis to our final algorithm, which we 0 call the lazy randomized flushing algorithm. of messages ax presently in node x’s buffer satisfies ax ≡ gx mod B`. Thus ax (mod B`) is uniformly random in PROPOSITION 4.1. Suppose that B is at least a sufficiently {0,...,B`−1}. ε large constant multiple of logB N. Let T be a B -tree on Let jx denote the number of messages that arrive into N records, and suppose that T ’s buffers are initially empty. cup x’s buffer during input rounds t0, . . . , t1. Note that Consider a workload consisting only of read operations and every time a threshold crossing occurs in node x, the total update operations (to records already in the tree). Suppose number of messages to have been placed in node x’s buffer, that the operations are performed using the aggressive ran- over all time, must cross a multiple of B`. (Here, again, 1 1 1 domized flushing algorithm, and using the ( 12 , 22 , 32 ,...)- we are exploiting the fact that universal flushes remove induced buffer-size sequence. exactly B` messages, thereby not changing the number of Define k = (log N)c. Then for any sequence of k messages modulo B` in the buffer.) It follows that the first flushing rounds, the total number of universal flushes in the jx−(jx (mod B`)) messages to arrive into x’s buffer during k rounds is at most O(k · logB N), with high probability in input round t0, . . . , t1 can induce at most bjx/B`c threshold N. crossings in cup x. The final jx (mod B`) can induce an additional threshold crossing only if (ax (mod B`)) ≥ Proof. Consider a sequence of k flushing rounds t0, . . . , t1. B` − (jx (mod B`)). Since ax (mod B`) is uniformly For each level `, let j` be the number of universal flushes random, it follows that the extra threshold crossing occurs performed by the aggressive randomized flushing algorithm with probability at most (jx (mod B`))/B`. at level ` during rounds t0, . . . , t1.

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited Let h be the height of the tree T . Then jh is exactly If there remains a level ` with a cache of size at least the number of flushing rounds k. For ` < h, the number of M/(1 + logB N) − B1, then the algorithm again selects messages flushed from level ` + 1 to level ` during rounds the lowest such `, and repeats the process until no such ` t0, . . . , t1 is exactly exists. Note that by selecting the lowest such ` each time, the algorithm ensures that a level `0 can only be flushed into j`+1 · B`+1. 0 if level ` has a cache of size smaller than M/(1 + logB N). The combined sizes of the caches across all levels therefore By Lemma 4.1, the number of threshold crossings at level does not exceed M. `, which is in turn exactly j , is a sum of independent 0-1 ` Note that, even with the cleanup step, the lazy random- random variables with mean ized flushing algorithm continues to satisfy the requirement B`+1 by Lemma 4.1 that flushes at levels ` + 1, ` + 2,... are in- j`+1 · . B` dependent of the random thresholds rx used for nodes x at level `. For this property to remain true, it is important that 1 1 1 The use of the ( 12 , 22 , 32 ,...)-induced buffer-size se- cleanup is performed based on the cache-size at each level, quence ensures that B`+1 ≤ 1 − Ω(1/`2). It follows that for B` rather than based on the total cache size of the tree. each ` < h, j` is a sum of independent 0-1 random variables With the exception of when the cached buffers with mean at most cause cleanup steps, the lazy algorithm performs at most

2 2 O(logB N) universal flushes during each flushing round. In j`+1(1 − Ω(1/` )) ≤ j`+1(1 − Ω(1/ log N)). order to complete Theorem 4.1 it suffices to bound the sum By a Chernoff bound, assuming that k ≥ (log N)c for of the sizes of the cached buffers, demonstrating that cleanup steps are rare. This is accomplished by Lemma 4.2. a large enough constant c, the probability that j` > k given 1 that j`+1 ≤ k is . Thus, with high probability in N, EMMA poly(N) L 4.2. Suppose that B is at least a sufficiently large ε no j` ≤ k for all levels `. constant multiple of logB N. Let T be a B -tree on n records, and suppose that T ’s buffers are initially empty. 1 1 1 We remark that the use of the ( 12 , 22 , 32 ,...)-induced Consider a workload consisting only of read operations and buffer-size sequence is not necessary for Proposition 4.1. In update operations (to records already in the tree). Suppose particular, as noted in Section 2, even if every level uses that the operations are performed using the lazy randomized the same block-size, one can argue that over the course of 1 1 1 flushing algorithm, and using the ( 12 , 22 , 32 ,...)-induced a large number j of flushes from level `+1 to level `, only at buffer-size sequence. most (1 + 1 )j flushes are caused at level `. The use Then after any flushing round, the combined sizes of logB Nmax of the buffer-size sequence allows for the stronger property the cached buffers is at most B · polylog(N), with high that at most j flushes are performed at level `, which slightly probability in N. simplifies the proof. The induced buffer-size sequence will Proof. As in Lemma 4.1, define an input round at each level also play a similar (and more critical) role in the analysis of ` to be the time period beginning just before an universal the aggressive randomized flushing algorithm. flush at level ` + 1 and ending just before the next universal Although Proposition 4.1 shows that the aggressive flush at level `+1. Note that level ` may experience multiple randomized flushing algorithm behaves well across batches input rounds during a single flushing round due to clean-up of polylogarithmically many flushing rounds, any individual steps within the flushing round. flushing round could potentially involve an unacceptably Let h be the level of the root node. For a level ` < h and large number of universal flushes. The lazy randomized a input round t, define the potential function Q (t) to be, flushing algorithm resolves this problem by caching the ` X buffers for a small number of nodes in internal memory. d(|x| − 2B`)/B`e,

During each flushing round, and for each level ` (beginning x∈At with the root level ` = h), the algorithm selects the node where A is the set of cached buffers in level ` after input x with the fullest buffer, and if x’s buffer is of size at least t round t. 3B , then the algorithm performs an universal flush on x. ` We prove that for each ` and t, Q (t) ≤ polylog(N) Any other buffer of size 3B or more is cached. ` ` with high probability in N. This completes the lemma, since Finally, at the end of each flushing round, there is a 3Q (t) is at least as large as the combined sizes of the cached cleanup step that occurs with low probability. If, at the ` buffers at level `. end of any round, the total combined size of the cached Note that for each ` and t ≥ 1, the value Q (t) is either buffers at a level ` reaches M/(1 + log N) − B , then the ` B 1 zero, or satisfies algorithm selects the smallest such level `, and performs an additional universal flush in each of level `, `−1, `−2,..., 2. Q`(t) ≤ Q`(t − 1) + ut − 1,

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited where ut is the number of threshold crossings at level ` operations are performed using the lazy randomized flushing during input round t. In particular, each threshold crossing algorithm. increases Q` by one, and then the universal flush performed Then after any flushing round, the maximum size of any by the lazy flushing algorithm reduces Q` by 1 (unless Q` is cached buffer in a level ` is O(B · log m), where m is the already 0). number of cached nodes in level `. Note that if the cached c Thus, in order for Q`(t) to be large, say (log N) for a buffers fit into memory, then this implies that the maximum large constant c, there must be some m > 0 such that size of any cached buffer is also at most O(B · log M) .

t Proof. We prove the lemma by modeling the cached buffers X c (ui − 1) ≥ (log N) . as cups in a cup-emptying game. Each cached buffer x at i=t−m+1 level ` can be thought of as a cup containing wx units of water, where (w +3)B is the number of messages in buffer Rearranging gives x ` x. Using this analogy, in each input round, at most 1 unit of t water is placed into cups, and then 1 unit of water is removed X c (4.1) ui ≥ (log N) + m. from the fullest cup (unless that cup contains less than 1 unit i=t−m+1 of water, in which case the cup is emptied). This is precisely the dynamic cup game analyzed by [11]. It follows that the Pt Lemma 4.1 establishes that i=t−m+1 ui is bounded amount of water in the fullest cup is at most O(log m), where above by a sum of independent random variables with mean m is the number of non-empty cups. Thus the fullest cached at most buffer at level ` has size at most O(B log m), where m is the B number of cached buffers. m · `+1 = m · (1 − Ω(1/ log2 N)), B ` 5 The Dynamic Cup-Based Bε-Tree ε since at most B`+1 messages are flushed from level ` + 1 to In this section, we introduce the Dynamic Cup-Based B - ` during each input round. tree, which extends the techniques in Section 4 to support By a Chernoff bound, when m ≤ (log N)c, the prob- arbitrary workloads of queries and upserts. Throughout the 1−ε ε` ability that Eq. 4.1 occurs is polynomially small in N (i.e., section, we use s` as shorthand for B · B , which one at most 1 ). Thus with high probability in N, Eq. 4.1 poly(N) should think of as approximately representing the size of a does not hold for any m ≤ (log N)c. node at level `. On the other hand, when m ≥ (log N)c, we get by a THEOREM 5.1. Let 0 < ε < 1 be a constant. Suppose that Chernoff bound that 2 c M ≥ cB + (log Nmax) B for a sufficiently large constant " t # X   1  c, and suppose that B is at least a sufficiently large constant Pr ui ≥ m ≤ exp −Ω m . multiple of log N . log4 N max i=t−m+1 Consider an arbitrarily long sequence of in- sert/update/delete operations on an initially empty tree By a union bound over all m ≥ (log N)c, it follows that T , and let N denote the maximum size of T during those with high probability in N, Eq. 4.1 does not hold for any max operations. If T is implemented as a dynamic cup-based Bε- m ≥ (log N)c. tree, then T will deterministically never use memory more Since Eq. 4.1 fails for all m with high probability in N, than M. The tree T will satisfy the following performance we have that Q (t) ≤ (log N)c with high probability in N, ` guarantees: as desired. • Worst-Case Write Costs: For each insert/update/delete An additional property of the Lazy Algorithm. We operation operation i, with high probability in Nmax, conclude the section by noting one additional useful property the operation i has I/O cost at most of the lazy randomized flushing algorithm. Specifically, we O(d(log N )/B1−εe), show that although the sum of the sizes of the cached buffers B i may in the worst case be M, the size of the largest cached where Ni is the size of the tree after operation i. buffer never exceeds O(B log M). • Domination by Universal Flushes: For a given flush- LEMMA 4.3. Let T be a Bε-tree on N records, and suppose ing round i, the number of I/Os incurred by the flush- that T ’s buffers are initially empty. Consider a workload ing round is, with high probability in Nmax, at most consisting only of read operations and update operations O(Bε + q), where q is the number of I/Os incurred by (to records already in the tree). Suppose that the flushing universal flushes during the flushing round.

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited • Amortized Operation Costs: For any n > 0, with high uses the summary sketch in order to determine when to split probability in Nmax, the total I/O cost of the the first n or merge with neighbors. insert/update/delete operations is 5.1 Eliminating Flushing Cascades In this section, we n ! X log Ni adapt the lazy randomized flushing algorithm to handle O B . B1−ε structural changes to the tree T resulting from node merges i=1 and splits. The new flushing algorithm assumes certain important properties from the mechanism that is used for • Worst-Case Read Costs: For any insert/update/delete rebalancing nodes (i.e., the algorithm for determining when operation i, with high probability in N , any read max to perform node merges and node splits). Designing a performed after operation i incurs O(log N ) I/Os. B i rebalancing mechanism that guarantees these properties will then be the primary purpose of Section 5.2. We assume throughout the section that M ≤ poly(B), The first property required of the rebalancing mecha- which is w.l.o.g. since the requirement M ≥ cB2 + nism is the Independent Levels Property. c (log Nmax) B is satisfied by M ∈ poly(B). Importantly, the assumption affects the cleanup phase in lazy randomized • The Independent Levels Property: The key ranges of flushing algorithm, and implies by Lemma 4.3 that the max- the nodes in level ` of the tree are determined only by imum size of a stashed buffer never exceeds O(B log B). (1) the set of messages that have ever been flushed from Supporting structural changes to the static cup-based level ` + 1 to level `; and (2) random bits associated Bε-tree introduced in Section 4 leads to two technical issues. with level ` of the tree. The first is the handling of flushing cascades caused by node merges. The second, more subtle issue, is the introduction of The Independent Levels Property eliminates cyclic de- cyclic dependencies between levels. pendencies between levels. The property ensures that, as long as the flushing algorithm at each level ` makes deci- Cyclic dependencies between levels. In order for the sions based only on the contents of the buffers at level `, randomized flushing strategy at each level ` of the tree to then the contents of the i-th flush from level ` + 1 to level ` be effective (and, specifically, for Lemma 4.1 to hold), it is will be independent of the random bits used by the flushing necessary that the input flushes from level ` + 1 to level ` algorithm (and the rebalancing mechanism) in level `. This, be independent of the random initial offsets used in level `. in turn, allows for Lemma 4.1 to be used at each level `. That is, for all i ≥ 1, the set of messages flushed by the i-th Note that, in order for the Independent Levels Property universal flush in level ` + 1 must be the same regardless of to be satisfied, it is necessary for the key ranges at each level the random initial offsets used in levels `, ` − 1,..., 1. ` to not necessarily be strict subsets of the key ranges at the Node merges and splits violate this property by intro- higher level `+1. This means that our Bε-tree has a structure ducing cyclic dependencies between levels. The random bits similar to that of a skip list, in that adjacent nodes at level ` used at level ` partially determine which messages make it may have a child in common at level ` − 1. into lower levels by a given point in time, and thus which The second property required from the rebalancing messages make it to the leaves by a given point in time. The mechanism is The Breathing-Space Property, which will ε arrival of messages to the leaves of the B -tree determines prove useful in adapting the flushing algorithm to handle when (and which) nodes at level 2 of the tree merge and split. flushing cascades due to buffer merges. The merging and splitting of nodes at level 2 in the tree then, in turn, influences when nodes at level 3 of the tree merge • The Breathing-Space Property: Say a node x is and split, etc. Thus the random bits at level ` indirectly in- touched whenever an universal flush occurs either in fluence the set of nodes at level ` + 1, which in turn affects x or in one of x’s sibling nodes in level `. (Note that the flushing decisions made in level ` + 1. the sibling nodes of x in level ` need not share a parent In designing the Dynamic Cup-Based Bε-tree, we be- with x.) Whenever a new node x is created at level `, gin in Section 5.1 by adapting the lazy randomized flushing either via a node merge or a node split, the new node algorithm to the dynamic-structure setting in order to elimi- x is touched at least Ω(s`/B) times before being again nate flushing cascades due to node merges. The new flush- involved in a merge or split. ing algorithm assumes that the underlying node-rebalancing mechanism has a number of useful properties and avoids Finally (and perhaps most obviously), the rebalancing cyclic level dependencies. In order to design such a node- mechanism should guarantee a Bε-tree-like structure with rebalancing mechanism, Section 5.2 then introduces a ran- high probability in Nmax. Specifically, at any point in time, domized weight-balancing approach in which each node we require that the Balanced-Tree Property hold with high maintains a summary sketch of the subtree that it roots, and probability in Nmax:

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited • The Balanced-Tree Property: For each level ` such random initial offset), and we retire the buffers a1, a2 for satisfying s` < 20|T |, and for each node x at level nodes x1 and x2. The new buffer b for node y is assigned as `, the size of the subtree Tx rooted at x must be the caretaker for retired buffers a1 and a2. Similarly, when Θ(s`) . Moreover, for any level ` ≤ B satisfying a buffer y is split into two nodes x1 and x2, y’s buffer b is s` > 20|T |, the number of nodes at level ` must be at retired, and two new buffers a1 and a2 are created for x1 and most one and must be stored in memory. Additionally, x2; in this case, both a1 and a2 are said to be b’s caretakers. the metadata used by the rebalancing mechanism must Once a buffer u is retired, it is guaranteed that no further consist of, with high probability in Nmax, no more messages will be placed into that buffer (since such messages than O(B) machine words per node in the tree, and are now sent to u’s caretakers). The lazy randomized no more than O(B2) additional blocks for non-node- flushing algorithm may still perform universal flushes on a specific metadata. retired buffer, however, if that buffer is the fullest of any buffer at level `. (And, importantly, if a stashed buffer is Note, in particular, that the Balanced-Tree Property retired, that buffer remains stashed.) bounds the number of children that each node x can have to O(Bε), which is important both for keeping the size Trickle flushes from retired buffers. Whenever an 1 of the node small (so that it fits in a constant number of universal flush is performed on a node x at a level ` > ε + 1, blocks) and for achieving write-optimized performance for one message is flushed from each of the (up to six) retired insert/update/delete operations. buffers for which x’s siblings or x are caretakers. This We call the flushing algorithm introduced in this section is called a trickle flush (since it flushes only a very small the dynamic lazy flushing algorithm. amount). By the Breathing-Space Property, whenever a node 1 x at level ` > ε + 1 is rebalanced, any nodes y for which x PROPOSITION 5.1. Let 0 < ε < 1 be a constant. Suppose is a caretaker will have had at least Ω(s /B) ≥ Ω(B1+ε) 2 c ` that cB + (log n) B ≤ M ≤ poly(B) for a sufficiently trickle flushes. Since B is at least a sufficiently large large constant c, and that B is at least a sufficiently large constant, Bε is at least a sufficiently large constant multiple constant multiple of logB n. Assume that the rebalancing of log M ≤ O(log B), and it follows that as long as no mechanism satisfies The Independent Levels Property, the retired buffer ever has buffer-size greater than O(B log B), Breathing-Space Property, and the Balanced-Tree Property then any nodes for which x is a caretaker will have an empty (with high probability in Nmax at any point in time). Then buffer by the time x is rebalanced. the dynamic lazy flushing algorithm satisfies the properties When an universal flush is performed on a node x in desired by Theorem 5.1. 1 a level ` ≤ ε + 1, dεBe messages are flushed from each x x The dynamic lazy flushing algorithm performs universal retired node for which ’s siblings or are caretakers. This flushes using the same approach as the (static) lazy flushing is again called a trickle flush (although it is much larger than the trickle flushes at levels ` > 1 + 1). Note that the εB algorithm. During each flushing round, and for each level ε ` (beginning with the root level), the algorithm selects the messages are not flushed using an universal flush, but are node x with the fullest buffer, and if x’s buffer is of size at instead selected arbitrarily from the retired buffers. Since Bε least 3B , then the algorithm performs an universal flush on at least -trickle flushes are performed between rebalances ` x x x. Any other buffer of size 3B or more is stashed. of a node , by the time the node is next rebalanced its ` Ω(εB1+ε)  B log B At the end of a flushing round, a cleanup step (that oc- caretakers will each have had at least curs with low probability) is also performed in the same man- messages removed from them via trickle flushes; thus all of x ner as for the (static) lazy flushing algorithm. Specifically, if the nodes for which is a caretaker have empty buffers when x h is the largest level to contain multiple nodes, then cleanup is next rebalanced (assuming they initially had buffer-size O(B log B) is performed as necessary to ensure that no level ` contains no greater than ). a stash of size greater than (M − O(B2))/h − B, where the We remark that extra care must be taken to ensure that x y O(B2) term is the sum of the maximum space in memory when a node is a caretaker for a node , the messages that k x y that may be required from levels at which there is only a sin- share a given key are flushed out of ’s and ’s buffers in gle node, along with the space required by the rebalancing the same order that they arrived (since the order of arrival mechanism for non-node-specific metadata. corresponds with the order in which they should be applied k r x In addition to performing universal flushes, the dynamic to the key ). That is, if a message is flushed from , but y r0 r lazy flushing algorithm performs a second type of flush contains a message with the same key as , then we r0 r y called a trickle flush, described below. actually replace the message with in ’s buffer and flush the message r instead of the message r0.4 Retiring buffers on merges and splits. When two nodes x1 and x2 at a level ` are combined to form a new node 4Note that this maintains the invariant that for each key k, the messages y, we create a new buffer b for node y (along with a new with key k in y’s buffer are all older than the messages with key k in x’s

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited Analyzing dynamic lazy flushing. Trickle flushes make x is a caretaker. Since each node x is a caretaker for at most it so that up to B`+1 + 1 messages may be flushed from two retired nodes, the Balanced-Tree Property ensures that 1 level ` + 1 to level ` on a given step, when ` ≥ ε + 1, the number of I/Os required by a read operation is at most and up to B`+1 + dεBe messages may be flushed from level logB |T |, with high probability in Nmax. 1 ` + 1 to level ` for ` < ε + 1. To absorb this into the Lemma 4.2, bounds the probability of any cleanup resource augmentation between levels, we modify the buffer- occurring during a flushing round to 1 . Thus, poly(Nmax) size sequence used to be the (p , . . . , p )-induced buffer- 1 B with high probability in Nmax, there is only one universal size sequence, where flush in each level ` of the tree T . Since the trickle flushes corresponding with an universal flush incur only ( 1 1 1 i2 + ε if i ≤ ε + 1 O(1) additional I/Os at each level ` > +1, and only O(Bε) pi = . ε 1 if i > 1 + 1 1 i2 ε additional I/Os at each level ` ≤ ε + 1, it follows that with high probability in Nmax, the total number of I/Os required PB Bε ε Since i=1 pi = O(1), the buffer-size sequence satisfies by a given flushing round is at most O( ε +q) = O(B +q), B` ∈ Θ(B) for ` = 1,...,B. Moreover, the buffer-size where q is the number of I/Os incurred by universal flushes sequence enforces the property that the maximum number of during the flushing round. Successive flushing rounds are messages r` that may be flushed from level ` + 1 to level ` separated by B insert/update/delete operations, meaning that during a flushing round satisfies if the work for a flushing round is spread out over the      following B insert/update/delete operations, then the worst- r` 1 1 case insert/update/delete costs becomes ≤ 1 − 2 ≤ 1 − Ω 2 , B` Ω(` ) log Nmax O(d(Bε + q)/B1−εe). which is precisely the property needed from the buffer-size sequence in the proof of Lemma 4.2. q N O(Bε · Trickle flushes also violate the property used in Lemma Since is, with high probability in max, at most log |T |) i 4.1 that flushes from a node x at level ` always flush exactly B , the -th operation has, with high probability in Nmax, I/O cost at most B` elements. This property was used to ensure that the size of each buffer modulo B` is independent of the random 1−ε threshold at that buffer, and thus that threshold crossings O(d(logB Ni)/B e), in each buffer occur at random points in time. However, the property is unnecessary in retired buffers, since new where Ni is the size of the tree after operation i. messages are never added to retired buffers, and thus retired To prove Proposition 5.1, it remains to analyze the buffers never incur threshold crossings. amortized performance of insert/update/delete operations. Besides the issues discussed above, trickle flushes do So far we have shown that, with high probability in Nmax, ε not interfere with the proofs of Lemma 4.1, Lemma 4.2, each flushing round incurs at most O(B logB Ni) I/Os, or Lemma 4.3. Moreover, the Independent Levels Property, where i is the index of any insert/update/delete performed ensures that the prerequisite requirements for Lemma 4.1 are during the flushing round. The total I/O cost of the the first met by the dynamic lazy flushing algorithm. n insert/update/delete operations is therefore at most Lemma 4.3 also bounds the size of buffers (both retired n  ! and non-retired) in levels containing only one node by O(B); X log Ni O X + B , this ensures that the buffers containing only one node can be i B1−ε i=1 stored in memory using space O(B2). This means that, when ε analyzing the I/O complexity of the Dynamic Cup-Based B - where Xi is 0 for each operation i contained in a flush- tree, we need only consider levels containing multiple nodes; ε ing round that incurs at most O(B logB Ni) I/Os, and O(log |T |) by the Balanced-Tree Property, there are only B Xi is Nmax otherwise. Since each Xi is zero with high n such levels with high probability in Nmax. P probability in Nmax, the sum i=1 Xi has expected value Since Lemma 4.3 bounds the size of each buffer by Pn n/poly(Nmax). By Markov’s inequality, the sum i=1 Xi is O(B log B) , it follows that trickle flushes successfully elim- at most n with high probability in Nmax. The total I/O cost of inate retired buffers prior to their caretaker(s) being rebal- the first n insert/update/delete operations is therefore at most anced. Thus every node is the caretaker for at most two re- tired nodes. This is especially important when the retired n ! X log Ni buffers are not stashed, since read operations accessing a O B , B1−ε node x must also examine all of the retired nodes for which i=1 buffer. with high probability in Nmax.

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited 5.2 Eliminating Inter-Level Dependencies In this sec- However, the creation or destruction of a root node at a level tion we introduce a node-rebalancing scheme satisfying the h would then be a function of random bits used in sketches Independent Levels Property, the Breathing-Space Property, Sh−1, which would violate the Independent Levels Property. and (with high probability in Nmax) the Balanced-Tree Prop- To resolve this, we simply make each level ` ∈ 1,...,B erty. Combining this with Proposition 5.1 completes the always contain at least one node. All levels that contain only proof of Theorem 5.1. We begin by designing a rebalanc- a single node fit in memory (with high probability Nmax). ing scheme satisfying the Independent Levels Property and Nodes are merged and split according to summary-sketch- the Balanced-Tree Property, and then describe how to fur- based weight balancing, except that if a level contains only ther satisfy the Breathing-Space Property at the end of the one node then that node can never be merged (since it has no section. siblings). Each node xi maintains a simple summary sketch of Establishing the independent levels and balanced-tree the tree T rooted at x . If T is a tree of size in the range i i properties. The Independent Levels Property follows from [s /c0.2, s · c0.2] (recall c is a large constant), the level-` ` ` the fact that the sketches S use different random bits at each summary sketch S (T ) satisfies the following properties: ` ` level `.

• Unit-Size. With high probability in Nmax, S`(T ) con- To establish the Balanced-Tree Property (with high sists of at most O(B) machine words. (Moreover, this probability in Nmax), we must be careful about the fact that 0.2 continues to be true for any T satisfying |T | < s` ·c .) the random bits determining the sketches S` also implicitly determine what interval each subtree Ti consists of in level • Size Estimation. S`(T ) provides an estimate s for |T | `. That is, the random bits used to determine S` help de- (i.e., the logical size of the key set represented by T ) termine the partition P = (T1,T2,...,Tk) of the records satisfying 0.9|T | ≤ s ≤ 1.1|T | with high probability in levels 1, 2, . . . , ` (where each Ti corresponds with an in- 0.2 0.2 in Nmax. (Moreover, if |T | > s`/c or T < s` · c , terval of keys for which a single node at level ` is respon- 0.2 then the size estimate s satisfies s ≥ 0.9s`/c and sible). This means that the partition P could potentially be 0.2 s ≤ 1.1s` · c , respectively.) adversarial against the sketch S`. However, since the set of records S = T ∪ T ∪ · · · ∪ T after a given universal flush • Median Estimation. S (T ) provides a message r ∈ T 1 2 k ` at level ` + 1 is determined entirely by randomness in levels such that, with high probability in Nmax, the rank of r 1 3 `+1, `+2,... of the tree (and by the operations on the tree), in T is between 4 |T | and 4 |T |. 2 2 and since for a given set S there are only O(|S| ) ≤ Nmax options for each T , it follows that with high probability in • Composability. For two trees T1,T2, the sketch i N , the sketches S (T ) perform correct size and median S`(T1 ∪ T2) can be computed directly from S`(T1) and max ` i estimation even for a worst-case choice of the partition P , S`(T2); moreover, given S`(T1 ∪ T2), and given the and satisfy the Unit-Size Requirement. key boundaries of T1 and T2, one can directly compute The Unit-Size Requirement ensures that the sketch of S`(T1) and S`(T2). each subtree Ti can be stored in the root node xi of the For convenience, we will sometimes denote S`(T ) by S`(x), subtree using a constant number of blocks (with high prob- where x is the root node of the tree T . ability in Nmax). The composability of sketches makes it Before describing how to construct summary sketches, so that when a node split or join occurs, the corresponding we explain and analyze their usage in the data structure. sketches can also be split or joined. By the accuracy of size- Node joins and splits at each level ` are performed based and median- estimation, we get that with high probability in on the estimated size of nodes given by the sketches S`(Ti). Nmax, the Balanced-Tree Property holds after each operation. Specifically, a node is merged with one of its siblings when- Constructing summary sketches. ever its estimated size is smaller than s , and is split when- A natural approach to ` S (T ) B ≥ Ω(log N ) ever its estimated size is larger than 10s . constructing ` is to randomly sample max ` T When a node x is split, the median estimate r given keys from . Suppose we have access to a fully independent i h r 0 by S (T ) is used as the boundary on which the split occurs. hash function that maps each key to with probability ` i p 1 − p p Note that r might not be used as a boundary at level ` − 1. and to a non-zero value with probability , where is p · s = B ≥ c log N This means that our Bε-tree has a structure similar to that of selected so that ` max. (Note, however, a skip list, in that adjacent nodes at level ` may have a child that our final construction cannot use such a hash function, h in common at level ` − 1. since the description bits for would not fit in memory.) S (T ) r ∈ T Creating and eliminating new root nodes introduces a Then ` could store the set of keys for which h(r) = 0 r subtle issue. A natural approach would be to create a new . Whenever an insert for a key is placed into the T S (T ) r h(r) = 0 root node whenever the old root splits, and eliminate a tree , the sketch ` is updated to include if ; r T root node whenever its number of children becomes one. whenever a delete for a key is placed into the tree ,

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited the sketch S`(T ) is updated to exclude r. Assuming c is a The next lemma demonstrates how to use S`(T ) in order sufficiently large constant, and that |T | ∈ Θ(s), then with to achieve median estimation. high probability in N , max −0.2 LEMMA 5.2. Fix a tree T satisfying |T | ≥ c s`, and 1 define vT to be the median of m(S1), . . . , m(Sk), where 0.9 · |T | ≤ |S`(T )| ≤ 1.1 · |T |, p m(Si) is the median of the elements of Si. If Si is empty, then m(Si) is omitted from the set over which the median vT allowing S`(T ) to be accurately used for size estimation. is taken. Additionally, the median of S`(T ) is, with high probability Then with high probability in Nmax, the key vT has rank in Nmax, of rank between |T |/4 and 3|T |/4 in T . This between |T |/4 and 3|T |/4 in T . is because, with high probability in Nmax, no more than p 1.1· 4 |T | of the keys in S are among the bottom quartile in T Proof. By the proof of Lemma 5.1, each Si satisfies p (in terms of ranking), and similarly, no more than 1.1 · 4 |T | 0.9|T | ≤ |Si|/q ≤ 1.1|T | with probability at least 15/16. of the keys in S are among the top quartile in T . Let Xi denote the set of keys x in Si that have rank In order to remove the requirement that h be fully inde- |T |/4 or smaller in T , and let Yi denote the set of keys y ∈ Si pendent, we replace h with a collection of low-independence that have rank greater than 4|T |/4 in T . Then the variance of hash functions as follows. Let h1, . . . , hB be B-wise inde- each of |Xi|/q and |Yi|/q is at most O(|T |/q), by the same pendent hash functions, with each hk mapping keys r to 0 analysis as in Lemma 5.1. Using that c is a large constant, it with probability q and to non-zero with probability 1 − q, follows by Chebyshev’s inequality that, where q · s = c. `  1  1 To compute S`(T ) for a tree T , define S1,...,SB Pr |X |/q − · |T | > 0.1|T | ≤ , i 4 16 so that Si = {r ∈ T | hi(r) = 0}, and define the sketch S (T ) = (S ,...,S ). Composability is immediate  1  1 ` 1 B Pr |Y |/q − · |T | > 0.1|T | ≤ . from the definition of the sketch. The following lemma i 4 16 demonstrates how to use S`(T ) for size estimation. Recalling that Si satisfies q · 0.9|T | ≤ |Si| ≤ q · LEMMA 5.1. Fix a tree T , and define uT to be the median 1.1|T | with probability at least 15/16, it follows that with of values |S1|/q, . . . , |Sk|/q. Then with high probability in probability at least 13/16 the median element m(Si) of Si is −0.2 Nmax, if |T | ≥ c s` then 0.9|T | ≤ uT ≤ 1.1|T |; and if in neither Xi nor Yi. −0.2 −0.2 |T | ≤ c s` then then uT ≤ 1.1 · c s`. By a Chernoff bound, with high probability in Nmax, at least 2/3 of the m(Si)’s have rank between |T |/4 and 3|T |/4 −0.2 0.8 Proof. Suppose that |T | ≥ c s`, and thus |T | · q ≥ c . in T . It follows that the median vT of the medians m(Si) also Since |Si| is the sum of |T | B-wise (and thus pairwise) has rank between |T |/4 and 3|T |/4 in T , as desired. independent 0-1 random variables, each with mean q ≤ 1/2, the variance of |Si| is given by |T | · q · (1 − q) ≤ |T | · q. Finally, we complete the analysis of the summary sketch The variance of |Si|/q is therefore at most |T |/q. By S`(T ), by proving the unit-size property. Chebyshev’s inequality, 0.2 LEMMA 5.3. Fix a tree T satisfying |T | ≤ c s`. Then h i |T |/q 1 1 with high probability in Nmax, the size of the sketch S`(T ) is Pr |S |/q − |T | > δ|T | ≤ = ≤ , i δ2|T |2 δ2|T | · q δ2c0.8 at most O(B) machine words. for δ > 0. Plugging in δ = 0.1, and using that c is a large Proof. Here we finally use the B-independence of constant, yields h1, . . . , hB. (Indeed, for Chebyshev’s inequality we only needed 2-independence.) By Theorem 2 of [41], for c0 h i 1 (5.2) Pr |S |/q − |T | > 0.1|T | ≤ . a sufficiently large constant, each Si satisfies i 16 0 −m Pr[|Si| ≥ c m] ≤ 2 , By a Chernoff bound, with high probability in Nmax, at least 2/3 of the B ≥ Nmax values |Si|/q satisfy 0.9|T | ≤ for m ≥ 1. Since the sizes of the |Si|’s are independent, it P |Si|/q ≤ 1.1|T |, and thus the median uT also satisfies follows that i |Si| is a sum of independent geometrically- 0.9|T | ≤ uT ≤ 1.1|T |. bounded random variables. In particular, the probability that −0.2 0 P 0 Finally, we consider the case of |T | ≤ c s`. If T is i |Si| exceeds its mean by c ·m for some m ≥ 1 is at most −0.2 0 0 any tree of size c s` satisfying T ⊆ T (i.e., T ’s record the probability that m − 1 fair coin flips together yield less set includes T ’s record set), then the uT ≤ uT 0 . By the than B total heads. Since B = Ω(log Nmax), it follows that −0.2 P analysis above, uT 0 ≤ 1.1 · c s` with high probability in with high probability in Nmax, i |Si| ≤ O(log Nmax) ≤ Nmax, completing the proof. O(B), as desired.

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited 1 Bounding the metadata size. For each level ` ∈ [B], we x’s neighbors since x’s creation must be at least 20 · s`/B` must store the random bits for B hash functions. Each of the for any tangentially rested node x. hash functions hk at level ` must be B-wise independent and We modify our weight-balancing scheme as follows. map keys to 0 with probability c/s`. Until a node x becomes either rested or tangentially rested, We may assume w.l.o.g. that c and s` are powers of two, the node x is deemed frozen, meaning that regardless of meaning that there exists a finite field of size s`/c. Thus if value of x’s summary sketch, the node x is not permitted d is the maximum number of bits of a record, then one can to be merged or split. Whenever a non-frozen node y wishes select hk as a random degree-(B − 1) polynomial over the to perform a node-merge, the node waits for at least one of its

d finite field Fmax(2 ,s`/c). The number of random bits needed neighbors to become non-frozen, and then performs a merge to store hk is therefore at most max(d, s`)B. with that neighbor (although node y could potentially stop When s` ≤ O(d), the random bits for each hk fit in waiting on its neighbors if y’s sketch S`(y) stops indicating O(1) blocks on disk, since each block can store Θ(B) d- the need for a node-merge). If a node x becomes non-frozen, bit records. On the other hand, if log s` ≥ 10d, then the and at least one of node x’s neighbors are waiting on x for maximum size of the tree Nmax must satisfy Nmax  s`, and a merge, then node x is merged with any neighbors waiting thus we need not actually construct the sketch S`. (Instead on x, and then if necessary node x is then immediately split. we can simply force level ` to consist of a single node.) Whenever a node is non-frozen, wishes to perform a split, Thus each of the B hash functions hk at each of the and has no neighbors waiting on it for a merge, the node can B levels requires only O(1) blocks of random bits. The x performs the split without further delay. total non-node-specific metadata required by the rebalancing The fact that frozen nodes are never rebalanced en- mechanism is O(B2) blocks, as required by the Balanced- sures the Breathing-Space Property. Moreover, the modifica- Tree Property. tions to the algorithm easily preserve the Independent Levels Property. To complete the analysis of the rebalancing mech- Adding the Breathing Space Property. Finally, we modify anism, we must show that the Balanced-Tree Property con- our rebalancing mechanism in order to fulfill the Breathing- tinues the hold. Specifically, we must prove that, at any time, Space Property, which we restate below. each node x in each level ` is of size Θ(s`) with high proba- • The Breathing-Space Property: Say a node x is bility in Nmax, unless there is only one node at that level. touched whenever an universal flush occurs either in LEMMA 5.4. Using the rebalancing mechanism described x or in one of x’s sibling nodes in level `. (Note that above, the following property holds: at any point in time, the sibling nodes of x in level ` need not share a parent each node x in each level ` is of size Θ(s ) with high with x.) Whenever a new node x is created at level `, ` probability in Nmax, unless there is only one node at that either via a node merge or a node split, the new node x level. is touched at least Ω(s`/B) times. Proof. We consider three cases. The first case is that a node We can enforce the breathing-space property with a x has seeked rebalancing (i.e., S`(x) estimates x’s size to be small modification to our weight-balancing scheme. Call either less than s`/2 or estimates x’s size to be at least 10s`) a node x rested if, since the creation of x the node x has 1 1 after the arrivals of each of the most recent 5 ·s`/B` elements received at least ·s` new elements in its buffer. By Lemma 10 to have arrived in x’s buffer; and furthermore, S`(x) has 4.3, the buffer-size of a node x is at most O(B log M) ≤ provided correct size estimates after each of those arrivals. O(B log B). It follows that each rested node x has been Then node x must have been unfrozen during all of the most flushed out of at least 1 recent 10 · s`/B` arrivals in x’s buffer. Moreover, during 1 1 those most recent 10 · s`/B` arrivals, at least one of node x’s · s /B − O(log B) 10 ` ` siblings must have become unfrozen, meaning that node x will have successfully been rebalanced, a contradiction. ε times, which by the assumption that B is at least a large con- As a second case, suppose S`(x) has provided correct 1 stant multiple of log B (i.e., that B is at least a sufficiently size estimates after all of the most recent 5 · s`/B` arrivals 1 large constant), is at least 20 · s`/B`. into node x’s buffer, and that node x has not required Call a node x tangentially rested if, since the creation rebalancing after at least one of those arrivals. Then node x of node x, at least one of node x’s current siblings (i.e., we must currently be of size between s`/(4B`) and 11s`/4B`, do not count siblings that have since been rebalanced) has ensuring that node x is size Θ(s`). 1 received at least 10 ·s` new elements in its buffer. Note that a Finally, the third case is that S`(x) has not provided 1 tangentially rested node x can potentially become no-longer correct size estimates after all of the most recent 5 · s`/B` tangentially rested when one of its siblings is rebalanced. arrivals into node x’s buffer. We show that with high Again applying Lemma 4.3, the number of flushes applied to probability in Nmax, this case does not occur for any node x

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited in the tree T . For a given level `, the arrivals of messages into at depth i, then no node below level i will span k. Many level ` are determined independently of the random bits used LSMs don’t maintain this invariant. As a result, a single to compute sketches S`. In the current state of tree T , there chunk at a higher level may span the ranges covered by an are O(n2) options for the interval [a, b] that a node x can unbounded number of chunks on the next level down. This consist of (i.e., the node x has minimum key a and maximum can stymie any attempt to deamortize flushing, since flushing key b). Consider the arrivals into level ` of messages in the a single chunk at one level may require touching the entire interval [a, b], and let Wi(a, b) denote the state of tree T level below it. However, it is easy to add this invariant to after the i-th most recent such arrival. Let Pi(a, b) denote an LSM implementation. Once this is done, each chunk can the set of records at level ` (or below) in the interval [a, b] include the indexing information of its subordinate chunks in state Wi(a, b). Since Pi(a, b) is independent of the on the next level down, and now we have essentially the ε random bits used to compute S`, the sketch S`(Pi(a, b)) structure of a B -tree. has high probability in Nmax of correctly estimating the size P (a, b) i ∈ [0,..., 1 · s /B ] of i . Unioning over 5 ` ` , and References 2 over all Nmax options for a and b, it follows that, with high probability in Nmax, S` performs correct size estimation on all such Pi(a, b). This, in turn, establishes that the third case [1] A. Aggarwal and S. Vitter, Jeffrey. The input/output com- occurs with low probability in Nmax. plexity of sorting and related problems. Communications of the ACM, 31(9):1116–1127, Sept. 1988. 6 Related Work [2] A. Amir, M. Farach, R. M. Idury, J. A. L. Poutre,´ and A. A. Schaffer.¨ Improved dynamic dictionary matching. Inf. ε COLAs [8], write-optimized skip-lists [10], and B -trees [9, Comput., 119(2):258–282, 1995. 14] are all on the optimal search-insert trade-off curve [14] [3] A. Amir, G. Franceschini, R. Grossi, T. Kopelowitz, and are easily seen to have roughly equivalent structures. M. Lewenstein, and N. Lewenstein. Managing unbounded- Write-optimized skip lists are just Bε-trees with succes- length keys in comparison-driven data structures with appli- sor pointers between nodes on the same level and a random- cations to online indexing. SIAM Journal on Computing, ized rebalancing scheme. The successor pointers have no 43(4):1396–1416, 2014. impact on flushing. In fact, it’s entirely plausible and useful [4] Apache. Accumulo. http://accumulo.apache.org, to add sibling pointers to Bε-trees, so the only difference is 2015. the rebalancing scheme. [5] S. K. Baruah, N. K. Cohen, C. G. Plaxton, and D. A. Varvel. Proportionate progress: A notion of fairness in resource COLAs can be transformed into Bε-trees by simply stor- allocation. Algorithmica, 15(6):600–625, 1996. ing each level as a sequence of chunks, where block bound- [6] R. Bayer and E. M. McCreight. Organization and mainte- ε aries are placed every Θ(B ) ghost elements. Breaking each nance of large ordered indexes. Acta Informatica, 1(3):173– level into chunks is useful for deamortizing flushing, since it 189, Feb. 1972. makes it easy to perform flushing locally, i.e. the flushing [7] M. A. Bender, A. Conway, M. Farach-Colton, W. Jannen, policy can choose to flush from one chunk to another, rather Y. Jiao, R. Johnson, E. Knorr, S. McAllister, N. Mukherjee, than flushing an entire level. Thus the ghost pointers become P. Pandey, D. E. Porter, J. Yuan, and Y. Zhan. Small Bε-tree pivots and the chunks are Bε-tree nodes. refinements to the DAM can have big consequences for Although LSMs are typically described quite differently data-structure design. In Proc. 31st ACM Symposium on from Bε-trees, they are often much closer in practice. For Parallelism in Algorithms and Architectures (SPAA), pages example, LSMs were originally described with each level 265–274, June 2019. [8] M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, consisting of a single sorted array but, for obvious practical B. C. Kuszmaul, and J. Nelson. Cache-oblivious streaming reasons, many implementations break each level into multi- B-trees. In Proc. 19th Annual ACM Symposium on Parallel ε ple sub-arrays, which correspond roughly to nodes in a B - Algorithms and Architectures (SPAA), pages 81–92, 2007. tree. This already opens the door to localized flushing, rather [9] M. A. Bender, M. Farach-Colton, W. Jannen, R. Johnson, than level-by-level flushing. The other main difference is that B. C. Kuszmaul, D. E. Porter, J. Yuan, and Y. Zhan. An intro- many LSMs do not perform fractional cascading or maintain duction to Bε-trees and write-optimization. login; magazine, any indexing information from one level to the next, so that 40(5):22–28, October 2015. queries must start from scratch at each level. LSM imple- [10] M. A. Bender, M. Farach-Colton, R. Johnson, S. Mauras, menters typically assume all this indexing information can T. Mayer, C. Phillips, and H. Xu. Write-optimized skip lists. fit in RAM, so that fractional cascading is not necessary. Re- In Proc. 36th ACM Symposium on Principles of gardless, this design choice doesn’t interact with flushing. Systems (PODS), pages 69–78, May 2017. [11] M. A. Bender, M. Farach-Colton, and W. Kuszmaul. Achiev- One important difference is that many LSMs do not ing optimal backlog in multi-processor cup games. In Proc. preserve chunk boundaries as they move down the tree. In 51st Annual ACM Symposium on the Theory of Computing ε a B -tree, if a key k is the boundary between two nodes (STOC), pages 1148–1157, June 2019.

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited [12] E. Bortnikov, A. Braginsky, E. Hillel, I. Keidar, and G. Sheffi. tem. In Proc. 13th USENIX Conference on File and Storage Accordion: Better memory organization for LSM key-value Technologies (FAST), pages 301–315, 2015. stores. VLDB, 2018. [30] W. Jannen, J. Yuan, Y. Zhan, A. Akshintala, J. Esmet, Y. Jiao, [13] G. S. Brodal, E. D. Demaine, J. T. Fineman, J. Iacono, A. Mittal, P. Pandey, P. Reddy, L. Walsh, M. A. Bender, S. Langerman, and J. I. Munro. Cache-oblivious dynamic M. Farach-Colton, R. Johnson, B. C. Kuszmaul, and D. E. dictionaries with update/query tradeoffs. In Proc. 21st An- Porter. Betrfs: Write-optimization in a kernel file system. nual ACM-SIAM Symposium on Discrete Algorithms (SODA), Transactions on Storage—Special Issue on USENIX FAST pages 1448–1456, 2010. 2015, 11(4):18:1–18:29, October 2015. [14] G. S. Brodal and R. Fagerberg. Lower bounds for external [31] Z. Kasheff and B. Kuszmaul. Personal communication. memory dictionaries. In Proc. 14th Annual ACM-SIAM 2015. Symposium on Discrete Algorithms (SODA), pages 546–554, [32] T. Kopelowitz. On-line indexing for general alphabets via 2003. predecessor queries on subsets of an ordered list. In Proc. [15] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wal- 53rd Annual Symposium on Foundations of Computer Science lach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. (FOCS), pages 283–292, 2012. Bigtable: A distributed storage system for structured data. [33] W. Kuszmaul. Achieving optimal backlog in the vanilla ACM Transactions on Computer Systems (TOCS), 26(2):4, multi-processor cup game. In Proc. 31-st Annual ACM-SIAM 2008. Symposium on Discrete Algorithms (SODA), 2020. [16] D. Comer. The ubiquitous B-tree. ACM Comput. Surv., [34] A. Lakshman and P. Malik. Cassandra: a decentralized 11(2):121–137, June 1979. structured storage system. ACM SIGOPS Operating Systems [17] N. Dayan and S. Idreos. Dostoevsky: Better space-time Review, 44(2):35–40, Apr. 2010. trade-offs for LSM-tree based key-value stores via adaptive [35] C. L. Liu and J. W. Layland. Scheduling algorithms for removal of superfluous merging. In Proc. 2018 International multiprogramming in a hard-real-time environment. Journal Conference on Management of Data (SIGMOD), pages 505– of the ACM (JACM), 20(1):46–61, 1973. 520, 2018. [36] C. Luo and M. J. Carey. LSM-based storage techniques: A [18] P. Dietz and R. Raman. Persistence, amortization and ran- survey. CoRR, abs/1812.07527, 2018. domization. 1991. [37] C. W. Mortensen. Fully-dynamic two dimensional orthogonal [19] P. Dietz and D. Sleator. Two algorithms for maintaining order range and line segment intersection reporting in logarithmic in a list. In Proc. 19th Annual ACM Symposium on Theory of time. In Proc. 14th Annual ACM-SIAM Symposium on computing (STOC), pages 365–372. ACM, 1987. Discrete Algorithms (SODA), pages 618–627, 2003. [20] J. Esmet, M. A. Bender, M. Farach-Colton, and B. C. Kusz- [38] P. O’Neil, E. Cheng, D. Gawlic, and E. O’Neil. The maul. The TokuFS streaming file system. In Proc. 4th log-structured merge-tree (LSM-tree). Acta Informatica, USENIX Workshop on Hot Topics in Storage (HotStorage), 33(4):351–385, 1996. MA, USA, June 2012. [39] K. Ren and G. A. Gibson. TABLEFS: Enhancing metadata [21] I. FaceBook. RocksDB Compaction. efficiency in the local file system. In USENIX Annual [22] J. Fischer and P. Gawrychowski. Alphabet-dependent string Technical Conference, pages 145–156, 2013. searching with wexponential search trees. In Annual Sympo- [40] RocksDB. https://rocksdb.org/. sium on Combinatorial Pattern Matching (CPM), pages 160– [41] J. P. Schmidt, A. Siegel, and A. Srinivasan. Chernoff– 171, 2015. hoeffding bounds for applications with limited independence. [23] T. A. S. Foundation. Cassandra compaction. SIAM Journal on Discrete Mathematics, 8(2):223–250, 1995. [24] M. T. Goodrich and P. Pszona. Streamed graph drawing and [42] ScyllaDB. Compaction. https://docs.scylladb. the file maintenance problem. In International Symposium on com/kb/compaction/. Graph Drawing (GD), pages 256–267. Springer, 2013. [43] R. Sears, M. Callaghan, and E. Brewer. Rose: Compressed, [25] Google, Inc. LevelDB: A fast and lightweight key/value log-structured replication. Proceedings of the VLDB Endow- database library by Google. https://github.com/ ment, 1(1):526–537, 2008. google/leveldb. [44] Tokutek Inc. ft-index. https://github.com/ [26] N. Har’El. Scylla’s compaction strategies se- Tokutek/ft-index. Accessed: 2018-01-28. ries: Space amplification in size-tiered compaction. [45] Tokutek, Inc. TokuDB: MySQL Performance, MariaDB Per- https://www.scylladb.com/2018/01/17/ formance . http://www.tokutek.com/products/ compaction-series-space-amplification/. -for-mysql/. [27] N. Har’El. Scylla’s compaction strategies se- [46] Tokutek, Inc. TokuMX—MongoDB Performance En- ries: Write amplification in leveled compaction. gine. http://www.tokutek.com/products/ https://www.scylladb.com/2018/01/31/ tokumx-for-mongodb/. compaction-series-leveled-compaction/. [47] J. Yuan, Y. Zhan, W. Jannen, P. Pandey, A. Akshintala, [28] Apache HBase. https://hbase.apache.org/. K. Chandnani, P. Deo, Z. Kasheff, L. Walsh, M. A. Bender, [29] W. Jannen, J. Yuan, Y. Zhan, A. Akshintala, J. Esmet, Y. Jiao, M. Farach-Colton, R. Johnson, B. C. Kuszmaul, and D. E. A. Mittal, P. Pandey, P. Reddy, L. Walsh, M. A. Bender, Porter. Optimizing every operation in a write-optimized file M. Farach-Colton, R. Johnson, B. C. Kuszmaul, and D. E. system. In Proc. 14th USENIX Conference on File and Porter. Betrfs: A right-optimized write-optimized file sys- Storage Technologies (FAST), pages 1–14, February 2016.

Copyright © 2020 by SIAM Unauthorized reproduction of this article is prohibited